How to Troubleshoot Cilium LoadBalancer Issues on Talos Linux¶
Goal: Diagnose and fix common LoadBalancer problems in Cilium L2 announcement deployments.
Audience: Kubernetes administrators and SREs managing Talos clusters with Cilium.
Time: Variable (5 minutes to 1 hour depending on issue complexity)
Quick Diagnostic Decision Tree¶
graph TD
A[LoadBalancer Service Created] --> B{Has EXTERNAL-IP?}
B -->|No - Pending| C[Problem 1: IP Stuck in Pending]
B -->|Yes| D{Accessible from outside?}
D -->|No| E{Can ping IP?}
E -->|No| F[Problem 2: ARP Incomplete]
E -->|Yes, but no HTTP| G[Problem 5: Application Issue]
D -->|Yes, but wrong source IP| H[Problem 4: Traffic Policy]
C --> I[Check IP Pool & RBAC]
F --> J[Check L2 Policy & Interface]
G --> K[Check Pod/Service Config]
H --> L[Check externalTrafficPolicy] Problem 1: LoadBalancer IP Stuck in Pending¶
Symptoms¶
Output shows <pending> indefinitely:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
my-service LoadBalancer 10.96.100.123 <pending> 80:30123/TCP 5m
Root Causes¶
- No IP pool configured or IP pool exhausted
- LB-IPAM not enabled in Cilium
- Service selector doesn't match any IP pool
- RBAC permissions missing for IPAM controller
Solution 1.1: Verify IP Pool Exists¶
Check for IP pools:
Expected output:
If no pools exist, create one:
cat <<EOF | kubectl apply -f -
apiVersion: cilium.io/v2alpha1
kind: CiliumLoadBalancerIPPool
metadata:
name: default-pool
spec:
blocks:
- cidr: "192.168.10.64/28"
EOF
Solution 1.2: Check IP Pool Availability¶
Verify pool has available IPs:
Look for:
If IPS AVAILABLE is 0, expand your pool or create a new one:
Change CIDR to larger range:
Solution 1.3: Check Service Selector Matching¶
View pool's service selector:
Example with namespace restriction:
spec:
serviceSelector:
matchExpressions:
- key: io.kubernetes.service.namespace
operator: In
values:
- production
- staging
If your service is in default namespace, it won't match. Either:
Option A: Remove selector to match all services:
kubectl patch ciliumloadbalancerippool prod-pool --type=merge -p '{"spec":{"serviceSelector":{"matchLabels":{}}}}'
Option B: Add label to your service:
And update pool selector:
Solution 1.4: Verify IPAM is Enabled¶
Check Cilium operator logs:
Should see:
If not enabled, check Helm values:
Should have:
If missing, upgrade Cilium:
helm upgrade cilium cilium/cilium \
--namespace kube-system \
--reuse-values \
--set l2announcements.enabled=true \
--set externalIPs.enabled=true
Solution 1.5: Check RBAC Permissions¶
Verify IPAM controller has permissions:
kubectl get role cilium-l2-announcement -n kube-system
kubectl get rolebinding cilium-l2-announcement -n kube-system
If missing, create RBAC:
apiVersion: v1
kind: ServiceAccount
metadata:
name: cilium-l2-announcement
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: cilium-l2-announcement
namespace: kube-system
rules:
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["create", "get", "update"]
- apiGroups: [""]
resources: ["services", "endpoints"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: cilium-l2-announcement
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: cilium-l2-announcement
subjects:
- kind: ServiceAccount
name: cilium-l2-announcement
namespace: kube-system
Apply:
Restart Cilium operator to pick up permissions:
Problem 2: IP Assigned but Not Accessible (ARP Incomplete)¶
Symptoms¶
Service has EXTERNAL-IP assigned:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
my-service LoadBalancer 10.96.100.123 192.168.10.75 80:30123/TCP 5m
But cannot reach from outside cluster. ARP shows incomplete:
Or ping fails:
PING 192.168.10.75 (192.168.10.75) 56(84) bytes of data.
--- 192.168.10.75 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2047ms
Root Causes¶
- L2 announcements not enabled in Cilium
- No L2 announcement policy configured
- Wrong network interface selected in policy
- No nodes match the node selector
- Leader election failing due to lease issues
Solution 2.1: Verify L2 Announcements Enabled¶
Check Cilium agent status:
Expected output:
If disabled, check Helm configuration:
Should have:
If missing, upgrade:
helm upgrade cilium cilium/cilium \
--namespace kube-system \
--reuse-values \
--set l2announcements.enabled=true
Solution 2.2: Check L2 Announcement Policy Exists¶
List policies:
If none exist, create one:
apiVersion: cilium.io/v2alpha1
kind: CiliumL2AnnouncementPolicy
metadata:
name: default-l2-policy
spec:
serviceSelector:
matchLabels: {} # Match all services
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist # Only worker nodes
interfaces:
- ^enp0s.*
- ^eth0$
- ^ens18$
externalIPs: true
loadBalancerIPs: true
Apply:
Solution 2.3: Verify Correct Network Interface¶
This is the most common issue.
Step 1: Find your node's actual interface¶
From Talos node:
Example output:
NAME TYPE ENABLED
enp0s1 ether true
lo loopback true
cilium_host ether true
cilium_net ether true
cilium_vxlan ether true
The physical interface is enp0s1.
Step 2: Check policy's interface regex¶
Look at spec.interfaces:
Step 3: Update policy with correct interface¶
Change to match your interface:
Or use pattern to match multiple:
Step 4: Verify from debug pod¶
Run pod with host network:
kubectl run net-debug --rm -it --image=nicolaka/netshoot --overrides='{"spec":{"hostNetwork":true}}' -- ip link show
Identify the interface that connects to your LAN (usually the one with the node's IP).
Solution 2.4: Check Node Selector Matches Nodes¶
View policy's node selector:
Check if any nodes match:
Common issue: Policy excludes control plane, but you're running single-node cluster:
For single-node cluster, remove node selector:
kubectl patch ciliuml2announcementpolicy default-l2-policy --type=json -p='[{"op": "remove", "path": "/spec/nodeSelector"}]'
Or specifically include control plane:
Solution 2.5: Check Lease Leader Election¶
Verify lease exists and has owner:
Should see lease for each service:
If no lease or no holder, check Cilium logs:
Look for errors like:
This indicates RBAC issue. Apply RBAC from Solution 1.5.
Solution 2.6: Force Announcement Refresh¶
Delete and recreate the service to trigger new announcement:
kubectl get svc my-service -o yaml > my-service-backup.yaml
kubectl delete svc my-service
kubectl apply -f my-service-backup.yaml
Or restart Cilium agent on announcing node:
# Find which node is announcing
kubectl get lease -n kube-system | grep cilium-l2
# Restart Cilium on that node
kubectl delete pod -n kube-system -l k8s-app=cilium --field-selector spec.nodeName=worker-01
Problem 3: RBAC Errors in Cilium Logs¶
Symptoms¶
Checking Cilium operator logs shows permission denied:
level=error msg="Failed to create lease" error="leases.coordination.k8s.io is forbidden: User 'system:serviceaccount:kube-system:cilium-operator' cannot create resource 'leases'"
Or Cilium agent logs show:
Root Causes¶
- Missing RBAC Role/RoleBinding for lease management
- ClusterRole/ClusterRoleBinding not created during installation
- ServiceAccount not assigned to Cilium pods
Solution 3.1: Create Complete RBAC Configuration¶
Create comprehensive RBAC covering all L2 announcement needs:
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: cilium-l2-announcement
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: cilium-l2-announcement
namespace: kube-system
rules:
# Lease management for leader election
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["create", "get", "update", "list", "watch"]
# Service and endpoint access
- apiGroups: [""]
resources: ["services", "endpoints"]
verbs: ["get", "list", "watch"]
# Node information
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: cilium-l2-announcement
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: cilium-l2-announcement
subjects:
- kind: ServiceAccount
name: cilium-l2-announcement
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cilium-l2-announcement
rules:
# Access to Cilium CRDs
- apiGroups: ["cilium.io"]
resources:
- ciliumloadbalancerippools
- ciliuml2announcementpolicies
verbs: ["get", "list", "watch"]
# Service access across namespaces
- apiGroups: [""]
resources: ["services"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: cilium-l2-announcement
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cilium-l2-announcement
subjects:
- kind: ServiceAccount
name: cilium-l2-announcement
namespace: kube-system
Apply:
Solution 3.2: Verify ServiceAccount Assignment¶
Check Cilium operator deployment uses correct ServiceAccount:
kubectl get deployment cilium-operator -n kube-system -o jsonpath='{.spec.template.spec.serviceAccountName}'
Should output: cilium-operator (Cilium's default SA)
Check if Cilium operator's SA has proper ClusterRole:
Should see:
Solution 3.3: Restart Cilium Components¶
After applying RBAC, restart to pick up new permissions:
# Restart operator
kubectl rollout restart deployment/cilium-operator -n kube-system
# Restart agents
kubectl rollout restart daemonset/cilium -n kube-system
Wait for rollout:
kubectl rollout status daemonset/cilium -n kube-system
kubectl rollout status deployment/cilium-operator -n kube-system
Problem 4: Wrong externalTrafficPolicy Behavior¶
Symptoms¶
Scenario A: Service works but can't see client source IP in application logs.
Scenario B: Service is inaccessible, but only when using externalTrafficPolicy: Local.
Root Cause¶
externalTrafficPolicy: Local requires the announcing node to have a local pod. If not, traffic is dropped.
Solution 4.1: Understand Traffic Policy Differences¶
| Policy | Client IP Preserved | Works Without Local Pod | Load Distribution |
|---|---|---|---|
Cluster (default) | ❌ No (SNAT'd) | ✅ Yes | ✅ Even across all pods |
Local | ✅ Yes | ❌ No | ⚠️ Only to local pods |
Solution 4.2: Check Current Policy¶
Solution 4.3: Verify Pod Distribution with Local Policy¶
If using Local, check which node is announcing:
Check if that node has pods:
NAME READY STATUS NODE
my-service-7d4c9b8f6d-abc123 1/1 Running worker-01
my-service-7d4c9b8f6d-def456 1/1 Running worker-01
Problem: Worker-02 is announcing but pods are on worker-01!
Solution 4.4: Fix Pod Distribution¶
Option A: Force pods to announcing node:
apiVersion: v1
kind: Pod
metadata:
name: my-service
spec:
nodeSelector:
kubernetes.io/hostname: worker-02
Option B: Increase replicas for better distribution:
Option C: Use pod anti-affinity to spread pods:
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: my-service
topologyKey: kubernetes.io/hostname
Solution 4.5: Switch to Cluster Policy (If Source IP Not Needed)¶
Solution 4.6: Keep Local Policy and Use Ingress¶
If you need source IP preservation, use ingress controller with Local policy:
apiVersion: v1
kind: Service
metadata:
name: traefik
spec:
type: LoadBalancer
externalTrafficPolicy: Local # Preserves source IP
selector:
app: traefik
---
apiVersion: apps/v1
kind: DaemonSet # Runs on all nodes
metadata:
name: traefik
spec:
template:
spec:
containers:
- name: traefik
image: traefik:v2.10
DaemonSet ensures every node has a pod, making Local policy work reliably.
Problem 5: Service Accessible but Application Not Responding¶
Symptoms¶
- Can ping LoadBalancer IP ✅
- TCP handshake fails or HTTP returns errors ❌
Root Cause¶
This is typically an application or service configuration issue, not Cilium L2.
Solution 5.1: Verify Pods Are Running¶
All pods should be Running and READY 1/1.
Solution 5.2: Check Service Selector Matches Pods¶
Get service selector:
Output:
Verify pods have matching labels:
If no pods match, update service selector or pod labels.
Solution 5.3: Verify Service Port Configuration¶
Describe service:
Check TargetPort matches container port:
Verify container port:
Solution 5.4: Test Pod Directly¶
Port-forward to pod to bypass service:
Test locally:
If this works, issue is with Service configuration, not application.
Solution 5.5: Check Network Policies¶
Verify no NetworkPolicy is blocking traffic:
If policies exist, ensure they allow ingress from all sources:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-loadbalancer
spec:
podSelector:
matchLabels:
app: my-service
policyTypes:
- Ingress
ingress:
- from:
- ipBlock:
cidr: 0.0.0.0/0 # Allow from anywhere
ports:
- protocol: TCP
port: 8080
Advanced Troubleshooting Techniques¶
Technique 1: Enable Debug Logging¶
Enable debug logs for Cilium agent:
Watch logs:
Disable when done:
Technique 2: Check L2 Announcement Status¶
Get detailed status for specific service:
Technique 3: Monitor ARP Announcements¶
From node announcing the IP:
# Find announcing node
ANNOUNCING_NODE=$(kubectl get lease -n kube-system -l cilium.io/service=default/my-service -o jsonpath='{.items[0].spec.holderIdentity}')
# SSH to node (if using SSH) or use Talos
talosctl -n $ANNOUNCING_NODE logs kubelet | grep -i arp
Or capture ARP packets from external machine:
Should see:
ARP, Request who-has 192.168.10.75 tell 192.168.10.1
ARP, Reply 192.168.10.75 is-at aa:bb:cc:dd:ee:ff
Technique 4: Verify Cilium BPF Program¶
Check BPF programs loaded:
Should show LoadBalancer IP and backends.
Technique 5: Check Cilium Connectivity¶
Test Cilium connectivity:
This runs comprehensive tests (takes 5-10 minutes).
Common Error Messages and Solutions¶
Error: "no CiliumLoadBalancerIPPool matches"¶
Message in logs:
Solution: Check service namespace matches pool selector (see Solution 1.3)
Error: "interface not found"¶
Message in logs:
Solution: Update L2 policy with correct interface name (see Solution 2.3)
Error: "failed to acquire lease"¶
Message in logs:
Solution: Apply RBAC permissions (see Solution 3.1)
Error: "no nodes match announcement policy"¶
Message in logs:
Solution: Adjust node selector in policy (see Solution 2.4)
Debugging Command Cheat Sheet¶
# Check Cilium components status
kubectl get pods -n kube-system -l k8s-app=cilium
kubectl exec -n kube-system ds/cilium -- cilium-dbg status
# Verify L2 feature enabled
kubectl exec -n kube-system ds/cilium -- cilium-dbg status | grep -i l2
# List IP pools and policies
kubectl get ciliumloadbalancerippool
kubectl get ciliuml2announcementpolicy
# Check RBAC
kubectl get role,rolebinding -n kube-system | grep cilium-l2
kubectl get clusterrole,clusterrolebinding | grep cilium
# View leases (shows which node is announcing)
kubectl get lease -n kube-system | grep cilium-l2
# Check service details
kubectl describe svc <service-name>
kubectl get svc <service-name> -o yaml
# View Cilium logs
kubectl logs -n kube-system -l k8s-app=cilium --tail=100 | grep -i "l2\|announce\|lease"
kubectl logs -n kube-system deployment/cilium-operator | grep -i "l2\|ipam"
# Test from external machine
ping <LOADBALANCER_IP>
curl -v http://<LOADBALANCER_IP>
arp -a | grep <LOADBALANCER_IP>
# Check node interfaces (Talos)
talosctl get links -n <node-ip>
# Verify pod distribution
kubectl get pods -o wide -l app=<service>
Prevention Best Practices¶
1. Always Test in Development First¶
Create test service before deploying production:
kubectl create deployment test-nginx --image=nginx
kubectl expose deployment test-nginx --type=LoadBalancer --port=80
kubectl get svc test-nginx -w
2. Document Your IP Pool Allocations¶
Keep a record:
192.168.10.64 - 192.168.10.79: Production services
192.168.10.80 - 192.168.10.95: Staging services
192.168.10.96 - 192.168.10.111: Development services
3. Use Explicit IP Assignment for Critical Services¶
4. Monitor IP Pool Usage¶
Set up alert when pool is 80% full:
# Check current usage
kubectl get ciliumloadbalancerippool -o json | jq '.items[] | {name: .metadata.name, available: .status.ipsAvailable}'
5. Label Services for Troubleshooting¶
Makes filtering logs easier:
When to Escalate¶
If you've tried all solutions and still have issues, gather this information before escalating:
- Cilium version:
helm list -n kube-system - Talos version:
talosctl version - Full Cilium status:
kubectl exec -n kube-system ds/cilium -- cilium-dbg status - Configuration dumps:
helm get values cilium -n kube-system > cilium-values.yaml
kubectl get ciliumloadbalancerippool -o yaml > ip-pools.yaml
kubectl get ciliuml2announcementpolicy -o yaml > l2-policies.yaml
- Recent logs:
kubectl logs -n kube-system -l k8s-app=cilium --tail=500 > cilium-logs.txt
kubectl logs -n kube-system deployment/cilium-operator --tail=500 > operator-logs.txt
- Service and lease details:
kubectl describe svc <service-name> > service-details.txt
kubectl get lease -n kube-system -o yaml > leases.yaml
Open issue at: https://github.com/cilium/cilium/issues
Related Documentation¶
- Tutorial: Deploy Cilium with L2 LoadBalancer on Talos
- How to: Configure Cilium L2 Announcements
- Explanation: Cilium L2 Networking Architecture