Skip to content

How to Troubleshoot Cilium LoadBalancer Issues on Talos Linux

Goal: Diagnose and fix common LoadBalancer problems in Cilium L2 announcement deployments.

Audience: Kubernetes administrators and SREs managing Talos clusters with Cilium.

Time: Variable (5 minutes to 1 hour depending on issue complexity)


Quick Diagnostic Decision Tree

graph TD
    A[LoadBalancer Service Created] --> B{Has EXTERNAL-IP?}
    B -->|No - Pending| C[Problem 1: IP Stuck in Pending]
    B -->|Yes| D{Accessible from outside?}
    D -->|No| E{Can ping IP?}
    E -->|No| F[Problem 2: ARP Incomplete]
    E -->|Yes, but no HTTP| G[Problem 5: Application Issue]
    D -->|Yes, but wrong source IP| H[Problem 4: Traffic Policy]
    C --> I[Check IP Pool & RBAC]
    F --> J[Check L2 Policy & Interface]
    G --> K[Check Pod/Service Config]
    H --> L[Check externalTrafficPolicy]

Problem 1: LoadBalancer IP Stuck in Pending

Symptoms

kubectl get svc my-service

Output shows <pending> indefinitely:

NAME         TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
my-service   LoadBalancer   10.96.100.123   <pending>     80:30123/TCP   5m

Root Causes

  1. No IP pool configured or IP pool exhausted
  2. LB-IPAM not enabled in Cilium
  3. Service selector doesn't match any IP pool
  4. RBAC permissions missing for IPAM controller

Solution 1.1: Verify IP Pool Exists

Check for IP pools:

kubectl get ciliumloadbalancerippool

Expected output:

NAME          DISABLED   CONFLICTING   IPS AVAILABLE   AGE
prod-pool     false      false         14              10m

If no pools exist, create one:

cat <<EOF | kubectl apply -f -
apiVersion: cilium.io/v2alpha1
kind: CiliumLoadBalancerIPPool
metadata:
  name: default-pool
spec:
  blocks:
    - cidr: "192.168.10.64/28"
EOF

Solution 1.2: Check IP Pool Availability

Verify pool has available IPs:

kubectl describe ciliumloadbalancerippool prod-pool

Look for:

Status:
  Conditions:
    Status: True
    Type: io.cilium/ips-available

If IPS AVAILABLE is 0, expand your pool or create a new one:

kubectl edit ciliumloadbalancerippool prod-pool

Change CIDR to larger range:

spec:
  blocks:
    - cidr: "192.168.10.64/27" # Changed from /28 to /27 (32 IPs)

Solution 1.3: Check Service Selector Matching

View pool's service selector:

kubectl get ciliumloadbalancerippool prod-pool -o yaml

Example with namespace restriction:

spec:
  serviceSelector:
    matchExpressions:
      - key: io.kubernetes.service.namespace
        operator: In
        values:
          - production
          - staging

If your service is in default namespace, it won't match. Either:

Option A: Remove selector to match all services:

kubectl patch ciliumloadbalancerippool prod-pool --type=merge -p '{"spec":{"serviceSelector":{"matchLabels":{}}}}'

Option B: Add label to your service:

kubectl label service my-service pool=production

And update pool selector:

spec:
  serviceSelector:
    matchLabels:
      pool: production

Solution 1.4: Verify IPAM is Enabled

Check Cilium operator logs:

kubectl logs -n kube-system deployment/cilium-operator | grep -i ipam

Should see:

level=info msg="LB-IPAM is enabled"

If not enabled, check Helm values:

helm get values cilium -n kube-system | grep -i ipam

Should have:

l2announcements:
  enabled: true
externalIPs:
  enabled: true

If missing, upgrade Cilium:

helm upgrade cilium cilium/cilium \
  --namespace kube-system \
  --reuse-values \
  --set l2announcements.enabled=true \
  --set externalIPs.enabled=true

Solution 1.5: Check RBAC Permissions

Verify IPAM controller has permissions:

kubectl get role cilium-l2-announcement -n kube-system
kubectl get rolebinding cilium-l2-announcement -n kube-system

If missing, create RBAC:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: cilium-l2-announcement
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: cilium-l2-announcement
  namespace: kube-system
rules:
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["create", "get", "update"]
  - apiGroups: [""]
    resources: ["services", "endpoints"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: cilium-l2-announcement
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: cilium-l2-announcement
subjects:
  - kind: ServiceAccount
    name: cilium-l2-announcement
    namespace: kube-system

Apply:

kubectl apply -f cilium-l2-rbac.yaml

Restart Cilium operator to pick up permissions:

kubectl rollout restart deployment/cilium-operator -n kube-system

Problem 2: IP Assigned but Not Accessible (ARP Incomplete)

Symptoms

Service has EXTERNAL-IP assigned:

kubectl get svc my-service
NAME         TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)        AGE
my-service   LoadBalancer   10.96.100.123   192.168.10.75   80:30123/TCP   5m

But cannot reach from outside cluster. ARP shows incomplete:

arp -a | grep 192.168.10.75
? (192.168.10.75) at (incomplete) on enp0s1

Or ping fails:

ping -c 3 192.168.10.75
PING 192.168.10.75 (192.168.10.75) 56(84) bytes of data.
--- 192.168.10.75 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2047ms

Root Causes

  1. L2 announcements not enabled in Cilium
  2. No L2 announcement policy configured
  3. Wrong network interface selected in policy
  4. No nodes match the node selector
  5. Leader election failing due to lease issues

Solution 2.1: Verify L2 Announcements Enabled

Check Cilium agent status:

kubectl exec -n kube-system ds/cilium -- cilium-dbg status | grep -i l2

Expected output:

L2 Announcements:     Enabled

If disabled, check Helm configuration:

helm get values cilium -n kube-system | grep -A5 l2announcements

Should have:

l2announcements:
  enabled: true

If missing, upgrade:

helm upgrade cilium cilium/cilium \
  --namespace kube-system \
  --reuse-values \
  --set l2announcements.enabled=true

Solution 2.2: Check L2 Announcement Policy Exists

List policies:

kubectl get ciliuml2announcementpolicy

If none exist, create one:

apiVersion: cilium.io/v2alpha1
kind: CiliumL2AnnouncementPolicy
metadata:
  name: default-l2-policy
spec:
  serviceSelector:
    matchLabels: {} # Match all services
  nodeSelector:
    matchExpressions:
      - key: node-role.kubernetes.io/control-plane
        operator: DoesNotExist # Only worker nodes
  interfaces:
    - ^enp0s.*
    - ^eth0$
    - ^ens18$
  externalIPs: true
  loadBalancerIPs: true

Apply:

kubectl apply -f l2-announcement-policy.yaml

Solution 2.3: Verify Correct Network Interface

This is the most common issue.

Step 1: Find your node's actual interface

From Talos node:

talosctl get links -n <node-ip>

Example output:

NAME          TYPE     ENABLED
enp0s1        ether    true
lo            loopback true
cilium_host   ether    true
cilium_net    ether    true
cilium_vxlan  ether    true

The physical interface is enp0s1.

Step 2: Check policy's interface regex

kubectl get ciliuml2announcementpolicy default-l2-policy -o yaml

Look at spec.interfaces:

spec:
  interfaces:
    - ^eth0$ # This won't match enp0s1!

Step 3: Update policy with correct interface

kubectl edit ciliuml2announcementpolicy default-l2-policy

Change to match your interface:

spec:
  interfaces:
    - ^enp0s1$ # Exact match

Or use pattern to match multiple:

spec:
  interfaces:
    - ^enp0s.* # Matches enp0s1, enp0s3, enp0s8, etc.

Step 4: Verify from debug pod

Run pod with host network:

kubectl run net-debug --rm -it --image=nicolaka/netshoot --overrides='{"spec":{"hostNetwork":true}}' -- ip link show

Identify the interface that connects to your LAN (usually the one with the node's IP).

Solution 2.4: Check Node Selector Matches Nodes

View policy's node selector:

kubectl get ciliuml2announcementpolicy default-l2-policy -o jsonpath='{.spec.nodeSelector}' | jq

Check if any nodes match:

kubectl get nodes --show-labels

Common issue: Policy excludes control plane, but you're running single-node cluster:

nodeSelector:
  matchExpressions:
    - key: node-role.kubernetes.io/control-plane
      operator: DoesNotExist

For single-node cluster, remove node selector:

kubectl patch ciliuml2announcementpolicy default-l2-policy --type=json -p='[{"op": "remove", "path": "/spec/nodeSelector"}]'

Or specifically include control plane:

nodeSelector:
  matchLabels: {} # Match all nodes

Solution 2.5: Check Lease Leader Election

Verify lease exists and has owner:

kubectl get lease -n kube-system | grep cilium-l2

Should see lease for each service:

NAME                          HOLDER                     AGE
cilium-l2-default-my-service  worker-01                  5m

If no lease or no holder, check Cilium logs:

kubectl logs -n kube-system -l k8s-app=cilium --tail=100 | grep -i "lease\|l2"

Look for errors like:

level=error msg="Failed to acquire lease" error="leases.coordination.k8s.io is forbidden"

This indicates RBAC issue. Apply RBAC from Solution 1.5.

Solution 2.6: Force Announcement Refresh

Delete and recreate the service to trigger new announcement:

kubectl get svc my-service -o yaml > my-service-backup.yaml
kubectl delete svc my-service
kubectl apply -f my-service-backup.yaml

Or restart Cilium agent on announcing node:

# Find which node is announcing
kubectl get lease -n kube-system | grep cilium-l2

# Restart Cilium on that node
kubectl delete pod -n kube-system -l k8s-app=cilium --field-selector spec.nodeName=worker-01

Problem 3: RBAC Errors in Cilium Logs

Symptoms

Checking Cilium operator logs shows permission denied:

kubectl logs -n kube-system deployment/cilium-operator | grep -i error
level=error msg="Failed to create lease" error="leases.coordination.k8s.io is forbidden: User 'system:serviceaccount:kube-system:cilium-operator' cannot create resource 'leases'"

Or Cilium agent logs show:

kubectl logs -n kube-system ds/cilium | grep -i forbidden
level=error msg="cannot list services" error="services is forbidden"

Root Causes

  1. Missing RBAC Role/RoleBinding for lease management
  2. ClusterRole/ClusterRoleBinding not created during installation
  3. ServiceAccount not assigned to Cilium pods

Solution 3.1: Create Complete RBAC Configuration

Create comprehensive RBAC covering all L2 announcement needs:

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: cilium-l2-announcement
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: cilium-l2-announcement
  namespace: kube-system
rules:
  # Lease management for leader election
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["create", "get", "update", "list", "watch"]
  # Service and endpoint access
  - apiGroups: [""]
    resources: ["services", "endpoints"]
    verbs: ["get", "list", "watch"]
  # Node information
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: cilium-l2-announcement
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: cilium-l2-announcement
subjects:
  - kind: ServiceAccount
    name: cilium-l2-announcement
    namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cilium-l2-announcement
rules:
  # Access to Cilium CRDs
  - apiGroups: ["cilium.io"]
    resources:
      - ciliumloadbalancerippools
      - ciliuml2announcementpolicies
    verbs: ["get", "list", "watch"]
  # Service access across namespaces
  - apiGroups: [""]
    resources: ["services"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cilium-l2-announcement
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cilium-l2-announcement
subjects:
  - kind: ServiceAccount
    name: cilium-l2-announcement
    namespace: kube-system

Apply:

kubectl apply -f cilium-l2-rbac-complete.yaml

Solution 3.2: Verify ServiceAccount Assignment

Check Cilium operator deployment uses correct ServiceAccount:

kubectl get deployment cilium-operator -n kube-system -o jsonpath='{.spec.template.spec.serviceAccountName}'

Should output: cilium-operator (Cilium's default SA)

Check if Cilium operator's SA has proper ClusterRole:

kubectl get clusterrolebinding | grep cilium-operator

Should see:

cilium-operator    ClusterRole/cilium-operator    10d

Solution 3.3: Restart Cilium Components

After applying RBAC, restart to pick up new permissions:

# Restart operator
kubectl rollout restart deployment/cilium-operator -n kube-system

# Restart agents
kubectl rollout restart daemonset/cilium -n kube-system

Wait for rollout:

kubectl rollout status daemonset/cilium -n kube-system
kubectl rollout status deployment/cilium-operator -n kube-system

Problem 4: Wrong externalTrafficPolicy Behavior

Symptoms

Scenario A: Service works but can't see client source IP in application logs.

Scenario B: Service is inaccessible, but only when using externalTrafficPolicy: Local.

Root Cause

externalTrafficPolicy: Local requires the announcing node to have a local pod. If not, traffic is dropped.


Solution 4.1: Understand Traffic Policy Differences

Policy Client IP Preserved Works Without Local Pod Load Distribution
Cluster (default) ❌ No (SNAT'd) ✅ Yes ✅ Even across all pods
Local ✅ Yes ❌ No ⚠️ Only to local pods

Solution 4.2: Check Current Policy

kubectl get svc my-service -o jsonpath='{.spec.externalTrafficPolicy}'

Solution 4.3: Verify Pod Distribution with Local Policy

If using Local, check which node is announcing:

kubectl get lease -n kube-system | grep cilium-l2
cilium-l2-default-my-service   worker-02   5m

Check if that node has pods:

kubectl get pods -l app=my-service -o wide
NAME                          READY   STATUS    NODE
my-service-7d4c9b8f6d-abc123  1/1     Running   worker-01
my-service-7d4c9b8f6d-def456  1/1     Running   worker-01

Problem: Worker-02 is announcing but pods are on worker-01!

Solution 4.4: Fix Pod Distribution

Option A: Force pods to announcing node:

apiVersion: v1
kind: Pod
metadata:
  name: my-service
spec:
  nodeSelector:
    kubernetes.io/hostname: worker-02

Option B: Increase replicas for better distribution:

kubectl scale deployment my-service --replicas=3

Option C: Use pod anti-affinity to spread pods:

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: my-service
                topologyKey: kubernetes.io/hostname

Solution 4.5: Switch to Cluster Policy (If Source IP Not Needed)

kubectl patch svc my-service -p '{"spec":{"externalTrafficPolicy":"Cluster"}}'

Solution 4.6: Keep Local Policy and Use Ingress

If you need source IP preservation, use ingress controller with Local policy:

apiVersion: v1
kind: Service
metadata:
  name: traefik
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local # Preserves source IP
  selector:
    app: traefik
---
apiVersion: apps/v1
kind: DaemonSet # Runs on all nodes
metadata:
  name: traefik
spec:
  template:
    spec:
      containers:
        - name: traefik
          image: traefik:v2.10

DaemonSet ensures every node has a pod, making Local policy work reliably.


Problem 5: Service Accessible but Application Not Responding

Symptoms

  • Can ping LoadBalancer IP ✅
  • TCP handshake fails or HTTP returns errors ❌
ping -c 3 192.168.10.75  # Works
curl http://192.168.10.75  # Fails or times out

Root Cause

This is typically an application or service configuration issue, not Cilium L2.


Solution 5.1: Verify Pods Are Running

kubectl get pods -l app=my-service

All pods should be Running and READY 1/1.

Solution 5.2: Check Service Selector Matches Pods

Get service selector:

kubectl get svc my-service -o jsonpath='{.spec.selector}' | jq

Output:

{
  "app": "my-service"
}

Verify pods have matching labels:

kubectl get pods -l app=my-service --show-labels

If no pods match, update service selector or pod labels.

Solution 5.3: Verify Service Port Configuration

Describe service:

kubectl describe svc my-service

Check TargetPort matches container port:

Port: 80/TCP
TargetPort: 8080/TCP # Must match container's port

Verify container port:

kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].ports}' | jq

Solution 5.4: Test Pod Directly

Port-forward to pod to bypass service:

kubectl port-forward pod/<pod-name> 8080:8080

Test locally:

curl http://localhost:8080

If this works, issue is with Service configuration, not application.

Solution 5.5: Check Network Policies

Verify no NetworkPolicy is blocking traffic:

kubectl get networkpolicy -n <namespace>

If policies exist, ensure they allow ingress from all sources:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-loadbalancer
spec:
  podSelector:
    matchLabels:
      app: my-service
  policyTypes:
    - Ingress
  ingress:
    - from:
        - ipBlock:
            cidr: 0.0.0.0/0 # Allow from anywhere
      ports:
        - protocol: TCP
          port: 8080

Advanced Troubleshooting Techniques

Technique 1: Enable Debug Logging

Enable debug logs for Cilium agent:

kubectl exec -n kube-system ds/cilium -- cilium-dbg config set debug true

Watch logs:

kubectl logs -n kube-system -l k8s-app=cilium --tail=100 -f | grep -i "l2\|announce"

Disable when done:

kubectl exec -n kube-system ds/cilium -- cilium-dbg config set debug false

Technique 2: Check L2 Announcement Status

Get detailed status for specific service:

kubectl exec -n kube-system ds/cilium -- cilium-dbg service list | grep <EXTERNAL-IP>

Technique 3: Monitor ARP Announcements

From node announcing the IP:

# Find announcing node
ANNOUNCING_NODE=$(kubectl get lease -n kube-system -l cilium.io/service=default/my-service -o jsonpath='{.items[0].spec.holderIdentity}')

# SSH to node (if using SSH) or use Talos
talosctl -n $ANNOUNCING_NODE logs kubelet | grep -i arp

Or capture ARP packets from external machine:

sudo tcpdump -i enp0s1 arp -n

Should see:

ARP, Request who-has 192.168.10.75 tell 192.168.10.1
ARP, Reply 192.168.10.75 is-at aa:bb:cc:dd:ee:ff

Technique 4: Verify Cilium BPF Program

Check BPF programs loaded:

kubectl exec -n kube-system ds/cilium -- cilium-dbg bpf lb list

Should show LoadBalancer IP and backends.

Technique 5: Check Cilium Connectivity

Test Cilium connectivity:

kubectl exec -n kube-system ds/cilium -- cilium-dbg connectivity test

This runs comprehensive tests (takes 5-10 minutes).


Common Error Messages and Solutions

Error: "no CiliumLoadBalancerIPPool matches"

Message in logs:

level=warning msg="No CiliumLoadBalancerIPPool matches service" service=default/my-service

Solution: Check service namespace matches pool selector (see Solution 1.3)


Error: "interface not found"

Message in logs:

level=error msg="Failed to announce IP" error="interface eth0 not found"

Solution: Update L2 policy with correct interface name (see Solution 2.3)


Error: "failed to acquire lease"

Message in logs:

level=error msg="Failed to acquire lease" error="leases.coordination.k8s.io is forbidden"

Solution: Apply RBAC permissions (see Solution 3.1)


Error: "no nodes match announcement policy"

Message in logs:

level=warning msg="No nodes match CiliumL2AnnouncementPolicy" policy=default-l2-policy

Solution: Adjust node selector in policy (see Solution 2.4)


Debugging Command Cheat Sheet

# Check Cilium components status
kubectl get pods -n kube-system -l k8s-app=cilium
kubectl exec -n kube-system ds/cilium -- cilium-dbg status

# Verify L2 feature enabled
kubectl exec -n kube-system ds/cilium -- cilium-dbg status | grep -i l2

# List IP pools and policies
kubectl get ciliumloadbalancerippool
kubectl get ciliuml2announcementpolicy

# Check RBAC
kubectl get role,rolebinding -n kube-system | grep cilium-l2
kubectl get clusterrole,clusterrolebinding | grep cilium

# View leases (shows which node is announcing)
kubectl get lease -n kube-system | grep cilium-l2

# Check service details
kubectl describe svc <service-name>
kubectl get svc <service-name> -o yaml

# View Cilium logs
kubectl logs -n kube-system -l k8s-app=cilium --tail=100 | grep -i "l2\|announce\|lease"
kubectl logs -n kube-system deployment/cilium-operator | grep -i "l2\|ipam"

# Test from external machine
ping <LOADBALANCER_IP>
curl -v http://<LOADBALANCER_IP>
arp -a | grep <LOADBALANCER_IP>

# Check node interfaces (Talos)
talosctl get links -n <node-ip>

# Verify pod distribution
kubectl get pods -o wide -l app=<service>

Prevention Best Practices

1. Always Test in Development First

Create test service before deploying production:

kubectl create deployment test-nginx --image=nginx
kubectl expose deployment test-nginx --type=LoadBalancer --port=80
kubectl get svc test-nginx -w

2. Document Your IP Pool Allocations

Keep a record:

192.168.10.64 - 192.168.10.79: Production services
192.168.10.80 - 192.168.10.95: Staging services
192.168.10.96 - 192.168.10.111: Development services

3. Use Explicit IP Assignment for Critical Services

metadata:
  annotations:
    io.cilium/lb-ipam-ips: "192.168.10.75"

4. Monitor IP Pool Usage

Set up alert when pool is 80% full:

# Check current usage
kubectl get ciliumloadbalancerippool -o json | jq '.items[] | {name: .metadata.name, available: .status.ipsAvailable}'

5. Label Services for Troubleshooting

metadata:
  labels:
    app: my-service
    team: platform
    environment: production

Makes filtering logs easier:

kubectl logs -n kube-system -l k8s-app=cilium | grep "service=production/my-service"

When to Escalate

If you've tried all solutions and still have issues, gather this information before escalating:

  1. Cilium version: helm list -n kube-system
  2. Talos version: talosctl version
  3. Full Cilium status: kubectl exec -n kube-system ds/cilium -- cilium-dbg status
  4. Configuration dumps:
helm get values cilium -n kube-system > cilium-values.yaml
kubectl get ciliumloadbalancerippool -o yaml > ip-pools.yaml
kubectl get ciliuml2announcementpolicy -o yaml > l2-policies.yaml
  1. Recent logs:
kubectl logs -n kube-system -l k8s-app=cilium --tail=500 > cilium-logs.txt
kubectl logs -n kube-system deployment/cilium-operator --tail=500 > operator-logs.txt
  1. Service and lease details:
kubectl describe svc <service-name> > service-details.txt
kubectl get lease -n kube-system -o yaml > leases.yaml

Open issue at: https://github.com/cilium/cilium/issues



References