Share Button

If you work with Docker,  there’s no doubt you heard about Kubernetes.  I won’t introduce this amazing gift from Google. This post is about a particular ressource, the Ingress and its controller. Introduced as a beta feature when releasing Kubernetes 1.2, the Ingress ressource is the missing part to open your cluster to the world.

Sure, if you’re lucky enough to run Kubernetes in a supported cloud environment (AWS, Google Cloud…), you can provision a load balancer automatically when creating a new Service. But…that’s it. If you need to do SSL termination, or just some simple routing rules, you’re stuck. This is where the Ingress ressource steps in.

The beauty of the Ingress Controller is the freedom of choice.  Of course Google offers its own implementation (based on NGINX), but you also have NGINX Inc, offering an implementation, Rancher, HAProxy, Vulcan etc.

In the end, they all do the same thing:

  • Watch the Kubernetes API for any Ingress ressource change
  • Update the configuration of the controller (create/update/delete endpoint, SSL certificate, etc.)
  • Reload the configuration

The main differences between each implementation are the tweaking possibilities (basic authentication, rewrite rule…). But, they have something in common: the way they denormalize a Kubernetes Service into a configuration (NGINX, HAProxy etc.).

Let’s imagine you have a Service my-wordpress-svc with 2 pods running behind it:

my-wordpress-svc: 10.3.0.149
  my-wordpress-pod-atgwrf: 10.2.73.3  
  my-wordpress-pod-ioasfk: 10.2.28.4

And an Ingress ressource pointing to that service:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: my-wordpress-ingress
  namespace: default
spec:
  rules:
  - host: onelineatatime.io
    http:
      paths:
      - path: /
        backend:
          serviceName: my-wordpress-svc
          servicePort: 80

When creating this Ingress ressource, the Ingress Controller will detect it, fetch all the endpoints behind the service and regenerate the configuration.

With NGINX, it would look like this (slightly simplified here):

upstream default-my-wordpress-svc-80 {
  least_conn;
  server 10.2.73.3:80 max_fails=0 fail_timeout=0;
  server 10.2.28.4:80 max_fails=0 fail_timeout=0;   
}

server {
  server_name onelineatatime.io;
  listen 80;

  location / {
    proxy_set_header Host               $host;

    # Pass Real IP
    proxy_set_header X-Real-IP          $remote_addr;
    proxy_set_header X-Forwarded-For    $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Host   $host;
    proxy_set_header X-Forwarded-Port   $server_port;
    proxy_set_header X-Forwarded-Proto  $pass_access_scheme;

    proxy_pass http://default-my-wordpress-svc-80;
  }
}

But wait a minute, why don’t we use the Service virtual IP in the upstream configuration, instead of fetching the endpoints? It sounds like a good idea:

  1. No need to update the configuration in case of scaling up/down or deployment update, Service VIP don’t change.
  2. No risk of sending requests to un-existing pods (Ingress Controller is not always synchronised with the API, it’s just a watcher in the end, there might be a delay).
  3. No risk of un-expected behaviour (non-idempotent requests are automatically retried by NGINX on connection error, timeout, http 502, http 503, http 504…. by passing the request to the next server).

In practice, by using the Service VIP, the Ingress Controller would not have to worry about any pod change happening, and that would guarantee a seamless scaling/deployment of your service.

Well, that’s what I thought, and that’s what other people thought too, and we were wrong. The idea sounds good, but in theory there is no guarantee at all when using the Service VIP. No more than letting the Ingress Controller maintaining the pod list.

Why is that? First, let’s describe how scaling up works, imagine this scenario:

  1. Replication Controller creates 2 pods
  2. Pods become ready
  3. The controller manager updates the endpoints
  4. Kube-proxy detects a change with the endpoints and uptable iptables 

Now, what about scaling down, we set the number of replicas to 1:

  1. Replication Controller deletes 1 pod
  2. Pod is marked as Terminating
  3. Kubelet observes that changes and sends SIGTERM
  4. Endpoint controller observes the pod change (terminating) and removes the pod from Endpoints
  5. Kube-proxy observes the Endpoints change and updates iptables
  6. Pod receives SIGKILL after grace period

The important part here, is that 3 and 4 are triggered in parallel. There is no synchronisation, one can happen before the other. That means your pod might be shutting down, and the endpoints have not been updated yet.

If the endpoints have not been updated, the Service VIP will still serve traffic on those pods. And even if the endpoints have been updated, there is no guarantee that kube-proxy has picked up on those changes yet, and the iptable rules will still route traffic to those dying pods.

Just like the Ingress Controller, kube-proxy is an API watcher. It does not work in a synchronised way when scaling up/down. It detects changes at some point, and tries to apply those changes.

That’s the reason why using the Service VIP in the configuration generated by the Ingress Controller, does not provide more resiliency.

So if it’s the same, why do Ingress Controller bypass the Service and maintain the endpoint list?

Advantages of bypassing the Service VIP

There are multiple advantages of having the pod’s endpoints in the NGINX upstream configuration:

  • Sticky session on a pod (through nginx-sticky-module-ng for example)
  • Fine tuning of the upstream behaviour in case of failure (max-fails, fail-timeout…)
  • Because Service VIP are using DNAT in the iptables config, it would generate an unnecessary overhead

There’s probably more. But the point was to prove that Ingress Controller can provide a more powerful experience when bypassing the Service VIP without losing anything in terms of resiliency.

This blog entry was inspired by this conversation on Github.

Share Button