Kubernetes Services and Ingress Under X-ray

I haven’t blogged here for over 2 years. It’s not that I had nothing to say, but every time I started writing a new post I never pushed myself into finishing it. So, most of the drafts ended up rotting in my private Github gists. Although my interests have expanded way beyond the Linux container space, my professional life remained tied to it.

Over the past two years I have been quite heavily involved in Kubernetes (K8s) community. I helped to start Kubernetes London Meetup as well as Kubecast, a podcast about all things K8s. It’s been amazing to see the community [not only in London] to grow so fast in such a short time.

More and more companies are jumping on board to orchestrate their container deployments to address the container Jevons paradox. This is great for the project, but it’s not a free lunch for the newcomers. New and shiny things often make the newcomers anxious. Especially when there is a lot of new concepts to grasp to become fully productive. Changing the mindset is often the hardest thing to do.

Over the past few months I have been noticing one thing in particular. K8s abstracts a lot of infra through its API. This is similar in other modern platforms like Cloud Foundry and the likes. Hiding things away makes “traditional” Ops teams feel uneasy. The idea of “not caring” what’s going on underneath the K8s roof is unsettling. This is hardly surprising, as most of us polished our professional skills whilst debugging all kinds of crazy OS and hardware issues (hello Linux on desktop!); we naturally tend to dig deep when new tech comes up. It’s good to be prepared when a disaster strikes.

Quite a few peolpe have asked me recently, both in person and via Twitter DMs, about what’s going on underneath the K8s when a HTTP request arrives in the cluster from the external traffic i.e. traffic from outside the K8 cluster. People wanted to know how do all the pieces such as service, ingress and DNS work together inside the cluster. If you are one of the curious folks, this post might be for you. We will put the service requests through X-ray!

Cluster Setup

This post assumes we run our own “bare metal” standalone K8s installation. If you don’t know how to get K8s running in your own infrastructure, check out the Kubernetes the hard way guide by Kelsey Hightower which can be easily translated into your own environment.

We will assume we have both the control plane and 3 worker nodes up and running:

1
2
3
4
5
6
7
8
9
10
11
12
13
$ kubectl get componentstatuses
NAME                 STATUS    MESSAGE              ERROR
controller-manager   Healthy   ok
scheduler            Healthy   ok
etcd-0               Healthy   {"health": "true"}
etcd-2               Healthy   {"health": "true"}
etcd-1               Healthy   {"health": "true"}

$ kubectl get nodes
NAME      STATUS    AGE
worker0   Ready     18m
worker1   Ready     16m
worker2   Ready     16m

We will also assume that we have DNS and K8s dashboard add-ons set up in the cluster. As for DNS, we will use the off-the shelf, kube-dns. [Remember, the add-ons run as services in kube-system namespace] :

1
2
3
4
$ kubectl get svc --namespace=kube-system
NAME                   CLUSTER-IP    EXTERNAL-IP   PORT(S)         AGE
kube-dns               10.32.0.10    <none>        53/UDP,53/TCP   21m
kubernetes-dashboard   10.32.0.148   <nodes>       80:31711/TCP    1m

No services [and thus no pods] are running in the default K8s space except for the kubernetes API service:

1
2
3
4
5
6
$ kubectl get svc
NAME         CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   10.32.0.1    <none>        443/TCP   5h

$ kubectl get pods
No resources found.

Cluster state and configuration

K8s cluster stores all of its internal state in etcd cluster. The idea is, that you should interact with K8s only via its API provided by API service. API service abstracts away all the K8s cluster state manipulating by reading from and writing into the etcd cluster. Let’s explore what’s stored in the etcd cluster after fresh installation:

1
2
$ etcdctl --ca-file=/etc/etcd/ca.pem ls
/registry

/registry key is where all the magic happens in K8s. If you are familiar with K8s at least a bit, listing the contents of this key will reveal a tree structure referencing keys with names of familiar K8s concepts:

1
2
3
4
5
6
7
8
9
10
11
12
13
$ etcdctl --ca-file=/etc/etcd/ca.pem ls registry
/registry/services
/registry/events
/registry/secrets
/registry/minions
/registry/deployments
/registry/clusterroles
/registry/ranges
/registry/namespaces
/registry/replicasets
/registry/pods
/registry/clusterrolebindings
/registry/serviceaccounts

Let’s have a look what’s hiding underneath /registry/services key, which is what we are interested in in this blog post. We will list the key services key space recursively i.e. sort of like when you run ls -lR on your command line:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
~$ etcdctl --ca-file=/etc/etcd/ca.pem ls /registry/services --recursive
/registry/services/specs
/registry/services/specs/default
/registry/services/specs/default/kubernetes
/registry/services/specs/kube-system
/registry/services/specs/kube-system/kubernetes-dashboard
/registry/services/specs/kube-system/kube-dns
/registry/services/endpoints
/registry/services/endpoints/default
/registry/services/endpoints/default/kubernetes
/registry/services/endpoints/kube-system
/registry/services/endpoints/kube-system/kube-scheduler
/registry/services/endpoints/kube-system/kube-controller-manager
/registry/services/endpoints/kube-system/kube-dns
/registry/services/endpoints/kube-system/kubernetes-dashboard

Output of this command has uncovered a wealth of the information. For starters, we can see the two service namespaces: default and kube-system under the specs key. We can assume that particular K8s service configuration is stored in the values stored under the keys named after the particular service names.

Another important key in the output above is endpoint. I’ve noticed in the community that not a lot of people are familiar with K8s endpoints API resource. This is because normally, you don’t need to interact with it directly. At least not when doing the usual K8s work like deploying and managing apps via kubectl. But you do need to be familiar with it when debugging malfunctioning services or building ingress controllers or custom loadbalancers.

Endpoints are a crucial concept for K8s services. They represent a list of IP:PORT mappings created automatically (unless you are using headless services) when you create a new K8s service. K8s sercice selects particular set of pods and maps them into endpoints.

In the context of K8s service, endpoints are basically service traffic routes. K8s service must keep an eye on its endpoints at all times. K8s service watches particular endpoints key which notifies it in case some pods in its list have been terminated or rescheduled on another host (in this case it most likely gets a new IP:PORT allocation). Service then routes the traffic to the new endpoint instead of the old [dead] one. In other words, K8s services are K8s API watchers.

In our cluster we only have kubernetes service running right now. It is running in the default namespace. Let’s check its endpoints using kubectl:

1
2
3
$ kubectl get endpoints kubernetes
NAME         ENDPOINTS                                            AGE
kubernetes   10.240.0.10:6443,10.240.0.11:6443,10.240.0.12:6443   7h

We can find the same list of endpoints if we display the contents of particular etcd key:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
$ etcdctl --ca-file=/etc/etcd/ca.pem get /registry/services/endpoints/default/kubernetes
{
  "kind": "Endpoints",
  "apiVersion": "v1",
  "metadata": {
      "name": "kubernetes",
      "namespace": "default",
      "selfLink": "/api/v1/namespaces/default/endpoints/kubernetes",
      "uid": "918dc93c-e61c-11e6-8119-42010af00015",
      "creationTimestamp": "2017-01-29T12:15:05Z"
  },
  "subsets": [{
      "addresses": [{
          "ip": "10.240.0.10"
      }, {
          "ip": "10.240.0.11"
      }, {
          "ip": "10.240.0.12"
      }],
      "ports": [{
          "name": "https",
          "port": 6443,
          "protocol": "TCP"
      }]
  }]
}

Now that we have scrutinized K8s services a bit, let’s move on and create our own K8s service and try to route some traffic to it from outside the cluster.

Services, kube-proxy and kube-dns

We will create a simple service which will run two replicas of nginx and we will scrutinize the request flow within the K8s cluster. The following command will create a K8s deployment of 2 replicas of nginx servers running in separate pods:

1
2
$ kubectl run nginx --image=nginx --replicas=2 --port=80
deployment "nginx" created

The command we ran actually created K8s deployment which bundles K8s replica set that consists of two pods. We can verify that very easily:

1
2
3
4
5
6
7
8
9
10
$ kubectl get deployments,rs,pods
NAME           DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deploy/nginx   2         2         2            2           17s

NAME                  DESIRED   CURRENT   READY     AGE
rs/nginx-3449338310   2         2         2         17s

NAME                        READY     STATUS    RESTARTS   AGE
po/nginx-3449338310-140d3   1/1       Running   0          17s
po/nginx-3449338310-qc77h   1/1       Running   0          17s

We can also check that the K8s cluster keeps the track of the new pods in etcd under the pods key:

1
2
3
$ etcdctl --ca-file=/etc/etcd/ca.pem ls /registry/pods/default --recursive
/registry/pods/default/nginx-3449338310-qc77h
/registry/pods/default/nginx-3449338310-140d3

We could equally check the contents of /registry/deployments and /registry/replicasets keys, but let’s pass that for now. The next step is to turn the nginx deployment into a service. We will call it nginx-svc and expose it on port 8080 inside the cluster:

1
2
$ kubectl expose deployment nginx --port=8080 --target-port=80 --name=nginx-svc
service "nginx-svc" exposed

This will, as expected, create a new K8s service which has two endpoints:

1
2
3
4
5
6
7
8
$ kubectl get svc,endpoints
NAME             CLUSTER-IP    EXTERNAL-IP   PORT(S)    AGE
svc/kubernetes   10.32.0.1     <none>        443/TCP    9h
svc/nginx-svc    10.32.0.191   <none>        8080/TCP   26m

NAME            ENDPOINTS                                            AGE
ep/kubernetes   10.240.0.10:6443,10.240.0.11:6443,10.240.0.12:6443   9h
ep/nginx-svc    10.200.1.5:80,10.200.2.4:80                          26m

We could also query etcd and see that K8s API service has taken care of creating particular service and endpoints keys and populated them with correct information about connection mappings.

Now, here comes the first newcomer “gotcha”. When the service is created it is assigned a Virtual IP (VIP). Many people often try to ping the IP and fail miserably. This leads them to do all kinds of debugging until they get frustrated and give up. Service VIP address is only really useful in combination with the service port (we will get back to this later on in the post). Pinging service VIP gives you no luck. However, when accessing the service endpoints from any pod in the cluster, you are perfectly fine. We will see that later on.

If you don’t specify a type of service, K8s by default uses ClusterIP option, which means that the new service is only exposed only within the cluster. It’s kind of like internal K8s service, so it’s not particularly useful if you want to accept external traffic:

1
2
3
4
5
6
7
8
9
10
11
$ kubectl describe svc nginx-svc
Name:         nginx-svc
Namespace:        default
Labels:           run=nginx
Selector:     run=nginx
Type:         ClusterIP
IP:           10.32.0.191
Port:         <unset>   8080/TCP
Endpoints:        10.200.1.5:80,10.200.2.4:80
Session Affinity: None
No events.

Our service consists of two nginx servers, so we can try to curl it from a pod that has curl installed. We can use the service VIP in combination with the exposed service port 8080:

1
2
3
4
5
6
7
8
9
10
11
$ kubectl run -i --tty client --image=tutum/curl
[email protected]:/# curl -I 10.32.0.191:8080
HTTP/1.1 200 OK
Server: nginx/1.11.9
Date: Mon, 30 Jan 2017 10:36:19 GMT
Content-Type: text/html
Content-Length: 612
Last-Modified: Tue, 24 Jan 2017 14:02:19 GMT
Connection: keep-alive
ETag: "58875e6b-264"
Accept-Ranges: bytes

Let’s move to more interesting service exposure options now. If you want to expose your service to the outside world you can either use NodePort type or LoadBalancer. Let’s have a look at the NodePort service first. We will delete the service we created earlier:

1
2
3
4
5
6
7
$ kubectl delete service nginx-svc
service "nginx-svc" deleted
$ kubectl delete deployment nginx
deployment "nginx" deleted
$ kubectl get pods,svc,deployments
NAME         CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   10.32.0.1    <none>        443/TCP   9h$

We will recreate the same service, but this time we will use NodePort type option:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ kubectl run nginx --image=nginx --replicas=2 --port=80
deployment "nginx" created
$ kubectl expose deployment nginx --port=8080 --target-port=80 --name=nginx-svc --type=NodePort
service "nginx-svc" exposed
$ kubectl get deployments,svc,pods
NAME           DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deploy/nginx   2         2         2            2           1m

NAME             CLUSTER-IP    EXTERNAL-IP   PORT(S)          AGE
svc/kubernetes   10.32.0.1     <none>        443/TCP          9h
svc/nginx-svc    10.32.0.156   <nodes>       8080:30299/TCP   53s

NAME                        READY     STATUS    RESTARTS   AGE
po/nginx-3449338310-9ffk6   1/1       Running   0          1m
po/nginx-3449338310-wrp9x   1/1       Running   0          1m

NodePort type, according to the documentation opens a service port on every worker node in K8s cluster. Now, here comes another newcomer gotcha. What a lot of people ask me is, “ok, but how come I can’t see the service port listening on any of the worker nodes?” Often, people simply run netstat -ntlp and grep for the exposed service port; in our case that would be port 8080. Well, bad news is, they won’t find any service listening on port 8080. This is where the magic of kube-proxy happens. Instead the service port is mapped to a different port on the node, NodePort. You can find the NodePort by describeing the service:

1
2
3
4
5
6
7
8
9
10
11
12
$ kubectl describe svc nginx-svc
Name:         nginx-svc
Namespace:        default
Labels:           run=nginx
Selector:     run=nginx
Type:         NodePort
IP:           10.32.0.156
Port:         <unset>   8080/TCP
NodePort:     <unset>   30299/TCP
Endpoints:        10.200.1.6:80,10.200.2.6:80
Session Affinity: None
No events.

So if you now grep for processes listening on port 30299 on every worker node you should see kube-proxy listening on the NodePort. See the output below from one of the worker nodes:

1
2
[email protected]:~$ sudo netstat -ntlp|grep 30299
tcp6       0      0 :::30299                :::*                    LISTEN      16080/kube-proxy

Again, as for the service accessibility, nothing changes, service is perfectly accessible through VIP:PORT combination as in the ClusterIP case:

1
2
3
4
5
6
7
8
9
10
11
$ kubectl run -i --tty client --image=tutum/curl
[email protected]:/# curl -I 10.32.0.156:8080
HTTP/1.1 200 OK
Server: nginx/1.11.9
Date: Mon, 30 Jan 2017 10:50:56 GMT
Content-Type: text/html
Content-Length: 612
Last-Modified: Tue, 24 Jan 2017 14:02:19 GMT
Connection: keep-alive
ETag: "58875e6b-264"
Accept-Ranges: bytes

Now that you have a port open on every node you can configure your external load balancer or edge router to route the traffic to any of the K8s worker nodes on the NodePort. Simples! And indeed this is what we had to do in past before ingress has been introduced.

The “problem” of NodePort type is that the load balancer (or proxy) that routes the traffic to worker nodes will need to balance between the K8s cluster nodes, which in turn will load balance the traffic across pod endpoints. There is also no easy way of adding TLS or more sophisticated traffic routing. This is what the ingress API resource addresses, but let’s talk about kube-proxy first as it’s the most crucial component with regards to K8s services and also a bit of source of confusion for the newcomers.

kube-proxy

kube-proxy is a special daemon (application) running on every worker node. It can run in two modes [configuratble via --proxy-mode command line switch]:

  • userspace
  • iptables

In the userspace mode, kube-proxy is running as a userspace process i.e. regular application. It terminates all incoming service connections and creates a new connection to a particular service endpoint. The advantage of the userspace mode is that because the connections are created from userspace process, if the connection fails, kube-proxy can retry to a different endpoint.

In iptables mode, the traffic routing is done entirely through kernelspace via quite complex iptables kung-fu. Feel free to check the iptables rules on each node. This is way more efficient than moving the packets from the kernel to userspace and then back to the kernel. So you get higher throughput and better latency. The downside is that the service can be more difficult to debug, because you need to inspect logs from iptables and maybe do some tcpdumping or what not.

The moral of the story is: there will always be a kube-proxy running on worker nodes regardless of what mode it is running in. The difference is that in userspace mode it acts as a TCP proxy intercepting and forwarding traffic whilst in iptables mode it will configure iptables rather than proxy connections itself. The traffice forwarding is done by iptables automagically.

kube-dns

Now, kube-proxy is just one piece of the K8s service puzzle. Another one is kube-dns which is responsible for DNS service discovery. If the kube-dns add on has been set up properly you can access K8s services using their names directly. You don’t need to remember VIP:PORT combination. The name of the service will suffice. How is this possible? Well, when you use kube-dns, K8s “injects” certain nameservice lookup configuration into new pods that allows you to query the DNS records in the cluster. Let’s have a look at our familiar tutum/curl pod we created to test services.

First, let’s check the DNS resolution configuration:

1
2
3
4
5
$ kubectl run -i --tty client --image=tutum/curl
[email protected]:/# cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local google.internal c.kube-blog.internal
nameserver 10.32.0.10
options ndots:5

You can see that the IP address of the kube-dns service (see at the top that this is the kube-dns VIP) has been injected into the new pod along with some lookup domains. kube-dns creates an internal cluster DNS zone which is used for DNS and service discovery. This means that we can access the services from inside the pods via the service names directly:

1
2
3
4
5
6
7
8
9
10
[email protected]:/# curl -I nginx-svc:8080
HTTP/1.1 200 OK
Server: nginx/1.11.9
Date: Mon, 30 Jan 2017 11:05:40 GMT
Content-Type: text/html
Content-Length: 612
Last-Modified: Tue, 24 Jan 2017 14:02:19 GMT
Connection: keep-alive
ETag: "58875e6b-264"
Accept-Ranges: bytes

Since our nginx-svc has been created in the default namespace (check the etcd queries shown earlier in this post), we can access it using the cluster internal DNS name, too:

1
2
3
4
5
6
7
8
9
10
[email protected]:/# curl -I nginx-svc.default.svc.cluster.local:8080
HTTP/1.1 200 OK
Server: nginx/1.11.9
Date: Mon, 30 Jan 2017 11:06:12 GMT
Content-Type: text/html
Content-Length: 612
Last-Modified: Tue, 24 Jan 2017 14:02:19 GMT
Connection: keep-alive
ETag: "58875e6b-264"
Accept-Ranges: bytes

Ok, so this is really handy. No more remembering IP addresses, no more crafting and hacking our own /etc/hosts files within the pods - this almost feels like “No-Traditional-Ops” (oops) ;-)

We won’t talk about LoadBalancer type in this post as it’s only handy when running your cluster in one of the supported cloud providers and like I said, this post is about running K8s on bare metal deployment - we have no luxury of ELBs and the likes! Either way, now we should be well equipped to take the next step and look into the magic of ingress.

Services and ingresses

Let’s talk about Ingress resource and how it can address the “shortcomings” of the NodePort service type. Don’t forget to check the documentation about Ingress API resource. I will try to summarize the most important bits and show you how you how it works undernath.

Ingress is an API resource which represents a set of traffic routing rules that map external traffic to K8s services. Ingress allows external traffic to land in the cluster in a particular service. Ingress on its own is just one part of the puzzle. It merely creates the traffic route maps. We need one more piece to make this work: ingress controllers. Ingress controllers are responsible for the actual traffic routing. So we need to:

  1. Create Ingress (API object)
  2. Run Ingress controller

What we do in practice has actually the opposite order. First you create an ingress controller which handles the traffic and wait until it’s ready. Then you created and “open” the route in for the incoming traffic. This order makes sense: you need to have your traffic controller ready to handle the traffic before you “open the door”.

Ingress

Let’s create a simple ingress to route the traffic to our nginx-svc service we created earlier. Before that we need to create a default backend. Default backend is a special service endpoint which will handle the traffic that arrives at the ingress and does not match any of the configured routes in the ingress route map. It is sort of like a default “fail over” host known from various application and http servers. We will use the default-http-backend available in the official documentation. We will expose it as a new service:

1
2
3
4
5
6
7
8
9
$ kubectl create -f https://raw.githubusercontent.com/kubernetes/contrib/master/ingress/controllers/nginx/examples/default-backend.yaml
replicationcontroller "default-http-backend" created
$ kubectl expose rc default-http-backend --port=80 --target-port=8080 --name=default-http-backend
service "default-http-backend" exposed
$ kubectl get svc
NAME                   CLUSTER-IP    EXTERNAL-IP   PORT(S)          AGE
default-http-backend   10.32.0.227   <none>        80/TCP           8s
kubernetes             10.32.0.1     <none>        443/TCP          23h
nginxsvc               10.32.0.156   <nodes>       8080:31826/TCP   10h

Now, before we create ingress API object we need to create an ingress controller. We don’t want to be caught off the guard exposing the service before we are ready to handle it. You are spoilt for choice here. You can use the Rancher one or NGINX inc. one build your own more specialized controller. In this guide we will stick to the basic nginx ingress controller available in Kubernetes repo. So let’s create it now:

1
2
3
4
5
6
7
8
$ kubectl create -f https://raw.githubusercontent.com/kubernetes/contrib/master/ingress/controllers/nginx/examples/default/rc-default.yaml
replicationcontroller "nginx-ingress-controller" created
$ kubectl get pods
NAME                             READY     STATUS    RESTARTS   AGE
default-http-backend-lb96h       1/1       Running   0          14m
nginx-3449338310-64h52           1/1       Running   0          1h
nginx-3449338310-8pcq5           1/1       Running   0          1h
nginx-ingress-controller-k3d2s   0/1       Running   0          18s

Notice that the nginx-controller we have created is just a simple application running in a K8s pod which has some special powers as we will see later on. What ingress controllers do unerneath is, they first register themselves into the list of controllers via API service and store some configuration there. We can list the available controllers in the cluster by listing our familiar etcd cluster registry. In this case we are interested in /registry/controllers/default key (default implies default K8s namespace):

1
2
3
$ etcdctl --ca-file=/etc/etcd/ca.pem ls /registry/controllers/default
/registry/controllers/default/default-http-backend
/registry/controllers/default/nginx-ingress-controller

Great, so both default-http-backend and nginx-ingress-controller have registered themselves correctly. We should be ready to create some ingress rules and bring the external traffic into the cluster now. For the purpose of this post I will use the following ingress:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: my-nginx-ingress
  annotations:
    kubernetes.io/ingress.class: "nginx"
spec:
  rules:
  - host: foobar.com
    http:
      paths:
      - path:
        backend:
          serviceName: nginx-svc
          servicePort: 8080

What this will do is, it will create an ingress API resource which will map all the incoming requests which have HTTP Host header set to foobar.com into our nginx-svc service. All the other requests coming to this ingress point will be routed to the default-http-backend. Please note that we will be mapping only the root URL, but you have the option to create maps to particular URL endpoints. Let’s go ahead create the ingress now:

1
2
3
4
5
$ kubectl create -f foobar.yaml
ingress "my-nginx-ingress" created
$ kubectl get ing -o wide
NAME               HOSTS        ADDRESS           PORTS     AGE
my-nginx-ingress   foobar.com   X.X.X.X           80        21s

Excellent! The ingress has now been created as per the specified yaml config shown earlier. As is always the case, every ingress stores its configuration in the etcd cluster. So let’s have a look there:

1
2
$ etcdctl --ca-file=/etc/etcd/ca.pem ls /registry/ingress/default
/registry/ingress/default/my-nginx-ingress

Let’s see what the contents of the /registry/ingress/default/my-nginx-ingress key is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
{
  "kind": "Ingress",
  "apiVersion": "extensions/v1beta1",
  "metadata": {
      "name": "my-nginx-ingress",
      "namespace": "default",
      "selfLink": "/apis/extensions/v1beta1/namespaces/default/ingresses/my-nginx-ingress",
      "uid": "f2e57376-e6e5-11e6-9d09-42010af0000c",
      "generation": 1,
      "creationTimestamp": "2017-01-30T12:16:37Z",
      "annotations": {
          "kubernetes.io/ingress.class": "nginx"
      }
  },
  "spec": {
      "rules": [{
          "host": "foobar.com",
          "http": {
              "paths": [{
                  "backend": {
                      "serviceName": "nginx-svc",
                      "servicePort": 8080
                  }
              }]
          }
      }]
  },
  "status": {
      "loadBalancer": {
          "ingress": [{
              "ip": "X.X.X.X"
          }]
      }
  }
}

We can see that the ingress maps all the traffic for the foobar.com into our nginx-svc service as expected and it is available through an external IP address which has been redacted in this post. This is the IP to which you would point your DNS records and which routes to the ingress controller IP address exposed externally.

Why is this etcd key so important? Well, the ingress controllers are actually K8s API watcher applications which watch particular /registry/ingress endpoints changes to keep an eye on particular K8s service endpoints. That was a mouthful! The key takeaway here is: ingress controller monitor service endpoints eg. Ingress controllers don’t route traffic to the service, but rather to the actual service endpoints i.e. pods. There is a good reason for this behavior.

Imagine one of your service pod dies. Until K8s service notices it’s dead it won’t remove it from it’s list of endpoints. K8s service is like we should know by now, just another API watcher which simply watches /registry/endpoints. This endpoint is updated by K8s controller-manager. Even if the K8s controller-manager does pick up the endpoints change, there is no guarantee kube-proxy has picked it up and updated the iptables rules accordingly - kube-proxy can still route the traffic to the dead pods. So it’s safer for the ingress controllers to watch the endpoints themselves and update their routing tables as soon as controller-manager updates the list of endpoints.

Now, what is all this buzz around ingress controllers? Well, for starters they can terminate the traffic and load balance it across service endpoints. The load balancing can be quite sophisticated and it’s entirely up to the ingress controller designed. Finally, you can have them doing the SSL/TLS termination heavy lifting and relieve the actual services of doing so. You can indeed get the TLS configuration done through the secrets API.

Now that we have both Ingress and Ingress controllers in place we should be able to curl our nginxsvc directly from outside the cluster as long as we set foobar.com HTTP Header. Let’s try to do that:

1
2
3
4
5
6
7
8
9
10
$ curl -I -H "Host: foobar.com" X.X.X.X
HTTP/1.1 200 OK
Server: nginx/1.11.3
Date: Sat, 28 Jan 2017 00:41:51 GMT
Content-Type: text/html
Content-Length: 612
Connection: keep-alive
Last-Modified: Tue, 24 Jan 2017 14:02:19 GMT
ETag: "58875e6b-264"
Accept-Ranges: bytes

We could easily verify that the traffic is being routed to particular pods by tailing the logs of the ingress controller and logs of the nginxsvc pods. I will leave that as an exercise to you :-)

Finally we should have a better mental model about how all the pieces come together when a request arrives into the K8s cluster through ingress [controller]:

(traffic from outside K8s cluster) -> (ingress-controller) -> (kube-dns lookup for particular service) -> (kube-proxy/iptables) -> service endpoint (pod)

Conclusion

Thanks for staying with me until the end! Let’s quickly summarize what we learnt in this post. Everything what’s happening in the K8s cluster is done through API service. API resources are implemented as API watchers that watch particular set of keys by registering watches in K8s API service. K8s API service in turn reads from and writes into etcd cluster which stores all of the cluster internal state.

K8s service Traffic is routed directly into service’s pods, which are service’s endpoints, via a sophisticated iptables kung-fu performed by kube-proxy or via the kube-proxy itself. The service name addressing is done through DNS discovery done by DNS add-on; in the simplest case this is kube-dns, but you are spoilt for choice, so pick one that suits you the best.

Ingress is an API resource which allows to map network traffic to ingress controllers. Ingress controllers are applications deployed in pods that work as K8s API watchers monitoring /registry/ingress key through K8S API service and update their routes based on the service endpoint changes.

Docker vs Rocket Gimme a Break

Alert! This is another rant blog post! I promise the next one will be more technical :-)

Rocket launch me some opinions

Past few days I’ve been playing around with Rocket, the new container runtime by CoreOS. Quite a few people have asked for my opinion so I figured I would put it into a blog post. I haven’t planned on publishing it any time soon, but few blog posts I have come across recently have prompted me to put off the original post and write a different one first. So before I finalise I need to get something out of my chest first. Anyhow, let’s get to it!

Potayto, Potahto

First of all, let’s start by setting up some rules. For those who are comparing Docker with Rocket I would advise you to either stop right now or at least do a proper research. SERIOUSLY. If you ever compare a software which has been developed by thousands of open source developers around the world past 1+ years or so, the software which has been and is being deployed and tested in real production environment with “fresh from the oven” kit, then surely there is a big likelihood that the more mature software - if it’s not a total pile of shite - will come out as a winner. Don’t you think ?

Much fairer comparison would have been to compare Docker 0.1.1 with the latest pre-release of Rocket. Furthermore, Brandon Philips has said a few times already that some of the features that are available in Docker are not planned to be implemented in Rocket - not right now and maybe never. Why ? Because focus of the Rocket project - at least where it stands now - is not to reimplement Docker. The focus of the projec is simply not the same as the focus of Docker inc. Rocket is an implementation of App Container specification, hence much better comparison would be a comparison between App Container specification with original Docker Manifest or some kind of other specification Docker implements.

Ok, now that we have a COMMON SENSE off the table, let’s move on.

SystemABCDEFG

One of the most discussed topics of Rocket is systemd-spawn or systemd in general. CoreOS have picked systemd as the init system for their Linux distro. They could’ve gone with Init scripts (oh my!) or upstart (ouch!), but they went with what is de facto going to be the standard init system in the near future on all mainstream Linux distributions. If I have to be honest, I don’t have a proper opinion about systemd. I’ve read a few of Lennart’s blog posts and few other ones on the internet. I liked some things, disliked other ones, but even for the sake of the abve mentioned, picking systemd as an init system of the new Linux distribution was really a no brainer for CoreOS. Plus, CoreOS have earned my confidence so I trust them with this one without a blink of an eye. You might argue by throwing the choice of btrfs at me, but even that might go away soon. I like that - shit does not work - it either has to be fixed, or if the cost of fixing it is too high to pay, it should be replaced by a reasonably reliable alternative. After all stability is more valuable than features.

Ok, let’s get back to the point. One of many things systemd does or rather is capable of is process supervision. You might argue that the way systemd it approaches this problem somewhat overlaps the concepts of service and process supervisions and unnecessarily overcomplicates things which would probably be better done outside of PID 1, but it has this capability and since CoreOS is an extensive user of systemd it would be quite surprising for the guys to come up with something else or use some other tool for the job.

Furthermore Brandon said that the reason why the initial implementation of Rocket uses systemd-nspawn is because they wanted to use systemd and systemd-nspawn was already implemented and was doing what they intended to do, so it helped to kick off the project. I don’t remember him saying they’ll be sticking to it forever - they might, they might not. Who knows. And frankly, at this moment of 0.1.1 release, I don’t really care. Remember, Rocket is designed to be pluggable. If you don’t like systemd-nspawn you can use your own implementation of stage1. Rocket already provides particular cli arguments, so go nuts and start hacking.

Another importan point some people are getting totally wrong is that if you want to use Rocket you need to run systemd. Wrong! Rocket does NOT require systemd at all. It should work with any other init systems like SysV or upstart. When I played around with it I was testing it on upstart and did not encounter any issues. Rocket merely “reuses” systemd and systemd-nspawn to handle stage1 and stage2

App Container

It seems that some people probably haven’t given App Container specification a proper read. They keep crying about systemd blah blah and saying how awesome Docker is that it does not need systemd to run processes in containers. Docker handles - or rather is supposed to handle - the one-process container supervision via Docker daemon. Now, let’s face it. Docker daemon has not been written as a process supervisor and that behaviour is already kinda showing up when your containers die. You end up hacking around that by using some kind of process supervisor on the host or even in the container. If you have not experienced this you probably either have been lucky or haven’t run lots of containers in production.

So here we are. We need a process supervisor sometimes. Or some sort of functionality process supervisors provide. Wait a second, it gets better. If you read the App Container specification properly you will come across this point. I’m going to copy paste it here for reference:

“A container executes one or more apps with shared PID namespace, network namespace, mount namespace, IPC namespace and UTS namespace. Each app will start pivoted (i.e. chrooted) into its own unique read-write rootfs before execution. The definition of the container is a list of apps that should be launched together, along with isolators that should apply to the entire container.”

The above snippet clearly mentions several processes sharing linux namespaces. In other words, App Container’s definition of container suspiciously resembles Kubernetes’ Pod. I don’t know if it has anything to do with CoreOS being actively involved in Kubernetes development, but that’s the fact. It seems like they are the same thing, but they are really not. Kubernetes Pod is a set of containers as oppose to a set of multiple processes running in one container. That gives you a benefit of composing a Pod from multiple docker images. This is where App Container spec “jumps” in again and defines a concept of dependencies, so your container can depend on other containers, so the final container runtime manifest created by stage0 might look like this.

Now, that you have several processes running in the container and some of them can be daemons - yep, you’ve guessed it - they might need to be supervised. So, you need a process supervisor. Now, arguably one of the best supervisors I’ve ever used is runit, but again, why would you want to write runit scripts or use any other supervisor when you have already amassed a lot of experience with systemd? If I were in the same position I would make the same decision as the CoreOS did - go with systemd.

Docker and namespaces

Now let’s go back to Docker. Docker have been advocating a concept of running preferably ONE single process per container. In general I agree with this, as Linux containers are really just isolated processes on the host and running one process per container gives you or is supposed to give you more flexibility and composition, independent updates and rollbacks etc. i.e. easier life for an operator. However, there is a subtlety hiding. When you create a new Docker container or in fact LXC container, the container is allocated fresh set of ALL new Linux namespaces provided by libcontainer. I find this kinda wasteful and unnecessary if I have to be honest. With Docker it kinda make sense as you are supposed to run a single process in the container and if you want to be generic to provide process environment to cover majority of use cases you might need to provide a large chunk of of all available namespaces which libcontainer effecetively does.

By creating a new set of namespaces for just one single process you are adding one extra bit of work to Kernel to keep an eye on. If you run loads of containers on your host you might start hitting some Kernel limits. You might argue the overhead is small, but why would you want to unnecessarily over occupy Kernel anyway? I remember hitting these Kernel limits when running hundreds of LXCs on my host deliberately, when I was playing around with LXC a while ago kinda trying to push the Kernel as far as possible.

So to save the overhead, you can share the namespace between the processes. The problem though is if you start sharing namespaces and the container which created them dies, it takes down all the other processes it shares the namespaces with. This is one of the issues Kubernetes guys were dealing with when designing Pods. Kubernetes Pod is allocated an IP address from some internal /24 VLAN which is assigned to the first created container. This IP address is then shared between all containers within the Pod which share the same network (and other) namespaces.

You can see the subtlety here. Whenever the “network” creating container died it took down all the other containers in the Pod with it as there was NO namespace to share with any more, so they would have to be cleaned by replication controller and new Pod would be created from scratch and assigned new IP … or not. That’s not really important. What is worse, your links die too. So when your container dies and you can’t restart it automatically ie. it’s not properly supervised all linked containers are effectively useless you need to restart them. I’m sure there are some solutions based on watching Docker events for dying containers but let’s face it, that’s just a hack around. So if you want multiple processes running inside Docker container you need……drum rolls…a process supervisor.

Supervisor logging

Lastly, logging. Reliable logging that is. Process supervisor is supposed to take care of this. Normally the process supervisors do this by capturing the process stdout/stderr and route it into a log. Now read what App Container Spec says about logging:

“Apps should log to stdout and stderr. The container executor is responsible for capturing and persisting the output.”

Again, systemd seems like an obvious choice to go with for CoreOS. Runit is pretty awesome at this, too, but again this falls down to what I’ve mentioned already above. Expertise, unnecessary extra work etc.

The state of Docker logging is documented in almost 6 months old Logging plugin proposal. It has yet to be implemented and truth to be said, at this time I’ve already forgotten what it was about, but given it was proposed by Michael Crosby I have big confidence and trust in it.

Experience the user and Use the experience

Ok, now that we have the important bits of App Container spec covered, one last point people seem to be crying about is User Experience provided by Rocket. Again. Please: GIMME A BREAK. Latest Rocket release at the time of this writing is 0.1.1. Have you seen what the first car looked like? It looked something like this. That was not the greatest of user experiences. Nope. It was however the first step towards Ferraris and Porsches.

But AGAIN and most importantly, Rocket is an implementation of App Container specification. It is totally up to you to implement your improved implementation of Dockerfiles, Docker daemons and whatever else your imagination gives you. You can implement other niceties you like when using Docker. Rocket does not enforce anything on you. Not even Stage1! You want Docker as a supervisor? Hack on it and pass it in an argument to rkt run as a Stage1 process. Boom! I can imagine CoreOS might implement some nifty tools in the future and even improve user experience for Ops and Devs, but I’m guessing those toold will be created as separate projects and not as part of the core of Rocket. At this point I would argue it’s important to get the project hacked to some stability.

Switching off the rant

To close this rant off, I’d like to invite you to hack on both Docker and Rocket. Hell, if you like coding in C, you should get involved in LXC or the likes, too. There is so much work to be done and so much tremendous opportunity for you to be part of something awesome happening in our industry at the moment.

Lastly I want to make absolutely clear that I love Docker. I love hacking on it every now and then. I love community around it and chatting to the awesome people at various Docker IRC channels. This rant was not about Docker or Rocket being bad or good. It was about getting out my 5 cents to the topic of Docker vs Rocket comparison and the questionable opinions I’ve come across recently, mostly by people who probably didn’t even bother to read the App Container spec. I’ll be covering Rocket on this blog in near future more and more so stay tuned if you’re interested.

Now,let’s hack and keep on shipping Dockers, launching Rockets, executing LXCs…

Future of Docker Networking

Disclaimer: I do not work for Docker nor for any company whose business is tied to Docker in any way

Recently there has been a lot of discussions about the future of Docker networking in various communication channels. It is due to hugely increased Docker usage over the past year. Users are now realising Docker’s networking limitations which are becoming more apparent. The truth is, and I’m sure we can all agree on this, the current networking capabilities of Docker are not sufficient for more complicated setups. Current model is plagued with performance issues and not flexible enough. But for a basic usage it does a pretty good job. However, we never just stay within “basic usage” realm. With the cloud and current world of microservices this is almost impossible. So the time has come to move forward. As one of many contributors to libcontainer, I feel like I need to express my opinion about where I stand in all these discussions and this blog post is about that.

As with every project, there is a bit of a history behind it. Early adopters and tinkerers realised the insufficiencies of the current networking model very quickly. When they needed to assign more than one IP address to their Docker containers they often couldn’t understand why this was possible with LXC but not with Docker, which at the time was using LXC as its container engine. Whilst the containers have been around for a long time, only recently thanks to Docker more and more people started learning about Linux namespaces, including Network namespace, which is the base stone of Linux container networking. Thanks to the many blog posts by Docker Tinkerer Extraordinaire and his awesome pipework tool more people gained a better understanding of the topic and could build more complicated network setups with Docker. Still, we all knew pipework was an awesome and helpful tool, but it was still a “hack-around” tool. Much more was needed from the core of Docker.

I became interested in this topic at the dawn of version 0.7, but instead of diving into the source code straight away, I tried to get my hands dirty with the LXC networking first, as I hoped that it would help me understand the core of the problem so that I could then help by contributing to Docker. I got in touch with Jerome and we agreed on some golang-pipework hack which could be potentially embedded into Docker in the future. The rest is a history. This had happened in the span of 7 months. In these seven months there wasn’t much activity in the Docker-network-land. There was a few network related issues opened on Github. When I opened a PR to “fix” the netlink issues, I realised not much had changed in that time. Docker guys were focussing on more crucial issues (rightly so!) and since we had pipework and by-pipework-inspired-tools, as well as various orchestration tools we could hack around the Docker network’s limitations. People did not seem to be interested in networking too much as only a small number of people contributed by fixing the existing issues or adding new functionality.

Fast forward X months and things are finally starting to change. We now have a first proper official networking proposal and we even have a #libnetwork IRC room, albeit its existence is rather premature at this point. The discussion about the future of networking in Docker is now happening with much larger audience than just between a few people concentrated in Github issues, IRC conversations and private email threads. The most important thing though will be the outcome of these discussions and how it will affect Docker users, both companies and individuals. This could be in my opinion incredibly crucial for Docker and I’m glad we as a community are starting to realise it more.

The topic of Docker networking is quite complex. It touches both underlay and overlay side of things. It literally affects very project from the smallest ones like local dev environments to the large monsters like the awesome Kubernetes project. Feel free to read through the Kubernetes networking design document and you will get a pretty good idea of what I’m talking about. Over the past year we have developed lots of tools and libraries to solve these issues so I really hope the decision outcome of the ongoing discussion should not fuck any of the existing work, but rather create a much nicer environment for them and for the newcomers. I chose the F word deliberately because being in the industry for a fairly long time I have had a chance to see how bad decisions can turn into a total dismay. And that slightly worries me with regards to Docker as well. Maybe it’s really just a result of the experience with other projects, but you know the old saying: “History teaches us nothing…”, so I am on “alert” :-)

To me Docker is not only just another means of shipping software. It’s not just another DevOps enabler. To me the Docker is a tool, but also maybe even more importantly a platform. Open platform that is. Platform which people can build on top of. Not the platform in the sense as we are normally used to when talking about software platforms, but platform nevertheless. If it is about to stay open to users (and companies) building their solutions on top of it it must not tie itself to another project. I have a big hope it won’t, because Docker guys learnt that when they got rid of dependency on LXC and started libcontainer project as a home grown replacement. Similar thinking should be applied to the decision about the future of Networking.

There are some visible initiatives trying to tie Docker to the awesome OVS project, which has been a part of the mainline Linux Kernel since version 3.3. I hope I’m just being a little paranoid here. I have nothing against the OVS. In fact, I find it to be one of the most awesome and advanced networking toolkits out there. It has a bit of a rocky setup and learning curve for a beginner but once you master it, you will gain some serious networking power! One of the arguments I’ve heard on this topic was “if OVS just works Docker should just work fine with it”. Let me tell you something my friend: If there is one thing I learnt over the years in this industry, it is that “NOTHING JUST WORKS”. I don’t want to sound cynical, but some of us have seen even ls utility crashing under certain circumstances. So building an argument on an assumption like that is simply crazy. There are much smarter people than me who would agree with me. Yes, you can hack anything to work FOR YOU, we all do in on daily basis. Yes, you can make SysAdmins or NetOps happy, but probably not both, let alone programmers included. This is a complex topic and of course there is no silver bullet solution, so we should not approach it like that.

But this post is not a about whether or not to make OVS to be part of Docker. I chose to talk about the OVS above as it seemed to have been the most vocally discussed “solution”. This is about a more generic question of tying the Docker to any third party project, regardless of whether it’s open source or not. So far Docker guys have done a pretty good job of avoiding this. This is important as apart from aforementioned problems of project dependencies, once you make a decision like that, your focus will inevitably shift to the chosen path by which you can harm the ecosystem which could grow organically from your platform if you keep it free of such dependencies. Right now we can talk about projects like flannel, weave, docket and probably many others I have not heard of which grew out of the current platform, some due to necessity others bringing something new.

There is currently a proposal for pluggable networking backends which looks promising, but it will take a bit of time until it materialises as it seems Docker is going through a phase of designing solid pluggable architecture not necessarily related to just networking but to other important parts. This must happen in the core of Docker before the project moves on to networking. It is however hugely important in my opinion that the Docker will have a sane default backends whilst leaving the decisions about more advanced solutions to the users and not forcing them down a certain path. This would just inevitably lead to hack arounds as we know them from other projects and would arguably harm the current and possibly future users. We all know you can’t make everyone happy, but you can create a nice “playground” for users where they can build their own solutions without harming each others work. If you’re thinking unified pluggable API then you might just be on the same boat as me!

I’m really excited about the future of networking in Docker and I have a big hope that Docker will make the best decision. Not that I ever doubted it, but I felt like I had to get this post out of my chest having been involved with Docker networking and many discussions about it which made me sense some odd lobbies sneaking in. Most importantly, I hope this post will inspire even more people to get involved in these conversations and help the project to be even more awesome than it is alredy right now.

Tenus - Golang Powered Linux Networking

2014-07-30 22:35 Update: I’ve updated the post with the link to netlink RFC. I’ve also replaced references to golang with Go programming language on majority of mentions in the article. I do agree with the people in discussions on the topic of Go/golang, but I’ve adpoted golang in my vocabulary as that’s my standard search term on Google for the information about Go language, hence the abundance of the word golang in the original post.

Long Overdue

When I published the first post on this site some time last year in November about LXC networking I noticed that Docker lacked advanced networking functionality provided in LXC. Docker master extraordinaire Jérôme Petazzoni, who reviewed the post, pointed my attention to pipework. Pipework is a pretty awesome project written entirely in bash which allows you to implement complicated networking setups with Docker.

At the time I was becoming more and more interested in golang, so I thought reimplementing pipework in Go could be a great learning experience. Back then I had only written few small programs in what has become now my most favourite language for one of the companies I worked for and I was looking for a bigger challenge to tackle to improve my skills. Unfortunately a lot of things happened between December last year and now and I only really got back to working “process” properly in mid May. That explains a lack of posts on this site.

Anyways, to cut the long story short, eventually I have decided to NOT reimplement pipework in Go, but rather to create a library which would allow to configure and manage Linux network devices programmatically directly from your Go programs (such as Docker). In this blog post I’ll try to give you an introduction to the project I decided to call tenus which in Latin means something like down to (the problem). You can check the source code of the project on GitHub.

First we need to understand a little bit more about how the management of Linux network devices is implemented in modern Linux tools and then have a look at how the networking is implemented in LXC and Docker.

Netlink

The core of Linux Kernel (just like the core of any other modern Operating System) has a monolithic design for performance reasons as non-essential parts - non-essential for starting and running the kernel - are built as dynamically loadable modules. So the core is monolithic but the kernel itself is modular. Kernel subsystems provide various interfaces to user-space processes/tools to configure and manage them. This reduces the bloat of adding every new feature into the Linux kernel.

Networking implementations in LXC and Docker as well as for example iproute’s make use of the netlink Linux Kernel interface. Before we dive into discussion what tenus can do I’d like give you a tiny introduction to the netlink and hopefully motivate you a little bit to learn more about it as it is one of those fascinating but slightly obscure kernel features!

Netlink is Linux kernel interface which was introduced some time around version 1.3 and then reimplemented around version 2.1. I must admit, I could not find ANY design document or specification when working on this post only the Linux kernel source code and some man pages. But then one of the commenters on Hacker News pointed out the following link which contains full netlink RFC! Wow! I wish I was asking Google the right questions, and paid proper attention to the results! Thanks for pointing this out signa11.

Netlink is a datagram-oriented messaging system that allows passing messages from kernel to user-space and vice-versa. It is implemented on top of the generic BSD sockets so it supports well known calls like socket(), bind(), sendmsg() and recvmsg() as well as common socket polling mechanisms.

Netlink allows 2 types of communication (displayed on the image below):

  • unicast - typically used to send commands from user-space tools/processes to kernel-space and receive the result
  • multicast - typically the sender is the Kernel and listeners are the user-space programs - this is useful for event-based notifications to multiple processes which might be listening on the netlink socket

As you can see on the above image, the communication from user-space to kernel-space is synchronous whilst the communication from the kernel to user-space is asynchronous (notice the queues). In other words, when you communicate with Linux Kernel over netlink socket from the user-space, your program will be in a blocking state until you receive the answer back from the Kernel.

We are only going to talk about the unicast communication as that’s the one used in tenus and in the above mentioned containers. In simplified terms it looks like this:

  • bind to netlink socket
  • construct a netlink message
  • send the netlink message via socket to the kernel
  • receive and parse netlink response from the kernel

Unfortunately netlink does not hide the protocol details to user-space as other protocols do, so you must implement your own netlink message constructs and response parsers yourself.

The main advantages provided by netlink in my opinion are these:

  • extensible data format which allows for adding new features without breaking backwards compatibility
  • no new kernel syscalls need to be introduced as netlink builds on BSD sockets
  • event-based notifications from Kernel to user-space

Currently netlink is mostly used by various networking tools to configure and manage network interfaces, advanced routing, firewalls etc., but there is some use of it in other non-networking kernel subsystems like the ACPI subsystem. Netlink allows up to 32 communication busses in kernel-space. In general each bus is attached to one Kernel subsystem. We are going to focus on its use in Linux network device management bus called rtnetlink.

If you’re interested in netlink programming there is a couple of really good sources (linuxjournal, Neil Horman) which cover the topic pretty well.

LXC, Docker and netlink

Both LXC and Docker containers implement their own netlink libraries to interact with the kernel’s netlink interface. LXC one is written in C (just like the whole LXC project) whilst Docker one is written in Go. Since my primary motivation to create tenus was the curiosity about networking features in Docker and a desire to hack on golang, I’m not going to talk about LXC here. If you are interested in the networking implementation in LXC I’d recommend to look at LXC networking source code on GitHub. The most interesting parts with regards to networking and netlink are on the links below:

I must admit, I hadn’t done any C for a while before I looked at the above source files and I was very surprised how readable the code was. Kudos to Stephane Garber and the whole LXC team for the awesome job on this. Reading LXC source code massively helped me understand better how the networking is implemented in LXC containers and inspired me with regards to how I should approach the implementation of tenus Go package.

Like I said, Docker has its own Go netlink library which it uses to create and configure network devices. The library is implemented as a Go package and is now part of libcontainer. You can have a look at its implementation here. Armed with the knowledge I gained studying netlink I decided to give this a shot and test what it had to offer.

I have created a following Vagrantfile for my tests and started hacking:

Vagrantfile
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$provision = <<SCRIPT
apt-get update -qq && apt-get install -y vim curl python-software-properties golang
add-apt-repository -y "deb https://get.docker.io/ubuntu docker main"
curl -s https://get.docker.io/gpg | sudo apt-key add -
apt-get update -qq; apt-get install -y lxc-docker
cat > /etc/profile.d/envvar.sh <<'EOF'
export GOPATH=/opt/golang
export PATH=$PATH:$GOPATH/bin
EOF
. /etc/profile.d/envvar.sh
SCRIPT

VAGRANTFILE_API_VERSION = "2"

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
  config.vm.box = "trusty64"
  config.vm.hostname = "netlink"
  config.vm.provider "virtualbox" do |v|
    v.customize ['modifyvm', :id, '--nicpromisc1', 'allow-all']
  end

  config.vm.provision "shell", inline: $provision
end

I picked Ubuntu Trusty for my tests as it ships with quite a new Linux Kernel. This is important because older kernels ship with slightly buggy netlink interface. As long as you use 3.10+ you should be fine, though:

1
2
3
[email protected]:~# uname -r
3.13.0-24-generic
[email protected]:~#

I quickly put together a simple program which would make use of libcontainer’s netlink package. The program was supposed to create a Linux network bridge called “mybridge”. No more, no less:

mybridge.go
1
2
3
4
5
6
7
8
9
10
11
12
13
package main

import (
  "log"

  "github.com/docker/libcontainer/netlink"
)

func main() {
  if err := netlink.NetworkLinkAdd("mybridge", "bridge"); err != nil {
      log.Fatal(err)
  }
}

I downloaded the netlink package by running go get and then built and ran the program:

1
2
3
4
5
[email protected]:/opt/golang/src/tstnetlink# go get github.com/docker/libcontainer/netlink
[email protected]:/opt/golang/src/tstnetlink# go build mybridge.go
[email protected]:/opt/golang/src/tstnetlink# ./mybridge
2014/07/27 14:59:19 operation not supported
[email protected]:/opt/golang/src/tstnetlink#

Uhh, something’s not right! The network bridge was indeed NOT created as the program failed with error straight after the netlink call. I wondered why, so first thing I did was to check the dmesg as I would expect Kernel to spit out some data into kernel ring buffer:

1
2
3
4
5
[email protected]:/opt/golang/src/tstnetlink# dmesg|tail
...
...
[ 2553.121658] netlink: 1 bytes leftover after parsing attributes.
[email protected]:/opt/golang/src/tstnetlink#

I immediately knew that the program was trying to send more data to the kernel than the kernel was expecting. From a very low level point of view when you interact with Kernel’s netlink interface, you essentially send a stream of bytes down the open netlink socket. The stream of bytes comprise a netlink message which contains rtnetlink attributes encoded in the message payload in TLV format. Using the attributes you can tell kernel what you want it to do. So I started looking into the implementation and set out to fix this.

I figured that the problem is most likely related to how rtnetlink attributes are packed into netlink message which is then sent down the netlink socket to th kernel. See, the netlink messages are aligned to 32 bits and contain data that is expressed in host-byte order, so you have to be very careful as you miss one byte or byte-order and you are pretty much screwed. Kernel will “ignore” you in a sense that it will have no clue what you are asking from it!

I pinpointed the problem to be either in func (a *RtAttr) Len() int or func (a *RtAttr) ToWireFormat() methods. After couple of hours reading and re-reading of netlink’s man pages and reading Kernel’s netlink source code (the last time I’ve done this was probably when I was at uni) and various other sources on netlink specification, I thought I had figured it out. I’ve forked the libcontainer repository to verify my assumptions, hacked for a WHILE on it and finally rebuilt the original test program using my libcontainer fork:

mybridge.go
1
2
3
4
5
6
7
8
9
10
11
12
13
package main

import (
  "log"

  "github.com/milosgajdos83/libcontainer-milosgajdos83/netlink"
)

func main() {
  if err := netlink.NetworkLinkAdd("mybridge", "bridge"); err != nil {
      log.Fatal(err)
  }
}

Building and running the test program was then pretty much the same routine. Get the forked package, rebuild the program and run it:

1
2
3
4
[email protected]:/opt/golang/src/tstnetlink# go get github.com/milosgajdos83/libcontainer-milosgajdos83/netlink
[email protected]:/opt/golang/src/tstnetlink# go build mybridge.go
[email protected]:/opt/golang/src/tstnetlink# ./mybridge
[email protected]:/opt/golang/src/tstnetlink#

Boom! No error! This looks promising! Let’s have a look if the network bridge was actually created:

1
2
3
4
[email protected]:/opt/golang/src/tstnetlink# ip link list mybridge
88: mybridge: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
    link/ether 8a:7d:1a:c0:93:15 brd ff:ff:ff:ff:ff:ff
[email protected]:/opt/golang/src/tstnetlink#

Wooha! Looks like the fixes I’ve implemented in netlink package worked like charm. I also noticed that Docker hackers must have noticed that the original implementation of func NetworkLinkAdd() didn’t work very well, because they added various network bridge functions to the netlink package which unfortunately have nothing to do with netlink and instead use pure “auld” Linux syscalls.

I’m guessing this is partly because of time constraints to investigate the bugs (this is where we as a community come in!) or it was simply to get network bridging working on older Linux Kernels such as the ones which are shipped with RHEL 6 or CentOS 6 and which come with slightly buggy netlink interface like I mentioned earlier. On these two distros you can’t for example even create a bridge by using well known ip utility by running ip link add name mybridge type bridge. Either way, I believe these bridge related functions should be put somewhere outside of netlink package as they encourage bad practice and simply don’t belong where they are now. But that’s a discussion for maling list and #docker-dev channel, rather than the blog post.

I also noticed that the netlink package was missing the implementation of the advanced networking functionality which is present in LXC and that it also contains few more bugs which had to be fixed before I could start hacking on tenus. I spent couple of nights after work adding these into the package. Once the rtnetlink attributes were correctly packed into the netlink message payload, adding the new functionality was quite easy.

You can have a look at actual implementation in my libcontainer fork. I have of course removed the bridge syscall-based functions from the package because their presence in there was seriously teasing my OCD. Once the core functionality I was after was available in the netlink package I could finally start hacking on my library.

But before we divee into the tenus, let’s quickly have a look at Docker networking. As you probably know by now, when you install Docker on your host and start the docker daemon it creates a network bridge called docker0:

1
2
3
4
[email protected]:~/# ip link list docker0
5: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
[email protected]:~/#

When you create a new docker container, docker will create a veth pair and stick one of the interfaces into the container whilst bridging the other interface of the pair against the docker0 bridge. Let’s look at the example. I will create a docker container running /bin/bash:

1
2
3
4
5
6
7
[email protected]:~# docker run -i -t --rm --name simple ubuntu:14.04 /bin/bash
[email protected]:/# ip link list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
89: eth0: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 86:21:ab:94:ad:9c brd ff:ff:ff:ff:ff:ff
[email protected]:/#

Now, when you open another terminal on the host you can verify what I mentioned above:

1
2
3
4
5
6
7
8
9
10
11
[email protected]:~# docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
bc08199098f8        ubuntu:14.04        /bin/bash           5 seconds ago       Up 5 seconds                            simple
[email protected]:~#
[email protected]:~# brctl show docker0
bridge name   bridge id       STP enabled interfaces
docker0       8000.06316b00394b   no      veth8007
[email protected]:~# ip link list veth8007
90: veth8007: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master docker0 state UP mode DEFAULT group default qlen 1000
    link/ether 06:31:6b:00:39:4b brd ff:ff:ff:ff:ff:ff
[email protected]:~#

There is MUCH more to Docker networking than what I showed here, but all in all it’s a bit of an iptables kung-fu and/or running an extra container (which runs socat inside) on the host or using quite a few different command line options which you need to understand and remember. This is not a criticism! It’s just my personal opinion. Networkin in Docker has gone a long way since December - kudos to the guys at Docker who are working hard on this. I can tell from my own experience it’s not a trivial matter. Either way, I was after something more straightforward, more familiar, something which would offer the simplicity of LXC network configuration, hence I created the tenus.

tenus

I’m not going to talk about the actual package implementation. Go ahead and get shocked by my awful golang code which is available on GitHub. I do admit the code is pretty bad, but it was a great learning experience for me to work on something I was passionate about in my free time. I would welcome any comments and obviously I would love PRs to get the codebase into more sane state than it is now in.

As I mentioned earlier, the package only works with newer Linux Kernels (3.10+) which ship with reasonably new version of netlink interface, so if you are running older Linux kernel this package won’t be of much help to you I’m afraid. Like I said I developed tenus on Ubuntu Trusty Tahr which ships with Kernel 3.13+ and then verified its functionality on Precise Pangolin with upgraded kernel to version 3.10.

Now, let’s do something interesting stuff with the tenus package. I’ve put together a few example programs to help to get you started easily with it. They are located in examples subdirectory of the project. Let’s have a look at one of the examples which creates two MAC VLAN interfaces on the host and sends one of them to a running docker container. First let’s create a simple docker which will run /bin/bash as we did above to show the basic networking. I will name it mcvlandckr :

1
2
3
4
5
6
7
[email protected]:~# docker run -i -t --rm --privileged -h mcvlandckr --name mcvlandckr ubuntu:14.04 /bin/bash
[email protected]:/# ip link list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
91: eth0: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 92:af:be:4f:2b:3f brd ff:ff:ff:ff:ff:ff
[email protected]:/#

Nothing extraordinary about the above. As already said, docker creates a veth pair out of which one is sent to the docker container and the other one is bridged against the docker0 network bridge on the host. Now leave the docker running and open a new terminal on the host (the above container is running with -i option i.e. in interactive mode, so it will run in the foreground).

Let’s now build the example program from the source file which is called tenus_macvlanns_linux.go and run it as shown below:

1
2
3
[email protected]:/opt/golang/src/github.com/milosgajdos83/tenus/examples# go build tenus_macvlanns_linux.go
[email protected]:/opt/golang/src/github.com/milosgajdos83/tenus/examples# ./tenus_macvlanns_linux
[email protected]:/opt/golang/src/github.com/milosgajdos83/tenus/examples#

I have also created a gist where you can check out the source code if you prefer that to loading the GitHub page of the project and navigating to examples subdirectory. The program does the following:

  • it creates a MAC VLAN interface operating in the bridge mode on the host, names it macvlanHostIfc and assigns 10.0.41.2/16 IPv4 address to it
  • it creates another MAC VLAN interface called macvlanDckrIfc, again operating in the bridge mode, which is “bridged” to the same host interface eth1 just like macvlanHostIfc, sends it to the running docker container called mcvlandckr which we created earlier above and finally assigns 10.0.41.3/16 IPv4 address to it

The end result of running the example program should be 2 MAC VLAN interfaces - one on the host and another one in the runing docker container “bridged” against the same host interface. Creating MAC VLAN interface on the host and “bridging” it against the same host interface as the docker one was done on purpose so that we can test the connectivity from the host to the container. If you want to know more about how MAC VLAN bridge mode works and why I keep putting doublequotes around the word bridiging in MAC VLAN context, make sure you checkout my previous blog post.

Let’s verify if the program delivers what it promises. First we will check if the macvlanHostIfc interface has been created on the host and if it has a given IP assigned:

1
2
3
4
5
6
7
8
[email protected]:/opt/golang/src/github.com/milosgajdos83/tenus/examples# ip address list macvlanHostIfc
93: [email protected]: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
    link/ether 46:e4:70:c1:a2:a7 brd ff:ff:ff:ff:ff:ff
    inet 10.0.41.2/16 scope global macvlanHostIfc
       valid_lft forever preferred_lft forever
    inet6 fe80::44e4:70ff:fec1:a2a7/64 scope link
       valid_lft forever preferred_lft forever
[email protected]:/opt/golang/src/github.com/milosgajdos83/tenus/examples#

Boom! It looks like the host part worked as expected. Now let’s checkout if the other MAC VLAN interface has been created in Docker and if it has the right IP address assigned, too:

1
2
3
4
5
6
7
8
[email protected]:/# ip address list macvlanDckrIfc
94: [email protected]: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
    link/ether 26:57:ce:39:32:8e brd ff:ff:ff:ff:ff:ff
    inet 10.0.41.3/16 scope global macvlanDckrIfc
       valid_lft forever preferred_lft forever
    inet6 fe80::2457:ceff:fe39:328e/64 scope link
       valid_lft forever preferred_lft forever
[email protected]:/#

Awesome! That’s another good sign that all worked as expected. Now, the last check - connectivity. If all went well, we should be able to ping the new IP address allocated to macvlanDckrIfc interface in the running docker. So let’s verify that claim:

1
2
3
4
5
6
7
8
9
[email protected]:/opt/golang/src/github.com/milosgajdos83/tenus/examples# ping -c 2 10.0.41.3
PING 10.0.41.3 (10.0.41.3) 56(84) bytes of data.
64 bytes from 10.0.41.3: icmp_seq=1 ttl=64 time=0.038 ms
64 bytes from 10.0.41.3: icmp_seq=2 ttl=64 time=0.044 ms

--- 10.0.41.3 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.038/0.041/0.044/0.003 ms
[email protected]:/opt/golang/src/github.com/milosgajdos83/tenus/examples#

Now, if I remove the host’s MAC VLAN interface, I should no longer be able to ping the IP address assigned to the macvlandckr’s MAC VLAN interface:

1
2
3
4
5
6
7
8
9
10
[email protected]:/opt/golang/src/github.com/milosgajdos83/tenus/examples# ip link del macvlanHostIfc
[email protected]:/opt/golang/src/github.com/milosgajdos83/tenus/examples# ip l l macvlanHostIfc
Device "macvlanHostIfc" does not exist.
[email protected]:/opt/golang/src/github.com/milosgajdos83/tenus/examples# ping -c 2 10.0.41.3
PING 10.0.41.3 (10.0.41.3) 56(84) bytes of data.

--- 10.0.41.3 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1001ms

[email protected]:/opt/golang/src/github.com/milosgajdos83/tenus/examples#

Coo. We can no longer reach the mcvlandckr container from the host via its MAC VLAN interface, so it’s isolated, but it’s also pretty useless because we simply can’t make any use of the newly assigned IP address. So let’s build another docker called mcvlandckr2:

1
2
3
4
5
6
7
[email protected]:~# docker run -i -t --rm --privileged -h mcvlandckr2 --name mcvlandckr2 ubuntu:14.04 /bin/bash
[email protected]:/# ip l l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
9: eth0: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 62:3d:74:f1:94:d9 brd ff:ff:ff:ff:ff:ff
[email protected]:/#

It’s pretty identical, but it will do for the prupose of what I want to show you. Now, let’s modify the program which helped us to create MAC VLAN interface for mcvlandckr docker and use tenus to create MAC VLAN interface in the newly created mcvlandckr2 docker container. Here’s the code snippet:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
package main

import (
  "log"
  "net"

  "github.com/milosgajdos83/tenus"
)

func main() {
  macVlanDocker2, err := tenus.NewMacVlanLinkWithOptions("eth1", tenus.MacVlanOptions{Mode: "bridge", MacVlanDev: "macvlanDckr2Ifc"})
  if err != nil {
      log.Fatal(err)
  }

  pid, err := tenus.DockerPidByName("mcvlandckr2", "/var/run/docker.sock")
  if err != nil {
      log.Fatal(err)
  }

  if err := macVlanDocker2.SetLinkNetNsPid(pid); err != nil {
      log.Fatal(err)
  }

  macVlanDckr2Ip, macVlanDckr2IpNet, err := net.ParseCIDR("10.0.41.4/16")
  if err != nil {
      log.Fatal(err)
  }

  if err := macVlanDocker2.SetLinkNetInNs(pid, macVlanDckr2Ip, macVlanDckr2IpNet, nil); err != nil {
      log.Fatal(err)
  }
}

The source code should look familiar to you if you already checked the gist of the first example. All it does is to create a new MAC VLAN interface inside the new docker and “bridges” it against the SAME host interface as the one created for macvlandckr docker. Now, let’s run it and verify that the interface has indede been created and has the correct IP address assigned as per the above:

1
2
3
4
5
6
7
8
[email protected]:/# ip address list macvlanDckr2Ifc
13: [email protected]: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
    link/ether 1e:7a:5d:53:52:0a brd ff:ff:ff:ff:ff:ff
    inet 10.0.41.4/16 scope global macvlanDckr2Ifc
       valid_lft forever preferred_lft forever
    inet6 fe80::1c7a:5dff:fe53:520a/64 scope link
       valid_lft forever preferred_lft forever
[email protected]:/#

Awesome! We are almost done here. Now, the interesting part! If all works as expected, both mcvlandckr2 and mcvlandckr dockers should be able to ping each other’s IP assigned to their new network interfaces, whilst still remain to be isolated from the host (remember we have deleted macvlanHostIfc MAC VLAN interface from the host). Now, let’s go ahead and test this.

Connectivity from the host:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[email protected]:/opt/golang/src/github.com/milosgajdos83/tenus/examples# ping -c 2 10.0.41.3
PING 10.0.41.3 (10.0.41.3) 56(84) bytes of data.

--- 10.0.41.3 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1001ms

[email protected]:/opt/golang/src/github.com/milosgajdos83/tenus/examples#
[email protected]:/opt/golang/src/github.com/milosgajdos83/tenus/examples# ping -c 2 10.0.41.4
PING 10.0.41.4 (10.0.41.4) 56(84) bytes of data.

--- 10.0.41.4 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1001ms

[email protected]:/opt/golang/src/github.com/milosgajdos83/tenus/examples#

Awesome, both dockers remain isolated from the host. Now let’s see if they can reach each other. Let’s try to ping the new IP assigned to mcvlandckr2 from mcvlandckr container:

1
2
3
4
5
6
7
8
9
[email protected]:/# ping -c 2 10.0.41.4
PING 10.0.41.4 (10.0.41.4) 56(84) bytes of data.
64 bytes from 10.0.41.4: icmp_seq=1 ttl=64 time=0.037 ms
64 bytes from 10.0.41.4: icmp_seq=2 ttl=64 time=0.041 ms

--- 10.0.41.4 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.037/0.039/0.041/0.002 ms
[email protected]:/#

Boom! Perfect. Now, let’s do the same thing but in the opposite direction i.e. let’s ping the new IP assigned to mcvlandckr from mcvlandckr2:

1
2
3
4
5
6
7
8
9
[email protected]:/# ping -c 2 10.0.41.3
PING 10.0.41.3 (10.0.41.3) 56(84) bytes of data.
64 bytes from 10.0.41.3: icmp_seq=1 ttl=64 time=0.036 ms
64 bytes from 10.0.41.3: icmp_seq=2 ttl=64 time=0.057 ms

--- 10.0.41.3 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.036/0.046/0.057/0.012 ms
[email protected]:/#

Excellent! We can see based on the above example that tenus can indeed help you to create MAC VLAN interfaces inside your dockers and isolate them from the host or from the other dockers on the host very easily. Let your imagination loose and come up with your scenarios to connect your dockers or simply get inspired by awesome pipework documentation.

If you liked what you saw, feel free to checkout the other examples and the whole project which will show you how to create VLAN interfaces as well an extra VETH pair with one peer inside the docker. I would love if people started hacking on this and improved the codebase massively.

Conclusion

Before I conclude this post, I just want to really stress, that none of what is written in thist post is supposed to criticise the state of networking or any part of the Docker codebase! Although I do admit it might seem like it in some parts. I do appreciate all the hard work which is being done by the tremendous community created around Docker. It also doesn’t mean that the advanced networking as shown here needs to be implemented in the core of Docker. tenus was simply created out of my curiosity and interest in the container networking. I’ll be happy if anyone finds it at least tiny bit useful or even happier if anyone starts hacking on it!

Furthermore, tenus is a generic Linux networking package. It does not necessarily need to be used with Docker. You can use it with LXCs as well or just simply use it to create, configure and manage networking devices programmatically on any Linux host running reasonably new Linux Kernel.

The current state of the package reflects the short time, effort and my rookie knowledge of Go programming language. In other words: I do appreciate it needs quite a lot of work especially around testing and most likely around design, too. Right now the test coverage is not great, but the core functionality should be covered with functional tests. I would massively welcome PRs.

I’m not going to try to get this into the Docker as it looks like the Docker guys have decided to port netlink to C bindings, but I’d love to integrate this into my own Docker fork and test the advanced networking functionality provided by tenus from the core of Docker as oppose to from a separate Go programs. If you are up for this challenge, hit me up! My idea is to come up with new entries in the Dockerfiles which would specify type of networking you want to do with the docker container you are building. But that’s just an idea. Let’s see where we can take this!

Exploring LXC Networking

Daily Dilemma

Recently I’ve been finding myself in various conversations about Docker and Linux Containers (LXC). Most of the time the conversations eventually end up with one and the same question and that is whether we should run containers in production. Initially this post had a few paragraphs where I philosophised about readiness of the technology, but I’ve deleted those paragraphs now as the attention dedicated to containers in past year has been nothing short of amazing and more and more companies are running their Infrastructures in the containers or at least big parts of them. I’m sure this trend will continue to high degree in the future, so I’ve now removed these paragraphs and kept the technical content only.

One of the (many) things which were not entirely clear to me and to the people I speak with about the topic of containers almost on daily basis is how the networking can be done and configured when using LXC. Hopefully the first blog post on this topic is going to shed some more light on this matter and hopefully it will inspire further posts on various other topics related to the containers.

Setup

Whenever I’m looking for answers for my technical questions or when learning new technology I always prefer the “hands-on” blog posts, so that’s the approach I’ve decided to take when writing this post. You can follow it step by step and feel free to experiment as you work through the examples provided in each chapter.

All you need to follow the guide in this post is Vagrant. I used version 1.3.2, but for the purpose of this post any 1.2+ version will do (possibly even lower versions, but I’ve encountered some issues with older versions, so I don’t recommend them). In order to run Vagrant you need to put together a Vagrantfile. The one I used for this guide looks like this:

Vagrantfile
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
pkg_cmd = "apt-get update -qq && apt-get install -y vim curl python-software-properties; "
pkg_cmd << "add-apt-repository -y ppa:ubuntu-lxc/stable; "
pkg_cmd << "apt-get update -qq && apt-get install -y lxc"

VAGRANTFILE_API_VERSION = "2"

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
  config.vm.box = "precise64"
  config.vm.box_url = "http://files.vagrantup.com/precise64.box"

  config.vm.provision :shell, :inline => pkg_cmd

  # Create a private network
  config.vm.network :private_network, ip: "10.0.4.2"

  # Create a public network
  config.vm.network :public_network
end

I decided to use Ubuntu Precise Pangolin version as that was the latest Ubuntu LTS version released at the time whe I wrote this post and that’s the OS which the companies I work for are running on their production servers.

As for the LXC tools, I initially ran all the examples in this blog post on daily LXC builds as back then when I was writing this post there was no stable LXC build per se and I thought it could be fun to test “on the edge” stuff. But on 20th of February 2014 Ubuntu guys released first stable LXC release 1.0. You can look up the stable LXC PPA on launchpad. I revisited the blog post and updated it accordingly so that people who visit this page later on could find some up to date information here. Anyways, place your Vagrantfile into some working directory, fire up vagrant up and off we go:

1
2
3
4
5
6
[email protected]:~/Vagrant/blogpost$ ls -l
total 8
-rw-r--r--  1 milosgajdos  staff  632  3 Apr 09:37 Vagrantfile
[email protected]:~/Vagrant/blogpost$ vagrant up
...
[email protected]:~/Vagrant/blogpost$

If you used the same Vagrantfile as myself then before the Virtual Machine (VM) boots you will be prompted by vagrant about which network interface on your workstation you want to bridge the Virtual Box one with. Pick any active interface on the workstation you are running the vagrant i.e. the one you’re using for networking (which is hopefully on private network). vagrant up downloads the Vagrant box image (which is an OS image) if it’s not already present on your workstation, boots it and runs the commands specified in pkg_cmd once the VM is up and running.

Once the whole set up has finished, you can run vagrant ssh in the same directory you have run vagrant up in. This will log you on the newly created VM via ssh and you should see the LXC packages installed and ready to be used:

1
2
3
4
5
6
[email protected]:~$ dpkg -l|grep -i lxc
ii  liblxc1                         1.0.5-0ubuntu0.1~ubuntu12.04.1~ppa1 Linux Containers userspace tools (library)
ii  lxc                             1.0.5-0ubuntu0.1~ubuntu12.04.1~ppa1 Linux Containers userspace tools
ii  lxc-templates                   1.0.5-0ubuntu0.1~ubuntu12.04.1~ppa1 Linux Containers userspace tools (templates)
ii  python3-lxc                     1.0.5-0ubuntu0.1~ubuntu12.04.1~ppa1 Linux Containers userspace tools (Python 3.x bindings)
[email protected]:~$

The actual version tags (such as the timestamp) of the packages above may differ for you as you’ll be installing the latest ones at the time you’ll be following this guide. As for the networking part of the setup, you should end up with something like below, though obviously IP and MAC addresses will be different on your workstation (for easier readability I’ve added extra new line after each interface output):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
[email protected]:~$ sudo ip address list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:88:0c:a6 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
    inet6 fe80::a00:27ff:fe88:ca6/64 scope link
       valid_lft forever preferred_lft forever

3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:2e:8a:7a brd ff:ff:ff:ff:ff:ff
    inet 10.0.4.2/24 brd 10.0.4.255 scope global eth1
    inet6 fe80::a00:27ff:fe2e:8a7a/64 scope link
       valid_lft forever preferred_lft forever

4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:f9:8f:2e brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.135/24 brd 192.168.1.255 scope global eth2
    inet6 fe80::a00:27ff:fef9:8f2e/64 scope link
       valid_lft forever preferred_lft forever

5: lxcbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether ee:5d:90:3b:26:d0 brd ff:ff:ff:ff:ff:ff
    inet 10.0.3.1/24 brd 10.0.3.255 scope global lxcbr0
    inet6 fe80::85f:69ff:fe8e:9df8/64 scope link
       valid_lft forever preferred_lft forever
[email protected]:~$

Quick explanation of the above:

  • lo - loopback interfaces (more on this later)
  • eth0 - default network interface created by Vagrant
  • eth1 - private network interface which Vagrant created as instructed by our Vagrantfile configuration :private_network
  • eth2 - public network interface which Vagrant created as instructed by our Vagrantfile configuration :public_network
  • lxcbr0 - network bridge which was created automatically when the lxc tools have been installed

Installation of LXC tools hides a small subtlety you might want to be aware of and that is a creation of one iptables rule. The rule basically masquerades all the traffic leaving the containers which are bridged with lxcbr0, i.e. which are on 10.0.3.0/24 LAN so that you can reach “outside” world from inside these containers:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[email protected]:~$ sudo iptables -t nat -nL
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination

Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
MASQUERADE  all  --  10.0.3.0/24         !10.0.3.0/24
[email protected]:~$

The above iptables rule is also created automatically on boot via lxc-net upstart job which you can easily verify by looking into particular upstart configuration file: /etc/init/lxc-net.conf.

LXC Network Types

As the official LXC Ubuntu manual pages mention there are 5 network virtualization types available to be used with LXC:

  • empty
  • veth
  • macvlan
  • vlan
  • phys

There is one extra network type available on Trusty which is called “none”. Each of them works with different concepts and requires some knowledge of Linux networking. I hope the practical examples in this blog post will give you a better understanding of these networking types. Enough talking, let’s get party started !

Empty

Let’s create our first container to explore the empty network type. We will create a new container by running lxc-create command which can use ubuntu template script shipped with lxc-templates package. LXC templates are essentially bash scripts which make creation of the Linux containers supereasy. You can have a look at what’s “hiding” inside the ubuntu one (which is the one we will be using in this guide) in the following path: /usr/share/lxc/templates/lxc-ubuntu. You can specify a particular Ubuntu version via -r command line switch passed to the template script as you can see below:

1
[email protected]:~$ sudo lxc-create -t ubuntu -n empty01 -- -r precise

The above command uses debootstrap package to download and install particular Ubuntu OS image into /var/lib/lxc directory on the host machine. Apart from downloading the image, lxc-create sets up container directory structure and performs few other tasks. Once the above command has successfully finished you should be able to see that the new image has been installed and is ready to be used:

1
2
3
4
5
6
7
8
[email protected]:~$ sudo ls -1 /var/lib/lxc/
empty01
[email protected]:~$ sudo ls -l /var/lib/lxc/empty01/
total 12
-rw-r--r--  1 root root 1516 Nov 10 19:37 config
-rw-r--r--  1 root root  329 Nov 10 19:29 fstab
drwxr-xr-x 22 root root 4096 Nov 10 19:38 rootfs
[email protected]:~$

Every container created using lxc-create utility has a configuration file in the particular container’s path (/var/lib/lxc/CONTAINERNAME/config). The configuration file is then used by another utility called lxc-start to actually create and start the container. We need to modify this file to use the empty network type as by default veth network type is used. The new network related configuration looks like this:

1
2
3
4
5
[email protected]:~$ sudo grep network /var/lib/lxc/empty01/config
lxc.network.type = empty
lxc.network.hwaddr = 00:16:3e:67:4f:a5
lxc.network.flags = up
[email protected]:~$

Now that we have our networking configuration ready, let’s start the conatiner. Note the -d switch below which starts the container in the background - if you don’t use this switch when starting the container you will see container start logs streaming in your standard output and you would get a container console once the container booted. After a bit of time (give it a short time to boot) let’s check if it’s running:

1
2
3
4
5
6
[email protected]:~$ sudo lxc-start -n empty01 -d
[email protected]:~$ sudo lxc-ls --fancy
NAME     STATE    IPV4  IPV6  AUTOSTART
---------------------------------------
empty01  RUNNING  -     -     NO
[email protected]:~$

As you can see in the above output, the container is up and running, but it doesn’t have any IP address assigned. This is exactly what we should expect when we specified empty network type as per documentation which describes it as follows: “empty network type creates only the loopback interface”. In fact, empty network type creates EMPTY NETWORK NAMESPACE in which the container runs. When you create an empty network namespace only the loopback device is created and that’s exactly what the documentation says. The newly created loopback interface is not visible on the host when running sudo ip link list as it’s in a different network namespace.

I will talk more about Linux Namespacing in another blog post, but if you are impatient and want to know more now, there is a small presentation of Network namespaces I gave short while ago or you can check a great blog post written by Jérôme Petazzoni which explains them very well. Short explanation is that the namespacing is a functionality implemented in Linux Kernel which allows for an isolation of various Kernel subsystems on the host.

You can verify that the new container is running on the host and has a new network namespace created by checking the contents of its init process' proc namespace entry. Make sure you are checking the container’s init process PID entry (2nd column on the 2nd line in the output below) and NOT the lxc-start process one as lxc-start is a parent process which forks the container’s init process:

1
2
3
4
5
6
7
[email protected]:~$ ps faux | grep -A 1 "lxc-start -n empty01 -d"
root      20465  0.0  0.3  31952  1172 ?        Ss   19:38   0:00 lxc-start -n empty01 -d
root      20469  0.0  0.5  24076  2040 ?        Ss   19:38   0:00  \_ /sbin/init
[email protected]:~$
[email protected]:~$ sudo ls -l /proc/20469/ns/net
-r-------- 1 root root 0 Nov 10 19:39 /proc/20469/ns/net
[email protected]:~$

We can log on the container using lxc-console which attaches to one of the container’s ttys. user/password is ubuntu/ubuntu. We can now verify that the loopback device has been created correctly:

1
[email protected]:~$ sudo lxc-console -n empty01 -t 2

Inside the container just run simply ip address list:

1
2
3
4
5
6
7
[email protected]:~$ sudo ip address list
6: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
[email protected]:~$ 

We didn’t really have to even log on to the container to verify this. We could just simply switch into the container’s network namespace directly on the host, but let’s leave that for later. Alternatively, we could run lxc-attach -- /sbin/ip address list but due to some namespace implementation issues on the Kernel I’m running this guide on, lxc-attach does not seem to work properly, hence I’m using lxc-console. Stéphane Graber, one of the LXC project core developers recommends in the comments below to use the Kernel version >= 3.8 and lxc-attach will work just fine.

Before we dive into discussion about how useful the “empty” network type in combination with container is I’m going to talk a bit about what loopback device actually is as a lot of people I come in touch somehow accept its existence but don’t really know even the basic implementation details.

Loopback device is a virtual network device implemented entirely in Kernel which the host uses to communicate with itself. Device (interface) is assigned a special non-routable IP address block 127.0.0.0/8 that is the IP addresses 127.0.0.1 through 127.255.255.254 can be used to communicate with the host itself. Linux distributions also “alias” one of these IP addresses in /etc/hosts with entry called localhost.

How does the actual communication work when we know that the assigned IP address block is non-routable? It works simply by creating a special entry in the host’s routing table which is routing all packets destined to 127.0.0.0/8 back to the loopback device lo:

1
2
3
4
5
6
[email protected]:~$ sudo ip route show table local
broadcast 127.0.0.0 dev lo  proto kernel  scope link  src 127.0.0.1
local 127.0.0.0/8 dev lo  proto kernel  scope host  src 127.0.0.1
local 127.0.0.1 dev lo  proto kernel  scope host  src 127.0.0.1
broadcast 127.255.255.255 dev lo  proto kernel  scope link  src 127.0.0.1
[email protected]:~$

How is this useful? Mostly for diagnostics, troubleshooting, testing and when you don’t want your service to be exposed on any network. Imagine you want to test different Postgresql configurations which bind to the localhost IP address on the same TCP port. Now, you can argue that you can copy the configuration files and run it on the different TCP port on the same host, but why all that “hassle”, when you can just simply run the Postgresql daemon in the container and avoid the “interference” with other Postgresql instances bound to the same IP address and port. Obviously, this is just one of many use cases. I’m sure you will be able to come up with many other. I’d like to hear about these in the comments below.

Veth

In order to explore the veth network type we will create a new container, but with different network settings. Let’s not waste any more time and fire up another lxc-create command:

1
2
3
4
[email protected]:~$ sudo lxc-create -t ubuntu -n veth01 -- -r precise
Checking cache download in /var/cache/lxc/precise/rootfs-amd64 ...
Copy /var/cache/lxc/precise/rootfs-amd64 to /var/lib/lxc/veth01/rootfs ...
Copying rootfs to /var/lib/lxc/veth01/rootfs ...

As you must’ve noticed the container creation takes much shorter time than when we created our first container as debootstrap caches the previously downloaded image in the directory you can see in the output above and then just simply copies it over to the new container’s filesystem. Of course, if the second container was created with a different template than the first one, it would take longer as the whole new OS image would have to be downloaded from scratch.

Now, before we look into the container’s network configuration, let’s talk a little bit about what the veth network type actually is. LXC documentation says:

a peer network device is created with one side assigned to the container and the other side is attached to a bridge specified by the lxc.network.link

The peer device can be understood as a pair of fake Ethernet devices that act as a pipe, i.e. traffic sent via one interface comes out the other one. As these devices are Ethernet devices and not point to point devices you can handle broadcast traffic on these interfaces and use protocols other than IP - you are basically protocol independent on top of Ethernet. For the purpose of this guide we will only focus on the IP protocol though.

Armed with this knowledge, what we should expect when running the container with the veth network type enabled, is to have one network interface created on the host and the other one in the container, where the container interface will be “hidden” in the container’s network namespace. The host’s interface will then be bridged to the bridge created on the host if so configured.

Let’s proceed with the container’s network configuration modifications so that it looks as below:

1
2
3
4
5
6
[email protected]:~$ sudo grep network /var/lib/lxc/veth01/config
lxc.network.type = veth
lxc.network.hwaddr = 00:16:3e:7e:11:ac
lxc.network.flags = up
lxc.network.link = lxcbr0
[email protected]:~$ 

As you’ve probably noticed, lxc-create default generated container configuration file is already set to use veth network type so you shouldn’t need to make any modifications to the veth01 container’s configuration if you followed this guide carefully. Now let’s start our new container!

1
2
3
4
5
6
7
[email protected]:~$ sudo lxc-start -n veth01 -d
[email protected]:~$ sudo lxc-ls --fancy
NAME     STATE    IPV4        IPV6  AUTOSTART
---------------------------------------------
empty01  RUNNING  -           -     NO
veth01   RUNNING  10.0.3.118  -     NO
[email protected]:~$ 

Brilliant! The container is running and it has been assigned 10.0.3.118 IP address automatically. How is the IP address assigned ? In order to understand that we need to understand how is the container actually created. What is happening in terms of networking on the host is - in very simplified terms - the following:

  1. A pair of veth devices is created on the host. Future container’s network device is then configured via DHCP server (it’s actually dnsmasq daemon) which is listening on the IP assigned to the LXC network bridge (in this case that’s lxcbr0). You can verify this by running sudo netstat -ntlp|grep 53. The bridge’s IP address will serve as the container’s default gateway as well as its nameserver
  2. The host part of the veth device pair is bridged to the configured bridge as per container configuration - as I said in this case that is lxcbr0
  3. “Slave” part of the pair is then “moved” to the container, renamed to eth0 and finally configured in the container’s network namespace
  4. Once the container’s init process is started it brings up particular network device interface in the container and we can start networking!

In other words, the above 4 steps serve to bridge the container’s network namespace with the host network namespace via VETH device pair, so you should be able to communicate with the container directly from the host. Let’s send it couple of pings from the host and see if that’s the case:

1
2
3
4
5
6
7
8
9
[email protected]:~$ ping -c 2 10.0.3.118
PING 10.0.3.118 (10.0.3.118) 56(84) bytes of data.
64 bytes from 10.0.3.118: icmp_req=1 ttl=64 time=0.220 ms
64 bytes from 10.0.3.118: icmp_req=2 ttl=64 time=0.130 ms

--- 10.0.3.118 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.130/0.175/0.220/0.045 ms
[email protected]:~$

Awesome! Life is beautiful! But because we are curious we will log on to the veth01 container and have a poke around and verify whether the theory we spoke earlier about have practical backing by checking the network configuration of the container. You can log on to the veth01 container either via ssh which runs on the container (ssh is installed by default when you create the container by following the steps in this guide) or you can log on via lxc-console as we did when we introduced empty networking type:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[email protected]:~$ sudo ip address list
7: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:16:3e:7e:11:ac brd ff:ff:ff:ff:ff:ff
    inet 10.0.3.118/24 brd 10.0.3.255 scope global eth0
    inet6 fe80::216:3eff:fe7e:11ac/64 scope link
       valid_lft forever preferred_lft forever

9: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
[email protected]:~$
[email protected]:~$ sudo ip route show
default via 10.0.3.1 dev eth0  metric 100
10.0.3.0/24 dev eth0  proto kernel  scope link  src 10.0.3.118
ubuntu@veth01:~$ grep nameserver /etc/resolv.conf
nameserver 10.0.3.1
nameserver 10.0.2.3
nameserver 192.168.1.254
ubuntu@veth01:~$

Container’s veth pair side looks as we expected. eth0 has the correct IP assigned, and its network is configured to use the IP address of lxcbr0 bridge. Let’s check the Host’s side:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
vagrant@precise64:~$ sudo ip address list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:88:0c:a6 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
    inet6 fe80::a00:27ff:fe88:ca6/64 scope link
       valid_lft forever preferred_lft forever

3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:2e:8a:7a brd ff:ff:ff:ff:ff:ff
    inet 10.0.4.2/24 brd 10.0.4.255 scope global eth1
    inet6 fe80::a00:27ff:fe2e:8a7a/64 scope link
       valid_lft forever preferred_lft forever

4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:f9:8f:2e brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.135/24 brd 192.168.1.255 scope global eth2
    inet6 fe80::a00:27ff:fef9:8f2e/64 scope link
       valid_lft forever preferred_lft forever

5: lxcbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether ee:5d:90:3b:26:d0 brd ff:ff:ff:ff:ff:ff
    inet 10.0.3.1/24 brd 10.0.3.255 scope global lxcbr0
    inet6 fe80::85f:69ff:fe8e:9df8/64 scope link
       valid_lft forever preferred_lft forever

8: vethD9YPJ0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master lxcbr0 state UP qlen 1000
    link/ether fe:25:26:02:77:25 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc25:26ff:fe02:7725/64 scope link
       valid_lft forever preferred_lft forever
vagrant@precise64:~$

vethD9YPJ0 device has been created as the Host’s side device of the veth network pair (remember the other side of the pair is in the container).

Let’s see if the Host’s side veth device is bridged to the lxcbr0 bridge:

1
2
3
4
vagrant@precise64:~$ sudo brctl show
bridge name   bridge id       STP enabled interfaces
lxcbr0        8000.febc614cdc21   no      vethD9YPJ0
vagrant@precise64:~$ 

BINGO! Looks like all is as we expected it to be and we can sleep well at night knowing that we learnt something new. Well at least I hope we did :-)

Now, let’s take this a bit further and explore Network Namespaces for a bit and try to simulate something similar to what is happening on the network level when the veth01 container is created. We won’t dive into actual Kernel namespacing implementation, but we will play around with the userspace tools a bit.

The task we will work on now is to create a veth pair of devices, one of which will be in a different network namespace than the other one and then bridge them to the same bridge as the veth01 container is bridged to. In other words, let’s create a separate network stack in a separate Network Namespace from the Host one without all the “boiler plate” which comes with the creation of LXC container i.e. we won’t be creating any network namespaces except for the network one. Let’s pretend we are network experts and we need to perform some network activities and we don’t need full blown containers - just separate network stacks on the host.

One way of completing this task is simply to perform the following steps:

  1. create new network namespace
  2. create veth pair of network devices in the new namespace
  3. configure the “host isolated” device
  4. pass the other side of the pair back to the Host’s namespace
  5. bridge the Host’s veth pair device to the lxcbr0 bridge

Simple! So let’s roll our sleves up and start working through the above plan. The result should be a pingable IP address in a separate network namespace. All the following steps are performed on the Host.

Let’s create a directory where the network namespaces are read from:

1
vagrant@precise64:~$ sudo mkdir -p /var/run/netns

Let’s create a new network namespace and verify it was created (ip command has a special flags for listing network namespaces):

1
2
3
4
5
6
7
vagrant@precise64:~$ sudo ip netns add mynamespace
vagrant@precise64:~$ sudo ip netns list
mynamespace
vagrant@precise64:~$ sudo ls -l /var/run/netns/
total 0
-r-------- 1 root root 0 Nov 10 22:24 mynamespace
vagrant@precise64:~$ 

Awesome! We are set for some good Linux Network ride! We can check the box next to step (1) in the plan above.

Now Let’s switch to the newly created namespace, create a pair of veth devices and configure one of them to use 10.0.3.78/24 IP address:

1
2
3
4
5
6
7
8
9
10
11
12
vagrant@precise64:~$ sudo ip netns exec mynamespace bash
root@precise64:~# ip link add vethMYTEST type veth peer name eth0
root@precise64:~# ip link list
10: lo: <LOOPBACK> mtu 16436 qdisc noop state DOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

11: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 82:6b:b3:08:36:34 brd ff:ff:ff:ff:ff:ff

12: vethMYTEST: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether ea:6a:c3:f0:62:d7 brd ff:ff:ff:ff:ff:ff
root@precise64:~#

As you can see, 3 separate network devices have been created. lo device which is a loopback device interface and which is created automatically by default for every newly created network namespace, and a pair of veth network devices. None of them has been assigned an IP address yet:

1
2
3
4
5
6
7
8
9
10
root@precise64:~# ip address list
10: lo: <LOOPBACK> mtu 16436 qdisc noop state DOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

11: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 82:6b:b3:08:36:34 brd ff:ff:ff:ff:ff:ff

12: vethMYTEST: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether ea:6a:c3:f0:62:d7 brd ff:ff:ff:ff:ff:ff
root@precise64:~#

Let’s assign an IP address to eth0 interface from the same IP range as lxcbr0 is on and bring it up to life:

1
2
3
4
5
6
7
8
9
10
11
12
13
root@precise64:~# ip address add 10.0.3.78/24 dev eth0
root@precise64:~# ip link set eth0 up
root@precise64:~# ip address list
10: lo: <LOOPBACK> mtu 16436 qdisc noop state DOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

11: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN qlen 1000
    link/ether 82:6b:b3:08:36:34 brd ff:ff:ff:ff:ff:ff
    inet 10.0.3.78/24 scope global eth0

12: vethMYTEST: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether ea:6a:c3:f0:62:d7 brd ff:ff:ff:ff:ff:ff
root@precise64:~#

Brilliant! We can now check the boxes next to steps (2) and (3). Let’s proceed with step (4) and move the vethMYTEST device to the Host’s namespace or in other words i.e. to the network namespace of the process whose PID = 1 which is the host’s init process. We can do that like so:

1
root@precise64:~# ip link set vethMYTEST netns 1

The vethMYTEST device should now be present in the Host’s network namespace so we should be able to bring it up on the Host, but first let’s exit the shell in the “mynamespace” network namespace (I know I’ve picked a horrible name for it) and then bring the device to life:

1
2
3
4
5
6
7
root@precise64:~# exit
exit
vagrant@precise64:~$ sudo ip link set vethMYTEST up
vagrant@precise64:~$ sudo ip link list vethMYTEST
12: vethMYTEST: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether ea:6a:c3:f0:62:d7 brd ff:ff:ff:ff:ff:ff
vagrant@precise64:~$

Until we bridge the vethMYTEST device with the lxcbr0 bridge, the “network isolated” IP address should not be accessible. We can verify that very easily:

1
2
3
4
5
6
7
8
9
vagrant@precise64:~$ ping -c 2 10.0.3.78
PING 10.0.3.78 (10.0.3.78) 56(84) bytes of data.
From 10.0.3.1 icmp_seq=1 Destination Host Unreachable
From 10.0.3.1 icmp_seq=2 Destination Host Unreachable

--- 10.0.3.78 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1009ms
pipe 2
vagrant@precise64:~$ 

Now let’s do some bridging! We have an awesome brctl utility for this so let’s use it:

1
2
3
4
5
6
vagrant@precise64:~$ sudo brctl addif lxcbr0 vethMYTEST
vagrant@precise64:~$ sudo brctl show
bridge name   bridge id       STP enabled interfaces
lxcbr0        8000.3234ea7e8ace   no      vethD9YPJ0
                                          vethMYTEST
vagrant@precise64:~$

Now that we can check the box next to step (5) we should be able to access the “mynamespace” isolated IP address. So let’s verify that claim:

1
2
3
4
5
6
7
8
9
vagrant@precise64:~$ ping -c 2 10.0.3.78
PING 10.0.3.78 (10.0.3.78) 56(84) bytes of data.
64 bytes from 10.0.3.78: icmp_req=1 ttl=64 time=0.094 ms
64 bytes from 10.0.3.78: icmp_req=2 ttl=64 time=0.080 ms

--- 10.0.3.78 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.080/0.087/0.094/0.007 ms
vagrant@precise64:~$

Boom! We are done! Great job everyone!

Before we conclude this part of the post, I will list a few use cases veth interface can be used for:

  • create virtual networks between containers - by linking them via different bridges on different networks
  • provide a routed link for a container - for routing the packets to “outside” world with the help of iptables as mentioned earlier in this post
  • emulate bridged networks - this could be handy for testing complicated network setups locally
  • test pretty much almost any network topology

As always, I’m going to leave this to your imagination and creativity. Feel free to leave the suggestions in the comments. And now, let’s move on to the next network type available in containers - macvlan.

Macvlan

Another network type I will talk about in this guide is macvlan. But before we get to work, let’s get familiar with a little bit of theory first. macvlan - for easier understanding MAC VLAN - is a way to take a single network interface and create multiple virtual network interfaces with different MAC addresses asigned to them.

This is a “many-to-one” mapping. Linux VLANs which are capable to use single network interface and map it to multiple virtual networks provide “one-to-many” mapping - one network interface, many network VLANs in one trunk. MAC VLANs map multiple network interfaces (i.e. with different MAC addresses assigned) to one network interfaces. We can obviously combine Linux VLANs with MAC VLANs if we want to.

MAC VLAN allows each configured “slave” device to be in one of three modes:

  • PRIVATE - the device never communicates with any other device on the “upper_dev” (ie “master” device) which means that all incoming packets on the “slave” virtual interface are dropped if their source MAC Address matches one of the MAC VLAN interfaces - i.e. the “slaves” can’t communicate with each other

  • VEPA - Virtual Ethernet Port Aggregator is a MAC VLAN mode that aggregates virtual machine packets on the server before the resulting single stream is transmitted to the switch. When using VEPA we assume that the the adjacent bridge returns all frames where both source and destination are local to the macvlan port, i.e. the bridge is set up as a reflective relay. This mode of operation is also called “hairpin mode” and it must be supported by upstream switch, not the Linux kernel itself. As such, we forward all traffic out to the switch even if it is destined for us and we rely on the switch at the other end to send it back, so the isolation is done on the switch, not the Linux kernel.

  • BRIDGE - provides the behavior of a simple bridge between different macvlan interfaces on the same port. Frames from one interface to another one get delivered directly and are not sent out externally which gives you some security guarantees for the inter-conainer communication. Think of this as a simplified switch which does not need to learn MAC addresses as it already knows them.

What do the above modes mean for LXC containers ? Let’s have a look

  • PRIVATE mode disallows any communication between LXC containers

  • VEPA mode isolates the containers from one another - UNLESS you have an upstream switch configured to work as reflective relay - in that case, you CAN address containers directly. On top of that, traffic destined to a different MAC VLAN on the same interface (i.e. on the same bridge) will travel through the physical (“master”) interface twice - once to leave (egress) the interface where it is switched and sent back and enters (ingresses) via the SAME interface, which means that this can affect available physical bandwidth and also restricts inter MAC VLAN traffic to the speed of the physical connection

  • BRIDGE mode creates a special bridge (“pseudo” bridge - not the same bridge as a standard Linux bridge!) which allows the containers to talk to one another but isolates “pseudo-bridged” interfaces from the host

As VEPA requires a specially configured (reflective relay) switch to demonstrate its functionality and I could not find any way how to configure Linux bridge in reflective relay mode I will not deal with this mode in this guide. PRIVATE mode doesn’t provide much fun to play with, so I won’t be touching on this mode either. This leaves us with BRIDGE mode so let’s hurry up and have some fun.

Bridge mode

To demonstrate how this mode works, we will create 2 LXC containers, which we will bridge over manually created bridge using the LXC bridge network mode provided by macvlan network type. Let’s go ahead and create a new bridge on the Host - we don’t need to assign any IP address to the bridge as assigning an IP address to the bridge will not affect the results of this test if you read the above theory carefully:

1
2
3
4
5
6
7
8
vagrant@precise64:~$ sudo brctl addbr lxcbr1
vagrant@precise64:~$ sudo ip link set lxcbr1 up
vagrant@precise64:~$ sudo brctl show
bridge name   bridge id       STP enabled interfaces
lxcbr0        8000.3234ea7e8ace   no      vethD9YPJ0
                                          vethMYTEST
lxcbr1        8000.000000000000   no      
vagrant@precise64:~$ 

Let’s create the containers now and configure them to use macvlan network type in bridge macvlan mode. Let’s also assign each of them an IP address so they can communicate with each other:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
vagrant@precise64:~$ sudo lxc-create -t ubuntu -n macvlanbridge01 -- -r precise
vagrant@precise64:~$ sudo lxc-create -t ubuntu -n macvlanbridge02 -- -r precise
vagrant@precise64:~$ sudo grep network /var/lib/lxc/macvlanbridge01/config
lxc.network.type = macvlan
lxc.network.macvlan.mode = bridge
lxc.network.hwaddr = 00:16:3e:48:35:d2
lxc.network.flags = up
lxc.network.link = lxcbr1
lxc.network.ipv4 = 10.0.5.3/24
vagrant@precise64:~$
vagrant@precise64:~$ sudo grep network /var/lib/lxc/macvlanbridge02/config
lxc.network.type = macvlan
lxc.network.macvlan.mode = bridge
lxc.network.hwaddr = 00:16:3e:69:b3:4d
lxc.network.flags = up
lxc.network.link = lxcbr1
lxc.network.ipv4 = 10.0.5.4/24
vagrant@precise64:~$

We can now fire them up and start playing with them. As you can see below, both containers are now up and running and have been assigned IP addresses as per the configuration above:

1
2
3
4
5
6
7
8
9
10
vagrant@precise64:~$ sudo lxc-start -n macvlanbridge01 -d
vagrant@precise64:~$ sudo lxc-start -n macvlanbridge02 -d
vagrant@precise64:~$ sudo lxc-ls --fancy
NAME             STATE    IPV4           IPV6  AUTOSTART
--------------------------------------------------------
empty01          RUNNING  -              -     NO
macvlanbridge01  RUNNING  10.0.5.3       -     NO
macvlanbridge02  RUNNING  10.0.5.4       -     NO
veth01           RUNNING  10.0.3.118     -     NO
vagrant@precise64:~$

If the theory discussed at the beginning of this subchapter is correct we should not be able to access any of the newly created containers from the host. So let’s go ahead and verify this by simple ping tests:

1
2
3
4
5
6
7
8
9
10
11
12
13
vagrant@precise64:~$ ping -c 2 10.0.5.3
PING 10.0.5.3 (10.0.5.3) 56(84) bytes of data.

--- 10.0.5.3 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1000ms

vagrant@precise64:~$ ping -c 2 10.0.5.4
PING 10.0.5.4 (10.0.5.4) 56(84) bytes of data.

--- 10.0.5.4 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1008ms

vagrant@precise64:~$

On the other hand, containers should be able to communicate with each other. So let’s go ahead and make a few tests from inside the containers.

Test performed on the first container:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
ubuntu@macvlanbridge01:~$ sudo ip address list
14: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:16:3e:48:35:d2 brd ff:ff:ff:ff:ff:ff
    inet 10.0.5.3/24 brd 10.0.5.255 scope global eth0
    inet6 fe80::216:3eff:fe48:35d2/64 scope link
       valid_lft forever preferred_lft forever

15: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
ubuntu@macvlanbridge01:~$
ubuntu@macvlanbridge01:~$ ping -c 3 10.0.5.4
PING 10.0.5.4 (10.0.5.4) 56(84) bytes of data.
64 bytes from 10.0.5.4: icmp_req=1 ttl=64 time=0.106 ms
64 bytes from 10.0.5.4: icmp_req=2 ttl=64 time=0.080 ms
64 bytes from 10.0.5.4: icmp_req=3 ttl=64 time=0.118 ms

--- 10.0.5.4 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.080/0.101/0.118/0.017 ms
ubuntu@macvlanbridge01:~$

Test performed on the second container:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
ubuntu@macvlanbridge02:~$ sudo ip address list
16: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:16:3e:69:b3:4d brd ff:ff:ff:ff:ff:ff
    inet 10.0.5.4/24 brd 10.0.5.255 scope global eth0
    inet6 fe80::216:3eff:fe69:b34d/64 scope link
       valid_lft forever preferred_lft forever

17: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
ubuntu@macvlanbridge02:~$
ubuntu@macvlanbridge02:~$ ping -c 3 10.0.5.3
PING 10.0.5.3 (10.0.5.3) 56(84) bytes of data.
64 bytes from 10.0.5.3: icmp_req=1 ttl=64 time=0.061 ms
64 bytes from 10.0.5.3: icmp_req=2 ttl=64 time=0.058 ms
64 bytes from 10.0.5.3: icmp_req=3 ttl=64 time=0.061 ms

--- 10.0.5.3 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1998ms
rtt min/avg/max/mdev = 0.058/0.060/0.061/0.001 ms
ubuntu@macvlanbridge02:~$

Perfect! It looks like everything is working as expected! But, let’s test one more thing.

One of the obvious things I have not mentioned until now is that Linux containers can be configured with more than just ONE network interface. In fact you can have various network configurations in a single container each applied to different network interface. Armed with this knowledge we can create a container which will have:

  • one VETH network interface linked to lxcbr0 bridge - we can communicate with it directly from the host on which the container is running
  • one MAC VLAN network interface in bridge mode linked to lxcbr1 so the container can communicate with macvlan01 and macvlan02 containers - remember these 2 containers are NOT accessible directly from the host (not accessible is via network - you can always access the containers from the host via lxc-console indeed)

This will give us a “management” container acessible on the host via network through from which we can access the other two containers - think DMZ. So let’s get started and let’s create a new container which has these capabilities:

1
vagrant@precise64:~$ sudo lxc-create -t ubuntu -n dmzmaster01 -- -r precise

Network configuration should follow the above mentioned requirements - container should be accessible from the host and should have access to the MAC VLAN-ed containers:

1
2
3
4
5
6
7
8
9
10
11
vagrant@precise64:~$ sudo grep network /var/lib/lxc/dmzmaster01/config
lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = lxcbr0
# MAC VLAN network
lxc.network.type = macvlan
lxc.network.macvlan.mode = bridge
lxc.network.flags = up
lxc.network.link = lxcbr1
lxc.network.ipv4 = 10.0.5.5/24
vagrant@precise64:~$ 

Let’s start the container and verify that the requirements we set ourselves have been satisfied. The newly created container should have 2 IP addresses assigned on two separate network interfaces. Interface linked to lxcbr0 should have an IP from 10.0.3.0/24 IP range and interface linked to lxcbr1 should have a manually assigned IP address as per our configuration above on the same LAN as containers macvlanbridge01 and macvlanbridge02:

1
2
3
4
5
6
7
8
9
10
vagrant@precise64:~$ sudo lxc-start -n dmzmaster01 -d
vagrant@precise64:~$ sudo lxc-ls --fancy
NAME             STATE    IPV4                  IPV6  AUTOSTART
---------------------------------------------------------------
dmzmaster01      RUNNING  10.0.3.251, 10.0.5.5  -     NO
empty01          RUNNING  -                     -     NO
macvlanbridge01  RUNNING  10.0.5.3              -     NO
macvlanbridge02  RUNNING  10.0.5.4              -     NO
veth01           RUNNING  10.0.3.118            -     NO
vagrant@precise64:~$

Perfect! Looks like the IP assignment has worked as expected! Let’s see if 10.0.3.251 is accessible from the host and let’s confirm that 10.0.5.5 is NOT:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
vagrant@precise64:~$ ping -c 3 10.0.3.251
PING 10.0.3.251 (10.0.3.251) 56(84) bytes of data.
64 bytes from 10.0.3.251: icmp_req=1 ttl=64 time=0.078 ms
64 bytes from 10.0.3.251: icmp_req=2 ttl=64 time=0.057 ms
64 bytes from 10.0.3.251: icmp_req=3 ttl=64 time=0.073 ms

--- 10.0.3.251 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.057/0.069/0.078/0.011 ms
vagrant@precise64:~$ ping -c 3 10.0.5.5
PING 10.0.5.5 (10.0.5.5) 56(84) bytes of data.

--- 10.0.5.5 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 1998ms

vagrant@precise64:~$

Brilliant! Now, let’s ssh to the dmzmaster01 container, verify that the container has 2 separate network interfaces with particular IP addresses assigned and see if the macvlanbridge01 and macvlanbridge02 containers are accessible from it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
vagrant@precise64:~$ ssh ubuntu@10.0.3.251
The authenticity of host '10.0.3.251 (10.0.3.251)' can't be established.
ECDSA key fingerprint is 87:01:4e:04:51:3e:db:98:71:e2:3b:c5:59:fd:1b:51.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '10.0.3.251' (ECDSA) to the list of known hosts.
ubuntu@10.0.3.251's password:
ubuntu@dmzmaster01:~$ sudo ip address list
18: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether ce:b6:ff:7d:8a:23 brd ff:ff:ff:ff:ff:ff
    inet 10.0.3.251/24 brd 10.0.3.255 scope global eth0
    inet6 fe80::ccb6:ffff:fe7d:8a23/64 scope link
       valid_lft forever preferred_lft forever

20: eth1@if13: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether 2e:e2:37:06:56:b9 brd ff:ff:ff:ff:ff:ff
    inet 10.0.5.5/24 brd 10.0.5.255 scope global eth1
    inet6 fe80::2ce2:37ff:fe06:56b9/64 scope link
       valid_lft forever preferred_lft forever

21: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
ubuntu@dmzmaster01:~$ 

And the container-connectivity test:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
ubuntu@dmzmaster01:~$ ping -c 2 10.0.5.3
PING 10.0.5.3 (10.0.5.3) 56(84) bytes of data.
64 bytes from 10.0.5.3: icmp_req=1 ttl=64 time=0.116 ms
64 bytes from 10.0.5.3: icmp_req=2 ttl=64 time=0.061 ms

--- 10.0.5.3 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.061/0.088/0.116/0.029 ms
ubuntu@dmzmaster01:~$ ping -c 2 10.0.5.4
PING 10.0.5.4 (10.0.5.4) 56(84) bytes of data.
64 bytes from 10.0.5.4: icmp_req=1 ttl=64 time=0.107 ms
64 bytes from 10.0.5.4: icmp_req=2 ttl=64 time=0.063 ms

--- 10.0.5.4 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.063/0.085/0.107/0.022 ms
ubuntu@dmzmaster01:~$ 

With the last test we have concluded a walk through the LXC MAC VLAN networking mode by providing a practical example about how to use it to create a simple DMZ network. As with the other examples in this guide, possibilities are almost endless. We can, for example, create separate private Postgresql replication VLAN to avoid interferring with the application VLAN, we can have private VLAN for corosync intra cluster communication etc. I hope these examples will inspire your creativity and I’m looking forward to hearing the use cases you will come up with - leave them in the comments.

Before we close this subchapter on MAC VLAN it’s worth mentioning that there is a way how we could reach the MAC VLAN containers from the host and that is by creating a new macvlan network interface on the same bridge on which the macvlan containers were created earlier on. The new interface will operate in the same macvlan mode as the container ones i.e. in the bridge mode. Here is how we would go about it.

We will create new macvlan interfaces and link it to the same network interface as the macvlan containers are linked against:

1
2
3
4
5
vagrant@precise64:~$ sudo ip link add link lxcbr1 macvlancont01 type macvlan mode bridge
vagrant@precise64:~$ sudo ip link list macvlancont01
15: macvlancont01@lxcbr1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
    link/ether 36:54:11:b9:9d:f7 brd ff:ff:ff:ff:ff:ff
vagrant@precise64:~$

Now let’s assign it an IP address from the same IP range as the macvlan containers are in and bring the interface up:

1
2
3
4
5
6
7
8
9
vagrant@precise64:~$ sudo ip address add 10.0.5.6/24 dev macvlancont01
vagrant@precise64:~$ sudo ip link set dev macvlancont01 up
vagrant@precise64:~$ sudo ip addr list macvlancont01
15: macvlancont01@lxcbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
    link/ether 36:54:11:b9:9d:f7 brd ff:ff:ff:ff:ff:ff
    inet 10.0.5.6/24 scope global macvlancont01
    inet6 fe80::3454:11ff:feb9:9df7/64 scope link
       valid_lft forever preferred_lft forever
vagrant@precise64:~$

We should now be able to reach the macvlan containers FROM THE HOST. So let’s verify that claim:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
vagrant@precise64:~$ ping -c 2 10.0.5.3
PING 10.0.5.3 (10.0.5.3) 56(84) bytes of data.
64 bytes from 10.0.5.3: icmp_req=1 ttl=64 time=0.092 ms
64 bytes from 10.0.5.3: icmp_req=2 ttl=64 time=0.066 ms

--- 10.0.5.3 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.066/0.079/0.092/0.013 ms
vagrant@precise64:~$ ping -c 2 10.0.5.4
PING 10.0.5.4 (10.0.5.4) 56(84) bytes of data.
64 bytes from 10.0.5.4: icmp_req=1 ttl=64 time=0.097 ms
64 bytes from 10.0.5.4: icmp_req=2 ttl=64 time=0.056 ms

--- 10.0.5.4 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.056/0.076/0.097/0.022 ms
vagrant@precise64:~$

Boom! All works as expected and with the last test we can conclude this subchapter by learning that we can either build DMZ container to reach macvlan containers or to simply create new macvlan interface link to the same master interface on the host as the containers are linked with.

Vlan

I’m not going to explain how Linux VLANs work as there is loads of really good material available online already so I’ll jump straight into LXC documentation. LXC VLAN network type is described as an interface which is linked with the interface specified by lxc.network.link and which is then assigned to the container. The new interface “tags” the packets with VLAN tag specified via lxc.network.vlan.id configuration directive. In other words, using VLAN network type allows you to create a regular VLAN interface in container’s network namespace.

If you wanted to create your own VLAN container which would be tagging ethernet packets with its VLAN id, your container configuration would look something like this:

1
2
3
4
5
6
7
vagrant@precise64:~$ grep network config
lxc.network.type = vlan
lxc.network.flags = up
lxc.network.link = eth0
lxc.network.vlan.id = 10
lxc.network.hwaddr = 00:16:3e:fd:84:da
vagrant@precise64:~$

You can now run our well known lxc-start command and create a new LXC container as per the above documentation. I’d like to talk about what is actually happening in the background. You could translate the network part of the container creation process into a concrete set of commands as follows.

First you create vlan interface on the host and link it against an existing host interface - I’ve picked eth0 interface for our tests:

1
2
3
4
vagrant@precise64:~$ sudo ip link add name vlan20 link eth0 type vlan id 20
vagrant@precise64:~$ sudo ip link list vlan20
17: vlan20@eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
    link/ether 08:00:27:88:0c:a6 brd ff:ff:ff:ff:ff:ff

The above command has created a new interface which will be tagging the outgoing traffic with VLAN id of 20. You can verify that the above created interface is really a VLAN interface with the right VLAN id:

1
2
3
4
5
vagrant@precise64:~$ ip -d link show vlan20
17: vlan20@eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
    link/ether 08:00:27:88:0c:a6 brd ff:ff:ff:ff:ff:ff
    vlan id 20 <REORDER_HDR>
vagrant@precise64:~$

Now that the new interface exists it is going to be sent into particular container. So, let’s so it and send it to our empty01 container and verify that we have succeeded afterwards:

1
2
3
4
5
6
7
8
vagrant@precise64:~$ ps faux|grep -A 1 empty01
root     21269  0.0  0.4  40760  1512 ?        Ss   13:49   0:00 lxc-start -n empty01 -d
root     21275  0.0  0.5  24200  2056 ?        Ss   13:49   0:00  \_ /sbin/init
vagrant@precise64:~$
vagrant@precise64:~$ sudo ip link set vlan20 netns 21275
vagrant@precise64:~$ ip l l vlan20
Device "vlan20" does not exist.
vagrant@precise64:~$

Network interface no longer exists on the host. Let’s see if it was really moved into the said container. Let’s fire up lxc-console and check the links inside the container:

1
2
3
4
5
6
ubuntu@empty01:~$ ip link list
9: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
17: vlan20@if2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
    link/ether 08:00:27:88:0c:a6 brd ff:ff:ff:ff:ff:ff
ubuntu@empty01:~$

Boom! Our VLAN interface has indeed been moved to the empty01 container as expected. The host link has now changed from original vlan20@eth0 into vlan20@if2 which essentially means that the vlan20 network interface is linked with host’s network interface whose index is 2. Basically, when you create a new network interface on the host it is assigned an index and in this case eth0 has been assigned index of 2 on the host.

So what are VLAN containers useful for ? Their usefulness may mostly come into play when you start scaling your container infrastructure whilst keeping your network segmentation in place for various reasons such as security, banwidth etc. You can “hide” your container interface behind physical network interface on the host which would be plugged into physical switch. Obviously you can use them to simulate various VLAN setups.

One of the really interesting use case I could come up with would be that you would have a switch with let’s say 12 ports. Ports 1-11 would would be allocated for VLANs 1-11 and port 12 would be a trunk port. You’d plug your host interface into the trunk port and create 1-11 containers linked to the plugged port - each with a different VLAN id. Each container on this host would now have a physical inteface and could communicate on particular VLAN. Pretty cool! There is a bit of a catch - your trunk port can become a bandwidth bottlneck.

There is a subtle thing in the example I gave above - you can not have more than 1 container with the same VLAN id linked to the same physical network interface. This is logical given how VLANs work. If you want to have several containers on the same VLAN linked to one interface there is a solution for tht - it’s called bridging!

So what do you do if you want a setup like that ? You make a bridge, let’s call it br10 for VLAN with id 10, you will then create a vlan interface vlan10 like so:

1
vagrant@precise64:~$ sudo ip link add name vlan10 link eth0 type vlan id 10

Then you will attach vlan10 vlan device into br10, and use the br10 bridge for the containers on VLAN with tag of 10. These containers will be created with network type veth and bridged against br10. Simples!

This concludes the subchapter for vlan network type. If you have any other cool use cases I want to hear about them!

Phys

Finally, we have reached the last available network type in LXC configuration - phys. This network type is the least complicated to understand in my opinion. So what does LXC documentation say about the phys network type ?

an already existing interface specified by the lxc.network.link is assigned to the container

What this means in practice is - in very simplified terms - that you basically literally rip off the physical network card from the Virtual Box (Virtual Box is what vagrant uses to run virtual machines on which this guide is performed) or any other host you’re running this guide on and plug it into the container.

The interface “moves” from one network namespace to the new one, but should be accessible on the same network the original interface existed on i.e. you have basically isolated physical interface on the host (i.e. new network stack has been created in the container) in a similar way to what we were doing at the end of the veth subchapter when we pretended to be network experts.

Let’s get back to the Vagrantfile - the reason why I have specified :public_network config directive in it was so that Vagrant creates public network interface - “public” meaning the IP address assigned to this interface is on the same network as my laptop on which I’m running vagrant commands.

By applying the above theory and moving the “public” interface into the container we should be able to simply access the container directly from our workstatsions or from ANY other host on the same network on which the public interface has been created. Let’s see if we can back our theory in practice! Let’s create a new container and configure it to use the phys network type.

Couple of points about the LXC network configuration before we proceed. MAKE SURE that lxc.network.hwaddr is the same as the original one created by vagrant - I have come across some ARP madness when I didn’t reuse the original MAC address as the new one was generated randomly by lxc-create command. You can find the MAC address of the public interface by running sudo ip link list eth2 on the host:

1
2
3
4
vagrant@precise64:~$ sudo ip link list eth2
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:f9:8f:2e brd ff:ff:ff:ff:ff:ff
vagrant@precise64:~$

Now that you know the MAC address (link/ether field in the output above) of the physical device, you can proceed with the container’s network configuration. Notice that you MUST specify the correct gateway - it’s the default gateway of the network on which the original public interface has been created. In my case simple, the public interface is on the same network as my laptop so simple check of Network Preferences gives me 192.168.1.254.

Cool, we have all the tools, now we can start doing some work! Let’s create the container and modify the network configuration as below:

1
2
3
4
5
6
7
8
9
vagrant@precise64:~$ sudo lxc-create -t ubuntu -n phys01 -- -r precise
vagrant@precise64:~$ sudo grep network /var/lib/lxc/phys01/config
lxc.network.type = phys
lxc.network.hwaddr = 08:00:27:f9:8f:2e
lxc.network.flags = up
lxc.network.link = eth2
lxc.network.ipv4 = 192.168.1.135/24
lxc.network.ipv4.gateway = 192.168.1.254
vagrant@precise64:~$

Now, let’s start the container and log on to it via lxc-console:

1
2
3
4
5
6
7
8
9
10
11
vagrant@precise64:~$ sudo lxc-start -n phys01 -d
vagrant@precise64:~$ sudo lxc-ls --fancy
NAME             STATE    IPV4                  IPV6  AUTOSTART
---------------------------------------------------------------
dmzmaster01      RUNNING  10.0.3.251, 10.0.5.5  -     NO
empty01          RUNNING  -                     -     NO
macvlanbridge01  RUNNING  10.0.5.3              -     NO
macvlanbridge02  RUNNING  10.0.5.4              -     NO
phys01           RUNNING  192.168.1.135         -     NO
veth01           RUNNING  10.0.3.118            -     NO
vagrant@precise64:~$ sudo lxc-console -n phys01 -t 2

From the above, we can see that phys01 container has been assigned correct IP. If our theory is correct and if the physical network interface has been “ripped off” and “moved” inside the container we should no longer see in on the host. So let’s have a look if that’s the case:

1
2
3
vagrant@precise64:~$ sudo ip link list eth2
Device "eth2" does not exist.
vagrant@precise64:~$

Bingo! It looks like the interface no longer exists on the Host. Let’s have a look if all looks good inside the container. We need to check the routing and whether we can ping the default gateway. Let’s fire up our well known lxc-console command and check the network set up of the container:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
ubuntu@phys01:~$ sudo ip address list
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:f9:8f:2e brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.135/24 brd 192.168.1.255 scope global eth2
    inet6 fe80::a00:27ff:fef9:8f2e/64 scope link
       valid_lft forever preferred_lft forever

22: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
ubuntu@phys01:~$
ubuntu@phys01:~$ sudo ip route show
default via 192.168.1.254 dev eth2
192.168.1.0/24 dev eth2  proto kernel  scope link  src 192.168.1.135
ubuntu@phys01:~$ 
ubuntu@phys01:~$ ping -c 2 192.168.1.254
PING 192.168.1.254 (192.168.1.254) 56(84) bytes of data.
64 bytes from 192.168.1.254: icmp_req=1 ttl=64 time=260 ms
64 bytes from 192.168.1.254: icmp_req=2 ttl=64 time=1.78 ms

--- 192.168.1.254 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1003ms
rtt min/avg/max/mdev = 1.785/131.265/260.745/129.480 ms
ubuntu@phys01:~$

Brilliant! Route to default gateway has been created and we can successfully ping it. That means that out packets should be routable across the whole 192.168.1.0/24 network! Which also means - in my case - that I should be able to:

  • ping my laptop from inside the container
  • ping the container’s IP from my laptop
  • given that the SSH is running in the container on publicly accessible interface which we have just moved into it and if access to the listening IP and ssh port is allowed, I should be able to ssh to it straight from my laptop

So let’s verify these claims. The IP address of my laptop is:

1
2
3
4
milos@dingops:~/Vagrant/blogpost$ ifconfig |grep -A 5 en0|grep inet
  inet6 fe80::7ed1:c3ff:fef4:da13%en0 prefixlen 64 scopeid 0x5
  inet 192.168.1.116 netmask 0xffffff00 broadcast 192.168.1.255
milos@dingops:~/Vagrant/blogpost$

Let’s ping it from inside the container:

1
2
3
4
5
6
7
8
9
10
ubuntu@phys01:~$ ping -c 3 192.168.1.116
PING 192.168.1.116 (192.168.1.116) 56(84) bytes of data.
64 bytes from 192.168.1.116: icmp_req=1 ttl=64 time=0.170 ms
64 bytes from 192.168.1.116: icmp_req=2 ttl=64 time=0.245 ms
64 bytes from 192.168.1.116: icmp_req=3 ttl=64 time=0.464 ms

--- 192.168.1.116 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1998ms
rtt min/avg/max/mdev = 0.170/0.293/0.464/0.124 ms
ubuntu@phys01:~$

Perfect! Let’s try to ping the container from my laptop. As we know the container’s IP address is 192.168.1.135

1
2
3
4
5
6
7
8
9
10
milos@dingops:~/Vagrant/blogpost$ ping -c 3 192.168.1.135
PING 192.168.1.135 (192.168.1.135): 56 data bytes
64 bytes from 192.168.1.135: icmp_seq=0 ttl=64 time=0.535 ms
64 bytes from 192.168.1.135: icmp_seq=1 ttl=64 time=0.471 ms
64 bytes from 192.168.1.135: icmp_seq=2 ttl=64 time=0.289 ms

--- 192.168.1.135 ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.289/0.432/0.535/0.104 ms
milos@dingops:~/Vagrant/blogpost$

Awesome! Now, I have installed ssh on the container and it’s listening on all container’s interfaces. Also there are no iptables rules present on the container so I should be able to ssh to it from anywhere on the same network:

1
2
3
4
ubuntu@phys01:~$ sudo netstat -ntlp|grep 22
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      140/sshd
tcp6       0      0 :::22                   :::*                    LISTEN      140/sshd
ubuntu@phys01:~$ 

Let’s try to SSH into the container directly from my laptop:

1
2
3
4
5
6
7
8
9
10
11
milos@dingops:~/Vagrant/blogpost$ ssh ubuntu@192.168.1.135
The authenticity of host '192.168.1.135 (192.168.1.135)' can't be established.
RSA key fingerprint is de:03:8a:23:df:10:56:bc:77:1b:8e:4e:d0:13:ab:97.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.1.135' (RSA) to the list of known hosts.
ubuntu@192.168.1.135's password:
Welcome to Ubuntu 12.04.3 LTS (GNU/Linux 3.2.0-23-generic x86_64)

 * Documentation:  https://help.ubuntu.com/
Last login: Mon Nov 11 00:01:40 2013
ubuntu@phys01:~$ 

Excellent! With the last example we have now concluded our walk through phys network configuration! Let’s move to the “extra” one called in LXC configuration as none.

None

None networking type is a bit tricky and can get you in a “trouble” if you’re not careful. It certainly did get me into some interesting or rather unexpected situation whilst working on this subchapter until I learnt what is actually happening. In short, this networking type allows the container to share the host’s network namespace. This means that host’s network devices are usable in the container as the container is not running in its own private network namespace.

Let’s fire up a new container with network type set to none and have a poke around.

1
vagrant@precise64:~$ sudo lxc-create -t ubuntu -n none01 -- -r precise

The network configuration should look something like this:

1
2
3
4
5
vagrant@precise64:~$ sudo grep network /var/lib/lxc/none01/config
lxc.network.type = none
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:0c:42:31
vagrant@precise64:~$

Let’s start the container and log on to it:

1
2
3
4
5
6
7
8
9
10
11
vagrant@precise64:~$ sudo lxc-ls --fancy
NAME             STATE    IPV4                  IPV6  AUTOSTART
---------------------------------------------------------------
dmzmaster01      RUNNING  10.0.3.251, 10.0.5.5  -     NO
empty01          RUNNING  -                     -     NO
macvlanbridge01  RUNNING  10.0.5.3              -     NO
macvlanbridge02  RUNNING  10.0.5.4              -     NO
phys01           RUNNING  192.168.1.135         -     NO
veth01           RUNNING  10.0.3.118            -     NO
none01           RUNNING  10.0.2.15, 10.0.3.1, 10.0.4.2, 10.0.5.6, 192.168.1.138  -     NO
vagrant@precise64:~$

Boom! The above list shows exactly what we would expect to see for none01 container. All IP addresses assigned on the host are available inside the container. To be absolutely sure we can try to lxc-console into the container and make sure the output of lxc-ls command above is correct:

1
2
3
vagrant@precise64:~$ sudo lxc-console -n none01
Connected to tty 1
Type <Ctrl+a q> to exit the console, <Ctrl+a Ctrl+a> to enter Ctrl+a itself

Uhh, what is going on ? We are not getting a login prompt from the container like we would normally expect! Further investigation uncovers that getty didn’t get spawned. I was a bit clueless about why this is happening, so I turned back to Ubuntu LXC server guide for some answers.

It turned out that this problem is related to running Ubuntu container on Ubuntu host or rather more specifically upstart distribution container on upstart host. Go figure! Like I said the awesome Ubuntu server guide actually does mention this. So do read it carefully and don’t be surprised like myself.

Now, to get back to what is actually happening. Upstart uses an abstract unix socket for communication between processes and the init daemon. As it’s an abstract socket, even though the container doesn’t share the same filesystem as the host, the path still clashes so upstart in the container can’t bind it and anything running in the container will talk straight to the host instead. This has an effect where some things don’t get spawned like getty in our case. What is worse shutting down the container will likely shutdown your machine, too. Yes, that’s correct - shutting down the container set to work with network type none can bring the whole host down. Just try to run lxc-stop -n none01 in the vagrant and you will see that your vagrant will halt.

So be careful when you decide to use this network type in your container setup. I could not think of any practical use cases for this so I turned to lxcontainers channel on IRC to ask the people if they’re using this networking type and how. Stéphane Graber who works for Ubuntu and is one of the LXC core developers mentioned that they are using none network type on the Ubuntu phone to have android (running in a container) be the one bringing up networking for the host.

If you have any interesting use cases, do let me know!

Conclusion

I hope that this blog post helped you to understand the LXC networking at least a little bit more and that it will also motivate you to explore the LXC world even further, including the technologies backed by it such as Docker. If this blog post got you interested in this topic, I would recommend you to check out pipework, which is an awesome tool to automate creation of extra interfaces for containers (for “raw” LXC as well as Docker containers) written by Docker guru Jérôme Petazzoni.

Linux containers have been here for a long time and they are here to stay - in one form or another. Possibilities they offer are endless. Recently they have been mostly spoken about as a synonym of future of software/application delivery. However, we should not stop just there!

Grumpy ops guy in me is seeing containers not only as application delivery tool but also as a missing piece of Infrastructure delivery. We can test our Chef cookbooks/Puppet modules or what have you, we can model networks just like this blog post showed - we can literally model the full stack infrastructure with unquestionable ease and speed on a small budget. Recently I have attended DevOps days London where John Willis gave a presentation about Software Defined Networking. LXC can be an awesome helping piece in what John is suggesting to be the next step in Infrastructure-as-a-code.

So, to sum up. Let’s not be afraid of containers, let’s embrace them, understand them and I’m confident this will pay off in long term. Hopefully this site will help us on the road to container love….