Go Contain Me (SREcon Americas 2018)

My first day at SREcon Americas 2018 was very exciting and inspiring. It started with the Containers from Scratch workshop by Avishai Ish-Shalom and Nati Cohen. They developed a syscall-level workshop about Linux containers that I can highly recommend. It deals with a program containing and isolating itself step by step using Linux systemcalls. In the end, the program would fork to drop the last bit of privileges that is left. That was super fun, although the network namespaces gave me a hard time due to a silly implementation mistake I made.

The workshop code is in Python and worked perfectly. However, I want to improve my Golang skills so I decided to redo the assignments from the workshop in Golang. Shouldn’t be too hard, should it? After all, docker is written in Golang.

This is my write-up of the endeavor.

General Idea

Similar to other popular container solutions, our little program should:

  • Use the host’s Kernel space
  • Have its own userspace binaries (e.g. busybox)
  • Have a unique ID
  • Use cgroups to limit its own resource usage
  • Use an overlay filesystem to avoid messing with the userspace binaries that may be used by multiple containers at the same time
  • Use Linux namespacing for mounts, processes, network, and UNIX time-sharing (uhm, the last one is some historical thingy)
  • Make special devices, such as /dev/null and /dev/urandom available inside the container
  • Overload the currently running binary with a binary from inside the container. And run it.

Spoiler alert: We will not make all of this work. The processes namespace gave me a hard time because I was restricting myself to state-of-the art standard libraries and avoiding custom CGO code. Unfortunately, the syscall package is not considered state-of-the-art anymore. So I had to avoid using it. ๐Ÿ˜”

Who Am I?

Our container will start off as a simple process that will slowly isolate itself from the current environment. First, we want to know who we are, so we fetch the process ID (PID) first.

pid := unix.Getpid()
log.Printf("pid: %v", pid)

We also want a unique identifier for our container. Process IDs are limited and might eventually be re-assigned. So let’s grab a UUID and use that.

id := uuid.New().String()
log.Printf("container id: %v", id)

We can use the container ID to name directories in a non-conflicting way later. It is highly unlikely that any two UUIDs collide in an environment like ours.

Building Fences

Next thing we want to do is to build a fence around our process. We do this for CPU and memory using Linux cgroups. For this we want to write our PID to /sys/fs/cgroup/cpu/go-contain-me/<UUID>/tasks and write the number of CPU shares we want to grant the process into /sys/fs/cgroup/cpu/go-contain-me/<UUID>/cpu.shares.

cgroupCPU := "/sys/fs/cgroup/cpu/go-contain-me/" + id + "/"
log.Println("cpu cgroup: create")
err = os.MkdirAll(cgroupCPU, 0744)
if err != nil {
    log.Fatal(err)
}
log.Println("cpu cgroup: add pid")
err = ioutil.WriteFile(cgroupCPU+"tasks", []byte(strconv.Itoa(pid)), 0644)
if err != nil {
    log.Fatal(err)
}
if len(*cpuShares) > 0 {
    log.Println("cpu cgroup: set shares")
    err := ioutil.WriteFile(cgroupCPU+"cpu.shares",
        []byte(*cpuShares), 0644)
    if err != nil {
        log.Fatal(err)
    }
}

For the memory cgroup we do something similar, yet there is a small difference. We are not supposed to set shares here, but the actual number of bytes we want the process to be limited to. We can also limit the number of swap bytes the process is allowed to consume.

cgroupMemory := "/sys/fs/cgroup/memory/go-contain-me/" + id + "/"
log.Println("memory cgroup: create")
err = os.MkdirAll(cgroupMemory, 0644)
if err != nil {
    log.Fatal(err)
}
log.Println("memory cgroup: add pid")
err = ioutil.WriteFile(cgroupMemory+"tasks",
    []byte(strconv.Itoa(pid)), 0644)
if err != nil {
    log.Fatal(err)
}
if len(*memoryLimit) > 0 {
    log.Println("memory cgroup: set memory limit")
    err := ioutil.WriteFile(cgroupMemory+"memory.limit_in_bytes",
        []byte(*memoryLimit), 0644)
    if err != nil {
        log.Fatal(err)
    }
}
if len(*swapLimit) > 0 {
    log.Println("memory cgroup: set swap limit")
    err := ioutil.WriteFile(cgroupMemory+"memory.memsw.limit_in_bytes",
        []byte(*swapLimit), 0644)
    if err != nil {
        log.Fatal(err)
    }
}

Great, now we have contained ourselves in terms of resource usage.

Overlay Root File System

Now we shall isolate our process further from the host system step by step. Let’s assume we have an extracted userspace image at /root/go-contain-me/images/busybox. You can use docker extract to get your hands on one quickly if needed. ๐Ÿณ

As multiple containers might be using the same underlying image, we have to make sure we do not write to the image data. However, we still want to be able to make changes to the data, such as adding, modifying, or removing files. But how? The overlay filesystem comes to the rescue! As the name suggests, we can overlay something called an upperdir onto a lowerdir. We would also need a workdir where we store some copy-on-write information, e.g. for files that have been deleted during operation.

So the first order of business for overlaying a filesystem is to make sure all the required directories exists:

newRoot := baseDir + "/containers/" + id + "/rootfs"
workDir := baseDir + "/containers/" + id + "/workdir"
for _, path := range []string{newRoot, workDir} {
    err = os.MkdirAll(path, os.ModePerm)
    if err != nil {
        log.Fatal(err)
    }
}

After that the whole operation is just a regular mount with a less often seen set of options:

log.Printf("mount: overlay")
imageRoot := baseDir + "/images/" + *image
err = unix.Mount("overlay", newRoot, "overlay", uintptr(unix.MS_NODEV),
    "lowerdir="+imageRoot+",upperdir="+newRoot+",workdir="+workDir)
if err != nil {
    log.Fatal(err)
}

The MS_NODEV flag, by the way, prevents special files (devices) to be accessed on this filesystem. We will create those later using the mknod system call.

Moving To A New Mount Namespace

Right now, our mounts show up on the host and pollute it a bit. Luckily, we can isolate our mounts from the host mounts by creating a new namespace for our process (at some point we can start calling the process a container).

log.Printf("newns: mount")
err = unix.Unshare(unix.CLONE_NEWNS)
if err != nil {
    log.Fatal(err)
}

Now we remount the root filesystem in our namespace to assign it to the newly created namespace:

log.Printf("remount: /")
err = unix.Mount("", "/", "", uintptr(unix.MS_PRIVATE|unix.MS_REC), "")
if err != nil {
    log.Fatal(err)
}

We are using flags again:

  • MS_PRIVATE makes sure that mounts and unmounts events do not propagate into our out of this mount point.
  • MS_REC just means that the flags it is used in conjunction with are meant to be applied recursively.

Special Mounts

Now that we have our isolated mount namespace, it is time to mount some special filesystems there. Let’s use a for loop to avoid writing the same code over and over again.

mounts := []struct {
    source  string
    target  string
    fsType  string
    flags   uint
    options string
}{
    {source: "proc", target: newRoot + "/proc", fsType: "proc"},
    {source: "sysfs", target: newRoot + "/sys", fsType: "sysfs"},
    {
        source:  "tmpfs",
        target:  newRoot + "/dev",
        fsType:  "tmpfs",
        flags:   unix.MS_NOSUID | unix.MS_STRICTATIME,
        options: "mode=755",
    },
    {
        source: "devpts",
        target: newRoot + "/dev/pts",
        fsType: "devpts",
    },
}
for _, mnt := range mounts {
    // ensure mount target exists
    log.Printf("mkdirall: %v", mnt.target)
    err := os.MkdirAll(mnt.target, os.ModePerm)
    if err != nil {
        log.Fatal(err)
    }

    // mount
    log.Printf("mount: %v (%v)", mnt.source, mnt.fsType)
    flags := uintptr(mnt.flags)
    err = unix.Mount(mnt.source, mnt.target, mnt.fsType, flags, mnt.options)
    if err != nil {
        log.Fatal(err)
    }
}

This should leave us with most the most important filesystems in place under our new root. Remember, our new root is still seen as /root/go-contain-me/containers/<UUID>/rootfs. But that is going to change soon.

Essential File Descriptors

We will soon pivot the process’ root to use our container’s rootfs directory as root. See how I just used container now instead of process? This was totally arbitrary. ๐Ÿ™ƒ But before we lose access to the current filesystem tree, let’s rescue essential file descriptors such as stdin and stdout. Without them functioning, we would not have much fun with our container.

A simple symlink() does the job:

for i, name := range []string{"stdin", "stdout", "stderr"} {
    source := "/proc/self/fd/" + strconv.Itoa(i))
    target := newRoot + "/dev/" + name
    log.Printf("symlink: %v", name)
    err := unix.Symlink(source, target)
    if err != nil {
        log.Fatal(err)
    }
}

Creating Devices

Processes running inside our container may assume that a certain set of special devices is present. One popular example being /dev/null, which is often used to drop data streams into Nirvana. If /dev/null weren’t present those data streams may end up in a regular file. This could, in turn, quickly fill up the filesystem. If there are no quotas on the container’s filesystem, this might affect the host’s filesystem as well. Not cool.

We’ll use the loop approach one more time here:

devices := []struct {
    name  string
    attr  uint32
    major uint32
    minor uint32
}{
    {name: "null", attr: 0666 | unix.S_IFCHR, major: 1, minor: 3},
    {name: "zero", attr: 0666 | unix.S_IFCHR, major: 1, minor: 3},
    {name: "random", attr: 0666 | unix.S_IFCHR, major: 1, minor: 8},
    {name: "urandom", attr: 0666 | unix.S_IFCHR, major: 1, minor: 9},
    {name: "console", attr: 0666 | unix.S_IFCHR, major: 136, minor: 1},
    {name: "tty", attr: 0666 | unix.S_IFCHR, major: 5, minor: 0},
    {name: "full", attr: 0666 | unix.S_IFCHR, major: 1, minor: 7},
}
for _, dev := range devices {
    dt := int(unix.Mkdev(dev.major, dev.minor))
    log.Printf("mknod: %v (%v)", dev.name, dt)
    err := unix.Mknod(newRoot +"/dev/" dev.name, dev.attr, dt)
    if err != nil {
        log.Fatal(err)
    }
}

Isolate The UNIX Time-Sharing Namespace

We are coming closer to pivoting the root. I promise. However, there are still a few more isolation steps we should do. For example, we want the hostname of the container to be isolated from the hostname of the host. One might expect this to fall under the domain of the network namespace. Surprisingly that is not the case. For historical reasons, the namespace for this is the UNIX Time-Sharing namespace or short UTS.

So let’s unshare() this one before setting the hostname:

log.Printf("newns: UNIX time sharing")
err = unix.Unshare(unix.CLONE_NEWUTS)
if err != nil {
    log.Fatal(err)
}
// change hostname in new UTS
log.Printf("set hostname")
err = unix.Sethostname([]byte(id))
if err != nil {
    log.Fatal(err)
}

Isolate The Process Namespace (b0rked)

We also want to isolate the container process namespace from the host. Meaning, that if we run ps on the container, we don’t want to see the processes of the host.

Note: I was not able to get this one to work. The code compiles, the code runs, but then the contained processes run out of memory real quick. Despite having a generous cgroup setting for memory. I did not investigate much time into debugging this. Feel free to drop me a line if you happen to know what the problem is. ๐Ÿค“

For the sake of completeness, here is my code:

log.Printf("newns: processes")
err = unix.Unshare(unix.CLONE_NEWPID)
if err != nil {
    log.Fatal(err)
}

Isolating The Network

For the network namespace, we make another call to unshare(). This will give us a new namespace that does contain a loopback interface only. Clean and lean!

log.Printf("newns: network")
err = unix.Unshare(unix.CLONE_NEWNET)
if err != nil {
    log.Fatal(err)
}

If you like to dig deeper into network namespacing: Try ip netns help for a start and don’t forget to link the namespace to the container’s default namespace before unsharing!

Pivoting

Phew. That was a long journey. Now we can pivot the root! Hooray! The operation looks more complicated than it is. Basically, we just do the following things:

  • Create a directory named .old-root. This is where the kernel will mount the old root into after pivoting.
  • Pivot (obviously)
  • Change directory to /.
  • Unmount the old root.
  • Remove the old root directory created in step one.
log.Printf("pivot root")
oldRootBeforePivot := newRoot + "/.old-root"
oldRootAfterPivot := "/.old-root"
err = os.MkdirAll(oldRootBeforePivot, os.ModePerm)
if err != nil {
    log.Fatalf("mkdirall old root: %v", err)
}

unix.PivotRoot(newRoot, oldRootBeforePivot)
if err != nil {
    log.Fatalf("pivot root: %v", err)
}
unix.Chdir("/")
if err != nil {
    log.Fatalf("chdir: %v", err)
}
unix.Unmount(oldRootAfterPivot, unix.MNT_DETACH)
if err != nil {
    log.Fatalf("unmount old root: %v", err)
}
unix.Rmdir(oldRootAfterPivot)
if err != nil {
    log.Fatalf("rmdir old root: %v", err)
}

The Finally

Hold your breath, now comes the final operation before we fully enter container land! We overload the process with the new binary to run. Here we are using sh to get a shell we can interact with.

Ideally, we would do this in a child process after fork() or clone(), but it turns out, forking isn’t too much of a great idea in Golang. I’ll spare you the details, but there are plenty of discussions about this at the usual places.

err = unix.Exec("/bin/sh", []string{"sh"}, []string{})
log.Fatal(err)

Ideally, the line reading log.Fatal(err) is never reached.

Running It!

It’s time to run this thing! Do yourself a favor and run this in a virtual machine. The code is not free of risk and could force you to reboot in case something goes wrong. And we don’t reboot our computers anymore nowadays, do we? ๐Ÿ˜‚

# ./go-contain-me
2018/03/29 04:03:46 pid: 1054
2018/03/29 04:03:46 container id: c16f889c-6a49-49a4-bbb0-add1094993c5
2018/03/29 04:03:46 cpu cgroup: create
2018/03/29 04:03:46 cpu cgroup: add pid
2018/03/29 04:03:46 memory cgroup: create
2018/03/29 04:03:46 memory cgroup: add pid
2018/03/29 04:03:46 memory cgroup: set memory limit
2018/03/29 04:03:46 mount: overlay
2018/03/29 04:03:46 newns: mount
2018/03/29 04:03:46 remount: /
2018/03/29 04:03:46 mkdirall: /root/go-contain-me/containers/c16f889c-6a49-49a4-bbb0-add1094993c5/rootfs/proc
2018/03/29 04:03:46 mount: proc (proc)
2018/03/29 04:03:46 mkdirall: /root/go-contain-me/containers/c16f889c-6a49-49a4-bbb0-add1094993c5/rootfs/sys
2018/03/29 04:03:46 mount: sysfs (sysfs)
2018/03/29 04:03:46 mkdirall: /root/go-contain-me/containers/c16f889c-6a49-49a4-bbb0-add1094993c5/rootfs/dev
2018/03/29 04:03:46 mount: tmpfs (tmpfs)
2018/03/29 04:03:46 mkdirall: /root/go-contain-me/containers/c16f889c-6a49-49a4-bbb0-add1094993c5/rootfs/dev/pts
2018/03/29 04:03:46 mount: devpts (devpts)
2018/03/29 04:03:46 symlink: stdin
2018/03/29 04:03:46 symlink: stdout
2018/03/29 04:03:46 symlink: stderr
2018/03/29 04:03:46 mknod: null (259)
2018/03/29 04:03:46 mknod: zero (259)
2018/03/29 04:03:46 mknod: random (264)
2018/03/29 04:03:46 mknod: urandom (265)
2018/03/29 04:03:46 mknod: console (34817)
2018/03/29 04:03:46 mknod: tty (1280)
2018/03/29 04:03:46 mknod: full (263)
2018/03/29 04:03:46 newns: UNIX time sharing
2018/03/29 04:03:46 set hostname
2018/03/29 04:03:46 newns: network
2018/03/29 04:03:46 pivot root

Inside the container, we can see only our own mounts:

/ # mount
overlay on / type overlay (rw,nodev,relatime,lowerdir=/root/go-contain-me/images/busybox,upperdir=/root/go-contain-me/containers/c16f889c-6a49-49a4-bbb0-add1094993c5/rootfs,workdir=/root/go-contain-me/containers/c16f889c-6a49-49a4-bbb0-add1094993c5/workdir)
proc on /proc type proc (rw,relatime)
sysfs on /sys type sysfs (rw,relatime)
tmpfs on /dev type tmpfs (rw,nosuid,mode=755)
devpts on /dev/pts type devpts (rw,relatime,mode=600,ptmxmode=000)

We also have our own network namespace. All the host’s devices are gone. If we want to add network interfaces, we may use the netns functionality of iputils.

/ # ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

The situation is not that good for the process namespace. As I said, I was not able to get it to work reliably. So here we see all the processes of the host as well. Meh.

/ # ps -e
PID   USER     TIME  COMMAND
    1 root      0:00 {systemd} /sbin/init
    2 root      0:00 [kthreadd]
    3 root      0:00 [ksoftirqd/0]
โœ‚๏ธ
 1054 root      0:00 sh
 1066 root      0:00 ps -e

Full Source

Here is the full piece of code for your amusement and further experimentation. The code works with a directory structure that looks similar to this:

root@go-contain-me-1:~# tree
.
`-- go-contain-me
    |-- containers
    |   `-- 8f0f5a2d-0ce8-4bd1-887a-2c5b275ee337
    |       |-- rootfs
    |       `-- workdir
    `-- images
        `-- busybox
            `-- (a full user space here)

Compile the program:

$ CGO_ENABLED=0 GOOS=linux go build -a -ldflags '-extldflags "-static"' .

Here’s the source for your interest:

package main

import (
	"flag"
	"io/ioutil"
	"log"
	"os"
	"strconv"

	"github.com/google/uuid"
	"golang.org/x/sys/unix"
)

var (
	baseDir = "/root/go-contain-me"
)

func main() {
	var err error
	cpuShares := flag.String("cpu-shares", "",
		"CPU shares of the container.")
	memoryLimit := flag.String("memory-limit", "256m",
		"Memory limit of the container.")
	swapLimit := flag.String("swap-limit", "",
		"Swap limit of the container.")
	image := flag.String("image", "busybox", "name of the container image")
	flag.Parse()

	pid := unix.Getpid()
	log.Printf("pid: %v", pid)

	// generate container id
	id := uuid.New().String()
	log.Printf("container id: %v", id)

	// CPU cgroup
	cgroupCPU := "/sys/fs/cgroup/cpu/go-contain-me/" + id + "/"
	log.Println("cpu cgroup: create")
	err = os.MkdirAll(cgroupCPU, 0744)
	if err != nil {
		log.Fatal(err)
	}
	log.Println("cpu cgroup: add pid")
	err = ioutil.WriteFile(cgroupCPU+"tasks", []byte(strconv.Itoa(pid)), 0644)
	if err != nil {
		log.Fatal(err)
	}
	if len(*cpuShares) > 0 {
		log.Println("cpu cgroup: set shares")
		err := ioutil.WriteFile(cgroupCPU+"cpu.shares",
			[]byte(*cpuShares), 0644)
		if err != nil {
			log.Fatal(err)
		}
	}

	// memory cgroup
	cgroupMemory := "/sys/fs/cgroup/memory/go-contain-me/" + id + "/"
	log.Println("memory cgroup: create")
	err = os.MkdirAll(cgroupMemory, 0644)
	if err != nil {
		log.Fatal(err)
	}
	log.Println("memory cgroup: add pid")
	err = ioutil.WriteFile(cgroupMemory+"tasks",
		[]byte(strconv.Itoa(pid)), 0644)
	if err != nil {
		log.Fatal(err)
	}
	if len(*memoryLimit) > 0 {
		log.Println("memory cgroup: set memory limit")
		err := ioutil.WriteFile(cgroupMemory+"memory.limit_in_bytes",
			[]byte(*memoryLimit), 0644)
		if err != nil {
			log.Fatal(err)
		}
	}
	if len(*swapLimit) > 0 {
		log.Println("memory cgroup: set swap limit")
		err := ioutil.WriteFile(cgroupMemory+"memory.memsw.limit_in_bytes",
			[]byte(*swapLimit), 0644)
		if err != nil {
			log.Fatal(err)
		}
	}

	// create container directories
	newRoot := baseDir + "/containers/" + id + "/rootfs"
	workDir := baseDir + "/containers/" + id + "/workdir"
	for _, path := range []string{newRoot, workDir} {
		err = os.MkdirAll(path, os.ModePerm)
		if err != nil {
			log.Fatal(err)
		}
	}

	// mount rootfs as overlay
	log.Printf("mount: overlay")
	imageRoot := baseDir + "/images/" + *image
	err = unix.Mount("overlay", newRoot, "overlay", uintptr(unix.MS_NODEV),
		"lowerdir="+imageRoot+",upperdir="+newRoot+",workdir="+workDir)
	if err != nil {
		log.Fatal(err)
	}

	// new mount namespace
	log.Printf("newns: mount")
	err = unix.Unshare(unix.CLONE_NEWNS)
	if err != nil {
		log.Fatal(err)
	}

	// remount rootfs in new namespace
	log.Printf("remount: /")
	err = unix.Mount("", "/", "", uintptr(unix.MS_PRIVATE|unix.MS_REC), "")
	if err != nil {
		log.Fatal(err)
	}

	// mount special
	mounts := []struct {
		source  string
		target  string
		fsType  string
		flags   uint
		options string
	}{
		{source: "proc", target: newRoot + "/proc", fsType: "proc"},
		{source: "sysfs", target: newRoot + "/sys", fsType: "sysfs"},
		{
			source:  "tmpfs",
			target:  newRoot + "/dev",
			fsType:  "tmpfs",
			flags:   unix.MS_NOSUID | unix.MS_STRICTATIME,
			options: "mode=755",
		},
		{
			source: "devpts",
			target: newRoot + "/dev/pts",
			fsType: "devpts",
		},
	}
	for _, mnt := range mounts {
		// ensure mount target exists
		log.Printf("mkdirall: %v", mnt.target)
		err := os.MkdirAll(mnt.target, os.ModePerm)
		if err != nil {
			log.Fatal(err)
		}

		// mount
		log.Printf("mount: %v (%v)", mnt.source, mnt.fsType)
		flags := uintptr(mnt.flags)
		err = unix.Mount(mnt.source, mnt.target, mnt.fsType, flags, mnt.options)
		if err != nil {
			log.Fatal(err)
		}
	}

	// essential file descriptors
	for i, name := range []string{"stdin", "stdout", "stderr"} {
		source := "/proc/self/fd/" + strconv.Itoa(i))
		target := newRoot + "/dev/" + name
		log.Printf("symlink: %v", name)
		err := unix.Symlink(source, target)
		if err != nil {
			log.Fatal(err)
		}
	}

	// create devices
	devices := []struct {
		name  string
		attr  uint32
		major uint32
		minor uint32
	}{
		{name: "null", attr: 0666 | unix.S_IFCHR, major: 1, minor: 3},
		{name: "zero", attr: 0666 | unix.S_IFCHR, major: 1, minor: 3},
		{name: "random", attr: 0666 | unix.S_IFCHR, major: 1, minor: 8},
		{name: "urandom", attr: 0666 | unix.S_IFCHR, major: 1, minor: 9},
		{name: "console", attr: 0666 | unix.S_IFCHR, major: 136, minor: 1},
		{name: "tty", attr: 0666 | unix.S_IFCHR, major: 5, minor: 0},
		{name: "full", attr: 0666 | unix.S_IFCHR, major: 1, minor: 7},
	}
	for _, dev := range devices {
		dt := int(unix.Mkdev(dev.major, dev.minor))
		log.Printf("mknod: %v (%v)", dev.name, dt)
		err := unix.Mknod(newRoot + "dev" + dev.name, dev.attr, dt)
		if err != nil {
			log.Fatal(err)
		}
	}
	// new UTS (UNIX Timesharing System) namespace
	log.Printf("newns: UNIX time sharing")
	err = unix.Unshare(unix.CLONE_NEWUTS)
	if err != nil {
		log.Fatal(err)
	}
	// change hostname in new UTS
	log.Printf("set hostname")
	err = unix.Sethostname([]byte(id))
	if err != nil {
		log.Fatal(err)
	}

	/*
		 * can't get it to work :,(
		// new process namespace
		log.Printf("newns: processes")
		err = unix.Unshare(unix.CLONE_NEWPID)
		if err != nil {
			log.Fatal(err)
		}
	*/

	// new network namespace
	log.Printf("newns: network")
	err = unix.Unshare(unix.CLONE_NEWNET)
	if err != nil {
		log.Fatal(err)
	}

	// pivot root
	log.Printf("pivot root")
	oldRootBeforePivot := newRoot + "/.old-root"
	oldRootAfterPivot := "/.old-root"
	err = os.MkdirAll(oldRootBeforePivot, os.ModePerm)
	if err != nil {
		log.Fatalf("mkdirall old root: %v", err)
	}

	unix.PivotRoot(newRoot, oldRootBeforePivot)
	if err != nil {
		log.Fatalf("pivot root: %v", err)
	}
	unix.Chdir("/")
	if err != nil {
		log.Fatalf("chdir: %v", err)
	}
	unix.Unmount(oldRootAfterPivot, unix.MNT_DETACH)
	if err != nil {
		log.Fatalf("unmount old root: %v", err)
	}
	unix.Rmdir(oldRootAfterPivot)
	if err != nil {
		log.Fatalf("rmdir old root: %v", err)
	}

	err = unix.Exec("/bin/sh", []string{"sh"}, []string{})
	log.Fatal(err)
}

Note: I used path.Join() in a previous version but I decided to remove it. I found that to be very cluttery. So this will not run properly should the POSIX standard ever decide to replace the path separator / with something else. I am willing to take this risk, though. ๐Ÿ˜‰

Merging Huuuge CSV Files Using Golang Channels

More often than not in engineering, the saying all you need is less holds true. The story I like to share today does certainly belong into that category. It is about a fearless SRE working in a fitness company who had the honoring task of migrating exercise data from a not-so-reliable database to a shiny, new microservice.

What in the good world is exercise data, you may ask? Legitimate question! Exercise data is generated by tens of thousands of strengths machines, organized in training circles, from gyms and fitness studios all over the world. These machines challenge our users, make them sweat, and help them reach their training goals. Awesome! But these machines also create a ton of data every time they are used:

  • Metadata, that is data identifying the machine, to which training circle it is assigned to, and a bunch of other relevant data points.
  • Strength Sets, that is data about the user’s performance and strength distribution over time while pushing or pulling a machine’s lever. Even when compressed and saved as protocol buffer this data lies in the two-digit kilobyte range per set.

An exercise can have one or more strengths sets, depending on the number of consecutive training intervals a user had on that machine before switching to another machine. All of this data was about to be migrated.

After the SRE team spent a questionable amount of engineering and felt responsible for a shockingly large share of the monthly cloud bill, the situation was as follows:

  • The most recent data has been fully migrated to the new service.
  • Remaining strength sets have been parked on Cloud Storage. Each strength set is stored in a protobuf file.
  • The expensive and unreliable database has been shut off for good. ๐Ÿ’”
  • About half a billion exercises from early days were still awaiting migration.
  • Metadata for these half a billion exercises have been extracted from a variety of data sources and consolidated in two CSV files.

The Data

Note: Data structures in this example are simplified, leaving out the majority of the columns, a sophisticated pipeline of data augmentation, and backend-specific optimizations. All the data presented here is sourced from pure imagination. ๐Ÿธ

Exercise metadata contains the exercise ID and some additional columns:

id user_id machine_id circle_id timestamp
23 1 100 7 1522032074
24 1 204 7 1522032317
25 4711 34 9 1522032161
26 90 70 9 1522012462
27 1 101 7 1522032728

The file metadata.csv looks like this:

id,user_id,machine_id,circle_id,timestamp
23,1,100,7,1522032074
24,1,204,7,1522032317
25,4711,34,99,1522032161
26,90,70,99,1522012462
27,1,101,7,1522032728

Let’s assume this file was a couple of gigabyte large. Just large enough so that it will not fit into memory of any reasonably priced compute instance.

The other file is strength_sets.csv. It holds the mapping from an exercise’s ID to one or more strength set IDs.

id strength_set_id
23 13402394
23 13402894
23 48402394
24 84023943
25 26382958
25 23801213
26 93208493
26 93204762
27 94800394

Remember, the strength sets are located in a bucket on Cloud Storage. They shall not be moved. A bucket is just the right place for them to live for now.

Although much smaller than the previous file, it is also large enough so that one would not want to squeeze it into memory. The file looks like this:

id,strength_set_id
23,13402394
23,13402894
23,48402394
24,84023943
25,26382958
25,23801213
26,93208493
26,93204762
27,94800394

The Goal

For the final migration of the exercises we want to create a file exercises.csv that should contain something like this:

id user_id machine_id circle_id timestamp strength_set_ids
23 1 100 7 1522032074 13402394 13402894 48402394
24 1 204 7 1522032317 84023943
25 4711 34 9 1522032161 26382958 23801213
26 90 70 9 1522012462 93208493 93204762
27 1 101 7 1522032728 94800394

Metadata and strength set IDs shall be merged so that an importer tool could easily ingest the exercises one at a time.

In pure CSV this would be:

id,user_id,machine_id,circle_id,timestamp,strength_set_ids
23,1,100,7,1522032074,13402394 13402894 48402394
24,1,204,7,1522032317,84023943
25,4711,34,99,1522032161,26382958 23801213
26,90,70,99,1522012462,93208493 93204762
27,1,101,7,1522032728,94800394

Approaches

There are plenty of ways to achieve this, but only a few are really paying respect to the tagline from earlier: All you need is less!

Not so great ideas:

  • We could load everything into memory and nest some for loops. Nah, too expensive. Also to easy ๐Ÿ˜‰
  • We could throw all the data into an SQL database and then run a fancy query to get exactly what we want. Hell no, this data was just painfully extracted from a complex SQL data scheme. We better keep a safe distance to SQL for now. Also, this would be too easy ๐Ÿคจ
  • We could iterate over metadata.csv and for every ID then iterate over all lines in strength_sets.csv and copy the set IDs we need over. Sorry, nope! That’s O(n*m) with n being the number of lines in one file and m being the number of lines of the other file. But the direction into this is heading is not that bad, actually!

A better idea:

  • Assuming all lines are sorted with respect to the primary identifier in, let’s say, ascending order, we have to iterate over every file only once.
  • For the metadata, we don’t need access to records that we have already processed. We will never look them up again. The same holds true for records that are still to be processed. We don’t need a line any sooner than the very moment we want to process that line. No lookups here, either.
  • The situation is similar for the strength sets. We only need access to a couple of them at a time, because if everything is sorted, we will find equal IDs next to each other. Once the ID changes, we know there are no further mappings for that exercise’s ID.
  • This solution has no lookups at all. Not even a single hashmap, usually an SRE’s favorite data structure. ๐Ÿ˜ฏ

Testing The Assumption

So far we have just assumed that the data sets are sorted. We should test this assumption. Of course, by writing a small program for that.

package main

import (
	"bufio"
	"log"
	"os"
	"strconv"
	"strings"
)

func main() {
	scanner := bufio.NewScanner(os.Stdin)
	lastID := -1
	for scanner.Scan() {
		columns := strings.Split(scanner.Text(), ",")
		id, err := strconv.Atoi(columns[0])
		if err != nil {
			log.Fatalf("ParseInt: %v", err)
		}
		if id <= lastID {
			log.Fatal("๐Ÿ˜ญ")
		}
		lastID = id
	}
	if err := scanner.Err(); err != nil {
		log.Fatal("scanner: %v", err)
	}
	log.Println("๐Ÿ‘")
}

Let’s run it, skipping the header line of the CSV:

$ tail -n +2 metadata.csv | go run main.go
2018/03/25 21:53:25 ๐Ÿ‘

Looking good!

For the strength sets, we have to modify the test. Here we are OK with duplicate IDs as long as they are consecutive. We change the conditional from id < lastID to id <= lastid. That’s all.

โœ‚๏ธ
		if id <= lastID {
			log.Fatal("๐Ÿ˜ญ")
		}
โœ‚๏ธ

Let’s run this one, too:

$ tail -n +2 strength_sets.csv | go run main.go
2018/03/25 21:56:22 ๐Ÿ‘

Great! Now we can implement our memory-saving approach.

The Implementation

The architecture of our program will look like this:

First, we define a struct that holds a single CSV line. We will be using a simplified format that threats the ID special but leaves everything else as it is.

type line struct {
	id         int
	restOfLine string
}

Since we will be reading lines from two different files, a little helper function that abstracts away and synchronizes the line intake comes in handy. Its job is it, to read lines, parse the ID, and send that to a channel for easier consumption by other functions. The function takes a filename and a channel as parameter.

func reader(fname string, out chan<- *line) {
	defer close(out) // close channel on return

	// open the file
	file, err := os.Open(fname)
	if err != nil {
		log.Fatalf("open: %v", err)
	}
	defer file.Close()

	scanner := bufio.NewScanner(file)
	header := true
	for scanner.Scan() {
		var l line
		columns := strings.SplitN(scanner.Text(), ",", 2)
		// ignore first line (header)
		if header {
			header = false
			continue
		}
		// convert ID to integer for easier comparison
		id, err := strconv.Atoi(columns[0])
		if err != nil {
			log.Fatalf("ParseInt: %v", err)
		}
		l.id = id
		l.restOfLine = columns[1]
		// send the line to the channel
		out <- &l
	}
	if err := scanner.Err(); err != nil {
		log.Fatal(err)
	}
}

We read from stdin while testing our assumptions. Here we read from a file, but the rest is quite similar. We parse the first column into an Int to make it easier to match IDs from both input streams later. Pointers to the parsed lines will be sent to the outbound channel for further processing. Processing is done in another function, let’s call it joiner().

func joiner(metadata, setIDs <-chan *line, out chan<- *line) {
	defer close(out) // close channel on return

	si := &line{}
	for md := range metadata {
		sep := ","
		// add matching strength set (if left over from previous iteration)
		if si.id == md.id {
			md.restOfLine += sep + si.restOfLine
			sep = " "
		}
		// look for matching strength sets
		for si = range setIDs {
			// add all strength sets with matching IDs
			if si.id == md.id {
				md.restOfLine += sep + si.restOfLine
				sep = " "
			} else if si.id > md.id {
				break
			}
			// skip all strength sets with lower IDs
		}
		// send the augmented line into the channel
		out <- md
	}
}

Surprisingly, this function joins the matching data from our two input channels. To do so, it reads from the metadata channel to fetch a line with exercise metadata. Then we read from the strength set channel, skipping all lines that have a lower ID and appending all values from lines that have a matching ID. A pointer to the augmented line goes to the outbound channel.

In our main() function we fire up the workers and channels and print out the augmented lines.

func main() {
	// read metadata
	metadataChan := make(chan *line)
	go reader("metadata.csv", metadataChan)

	// read strength set IDs
	strengthSetChan := make(chan *line)
	go reader("strength_sets.csv", strengthSetChan)

	// join the two data streams
	augmentedLinesChan := make(chan *line)
	go joiner(metadataChan, strengthSetChan, augmentedLinesChan)

	// write computed lines
	fmt.Println("id,user_id,machine_id,circle_id,timestamp,strength_set_ids")
	for l := range augmentedLinesChan {
		fmt.Printf("%v,%v\n", l.id, l.restOfLine)
	}
}

And that’s all! The design of the program is agnostic to the size of the input files.

$ go run main.go
id,user_id,machine_id,circle_id,timestamp,strength_set_ids
23,1,100,7,1522032074,13402394 13402894 48402394
24,1,204,7,1522032317,84023943
25,4711,34,99,1522032161,26382958 23801213
26,90,70,99,1522012462,93208493 93204762
27,1,101,7,1522032728,94800394

I successfully ran this program to convert hundreds of gigabyte of exercises with as little as a couple of megabyte of memory consumption. ๐ŸŽ‰

Conclusion

Sometimes it is enough to evaluate a few simple assumptions to unlock a scaling approach. Once we are able to divide and conquer, things become easy very quickly.

Bonus: Try modifying the code to run concurrent joiner() goroutines to make use of all CPU cores (Hint: Modulus).

How to score The Daily Show tickets using Twilio and Golang

I’m a big fan of Comedy Central’s The Daily Show previously hosted by John Stuart who handed over to Trevor Noah in 2015. The show is recorded in New York City in front of a live studio audience. Tickets to the show are free but limited. And therefore they are hard to get. There are two types of tickets:

  • GENERAL GUARANTEED tickets which guarantee a seat in the audience if you arrive in time, and
  • GENERAL - ENTRY NOT GUARANTEED tickets that might get you in when you wait in line next to the studio. People holding this ticket type may be invited to fill in remaining seats.

On the ticket website is a little calendar in the top right corner. Dates on which the show will be airing become available in chunks from time to time. Over the short period of time I monitored this I was not able to find a reoccurring pattern from which I could predict future ticket rounds. If I wanted to get lucky with a ticket on a specific date I would have to regularly check the ticket website. I believe humans should delegate repetitive tasks like this one to a computer.

A small program would do the job of

  • checking the website regularly for ticket availability at specific dates and
  • notifying me once tickets are available.

Checking the Dates

To get an idea of how the website was structured and how I could detect changes in the availability of tickets I took a look a the source code. ๐Ÿ‘€ The site is structured mostly using HTML.

The calendar, however, is rendered via JavaScript.

var dates_avail = {
  "2018-02-20":"style1",
  "2018-02-21":"style1",
  "2018-02-22":"style1",
  โœ‚๏ธ
  "2018-03-28":"style1",
  "2018-03-29":"style1"
};

Luckily, the available dates are stored in an object in the source. Furthermore, to find a particular date I simply needed to search for it in the right notation, for example "2018-02-20":"style1". That string was not matching anything else but the ticket calendar if, and only if, tickets were available on that date. That was so much easier than I thought!

So the checking part of my program was down to simply fetching the website’s source and then running a substring search over the received data.

My weapon language of choice, as usual: Golang. I used the http package to fetch the website’s source:

ticketWebsiteURL := "https://www.showclix.com/event/TheDailyShowwithTrevorNoah"
response, err := http.Get(ticketWebsiteURL)
if err != nil {
  โœ‚๏ธ
}
contents, err := ioutil.ReadAll(response.Body)
if err != nil {
  โœ‚๏ธ
}

For the substring search, I started with first defining the dates at which I would be in New York and have enough spare time to make it to the studio. After all, there is no use blocking a seat that other would be happy to have.

triggerDates := []string{"2018-02-20", "2018-02-21"}

Ranging over the triggerDates I defined the search pattern for each date individually and then used the Contains() function of the strings package.

for _, td := range triggerDates {
  search := "\"" + td + "\":\"style1\""
  if strings.Contains(string(contents), search) {
    // Found it! Let's send a notification.
    โœ‚๏ธ
  }
}
response.Body.Close()

That was the easier part.

Notification via SMS

I spend most of my life in the Central European Time (CET) timezone but the show is recorded in the Eastern Standard Time (EST) timezone. I assumed the tickets would, therefore, become available sometime during EST office hours, at which I might be asleep.

I needed a notification method that would safely wake me up but not interfere with the quiet hours settings of my phone. I could have used the Pagerduty API but somehow I found this to be overkill. Overkill, because I have an alert escalation configuration at Pagerduty that truly wakes me up. And my girlfriend, much to her excitement. #not

A communication channel that has been almost forgotten is the good old Short Messaging Service (SMS). I get so few SMS messages, that receiving one can hardly disturb my usual flow. With SMS to the rescue, I configured my phone to notify for SMS even during quiet hours. Now I only need to make my program send SMS. That may sound tricky, but in the age of Cloud, this is just a web service away. I headed over to Twilio to register a mobile phone number that I would use as message source.

Here is how I crafted the message in Go:

msgData := url.Values{}
msgData.Set("To", "+00-MY-PHONE-NUMBER")
msgData.Set("From", "+00-MY-TWILIO-NUMBER")
msgData.Set("Body",
  "Tickets available now ("+info+")! Visit "+ticketWebsiteURL)

Sending the text was then as easy as creating a REST call:

req, _ := http.NewRequest("POST", apiURL, &msgDataReader)
req.SetBasicAuth(accountSid, authToken)
req.Header.Add("Accept", "application/json")
req.Header.Add("Content-Type", "application/x-www-form-urlencoded")

There are a few more lines to making a successful call to the Twilio API, of course. Scroll down for the full source.

Wait for it… Win!

With fetching and notifications being set up I just had to run the code. For this I copied it over to my jump host, a compute instance that is running 247 anyway and could use some additional load. ๐Ÿ˜‰

04:35:33 waiting...
04:35:33 fetching...
04:50:34 waiting...
05:05:34 fetching...
05:05:35 sms sent!
exit status 1

One day, right after my early workout, the long-awaited for text arrived. I got lucky and scored two tickets for one of the dates I wanted. Hooray!

Greetings from The Daily Show with Trevor Noah everyone!

Source Code

Here’s the full source code FYI. Enjoy the show! ๐ŸŽฅ

package main

import (
  "encoding/json"
  "io/ioutil"
  "log"
  "net/http"
  "net/url"
  "strings"
  "time"
)

var (
  triggerDates     = []string{"2018-02-20", "2018-02-21"}
  ticketWebsiteURL = "https://www.showclix.com/event/TheDailyShowwithTrevorNoah"
)

func sendSMS(info string) {
  accountSid := "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
  authToken := "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy"
  apiURL := "https://api.twilio.com/2010-04-01/Accounts/" +
    accountSid + "/Messages.json"

  msgData := url.Values{}
  msgData.Set("To", "+00-YOUR-NUMBER")
  msgData.Set("From", "+00-YOUR-TWILIO-NUMBER")
  msgData.Set("Body",
    "Tickets available now ("+info+")! Visit "+ticketWebsiteURL)
  msgDataReader := *strings.NewReader(msgData.Encode())

  client := &http.Client{}
  req, _ := http.NewRequest("POST", apiURL, &msgDataReader)
  req.SetBasicAuth(accountSid, authToken)
  req.Header.Add("Accept", "application/json")
  req.Header.Add("Content-Type", "application/x-www-form-urlencoded")

  resp, _ := client.Do(req)
  // Let's be very generous with the status code. We want those tickets!
  if resp.StatusCode >= 200 && resp.StatusCode < 300 {
    var data map[string]interface{}
    decoder := json.NewDecoder(resp.Body)
    err := decoder.Decode(&data)
    if err == nil {
      log.Printf("twilio: decode json: %v", err)
    }
  } else {
    log.Printf("twilio: status code: %v", resp.Status)
  }
}

func main() {
  for {
    // Behave! Wait 15 minutes to not overload the site or spam their logs.
    log.Printf("waiting...\n")
    time.Sleep(15 * time.Minute)

    // Fetch the website.
    log.Printf("fetching...\n")
    response, err := http.Get(ticketWebsiteURL)
    if err != nil {
      log.Printf("get: %v", err)
      continue
    }
    contents, err := ioutil.ReadAll(response.Body)
    if err != nil {
      log.Printf("read body: %v", err)
      continue
    }
    // Look for the trigger dates in the source code.s
    for _, td := range triggerDates {
      search := "\"" + td + "\":\"style1\""
      if strings.Contains(string(contents), search) {
        // Found it! Let's send a notification.
        sendSMS(td)
        log.Fatalf("sms sent!")
      }
    }
    response.Body.Close()
  }
}