Replying To Domain Abuse Mail

Once I heard about an abandoned World War II-era anti-aircraft platform that sits in the North Sea and hosts a self-proclaimed micronation called Principality of Sealand. One can hardly spot it on Google Maps, this is how micro it really is.

sealand on google maps,small

Here is an aerial shot of Sealand by Ryan Lackey:

sealand from the sky

Back then I was fascinated by the history of the platform and thought it was quite funny to establish a nation in the middle of nowhere. Years went by and I did not think about the story anymore.

One day, however, I found myself setting up a number of off-site backup systems and wanted to collect them under their own domain name for disaster recovery purposes. Thinking about an appropriate domain name the Sealand story came to my mind. So I ended up registering the sealand.io domain name. Skip ahead another couple of years and the off-site systems were eaten up by the cloud and the domain served no purpose anymore. Forgotten and abandoned it was sitting at a registrar’s account, just like the anti-aircraft platform was sitting abandoned in the North Sea.

And then I received an abuse notification via my registrar from an unrecognized sender, claiming to be representing the Government of Sealand.

Please find below a message we received regarding your domain name sealand.io:

From: [redacted address, appearing to be sent from a personal account]

Message: I would like gandi.net to help me contact the owner of Sealand.io I represent the government of Sealand (https://www.sealandgov.org). We would like to discuss the release of sealand.io from its current owner.

What to think about that? Usually, I do not follow up on emails like these. However, given my previous fascination with the platform’s history, I decided to contact the official address of the Government of Sealand.

Hi,

I received the attached message and wanted to check out if this is a legitimate request. If it is, I think we will find a way. I am open to talk.

Cheers

Dan

Now that I think about it, this was the most sloppy mail I ever wrote to a government. I probably assumed that folks who enter platforms in the sometimes unforgiving North Sea must be cool enough to deal with my sloppiness.

During the following conversation, I learned that the Principality of Sealand understands itself as a non-profit organization. A property I believe most governments share in one way or another. Furthermore, I learned about future plans regarding the platform management that would benefit from having a fancy domain like sealand.io. I do not want to go into details here to not endanger the success of the project. And also, to better not get into trouble with the Government of Sealand. Who knows what these bad ass guys are capable of? After all, they threw a bunch of pirate radio broadcasters from the platform in 1967.

Not having any idea what I could use the domain for I agreed to trade it in with the Government of Sealand for a title. A couple of days later my noble title arrived. There even was postage from Sealand on the parcel (which was delivered by UPS).

sealand stamp,small

The title certificate was inside a very fancy looking leather wrapper with the seal of the Principality of Sealand on it. If you look closely, you can see the slogan E Mare Libertas which means From The Sea, Freedom.

leather wrapper,small

Inside the leather wrapper was a 29-page fact sheet about the history of the micronation and, of course, the certificate to prove my surprisingly gained nobleness.

duke of sealand certificate

I am now officially a Duke of Sealand. 👑 Maybe I should read my abuse mail more often. 🤔

Protip: Do not ask your partner to address you as your highness in future conversations. Not a good idea. Funny, but not a good idea. 😬

Eight Queens: A Simple Backtracking Algorithm In Golang

The Eight Queens Puzzle, a popular computer science question reads:

How do you place eight queens on a chess board so that they do not attack each other?

The queen is the most powerful piece in a chess game. She can move in any direction and as far as she likes. This means a queen placed on the board attacks quite some fields. None of these fields can be used to place another queen because they would attack one another.

One solution to this problem includes backtracking, which is defined by Wikipedia as follows:

Backtracking is a general algorithm for finding all (or some) solutions to some computational problems, notably constraint satisfaction problems, that incrementally builds candidates to the solutions, and abandons a candidate (“backtracks”) as soon as it determines that the candidate cannot possibly be completed to a valid solution.

The QueensBoard Package

Let’s implement a simple backtracking algorithm for the puzzle! It uses a package called QueensBoard which includes the following functions:

  • New() creates an 8x8 board.
  • board.Queens() returns the number of queens that are currently placed on the board. We will use this function to check if we have found a place for all the queens.
  • board.AvailableFields() returns a list of coordinates which are not under attack and therefore available to place a queen there.
  • board.PlaceQueen() takes coordinates as parameter and places a queen there if possible. It updates the board to mark all fields unavailable that are under attack.
  • board.RemoveQueen() takes coordinates as parameter and removes the queen from the field addressed by the coordinates.
  • board.Print() takes a Writer interface as parameter and writes a Unicode-powered human-readable representation of the state of the board to it. We will use it with os.Stdout to print the final result.

Recursive Approach

Backtracking and recursion often go very well together. We can use recursion to dive deeper and deeper into a prospective solution until

  • we either hit the base case and return the solution, or
  • we realize that we are on a path that will not lead to a solution.

If we barking up the wrong tree, we backtrack to the last state and try a different way. In our example, backtracking means, that we remove the last queen that we placed on the board if we realize that we can not place all eight queens. This usually happens when there are not enough unattacked fields left to place queens on.

Boringly, we will name then function queens(). It will take a *qb.Board as parameter and also return a pointer to a qb.Board struct. The base case will be the perfect board: One with eight queens that do not attack each other:

func queens(board *qb.Board) *qb.Board {
	if board.Queens() == 8 {
		return board
	}

	// here be magic

	return nil
}

Now we need to fill in the actual algorithm. First, we fetch the coordinates of all the fields that are currently not under attack. We are free to place a queen on any of them, so why not start with the first one? After placing the queen, we recursively dive into this prospective solution and place the next queen and the next queen and so on… If we did not reach our goal of placing eight queens, we remove the queen and try again with the next coordinate.

func queens(board *qb.Board) *qb.Board {
	if board.Queens() == 8 {
		return board
	}
	for _, coordinate := range board.AvailableFields() {
		board.PlaceQueen(coordinate)
		next := queens(board)
		if next != nil {
			return next
		}
		board.RemoveQueen(coordinate)
	}
	return nil
}

Find the full source at the end of the article. This algorithm is not the fastest one, however, on most machines, it should finish in less than a second. The final board should look like this:

Here is an animation of the first couple of steps of the algorithm:

small

Improve!

The QueensBoard package also offers the ability to create custom sized boards via the NewCustom() function.

  • Investigate how the algorithms scales for increasing board sizes?
  • Have you tried to randomize the list returned from board.AvailableFields()? How did this influence the average runtime of your algorithm?

Source Code

Here is the full source, ready to run:

package main

import (
	"os"

	qb "github.com/danrl/golibby/queensboard"
)

func queens(board *qb.Board) *qb.Board {
    // uncomment the next line to print intermediate steps
	//board.Print(os.Stdout)
	if board.Queens() == 8 {
		return board
	}
	for _, coordinate := range board.AvailableFields() {
		board.PlaceQueen(coordinate)
		next := queens(board)
		if next != nil {
			return next
		}
		board.RemoveQueen(coordinate)
	}
	return nil
}

func main() {
	board := qb.New()
	queens(board)
	board.Print(os.Stdout)
}

Go Contain Me (SREcon Americas 2018)

My first day at SREcon Americas 2018 was very exciting and inspiring. It started with the Containers from Scratch workshop by Avishai Ish-Shalom and Nati Cohen. They developed a syscall-level workshop about Linux containers that I can highly recommend. It deals with a program containing and isolating itself step by step using Linux systemcalls. In the end, the program would fork to drop the last bit of privileges that is left. That was super fun, although the network namespaces gave me a hard time due to a silly implementation mistake I made.

The workshop code is in Python and worked perfectly. However, I want to improve my Golang skills so I decided to redo the assignments from the workshop in Golang. Shouldn’t be too hard, should it? After all, docker is written in Golang.

This is my write-up of the endeavor.

General Idea

Similar to other popular container solutions, our little program should:

  • Use the host’s Kernel space
  • Have its own userspace binaries (e.g. busybox)
  • Have a unique ID
  • Use cgroups to limit its own resource usage
  • Use an overlay filesystem to avoid messing with the userspace binaries that may be used by multiple containers at the same time
  • Use Linux namespacing for mounts, processes, network, and UNIX time-sharing (uhm, the last one is some historical thingy)
  • Make special devices, such as /dev/null and /dev/urandom available inside the container
  • Overload the currently running binary with a binary from inside the container. And run it.

Spoiler alert: We will not make all of this work. The processes namespace gave me a hard time because I was restricting myself to state-of-the art standard libraries and avoiding custom CGO code. Unfortunately, the syscall package is not considered state-of-the-art anymore. So I had to avoid using it. 😔

Who Am I?

Our container will start off as a simple process that will slowly isolate itself from the current environment. First, we want to know who we are, so we fetch the process ID (PID) first.

pid := unix.Getpid()
log.Printf("pid: %v", pid)

We also want a unique identifier for our container. Process IDs are limited and might eventually be re-assigned. So let’s grab a UUID and use that.

id := uuid.New().String()
log.Printf("container id: %v", id)

We can use the container ID to name directories in a non-conflicting way later. It is highly unlikely that any two UUIDs collide in an environment like ours.

Building Fences

Next thing we want to do is to build a fence around our process. We do this for CPU and memory using Linux cgroups. For this we want to write our PID to /sys/fs/cgroup/cpu/go-contain-me/<UUID>/tasks and write the number of CPU shares we want to grant the process into /sys/fs/cgroup/cpu/go-contain-me/<UUID>/cpu.shares.

cgroupCPU := "/sys/fs/cgroup/cpu/go-contain-me/" + id + "/"
log.Println("cpu cgroup: create")
err = os.MkdirAll(cgroupCPU, 0744)
if err != nil {
    log.Fatal(err)
}
log.Println("cpu cgroup: add pid")
err = ioutil.WriteFile(cgroupCPU+"tasks", []byte(strconv.Itoa(pid)), 0644)
if err != nil {
    log.Fatal(err)
}
if len(*cpuShares) > 0 {
    log.Println("cpu cgroup: set shares")
    err := ioutil.WriteFile(cgroupCPU+"cpu.shares",
        []byte(*cpuShares), 0644)
    if err != nil {
        log.Fatal(err)
    }
}

For the memory cgroup we do something similar, yet there is a small difference. We are not supposed to set shares here, but the actual number of bytes we want the process to be limited to. We can also limit the number of swap bytes the process is allowed to consume.

cgroupMemory := "/sys/fs/cgroup/memory/go-contain-me/" + id + "/"
log.Println("memory cgroup: create")
err = os.MkdirAll(cgroupMemory, 0644)
if err != nil {
    log.Fatal(err)
}
log.Println("memory cgroup: add pid")
err = ioutil.WriteFile(cgroupMemory+"tasks",
    []byte(strconv.Itoa(pid)), 0644)
if err != nil {
    log.Fatal(err)
}
if len(*memoryLimit) > 0 {
    log.Println("memory cgroup: set memory limit")
    err := ioutil.WriteFile(cgroupMemory+"memory.limit_in_bytes",
        []byte(*memoryLimit), 0644)
    if err != nil {
        log.Fatal(err)
    }
}
if len(*swapLimit) > 0 {
    log.Println("memory cgroup: set swap limit")
    err := ioutil.WriteFile(cgroupMemory+"memory.memsw.limit_in_bytes",
        []byte(*swapLimit), 0644)
    if err != nil {
        log.Fatal(err)
    }
}

Great, now we have contained ourselves in terms of resource usage.

Overlay Root File System

Now we shall isolate our process further from the host system step by step. Let’s assume we have an extracted userspace image at /root/go-contain-me/images/busybox. You can use docker extract to get your hands on one quickly if needed. 🐳

As multiple containers might be using the same underlying image, we have to make sure we do not write to the image data. However, we still want to be able to make changes to the data, such as adding, modifying, or removing files. But how? The overlay filesystem comes to the rescue! As the name suggests, we can overlay something called an upperdir onto a lowerdir. We would also need a workdir where we store some copy-on-write information, e.g. for files that have been deleted during operation.

So the first order of business for overlaying a filesystem is to make sure all the required directories exists:

newRoot := baseDir + "/containers/" + id + "/rootfs"
workDir := baseDir + "/containers/" + id + "/workdir"
for _, path := range []string{newRoot, workDir} {
    err = os.MkdirAll(path, os.ModePerm)
    if err != nil {
        log.Fatal(err)
    }
}

After that the whole operation is just a regular mount with a less often seen set of options:

log.Printf("mount: overlay")
imageRoot := baseDir + "/images/" + *image
err = unix.Mount("overlay", newRoot, "overlay", uintptr(unix.MS_NODEV),
    "lowerdir="+imageRoot+",upperdir="+newRoot+",workdir="+workDir)
if err != nil {
    log.Fatal(err)
}

The MS_NODEV flag, by the way, prevents special files (devices) to be accessed on this filesystem. We will create those later using the mknod system call.

Moving To A New Mount Namespace

Right now, our mounts show up on the host and pollute it a bit. Luckily, we can isolate our mounts from the host mounts by creating a new namespace for our process (at some point we can start calling the process a container).

log.Printf("newns: mount")
err = unix.Unshare(unix.CLONE_NEWNS)
if err != nil {
    log.Fatal(err)
}

Now we remount the root filesystem in our namespace to assign it to the newly created namespace:

log.Printf("remount: /")
err = unix.Mount("", "/", "", uintptr(unix.MS_PRIVATE|unix.MS_REC), "")
if err != nil {
    log.Fatal(err)
}

We are using flags again:

  • MS_PRIVATE makes sure that mounts and unmounts events do not propagate into our out of this mount point.
  • MS_REC just means that the flags it is used in conjunction with are meant to be applied recursively.

Special Mounts

Now that we have our isolated mount namespace, it is time to mount some special filesystems there. Let’s use a for loop to avoid writing the same code over and over again.

mounts := []struct {
    source  string
    target  string
    fsType  string
    flags   uint
    options string
}{
    {source: "proc", target: newRoot + "/proc", fsType: "proc"},
    {source: "sysfs", target: newRoot + "/sys", fsType: "sysfs"},
    {
        source:  "tmpfs",
        target:  newRoot + "/dev",
        fsType:  "tmpfs",
        flags:   unix.MS_NOSUID | unix.MS_STRICTATIME,
        options: "mode=755",
    },
    {
        source: "devpts",
        target: newRoot + "/dev/pts",
        fsType: "devpts",
    },
}
for _, mnt := range mounts {
    // ensure mount target exists
    log.Printf("mkdirall: %v", mnt.target)
    err := os.MkdirAll(mnt.target, os.ModePerm)
    if err != nil {
        log.Fatal(err)
    }

    // mount
    log.Printf("mount: %v (%v)", mnt.source, mnt.fsType)
    flags := uintptr(mnt.flags)
    err = unix.Mount(mnt.source, mnt.target, mnt.fsType, flags, mnt.options)
    if err != nil {
        log.Fatal(err)
    }
}

This should leave us with most the most important filesystems in place under our new root. Remember, our new root is still seen as /root/go-contain-me/containers/<UUID>/rootfs. But that is going to change soon.

Essential File Descriptors

We will soon pivot the process’ root to use our container’s rootfs directory as root. See how I just used container now instead of process? This was totally arbitrary. 🙃 But before we lose access to the current filesystem tree, let’s rescue essential file descriptors such as stdin and stdout. Without them functioning, we would not have much fun with our container.

A simple symlink() does the job:

for i, name := range []string{"stdin", "stdout", "stderr"} {
    source := "/proc/self/fd/" + strconv.Itoa(i))
    target := newRoot + "/dev/" + name
    log.Printf("symlink: %v", name)
    err := unix.Symlink(source, target)
    if err != nil {
        log.Fatal(err)
    }
}

Creating Devices

Processes running inside our container may assume that a certain set of special devices is present. One popular example being /dev/null, which is often used to drop data streams into Nirvana. If /dev/null weren’t present those data streams may end up in a regular file. This could, in turn, quickly fill up the filesystem. If there are no quotas on the container’s filesystem, this might affect the host’s filesystem as well. Not cool.

We’ll use the loop approach one more time here:

devices := []struct {
    name  string
    attr  uint32
    major uint32
    minor uint32
}{
    {name: "null", attr: 0666 | unix.S_IFCHR, major: 1, minor: 3},
    {name: "zero", attr: 0666 | unix.S_IFCHR, major: 1, minor: 3},
    {name: "random", attr: 0666 | unix.S_IFCHR, major: 1, minor: 8},
    {name: "urandom", attr: 0666 | unix.S_IFCHR, major: 1, minor: 9},
    {name: "console", attr: 0666 | unix.S_IFCHR, major: 136, minor: 1},
    {name: "tty", attr: 0666 | unix.S_IFCHR, major: 5, minor: 0},
    {name: "full", attr: 0666 | unix.S_IFCHR, major: 1, minor: 7},
}
for _, dev := range devices {
    dt := int(unix.Mkdev(dev.major, dev.minor))
    log.Printf("mknod: %v (%v)", dev.name, dt)
    err := unix.Mknod(newRoot +"/dev/" dev.name, dev.attr, dt)
    if err != nil {
        log.Fatal(err)
    }
}

Isolate The UNIX Time-Sharing Namespace

We are coming closer to pivoting the root. I promise. However, there are still a few more isolation steps we should do. For example, we want the hostname of the container to be isolated from the hostname of the host. One might expect this to fall under the domain of the network namespace. Surprisingly that is not the case. For historical reasons, the namespace for this is the UNIX Time-Sharing namespace or short UTS.

So let’s unshare() this one before setting the hostname:

log.Printf("newns: UNIX time sharing")
err = unix.Unshare(unix.CLONE_NEWUTS)
if err != nil {
    log.Fatal(err)
}
// change hostname in new UTS
log.Printf("set hostname")
err = unix.Sethostname([]byte(id))
if err != nil {
    log.Fatal(err)
}

Isolate The Process Namespace (b0rked)

We also want to isolate the container process namespace from the host. Meaning, that if we run ps on the container, we don’t want to see the processes of the host.

Note: I was not able to get this one to work. The code compiles, the code runs, but then the contained processes run out of memory real quick. Despite having a generous cgroup setting for memory. I did not investigate much time into debugging this. Feel free to drop me a line if you happen to know what the problem is. 🤓

For the sake of completeness, here is my code:

log.Printf("newns: processes")
err = unix.Unshare(unix.CLONE_NEWPID)
if err != nil {
    log.Fatal(err)
}

Isolating The Network

For the network namespace, we make another call to unshare(). This will give us a new namespace that does contain a loopback interface only. Clean and lean!

log.Printf("newns: network")
err = unix.Unshare(unix.CLONE_NEWNET)
if err != nil {
    log.Fatal(err)
}

If you like to dig deeper into network namespacing: Try ip netns help for a start and don’t forget to link the namespace to the container’s default namespace before unsharing!

Pivoting

Phew. That was a long journey. Now we can pivot the root! Hooray! The operation looks more complicated than it is. Basically, we just do the following things:

  • Create a directory named .old-root. This is where the kernel will mount the old root into after pivoting.
  • Pivot (obviously)
  • Change directory to /.
  • Unmount the old root.
  • Remove the old root directory created in step one.
log.Printf("pivot root")
oldRootBeforePivot := newRoot + "/.old-root"
oldRootAfterPivot := "/.old-root"
err = os.MkdirAll(oldRootBeforePivot, os.ModePerm)
if err != nil {
    log.Fatalf("mkdirall old root: %v", err)
}

unix.PivotRoot(newRoot, oldRootBeforePivot)
if err != nil {
    log.Fatalf("pivot root: %v", err)
}
unix.Chdir("/")
if err != nil {
    log.Fatalf("chdir: %v", err)
}
unix.Unmount(oldRootAfterPivot, unix.MNT_DETACH)
if err != nil {
    log.Fatalf("unmount old root: %v", err)
}
unix.Rmdir(oldRootAfterPivot)
if err != nil {
    log.Fatalf("rmdir old root: %v", err)
}

The Finally

Hold your breath, now comes the final operation before we fully enter container land! We overload the process with the new binary to run. Here we are using sh to get a shell we can interact with.

Ideally, we would do this in a child process after fork() or clone(), but it turns out, forking isn’t too much of a great idea in Golang. I’ll spare you the details, but there are plenty of discussions about this at the usual places.

err = unix.Exec("/bin/sh", []string{"sh"}, []string{})
log.Fatal(err)

Ideally, the line reading log.Fatal(err) is never reached.

Running It!

It’s time to run this thing! Do yourself a favor and run this in a virtual machine. The code is not free of risk and could force you to reboot in case something goes wrong. And we don’t reboot our computers anymore nowadays, do we? 😂

# ./go-contain-me
2018/03/29 04:03:46 pid: 1054
2018/03/29 04:03:46 container id: c16f889c-6a49-49a4-bbb0-add1094993c5
2018/03/29 04:03:46 cpu cgroup: create
2018/03/29 04:03:46 cpu cgroup: add pid
2018/03/29 04:03:46 memory cgroup: create
2018/03/29 04:03:46 memory cgroup: add pid
2018/03/29 04:03:46 memory cgroup: set memory limit
2018/03/29 04:03:46 mount: overlay
2018/03/29 04:03:46 newns: mount
2018/03/29 04:03:46 remount: /
2018/03/29 04:03:46 mkdirall: /root/go-contain-me/containers/c16f889c-6a49-49a4-bbb0-add1094993c5/rootfs/proc
2018/03/29 04:03:46 mount: proc (proc)
2018/03/29 04:03:46 mkdirall: /root/go-contain-me/containers/c16f889c-6a49-49a4-bbb0-add1094993c5/rootfs/sys
2018/03/29 04:03:46 mount: sysfs (sysfs)
2018/03/29 04:03:46 mkdirall: /root/go-contain-me/containers/c16f889c-6a49-49a4-bbb0-add1094993c5/rootfs/dev
2018/03/29 04:03:46 mount: tmpfs (tmpfs)
2018/03/29 04:03:46 mkdirall: /root/go-contain-me/containers/c16f889c-6a49-49a4-bbb0-add1094993c5/rootfs/dev/pts
2018/03/29 04:03:46 mount: devpts (devpts)
2018/03/29 04:03:46 symlink: stdin
2018/03/29 04:03:46 symlink: stdout
2018/03/29 04:03:46 symlink: stderr
2018/03/29 04:03:46 mknod: null (259)
2018/03/29 04:03:46 mknod: zero (259)
2018/03/29 04:03:46 mknod: random (264)
2018/03/29 04:03:46 mknod: urandom (265)
2018/03/29 04:03:46 mknod: console (34817)
2018/03/29 04:03:46 mknod: tty (1280)
2018/03/29 04:03:46 mknod: full (263)
2018/03/29 04:03:46 newns: UNIX time sharing
2018/03/29 04:03:46 set hostname
2018/03/29 04:03:46 newns: network
2018/03/29 04:03:46 pivot root

Inside the container, we can see only our own mounts:

/ # mount
overlay on / type overlay (rw,nodev,relatime,lowerdir=/root/go-contain-me/images/busybox,upperdir=/root/go-contain-me/containers/c16f889c-6a49-49a4-bbb0-add1094993c5/rootfs,workdir=/root/go-contain-me/containers/c16f889c-6a49-49a4-bbb0-add1094993c5/workdir)
proc on /proc type proc (rw,relatime)
sysfs on /sys type sysfs (rw,relatime)
tmpfs on /dev type tmpfs (rw,nosuid,mode=755)
devpts on /dev/pts type devpts (rw,relatime,mode=600,ptmxmode=000)

We also have our own network namespace. All the host’s devices are gone. If we want to add network interfaces, we may use the netns functionality of iputils.

/ # ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

The situation is not that good for the process namespace. As I said, I was not able to get it to work reliably. So here we see all the processes of the host as well. Meh.

/ # ps -e
PID   USER     TIME  COMMAND
    1 root      0:00 {systemd} /sbin/init
    2 root      0:00 [kthreadd]
    3 root      0:00 [ksoftirqd/0]
✂️
 1054 root      0:00 sh
 1066 root      0:00 ps -e

Full Source

Here is the full piece of code for your amusement and further experimentation. The code works with a directory structure that looks similar to this:

root@go-contain-me-1:~# tree
.
`-- go-contain-me
    |-- containers
    |   `-- 8f0f5a2d-0ce8-4bd1-887a-2c5b275ee337
    |       |-- rootfs
    |       `-- workdir
    `-- images
        `-- busybox
            `-- (a full user space here)

Compile the program:

$ CGO_ENABLED=0 GOOS=linux go build -a -ldflags '-extldflags "-static"' .

Here’s the source for your interest:

package main

import (
	"flag"
	"io/ioutil"
	"log"
	"os"
	"strconv"

	"github.com/google/uuid"
	"golang.org/x/sys/unix"
)

var (
	baseDir = "/root/go-contain-me"
)

func main() {
	var err error
	cpuShares := flag.String("cpu-shares", "",
		"CPU shares of the container.")
	memoryLimit := flag.String("memory-limit", "256m",
		"Memory limit of the container.")
	swapLimit := flag.String("swap-limit", "",
		"Swap limit of the container.")
	image := flag.String("image", "busybox", "name of the container image")
	flag.Parse()

	pid := unix.Getpid()
	log.Printf("pid: %v", pid)

	// generate container id
	id := uuid.New().String()
	log.Printf("container id: %v", id)

	// CPU cgroup
	cgroupCPU := "/sys/fs/cgroup/cpu/go-contain-me/" + id + "/"
	log.Println("cpu cgroup: create")
	err = os.MkdirAll(cgroupCPU, 0744)
	if err != nil {
		log.Fatal(err)
	}
	log.Println("cpu cgroup: add pid")
	err = ioutil.WriteFile(cgroupCPU+"tasks", []byte(strconv.Itoa(pid)), 0644)
	if err != nil {
		log.Fatal(err)
	}
	if len(*cpuShares) > 0 {
		log.Println("cpu cgroup: set shares")
		err := ioutil.WriteFile(cgroupCPU+"cpu.shares",
			[]byte(*cpuShares), 0644)
		if err != nil {
			log.Fatal(err)
		}
	}

	// memory cgroup
	cgroupMemory := "/sys/fs/cgroup/memory/go-contain-me/" + id + "/"
	log.Println("memory cgroup: create")
	err = os.MkdirAll(cgroupMemory, 0644)
	if err != nil {
		log.Fatal(err)
	}
	log.Println("memory cgroup: add pid")
	err = ioutil.WriteFile(cgroupMemory+"tasks",
		[]byte(strconv.Itoa(pid)), 0644)
	if err != nil {
		log.Fatal(err)
	}
	if len(*memoryLimit) > 0 {
		log.Println("memory cgroup: set memory limit")
		err := ioutil.WriteFile(cgroupMemory+"memory.limit_in_bytes",
			[]byte(*memoryLimit), 0644)
		if err != nil {
			log.Fatal(err)
		}
	}
	if len(*swapLimit) > 0 {
		log.Println("memory cgroup: set swap limit")
		err := ioutil.WriteFile(cgroupMemory+"memory.memsw.limit_in_bytes",
			[]byte(*swapLimit), 0644)
		if err != nil {
			log.Fatal(err)
		}
	}

	// create container directories
	newRoot := baseDir + "/containers/" + id + "/rootfs"
	workDir := baseDir + "/containers/" + id + "/workdir"
	for _, path := range []string{newRoot, workDir} {
		err = os.MkdirAll(path, os.ModePerm)
		if err != nil {
			log.Fatal(err)
		}
	}

	// mount rootfs as overlay
	log.Printf("mount: overlay")
	imageRoot := baseDir + "/images/" + *image
	err = unix.Mount("overlay", newRoot, "overlay", uintptr(unix.MS_NODEV),
		"lowerdir="+imageRoot+",upperdir="+newRoot+",workdir="+workDir)
	if err != nil {
		log.Fatal(err)
	}

	// new mount namespace
	log.Printf("newns: mount")
	err = unix.Unshare(unix.CLONE_NEWNS)
	if err != nil {
		log.Fatal(err)
	}

	// remount rootfs in new namespace
	log.Printf("remount: /")
	err = unix.Mount("", "/", "", uintptr(unix.MS_PRIVATE|unix.MS_REC), "")
	if err != nil {
		log.Fatal(err)
	}

	// mount special
	mounts := []struct {
		source  string
		target  string
		fsType  string
		flags   uint
		options string
	}{
		{source: "proc", target: newRoot + "/proc", fsType: "proc"},
		{source: "sysfs", target: newRoot + "/sys", fsType: "sysfs"},
		{
			source:  "tmpfs",
			target:  newRoot + "/dev",
			fsType:  "tmpfs",
			flags:   unix.MS_NOSUID | unix.MS_STRICTATIME,
			options: "mode=755",
		},
		{
			source: "devpts",
			target: newRoot + "/dev/pts",
			fsType: "devpts",
		},
	}
	for _, mnt := range mounts {
		// ensure mount target exists
		log.Printf("mkdirall: %v", mnt.target)
		err := os.MkdirAll(mnt.target, os.ModePerm)
		if err != nil {
			log.Fatal(err)
		}

		// mount
		log.Printf("mount: %v (%v)", mnt.source, mnt.fsType)
		flags := uintptr(mnt.flags)
		err = unix.Mount(mnt.source, mnt.target, mnt.fsType, flags, mnt.options)
		if err != nil {
			log.Fatal(err)
		}
	}

	// essential file descriptors
	for i, name := range []string{"stdin", "stdout", "stderr"} {
		source := "/proc/self/fd/" + strconv.Itoa(i))
		target := newRoot + "/dev/" + name
		log.Printf("symlink: %v", name)
		err := unix.Symlink(source, target)
		if err != nil {
			log.Fatal(err)
		}
	}

	// create devices
	devices := []struct {
		name  string
		attr  uint32
		major uint32
		minor uint32
	}{
		{name: "null", attr: 0666 | unix.S_IFCHR, major: 1, minor: 3},
		{name: "zero", attr: 0666 | unix.S_IFCHR, major: 1, minor: 3},
		{name: "random", attr: 0666 | unix.S_IFCHR, major: 1, minor: 8},
		{name: "urandom", attr: 0666 | unix.S_IFCHR, major: 1, minor: 9},
		{name: "console", attr: 0666 | unix.S_IFCHR, major: 136, minor: 1},
		{name: "tty", attr: 0666 | unix.S_IFCHR, major: 5, minor: 0},
		{name: "full", attr: 0666 | unix.S_IFCHR, major: 1, minor: 7},
	}
	for _, dev := range devices {
		dt := int(unix.Mkdev(dev.major, dev.minor))
		log.Printf("mknod: %v (%v)", dev.name, dt)
		err := unix.Mknod(newRoot + "dev" + dev.name, dev.attr, dt)
		if err != nil {
			log.Fatal(err)
		}
	}
	// new UTS (UNIX Timesharing System) namespace
	log.Printf("newns: UNIX time sharing")
	err = unix.Unshare(unix.CLONE_NEWUTS)
	if err != nil {
		log.Fatal(err)
	}
	// change hostname in new UTS
	log.Printf("set hostname")
	err = unix.Sethostname([]byte(id))
	if err != nil {
		log.Fatal(err)
	}

	/*
		 * can't get it to work :,(
		// new process namespace
		log.Printf("newns: processes")
		err = unix.Unshare(unix.CLONE_NEWPID)
		if err != nil {
			log.Fatal(err)
		}
	*/

	// new network namespace
	log.Printf("newns: network")
	err = unix.Unshare(unix.CLONE_NEWNET)
	if err != nil {
		log.Fatal(err)
	}

	// pivot root
	log.Printf("pivot root")
	oldRootBeforePivot := newRoot + "/.old-root"
	oldRootAfterPivot := "/.old-root"
	err = os.MkdirAll(oldRootBeforePivot, os.ModePerm)
	if err != nil {
		log.Fatalf("mkdirall old root: %v", err)
	}

	unix.PivotRoot(newRoot, oldRootBeforePivot)
	if err != nil {
		log.Fatalf("pivot root: %v", err)
	}
	unix.Chdir("/")
	if err != nil {
		log.Fatalf("chdir: %v", err)
	}
	unix.Unmount(oldRootAfterPivot, unix.MNT_DETACH)
	if err != nil {
		log.Fatalf("unmount old root: %v", err)
	}
	unix.Rmdir(oldRootAfterPivot)
	if err != nil {
		log.Fatalf("rmdir old root: %v", err)
	}

	err = unix.Exec("/bin/sh", []string{"sh"}, []string{})
	log.Fatal(err)
}

Note: I used path.Join() in a previous version but I decided to remove it. I found that to be very cluttery. So this will not run properly should the POSIX standard ever decide to replace the path separator / with something else. I am willing to take this risk, though. 😉