My Little Helper: Slack Bot

As a Site Reliability Engineer (SRE) I spend a significant amount of my time on the Linux console. Furthermore, I spend some time writing software and tooling. But I also spend a lot of time on Slack (substitute with your organizations preferred chat platform) communicating with humans.1

Bridging these domains often requires copying and pasting of information including sometimes reformatting. At one point, I was so annoyed by moving console output from a terminal window to Slack that I decided to find a better way.

My idea was to have a single, statically linked binary that I can scp to a system and that would run without further actions. The job of that binary helper would be to post to Slack on my behalf.

To honor the great inventor (and actress) Hedy Lamarr my little helper was named after her.

Great, we have a problem, a solution, and the name hedybot. But what are actual uses cases in a world (striving for) full automation? Actually, there are still a lot of manual tasks left, including:

  • Tasks that require human oversight to avoid disaster
  • One-time but often long-running tasks

Manual Deployment Notifications

One example of a task that requires human oversight at my workplace is the deployment of Domain Name System (DNS) changes. Since a mistake here can easily cost thousands of dollars and unmeasurable loss of customer trust, we tend to have an experienced engineer deploy the changes. For additional assurance, we always post the deployed changes to Slack for everyone to read. People double check and sometimes ask questions about the changes. That is a wonderful use case for hedybot! Here it is in action, using dns-tools:

$ rrpush --quiet --dry-run=false --delay=0 --no-color 2>&1 \
  | hedybot --channel=FOO2342 --title="Deployment on Production DNS"

In Slack it looks like this.

small

By the way, the color follows some loose internal convention and is hardcoded. It is a potential improvement to make the color configurable via command line flag.

Long-running Jobs

Another great use case for hedybot is a long-running job. Let’s assume there is a server that we need to wipe to comply with regulations. One could easily lose track of such a task once it is started. Daily business and occasional firefighting push less urgent matters aside and soon our brain has forgotten about them. This is where a little helper comes in handy by posting a quick message:

$ dd if=/dev/urandom of=/dev/sdx bs=4096; \
  echo "disk erase finished" | hedybot --title="Example Server"

The resulting message is clear and simple:

small

Thanks to the timely reminder, we can decommission the server right away and save some money here.

Hedybot Source Code

Here is the Golang code that I used. Grab it to craft your own little helper.

package main

import (
  "flag"
  "io/ioutil"
  "log"
  "os"

  "github.com/nlopes/slack"
)

const (
  // fetch API key from your slack workspace
  apiKey = "xxxx-xxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxx"
)

func main() {
  channelID := flag.String("channel-id", "C85PT1ULR",
    "ID of the channel to post too")
  title := flag.String("title", "Message",
    "Title for the message to post")
  flag.Parse()

  bytes, err := ioutil.ReadAll(os.Stdin)
  if err != nil {
    log.Fatalf("read stdin: %v", err)
  }
  if len(bytes) < 1 {
    log.Fatalf("stdin is empty")
  }
  report := string(bytes)

  params := slack.PostMessageParameters{
    AsUser: true,
    Attachments: []slack.Attachment{
      {
        Color: "#FFA500",
        Fields: []slack.AttachmentField{
          {
            Title: *title,
            Value: report,
          },
        },
      },
    },
  }
  api := slack.New(apiKey)
  _, _, err = api.PostMessage(*channelID, "", params)
  if err != nil {
    log.Fatalf("post report: %v", err)
  }
}
  1. The tricky part my job is to figure out which activity is worth automating, which activity requires time boxing, and when going deep into the details is advised.

Reducing Stackdriver Logging Resource Usage

Yesterday I received an alarming mail from Google informing me about the new pricing model for Stackdriver logging and that I am exceeding the free tier limit. The Stackdriver pricing model had a rough start including some adjustments and postponements. As of today, charging is expected to start on March 31, 2018. This means if I want to stay within the free tier limit, I should not exceed 50GB of log intake per month. That is quite a lot for my small cluster, so why would it use more than that?

First Look

I decided to take a look how bad the situation really was.

Woah! 😱 The morning of day 2 of the month, and I am already 37GB in? Good thing charging has not yet started. Facing the reality I moved on to drill down into were the logs come from. Since I had a good portion of log data, chances were high I find something in the logs, right? 😉 The resource table clearly showed me were to find the low hanging fruits. The Month To Date (MTD) and projected End Of Month (EOM) numbers for the resource GKE Container tops everything else by orders of magnitude.

Reason 1: Google Kubernetes Engine Bug

Looking through the logs I found out that there is a bug in the synchronizer. It has been firing multiple times per second for days:

09:18:54 Restarting synchronizer: kubernetes-dashboard-key-holder-kube-system.
09:18:54 Synchronizer kubernetes-dashboard-key-holder-kube-system exited with error: kubernetes-dashboard-key-holder-kube-system watch ended with timeout
09:18:54 Restarting synchronizer: kubernetes-dashboard-key-holder-kube-system.
09:18:54 Synchronizer kubernetes-dashboard-key-holder-kube-system exited with error: kubernetes-dashboard-key-holder-kube-system watch ended with timeout
09:18:54 Restarting synchronizer: kubernetes-dashboard-key-holder-kube-system.
09:18:54 Synchronizer kubernetes-dashboard-key-holder-kube-system exited with error: kubernetes-dashboard-key-holder-kube-system watch ended with timeout

This does produce quite some log volume for Stackdriver to ingest and that piles up adding to the overall bill. It’s one of those moments where I catch myself mumbling exponential backoff

To stop the torrent of log lines from the broken dashboard, I restarted the kubernetes dashboard pod. The hard way, of course:

$ kubectl -n kube-system delete pod kubernetes-dashboard-768854d6dc-j26qx

Reason 2: Verbose Services

Note: This subsection’s data is sourced from a different cluster which did not experience the aforementioned bug but had a huge log intake for a different reason.

In another cluster I also experienced a huge intake of logs. However, there was no log spamming, meaning that this cluster was just full of regular log lines. To find out if there are services that produce significantly more log lines than others I created a log-based metric.

This metric is basically just a counter of log lines, grouped by the resource label namespace_id. With this metric in place, I headed over to Stackdriver Monitoring and created a graph that plots the log lines per second grouped by namespace.

Obviously, this is most valuable when every service is confined to exactly one namespace. Now I was able to spot the most verbose services and dug a bit deeper into them to reduce their verbosity.

Mitigation 1: Exclusion

The first solution to the high log intake problem is to take less logs in. How unexpected! Luckily, there is a method for that called Exclusion. On the resources page we can create exclusion rules (filters if you will) to reduce the log intake in a reasonable way. Reasonable here means allowing important log entries to enter the system while dropping the less useful ones.

small

The following rule, for example, discards all log entries of log level INFO. It is a pretty simple example, however, we are free to use all the nice operators we know from regular log filtering activities. Exclusions are a powerful tool!

Here is a copy’n’paste friendly version of the same rule.

resource.type="container"
severity="INFO"

Note that you can even sample logs by creating an exclusion filter and setting the drop rate to a value less than 100%. For my use case, an exclusion rate of 95% provides me with just enough samples to assess a past problem while keeping the log intake amount reasonable. During issue triage I recommend disabling exclusions temporarily or adjusting them to pass all related logs at least.

Fun fact: Stackdriver logs the actions (create, delete, etc.) performed on exclusion rules, thus creating just another log source, the Log Exclusion log source. #inception

I wonder if one can create an exclusion rule for log exclusion. 🤔

Mitigation 2: Monitoring

The next log overdose mitigation technique I like to share uses a log-based metric to alert before things turn ugly. Stackdriver comes with some handy system metrics. Systems metrics means, these are meta data from the logging system. One of those data points is bytes_count. I use this metric in the Stackdriver Monitoring system to get an early warning if log intake exceeds the expected levels.

Here is my policy using a Metric Threshold condition:

small

Let’s have a closer look at the metric threshold.

I am monitoring the resource type Log Metrics and there the metric “Log bytes”.

An acceptable intake rate for me is 10kb/s. If hit constantly, that results in about 24.2GB of total log intake in a 28-day-month and about 26.8GB in one of those longer 31-day-months. Both values leave some good room for unforeseen issues and reaction time.

As you can see in the graph, my cluster was way beyond that threshold for quite a while. That was the bug I described earlier and which took me some time to find. With that alert in place, the same or similar bugs will fire an alert after a 1-minute grace period for log bursts.

Before I wrap this up, one word of caution: Thresholds set to low may harm your inbox! 😅 Been there, done that.

Conclusion

Stackdriver’s warning email may sound scary, but there are ways to gain control over the log intake and also be prepared for unforeseen issues by having metrics-based alerts in place.

Serving a static website using Firebase Hosting

How to (mis)use Firebase Hosting to host a static website for free.

Firebase provides mobile app developers with some nice ready-to-use backend services such as user authentication, real-time database, crash reporting, and analytics. Many apps nowadays come with static content that is loaded on demand and not built into the app. For this type of content Firebase provides a hosting solution called Firebase Hosting.

According to the pricing information (as of time of writing), one Gigabyte of data storage and 10 Gigabyte of monthly data transfer are free, including TLS certificate and custom domain. That makes Firebase Hosting interesting for serving static websites. One might argue, though, that this is not really what it was made for. On the other hand, in times of mobile first (or mobile only), chances are high that most users of a website are mobile users (whatever that means).

Bringing the cake online

Let’s assume we have a static website that informs visitors about the cake being a lie.

cakelie website screenshot,small

The website is fairly simple: It consists of a single HTML page and a picture of the virtual cake. Virtual, because the cake is a lie and you will never have it! 😉

$ tree
.
├── index.html
└── index.png

0 directories, 2 files

I found that small websites like this, for example statically rendered source code documentation, are perfect candidates for this kind of hosting services.

Create a Firebase project

Before we can create a Firebase project for our cake information portal we have to install the firebase-tools package via Node package manager:

$ npm install -g firebase-tools

If we have never logged into Firebase from the command line before we will be missing credentials to perform actions in the next steps. Let’s login first to get this out of the way:

$ firebase login

With the tools installed and credentials set up, we can initialize a new project in the current folder. Note the fancy unicode output of the firebase-tools! 😎

$ firebase init
✂️
? Which Firebase CLI features do you want to setup for this folder? Press Space to select features, then En
ter to confirm your choices.
 ◯ Database: Deploy Firebase Realtime Database Rules
 ◯ Firestore: Deploy rules and create indexes for Firestore
 ◯ Functions: Configure and deploy Cloud Functions
❯◉ Hosting: Configure and deploy Firebase Hosting sites
 ◯ Storage: Deploy Cloud Storage security rules
✂️
 ? Select a default Firebase project for this directory:
  [don't setup a default project]
❯ [create a new project]
? What do you want to use as your public directory? public
✂️
? Configure as a single-page app (rewrite all urls to /index.html)? No
✂️
✔  Firebase initialization complete!

We select Hosting only and create a default project. Using the public/ folder for files is fine. But we do not want to merge everything into a single index.html file. We will take care of the content ourself.

Now we have to head over to the Firebase Console and create a new project there as well. Unfortunately, the command line tools do not support new project creation yet.

new firebase project

Now we need to link our local project with the project we just created at the Firebase console:

$ firebase use --add
? Which project do you want to add? cakelie-static
? What alias do you want to use for this project? (e.g. staging) production

Created alias production for cakelie-static.
Now using alias production (cakelie-static)

Deploy the project

We are almost there! Before we deploy our project we have to move our static files to the public/ folder:

$ mv index.{html,png} public/

Fasten your seatbelt, we are ready to deploy:

$ firebase deploy

=== Deploying to 'cakelie-static'...

i  deploying hosting
i  hosting: preparing public directory for upload...
✔  hosting: 3 files uploaded successfully

✔  Deploy complete!

Project Console: https://console.firebase.google.com/project/cakelie-static/overview
Hosting URL: https://cakelie-static.firebaseapp.com

Here it is, the website is deployed and ready, including a valid TLS certificate and fully backed by a Content Delivery Network (CDN):

cakelie website in browser

Custom domain

Another nice feature of Firebase Hosting are custom domains. I mapped the cakelie.net domain to the Firebase project and can now stop worrying about hosting this particularly important website myself. 😂