Reducing Stackdriver Logging Resource Usage

Yesterday I received an alarming mail from Google informing me about the new pricing model for Stackdriver logging and that I am exceeding the free tier limit. The Stackdriver pricing model had a rough start including some adjustments and postponements. As of today, charging is expected to start on March 31, 2018. This means if I want to stay within the free tier limit, I should not exceed 50GB of log intake per month. That is quite a lot for my small cluster, so why would it use more than that?

First Look

I decided to take a look how bad the situation really was.

Woah! 😱 The morning of day 2 of the month, and I am already 37GB in? Good thing charging has not yet started. Facing the reality I moved on to drill down into were the logs come from. Since I had a good portion of log data, chances were high I find something in the logs, right? 😉 The resource table clearly showed me were to find the low hanging fruits. The Month To Date (MTD) and projected End Of Month (EOM) numbers for the resource GKE Container tops everything else by orders of magnitude.

Reason 1: Google Kubernetes Engine Bug

Looking through the logs I found out that there is a bug in the synchronizer. It has been firing multiple times per second for days:

09:18:54 Restarting synchronizer: kubernetes-dashboard-key-holder-kube-system.
09:18:54 Synchronizer kubernetes-dashboard-key-holder-kube-system exited with error: kubernetes-dashboard-key-holder-kube-system watch ended with timeout
09:18:54 Restarting synchronizer: kubernetes-dashboard-key-holder-kube-system.
09:18:54 Synchronizer kubernetes-dashboard-key-holder-kube-system exited with error: kubernetes-dashboard-key-holder-kube-system watch ended with timeout
09:18:54 Restarting synchronizer: kubernetes-dashboard-key-holder-kube-system.
09:18:54 Synchronizer kubernetes-dashboard-key-holder-kube-system exited with error: kubernetes-dashboard-key-holder-kube-system watch ended with timeout

This does produce quite some log volume for Stackdriver to ingest and that piles up adding to the overall bill. It’s one of those moments where I catch myself mumbling exponential backoff

To stop the torrent of log lines from the broken dashboard, I restarted the kubernetes dashboard pod. The hard way, of course:

$ kubectl -n kube-system delete pod kubernetes-dashboard-768854d6dc-j26qx

Reason 2: Verbose Services

Note: This subsection’s data is sourced from a different cluster which did not experience the aforementioned bug but had a huge log intake for a different reason.

In another cluster I also experienced a huge intake of logs. However, there was no log spamming, meaning that this cluster was just full of regular log lines. To find out if there are services that produce significantly more log lines than others I created a log-based metric.

This metric is basically just a counter of log lines, grouped by the resource label namespace_id. With this metric in place, I headed over to Stackdriver Monitoring and created a graph that plots the log lines per second grouped by namespace.

Obviously, this is most valuable when every service is confined to exactly one namespace. Now I was able to spot the most verbose services and dug a bit deeper into them to reduce their verbosity.

Mitigation 1: Exclusion

The first solution to the high log intake problem is to take less logs in. How unexpected! Luckily, there is a method for that called Exclusion. On the resources page we can create exclusion rules (filters if you will) to reduce the log intake in a reasonable way. Reasonable here means allowing important log entries to enter the system while dropping the less useful ones.

small

The following rule, for example, discards all log entries of log level INFO. It is a pretty simple example, however, we are free to use all the nice operators we know from regular log filtering activities. Exclusions are a powerful tool!

Here is a copy’n’paste friendly version of the same rule.

resource.type="container"
severity="INFO"

Note that you can even sample logs by creating an exclusion filter and setting the drop rate to a value less than 100%. For my use case, an exclusion rate of 95% provides me with just enough samples to assess a past problem while keeping the log intake amount reasonable. During issue triage I recommend disabling exclusions temporarily or adjusting them to pass all related logs at least.

Fun fact: Stackdriver logs the actions (create, delete, etc.) performed on exclusion rules, thus creating just another log source, the Log Exclusion log source. #inception

I wonder if one can create an exclusion rule for log exclusion. 🤔

Mitigation 2: Monitoring

The next log overdose mitigation technique I like to share uses a log-based metric to alert before things turn ugly. Stackdriver comes with some handy system metrics. Systems metrics means, these are meta data from the logging system. One of those data points is bytes_count. I use this metric in the Stackdriver Monitoring system to get an early warning if log intake exceeds the expected levels.

Here is my policy using a Metric Threshold condition:

small

Let’s have a closer look at the metric threshold.

I am monitoring the resource type Log Metrics and there the metric “Log bytes”.

An acceptable intake rate for me is 10kb/s. If hit constantly, that results in about 24.2GB of total log intake in a 28-day-month and about 26.8GB in one of those longer 31-day-months. Both values leave some good room for unforeseen issues and reaction time.

As you can see in the graph, my cluster was way beyond that threshold for quite a while. That was the bug I described earlier and which took me some time to find. With that alert in place, the same or similar bugs will fire an alert after a 1-minute grace period for log bursts.

Before I wrap this up, one word of caution: Thresholds set to low may harm your inbox! 😅 Been there, done that.

Conclusion

Stackdriver’s warning email may sound scary, but there are ways to gain control over the log intake and also be prepared for unforeseen issues by having metrics-based alerts in place.

Serving a static website using Firebase Hosting

How to (mis)use Firebase Hosting to host a static website for free.

Firebase provides mobile app developers with some nice ready-to-use backend services such as user authentication, real-time database, crash reporting, and analytics. Many apps nowadays come with static content that is loaded on demand and not built into the app. For this type of content Firebase provides a hosting solution called Firebase Hosting.

According to the pricing information (as of time of writing), one Gigabyte of data storage and 10 Gigabyte of monthly data transfer are free, including TLS certificate and custom domain. That makes Firebase Hosting interesting for serving static websites. One might argue, though, that this is not really what it was made for. On the other hand, in times of mobile first (or mobile only), chances are high that most users of a website are mobile users (whatever that means).

Bringing the cake online

Let’s assume we have a static website that informs visitors about the cake being a lie.

cakelie website screenshot,small

The website is fairly simple: It consists of a single HTML page and a picture of the virtual cake. Virtual, because the cake is a lie and you will never have it! 😉

$ tree
.
├── index.html
└── index.png

0 directories, 2 files

I found that small websites like this, for example statically rendered source code documentation, are perfect candidates for this kind of hosting services.

Create a Firebase project

Before we can create a Firebase project for our cake information portal we have to install the firebase-tools package via Node package manager:

$ npm install -g firebase-tools

If we have never logged into Firebase from the command line before we will be missing credentials to perform actions in the next steps. Let’s login first to get this out of the way:

$ firebase login

With the tools installed and credentials set up, we can initialize a new project in the current folder. Note the fancy unicode output of the firebase-tools! 😎

$ firebase init
✂ī¸
? Which Firebase CLI features do you want to setup for this folder? Press Space to select features, then En
ter to confirm your choices.
 ◯ Database: Deploy Firebase Realtime Database Rules
 ◯ Firestore: Deploy rules and create indexes for Firestore
 ◯ Functions: Configure and deploy Cloud Functions
❯◉ Hosting: Configure and deploy Firebase Hosting sites
 ◯ Storage: Deploy Cloud Storage security rules
✂ī¸
 ? Select a default Firebase project for this directory:
  [don't setup a default project]
❯ [create a new project]
? What do you want to use as your public directory? public
✂ī¸
? Configure as a single-page app (rewrite all urls to /index.html)? No
✂ī¸
✔  Firebase initialization complete!

We select Hosting only and create a default project. Using the public/ folder for files is fine. But we do not want to merge everything into a single index.html file. We will take care of the content ourself.

Now we have to head over to the Firebase Console and create a new project there as well. Unfortunately, the command line tools do not support new project creation yet.

new firebase project

Now we need to link our local project with the project we just created at the Firebase console:

$ firebase use --add
? Which project do you want to add? cakelie-static
? What alias do you want to use for this project? (e.g. staging) production

Created alias production for cakelie-static.
Now using alias production (cakelie-static)

Deploy the project

We are almost there! Before we deploy our project we have to move our static files to the public/ folder:

$ mv index.{html,png} public/

Fasten your seatbelt, we are ready to deploy:

$ firebase deploy

=== Deploying to 'cakelie-static'...

i  deploying hosting
i  hosting: preparing public directory for upload...
✔  hosting: 3 files uploaded successfully

✔  Deploy complete!

Project Console: https://console.firebase.google.com/project/cakelie-static/overview
Hosting URL: https://cakelie-static.firebaseapp.com

Here it is, the website is deployed and ready, including a valid TLS certificate and fully backed by a Content Delivery Network (CDN):

cakelie website in browser

Custom domain

Another nice feature of Firebase Hosting are custom domains. I mapped the cakelie.net domain to the Firebase project and can now stop worrying about hosting this particularly important website myself. 😂

IPv6 and Let's Encrypt TLS on Google Kubernetes Engine

In a previous article I described how I deployed my blog on kubernetes and served it over HTTP. Today I’d like to add three more pieces:

  • Automate Let’s Encrypt certificate retrieval (and renewal)
  • Add a TLS-capable load balancer
  • Add IPv6 support (because it’s 2017)

Automating certificate management

Thanks to Let’s Encrypt web servers can request trusted and signed certificate for free in a fully automated manner. A web traffic load balancer is basically a proxy server, acting like a web server on the frontend and like a HTTP client towards the backend. So why not let the load balancer’s fronted (the web server part) take care of fetching a certificate from Let’s Encrypt? We have seen other web servers, such as Caddy, taking care of certificate management.

Unfortunately, this is not a feature that is available on Google Cloud Platform (GCP). Furthermore, I can imagine this working fine with a single load balancer, but failing at scale in a multi-balancer setup. The reason is, that Let’s Encrypt has an API limit. One can request only so many certificates in a week. But even if we had access to an unlimited API, it would still be a non-trivial task to make sure the right load balancer is responding to the HTTP challenge request from Let’s Encrypt.

What we need to address the problem is a software that retrieves and renews certificates and deploys them to our load balancer(s) whenever a relevant change occurs. A relevant change in this sense could be a modified hostname, a new subdomain, or the nearing expiration date of a currently deployed certificate. Fortunately, there is a tool for that already. There are actually multiple tools, and they run on kubernetes, making deployment really straightforward:

In this article we will use kube-lego, but I can highly recommend cert-manager, too. Of course, for non-production use cases only. 😉

Note: If your kubernetes cluster has Role Based Access Control (RBAC) enabled then apply a profile to kube-lego that grants the required privileges before you proceed!

Deploying kube-lego

Like every other workload, we like to cage kube-lego into a dedicated namespace. We define the namespace in k8s/kube-lego.ns.yaml:

apiVersion: v1
kind: Namespace
metadata:
  name: kube-lego

And create it via the command line tool kubectl:

$ kubectl create -f k8s/kube-lego.ns.yaml

The next step is to define and configure the kube-lego deployment in k8s/kube-lego.deployment.yaml. For the initial deployment of kube-lego, I recommend setting LEGO_LOG_LEVEL to debug:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: kube-lego
  namespace: kube-lego
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: kube-lego
    spec:
      containers:
      - name: kube-lego
        image: jetstack/kube-lego:0.1.5
        imagePullPolicy: Always
        ports:
        - containerPort: 8080
        env:
        - name: LEGO_LOG_LEVEL
          value: info  # more verbose: debug
        - name: LEGO_EMAIL
          value: mail@example.com  # change this!
        - name: LEGO_URL
          value: https://acme-v01.api.letsencrypt.org/directory
        - name: LEGO_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: LEGO_POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        resources:
          limits:
            cpu: 100m
            memory: 50Mi
          requests:
            cpu: 50m
            memory: 50Mi
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          timeoutSeconds: 1

Once the namespace is ready we can deploy and check if the deployment succeeded:

$ kubectl create -f k8s/kube-lego.deployment.yaml
✂ī¸
$ kubectl -n kube-lego get deployments
NAME        DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
kube-lego   1         1         1            1           10m

Tip: Consider using a configmap as alternative to hard-coding configuration parameters into a deployment.

Addding a TLS-enabled load balancer

With kube-lego there are two different ways of defining a load balancer. The easier (but more expensive) one is to use a load balancer provided by GCP. The alternative is deploying an nginx ingress pod and using that as the load balancer. I got good results from both in my experiments. For the sake of brevity, we will use the quicker GCP way in this article.

First, we need to create a kubernetes ingress object to balance and proxy incoming web traffic. The important part here is, that we can influence the behavior of the ingress object by providing annotations.

  • kubernetes.io/ingress.class: "gce" This annotation let’s kubernetes know that we want to use a GCP load balancer for ingress traffic. Obviously, this annotation does not make sense on kubernetes installations which do not run on GCP.
  • kubernetes.io/tls-acme: "true"` This annotation allows kube-lego to manage the domains and certificates referenced in this ingress object for us. If we leave out this annotation, kube-lego will refrain from touching it or its associated kubernetes secrets.
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/tls-acme: "true"
    kubernetes.io/ingress.class: "gce"
  name: website
  namespace: website
spec:
  rules:
  - host: test.danrl.com
    http:
      paths:
      - backend:
          serviceName: website
          servicePort: 80
        path: /
  tls:
  - hosts:
    - test.danrl.com
    secretName: test-danrl-com-certificate
$ kubectl create -f k8s/website.ingress.yaml

It may take a while for the ingress object to become fully visible. GCP is not the fastest fellow to spin up new load balancers in my experience. ⏱

$ kubectl -n website get ingress
NAME      HOSTS            ADDRESS       PORTS     AGE
website   test.danrl.com   35.196.54.8   80, 443   3m

Very soon after the load balancer is up and running, kube-lego should jump in and notice the lack of a certificate. It will fetch one and deploy it automatically. Awesome! We can watch this process in the logs. I use Stackdriver for collecting logs from kubernetes workloads, but there are many other options as well. Wherever your logs are, lookout for a line similar to this one:

level=info msg="requesting certificate for test.danrl.com" context="ingress_tls" name=website namespace=website

Once the requested certificate has been received, kube-lego will create or update the secret for it. We can verify the existence of the secret:

$ kubectl -n website get secrets
NAME                          TYPE                                  DATA      AGE
✂ī¸
test-danrl-com-certificate    kubernetes.io/tls                     2         22m

From now on, kube-lego will monitor the certificate and renew and replace it as necessary. The certificate should also show up in the load balancer configuration on the GCP console at Network Services → Load balancing → Certificates (you may have to enable the advanced menu at the bottom):

initial certificate,small

To test the automation further we could trigger a certificate renewal by tweaking the LEGO_MINIMUM_VALIDITY environment variable (optional). For reference, here is the automatically retrieved follow-up certificate I got:

followup certificate,small

Adding IPv6 to the load balancer

In the standard configuration GCP load balancers are started without an IPv6 address assigned. Technically, they can handle IPv6 traffic and we are free to assign IPv6 addresses to a GCP load balancer. To do this, we first have to reserve a static IPv6 address. This is done at VPC network → External IP addresses.

vpc external addresses,small

Reserving an address means, that this address can not be used by anyone else on the platform. If we reverse addresses but don’t use them charges will apply.

reserve static address,small

Once the address is reserved, we can assign it to the load balancer. To do that, we have to add an additional frontend for every address and every protocol (HTTP, HTTPS). That is, two frontends for each additional address.

add ipv6 to load balancer

We have to do the same for HTTPS, too, of course. When setting the IPv6 HTTPS frontend, we select the current certificate from the dropdown menu.

Almost automated… 😤

And now I have some bad news for you. ☚ī¸ IPv6 load balancer frontends, certificate renewal via kube-lego, and GCP load balancers do not go very well together (as of time of writing). When kube-lego renews the certificates it ignores manually added frontends. This means, the certificate for the IPv6 address will not be replaced automatically. Very frustrating!

certificates differ

In the screenshot we can see the new certificate k8s-ssl-1-website2-website2–a02b6ae745a706f8 alongside the old one k8s-ssl-website2-website2–a02b6ae745a706f8. Only for the IPv4 frontend was the certificate replaced.