Structured Logging with journald
There are a lot of philosophies around logging and I have based my solution around a couple of key ideas.
- Logging too much is better then not enough.
- Consequently, analysis by humans will not work. There is too much information for timely insights.
- Parsing human readable messages is difficult and error prone, thus the more machine-readable the better.
Based on these thoughts I wanted a structured logging solution. Furthermore for ease on management I wanted to have all of my logs to go to the same place. For a linux system running systemd the obvious answer was the systemd journal. It already gathers information such as service output and status, as well as acting as a syslog implementation. Journald also has other benefits such as adding trusted information to every log entry such as the time, user, service (if applicable) and executable. This means that rouge processes can't trick readers of the journal.
Getting started is easy, most of my logs are already going there. The biggest source of data I wasn't currently capturing was NGINX access logs, which were going to a custom log file. However since NGINX supports logging to syslog that was quickly remedied. These logs aren't structured but that is the best we can do for now. It would be an interesting project to add structured logging to NGINX, ideally supporting the systemd journal directly.
Managing and Analyzing the Logs
Journald provides some simple queries via the
journalctl command line tool. While this is good for simple cases, especially looking at the real time logs of a specific system it is insufficient for global queries, aggregating and alerts.
I decided to use sumologic to manage my logs. sumologc is a hosted log collection and query service. They have a good free tier, allowing myself to know what I am getting into before committing as well providing good pipeline latency (I can generally see my logs in queries within a minute). While they don't explicitly support structured logs they have support for parsing JSON allowing reliable queries over the structured data.
Uploading to sumologic is quite easy. While they don't natively support journald they have a simple HTTP API, allowing a simple script to upload my logs in JSON format. (This script also works with logentries and can probably handle just about any service that accepts logs via HTTP.)
Running my script is very easy, just provide the upload URL via environment variable and it will do it's job. Just make sure the user that journal-forwarder runs as has permission to view the full journal, otherwise you will only get the logs for that user. Generally adding the user to the
systemd-journal group is enough to provide this access.
Setting up Logging and Alerts
This will vary depending on the log aggregater you use however I would like to provide a couple of tips.
First of all, the most important alerts are inactivity alerts. You wont get errors from things that aren't running and it is easy to not realize when something goes down. Also, the most important inactivity alert is ensuring that all of your systems are logging. Otherwise all of your other alerts are unreliable.
In sumologic the following query alerts me if any of my servers stops logging. This script looks for the stop of logs (falling edge) so it has the downside that if it isn't run often enough it can miss the stop, however written this way it has the advantage that I don't need to explicitly add and remove hosts. It will automatically notice if any that has logged before stops and will stop bothering me after a while (which is nice when decommissioning nodes).
* | json auto "_HOSTNAME" | max(_receiptTime) as latest by _HOSTNAME | now() - latest as age | age > 2*60*1000 as old | where old | formatDate(tolong(latest), "yyyy-MM-dd HH:mm:ss.SSSZ") as %"Last Log At"
(Ab)using the System
Now that we have a method of getting structured information into our analysis system it is time to add more info. I was previously using DataDog to collect system information however I realized I could replace it with my new system quite easily, with the added advantage of more powerful queries and all of my data in the same place.
The first step was writing a tool to dump system statistics into the journal. So I wrote sysstat-journald. It utilizes sysstat/sar to gather system statistics, then simply mangles the name to please journald and inserts an entry with all of the collected information.
Running this script on a timer is enough to get all my statistics into sumologic. I run it every 10 seconds to get fairly fine grained information.
The second step is writing queries and creating dashboards. This shows how easy it is to add more data to this pipeline.
With a couple of simple steps, and a couple of shell scripts I am collecting a lot of information from all of my servers into a centeralized environment where I can easily answer questions, alert on arbartray conditions and create nice dashboards. I have also encouraged myself to keep logging more structured data becasue of the simplicity and rapid benefit.
I am immideatly a number graphs richer, and get alerts when something goes wrong, if you haven't figured out your logging yet I would highly recomend you do so. And whether or not you have, I hope you found my solution interesting.