For the most part, I try to avoid the trap of using jargon that would go over the heads of an intelligent but non-technical audience. Then someone highly technical reminds me that they had to look up some term that I used, and I’m back to defining things carefully. Or someone misunderstands me, and I realise that a term is thoroughly overloaded and I need to clarify the context.
Case in point. Outside of the technical space, overloading is something you do to a vehicle or receptacle intended for the transport of goods. We use the term in the technical space to refer to a variable or function which has a different meaning or role in a different context. I then extrapolate that meaning into my every day language, and suggest that terms we use might have different meanings in different contexts. I believe the etymologists would call this polysemy (and yes, I did have a “surely there’s a word for this?” moment and look it up).
Spelling out RED metrics
The term of the day is a colourful acronym used in the DevOps and monitoring space. It is not ubiquitous, but I kind of feel like it should be. RED expands to Requests, Errors, and Duration. The three most critical metrics in any remote system. If the thing is running on your machine, or in your browser, then you probably don’t need them. If there is a server somewhere that is supposed to be staying alive and serving traffic, then you want to keep an eye on these.
Request metrics are the most nuanced, because they are tied to your business logic, and implementation details. A request is what happens when a client comes to your service and asks for an action to be taken. There are all manner of ways this can happen, but the easiest to understand is to consider HTTP requests. The browser makes a call to the server with a method type, a location, and some parameters. Or, if you’re old-school like me it might be a terminal command using cURL. As an example, this command will return the HTML view of a simple Google search.
➤ curl -X GET "https://www.google.com/search?q=REST" -H 'Content-Type: application/json'
The metrics we would be interested in for this request would be the method (GET, rather than POST or DELETE) and the path (search). By aggregating these metrics into a graph, you can very quickly determine the level of traffic being driven to your service. In the case of Google, that is obviously a constant high number. If suddenly the number of search requests dropped to 0, someone, somewhere would start panicking. In the case of a tiny startup system, seeing one or two requests in an hour can be exciting and prove to your business people that you have customers at all.
The importance of error metrics is clear to most people. I really want to know if I failed to serve my customers, particularly if it is because something broke internally. Depending on the maturity, and type of usage of your system, you might want to dive in a bit deeper. Error metrics are usually broken out on the same axes as request metrics, and then split further by response code. HTTP response codes are worth reading through, just to get a sense of what they might mean. They are a standard which help communicate clearly between clients and servers, and which give us some places to monitor. As a really good rule of thumb to start with, any time you return a 5xx response to the client, you should investigate what went wrong. If your only client is a web page or app which you own, you should have sufficient client side validation to be able to investigate all 4xx errors (except 401 which is used to indicate login is required). If you have clients calling via code you do not own, then those errors are much harder to investigate. You still want them to be minimised, but sometimes people just get things wrong.
For both request and error metrics a sudden change in behaviour should catch your attention. Suddenly having more of something or less of it is worth checking, with varying levels of urgency. Increased traffic means you should verify you are scaled sufficiently to handle it. Decreased client errors is probably good, but did the customer give up, or did they fix their script? A decrease in requests might have a decrease in client errors at the same time, in which case there is a good chance that the script was fixed, and the lower traffic is actually a good thing.
Duration metrics are both simpler, and more tricky. They are simpler because it is fairly easy to understand why we would want to know about requests which take us a minute or more to fulfil (no one wants to wait that long). They’re trickier because most tools used for metrics store aggregations, not every data point. This means that creating useful duration metrics requires understanding the aggregations of the tools. Thankfully, they are sufficiently standardised across the industry that knowing one tool means it is pretty easy to shift to the next.
Duration metrics are usually aggregated in buckets. These are complex counters which tell you for each unit of time (e.g. each minute) how many requests fell in each bucket. The buckets are duration ranges. For example all requests taking less than 100ms might be the first bucket. Then you determine the granularity you want, 200ms, 500ms, 1s, 5s, 10s, 30s, 60s is not a bad set of buckets. For small simple requests, you would expect most of them to be in the first two, with the more complex requests in the second two. By the time you’re hitting 10s for a request, I hope you’re doing a lot of complex computation.
Now that you have these buckets, you need to display them meaningfully. The histogram they create is a little weird on its own, so we usually move into percentile metrics. Yay, now we’re doing statistics! Because the data is aggregated, it is actually easier to reason about the statistics. Of course, if you don’t really have much data the stats are going to be really boring. When you get to cloud scale, they start to be really useful. When I was looking at metrics in EC2, the difference in duration metrics between the 99th percentile and the 95th percentile was meaningful. In the system I am working on now, there is not enough data for it to be statistically relevant. Plotting the 50th and 75th percentiles from your aggregation can give you a sense of the difference between average case request times, and slower request times. If there is a huge gap (500ms or more) it might be worth looking into the cause.
Of course, just as with errors and requests, you want to look for sudden changes in behaviour in the duration graphs. Somewhat counter-intuitively a massive unexpected drop in duration can be an indicator that something has gone wrong. Why? Because we design systems to fail fast. So a bad deployment might cause a customer script to get into a retry loop and start hammering the server with invalid requests. The duration graphs will plummet, but requests and errors will spike. The massive step up in latency is one we already know to be concerned about. Usually that means there is a bottleneck in the system somewhere, and the investigation may lead you to a too small data store, or to a poorly implemented algorithm.
Monitoring Systems
RED metrics, as I have described them here are the baseline for monitoring production systems. The tooling around them is so mature that I am confident in saying even the smallest, scrappiest, startup system can afford to have them. Using open source tools like Prometheus and Grafana isn’t going to break the bank, and it allows you to start off with the confidence that you will know when you get new customers, as well as what sort of experience they are having. If you have more complex infrastructure you are going to want even more metrics, and you should certainly still talk to your customers — don’t assume the metrics tell you everything. This is my bare minimum requirement for monitoring in production.
Step two is teaching your product managers and business analysts how to read the dashboards, and that they can ask you to add new metrics. If you can achieve that, as a DevOps engineer, you will save yourself countless “please pull this data” requests. Giving you more time to build cool stuff, with the confidence that at least if something breaks you will know about it without needing a phone call from your boss!
