Creating alerts to spot problems before your users do is simple. However, when you receive too many alerts, you might end up ignoring critical problems.
Having intelligent alerts to filter out the noise you receive daily will have a positive impact on the way you manage incidents. Like many developers, I’ve suffered the noise fatigue associated with having too many alerts. When we see how easy it is to create an alert, we create one “just in case,” right? Been there, done that. The problem is, like me, you might end up creating email rules to filter out the noise. This is a good sign you need to fine-tune your alerts.
Don’t get me wrong. Alerts are great because you don’t have to wait until the system is down to react. There are other dependencies of your services capable of hinting something terrible will happen if you don’t take care of a problem soon. In this post, I’ll give you a few tips and strategies for creating intelligent alerts to help you act proactively rather than reactively.
Alerts Should Be Actionable
One of the first questions you need to answer before creating an alert is “what would you do when the alert triggers?”
An alert needs to be actionable. For alerts, quality is more important than quantity. You don’t need to know one of your 50 servers had a memory increase. Instead, you might be interested when the database connections are increasing rapidly. Too many things could happen when a dependency in the system is down, and you don’t want to get tons of emails referring to the same problem. Sometimes, failures can be temporary, and it’s OK not to act immediately.
In the Site Reliability Engineering (SRE) book, you’ll find Google has some helpful golden signals, like latency and error rate.
But from a developer perspective, when an alert triggers, you want to know what caused the error. Was it a database problem? Were there missing keys in the cache? Perhaps an external service was timing out. Each remediation type is different, and it depends on what the problem was. Knowing this can help us make alerts actionable.
So what can we do when coding?
First, we can classify logs (and thus alerts) with information implying how critical an error is. Simple categories in the log details like “WARNING” or “CRITICAL” will be enough. Warning logs can be informative, and we can give them a look during working hours. For critical logs, we need to pay attention to them quickly (or right away), even if it’s in the middle of the night.
Second, you need to know what your dependencies are.
Learn What Your Service Dependencies Are
I remember being on a previous team and having a problem with network calls to other services. Before the system crashed, we noticed threads in the server were increasing. So when the number of threads was starting to grow, we triggered an alarm. The action of this alarm was to restart the service, and then we didn’t see the system crash.
We had to learn what the service dependencies were before we could tune the alerts and make them actionable. Additionally, we ended up improving our service health checks. In a microservices world, every service must have a health endpoint for service discovery. And we had a traditional endpoint whose purpose was to say “Hey, I’m alive!” without any other logic. Because we knew what our dependencies were, we started to include tests for those dependencies in health endpoints. This helped us have a more resilient system.
Every time you add a call to a service dependency, take note of it; you’ll need it when configuring alerts.
Does your service interact with a Redis server? Take note of this and configure an alert like the evictions number in the cluster. Perhaps Redis is running short on memory and you simply need to get rid of unused old keys. I’ve worked with these types of alerts before, and the alerts noise was getting lower and lower every time. We either had alerts needing human interaction or those where the action could be automated, like the Redis example.
However, to get to this point of noise reduction, we needed to understand our systems better.
Learn What Action an Alert Needs, Then Try to Automate
There are going to be problems you can quickly resolve by automating the action a human operator would do anyway.
For instance, this might include things like restarting a service. However, we need to be careful—sometimes, automation can bring even more problems. You don’t want to drop service calls for constant restarts. Learn what actions are solving the problem notified by an alert. Perhaps you could start by automating simple things. Then, when someone looks at the problem, they’ll know for sure the problem is more prominent because a simple restart didn’t fix it.
When you think about automating alert actions, you’re delegating tedious tasks to a computer. In the middle of the night, who do you think would be better to restart a server: a computer or you? There are many tools out there we can use to help us out. For instance, I’ve used Lambda functions in AWS to automate restarts, modify scaling policies, and roll back a deployment, among other things. We need to develop applications capable of getting back to work with as little human interaction as possible.
When we put this strategy into practice, we put people first. We don’t want to burn out from alert fatigue.
Know What Your Limits Are
Besides the traditional error logs your systems emit, you can instrument your applications to keep other types of metrics. For instance, this might include important metrics for your customers. Customers don’t care if you have an alert for memory usage. If your service is selling products on a website, they care about being able to finish an order. Sometimes we need to put our customers before technology. There may be times where everything looks great, but you’re still receiving complaints.
So what approach could work here? You have to know what your limits are.
To get this type of information, you need to collaborate with other teams at the company. For instance, there was a time when I configured alerts from the data a system was emitting. It wasn’t necessarily for troubleshooting purposes; it was information our account executives needed. We learned we could aggregate some of this information and send it to our logging platform. We were configuring alerts important to our business, and we knew when things were starting to look bad before our customers did.
What I described before has a name, and the SRE book talks about it in more detail. SRE refers to this as service-level objectives. The idea is to configure limits your customers care about—because they care about them, you need to keep an eye on them. If you ask your customers or even upper management what percentage of errors is acceptable, they’ll say “none.” But you have to be objective and embrace failure, as the SRE model fosters.
Keep a Healthy Alerting System
Lastly, you should occasionally evaluate your alerts to see whether they’re obsolete or not relevant anymore. Configuring alerts is a dynamic activity you can always improve. Remember, quality is better than quantity. Try to tune your alerts, and make sure you only have actionable ones. As your systems continue changing, your alert strategy will need to adapt. To see how SolarWinds® Papertrail™ can integrate with your alert strategy, check out how we can help manage alerts here.
As the saying goes, “perfect is the enemy of good.” In other words, systems are not perfect, and we need to keep identifying where we can improve our alerts so they remain “intelligent.”
This post was written by Christian Meléndez. Christian is a technologist who started as a software developer and has more recently become a cloud architect focused on implementing continuous delivery pipelines with applications in several flavors, including .NET, Node.js, and Java, often using Docker containers.