Auth Monitoring Runbook: What to Alert On and What to Ignore
In the world of IT and DevOps, maintaining a secure and efficient authentication system is paramount. Whether you're using Bastionary, a self-hosted platform for authentication, billing, licensing, and feature flags, or any other system, it's crucial to know what signals to monitor and what to ignore. This runbook aims to help you identify the key metrics that matter and the noise that can be safely disregarded. Let's dive into the specifics of what to alert on and what to ignore for a streamlined and effective monitoring strategy.
Error Rate Thresholds
One of the first metrics to monitor is the error rate in your authentication system. High error rates can indicate issues such as misconfigurations, network problems, or even security breaches. However, not all errors are created equal. It's essential to differentiate between transient errors and those that signify a deeper problem. For instance, a sudden spike in errors might warrant immediate attention, while a gradual increase could be part of a planned upgrade or maintenance window.
Token Verification Latency P99
Token verification latency, especially the P99 percentile, is another critical metric. This measures the time it takes for a token to be verified and can be a good indicator of your system's performance. If the latency is consistently high, it might mean that your system is under heavy load or experiencing technical issues. However, occasional spikes in latency might not be as concerning, especially if they are within an acceptable range.
Login Anomaly Spikes
Monitoring login attempts is crucial for security. A sudden spike in login attempts can be a sign of a brute force attack or a compromised account. However, not all spikes are malicious. For example, a spike in login attempts might occur during a promotional event or a new feature release. It's essential to correlate login spikes with other data points to determine if they are legitimate or malicious.
What to Ignore
While it's important to monitor specific metrics, there are also metrics that can be safely ignored. These include:
- Gradual changes in error rates: As mentioned earlier, not all errors are indicative of a serious problem. Gradual changes might be part of a planned upgrade or maintenance window.
- Occasional spikes in latency: As long as they are within an acceptable range and not accompanied by other issues, occasional spikes in latency might not be a cause for concern.
- Normal login spikes: As mentioned earlier, spikes in login attempts can occur during promotional events or new feature releases. It's essential to correlate these spikes with other data points to determine if they are legitimate or malicious.
Practical Implementation
Let's look at a practical implementation using Bastionary. Suppose you want to set up alerts for high error rates and token verification latency. You can use the following code snippet to configure your monitoring system:
alert: HighErrorRate
expr: rate(auth_errors[5m]) > 0.05
for: 10m
labels:
- severity: critical
alert: HighLatency
expr: auth_token_verification_latency{job="bastionary"} > 1s
for: 10m
labels:
- severity: warning
In this example, we set up two alerts: one for high error rates and another for high latency. The first alert triggers if the error rate exceeds 5% for 10 minutes, while the second alert triggers if the latency exceeds 1 second for 10 minutes. Both alerts are labeled with their severity level.
Conclusion
Monitoring your authentication system is crucial for maintaining security and efficiency. By focusing on specific metrics such as error rates, token verification latency, and login anomalies, you can ensure that your system is running smoothly. However, it's also essential to ignore metrics that are not indicative of serious problems. By following this runbook, you can set up an effective monitoring strategy that helps you identify the signals that matter and the noise that can be safely disregarded.
About Bastionary
Bastionary is a self-hosted platform for authentication, billing, licensing, and feature flags. It provides a comprehensive solution for managing these aspects of your system, helping you to streamline your operations and improve efficiency. Whether you're using Bastionary or another platform, this runbook can help you set up an effective monitoring strategy that ensures your authentication system is secure and efficient.