As software engineers, we are building and adding new features to our product. As we add new features and increase the capability of our system, it is crucial to ensure the stability and reliability of our system.
The obvious questions are.
How can we ensure that our system remains healthy and stable?
How can we ensure that the performance of our system doesn't degrade upon adding new features?
And the answer to all those questions is.
By observing our system closely… by collecting crucial metrics to understand our system behavior.
What are metrics?
Let's understand it by a simple example.
When a person feels sick, he visits a doctor. The doctor will measure his body temperature, right? This body temperature is a metric that determines whether he is healthy or not.
Similarly, we need metrics to ensure our system's health. An example of such metrics is system metrics such as CPU or memory utilization.
In today's world, we are trying to make our system as close to humans as possible. This enables us to reason about our system more naturally.
Symptoms (what's not working) vs causes (why it's not working)
Metrics are like symptoms. They just tell us what's not working, but they don't tell us why it's not working.
Let's understand this concept with a simple example.
If a person is not feeling well, it does not necessarily mean he has a fever. He could have a coronavirus disease or maybe not. So it's just a symptom. A doctor will analyze it and diagnose the cause accordingly.
Similarly, there could be numerous reasons for our system to be unhealthy.
|Symptoms (what's not working)||Causes (why it's not working)|
|REST application is returning 500 as a response code.||Database went down or system went out-of-memory.|
|Specific endpoint is slow.||Query took too long to respond|
|Database is rejecting connection||No free disk space left.|
As we have now an understanding of basic concepts. Let's have a look at basic workflow.
The workflow is as follows.
- First, a metric collector will collect metrics from the system into consideration.
- The metric collector will store collected metrics in a database.
- We can visualize those collected metrics using some tool.
- As we now have metrics in our database, we can retrieve them and inform our team of any problems.
Tech Stack used for monitoring in this guide
I will be using the following tech stack in this guide.
Elasticsearch + Beats + Kibana
Elasticsearch: To store metrics. See details here .
Beats: To collect metrics. Please note that I will be using metricbeat and filebeat only. See details here .
Kibana: To visualize collected metrics. See details here .
Watchers : To inform our team of any problem by fetching metrics from elasticsearch. See details here .
- Monitor Ansible Playbook Executions
- Monitor Aborted MySQL Connections using metricbeat
- Monitoring Nginx Logs
I will keep adding more examples in this guide in the future.
Your feedback is crucial for me
Hey Network, please leave your feedback in the comment section about this guide. I would really appreciate your time and input.
I will use your input as constructive feedback and improve it accordingly.
Thank you so much. Looking forward to hear from you :).