Anas Anjaria
Anas Anjaria's blog

Anas Anjaria's blog

Monitor Your System - Practical Guide

Photo by Myriam Jessier on Unsplash

Monitor Your System - Practical Guide

Ensure your system is healthy like you :)

Anas Anjaria's photo
Anas Anjaria
·Sep 3, 2022·

3 min read

Subscribe to my newsletter and never miss my upcoming articles

As software engineers, we are building and adding new features to our product. As we add new features and increase the capability of our system, it is crucial to ensure the stability and reliability of our system.

The obvious questions are.

How can we ensure that our system remains healthy and stable?

How can we ensure that the performance of our system doesn't degrade upon adding new features?

And the answer to all those questions is.

By observing our system closely… by collecting crucial metrics to understand our system behavior.

Basic Concepts

What are metrics?

Let's understand it by a simple example.

When a person feels sick, he visits a doctor. The doctor will measure his body temperature, right? This body temperature is a metric that determines whether he is healthy or not.

Similarly, we need metrics to ensure our system's health. An example of such metrics is system metrics such as CPU or memory utilization.

In today's world, we are trying to make our system as close to humans as possible. This enables us to reason about our system more naturally.

Symptoms (what's not working) vs causes (why it's not working)

Metrics are like symptoms. They just tell us what's not working, but they don't tell us why it's not working.

Let's understand this concept with a simple example.

If a person is not feeling well, it does not necessarily mean he has a fever. He could have a coronavirus disease or maybe not. So it's just a symptom. A doctor will analyze it and diagnose the cause accordingly.

Similarly, there could be numerous reasons for our system to be unhealthy.

Symptoms (what's not working)Causes (why it's not working)
REST application is returning 500 as a response code.Database went down or system went out-of-memory.
Specific endpoint is slow.Query took too long to respond
Database is rejecting connectionNo free disk space left.

Basic workflow

As we have now an understanding of basic concepts. Let's have a look at basic workflow.

The basic workflow for system monitoring

The workflow is as follows.

  1. First, a metric collector will collect metrics from the system into consideration.
  2. The metric collector will store collected metrics in a database.
  3. We can visualize those collected metrics using some tool.
  4. As we now have metrics in our database, we can retrieve them and inform our team of any problems.

Tech Stack used for monitoring in this guide

I will be using the following tech stack in this guide.

Elasticsearch + Beats + Kibana

Elasticsearch:  To store metrics. See details here [1].

Beats: To collect metrics. Please note that I will be using metricbeat and filebeat only. See details here [2].

Kibana:  To visualize collected metrics. See details here [3].

Watchers :  To inform our team of any problem by fetching metrics from elasticsearch. See details here [4].

Practical examples

  1. Monitor Ansible Playbook Executions
  2. Monitor Aborted MySQL Connections using metricbeat
  3. Monitoring Nginx Logs

I will keep adding more examples in this guide in the future.

Your feedback is crucial for me

Hey Network, please leave your feedback in the comment section about this guide. I would really appreciate your time and input.

I will use your input as constructive feedback and improve it accordingly.

Thank you so much. Looking forward to hear from you :).


[1] What is elasticsearch

[2] beats

[3] kibana

[4] Watcher

Want to connect?

Share this