course updates

This commit is contained in:
marcel-dempers 2025-02-20 19:22:28 +11:00
parent 3ca3ad6f6e
commit 9c79cda5f1
6 changed files with 319 additions and 77 deletions

View File

@ -28,7 +28,7 @@ In this chapter we'll take a look at monitoring concepts and introduce a few mor
* Analyzing CPU Usage
* Practical Examples
### 🚧🎬 [Module 3: Linux Memory Monitoring](../../content/operating-systems//linux/monitoring/memory/README.md)
### 🎬 [Module 3: Linux Memory Monitoring](../../content/operating-systems//linux/monitoring/memory/README.md)
#### In this module
@ -37,7 +37,7 @@ In this chapter we'll take a look at monitoring concepts and introduce a few mor
* Analyzing Memory Usage
* Practical Examples
### 🚧🎬 [Module 4: Linux Disk Usage Monitoring](../../content/operating-systems//linux/monitoring/disk/README.md)
### 🎬 [Module 4: Linux Disk Usage Monitoring](../../content/operating-systems//linux/monitoring/disk/README.md)
#### In this module
@ -46,7 +46,7 @@ In this chapter we'll take a look at monitoring concepts and introduce a few mor
* Analyzing Disk Usage
* Practical Examples
### 🚧🎬 [Module 5: Linux Network Monitoring](../../content/operating-systems//linux/monitoring/network/README.md)
### 🎬 [Module 5: Linux Network Monitoring](../../content/operating-systems//linux/monitoring/network/README.md)
#### In this module

View File

@ -29,22 +29,22 @@ The cycle of a CPU involves:
3) Read the address from memory
4) Execute the instruction
In Chapter 2, I suggested thinking of a CPU Core this as a spinning wheel, each "wheelspin" is a CPU cycle that processes 1 task. </br>
When the wheel is spinning, it cannot process another task and has to finish its cycle so another task can hop on. </br>
In Chapter 2, I suggested thinking of a CPU Core as a spinning wheel, each "wheelspin" is a CPU cycle that processes 1 task. </br>
When the wheel is spinning, it cannot process another task and has to finish it's cycle so another task can hop on. </br>
This means that tasks may queue to wait their turn to get onto CPU. </br>
This means that tasks may queue to wait their turn to get onto the CPU. </br>
### CPU Multitask execution
### CPU Multi Task execution
Because a single CPU core is extremely fast as mentioned above, it may process tasks extremely fast and appear as if its completing them all at once. </br>
Because a single CPU core is extremely fast as mentioned above, it may process tasks extremely fast and appear as if it's completing them all at once. </br>
However, we now know that a single CPU core can only perform one task at a time. </br>
To speed this up, the CPU has multiple cores for true paralel processing. </br>
To speed this up, the CPU has multiple cores for true parallel processing. </br>
So CPU is a shared resource, similar to memory and disk and other resources. </br>
If an application needs disk, it gets disk space allocation. </br>
If an application needs disk space, it gets disk space allocation. </br>
If an application needs memory, it gets memory allocated. </br>
However, CPU is a much more intensively shared resources as it does not get allocated to an application . <br>
However, CPU is a much more intensively shared resource as it does not get allocated to an application . <br>
Applications all queue up their tasks and CPU executes it </br>
This makes it difficult for operating systems to display exactly how much CPU each application
@ -52,11 +52,11 @@ is using, but it does a pretty good job in doing so. </br>
### Understanding CPU as a Shared Resource
The CPU, like memory and other resouces, is a critical resource that is shared among all running applications on a system.
The CPU, like memory and other resources, is a critical resource that is shared among all running applications on a system.
The CPU is not allocated to applications in fixed chunks. Instead, the CPU time is shared among all running applications through a process called scheduling.
The operating system's scheduler rapidly switches the CPU's focus between different applications, giving each one a small time slice to execute its tasks. This switching happens so quickly that it appears as though applications are running simultaneously.
The operating system's scheduler rapidly switches the CPU's focus between different applications, giving each one a small time slice to execute it's tasks. This switching happens so quickly that it appears as though applications are running simultaneously.
Because the CPU is a highly contended resource, the operating system must manage this sharing efficiently to ensure that all applications get a fair amount of CPU time and that high-priority tasks are executed promptly.
@ -87,48 +87,48 @@ In addition to the above, we may need to understand the `%` of CPU usage <b>on a
Since we know that a CPU core can only execute one task at a time, it's important for engineers to know how to take advantage of this and also avoid the pitfalls.
Example 1: If you have a task which may contain poorly written code, it could keep CPU cores busy unnessasarily, causing other tasks to queue up for execution. This can slow the entire system down. </br>
Example 1: If you have a task which may contain poorly written code, it could keep CPU cores busy unnecessarily, causing other tasks to queue up for execution. This can slow the entire system down. </br>
Example 2: If you have code that may be poorly written, you could end up in situations where you are only utilizing 1 core during task execution. This is common when engineers write loops in their programs. </br>
This means your application is not running optimally and not utilizing all available CPU cores. </br>
Example 3: Another example of poorly written code where one tasks is awaiting on another task, may end up in whats called a CPU "Deadlock". This occurs when all executing tasks are waiting on each other in a circular reference. </br>
Example 3: Another example of poorly written code where one task is waiting on another task, may end up in what's called a CPU "Deadlock". This occurs when all executing tasks are waiting on each other in a circular reference. </br>
### Worker threads and tasks
To solve for some of the above issues, programming, scripting and runtime frameworks allows us to write our code in such a way that we can create whats called "threads", "worker threads", or "tasks". </br>
To solve some of the above issues, programming, scripting and runtime frameworks allows us to write our code in such a way that we can create what's called "threads", "worker threads", or "tasks". </br>
Web servers are a good example of this. When an HTTP request comes to a web server, the web server can create a "worker thread" to handle that request. Then that request gets executed on a CPU core, the web server may issue another "worker thread" to handle other incoming requests. </br>
This way, a web server can handle mutlitple incoming requests and makes use of all available CPU cores on the system </br>
This way, a web server can handle multiple incoming requests and makes use of all available CPU cores on the system </br>
We may call the HTTP request handler of the web server a mutlithreaded application. </br>
We may call the HTTP request handler of the web server a multithreaded application. </br>
The web server code for handling each HTTP request may be viewed as "single threaded" code </br>
We can take a look at two bash scripts that I have written that demonstrates single vs multithreaded code </br>
#### example: single threaded code
If we execute `./singlethread-cpu.sh`, we will notice that only one core of CPU is busy at a time. </br>
If we execute `./singlethread-cpu.sh`, we will notice that only one core of the CPU is busy at a time. </br>
Now because we execute a loop, each iteration of that loop will run on a different core. </br>
Bash itself is single threaded, so this script needs to be optimized if we wanted to make use of all available CPU cores. </br>
Bash itself is single threaded, so this script needs to be optimized if we want to make use of all available CPU cores. </br>
#### example: multi threaded code
If we execute `./multithread-cpu.sh`, we will notice all CPU cores get busy. </br>
This is because in this script, we read the number of available cores with the `nproc` command. </br>
Then we loop the number of available cores and execute our `simulate_cpu_usage()` function. </br>
At this point it would technically still be single threaded, because its still a loop and does not create more threads or processes. </br> To get around this we use a special character in bash called `&` at the end of our function. </br>
At this point it would technically still be single threaded, because it's still a loop and does not create more threads or processes. </br> To get around this we use a special character in bash called `&` at the end of our function. </br>
In Bash, the `&` character is used to run a command or a script in the background. When you append `&` to the end of a command, it tells the shell to execute the command asynchronously, allowing the shell to continue processing subsequent commands without waiting for the background command to complete. </br>
So if we have 8 CPU cores, our script will spawn 8 instances of our function, and each will run on a different CPU core. </br>
This means our application a is mutlithreaded application! </br>
This means our application is a multithreaded application! </br>
### CPU Busy vs. System Busy: Understanding the Difference
Common scenarios where CPU usage interpretations may lead to misdiagnoses include:
Common scenarios where CPU usage interpretations may lead to misdiagnosis include:
When CPU overall CPU usage is low, applications or systems can still be slow if there are other bottlenecks </br>
@ -168,12 +168,12 @@ This will give us overall CPU usage for processes as well as other performance m
[sysstat](https://github.com/sysstat/sysstat) is a collection of performance tools for the Linux operating system. </br>
The `sysstat` package provides us with may performance monitoring tools such as `iostat`, `mpstat`, `pidstat` and more. </br>
The `sysstat` package provides us with many performance monitoring tools such as `iostat`, `mpstat`, `pidstat` and more. </br>
All of these providing insights to CPU usage and load on the system </br>
<b>Important Note:</b> <br/>
`sysstat` also contains tools which you can schedule to collect and historize performance and activity data. </br>
We'll learn throughout this chapter, that Linux writes performance data to file, but only has the current statistics written to file. So in order to monitor statistics over time, we need to collect this data from these file and collect it over a period of time we'd like to monitor. </br>
We'll learn throughout this chapter that Linux writes performance data to file, but only has the current statistics written to file. So in order to monitor statistics over time, we need to collect this data from these files and collect it over a period of time we'd like to monitor. </br>
To install sysstat on Ubuntu Linux:
```
@ -196,12 +196,12 @@ Usage: iostat [ options ] [ <interval> [ <count> ] ]
```
`iostat` Gives us average CPU load and IO since system startup </br>
This means that if the system is only recently busy, the average CPU usage may be very low because its an average from the time the system started up </br>
This means that if the system is only recently busy, the average CPU usage may be very low because it's an average from the time the system started up </br>
To get the current stats, we can provide an interval and we can also provide a count of snapshots we want of the stats. For example, we can get stats for every second, with a total of 5 snapshots.
We can also provide `-c` option, which states that we are only interested in CPU stats and not I/O stats. </br>
We can also provide the `-c` option, which states that we are only interested in CPU stats and not I/O stats. </br>
```
iostat -c 1 5
```
@ -213,7 +213,7 @@ Let's understand the output. This output is pretty common across `sysstat` tools
* `%nice`
* Percentage of CPU utilization that occurred while executing at the user level with nice priority.
The Niceness value is a number assigned to processes in Linux. It helps the kernel decide how much CPU time each process gets by determining its priority.
The Niceness value is a number assigned to processes in Linux. It helps the kernel decide how much CPU time each process gets by determining it's priority.
* `%system`
* Percentage of CPU utilization that occurred while executing at the system level (kernel).
@ -269,7 +269,7 @@ Usage: pidstat [ options ] [ <interval> [ <count> ] ] [ -e <program> <args> ]
We can have `pidstat` also monitor a given program using the `-e` option as shown above. `-e` allows us to:
```
Execute program with given arguments args and monitor it with pidstat. pidstat stops when program terminates.
Execute program with given arguments and monitor it with pidstat. pidstat stops when the program terminates.
```
Examples:
@ -321,7 +321,7 @@ The `PID` is the process identifier of a process running on the operating system
Once we have the process we can see the command that is causing it. </br>
Most of the time, you can identify the culprit by looking at the `COMMAND` column of `top` or `htop`. </br>
If its an executable, a script or program running somewhere we can located it using the `ps` command. </br>
If it's an executable, a script or program running somewhere we can locate it using the `ps` command. </br>
This command line helps us display information about a selection of the active processes </br>
You can check out `man ps` to get more details about all the available options. We are after the process using the `-p` option to pass the process ID of the culprit. This process ID can change, so if you follow along be aware of that.
@ -331,9 +331,9 @@ PID CMD
6268 /bin/bash ./singlethread-cpu.sh
```
Above, we use `-p` to pass the process ID, we use `-o` to display output about the process and its command and that helps us locate the executable, in our case a bash script called `singlethread-cpu.sh` which is executed by bash, under the `/bin` folder. </br>
Above, we use `-p` to pass the process ID, we use `-o` to display output about the process and it's command and that helps us locate the executable, in our case a bash script called `singlethread-cpu.sh` which is executed by bash, under the `/bin` folder. </br>
Now we learned in our command line episode, that the `./` allows to execute scripts and executables in the current working directory. </br>
Now we learned in our command line episode, that the `./` allows us to execute scripts and executables in the current working directory. </br>
We cannot see what the current directory is, so if we need to locate that culprit, we can try to find it using the `find` command </br>
`find` takes a file path, where we can specify to search from the root of the filesystem with `/` </br>

View File

@ -5,3 +5,245 @@
This module is part of a course on DevOps. </br>
Checkout the [course introduction](../../../../../README.md) for more information </br>
This module is part of [chapter 3](../../../../../chapters/chapter-3-linux-monitoring/README.md)
This module draws from my extensive experience in server management, performance monitoring, and issue diagnosis. Unlike typical Linux Disk usage and monitoring guides, which dive into the technical history and architectures of different types of disk storage, this guide is rooted in practical, hands-on experience.
As DevOps engineers, we don't necessarily need to understand exactly how disk writes and reads are performed and how disk sectors are structured and designed. </br>
Much of these low level details that I have learned in University has not helped me at all in today's modern DevOps and Cloud Architecture roles. </br>
This module focuses on my approach to understanding disk functionality and usage, the key aspects I prioritize when monitoring disk usage and performance bottlenecks, and the strategies I employ to address related challenges.
## Disk usage - Understand how disk is used
Disk is an important resource for the health and performance of systems and applications. </br>
As a DevOps, SRE or platform engineer, we need to form a basic understanding of how disk is utilised by applications that developers write as well as by the operating system. </br>
As we have learned in our introduction to monitoring, disk is the only long term persistent storage for a computer system </br>
When a server shuts down, all important data, files, applications and configuration files are all retained on disk. </br>
### Applications are just files
Applications that developers write are compiled into binaries or executables which are all but files on disk. </br>
Those files are then combined alongside configuration files and are "packaged" - which could just be a `zip` file. </br>
The package is then deployed (which is a fancy word for "copy") to a server (just like our server), extracted into a directory and the application can then be started. </br>
### Linux is a bunch of files
In Linux, everything is a file. Files are stored on disk </br>
We've seen this by looking at CPU and Memory statistics and we also learned that Linux writes all the statistics into `/proc` where other tools can read the statistics from </br>
Therefore, even real time statistical data is kept in files by the operating system. </br>
In our chapter on operating systems and servers, users, permissions, the operating system, all the statistics are all just files. </br>
Reason I am stating this is because files take up space. </br>
This makes disk space a critical resource for the operating system and the applications running on it, to remain stable and healthy. </br>
### Disk space
In Windows, we have C:\ drive as well as any other disk drives added. </br>
In Linux, instead of viewing a single disk like C:\ , we view File systems instead. </br>
When we installed Linux in chapter 2, we had to select partitions and the Linux installation process would divide our single disk drive into multiple partitions which we now see as separate file systems.
This includes things like the boot partition,
Linux file systems can run out of space, we need to monitor overall disk space
* Avoid storing data on the operating system disk. Instead use attached disk or other volumes or storage
When space in Linux runs out, many things come to a halt. </br>
In my experience, catastrophic things can happen with the operating system disk space runs out. </br>
Unlike Windows, Linux does not pop up with prompts and alerts stating that the available disk space is low. </br>
We need to have monitoring in place to detect this before it's too late. </br>
### Disk Reads & Writes (I/O)
Disk space is not the only resource to monitor. a Process may read a file to read the content or it may write to the file if it needs to write content. </br> These read and write operations are technically a resource too. The Disk has a certain bandwidth of reads and writes per second that it can handle. </br>
This metric is generally referred to as Input/Output speeds or disk I/O. </br>
When deploying virtual servers in the cloud, the cloud provider will often publish the maximum disk speeds. </br> This is important during cost analysis and choosing the right server for your needs. </br>
Some applications need to write to disk and may require better performing disks. Most web applications may not need fast disk. </br>
This is where it becomes important for engineers to understand the differences between stateful and stateless applications. </br>
### Stateful vs Stateless applications
When talking about disk, it's important to understand that applications or processes in software engineering can have state. </br>
Many applications may simply run, receive some data, process it and respond with output. </br>
When these types of applications are terminated, they technically do not lose any data, because they rely purely on inputs and outputs. </br>
A Web server is a prime example of this. Requests come in from a customer's browser to load a page and the web server simply responds with the content. If the web server is restarted, shut down, or terminated unexpectedly, it can recover and continue to serve content. The web server process does not have any internal state. </br>
However, let's say a customer is logged in on our website and their session data is stored on our web server, this makes our web server stateful. It also means all web requests for that logged in customer needs to be routed to that same web server where the session data is. </br>
If our web server shuts down or terminates unexpectedly, the customer may need to be routed to another healthy web server. However their logged in state is on our faulty server which has terminated.
A fix for the above scenario is that a web server should not store session state and this state can be handled by something like a database. So our web servers can be restarted and the customers all remain logged in.
This means web applications should strive to be stateless. </br>
An example of a stateful application is something like a database, media storage, caching server or a content system. </br>
Stateful applications have to strive to ensure when they are terminated, that their data is always written to disk and potentially even replicated across a few servers to ensure data integrity. </br>
#### Why is this important for DevOps SRE and Platform engineers ?
It's very important when it comes to disk monitoring, to understand what will be running on our servers in a production environment. </br>
When largely managing stateless applications, you should not expect high disk usage or high number or read and write operations. These types of applications may need some disk access to load any libraries, files or configuration needed to run, but should not be dependent on storing state. </br>
When managing stateful applications, we need to be aware and expect higher disk consumption and utilization.
Things to keep in mind:
* Stateful vs Stateless applications and impacts to disk usage
* Stateless applications are less reliant on disk space and I/O
* Stateful applications are more dependant on disk space and I/O
* Build servers (CI/CD) generally use higher disk space
* Build servers download large codebases, which will consume disk space
* Build servers compile code which may involve a lot of disk read\write operations
* Antivirus and security software scan files on disk, resulting in high disk read\write operations
## Disk monitoring tools for Linux
For disk monitoring tools, we should be able to monitor disk space for the overall system as well as be able to investigate directory sizes and where disk space consumption is coming from. </br>
In addition to disk space, we also need to monitor disk I/O usage to identify any bottlenecks when processes are intensively writing to disk. </br>
### df
[df](https://man7.org/linux/man-pages/man1/df.1.html) is a native linux monitoring tool that allows us to review disk space of each file system. </br>
The `df` tool allows us to view the file systems in our operating system. </br>
It's a great start to see overall used and available space for our server </br>
The `df` command takes a `-h` flag that tells `df` to print the output in human readable form.
### du
[du](https://man7.org/linux/man-pages/man1/du.1.html) is a native linux monitoring tool that allows us to review files and directory sizes. </br>
It allows us to summarize disk usage of the set of files, recursively for directories.
`df` also supports a `-h` flag to output in human readable format. </br>
In my experience, `du` is great at finding large directories and files manually whilst navigating the file system.
`du` and `df` are great because they are available in most Linux distributions.
These commands may help in production environments where there may not be any tools installed, so I think it's important to practise on how to use these two tools efficiently in combination</br>
You can run `du` on different depth levels which tells the command to how far to traverse within directories. </br> This helps us visualise directories that might have high disk usage and then focus on those directories to locate large files or large sets of files. <br>
### tree
[tree](https://linux.die.net/man/1/tree) is a pretty cool utility in Linux that provides us similar capabilities as `du` but with a more human-friendly output. It allows us to list contents of directories in a tree-like format. </br>
It can print sizes of files and folders as well and we can also control the depth at which to list directories. </br>
There are a large number of tools when it comes to Linux and monitoring. </br>
In this course, I generally focus on the main most popular ones, but also keep in mind that in modern distributed systems, you may use an external monitoring system and may not log into production servers directly. </br>
### sysstat tools
[sysstat](https://github.com/sysstat/sysstat) is a collection of performance tools for the Linux operating system. </br>
The `sysstat` package provides us with may performance monitoring tools such as `iostat`, `mpstat`, `pidstat` and more. </br>
Many of these providing insights to disk utilization </br>
<b>Important Note:</b> <br/>
`sysstat` also contains tools which you can schedule to collect and historize performance and activity data. </br>
We'll learn throughout this chapter that Linux writes performance data to file, but only has the current statistics written to file. So in order to monitor statistics over time, we need to collect this data from these files and collect it over a period of time we'd like to monitor. </br>
To install sysstat on Ubuntu Linux:
```
sudo apt-get install -y sysstat
```
### iostat
According to the documentation:
```
iostat -
Report Central Processing Unit (CPU) statistics and input/output statistics for devices and partitions.
```
We can run `iostat --help` to see high level usage
```
Usage: iostat [ options ] [ <interval> [ <count> ] ]
```
`iostat` Gives us average disk utilization and IO since system startup </br>
This means that if the system has only recently become busy, the average utilization may appear low because it is averaged over the entire uptime of the system. </br>
To get the current stats, we can provide an interval and we can also provide a count of snapshots we want of the stats. For example, we can get stats for every second, with a total of 5 snapshots.
We provide the `-d` option, which states that we are interested in device utilization stats. We also use `-x` which indicates we would like extended statistics </br>
```
iostat -dx 1 5
```
Let's understand the output for the I/O report:
* `Device`: The name of the device (e.g., loop0, sda).
* `r/s`: The number of read requests per second.
* `w/s`: The number of write requests per second.
* `rkB/s`: The number of kilobytes read per second.
* `wkB/s`: The number of kilobytes written per second.
* `rrqm/s`: The number of read requests merged per second.
* `wrqm`/s`: The number of write requests merged per second.
* `r_await`: The average time (in milliseconds) for read requests to be served.
* `w_await`: The average time (in milliseconds) for write requests to be served.
* `svctm`: The average service time (in milliseconds) for I/O requests.
* `%util`: The percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device).
### pidstat
`pidstat` provides detailed statistics for individual processes. This is useful for identifying which processes are consuming the most resources. </br>
According to documentation:
```
pidstat - Report statistics for Linux tasks.
The pidstat command is used for monitoring individual tasks currently being managed by the Linux kernel.
```
We can run `pidstat --help` to see high level usage
```
pidstat --help
Usage: pidstat [ options ] [ <interval> [ <count> ] ] [ -e <program> <args> ]
```
We can have `pidstat` also monitor a given program using the `-e` option as shown above. `-e` allows us to:
```
Execute program with given arguments and monitor it with pidstat. pidstat stops when the program terminates.
```
Examples:
The `-d` option allows to `Report I/O statistics` specifically
```
pidstat -d 1 5
```
## Troubleshooting Disk usage
When it comes to troubleshooting disk usage, we are either looking at disk space monitoring to understand disk consumption or disk I/O monitoring to troubleshoot bottlenecks when applications are performing high reads or write operations on disk. </br>
I would always start with `df` and then `du` to dive deeper into working out which file system is affected and where space consumption is taking place. </br> So we try to locate the folder that takes up all the space. </br>
Since we learned that Linux is just a bunch of files, that is a good thing, because these files are all documented. </br>
This means when you see space is consumed by a directory that you don't know much about, you can simply search that directory online and you would be able to locate a root cause fairly quickly in most cases. </br>
We can then also quickly find out if it's safe to remove those directories or not.
And also work out if we can prevent that space from being consumed in the future. </br>
Other than disk space - Another situation is working out if there are any disk bottlenecks. </br>
Generally speaking web services and applications don't do much disk I/O, so it's quite rare in my experience to come across disk trouble. </br>
One place where you may find it come up as a problem is in the cloud, especially in Build servers.
Build servers are servers that are dedicated to compile applications, run tests and package applications into artifacts that can be deployed. </br>
We'll cover build and deployments in our CI/CD chapter. But it's important to know that build operations involve high disk interactions. </br>
This means you'll want an SSD type disk which is fast for build pipelines. </br>
The cloud may give you certain virtual servers that you can choose from, but keep in mind, that disk is important so you will need to choose a server with an SSD type disk and not a general HDD. Otherwise your build pipelines may run a lot slower than expected. </br>
For disk I/O, we use `iostat` as mentioned above to work out if the overall server is having any I/O trouble. </br>
We can then also use `pidstat` to workout which process is the culprit and then see where that process is coming from by running the following and replacing `<pid>` with the process ID that we discovered in `pidstat`
```
ps -p <pid> -o pid,cmd
```

View File

@ -16,19 +16,19 @@ As a DevOps, SRE or platform engineer, we need to form a basic understanding of
Since developers will write applications that use data, this data would ultimately be stored and processed which will consume memory on the server. </br>
Software developers may forget that memory is not an unlimited resource and may also not know what the memory limits on servers are. </br>
In my experience most applications use anywhere between roughtly `50mb` to `2gb` of memory depending on the role of the applications.
In my experience most applications use anywhere between roughly `50mb` to `2gb` of memory depending on the role of the applications.
The size of the memory is generally dependant on what data the application uses. </br>
The size of the memory is generally dependent on what data the application uses. </br>
In a future chapter, we will cover HTTP & Web servers. </br>
Some applications that developers write, will be hosted by web servers and these applications will often accept HTTP requests and accept some data.
These applications may also rely on external systems like databases and other microservices to retrieve data like inventory records etc. </br>
The size of this data can cause memory usage to increase. </br>
Think about it. If an application accepts an HTTP request and loads a customer record from a database, let's say that record is `500kb` in size. If we deal with large number of HTTP requests per second, this data could add up quickly.
Think about it. If an application accepts an HTTP request and loads a customer record from a database, let's say that record is `500kb` in size. If we deal with a large number of HTTP requests per second, this data could add up quickly.
That means memory usage could be pretty high for this one application. </br>
As requests by this application increases or decreases, memory usage can either increase or decrease over time. </br>
As requests by this application increase or decrease, memory usage can either increase or decrease over time. </br>
If this application runs on a server that is shared by other applications, they would contend for the memory resource. </br>
@ -55,7 +55,7 @@ In the previously mentioned example of an application that deals with customer r
Garbage collector may run as a separate thread in the application that runs periodically and releases memory back to the operating system. </br>
There are more trade-offs here. </br>
The more frequest the GC runs, the more CPU it may consume to iterate memory objects, but the less frequent it runs, the longer it will hold onto memory. </br>
The more frequent the GC runs, the more CPU it may consume to iterate memory objects, but the less frequent it runs, the longer it will hold onto memory. </br>
The larger the applications memory demand, the more CPU the GC may consume to iterate over large quantities of memory. </br>
If this process is not working as expected, or there is a bug in the application code preventing release of unused memory, we would call that a memory leak </br>
@ -114,7 +114,7 @@ sudo apt-get install -y htop
### free
[free](https://man7.org/linux/man-pages/man1/free.1.html) is a Linux tool that displays the amount of free and used memory in the system. The `free` command in Linux provides a summary of the system's memory usage, including total, used, free, shared, buffer/cache, and available memory </br>
#### Undestanding the output
#### Understanding the output
* `total`: The total amount of physical RAM in the system.
* `used`: The amount of RAM currently used by processes and the operating system.
@ -123,7 +123,7 @@ sudo apt-get install -y htop
* `buff`/cache: The amount of memory used for buffers and cache. This memory is available for use by processes if needed.
* `available`: An estimate of the amount of memory available for starting new applications, without swapping. This value is calculated by the kernel and is more accurate than the free column for determining how much memory is truly available.
The above tools are great when you need to jump into a server to see whats going on in relation to system load, cpu and memory usage. </br>
The above tools are great when you need to jump into a server to see what's going on in relation to system load, cpu and memory usage. </br>
Linux stores memory statistics in `/proc/meminfo`
@ -136,12 +136,12 @@ cat /proc/meminfo
[sysstat](https://github.com/sysstat/sysstat) is a collection of performance tools for the Linux operating system. </br>
The `sysstat` package provides us with may performance monitoring tools such as `iostat`, `mpstat`, `pidstat` and more. </br>
Some of these provide insights to memory usage by the system as well as indidual breakdowns of memory usage by applications. </br>
The `sysstat` package provides us with many performance monitoring tools such as `iostat`, `mpstat`, `pidstat` and more. </br>
Some of these provide insights to memory usage by the system as well as individual breakdowns of memory usage by applications. </br>
<b>Important Note:</b> <br/>
`sysstat` also contains tools which you can schedule to collect and historize performance and activity data. </br>
We'll learn throughout this chapter, that Linux writes performance data to file, but only writes the current statistics to file. So in order to monitor statistics over time, we need to collect this data from these file and collect it over a period of time we'd like to monitor. </br>
We'll learn throughout this chapter that Linux writes performance data to file, but only writes the current statistics to file. So in order to monitor statistics over time, we need to collect this data from these files and collect it over a period of time we'd like to monitor. </br>
To install sysstat on Ubuntu Linux:
```
@ -150,7 +150,7 @@ sudo apt-get install -y sysstat
### pidstat
`pidstat` provides detailed statistics on memory usage for individual processes. This is useful for identifying which processes are consuming memory and report on memory usage for a specific process over time . </br>
`pidstat` provides detailed statistics on memory usage for individual processes. This is useful for identifying which processes are consuming memory and reporting on memory usage for a specific process over time . </br>
According to documentation:
@ -169,7 +169,7 @@ Usage: pidstat [ options ] [ <interval> [ <count> ] ] [ -e <program> <args> ]
We can have `pidstat` also monitor a given program using the `-e` option as shown above. `-e` allows us to:
```
Execute program with given arguments args and monitor it with pidstat. pidstat stops when program terminates.
Execute program with given arguments and monitor it with pidstat. pidstat stops when the program terminates.
```
Examples:

View File

@ -9,15 +9,15 @@ This module is part of [chapter 3](../../../../../chapters/chapter-3-linux-monit
This module is based on my long experience looking after servers, performance, monitoring and diagnosing issues. </br>
This is not your usual average Linux network monitoring guide. </br>
Althought we'll be covering theory, the objective is not to bombard the viewer with too much detail. </br>
Although we'll be covering theory, the objective is not to bombard the viewer with too much detail. </br>
We'll cover the theory conceptually, and then use practical examples to show you real world concepts in action </br>
This guide will feature basic and some deeper advanced topics, tools and techniques for dealing with network observability and monitoring </br>
Although its basic, this guide touches on all the components that I still use in day to day modern DevOps & Cloud engineering </br>
Although it's basic, this guide touches on all the components that I still use in day to day modern DevOps & Cloud engineering </br>
It will be important to pay attention as all of the details in this module will form the foundation of monitoring HTTP web & microservices, especially when: </br>
* One service or server cannot talk to another service or server
* Troublshooting connection errors
* Troubleshooting connection errors
* Understanding latency
* Understanding basic network bottlenecks
@ -29,9 +29,9 @@ Processes can communicate with one another on the same server or across servers,
Unlike CPU, memory & disk, the network has a few components to understand, as each one of the components can cause connection failure, errors, delays and bottlenecks
In the first module, we looked at a high level overview of networking. And in this below diagram, I highlight some of the components we will cover in this module. </br>
As a engineer you want to have an understanding of how servers\processes talk to one another over the network. This happens through a network connection. </br>
As an engineer you want to have an understanding of how servers\processes talk to one another over the network. This happens through a network connection. </br>
There are two types of connection (which are referred to as "network protocols"), called TCP and UDP. </br>
Each protocol varies slightly in the way connections are establised. </br>
Each protocol varies slightly in the way connections are established. </br>
The box on the left is our source server which runs a process and makes a network connection to the box on the right which is another server with a process on it </br>
This could be a Web browser (left box) opening a web page (Github.com) which connects to a Web server (right box) on the internet somewhere hosted by Github. </br>
@ -59,7 +59,7 @@ Let's keep referring to the diagram above, and talk about each network component
### IP addresses
We covered in previous modules, that IP addresses are identifiers for servers belonging to a network and a server must have an IP address in order to belong to a network. </br>
We covered in previous modules that IP addresses are identifiers for servers belonging to a network and a server must have an IP address in order to belong to a network. </br>
An IP address can be either public or private. </br>
Generally speaking, servers always have a private IP address when belonging to a network. A server may or may NOT have a public IP address depending on network setup and configuration. </br>
@ -79,20 +79,20 @@ In our module on Virtualization and Servers, our software used `10.0.0.0` as the
So basically that "gateway" is the gateway to the public internet as the network goes to your router and the router gets a Public IP address from your Internet service provider. </br> That's why when you reboot your home router, your Public IP address may change. </br>
A similar architecture is generally followed in company and office networks. Your computer in the office will route outbound traffic to a network device or router and that will have a Public IP address provided by the companies ISP. All similar to what is shown in the above diagram</br>
A similar architecture is generally followed in company and office networks. Your computer in the office will route outbound traffic to a network device or router and that will have a Public IP address provided by the company's ISP. All similar to what is shown in the above diagram</br>
In the cloud, servers would generally have a Public IP address you can visibly see in the cloud provider web interface </br>
So each server could have it's own Public IP address. Cloud providers also allow you to remove the Public IP address, which renders this server completely private and inaccesible from public networks </br>
So each server could have its own Public IP address. Cloud providers also allow you to remove the Public IP address, which renders this server completely private and not accessible from public networks </br>
It's important to know that Public IP addresses are used for both inbound and outbound traffic. </br>
So network requests can go from `server-a` to the router, and out via the router's Public IP address </br>
`server-b` or any destination that receives requests from `server-a` will see that it originates from the Public IP address we have for `server-a` <br>
If `server-b` needs to respond to that request, it may just respond to `server-a` over the same connection
Because the illustration shows a network request from left to right, its important to know that request can also go from right to left </br>
Because the illustration shows a network request from left to right, it's important to know that request can also go from right to left </br>
However to do this, `server-a` needs to listen on a port and have a process running that can accept requests. Also the router device on the left needs to have a "port forwarding" rule to tell the router which Private IP address to send all traffic that is coming over a given port. </br>
Therefore a server and its router needs to be configured in order to allow network requests to flow all the way through
Therefore a server and it's router needs to be configured in order to allow network requests to flow all the way through
### Server VS Client
@ -107,27 +107,27 @@ In order for a client and server to talk, a network connection must be made. </b
The client will need a private IP as we've mentioned earlier and it will also need a source port for the connection. This is so that the reply can find its way back to the client. The source port is generally assigned by the client's Operating System. </br>
Source ports are limited and each Operating System can have different limits for the number of source ports it can allow. This means that we could have port exhaustion if a client tries to create too many connections. </br>
Once the client has a source IP and Port, it establishes a connection to the destination IP address. Now there are some technical naunces to establishing network connections and there is more to it, however I'll be keeping this brief and simple. </br>
In my opinion, a simplified understanding is always a better place to start instead of drowning in the deph of theory and details. </br>
Once the client has a source IP and Port, it establishes a connection to the destination IP address. Now there are some technical nuances to establishing network connections and there is more to it, however I'll be keeping this brief and simple. </br>
In my opinion, a simplified understanding is always a better place to start instead of drowning in the depth of theory and details. </br>
When it comes to monitoring we'll have everything we need to know to form a great fundamental understanding in troubleshooting systems. </br>
Now this connection attempt from the client will end up at a destination server which would have a process running on it and listening on a port. This process could be a web server or application. </br>
To accept a connection, a process must "listen" on a port </br>
That port must also be open on the server, meaning no firewall or anti-virus should be blocking that port </br>
That port must also be open on the server, meaning no firewall or antivirus should be blocking that port </br>
If there is a network device, proxy, load balancer or router in front as per our diagram, that device needs a port forwarding rule to send traffic to that destination server on a given port. </br>
Please make an important note here, when you see "Connection Refused" It generally means there is no process or application listening on a destination port you are trying to reach. This is a popular error thats often misinterpreted by developer and engineers.
Please make an important note here, when you see "Connection Refused" It generally means there is no process or application listening on a destination port you are trying to reach. This is a popular error that's often misinterpreted by developers and engineers.
Another important error is "Connection Timeout". If you see this error or simply a network request hanging, it generally means that the port you are trying to reach is being blocked by something. This could be a cloud security rule , firewall , network device like a router etc.
### Network Bandwith
### Network Bandwidth
Once a connection is established between client and server, than the client can start sending network requests and the server can respond with network responses. These requests and responses often contains data. </br>
Once a connection is established between client and server, then the client can start sending network requests and the server can respond with network responses. These requests and responses often contain data. </br>
This can be a client web browser getting HTML and web page content from a server, it could be a web browser client calling an API service for data, it could be two microservices communicating with one another. </br>
These network requests and responses may sometimes contain large datasets. These request and responses generally take up whats called network bandwidth. </br>
These network requests and responses may sometimes contain large datasets. These requests and responses generally take up what's called network bandwidth. </br>
Network bandwidth is limited by network speeds which can involve the client and server network interfaces (or network cards), network cables, devices and ISP speeds </br>
Bandwidth is often monitored and measured in bytes per sec, megabytes per sec, gigabytes per sec, etc. </br>
@ -155,9 +155,9 @@ The two are `TCP` and `UDP` </br>
So when we talked about network connection and how the client and server establish this connection I mentioned there are some nuances to how this connection is established. </br>
`TCP` is the main network protocol used by the Web because its designed to be reliable. </br> Networks are flakey meaning a network packet is never garaunteed to arrive at from client to server </br>
`TCP` is the main network protocol used by the Web because it's designed to be reliable. </br> Networks are flakey meaning a network packet is never guaranteed to arrive at from client to server </br>
To make the network more reliable, TCP involves a handshake and a few network requests back and forth between client and server to establish connection. </br>
This network connection handshake in TCP is designed to help ensure connections are established when network can be flakey in nature. </br>
This network connection handshake in TCP is designed to help ensure connections are established when the network can be flakey in nature. </br>
This comes at a performance cost, therefore there is another protocol called `UDP` which is more of a "send and forget" type of network request. Where a client sends a request and waits for a response and will simply retry if it does not get a response. </br>
A client may throw an error after a few retries when a UDP request fails. </br>
@ -191,7 +191,7 @@ echo "Starting TCP server on port 12345..."
nc -lk 12345
```
- The `-l` flag tells netcat to listen on a give port for incoming connections
- The `-k` flag tells netcat to listen for another connection once it has received on. Without `-k`, netcat will close once it receives one connection
- The `-k` flag tells netcat to listen for another connection once it has received it. Without `-k`, netcat will close once it receives one connection
We can leave our server running in a terminal and open another to test the port </br>
This helps us during monitoring to test whether a server is listening on a port or not:
@ -246,7 +246,7 @@ netstat:
netstat -a | grep 12345
```
This will help us understand the connection lifecycle as well. We can view the connection state, in this case its `LISTEN` and `ESTABLISHED` </br>
This will help us understand the connection lifecycle as well. We can view the connection state, in this case it's `LISTEN` and `ESTABLISHED` </br>
We've learned now that `LISTEN` is for servers generally listening on a port for inbound connections. `ESTABLISHED` is the state when a connection is open and the client can send and receive messages to each other. </br>
In our example client and server, we are keeping the connection open so we can observe it in our command line tools above. </br>
@ -254,7 +254,7 @@ TCP socket states represent the various stages a TCP connection goes through dur
Interestingly if we close our client by pressing `CTRL+C`, we quickly use `ss` or `netstat` to check the connection and we will notice the `ESTABLISHED` has gone and we now see a `TIME_WAIT` state. </br>
This is when a connection has been used by client and server and about to be closed.The connection is now in a "recycling" state where the operating system will get to re-use that port. Generally a connection will spend roughly 60 sec in `TIMED_WAIT` before the operating system will be able to re-use that connection </br>
This is when a connection has been used by client and server and is about to be closed.The connection is now in a "recycling" state where the operating system will get to reuse that port. Generally a connection will spend roughly 60 sec in `TIMED_WAIT` before the operating system will be able to re-use that connection </br>
The time wait timing can be adjusted in Linux, if we require a lot more connections quickly, we may reduce the time wait time </br>
We can run the following loop that keeps sending messages every 5 seconds and will close the connection after sending each message. This allows us to observe `TIME_WAIT` sockets
@ -271,14 +271,14 @@ Here are the TCP socket states in order:
* `CLOSED`: The initial state. No connection exists.
* `LISTEN`: The server is waiting for incoming connection requests.
* `SYN_SENT`: The client has sent a connection request (SYN) and is waiting for a matching connection request (SYN-ACK) from the server.
* `SYN_RECEIVED`: The server has received the client's connection request (SYN) and sent a connection acknowledgment (SYN-ACK), waiting for the final acknowledgment (ACK) from the client.
* `SYN_RECEIVED`: The server has received the client's connection request (SYN) and sent a connection acknowledgement (SYN-ACK), waiting for the final acknowledgement (ACK) from the client.
* `ESTABLISHED`: The connection is open, and data can be sent and received between the client and server.
* `FIN_WAIT_1`: The client or server has initiated the connection termination and is waiting for the other side to acknowledge (FIN).
* `FIN_WAIT_2`: The side that initiated the termination has received the acknowledgment (ACK) of its FIN and is waiting for the other side to send its FIN.
* `FIN_WAIT_2`: The side that initiated the termination has received the acknowledgement (ACK) of its FIN and is waiting for the other side to send its FIN.
* `CLOSE_WAIT`: The side that received the first FIN is waiting to send its own FIN.
* `CLOSING`: Both sides have sent FINs, but neither has received the final acknowledgment (ACK).
* `LAST_ACK`: The side that sent the first FIN is waiting for the final acknowledgment (ACK) of its FIN.
* `TIME_WAIT`: The side that sent the final acknowledgment (ACK) is waiting for a period of time to ensure the other side received it.
* `CLOSING`: Both sides have sent FINs, but neither has received the final acknowledgement (ACK).
* `LAST_ACK`: The side that sent the first FIN is waiting for the final acknowledgement (ACK) of its FIN.
* `TIME_WAIT`: The side that sent the final acknowledgement (ACK) is waiting for a period of time to ensure the other side received it.
* `CLOSED`: The connection is fully terminated, and no further communication is possible.
## Network Traffic Monitoring
@ -290,8 +290,8 @@ To monitor this network traffic, I often rely on logging. </br>
Traffic will flow from client to server, and the applications we run on servers generally have capability to log traffic requests. </br>
All popular web servers have this feature that you can configure. </br>
Once configured, server process will write incoming traffic logs to a file, where each line in the file represents a request. </br>
It is generally in a similar format like this, either delimited by spaces or commas seperating each field:
Once configured, the server process will write incoming traffic logs to a file, where each line in the file represents a request. </br>
It is generally in a similar format like this, either delimited by spaces or commas separating each field:
```
<date> <client-IP> <request-info> <status> <time-taken-in-milliseconds> <etc>

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 1.1 MiB

After

Width:  |  Height:  |  Size: 1.2 MiB