mirror of
https://github.com/marcel-dempers/docker-development-youtube-series.git
synced 2025-06-06 17:01:30 +00:00
network monitoring chapter 3
This commit is contained in:
parent
5cfd2c8b37
commit
63d97cead0
259
course/content/operating-systems/linux/monitoring/README.md
Normal file
259
course/content/operating-systems/linux/monitoring/README.md
Normal file
@ -0,0 +1,259 @@
|
||||
# 🎬 Introduction to Linux monitoring
|
||||
|
||||
## 💡 Preface
|
||||
|
||||
This module is part of a course on DevOps. </br>
|
||||
Checkout the [course introduction](../../../../README.md) for more information </br>
|
||||
This module is part of [chapter 3](../../../../chapters/chapter-3-linux-monitoring/README.md)
|
||||
|
||||
## What is Monitoring
|
||||
|
||||
Monitoring is the process of collecting, analyzing and using data to track the performance and health of systems. </br>
|
||||
|
||||
Monitoring involves the use of tooling to:
|
||||
* Capture, collect or extract data from systems, services, applications, processes, etc
|
||||
* This type of data could be logs, metrics or traces
|
||||
* Store this raw data in a storage system where it can be processed
|
||||
* Process the raw data so it can be analysed
|
||||
* Visualise the data, which enable teams to track and analyze health and performance of these systems, services, applications and processes. </br>
|
||||
* Detect and notify engineering teams if any potential issue occurs that needs attention
|
||||
* For example, if CPU stays high for a certain amount of time, or disk usage runs over a threshold, or a process crashes with an error </br>
|
||||
|
||||
## What is Observability
|
||||
|
||||
Observability is a broader concept that refers to the ability to understand the internal state of a system based on the data it produces. </br>
|
||||
|
||||
It goes beyond traditional monitoring by providing deeper insights into the system's behavior and enabling more effective troubleshooting and root cause analysis. </br>
|
||||
|
||||
Some examples of Observability includes:
|
||||
* Logs
|
||||
* Metrics
|
||||
* Traces
|
||||
|
||||
Observability is often a more investigative approach to monitoring in order to find bottlenecks in a system or root cause analysis for issues. </br>
|
||||
|
||||
## Monitoring examples
|
||||
|
||||
The most basic form of monitoring, is to use tooling that the operating
|
||||
system provides a way to look at a system's basic resource utilization and analyze its health and performance. </br>
|
||||
|
||||
### Average System Load (memory+cpu)
|
||||
|
||||
For example, the operating system provides a native command called `top` to analyze and monitor overall system load and some performance metrics </br> `top` is another command line executable that lives at `/bin/top`
|
||||
|
||||
If we run `top` on our Linux server, we see system load averages, current memory and CPU usage as well as all the processes and threads running on our system. </br>
|
||||
That gives us an overview of Memory and CPU usage </br>
|
||||
|
||||
Load averages are made up of 3 important numbers. Each number is an average system load for a given timeframe. </br>
|
||||
The first number is the average system load in the last minute, followed by 5 min, and 15 min for the last number. </br>
|
||||
This tells us if there is ongoing performance load or just a small recent spike in load average. </br>
|
||||
In simple terms it tells us if the system was recently busy, or constantly busy </br>
|
||||
|
||||
We can also see load averages by printing out the load averages in the location `/proc/loadavg'
|
||||
|
||||
Below we see load averages were:
|
||||
|
||||
* `0.62` in the last 1 minute
|
||||
* `0.14` in the last 5 minutes
|
||||
* `0.05` in the last 15 minutes
|
||||
|
||||
```
|
||||
cat /proc/loadavg
|
||||
0.62 0.14 0.05 1/568 1807
|
||||
```
|
||||
|
||||
It's good to know that Linux stores a ton of process information in the `/proc` folder. </br>
|
||||
|
||||
### Network Utilization
|
||||
|
||||
In a previous module, we briefly covered networking as we created and configured a network for our virtual server and we learned about IP addresses. </br>
|
||||
|
||||
For servers to communicate with other servers in a network or even over the internet, they need to have an IP address. </br>
|
||||
|
||||
In addition, to connect to another server, we need an IP address of that server as well as a port number. </br> All network connections occur over a network port. </br> Ports are a limited resource and a server may only have so many ports available. A server can also only support a certain number of network connections over a given port. </br>
|
||||
|
||||
```
|
||||
--- 144.0.1.2 -------- 143.0.1.2:443 -\
|
||||
---/ public IP public IP ---\
|
||||
----/ ---\
|
||||
--/ --
|
||||
+----------------------+ +----------------------+
|
||||
| private IP | | private IP |
|
||||
| 10.0.0.4:1024 | | 10.0.0.4:443 |
|
||||
| port | | port |
|
||||
| | | |
|
||||
+----------------------+ +----------------------+
|
||||
```
|
||||
|
||||
#### Network resources
|
||||
|
||||
There are a number of resources we need to consider when monitoring networks
|
||||
|
||||
* IP addresses
|
||||
* Every network has a range of IP addresses, which is limited. A network can run out of IP addresses.
|
||||
* Ports
|
||||
* Every network connection needs a source and destination port number.
|
||||
* Source ports are allocated by the operating system when we make a network connection. The operating system assigns an ephemeral (temporary) port number from a predefined range of ports. This range is typically from 1024 to 65535, but it can vary depending on the operating system configuration.
|
||||
* Connections
|
||||
* A server can only make and receive a limited number of network connections.
|
||||
* We may often be tasked to monitor how many connections a server has open, so we know if connections are being exhausted or not.
|
||||
* Connections can be dependent on hardware support and operating system settings.
|
||||
* Bandwidth
|
||||
* I'd like to think of bandwidth as the speed at which our server can operate on the network
|
||||
* bandwidth is dependent on network speeds and network hardware that the server uses
|
||||
|
||||
#### Network monitoring tools
|
||||
|
||||
If we run commands like `netstat`, or `ss` (Socket stats) we can see network connections on our server which may help us review network connectivity on our server <br>
|
||||
|
||||
`ss -s` gives us a summary </br>
|
||||
|
||||
These tools can assist us in troubleshooting if a network port is open. </br>
|
||||
When we host applications like microservices, web services, databases or applications that accept network connections, these applications usually accept connections by listening on a network port. </br>
|
||||
|
||||
`netstat` can also be used to gather networking statistics
|
||||
|
||||
```
|
||||
netstat -a -l | head -n 10
|
||||
Active Internet connections (servers and established)
|
||||
Proto Recv-Q Send-Q Local Address Foreign Address State
|
||||
tcp 0 0 localhost:ipp 0.0.0.0:* LISTEN
|
||||
tcp 0 0 localhost:34521 0.0.0.0:* LISTEN
|
||||
tcp 0 0 localhost:34789 0.0.0.0:* LISTEN
|
||||
tcp 0 0 localhost:36491 0.0.0.0:* LISTEN
|
||||
tcp 0 0 localhost:44305 0.0.0.0:* LISTEN
|
||||
tcp 0 0 localhost:domain 0.0.0.0:* LISTEN
|
||||
tcp 0 0 Marcel-Laptop:49910 162.159.36.20:https ESTABLISHED
|
||||
tcp 0 1 Marcel-Laptop:57518 169.254.169.254:http SYN_SENT
|
||||
|
||||
```
|
||||
|
||||
### Disk space
|
||||
|
||||
To monitor disk space usage, we can use `df -h` </br>
|
||||
If we run `df -h` on our server, we can see file system usage on our server. </br>
|
||||
|
||||
```
|
||||
df -h
|
||||
Filesystem Size Used Avail Use% Mounted on
|
||||
tmpfs 197M 1.1M 196M 1% /run
|
||||
/dev/mapper/ubuntu--vg-ubuntu--lv 12G 4.4G 6.4G 41% /
|
||||
tmpfs 985M 0 985M 0% /dev/shm
|
||||
tmpfs 5.0M 0 5.0M 0% /run/lock
|
||||
GIT 192G 83G 109G 44% /home/devopsguy/gitrepos
|
||||
/dev/sda2 2.0G 95M 1.7G 6% /boot
|
||||
tmpfs 197M 12K 197M 1% /run/user/1000
|
||||
|
||||
```
|
||||
|
||||
This helps give an overview if there are any file systems low on disk space that we need to look into </br>
|
||||
|
||||
|
||||
We can also analyze space in a file system or in specific directories using the `du -h` command </br>
|
||||
This command takes a directory, in our case we can check from the root directory `/` and dig further down to find large directories or files. </br>
|
||||
|
||||
We use `sudo` here as we need it to access certain folders outside of our home directory. </br>
|
||||
```
|
||||
sudo du -h -d 1 /
|
||||
```
|
||||
|
||||
## Basic Monitoring Commands
|
||||
|
||||
* `top`
|
||||
* `htop`
|
||||
* `netstat`
|
||||
* `ss`
|
||||
* `df`
|
||||
* `du`
|
||||
* `vmstat` (provided by the `sysstat` package)
|
||||
* `pidstat` (provided by the `sysstat` package)
|
||||
* `iostat` (provided by the `sysstat` package)
|
||||
* `mpstat` (provided by the `sysstat` package)
|
||||
|
||||
## Logs vs Metrics vs Traces
|
||||
|
||||
Logs, metrics, and traces are different data formats produced by systems that help us understand various performance and health aspects.
|
||||
|
||||
The processes to produce these data formats differ and require various tools involving both developers and operations.
|
||||
|
||||
For example, developers use logging SDKs and configure log verbosity for application logs. These logs are then collected, processed, and stored by tools set up and configured by DevOps engineers for analysis."
|
||||
|
||||
### Logs
|
||||
|
||||
Logs are generated by applications and programs to provide detailed records of activities and events occurring within software applications. <br/>
|
||||
They capture information such as errors, warnings, informational messages, and debugging data, which are essential for monitoring, troubleshooting, and analyzing the behavior and performance of the software. </br>
|
||||
|
||||
We already have a little experience in logging in the previous Chapter, when we wrote our first bash script. We used the `echo` command to output events and activities about the execution of our script. </br>
|
||||
|
||||
Logs are often written to a file on disk. Applications can generally be configured to write logs to a given file path on disk. </br>
|
||||
The challenge with writing logs to file is:
|
||||
* Files can get too large if the application writes to the same file.
|
||||
* Applications often perform log rotation so only a fixed amount of logs are written to a file before the application will start writing to a new file to prevent a single file from getting too large.
|
||||
* Ensure logs are cleaned up from the file system to prevent the disk from running out of space.
|
||||
|
||||
Operating systems provide output streams for applications to write output to </br>
|
||||
|
||||
For example, in previous modules we covered the command line and these programs write output to our terminal. </br>
|
||||
This output steam is called `stdout` or "standard out" </br>
|
||||
|
||||
It's advantageous for applications to write logs to `stdout` rather than to a file, as this avoids the previously mentioned challenges related to writing files on disk. </br>
|
||||
|
||||
There are a number of tools that help collect logs:
|
||||
|
||||
* <b>Fluentd</b>: An open-source data collector for unified logging layers.
|
||||
* <b>Logstash</b>: A server-side data processing pipeline that ingests data from multiple sources simultaneously.
|
||||
* <b>Graylog</b>: A powerful log management and analysis tool.
|
||||
* <b>Filebeat</b>: A lightweight shipper for forwarding and centralizing log data.
|
||||
* <b>Promtail</b>: An agent which ships the contents of local logs to a Loki instance.
|
||||
* <b>Splunk</b>: A platform for searching, monitoring, and analyzing machine-generated big data.
|
||||
* <b>Elastic Agent</b>: A single, unified way to collect data from your infrastructure and applications.
|
||||
* <b>Vector</b>: A high-performance, end-to-end observability data pipeline.
|
||||
|
||||
### Metrics
|
||||
|
||||
Logs are great for monitoring application behaviour, as it reports activities and events, which may include errors. </br>
|
||||
However, logs can be quite heavy (to store and process) and need to be parsed and stored which can take up a lot of space. </br>
|
||||
It also takes a lot of compute to process logs into analytical metrics that can be aggregated and used in real time </br>
|
||||
This is where metrics help. </br>
|
||||
Think of metrics as "key" + "value" pairs of data. </br>
|
||||
Metrics are much smaller than logs and faster to process, summarize and perform analytical computations in real time. </br>
|
||||
|
||||
For example, CPU, memory and disk usage can be described in metrics format.
|
||||
The data is a lot smaller, and we can quickly calculate CPU usage over time to detect high system load.
|
||||
|
||||
There are a number of tools that help collect metrics:
|
||||
|
||||
* <b>Prometheus</b>: An open-source systems monitoring and alerting toolkit originally built at SoundCloud. It has a multi-dimensional data model and a powerful query language called PromQL.
|
||||
* <b>Grafana</b>: While primarily a visualization tool, Grafana can also collect and query metrics from various sources, including Prometheus, InfluxDB, and Graphite.
|
||||
* <b>InfluxDB</b>: A time-series database designed to handle high write and query loads. It is often used for storing metrics and events.
|
||||
* <b>Graphite</b>: An enterprise-ready monitoring tool that runs equally well on cheap hardware or Cloud infrastructure. It stores numeric time-series data and renders graphs of this data on demand.
|
||||
* <b>Telegraf</b>: An agent for collecting, processing, aggregating, and writing metrics. It is part of the TICK stack (Telegraf, InfluxDB, Chronograf, Kapacitor).
|
||||
* <b>Zabbix</b>: An open-source monitoring software tool for diverse IT components, including networks, servers, virtual machines, and cloud services.
|
||||
* <b>Datadog</b>: A monitoring and analytics platform for cloud-scale applications. It provides metrics collection, visualization, and alerting.
|
||||
* <b>New Relic</b>: A comprehensive monitoring tool that provides real-time insights into application performance, infrastructure, and user experience.
|
||||
### Tracing
|
||||
|
||||
Metrics are mostly designed to give us statistical data about applications, such as CPU, memory, disk IO usage, or even requests per second, or iterations per second of functions etc. </br>
|
||||
Just like with logs, Developers can add metrics to their applications too.
|
||||
|
||||
However, when we have multiple applications and web services, microservices all talking to one another over networks, it can be useful to trace a network request all the way through a system to monitor a full transaction.
|
||||
|
||||
For example, a customer interacts with a website in the browser. That makes a web request to our front end. Our front end makes a few requests to back ends, and some back ends interact with one another and with databases.
|
||||
|
||||
This is where Tracing comes in. Tracing is a technology used by applications and some web servers to inject tracking data into requests as it flows through systems. Then we can use visualization tools to see an entire transaction with all its requests
|
||||
|
||||
A lot can happen to form a transaction, and sometimes systems can slow down. </br>
|
||||
Tracing is very useful to detect bottlenecks in a distributed system
|
||||
|
||||
There are a number of tools that help collect metrics:
|
||||
|
||||
* <b>Jaeger</b>: An open-source, end-to-end distributed tracing tool originally developed by Uber Technologies. It is used for monitoring and troubleshooting microservices-based distributed systems.
|
||||
|
||||
* <b>Zipkin</b>: An open-source distributed tracing system that helps gather timing data needed to troubleshoot latency problems in service architectures.
|
||||
|
||||
* <b>OpenTelemetry</b>: A collection of tools, APIs, and SDKs that can be used to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior.
|
||||
|
||||
* <b>New Relic APM</b>: Provides distributed tracing capabilities to monitor and troubleshoot application performance issues.
|
||||
|
||||
* <b>Datadog APM</b>: Provides end-to-end distributed tracing from frontend devices to backend services, with automatic instrumentation for popular frameworks.
|
||||
|
@ -0,0 +1,302 @@
|
||||
# 🎬 Linux Network Monitoring
|
||||
|
||||
## 💡 Preface
|
||||
|
||||
This module is part of a course on DevOps. </br>
|
||||
Checkout the [course introduction](../../../../../README.md) for more information </br>
|
||||
This module is part of [chapter 3](../../../../../chapters/chapter-3-linux-monitoring/README.md)
|
||||
|
||||
This module is based on my long experience looking after servers, performance, monitoring and diagnosing issues. </br>
|
||||
This is not your usual average Linux network monitoring guide. </br>
|
||||
|
||||
Althought we'll be covering theory, the objective is not to bombard the viewer with too much detail. </br>
|
||||
We'll cover the theory conceptually, and then use practical examples to show you real world concepts in action </br>
|
||||
This guide will feature basic and some deeper advanced topics, tools and techniques for dealing with network observability and monitoring </br>
|
||||
Although its basic, this guide touches on all the components that I still use in day to day modern DevOps & Cloud engineering </br>
|
||||
|
||||
It will be important to pay attention as all of the details in this module will form the foundation of monitoring HTTP web & microservices, especially when: </br>
|
||||
|
||||
* One service or server cannot talk to another service or server
|
||||
* Troublshooting connection errors
|
||||
* Understanding latency
|
||||
* Understanding basic network bottlenecks
|
||||
|
||||
## Network usage - How the Network works
|
||||
|
||||
In our chapter on Operating Systems, we covered the basics of system resources that the operating system manages, including the Network which is used by processes to communicate with one another.</br>
|
||||
|
||||
Processes can communicate with one another on the same server or across servers, if the network allows it </br>
|
||||
Unlike CPU, memory & disk, the network has a few components to understand, as each one of the components can cause connection failure, errors, delays and bottlenecks
|
||||
|
||||
In the first module, we looked at a high level overview of networking. And in this below diagram, I highlight some of the components we will cover in this module. </br>
|
||||
As a engineer you want to have an understanding of how servers\processes talk to one another over the network. This happens through a network connection. </br>
|
||||
There are two types of connection (which are referred to as "network protocols"), called TCP and UDP. </br>
|
||||
Each protocol varies slightly in the way connections are establised. </br>
|
||||
|
||||
The box on the left is our source server which runs a process and makes a network connection to the box on the right which is another server with a process on it </br>
|
||||
This could be a Web browser (left box) opening a web page (Github.com) which connects to a Web server (right box) on the internet somewhere hosted by Github. </br>
|
||||
It could be Web client to server, Two applications talking to another, Two microservices, a service talking to a database, or technically anything talking to something else over a network or internet </br>
|
||||
|
||||
```
|
||||
connection from server-a to 143.0.1.2:443
|
||||
or https://143.0.1.2
|
||||
|
||||
--- 144.0.1.2 -------- 143.0.1.2:443 -\
|
||||
---/ public IP public IP ---\
|
||||
----/ ---\
|
||||
--/ --
|
||||
+----------------------+ +----------------------+
|
||||
| private IP | | private IP |
|
||||
| 10.0.0.4:1024 | | 10.0.0.4:443 |
|
||||
| port | | port |
|
||||
| server-a | | server-b |
|
||||
+----------------------+ +----------------------+
|
||||
```
|
||||
|
||||
## Importand Network components
|
||||
|
||||
Let's keep referring to the diagram above, and talk about each network component in this diagram to understand network connectivity
|
||||
|
||||
### IP addresses
|
||||
|
||||
We covered in previous modules, that IP addresses are identifiers for servers belonging to a network and a server must have an IP address in order to belong to a network. </br>
|
||||
An IP address can be either public or private. </br>
|
||||
Generally speaking, servers always have a private IP address when belonging to a network. A server may or may NOT have a public IP address depending on network setup and configuration. </br>
|
||||
|
||||
### Private IP address
|
||||
|
||||
For example, in the diagram, `server-a` has a private IP address `10.0.0.4`. In our module on virtualisation we learned that a DHCP service provides us with that private IP when our server joins the network. Virtualization software as well as Cloud providers will generally handle this IP address assignment for you. </br>
|
||||
|
||||
It's also important to note that every server will have an IP as `127.0.0.1` which we also call "localhost". IF a server refers to this IP, it is technically referring to itself </br>
|
||||
For example, opening a Web browser and going to address `http://127.0.0.1` will go to itself, so we would need a Web server running on our server that we can reach via that IP address </br>
|
||||
|
||||
### Public IP address
|
||||
|
||||
For `server-a` to have a public address, it depends on network configuration. </br>
|
||||
Generally speaking, the Operating system on a server is configured to have what's called a "gateway" IP address so it knows where to send all outbound network traffic. </br> On home networks, this "gateway" address would usually be your home router address. </br>
|
||||
Virtualization software and Cloud providers will also generally handle this for you and you don't need to worry about setting up gateways. </br>
|
||||
In our module on Virtualization and Servers, our software used `10.0.0.0` as the gateway address and software would send that to the Host machine gateway and that way it ends up going out via the router on the home network. </br>
|
||||
|
||||
So basically that "gateway" is the gateway to the public internet as the network goes to your router and the router gets a Public IP address from your Internet service provider. </br> That's why when you reboot your home router, your Public IP address may change. </br>
|
||||
|
||||
A similar architecture is generally followed in company and office networks. Your computer in the office will route outbound traffic to a network device or router and that will have a Public IP address provided by the companies ISP. All similar to what is shown in the above diagram</br>
|
||||
|
||||
In the cloud, servers would generally have a Public IP address you can visibly see in the cloud provider web interface </br>
|
||||
So each server could have it's own Public IP address. Cloud providers also allow you to remove the Public IP address, which renders this server completely private and inaccesible from public networks </br>
|
||||
|
||||
It's important to know that Public IP addresses are used for both inbound and outbound traffic. </br>
|
||||
So network requests can go from `server-a` to the router, and out via the router's Public IP address </br>
|
||||
`server-b` or any destination that receives requests from `server-a` will see that it originates from the Public IP address we have for `server-a` <br>
|
||||
If `server-b` needs to respond to that request, it may just respond to `server-a` over the same connection
|
||||
|
||||
Because the illustration shows a network request from left to right, its important to know that request can also go from right to left </br>
|
||||
However to do this, `server-a` needs to listen on a port and have a process running that can accept requests. Also the router device on the left needs to have a "port forwarding" rule to tell the router which Private IP address to send all traffic that is coming over a given port. </br>
|
||||
|
||||
Therefore a server and its router needs to be configured in order to allow network requests to flow all the way through
|
||||
|
||||
### Server VS Client
|
||||
|
||||
In a previous chapter we talked about "servers" and what a server is. </br>
|
||||
This is not to be confused with the concept of servers and clients in networking. </br>
|
||||
In networking terminology, <u>clients</u> are referred to as the server or process that is <u>initiating the network request.</u> </br>
|
||||
In networking terminology, <u>servers</u> are referred to as the server or process that is <u>receiving the network request.</u> </br>
|
||||
|
||||
### Ports & Connections
|
||||
|
||||
In order for a client and server to talk, a network connection must be made. </br>
|
||||
The client will need a private IP as we've mentioned earlier and it will also need a source port for the connection. This is so that the reply can find its way back to the client. The source port is generally assigned by the client's Operating System. </br>
|
||||
Source ports are limited and each Operating System can have different limits for the number of source ports it can allow. This means that we could have port exhaustion if a client tries to create too many connections. </br>
|
||||
|
||||
Once the client has a source IP and Port, it establishes a connection to the destination IP address. Now there are some technical naunces to establishing network connections and there is more to it, however I'll be keeping this brief and simple. </br>
|
||||
In my opinion, a simplified understanding is always a better place to start instead of drowning in the deph of theory and details. </br>
|
||||
|
||||
When it comes to monitoring we'll have everything we need to know to form a great fundamental understanding in troubleshooting systems. </br>
|
||||
|
||||
Now this connection attempt from the client will end up at a destination server which would have a process running on it and listening on a port. This process could be a web server or application. </br>
|
||||
|
||||
To accept a connection, a process must "listen" on a port </br>
|
||||
That port must also be open on the server, meaning no firewall or anti-virus should be blocking that port </br>
|
||||
If there is a network device, proxy, load balancer or router in front as per our diagram, that device needs a port forwarding rule to send traffic to that destination server on a given port. </br>
|
||||
|
||||
Please make an important note here, when you see "Connection Refused" It generally means there is no process or application listening on a destination port you are trying to reach. This is a popular error thats often misinterpreted by developer and engineers.
|
||||
|
||||
Another important error is "Connection Timeout". If you see this error or simply a network request hanging, it generally means that the port you are trying to reach is being blocked by something. This could be a cloud security rule , firewall , network device like a router etc.
|
||||
|
||||
### Network Bandwith
|
||||
|
||||
Once a connection is established between client and server, than the client can start sending network requests and the server can respond with network responses. These requests and responses often contains data. </br>
|
||||
This can be a client web browser getting HTML and web page content from a server, it could be a web browser client calling an API service for data, it could be two microservices communicating with one another. </br>
|
||||
|
||||
These network requests and responses may sometimes contain large datasets. These request and responses generally take up whats called network bandwidth. </br>
|
||||
Network bandwidth is limited by network speeds which can involve the client and server network interfaces (or network cards), network cables, devices and ISP speeds </br>
|
||||
|
||||
Bandwidth is often monitored and measured in bytes per sec, megabytes per sec, gigabytes per sec, etc. </br>
|
||||
|
||||
If you are using the cloud, the server will have limits and specifications on its network bandwidth capabilities. </br>
|
||||
|
||||
You shouldn't have to worry about this most of the time unless you are working for a company that deals with large data transmission where services that talk to each other exchange large messages. Most servers in the cloud can handle pretty large amounts of bandwidth, and latency will likely be your biggest challenge </br>
|
||||
|
||||
### Latency
|
||||
|
||||
Latency is the time taken for the client to receive a response from the server. </br>
|
||||
Latency is generally measured in milliseconds (ms) or seconds (s). </br>
|
||||
|
||||
In my experience, the main way I monitor latency is generally through proxy or web server logs. Web logs will generally list each network request and details about the request such as latency so we know how long it took for the server to respond to the client. </br>
|
||||
This helps us understand timeouts, and long running requests which could impact customers </br>
|
||||
|
||||
In our Chapter on Web servers we will take a look at web server monitoring which covers this topic in further detail. </br>
|
||||
To summarise, generally when a server is slow and not responding to requests from clients in a timely manner, we can get evidence of this by looking at web server logs. </br>
|
||||
Network devices in front of web servers, like load balancers , proxies or what we call "ingresses" may also provide web traffic logs which can indicate latency of requests as well. </br> This is a key focus point when monitoring HTTP web server traffic performance and stability. </br>
|
||||
|
||||
### Network Protocols
|
||||
|
||||
There are two major network protocols you will come across in the field </br>
|
||||
The two are `TCP` and `UDP` </br>
|
||||
|
||||
So when we talked about network connection and how the client and server establish this connection I mentioned there are some nuances to how this connection is established. </br>
|
||||
|
||||
`TCP` is the main network protocol used by the Web because its designed to be reliable. </br> Networks are flakey meaning a network packet is never garaunteed to arrive at from client to server </br>
|
||||
To make the network more reliable, TCP involves a handshake and a few network requests back and forth between client and server to establish connection. </br>
|
||||
This network connection handshake in TCP is designed to help ensure connections are established when network can be flakey in nature. </br>
|
||||
|
||||
This comes at a performance cost, therefore there is another protocol called `UDP` which is more of a "send and forget" type of network request. Where a client sends a request and waits for a response and will simply retry if it does not get a response. </br>
|
||||
A client may throw an error after a few retries when a UDP request fails. </br>
|
||||
|
||||
## Making a Server
|
||||
|
||||
We have learned that a server is the receiver of network traffic </br>
|
||||
A server has to run a process that listens on a port that is ready to receive traffic </br>
|
||||
|
||||
If this server needs to be accessible on a private network only, we just need to ensure we can access it from another server on the same network and no firewall is blocking the port. Sometimes firewall software on a server can block ports. Linux has a basic network firewall and Windows has a firewall too. </br>
|
||||
|
||||
Remember what I said - If we see "Connection refused" when trying to connect to our server, it means the port we are trying to connect to is not listening. </br>
|
||||
If we see "Connection timeout" it's generally a firewall blocking the request, so it hangs and times out </br>
|
||||
|
||||
It's important to learn how to validate if a server port is open. </br>
|
||||
To do this lets use a tool called netcat or `nc`:
|
||||
|
||||
```
|
||||
nc -zv localhost 12345
|
||||
nc: connect to localhost (127.0.0.1) port 12345 (tcp) failed: Connection refused
|
||||
```
|
||||
|
||||
- The `-z` flag can be used to tell nc to report open ports, rather than initiate a connection
|
||||
- The `-v` flag tells `nc` to turn on verbose output so we can get more helpful information about the port being open or not.
|
||||
|
||||
Let's use `nc` to start a server. </br>
|
||||
We can create a script to start our server if we wanted:
|
||||
|
||||
```
|
||||
echo "Starting TCP server on port 12345..."
|
||||
nc -lk 12345
|
||||
```
|
||||
- The `-l` flag tells netcat to listen on a give port for incoming connections
|
||||
- The `-k` flag tells netcat to listen for another connection once it has received on. Without `-k`, netcat will close once it receives one connection
|
||||
|
||||
We can leave our server running in a terminal and open another to test the port </br>
|
||||
This helps us during monitoring to test whether a server is listening on a port or not:
|
||||
|
||||
```
|
||||
nc -zv localhost 12345
|
||||
Connection to localhost (127.0.0.1) 12345 port [tcp/*] succeeded!
|
||||
```
|
||||
|
||||
## Making a client
|
||||
|
||||
Now that we have a server , we can make a client that sends requests to the server. </br>
|
||||
The whole point of this exercise is to learn about how client and server connectivity works, so that we can learn how to monitor these requests and connections </br>
|
||||
We can use the `nc` command to troubleshoot and test connectivity between clients and servers. </br>
|
||||
We will also learn about a couple of other important and popular network commands. </br>
|
||||
|
||||
We can use the `nc` command to create network requests. </br>
|
||||
Send one request to our server but hold the connection:
|
||||
|
||||
```
|
||||
echo "Sending one message and holding connection!" | nc localhost 12345
|
||||
```
|
||||
|
||||
The above is a basic TCP client. </br>
|
||||
Another network protocol which is built on top of TCP, is `HTTP`. </br>
|
||||
HTTP is used in Web communications and we'll learn more about the Web in a future Chapter. This is how we create an HTTP connection and send an HTTP message:
|
||||
|
||||
```
|
||||
curl -X POST http://localhost:12345 -d "test"`
|
||||
```
|
||||
|
||||
We'll get into more variations of what the client can do. But before we do this, we need to learn a couple of popular network tools and we can monitor the network at the same time.
|
||||
|
||||
## Network Connection Monitoring Tools
|
||||
|
||||
To monitor and further understand TCP connections, we can use two popular tools:
|
||||
|
||||
- `ss`
|
||||
- Another utility to investigate sockets. <br/>
|
||||
`ss` is used to dump socket statistics. It allows showing information similar to netstat. It can display more TCP and state information than other tools.
|
||||
- `netstat`
|
||||
- Print network connections, routing tables, interface statistics, masquerade connections, and multicast memberships
|
||||
|
||||
We can monitor current TCP connections and see our server listening and also see the current in-progress open and established TCP connections between client and server
|
||||
|
||||
ss:
|
||||
```
|
||||
ss -a | grep 12345
|
||||
```
|
||||
netstat:
|
||||
```
|
||||
netstat -a | grep 12345
|
||||
```
|
||||
|
||||
This will help us understand the connection lifecycle as well. We can view the connection state, in this case its `LISTEN` and `ESTABLISHED` </br>
|
||||
We've learned now that `LISTEN` is for servers generally listening on a port for inbound connections. `ESTABLISHED` is the state when a connection is open and the client can send and receive messages to each other. </br>
|
||||
In our example client and server, we are keeping the connection open so we can observe it in our command line tools above. </br>
|
||||
|
||||
TCP socket states represent the various stages a TCP connection goes through during its lifecycle. </br>
|
||||
|
||||
Interestingly if we close our client by pressing `CTRL+C`, we quickly use `ss` or `netstat` to check the connection and we will notice the `ESTABLISHED` has gone and we now see a `TIME_WAIT` state. </br>
|
||||
|
||||
This is when a connection has been used by client and server and about to be closed.The connection is now in a "recycling" state where the operating system will get to re-use that port. Generally a connection will spend roughly 60 sec in `TIMED_WAIT` before the operating system will be able to re-use that connection </br>
|
||||
The time wait timing can be adjusted in Linux, if we require a lot more connections quickly, we may reduce the time wait time </br>
|
||||
|
||||
We can run the following loop that keeps sending messages every 5 seconds and will close the connection after sending each message. This allows us to observe `TIME_WAIT` sockets
|
||||
|
||||
```
|
||||
echo "Sending messages one after another!"
|
||||
while true; do
|
||||
echo "Sending message and closing connection!" | nc -q 5 localhost 12345
|
||||
done
|
||||
```
|
||||
|
||||
Here are the TCP socket states in order:
|
||||
|
||||
* `CLOSED`: The initial state. No connection exists.
|
||||
* `LISTEN`: The server is waiting for incoming connection requests.
|
||||
* `SYN_SENT`: The client has sent a connection request (SYN) and is waiting for a matching connection request (SYN-ACK) from the server.
|
||||
* `SYN_RECEIVED`: The server has received the client's connection request (SYN) and sent a connection acknowledgment (SYN-ACK), waiting for the final acknowledgment (ACK) from the client.
|
||||
* `ESTABLISHED`: The connection is open, and data can be sent and received between the client and server.
|
||||
* `FIN_WAIT_1`: The client or server has initiated the connection termination and is waiting for the other side to acknowledge (FIN).
|
||||
* `FIN_WAIT_2`: The side that initiated the termination has received the acknowledgment (ACK) of its FIN and is waiting for the other side to send its FIN.
|
||||
* `CLOSE_WAIT`: The side that received the first FIN is waiting to send its own FIN.
|
||||
* `CLOSING`: Both sides have sent FINs, but neither has received the final acknowledgment (ACK).
|
||||
* `LAST_ACK`: The side that sent the first FIN is waiting for the final acknowledgment (ACK) of its FIN.
|
||||
* `TIME_WAIT`: The side that sent the final acknowledgment (ACK) is waiting for a period of time to ensure the other side received it.
|
||||
* `CLOSED`: The connection is fully terminated, and no further communication is possible.
|
||||
|
||||
## Network Traffic Monitoring
|
||||
|
||||
For all the above, we monitored network connections. </br>
|
||||
But once a connection is established the client and server can go back and forth and send and receive messages to one another. </br>
|
||||
|
||||
To monitor this network traffic, I often rely on logging. </br>
|
||||
Traffic will flow from client to server, and the applications we run on servers generally have capability to log traffic requests. </br>
|
||||
All popular web servers have this feature that you can configure. </br>
|
||||
|
||||
Once configured, server process will write incoming traffic logs to a file, where each line in the file represents a request. </br>
|
||||
It is generally in a similar format like this, either delimited by spaces or commas seperating each field:
|
||||
|
||||
```
|
||||
<date> <client-IP> <request-info> <status> <time-taken-in-milliseconds> <etc>
|
||||
```
|
||||
|
||||
With these logs, we can monitor where traffic is coming from, how long the request has taken to complete and also whether or not it was a success or failure. </br>
|
||||
This will be of great help when we start troubleshooting slow requests which can result in systems slowing down </br>
|
||||
We'll be covering web server logs and log monitoring in more detail in our chapter on Web servers </br>
|
Loading…
x
Reference in New Issue
Block a user