Distributed Tracing
Distributed tracing helps us understand how requests and messages travel between microservices in a system. It is very useful when we need to find problems or delays in a system made of many microservices.
Problem
In systems with many microservices, it can be hard to know what happens when a user makes a request. Some common problems are:
- Finding the root cause of a failure: Imagine users report an error when trying to buy a product. How do we know which microservice caused the error?
- Tracking a specific entity: Suppose a user has a question about order
#12345. How can we find all log messages from all microservices involved in processing that order? - Identifying delays: If a system is responding slowly, how can we find which microservice in the chain is causing the delay?
Solution
To solve these problems, we use correlation IDs. A correlation ID is a unique identifier that links all requests and messages for one action or user request.
Here’s how it works:
- When a new request comes in, the system generates a unique correlation ID.
- Every microservice that processes this request includes the correlation ID in its logs.
- If a microservice uses information about a specific business entity (like a customer ID or an order ID), we can also connect the correlation ID to that entity.
This way, we can:
- Find all log messages related to a specific request or entity.
- Measure how long each microservice takes to process the request, helping identify slow points in the system.
Example: A user wants to buy a laptop:
- The request enters the Order Service and gets correlation ID abc123.
- The Payment Service and Inventory Service also log the same correlation ID.
- If the user reports a problem, support can search logs using abc123 and see all steps in the request.
- If the payment took too long, timestamps in the logs will show which microservice caused the delay.
Solution Requirements
To implement distributed tracing effectively, we need to:
- Assign correlation IDs to all incoming requests and new events. Usually, this is done in a request header.
- Propagate the correlation ID in every outgoing request or message from a microservice.
- Include the correlation ID in all logs in a standard format, so the central logging system can find and link related events.
- Record timestamps when requests, responses, or messages enter and leave each microservice. This helps analyze delays in the system.