Propose a high-level design for a tracing system that can handle millions of spans per second across thousands of services.

Question

Anonymous · Accepted Answer

It would be important to clarify some high level details about the system in
question such as:
1. What is the data size of an average span. This will help to determine how
much data is being transmitted across the wire.
2. What are the outliers (Especially on the high side) in this regard and how do
they impact the system performance. This is important to assess options that may
be available for handling the data flow.
3. What are the SLA's. 
3.a. How quickly does the trace data need to be indexed and accessible for
debugging.
3.b. Do we have a threshold on how much trace data we consider acceptable to
loose. This will indicate if data can be queued up on the machine and sent in
small batches. (NOTE: Most of the data retention risk can be mitigated by
replicating the data both on disk and in memory and only falling back to disk in
the event of a failure. Providing a proper RAID setup the risk at this point is
quite low and would likely require a catastrophic failure)
3.c. What type of log searches do we want to support. Is a trace id sufficient
to debug known failed requests or do we need broader search functionality.

Propose a high-level design for a tracing system that can handle millions of spans per second across thousands of services.

Did you come across this question in an interview?

Answers

Also asked as

Try Our AI Interviewer