Anonymous
It would be important to clarify some high level details about the system in question such as:
1. What is the data size of an average span. This will help to determine how much data is being transmitted across the wire.
2. What are the outliers (Especially on the high side) in this regard and how do they impact the system performance. This is important to assess options that may be available for handling the data flow.
3. What are the SLA's.
3.a. How quickly does the trace data need to be indexed and accessible for debugging.
3.b. Do we have a threshold on how much trace data we consider acceptable to loose. This will indicate if data can be queued up on the machine and sent in small batches. (NOTE: Most of the data retention risk can be mitigated by replicating the data both on disk and in memory and only falling back to disk in the event of a failure. Providing a proper RAID setup the risk at this point is quite low and would likely require a catastrophic failure)
3.c. What type of log searches do we want to support. Is a trace id sufficient to debug known failed requests or do we need broader search functionality.