All posts by
Maciej Mościcki

4 months ago

Unlocking Kafka's Potential: Tackling Tail Latency with eBPF

At Allegro, we use Kafka as a backbone for asynchronous communication between microservices. With up to 300k messages published and 1M messages consumed every second, it is a key part of our infrastructure. A few months ago, in our main Kafka cluster, we noticed the following discrepancy: while median response times for produce requests were in single-digit milliseconds, the tail latency was much worse. Namely, the p99 latency was up to 1 second, and the p999 latency was up to 3 seconds. This was unacceptable for a new project that we were about to start, so we decided to look into this issue. In this blog post, we would like to describe our journey — how we used Kafka protocol sniffing and eBPF to identify and remove the performance bottleneck.

Maciej Mościcki

Software Engineer at core Technical Platform team. Mainly interested in distributed systems, concurrency and Information Retrieval. Likes to open black boxes.