Unlocking Kafka's Potential: Tackling Tail Latency with eBPF

At Allegro, we use Kafka as a backbone for asynchronous communication between microservices. With up to 300k messages published and 1M messages consumed every second, it is a key part of our infrastructure. A few months ago, in our main Kafka cluster, we noticed the following discrepancy: while median response times for produce requests were in single-digit milliseconds, the tail latency was much worse. Namely, the p99 latency was up to 1 second, and the p999 latency was up to 3 seconds. This was unacceptable for a new project that we were about to start, so we decided to look into this issue. In this blog post, we would like to describe our journey — how we used Kafka protocol sniffing and eBPF to identify and remove the performance bottleneck.

Tired of repetitive tasks?! Go for RPA!

Have you ever thought about ways of reducing repetitive, monotonous tasks? Maybe you would like to try to automate your own tasks? I will show you what technology we use at Allegro, what processes we have automated, and how to do it on your own.

Don’t bother: it is only a little expired

This story shows how we strive to fix issues reported by our customers regarding inconsistent listing views on our e-commerce platform. We will use a top-down manner to guide you through our story. At the beginning, we highlight the challenges faced by our customers, followed by presenting basic information on how views are personalized on our web application. We then delve deeper into our internal architecture, aiming to clarify how it supports High Availability (HA) by using two data centers. Finally, we advertise a little Couchbase, distributed NoSQL database, and explain why it is an excellent storage solution for such an architecture.

WCAG 2.2 is here! And what about 3.0?

Ready to turn web accessibility from a headache into a breeze? Join us as we demystify WCAG, explore its latest 2.2 version, and gaze into the future of digital inclusivity. Get ready for a journey that’s as enlightening as it is entertaining!

Clever, surprised and gray-haired

This article is a form of a public postmortem in which we would like to share our bumpy way of revealing the cause of a mysterious performance problem. Besides unveiling part of our technical stack based on open-source solutions, we also show how some false assumptions made such a bug triage process much harder. Besides all NOT TO DOs, you can find some exciting information about performance hunting and reproducing performance issues on a small scale. As a perk, we prepared a repository where you can reproduce the problem and make yourself familiar with tools that allowed us to confirm the cause. The last part (lessons learned) is the most valuable if you prefer to learn from the mistakes of others.

allegro.tech

At Allegro, we build and maintain some of the most distributed and scalable applications in Central Europe. This poses many challenges, starting with architecture and design, through the choice of technologies, code quality and performance tuning, and ending with deployment and devops. In this blog, our engineers share their experiences and write about some of the interesting things taking place at the company.