Articles tagged with
big data
20 Jun 2024
In this article we’ll present methods for efficiently optimizing physical resources and fine-tuning the configuration of a Google Cloud Platform (GCP)
Dataflow pipeline in order to achieve cost reductions.
Optimization will be presented as a real-life scenario, which will be performed in stages.
04 Nov 2020
We are excited to announce that we have just released BigFlow 1.0 as open source.
It’s a Python framework for big data processing on the Google Cloud Platform.
07 Jan 2020
When designing the architecture of a system, one always needs to think about what can go wrong and
what kind of failures can occur in the system. This kind of problem analysis is especially hard in distributed systems.
Failure is inevitable and the best we can do is to prepare for it.
22 Oct 2018
Two years ago at Allegro we used to have a very typical Big Data technology stack. The architecture was based
on a Hadoop cluster and we would query it with plain Hive
queries, Spark jobs and Jupyter notebooks. Over those last two
years we have transformed it into a more efficient and easy to use
OLAP platform.
29 Jun 2017
I bet you have found this article after googling some of the issues you encounter when working with a Hadoop cluster.
You probably deal with Hive queries used for exploratory data analysis that are processed way too long. Moreover, you
cannot adapt Spark in your organization for every use case because of the fact that writing jobs requires quite strong
programming skills. Clogged Yarn queues might be your nightmare and waiting for the launch of the container when you run
even a small query drives you mad. Before we deployed Presto — a Fast SQL engine provided by Facebook — our
analysts struggled with these problems on a regular basis.
26 Jan 2017
Caching is a good and well-known technique used to increase application performance and decrease overall system load.
Usually small or medium data sets, which are often read and rarely changed, are considered as a good candidate for
caching. In this article we focus on determining optimal cache size based on big data techniques.
17 Dec 2014
Big Data Spain is an annual conference on Big Data and related topics held in the
suburbs of Madrid. This year’s, i.e. third, edition has so far been the biggest; it has attracted more than 500 guests
and various speakers including Big Data celebrities like Paco Nathan of Databricks. During two days of the conference,
guests could attend many keynotes, speeches and workshops and learn about variuos products, services and specific
use-cases, in both English and Spanish. Allegro was represented by two employees with a presentation on Hadoop pitfalls
and gotchas.
05 Nov 2014
This year’s edition of Strata Hadoop World held in New York was humongous, 16
workshops, over 20 keynotes, over 130 talks and most importantly over 5000 attendees! This massive crowd wouldn’t fit in
Hilton hotel where the previous edition was held. That is why organizers had to move the conference to Javits Conference
Center - an enormous building in which Big Data believers occupied just one sector. The fact that the European edition of
Hadoop Summit experienced exactly the same transition (the third edition is going to be held in a bigger location in
Brussels) gives pleasant assurance that Big Data technologies are still a hot topic and that Big Data Community grows
at a stable pace.