Help Hadoop survive the 300 million block barrier and then back it up

Help Hadoop survive the 300 million block barrier and then back it up

Wednesday, June 20
11:50 AM - 12:30 PM
Executive Ballroom 210D/H

Criteo has a 100 PB production cluster of four namenodes and 2000 datanodes that runs 300,000 jobs a day. At the moment, we backup some of the cluster's data on our old 1200 node ex-production cluster.

We build our clusters in our own datacenters, as running in the cloud would be many times more expensive. In 2018, we will build yet another datacenter that will backup, in storage and compute, and then replace the main cluster. This presentation will describe what we learned when building multiple Hadoop clusters and why in-house bare-metal is better than the cloud.

Building a cluster requires testing the hardware from several manufacturers and choosing the most cost-effective option. We have now done these tests twice and can provide advice on how to do it right the first time.

Our two clusters were meant to provide a redundant solution to Criteo's storage and compute needs. We will explain our project, what went wrong, and our progress in building yet another, even bigger, cluster to create a computing system that will survive the loss of an entire datacenter. Disasters are not always external, they can come from operator error that could wipe out both Hadoop clusters. We are thus adding a second different backup technology for our most important 10 PB of data. Which technology will we use and how did we test it?

An HDFS namenode cannot store an infinite number of data blocks. We will share what we have had to do to reach almost half a billion blocks, including sharding our data by adding namenodes. Hadoop, especially at that this scale, does not run itself, so what operational skills and tools are required to keep the clusters healthy, the data safe and the jobs running 24 hours a day every day?

講演者

Stuart Pook
Senior DevOps Engineer
Criteo
Stuart loves storage (208 PB at Criteo) and is part of Criteo's Lake team that runs some small and two rather large Hadoop clusters. He also loves automation with Chef because configuring more than 3000 Hadoop nodes by hand is just too slow. Before discovering Hadoop he developed user interfaces and databases for biotech companies. Stuart has presented at ACM CHI 2000, Devoxx 2016, NABD 2016, Hadoop Summit Tokyo 2016, Apache Big Data Europe 2016, Big Data Tech Warsaw 2017, and Apache Big Data North America 2017.