A couple of thousands of servers for a big data system is also a big investment. Microsoft Bing has figured out a way to fulfill our needs without signing a huge check. We have the technology to harvest spare cycles on underutilized servers. And we tweak the configurations in Hadoop and Spark to fit the flexible capacity base. We have saved hundreds of millions of dollars per year.
Bing is adopting open source big data technologies for our offline data processing system. It requires a massive amount of capacity, which implies a significant bill. With collaboration with Windows and Azure, Research teams, we can harvest most of the needed capacity from our existing server fleet. We make use of the capacity on reserve servers while keeping them instantly available for emergency use; we allocate compute and storage to servers when they are not fully occupied. We updated Hadoop node decommission, HDFS block placement, YARN node label mapping, and a few other policies so that they can adapt to the capacity that is even less reliable than commodity servers. We brought open source capacity to Bing product with less than 1 percent of the cost we had done it through normal approach. We also extend the YARN and Spark framework to better fit the need of deep learning training and inferencing workloads in our system. This extension is equipping Bing with direct questioning and answering type of interactive query features.
Big does not mean expensive. The audience can learn about the approach from Bing that they can make better use of their existing servers to do additional big data systems.