Oath has one of the largest footprint of Hadoop, with tens of thousands of jobs run every day. Reliability and consistency is the key here. With 50k+ nodes there will be considerable amount of nodes having disk, memory, network, and slowness issues. If we have any hosts with issues serving/running jobs can increase tight SLA bound jobs’ run times exponentially and frustrate users and support team to debug it.
We are constantly working to develop system that works in tandem with Hadoop to quickly identify and single out pressure points. Here we would like to concentrate on disk, as per our experience disk are the most trouble maker and fragile, specially the high density disks. Because of the huge scale and monetary impact because of slow performing disks, we took challenge to build system to predict and take worn-out disks before they become performance bottleneck and hit jobs’ SLAs. Now task is simple look into symptoms of hard drive failure and take them out? Right? No it’s not straight forward when we are talking about 200+k disk drives. Just collecting such huge data periodically and reliably is one of the small challenges as compared to analyzing such huge datasets and predicting bad disks. Now lets see data regarding each disk we have reallocated sectors count, reported uncorrectable errors, command timeout, and uncorrectable sector count. On top of it hard disk model has its own interpretation of the above-mentioned statistics.
How can one predict if a disk is going to fail based on the symptoms or the system stats collected? We choose machine learning to accomplish it. We can divided entire system into broadly 4 sub systems:
1. Dataset collection : This involved getting the disk performance stats, counters, and logs for thousands of disk drives from thousands of servers and storing them at a centralized location. We used Elastic Stack to collect the data at a centralized location.
2. Clean and preprocess the data : Cleaning involves extracting necessary data from the data collected and preprocessing involves converting the extracted data into a required format for further processing.
3. Cluster and labeling the data (using domain knowledge) : Once the data has been preprocessed, we are required to find the datasets which are similar to each other by clustering them. Why do we need clustering? The data collected in the raw format is a dump. To derive meaning out of the data, we need to try and group similar data points together and non-similar data points away. Self-organizing maps is the clustering technique used to cluster the data. Once the data has been grouped together, it requires labeling or segregating the data point as “good disk,” or “going to be bad,” or “bad disk.” Labelling requires domain knowledge.
4. Train the clustered data using artificial neural networks : Artificial neural network is a supervised machine learning classifier which is the main algorithm used for prediction. Initially, we pass the labeled data to train the neural network model Once the model is trained, we use the trained model to predict if the disk is good/ going to fail/ bad by passing new unlabeled dataset. The efficiency of this model depends on various factors. The accuracy of the labeled data, learning/training parameters, to name a few.