Today, the state of the art of observing HDFS metadata changes lends itself to only a couple of architectures. Typically making use of the legacy FsImage, OfflineImageViewer to parse it to plaintext, and either ElasticSearch, Kibana, or maybe HDFSDU for viewing. Some even run scripts that either perform status listings or counts on various target directories against the active NameNode.
These approaches tend to lend themselves to long process times and only being able to look at a snapshot view. Even worse, they tend to put additional read load on the Standby and Active NameNodes.
NameNode images for our largest clusters at PayPal can take hours to complete, parse, and generate reports. Many times damage was already done by the time we got those reports. Thus, we strove to find a way to graph HDFS usage, by user, by directory, or nearly any way we wanted, but in much closer to real time.
So, we decided to create a new tool and a new NameNode. It is a Standby NameNode, with no RPC Server (so no client or DataNodes can connect to it), and a custom query engine on top of it, accessible by REST API. It stays up to date through fetching JournalNode edits batches just like the real Standby. With this new NameNode, which we call NNA internally, we are able to generate reports much more quickly, about once every 30 minutes, that give great insight into directory usage, their growth, user usage growth, quota usage, etc. Even the ability to define very precise searches across the entire NameNode.
While this is very much still an incomplete project and needs several improvements, it has already helped us immensely within PayPal to better graph in real time how our HDFS users are behaving and able to see how HDFS activity looks within even as small a window as a single day.