Integrating different sources into a centralized data lake is crucial to a data-driven company. For many industries one of the most common platforms is the mainframe, for which it is often challenging to achieve interoperability with other platforms.
A good ecosystem of tools have been built around Mainframes, to connect them to external systems on which analytics tools can be built. For instance, data can be streamed directly from Mainframes via message queues. Another option is to run Spark directly on Mainframes, although analytics teams are seldom allowed to run anything on mainframe infrastructures due to risk and security policies. Relational databases like DB2 can be accessed using direct ODBC/JDBC connectors and tools like Sqoop.
Consuming data from hierarchical databases like IMS, though, is much more challenging. Open source tools like legstar can help with it, but most of the time integrating IMS data requires dedicated software development for each data set and/or very expensive proprietary tools. Also, existing tools focus primarily on relational data so original hierarchical schema needs to be flattened, exploded and/or projected which can make resulting tables very wide (sometime >10k columns) and inefficient.
In this session we are going to talk about Cobrix (https://github.com/AbsaOSS/cobrix), a library that extends Spark SQL API to allow reading from binary files generated by mainframes directly from HDFS. This provides yet another option for accessing mainframe data, that is, to export the data as a set of records, provide the schema as a COBOL copybook and transfer it to an HDFS folder for direct processing in Spark. Spark’s flexibility in naturally supporting nested structures and arrays allows retaining the original schema.
One of the biggest challenges loading data from mainframe files is performance. While datasets are stored in mainframes, their access is managed by specialized mainframe filesystems. But when those datasets are copied to a PC, they become files containing sequences of records linked together in a linked-list fashion. This makes such files naturally sequential and hard to process in parallel. To solve this issue we designed a two-phase in-memory sparse index that allows those files to be read in a distributed and parallel way.
In this talk we will: provide an overview of the difference between data definition models in mainframes and PCs, show how the schema mapping between COBOL and Spark was done in Cobrix, and present several use cases of reading simple and multi-segment data files in Spark to illustrate how the library can be used. We are also going to talk about how we achieved scalability reading big record sequence files and how we achieved data locality awareness when processing such files.