The financial industry operates on a variety of different data and computing platforms. Integrating these different sources into a centralized data lake is crucial to support reporting and analytics tools.
Apache Spark is becoming the tool of choice for big data integration analytics due it’s scalable nature and because it supports processing data from a variety of data sources and formats such as JSON, Parquet, Kafka, etc. However, one of the most common platforms in the financial industry is the mainframe, which does not provide easy interoperability with other platforms.
COBOL is the most used language in the mainframe environment. It was designed in 1959 and evolved in parallel to other programming languages, thus, having its own constructs and primitives. Furthermore, data produced by COBOL has EBCDIC encoding and has a different binary representation of numeric data types.
We have developed Cobrix, a library that extends Spark SQL API to allow direct reading from binary files generated by mainframes.
While projects like Sqoop focus on transferring relational data by providing direct connectors to a mainframe, Cobrix can be used to parse and load hierarchical data (from IMS for instance) after it is transferred from a mainframe by dumping records to a binary file. Schema should be provided as a COBOL copybook. It can contain nested structures and arrays. We present how the schema mapping between COBOL and Spark was done, and how it was used in the implementation of Spark COBOL data source. We also present use cases of simple and multi-segment files to illustrate how we use the library to load data from mainframes into our Hadoop data lake.
We have open sourced Cobrix at https://github.com/AbsaOSS/cobrix