Cobrix – a COBOL Data Source for Spark

Cobrix – a COBOL Data Source for Spark

Wednesday, March 20
11:50 AM - 12:30 PM
Room 131-132

The financial industry operates on a variety of different data and computing platforms. Integrating these different sources into a centralized data lake is crucial to support reporting and analytics tools.

Apache Spark is becoming the tool of choice for big data integration analytics due it’s scalable nature and because it supports processing data from a variety of data sources and formats such as JSON, Parquet, Kafka, etc. However, one of the most common platforms in the financial industry is the mainframe, which does not provide easy interoperability with other platforms.

COBOL is the most used language in the mainframe environment. It was designed in 1959 and evolved in parallel to other programming languages, thus, having its own constructs and primitives. Furthermore, data produced by COBOL has EBCDIC encoding and has a different binary representation of numeric data types.

We have developed Cobrix, a library that extends Spark SQL API to allow direct reading from binary files generated by mainframes.
While projects like Sqoop focus on transferring relational data by providing direct connectors to a mainframe, Cobrix can be used to parse and load hierarchical data (from IMS for instance) after it is transferred from a mainframe by dumping records to a binary file. Schema should be provided as a COBOL copybook. It can contain nested structures and arrays. We present how the schema mapping between COBOL and Spark was done, and how it was used in the implementation of Spark COBOL data source. We also present use cases of simple and multi-segment files to illustrate how we use the library to load data from mainframes into our Hadoop data lake.

We have open sourced Cobrix at https://github.com/AbsaOSS/cobrix

Presentation Video

講演者

Ruslan Iushchenko
Big Data Engineer
ABSA
Ruslan is a Scala and Spark enthusiast with a degree in High Performance Computing. He lives in Prague, Czech Republic. Until 2016 he worked on seismic wave simulation software for Oil and Gas industry in Kiev, Ukraine. Also he taught Parallel Programming at a university there for some time. Now he works for ABSA, a multinational African bank as a big data engineer in Big Data R&D team. His interests include distributed systems, concurrent and parallel programming.
Felipe Melo
Big Data Engineer
ABSA
Senior Big Data Engineer with experience in Information Retrieval and Machine Learning.