Freddie Mac and KPMG will share an innovative solution to accelerate data model (ERM) development and data integration on a highly-distributed, in-memory computing platform. The machine learning component (PySpark) of the framework executes against evolving semi-structured and structured data sets to learn and automate data mapping from various sources to a targeted schema. As a result, it significantly reduces the manual analysis, design and development effort, as well as establishes faster data integration across a variety of complex and high-volume datasets.
The solution will leverage various components of the Hadoop data platform. It will use Sqoop to import the data into the platform. PySpark will be leveraged in order to process the data. In addition, the application will also have a developed PySpark ML model that will run as a continuous job in Spark to process the ingested semi-structured data and intelligently map into the proper Hive tables. This will all be scheduled thru the use of Oozie.