Dmitriyev, Viktor and Kruse, Felix and Precht, Hauke and Becker, Simon and Solsbach, Andreas and Marx Gómez, Jorge Carlos (2017) Building a big data analytical pipeline with Hadoop for processing enterprise XML data. The 11th Mediterranean Conference on Information Systems (MCIS 2017).

Full text not available from this repository.
Official URL:


The current paper shows an end-to-end approach how to process XML files in the Hadoop ecosystem. The work demonstrates a way how to handle problems faced during the analysis of a large amounts of XML files. The paper presents a completed Extract, Load and Transform (ELT) cycle, which is based on the open source software stack Apache Hadoop, which became a standard for processing of a huge amounts of data. This work shows that applying open source solutions to a particular set of problems could not be enough. In fact, most of big data processing open source tools were implemented only to address a limited number of the use cases. This work explains and shows, why exactly specific use cases may require significant extension with a self-developed multiple software components. The use case described in the paper deals with huge amounts of semi-structured XML files, which supposed to be persisted and processed daily.

Item Type: Article
Uncontrolled Keywords: Big Data, Hadoop, ETL, ELT, XML, Data Analytical Pipeline
Divisions: School of Computing Science, Business Administration, Economics and Law > Department of Computing Science
Date Deposited: 12 Sep 2018 11:51
Last Modified: 10 May 2019 15:39
URN: urn:nbn:de:gbv:715-oops-35873

Actions (login required)

View Item View Item