Reducing I/O Cost in OLAP Query Processing with MapReduce

Woo-Lam KANG  Hyeon-Gyu KIM  Yoon-Joon LEE  

Publication
IEICE TRANSACTIONS on Information and Systems   Vol.E98-D   No.2   pp.444-447
Publication Date: 2015/02/01
Online ISSN: 1745-1361
DOI: 10.1587/transinf.2014EDL8143
Type of Manuscript: LETTER
Category: Data Engineering, Web Information Systems
Keyword: 
MapReduce,  Hadoop,  OLAP,  data warehouse,  TPC-H benchmark,  

Full Text: PDF(353.2KB)
>>Buy this Article


Summary: 
This paper presents a method to reduce I/O cost in MapReduce when online analytical processing (OLAP) queries are used for data analysis. The proposed method consists of two basic ideas. First, to reduce network transmission cost, mappers are organized to receive only data necessary to perform a map task, not an entire set of input data. Second, to reduce storage consumption, only record IDs are stored for checkpointing, not the raw records. Experiments conducted with TPC-H benchmark show that the proposed method is about 40% faster than Hive, the well-known data warehouse solution for MapReduce, while reducing the size of data stored for checkpoining to about 80%.