Delay Distribution Based Remote Data Fetch Scheme for Hadoop Clusters in Public Cloud

Ravindra Sandaruwan RANAWEERA
Eiji OKI
Nattapong KITSUWAN

IEICE TRANSACTIONS on Communications   Vol.E102-B    No.8    pp.1617-1625
Publication Date: 2019/08/01
Publicized: 2019/02/04
Online ISSN: 1745-1345
DOI: 10.1587/transcom.2018EBP3243
Type of Manuscript: PAPER
Category: Network
public cloud,  Hadoop,  big data,  HDFS,  

Full Text: PDF>>
Buy this Article

Apache Hadoop and its ecosystem have become the de facto platform for processing large-scale data, or Big Data, because it hides the complexity of distributed computing, scheduling, and communication while providing fault-tolerance. Cloud-based environments are becoming a popular platform for hosting Hadoop clusters due to their low initial cost and limitless capacity. However, cloud-based Hadoop clusters bring their own challenges due to contradictory design principles. Hadoop is designed on the shared-nothing principle while cloud is based on the concepts of consolidation and resource sharing. Most of Hadoop's features are designed for on-premises data centers where the cluster topology is known. Hadoop depends on the rack assignment of servers (configured by the cluster administrator) to calculate the distance between servers. Hadoop calculates the distance between servers to find the best remote server from which to fetch data from when fetching non-local data. However, public cloud environment providers do not share rack information of virtual servers with their tenants. Lack of rack information of servers may allow Hadoop to fetch data from a remote server that is on the other side of the data center. To overcome this problem, we propose a delay distribution based scheme to find the closest server to fetch non-local data for public cloud-based Hadoop clusters. The proposed scheme bases server selection on the delay distributions between server pairs. Delay distribution is calculated measuring the round-trip time between servers periodically. Our experiments observe that the proposed scheme outperforms conventional Hadoop nearly by 12% in terms of non-local data fetch time. This reduction in data fetch time will lead to a reduction in job run time, especially in real-world multi-user clusters where non-local data fetching can happen frequently.