High-Performance End-to-End Integrity Verification on Big Data Transfer

Eun-Sung JUNG  Si LIU  Rajkumar KETTIMUTHU  Sungwook CHUNG  

IEICE TRANSACTIONS on Information and Systems   Vol.E102-D   No.8   pp.1478-1488
Publication Date: 2019/08/01
Publicized: 2019/04/24
Online ISSN: 1745-1361
DOI: 10.1587/transinf.2018EDP7297
Type of Manuscript: PAPER
Category: Fundamentals of Information Systems
high-performance data transfer,  IoT-based big data,  data integrity,  pipelining,  

Full Text: PDF(870.1KB)>>
Buy this Article

The scale of scientific data generated by experimental facilities and simulations in high-performance computing facilities has been proliferating with the emergence of IoT-based big data. In many cases, this data must be transmitted rapidly and reliably to remote facilities for storage, analysis, or sharing, for the Internet of Things (IoT) applications. Simultaneously, IoT data can be verified using a checksum after the data has been written to the disk at the destination to ensure its integrity. However, this end-to-end integrity verification inevitably creates overheads (extra disk I/O and more computation). Thus, the overall data transfer time increases. In this article, we evaluate strategies to maximize the overlap between data transfer and checksum computation for astronomical observation data. Specifically, we examine file-level and block-level (with various block sizes) pipelining to overlap data transfer and checksum computation. We analyze these pipelining approaches in the context of GridFTP, a widely used protocol for scientific data transfers. Theoretical analysis and experiments are conducted to evaluate our methods. The results show that block-level pipelining is effective in maximizing the overlap mentioned above, and can improve the overall data transfer time with end-to-end integrity verification by up to 70% compared to the sequential execution of transfer and checksum, and by up to 60% compared to file-level pipelining.