Logging Inter-Thread Data Dependencies in Linux Kernel

Takafumi KUBOTA  Naohiro AOTA  Kenji KONO  

IEICE TRANSACTIONS on Information and Systems   Vol.E103-D   No.7   pp.1633-1646
Publication Date: 2020/07/01
Online ISSN: 1745-1361
DOI: 10.1587/transinf.2019EDP7255
Type of Manuscript: PAPER
Category: Software System
logging automation,  inter-thread dependency,  debugging,  operating systems,  

Full Text: PDF(1.9MB)>>
Buy this Article

Logging is a practical and useful way of diagnosing failures in software systems. The logged events are crucially important to learning what happened during a failure. If key events are not logged, it is almost impossible to track error propagations in the diagnosis. Tracking an error propagation becomes utterly complicated if inter-thread data dependency is involved. An inter-thread data dependency arises when one thread accesses to share data corrupted by another thread. Since the erroneous state propagates from a buggy thread to a failing thread through the corrupt shared data, the root cause cannot be tracked back solely by investigating the failing thread. This paper presents the design and implementation of K9, a tool that inserts logging code automatically to trace inter-thread data dependencies. K9 is designed to be “practical”; it scales to one million lines of code in C, causes negligible runtime overheads, and provides clues to tracking inter-thread dependencies in real-world bugs. To scale to one million lines of code, K9 ditches rigorous static analysis of pointers to detect code locations where inter-thread data dependency can occur. Instead, K9 takes the best-effort approach and finds out “most” of those code locations by making use of coding conventions. This paper demonstrates that K9 is applicable to Linux and captures relevant code locations, in spite of the best-effort approach, enough to provide useful clues to root causes in real-world bugs, including a previously unknown bug in Linux. The paper also shows K9 runtime overhead is negligible. K9 incurs 1.25% throughput degradation and 0.18% CPU usage increase, on average, in our evaluation.