Improving Cache Partitioning Algorithms for Pseudo-LRU Policies

Xi ZHANG  Chuanyi LIU  Zhenyu LIU  Dongsheng WANG  

IEICE TRANSACTIONS on Information and Systems   Vol.E96-D   No.12   pp.2514-2523
Publication Date: 2013/12/01
Online ISSN: 1745-1361
DOI: 10.1587/transinf.E96.D.2514
Print ISSN: 0916-8532
Type of Manuscript: Special Section PAPER (Special Section on Parallel and Distributed Computing and Networking)
cache partitioning,  pseudo-LRU,  curve approximation,  

Full Text: PDF(1.5MB)>>
Buy this Article

As the number of concurrently running applications on the chip multiprocessors (CMPs) is increasing, efficient management of the shared last-level cache (LLC) is crucial to guarantee overall performance. Recent studies have shown that cache partitioning can provide benefits in throughput, fairness and quality of service. Most prior arts apply true Least Recently Used (LRU) as the underlying cache replacement policy and rely on its stack property to work properly. However, in commodity processors, pseudo-LRU policies without stack property are commonly used instead of LRU for their simplicity and low storage overhead. Therefore, this study sets out to understand whether LRU-based cache partitioning techniques can be applied to commodity processors. In this work, we instead propose a cache partitioning mechanism for two popular pseudo-LRU policies: Not Recently Used (NRU) and Binary Tree (BT). Without the help of true LRU's stack property, we propose a profiling logic that applies curve approximation methods to derive the hit curve (hit counts under varied way allocations) for an application. We then propose a hybrid partitioning mechanism, which mitigates the gap between the predicted hit curve and the actual statistics. Simulation results demonstrate that our proposal can improve throughput by 15.3% on average and outperforms the stack-estimate proposal by 12.6% on average. Similar results can be achieved in weighted speedup. For the cache configurations under study, it requires less than 0.5% storage overhead compared to the last-level cache. In addition, we also show that profiling mechanism with only one true LRU ATD achieves comparable performance and can further reduce the hardware cost by nearly two thirds compared with the hybrid mechanism.