Profile-Guided Instruction and Data Memory Layout

February 7, 2001

12:30 pm - 1:30 pm

Halligan 106

Speaker: David Kaeli, Northeastern University

Abstract

Memory hierarchy performance has always been an important issue in computer architecture design. The likelihood of a bottleneck in the memory hierarchy is increasing, as improvements in microprocessor performance continue to outpace those made in the memory system. As a result, effective utilization of cache memories is essential in today's architectures. The nature of procedural software poses visibility problems when attempting to perform program optimization. One approach to increasing visibility in procedural design is to perform procedure inlining. The main downside of using inlining is that inlined procedures can place excess pressure on the instruction cache. To address this issue we attempt to perform code reordering. By combining reordering with aggressive inlining, a larger executable image produced through inlining can be effectively remapped onto the cache address space, while not noticeably increasing the instruction cache miss rate. In this talk, we discuss our ability to perform aggressive inlining by employing cache line coloring. We have implemented three variations of our coloring algorithm in the Alto toolset and compare them against Alto's aggressive basic block reordering algorithms. Alto allows us to generate optimized executables, that can be run on hardware to generate results. We find that by using our algorithms, we can achieve up a 21% reduction is execution runtime over the base Compaq optimizing compiler, and a 6.4% reduction when compared to Alto's interprocedural basic block reordering algorithm. In the second part of this talk we discuss recent work on improving date heap layout using profile-guided allocation. We have developed our own malloc library and utilize this algorithm to exploit profile information on heap access patterns. Again, our goal is to reduce runtime cache conflicts. We use a call-stack pattern predictor to drive our layout. Using this new layout, we have improved runtime performance by up to 5%.