Concurrency

Niagara: A 32-Way Multithreaded Sparc Processor
Poonacha Kongetira Sun Microsystems
Kathirgamar Aingaran Sun Microsystems
Kunle Olukotun Sun Microsystems
MICRO 2005
ACM portal

The Niagara processor implements a thread-rich architecture designed to provide a high-performance solution for commercial server applications. The hardware supports 32 threads with a memory subsystem consisting of an on-board crossbar, level-2 cache, and memory controllers for a highly integrated design that exploits the thread-level parallelism inherent to server applications, while targeting low levels of power consumption.

Synergistic processsing in CELL's multicore architecture
A bunch of people from IBM
MICRO 2006
PDF

Transactional memory: architectural support for lock-free data structures
Maurice Herlihy, J. Eliot B. Moss
ISCA 1993
ACM portal

A shared data structure is lock-free if its operations do not require mutual exclusion. If one process is interrupted in the middle of an operation, other processes will not be prevented from operating on that object. In highly concurrent systems, lock-free data structures avoid common problems associated with conventional locking techniques, including priority inversion, convoying, and difficulty of avoiding deadlock. This paper introduces transactional memory, a new multiprocessor architecture intended to make lock-free synchronization as efficient (and easy to use) as conventional techniques based on mutual exclusion. Transactional memory allows programmers to define customized read-modify-write operations that apply to multiple, independently-chosen words of memory. It is implemented by straightforward extensions to any multiprocessor cache-coherence protocol. Simulation results show that transactional memory matches or outperforms the best known locking techniques for simple benchmarks, even in the absence of priority inversion, convoying, and deadlock.

Software transactional memory
Nir Shavit MIT and Tel-Aviv University
Dan Touitou Tel-Aviv University
PODC 1995
ACM portal

Software transactional memory for dynamic-sized data structures
Maurice Herlihy Brown University, Providence, RI
Victor Luchangco Sun Microsystems Laboratories, Burlington, MA
Mark Moir Sun Microsystems Laboratories, Burlington, MA
William N. Scherer, III University of Rochester, Rochester, NY
PODC 2003
ACM portal

We propose a new form of software transactional memory (STM) designed to support dynamic-sized data structures, and we describe a novel non-blocking implementation. The non-blocking property we consider is obstruction-freedom. Obstruction-freedom is weaker than lock-freedom; as a result, it admits substantially simpler and more efficient implementations. A novel feature of our obstruction-free STM implementation is its use of modular contention managers to ensure progress in practice. We illustrate the utility of our dynamic STM with a straightforward implementation of an obstruction-free red-black tree, thereby demonstrating a sophisticated non-blocking dynamic data structure that would be difficult to implement by other means. We also present the results of simple preliminary performance experiments that demonstrate that an "early release" feature of our STM is useful for reducing contention, and that our STM lends itself to the effective use of modular contention managers.

Compiler and runtime support for efficient software transactional memory
Ali-Reza Adl-Tabatabai Intel Labs, Santa Clara, CA
Brian T. Lewis Intel Labs, Santa Clara, CA
Vijay Menon Intel Labs, Santa Clara, CA
Brian R. Murphy Intel China Research Center, Beijing, China
Bratin Saha Intel Labs, Santa Clara, CA
Tatiana Shpeisman Intel Labs, Santa Clara, CA
PLDI 2006
ACM portal

Programmers have traditionally used locks to synchronize concurrent access to shared data. Lock-based synchronization, however, has well-known pitfalls: using locks for fine-grain synchronization and composing code that already uses locks are both difficult and prone to deadlock. Transactional memory provides an alternate concurrency control mechanism that avoids these pitfalls and significantly eases concurrent programming. Transactional memory language constructs have recently been proposed as extensions to existing languages or included in new concurrent language specifications, opening the door for new compiler optimizations that target the overheads of transactional memory.This paper presents compiler and runtime optimizations for transactional memory language constructs. We present a high-performance software transactional memory system (STM) integrated into a managed runtime environment. Our system efficiently implements nested transactions that support both composition of transactions and partial roll back. Our JIT compiler is the first to optimize the overheads of STM, and we show novel techniques for enabling JIT optimizations on STM operations. We measure the performance of our optimizations on a 16-way SMP running multi-threaded transactional workloads. Our results show that these techniques enable transactional memory's performance to compete with that of well-tuned synchronization.

The Atomos transactional programming language
Brian D. Carlstrom Stanford University
Austen McDonald Stanford University
Hassan Chafi Stanford University
JaeWoong Chung Stanford University
Chi Cao Minh Stanford University
Christos Kozyrakis Stanford University
Kunle Olukotun Stanford University
PLDI 2006
ACM portal

Atomos is the first programming language with implicit transactions, strong atomicity, and a scalable multiprocessor implementation. Atomos is derived from Java, but replaces its synchronization and conditional waiting constructs with simpler transactional alternatives.The Atomos watch statement allows programmers to specify fine-grained watch sets used with the Atomos retry conditional waiting statement for efficient transactional conflict-driven wakeup even in transactional memory systems with a limited number of transactional contexts. Atomos supports open-nested transactions, which are necessary for building both scalable application programs and virtual machine implementations.The implementation of the Atomos scheduler demonstrates the use of open nesting within the virtual machine and introduces the concept of transactional memory violation handlers that allow programs to recover from data dependency violations without rolling back.Atomos programming examples are given to demonstrate the usefulness of transactional programming primitives. Atomos and Java are compared through the use of several benchmarks. The results demonstrate both the improvements in parallel programming ease and parallel program performance provided by Atomos.

A flexible framework for implementing software transactional memory
Maurice Herlihy Brown University
Victor Luchangco Sun Microsystems Laboratories
Mark Moir Sun Microsystems Laboratories
OOPSLA 2006
ACM portal

We describe DSTM2, a Java“¢â software library that provides a flexible framework for implementing object-based software transactional memory (STM). The library uses transactional factories to transform sequential (unsynchronized) classes into atomic (transactionally synchronized) ones, providing a substantial improvement over the awkward programming interface of our previous DSTM library. Furthermore, researchers can experiment with alternative STM mechanisms by providing their own factories. We demonstrate this flexibility by presenting two factories: one that uses essentially the same mechanisms as the original DSTM (with some enhancements),and another that uses a completely different approach.Because DSTM2 is packaged as a Java library, a wide range of programmers can easily try it out, and the community can begin to gain experience with transactional programming. Furthermore, researchers will be able to use the body of transactional programs that arises from this community experience to test and evaluate different STM mechanisms simply by supplying new transactional factories. We believe that this flexible approach will help to build consensus about the best ways to implement transactions, and will avoid the premature "lock-in" that may arise if STM mechanisms are baked into compilers before such experimentation is done.

Speculative lock elision: enabling highly concurrent multithreaded execution
Ravi Rajwar University of Wisconsin-Madison Madison, WI
James R. Goodman University of Wisconsin-Madison Madison, WI
MICRO 2001
ACM portal

Serialization of threads due to critical sections is a fundamental bottleneck to achieving high performance in multithreaded programs. Dynamically, such serialization may be unnecessary because these critical sections could have safely executed concurrently without locks. Current processors cannot fully exploit such parallelism because they do not have mechanisms to dynamically detect such false inter-thread dependences.We propose Speculative Lock Elision (SLE), a novel micro-architectural technique to remove dynamically unnecessary lock-induced serialization and enable highly concurrent multithreaded execution. The key insight is that locks do not always have to be acquired for a correct execution. Synchronization instructions are predicted as being unnecessary and elided. This allows multiple threads to concurrently execute critical sections protected by the same lock. Misspeculation due to inter-thread data conflicts is detected using existing cache mechanisms and rollback is used for recovery. Successful speculative elision is validated and committed without acquiring the lock.SLE can be implemented entirely in microarchitecture without instruction set support and without system-level modifications, is transparent to programmers, and requires only trivial additional hardware support. SLE can provide programmers a fast path to writing correct high-performance multithreaded programs.

Detecting access anomalies in programs with critical sections
Anne Dinning D.E. Shaw & Co., 251 Park Ave., So., N.Y., New York
Edith Schonberg IBM Research, T. J. Watson, Research Center, Yorktown Heights, NY
1991 ACM/ONR workshop on Parallel and distributed debugging
ACM portal

Guava: a dialect of Java without data races
David F. Bacon IBM Watson Research Center, P.O. Box 704, Yorktown Heights, NY
Robert E. Strom IBM Watson Research Center, P.O. Box 704, Yorktown Heights, NY
Ashis Tarafdar Dept. of Computer Sciences, University of Texas at Austin, Austin, Texas
OOPSLA 2000
ACM portal

We introduce Guava, a dialect of Java whose rules statically guarantee that parallel threads access shared data only through synchronized methods. Our dialect distinguishes three categories of classes: (1) monitors, which may be referenced from multiple threads, but whose methods are accessed serially; (2) values, which cannot be referenced and therefore are never shared; and (3) objects, which can have multiple references but only from within one thread, and therefore do not need to be synchronized. Guava circumvents the problems associated with today's Java memory model, which must define behavior when concurrent threads access shared memory without synchronization.We present an overview of the syntax and the semantic rules of Guava. We discuss how implementations of Guava can exploit these rules to re-enable compiler optimizations inhibited by standard Java. We discuss how compilers for certain multiprocessor architectures can automatically generate certain programming idioms, such as double-check reads, as optimizations of serialized monitors.

Type-based race detection for Java
Cormac Flanagan Compaq Systems Research Center, 130 Lytton Ave., Palo Alto, CA
Stephen N. Freund Department of Computer Science, Stanford University, Stanford, CA
PLDI 2000
ACM portal

This paper presents a static race detection analysis for multithreaded Java programs. Our analysis is based on a formal type system that is capable of capturing many common synchronization patterns. These patterns include classes with internal synchronization, classes thatrequire client-side synchronization, and thread-local classes. Experience checking over 40,000 lines of Java code with the type system demonstrates that it is an effective approach for eliminating races conditions. On large examples, fewer than 20 additional type annotations per 1000 lines of code were required by the type checker, and we found a number of races in the standard Java libraries and other test programs.

Efficient and precise datarace detection for multithreaded object-oriented programs
Jong-Deok Choi IBM T. J. Watson Research Center
Keunwoo Lee Univ. of Washington
Alexey Loginov Univ. of Wisconsin - Madison
Robert O'Callahan IBM T. J. Watson Research Center
Vivek Sarkar IBM T. J. Watson Research Center
Manu Sridharan MIT
PLDI 2002
ACM portal

We present a novel approach to dynamic datarace detection for multithreaded object-oriented programs. Past techniques for on-the-fly datarace detection either sacrificed precision for performance, leading to many false positive datarace reports, or maintained precision but incurred significant overheads in the range of 3x to 30x. In contrast, our approach results in very few false positives and runtime overhead in the 13% to 42% range, making it both efficient and precise. This performance improvement is the result of a unique combination of complementary static and dynamic optimization techniques.

LOCKSMITH: context-sensitive correlation analysis for race detection
Polyvios Pratikakis University of Maryland, College Park
Jeffrey S. Foster University of Maryland, College Park
Michael Hicks University of Maryland, College Park
PLDI 2006
ACM portal

One common technique for preventing data races in multi-threaded programs is to ensure that all accesses to shared locations are consistently protected by a lock. We present a tool called LOCKSMITH for detecting data races in C programs by looking for violations of this pattern. We call the relationship between locks and the locations they protect consistent correlation, and the core of our technique is a novel constraint-based analysis that infers consistent correlation context-sensitively, using the results to check that locations are properly guarded by locks. We present the core of our algorithm for a simple formal language ’¦Ë> which we have proven sound, and discuss how we scale it up to an algorithm that aims to be sound for all of C. We develop several techniques to improve the precision and performance of the analysis, including a sharing analysis for inferring thread locality; existential quantification for modeling locks in data structures; and heuristics for modeling unsafe features of C such as type casts. When applied to several benchmarks, including multi-threaded servers and Linux device drivers, LOCKSMITH found several races while producing a modest number of false alarm.

Effective static race detection for Java
Mayur Naik Stanford University
Alex Aiken Stanford University
John Whaley Stanford University
PLDI 2006
ACM portal

We present a novel technique for static race detection in Java programs, comprised of a series of stages that employ a combination of static analyses to successively reduce the pairs of memory accesses potentially involved in a race. We have implemented our technique and applied it to a suite of multi-threaded Java programs. Our experiments show that it is precise, scalable, and useful, reporting tens to hundreds of serious and previously unknown concurrency bugs in large, widely-used programs with few false alarms.

KISS: keep it simple and sequential
Shaz Qadeer Microsoft, Redmond, WA
Dinghao Wu Princeton University, Princeton, NJ
PLDI 2004
ACM portal

The design of concurrent programs is error-prone due to the interaction between concurrently executing threads. Traditional automated techniques for finding errors in concurrent programs, such as model checking, explore all possible thread interleavings. Since the number of thread interleavings increases exponentially with the number of threads, such analyses have high computational complexity. In this paper, we present a novel analysis technique for concurrent programs that avoids this exponential complexity. Our analysis transforms a concurrent program into a sequential program that simulates the execution of a large subset of the behaviors of the concurrent program. The sequential program is then analyzed by a tool that only needs to understand the semantics of sequential execution. Our technique never reports false errors but may miss errors. We have implemented the technique in KISS, an automated checker for multithreaded C programs, and obtained promising initial results by using KISS to detect race conditions in Windows device drivers.

Threads cannot be implemented as a library
Hans-J. Boehm HP Laboratories, Palo Alto, CA
PLDI 2005
ACM portal

In many environments, multi-threaded code is written in a language that was originally designed without thread support (e.g. C), to which a library of threading primitives was subsequently added. There appears to be a general understanding that this is not the right approach. We provide specific arguments that a pure library approach, in which the compiler is designed independently of threading issues, cannot guarantee correctness of the resulting code.We first review why the approach almost works, and then examine some of the surprising behavior it may entail. We further illustrate that there are very simple cases in which a pure library-based approach seems incapable of expressing an efficient parallel algorithm.Our discussion takes place in the context of C with Pthreads, since it is commonly used, reasonably well specified, and does not attempt to ensure type-safety, which would entail even stronger constraints. The issues we raise are not specific to that context.

The Java Memory Model is Fatally Flawed
William Pugh Dept. of Computer Science, Univ. of Maryland, College Park
Concurrency: Practice and Experience, 2000
PDF

The Java memory model
Jeremy Manson University of Maryland, College Park, MD
William Pugh University of Maryland, College Park, MD
Sarita V. Adve University of Illinois at Urbana-Champaign, Urbana-Champaign, IL
POPL 2005
ACM portal

This paper describes the new Java memory model, which has been revised as part of Java 5.0. The model specifies the legal behaviors for a multithreaded program; it defines the semantics of multithreaded Java programs and partially determines legal implementations of Java virtual machines and compilers.The new Java model provides a simple interface for correctly synchronized programs -- it guarantees sequential consistency to data-race-free programs. Its novel contribution is requiring that the behavior of incorrectly synchronized programs be bounded by a well defined notion of causality. The causality requirement is strong enough to respect the safety and security properties of Java and weak enough to allow standard compiler and hardware optimizations. To our knowledge, other models are either too weak because they do not provide for sufficient safety/security, or are too strong because they rely on a strong notion of data and control dependences that precludes some standard compiler transformations.Although the majority of what is currently done in compilers is legal, the new model introduces significant differences, and clearly defines the boundaries of legal transformations. For example, the commonly accepted definition for control dependence is incorrect for Java, and transformations based on it may be invalid.In addition to providing the official memory model for Java, we believe the model described here could prove to be a useful basis for other programming languages that currently lack well-defined models, such as C++ and C#.




ACM portal

Dynamic optimization

ATOM: a system for building customized program analysis tools
Amitabh Srivastava, Alan Eustace -- Digital Equipment Western Research Laboratory
PLDI 1994
[
PDF | ACM ]

ATOM (Analysis Tools with OM) is a single framework for building a wide range of customized program analysis tools. It provides the common infrastructure present in all code-instrumenting tools; this is the difficult and time-consuming part. The user simply defines the tool-specific details in instrumentation and analysis routines. Building a basic block counting tool like Pixie with ATOM requires only a page of code. ATOM, using OM link-time technology, organizes the final executable such that the application program and user's analysis routines run in the same address space. Information is directly passed from the application program to the analysis routines through simple procedure calls instead of inter-process communication or files on disk. ATOM takes care that analysis routines do not interfere with the program's execution, and precise information about the program is presented to the analysis routines at all times. ATOM uses no simulation or interpretation. ATOM has been implemented on the Alpha AXP under OSF/1. It is efficient and has been used to build a diverse set of tools for basic block counting, profiling, dynamic memory recording, instruction and data cache simulation, pipeline simulation, evaluating branch prediction, and instruction scheduling.

Valgrind: A Program Supervision Framework
Nicholas Nethercote, Julian Seward -- University of Cambridge

[
PDF ] -- valgrind web-site

Dynamo: a transparent dynamic optimization system
Vasanth Bala, Evelyn Duesterwald, Sanjeev Banerjia -- Hewlett-Packard Labs
PLDI 2000
[
PDF | ACM ]

We describe the design and implementation of Dynamo, a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor. The input native instruction stream to Dynamo can be dynamically generated (by a JIT for example), or it can come from the execution of a statically compiled native binary. This paper evaluates the Dynamo system in the latter, more challenging situation, in order to emphasize the limits, rather than the potential, of the system. Our experiments demonstrate that even statically optimized native binaries can be accelerated Dynamo, and often by a significant degree. For example, the average performance of -O optimized SpecInt95 benchmark binaries created by the HP product C compiler is improved to a level comparable to their -O4 optimized version running without Dynamo. Dynamo achieves this by focusing its efforts on optimization opportunities that tend to manifest only at runtime, and hence opportunities that might be difficult for a static compiler to exploit. Dynamo's operation is transparent in the sense that it does not depend on any user annotations or binary instrumentation, and does not require multiple runs, or any special compiler, operating system or hardware support. The Dynamo prototype presented here is a realistic implementation running on an HP PA-8000 workstation under the HPUX 10.20 operating system.

Secure Execution via Program Shepherding
Vladimir Kiriansky Derek Bruening, Saman Amarasinghe -- Massachusetts Institute of Technology
USENIX 2002
[
PDF | ACM ]

We introduce program shepherding, a method for monitoring control flow transfers during program execution to enforce security policies. Program shepherding provides three techniques as building blocks for security policies. First, shepherding can restrict execution privileges on the basis of code origins. This distinction can ensure that malicious code masquerading as data is never executed, thwarting a large class of security attacks. Second, shepherding can restrict control transfers based on instruction class, source, and target. For example, shepherding can forbid execution of shared library code except through declared entry points, and can ensure that a return instruction only targets the instruction after a call. Finally, shepherding guarantees that sandboxing checks placed around any type of program operation will never be bypassed. We have implemented these capabilities efficiently in a runtime system with minimal or no performance penalties. This system operates on unmodified native binaries, requires no special hardware or operating system support, and runs on existing IA-32 machines under both Linux and Windows.

An infrastructure for adaptive dynamic optimization
Derek Bruening, Timothy Garnett, Saman Amarasinghe -- Massachusetts Institute of Technology
CGO 2003
[
PDF | ACM ]

Dynamic optimization is emerging as a promising approach to overcome many of the obstacles of traditional static compilation. But while there are a number of compiler infrastructures for developing static optimizations, there are very few for developing dynamic optimizations. We present a framework for implementing dynamic analyses and optimizations. We provide an interface for building external modules, or clients, for the DynamoRlO dynamic code modification system. This interface abstracts away many low-level details of the DynamoRlO runtime system while exposing a simple and powerful, yet efficient and lightweight, API. This is achieved by restricting optimization units to linear streams of code and using adaptive levels of detail for representing instructions. The interface is not restricted to optimization and can be used for instrumentation, profiling, dynamic translation, etc.To demonstrate the usefulness and effectiveness of our framework, we implemented several optimizations. These improve the performance of some applications by as much as 40% relative to native execution. The average speedup relative to base DynamoRlO performance is 12%.

Pin: building customized program analysis tools with dynamic instrumentation
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace -- Intel Corporation
Vijay Janapa Reddi -- University of Colorado
Kim Hazelwood -- Intel Corporation
PLDI 2005
[
PDF | ACM ]

Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called Pin. Our goals are to provide easy-to-use, portable, transparent, and efficient instrumentation. Instrumentation tools (called Pintools) are written in C/C++ using Pin's rich API. Pin follows the model of ATOM, allowing the tool writer to analyze an application at the instruction level without the need for detailed knowledge of the underlying instruction set. The API is designed to be architecture independent whenever possible, making Pintools source compatible across different architectures. However, a Pintool can access architecture-specific details when necessary. Instrumentation with Pin is mostly transparent as the application and Pintool observe the application's original, uninstrumented behavior. Pin uses dynamic compilation to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated approach delivers significantly better instrumentation performance than similar tools. For example, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic-block counting. To illustrate Pin's versatility, we describe two Pintools in daily use to analyze production software. Pin is publicly available for Linux platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86), Itanium®, and ARM. In the ten months since Pin 2 was released in July 2004, there have been over 3000 downloads from its website.