Concurrency
Niagara: A 32-Way Multithreaded Sparc Processor The Niagara processor
implements a thread-rich architecture designed to provide a
high-performance solution for commercial server applications. The
hardware supports 32 threads with a memory subsystem consisting of an
on-board crossbar, level-2 cache, and memory controllers for a highly
integrated design that exploits the thread-level parallelism inherent
to server applications, while targeting low levels of power
consumption.
Poonacha Kongetira Sun Microsystems
Kathirgamar Aingaran Sun Microsystems
Kunle Olukotun Sun Microsystems
MICRO 2005
ACM portal
Synergistic processsing in CELL's multicore architecture
A bunch of people from IBM
MICRO 2006
PDF
Transactional memory: architectural support for lock-free data structures A shared data structure is
lock-free if its operations do not require mutual exclusion. If one
process is interrupted in the middle of an operation, other processes
will not be prevented from operating on that object. In highly
concurrent systems, lock-free data structures avoid common problems
associated with conventional locking techniques, including priority
inversion, convoying, and difficulty of avoiding deadlock. This paper
introduces transactional memory, a new multiprocessor architecture
intended to make lock-free synchronization as efficient (and easy to
use) as conventional techniques based on mutual
exclusion. Transactional memory allows programmers to define
customized read-modify-write operations that apply to multiple,
independently-chosen words of memory. It is implemented by
straightforward extensions to any multiprocessor cache-coherence
protocol. Simulation results show that transactional memory matches or
outperforms the best known locking techniques for simple benchmarks,
even in the absence of priority inversion, convoying, and deadlock.
Maurice Herlihy,
J. Eliot B. Moss
ISCA 1993
ACM portal
Software transactional memory
Nir Shavit MIT and Tel-Aviv University
Dan Touitou Tel-Aviv University
PODC 1995
ACM portal
Software transactional memory for dynamic-sized data structures We propose a new form of
software transactional memory (STM) designed to support dynamic-sized
data structures, and we describe a novel non-blocking
implementation. The non-blocking property we consider is
obstruction-freedom. Obstruction-freedom is weaker than lock-freedom;
as a result, it admits substantially simpler and more efficient
implementations. A novel feature of our obstruction-free STM
implementation is its use of modular contention managers to ensure
progress in practice. We illustrate the utility of our dynamic STM
with a straightforward implementation of an obstruction-free red-black
tree, thereby demonstrating a sophisticated non-blocking dynamic data
structure that would be difficult to implement by other means. We also
present the results of simple preliminary performance experiments that
demonstrate that an "early release" feature of our STM is useful for
reducing contention, and that our STM lends itself to the effective
use of modular contention managers.
Maurice Herlihy Brown University, Providence, RI
Victor Luchangco Sun Microsystems Laboratories, Burlington, MA
Mark Moir Sun Microsystems Laboratories, Burlington, MA
William N. Scherer, III University of Rochester, Rochester, NY
PODC 2003
ACM portal
Compiler and runtime support for efficient software transactional memory Programmers have traditionally
used locks to synchronize concurrent access to shared data. Lock-based
synchronization, however, has well-known pitfalls: using locks for
fine-grain synchronization and composing code that already uses locks
are both difficult and prone to deadlock. Transactional memory
provides an alternate concurrency control mechanism that avoids these
pitfalls and significantly eases concurrent programming. Transactional
memory language constructs have recently been proposed as extensions
to existing languages or included in new concurrent language
specifications, opening the door for new compiler optimizations that
target the overheads of transactional memory.This paper presents
compiler and runtime optimizations for transactional memory language
constructs. We present a high-performance software transactional
memory system (STM) integrated into a managed runtime environment. Our
system efficiently implements nested transactions that support both
composition of transactions and partial roll back. Our JIT compiler is
the first to optimize the overheads of STM, and we show novel
techniques for enabling JIT optimizations on STM operations. We
measure the performance of our optimizations on a 16-way SMP running
multi-threaded transactional workloads. Our results show that these
techniques enable transactional memory's performance to compete with
that of well-tuned synchronization.
Ali-Reza Adl-Tabatabai Intel Labs, Santa Clara, CA
Brian T. Lewis Intel Labs, Santa Clara, CA
Vijay Menon Intel Labs, Santa Clara, CA
Brian R. Murphy Intel China Research Center, Beijing, China
Bratin Saha Intel Labs, Santa Clara, CA
Tatiana Shpeisman Intel Labs, Santa Clara, CA
PLDI 2006
ACM portal
The Atomos transactional programming language Atomos is the first
programming language with implicit transactions, strong atomicity, and
a scalable multiprocessor implementation. Atomos is derived from Java,
but replaces its synchronization and conditional waiting constructs
with simpler transactional alternatives.The Atomos watch statement
allows programmers to specify fine-grained watch sets used with the
Atomos retry conditional waiting statement for efficient transactional
conflict-driven wakeup even in transactional memory systems with a
limited number of transactional contexts. Atomos supports open-nested
transactions, which are necessary for building both scalable
application programs and virtual machine implementations.The
implementation of the Atomos scheduler demonstrates the use of open
nesting within the virtual machine and introduces the concept of
transactional memory violation handlers that allow programs to recover
from data dependency violations without rolling back.Atomos
programming examples are given to demonstrate the usefulness of
transactional programming primitives. Atomos and Java are compared
through the use of several benchmarks. The results demonstrate both
the improvements in parallel programming ease and parallel program
performance provided by Atomos.
Brian D. Carlstrom Stanford University
Austen McDonald Stanford University
Hassan Chafi Stanford University
JaeWoong Chung Stanford University
Chi Cao Minh Stanford University
Christos Kozyrakis Stanford University
Kunle Olukotun Stanford University
PLDI 2006
ACM portal
A flexible framework for implementing software transactional memory We describe DSTM2, a Java“¢â
software library that provides a flexible framework for implementing
object-based software transactional memory (STM). The library uses
transactional factories to transform sequential (unsynchronized)
classes into atomic (transactionally synchronized) ones, providing a
substantial improvement over the awkward programming interface of our
previous DSTM library. Furthermore, researchers can experiment with
alternative STM mechanisms by providing their own factories. We
demonstrate this flexibility by presenting two factories: one that
uses essentially the same mechanisms as the original DSTM (with some
enhancements),and another that uses a completely different
approach.Because DSTM2 is packaged as a Java library, a wide range of
programmers can easily try it out, and the community can begin to gain
experience with transactional programming. Furthermore, researchers
will be able to use the body of transactional programs that arises
from this community experience to test and evaluate different STM
mechanisms simply by supplying new transactional factories. We believe
that this flexible approach will help to build consensus about the
best ways to implement transactions, and will avoid the premature
"lock-in" that may arise if STM mechanisms are baked into compilers
before such experimentation is done.
Maurice Herlihy Brown University
Victor Luchangco Sun Microsystems Laboratories
Mark Moir Sun Microsystems Laboratories
OOPSLA 2006
ACM portal
Speculative lock elision: enabling highly concurrent multithreaded execution Serialization of threads due
to critical sections is a fundamental bottleneck to achieving high
performance in multithreaded programs. Dynamically, such
serialization may be unnecessary because these critical sections
could have safely executed concurrently without locks. Current
processors cannot fully exploit such parallelism because they do not
have mechanisms to dynamically detect such false inter-thread
dependences.We propose Speculative Lock Elision (SLE), a novel
micro-architectural technique to remove dynamically unnecessary
lock-induced serialization and enable highly concurrent multithreaded
execution. The key insight is that locks do not always have to be
acquired for a correct execution. Synchronization instructions are
predicted as being unnecessary and elided. This allows multiple
threads to concurrently execute critical sections protected by the
same lock. Misspeculation due to inter-thread data conflicts is
detected using existing cache mechanisms and rollback is used for
recovery. Successful speculative elision is validated and committed
without acquiring the lock.SLE can be implemented entirely in
microarchitecture without instruction set support and without
system-level modifications, is transparent to programmers, and
requires only trivial additional hardware support. SLE can provide
programmers a fast path to writing correct high-performance
multithreaded programs.
Ravi Rajwar University of Wisconsin-Madison Madison, WI
James R. Goodman University of Wisconsin-Madison Madison, WI
MICRO 2001
ACM portal
Detecting access anomalies in programs with critical sections
Anne Dinning D.E. Shaw & Co., 251 Park Ave., So., N.Y., New York
Edith Schonberg IBM Research, T. J. Watson, Research Center, Yorktown Heights, NY
1991 ACM/ONR workshop on Parallel and distributed debugging
ACM portal
Guava: a dialect of Java without data races We introduce Guava, a dialect
of Java whose rules statically guarantee that parallel threads access
shared data only through synchronized methods. Our dialect
distinguishes three categories of classes: (1) monitors, which may be
referenced from multiple threads, but whose methods are accessed
serially; (2) values, which cannot be referenced and therefore are
never shared; and (3) objects, which can have multiple references but
only from within one thread, and therefore do not need to be
synchronized. Guava circumvents the problems associated with today's
Java memory model, which must define behavior when concurrent threads
access shared memory without synchronization.We present an overview of
the syntax and the semantic rules of Guava. We discuss how
implementations of Guava can exploit these rules to re-enable compiler
optimizations inhibited by standard Java. We discuss how compilers for
certain multiprocessor architectures can automatically generate
certain programming idioms, such as double-check reads, as
optimizations of serialized monitors.
David F. Bacon IBM Watson Research Center, P.O. Box 704, Yorktown Heights, NY
Robert E. Strom IBM Watson Research Center, P.O. Box 704, Yorktown Heights, NY
Ashis Tarafdar Dept. of Computer Sciences, University of Texas at Austin, Austin, Texas
OOPSLA 2000
ACM portal
Type-based race detection for Java This paper presents a static
race detection analysis for multithreaded Java programs. Our analysis
is based on a formal type system that is capable of capturing many
common synchronization patterns. These patterns include classes with
internal synchronization, classes thatrequire client-side
synchronization, and thread-local classes. Experience checking over
40,000 lines of Java code with the type system demonstrates that it is
an effective approach for eliminating races conditions. On large
examples, fewer than 20 additional type annotations per 1000 lines of
code were required by the type checker, and we found a number of races
in the standard Java libraries and other test programs.
Cormac Flanagan Compaq Systems Research Center, 130 Lytton Ave., Palo Alto, CA
Stephen N. Freund Department of Computer Science, Stanford University, Stanford, CA
PLDI 2000
ACM portal
Efficient and precise datarace detection for multithreaded object-oriented programs We present a novel approach to
dynamic datarace detection for multithreaded object-oriented
programs. Past techniques for on-the-fly datarace detection either
sacrificed precision for performance, leading to many false positive
datarace reports, or maintained precision but incurred significant
overheads in the range of 3x to 30x. In contrast, our approach
results in very few false positives and runtime overhead in the 13%
to 42% range, making it both efficient and precise. This performance
improvement is the result of a unique combination of complementary
static and dynamic optimization techniques.
Jong-Deok Choi IBM T. J. Watson Research Center
Keunwoo Lee Univ. of Washington
Alexey Loginov Univ. of Wisconsin - Madison
Robert O'Callahan IBM T. J. Watson Research Center
Vivek Sarkar IBM T. J. Watson Research Center
Manu Sridharan MIT
PLDI 2002
ACM portal
LOCKSMITH: context-sensitive correlation analysis for race detection One common technique for
preventing data races in multi-threaded programs is to ensure that all
accesses to shared locations are consistently protected by a lock. We
present a tool called LOCKSMITH for detecting data races in C programs
by looking for violations of this pattern. We call the relationship
between locks and the locations they protect consistent correlation,
and the core of our technique is a novel constraint-based analysis
that infers consistent correlation context-sensitively, using the
results to check that locations are properly guarded by locks. We
present the core of our algorithm for a simple formal language ’¦Ë>
which we have proven sound, and discuss how we scale it up to an
algorithm that aims to be sound for all of C. We develop several
techniques to improve the precision and performance of the analysis,
including a sharing analysis for inferring thread locality;
existential quantification for modeling locks in data structures; and
heuristics for modeling unsafe features of C such as type casts. When
applied to several benchmarks, including multi-threaded servers and
Linux device drivers, LOCKSMITH found several races while producing a
modest number of false alarm.
Polyvios Pratikakis University of Maryland, College Park
Jeffrey S. Foster University of Maryland, College Park
Michael Hicks University of Maryland, College Park
PLDI 2006
ACM portal
Effective static race detection for Java We present a novel technique
for static race detection in Java programs, comprised of a series of
stages that employ a combination of static analyses to successively
reduce the pairs of memory accesses potentially involved in a
race. We have implemented our technique and applied it to a suite of
multi-threaded Java programs. Our experiments show that it is
precise, scalable, and useful, reporting tens to hundreds of serious
and previously unknown concurrency bugs in large, widely-used
programs with few false alarms.
Mayur Naik Stanford University
Alex Aiken Stanford University
John Whaley Stanford University
PLDI 2006
ACM portal
KISS: keep it simple and sequential The design of concurrent
programs is error-prone due to the interaction between concurrently
executing threads. Traditional automated techniques for finding errors
in concurrent programs, such as model checking, explore all possible
thread interleavings. Since the number of thread interleavings
increases exponentially with the number of threads, such analyses have
high computational complexity. In this paper, we present a novel
analysis technique for concurrent programs that avoids this
exponential complexity. Our analysis transforms a concurrent program
into a sequential program that simulates the execution of a large
subset of the behaviors of the concurrent program. The sequential
program is then analyzed by a tool that only needs to understand the
semantics of sequential execution. Our technique never reports false
errors but may miss errors. We have implemented the technique in KISS,
an automated checker for multithreaded C programs, and obtained
promising initial results by using KISS to detect race conditions in
Windows device drivers.
Shaz Qadeer Microsoft, Redmond, WA
Dinghao Wu Princeton University, Princeton, NJ
PLDI 2004
ACM portal
Threads cannot be implemented as a library In many environments,
multi-threaded code is written in a language that was originally
designed without thread support (e.g. C), to which a library of
threading primitives was subsequently added. There appears to be a
general understanding that this is not the right approach. We provide
specific arguments that a pure library approach, in which the
compiler is designed independently of threading issues, cannot
guarantee correctness of the resulting code.We first review why the
approach almost works, and then examine some of the surprising
behavior it may entail. We further illustrate that there are very
simple cases in which a pure library-based approach seems incapable
of expressing an efficient parallel algorithm.Our discussion takes
place in the context of C with Pthreads, since it is commonly used,
reasonably well specified, and does not attempt to ensure
type-safety, which would entail even stronger constraints. The issues
we raise are not specific to that context.
Hans-J. Boehm HP Laboratories, Palo Alto, CA
PLDI 2005
ACM portal
The Java Memory Model is Fatally Flawed
William Pugh Dept. of Computer Science, Univ. of Maryland, College Park
Concurrency: Practice and Experience, 2000
PDF
The Java memory model This paper describes the new
Java memory model, which has been revised as part of Java 5.0. The
model specifies the legal behaviors for a multithreaded program; it
defines the semantics of multithreaded Java programs and partially
determines legal implementations of Java virtual machines and
compilers.The new Java model provides a simple interface for correctly
synchronized programs -- it guarantees sequential consistency to
data-race-free programs. Its novel contribution is requiring that the
behavior of incorrectly synchronized programs be bounded by a well
defined notion of causality. The causality requirement is strong
enough to respect the safety and security properties of Java and weak
enough to allow standard compiler and hardware optimizations. To our
knowledge, other models are either too weak because they do not
provide for sufficient safety/security, or are too strong because they
rely on a strong notion of data and control dependences that precludes
some standard compiler transformations.Although the majority of what
is currently done in compilers is legal, the new model introduces
significant differences, and clearly defines the boundaries of legal
transformations. For example, the commonly accepted definition for
control dependence is incorrect for Java, and transformations based on
it may be invalid.In addition to providing the official memory model
for Java, we believe the model described here could prove to be a
useful basis for other programming languages that currently lack
well-defined models, such as C++ and C#.
Jeremy Manson University of Maryland, College Park, MD
William Pugh University of Maryland, College Park, MD
Sarita V. Adve University of Illinois at Urbana-Champaign, Urbana-Champaign, IL
POPL 2005
ACM portal
Dynamic optimization
ATOM: a system for building customized program analysis tools ATOM (Analysis Tools with OM) is a
single framework for building a wide range of customized program analysis
tools. It provides the common infrastructure present in all
code-instrumenting tools; this is the difficult and time-consuming
part. The user simply defines the tool-specific details in
instrumentation and analysis routines. Building a basic block counting
tool like Pixie with ATOM requires only a page of code. ATOM, using OM
link-time technology, organizes the final executable such that the
application program and user's analysis routines run in the same address
space. Information is directly passed from the application program to the
analysis routines through simple procedure calls instead of inter-process
communication or files on disk. ATOM takes care that analysis routines do
not interfere with the program's execution, and precise information about
the program is presented to the analysis routines at all times. ATOM uses
no simulation or interpretation. ATOM has been implemented on the Alpha
AXP under OSF/1. It is efficient and has been used to build a diverse set
of tools for basic block counting, profiling, dynamic memory recording,
instruction and data cache simulation, pipeline simulation, evaluating
branch prediction, and instruction scheduling.
Amitabh Srivastava,
Alan Eustace -- Digital Equipment Western Research Laboratory
PLDI 1994
[ PDF | ACM ]
Valgrind: A Program Supervision Framework
Nicholas Nethercote, Julian Seward -- University of Cambridge
[ PDF ] -- valgrind web-site
Dynamo: a transparent dynamic optimization system We describe the design and
implementation of Dynamo, a software dynamic optimization system that is
capable of transparently improving the performance of a native
instruction stream as it executes on the processor. The input native
instruction stream to Dynamo can be dynamically generated (by a JIT for
example), or it can come from the execution of a statically compiled
native binary. This paper evaluates the Dynamo system in the latter, more
challenging situation, in order to emphasize the limits, rather than the
potential, of the system. Our experiments demonstrate that even
statically optimized native binaries can be accelerated Dynamo, and often
by a significant degree. For example, the average performance of -O
optimized SpecInt95 benchmark binaries created by the HP product C
compiler is improved to a level comparable to their -O4 optimized version
running without Dynamo. Dynamo achieves this by focusing its efforts on
optimization opportunities that tend to manifest only at runtime, and
hence opportunities that might be difficult for a static compiler to
exploit. Dynamo's operation is transparent in the sense that it does not
depend on any user annotations or binary instrumentation, and does not
require multiple runs, or any special compiler, operating system or
hardware support. The Dynamo prototype presented here is a realistic
implementation running on an HP PA-8000 workstation under the HPUX 10.20
operating system.
Vasanth Bala,
Evelyn Duesterwald,
Sanjeev Banerjia -- Hewlett-Packard Labs
PLDI 2000
[ PDF | ACM ]
Secure Execution via Program Shepherding We introduce program
shepherding, a method for monitoring control flow transfers during
program execution to enforce security policies. Program shepherding
provides three techniques as building blocks for security
policies. First, shepherding can restrict execution privileges on the
basis of code origins. This distinction can ensure that malicious
code masquerading as data is never executed, thwarting a large class
of security attacks. Second, shepherding can restrict control
transfers based on instruction class, source, and target. For
example, shepherding can forbid execution of shared library code
except through declared entry points, and can ensure that a return
instruction only targets the instruction after a call. Finally,
shepherding guarantees that sandboxing checks placed around any type
of program operation will never be bypassed. We have implemented
these capabilities efficiently in a runtime system with minimal or no
performance penalties. This system operates on unmodified native
binaries, requires no special hardware or operating system support,
and runs on existing IA-32 machines under both Linux and Windows.
Vladimir Kiriansky
Derek Bruening,
Saman Amarasinghe -- Massachusetts Institute of Technology
USENIX 2002
[ PDF | ACM ]
An infrastructure for adaptive dynamic optimization Dynamic optimization is emerging as
a promising approach to overcome many of the obstacles of traditional
static compilation. But while there are a number of compiler
infrastructures for developing static optimizations, there are very few
for developing dynamic optimizations. We present a framework for
implementing dynamic analyses and optimizations. We provide an interface
for building external modules, or clients, for the DynamoRlO dynamic code
modification system. This interface abstracts away many low-level details
of the DynamoRlO runtime system while exposing a simple and powerful, yet
efficient and lightweight, API. This is achieved by restricting
optimization units to linear streams of code and using adaptive levels of
detail for representing instructions. The interface is not restricted to
optimization and can be used for instrumentation, profiling, dynamic
translation, etc.To demonstrate the usefulness and effectiveness of our
framework, we implemented several optimizations. These improve the
performance of some applications by as much as 40% relative to native
execution. The average speedup relative to base DynamoRlO performance is
12%.
Derek Bruening,
Timothy Garnett,
Saman Amarasinghe -- Massachusetts Institute of Technology
CGO 2003
[ PDF | ACM ]
Pin: building customized program analysis tools with dynamic instrumentation Robust and powerful software
instrumentation tools are essential for program analysis tasks such as
profiling, performance evaluation, and bug detection. To meet this need,
we have developed a new instrumentation system called Pin. Our goals are
to provide easy-to-use, portable, transparent, and efficient
instrumentation. Instrumentation tools (called Pintools) are written in
C/C++ using Pin's rich API. Pin follows the model of ATOM, allowing the
tool writer to analyze an application at the instruction level without
the need for detailed knowledge of the underlying instruction set. The
API is designed to be architecture independent whenever possible, making
Pintools source compatible across different architectures. However, a
Pintool can access architecture-specific details when
necessary. Instrumentation with Pin is mostly transparent as the
application and Pintool observe the application's original,
uninstrumented behavior. Pin uses dynamic compilation to instrument
executables while they are running. For efficiency, Pin uses several
techniques, including inlining, register re-allocation, liveness
analysis, and instruction scheduling to optimize instrumentation. This
fully automated approach delivers significantly better instrumentation
performance than similar tools. For example, Pin is 3.3x faster than
Valgrind and 2x faster than DynamoRIO for basic-block counting. To
illustrate Pin's versatility, we describe two Pintools in daily use to
analyze production software. Pin is publicly available for Linux
platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86),
Itanium®, and ARM. In the ten months since Pin 2 was released in July
2004, there have been over 3000 downloads from its website.
Chi-Keung Luk,
Robert Cohn,
Robert Muth,
Harish Patil,
Artur Klauser,
Geoff Lowney,
Steven Wallace -- Intel Corporation
Vijay Janapa Reddi -- University of Colorado
Kim Hazelwood -- Intel Corporation
PLDI 2005
[ PDF | ACM ]