The Tension Between Evolving Programming Paradigms and Determinism: Data Triggered Threads as a Case Study
Dean Tullsen, University of California San Diego
Abstract: The current parallelism crisis (lots of hardware parallelism, not so much software parallelism) means that future performance scaling is completely dependent on our ability to increase available software parallelism. This has two implications, which are often at odds. On the one hand, any programming model, execution model, architecture, or compiler optimization that gives us more parallelism should be exploited. On the other hand, we need to make parallel programming (and debugging) accessible to novice programmers to increase the supply of parallel code. This talk will examine a new programming and execution model for parallel threads called Data Triggered Threads (DTT), and examine some of the tradeoffs between the performance goals of DTT and programmability, debuggability, and correctness concerns. Unlike threads in conventional programming models, data-triggered threads are initiated on a change to a memory location. This enables increased parallelism and the elimination of redundant, unnecessary computation.
Bio: Dean Tullsen is a professor in the computer science and engineering department at UCSD. He received his PhD from the University of Washington in 1996, where he worked on simultaneous multithreading (hyper-threading). He has continued to work in the area of computer architecture and back-end compilation, where with various co-authors he has introduced many new ideas to the research community, including threaded multipath execution, symbiotic job scheduling for multithreaded processors, dynamic critical path prediction, speculative precomputation, heterogeneous multi-core architectures, conjoined core architectures, event-driven simultaneous code optimization, and data triggered threads. He is a Fellow of the ACM and the IEEE. He has twice won the ACM SIGARCH/IEEE-CS TCCA Influential ISCA Paper Award.
Performance and Programmability Trade-offs in the OpenCL 2.0 SVM and Memory Model
Brian Lewis, Intel
Abstract: GPUs offer the potential of huge amounts of data-parallel computation for relatively little energy. NVIDIA’s Tesla K40 GPU card, for example, provides up to 4.29 TFLOPS of single-precision performance for 235 watts. GPUs are also becoming ubiquitous: more than 90% of processors shipping today include integrated GPUs on die. To take advantage of GPUs and other accelerators, a number of heterogeneous programming frameworks have been developed including OpenCL, CUDA, HSA, OpenACC, Renderscript, and C++AMP. These frameworks differ in what capabilities they provide, how much control they give programmers, and what guarantees they provide.
The recently released OpenCL 2.0 standard, for example, includes support for cross-device shared virtual memory (SVM), memory consistency, and C11-style atomics and fences. AMD’s new integrated Kaveri APU is the first processor to include HW support for these capabilities. While these features make it simple to share pointer-containing data structures like trees among a CPU and a number of GPUs or other devices, they also increase the risk of data races, which is only made worse by the large amount of parallelism. To help tame memory errors and to assist programmers, HW designers, and compiler implementers, OpenCL 2.0 includes a memory model based on those of C11 and C++11.
Based on my experience helping to develop OpenCL 2.0 SVM and the OpenCL memory model, I will talk about some features and compromises that made it into OpenCL 2.0. My focus will be on the performance and programmability trade-offs involved in SVM and the memory model. Along the way, I will also touch on other heterogeneous computing frameworks like HSA, CUDA, Renderscript, and C++AMP.
Bio: Brian Lewis is a Senior Staff Researcher at Intel Labs. His research interests include programming language implementation, virtual machines for managed programming languages, runtime system design, and heterogeneous computing. He is currently focused on simplifying and extending the use of graphics processing units to accelerate general-purpose computing tasks. He is the lead author of the shared virtual memory (SVM) support and the memory model for the OpenCL 2.0 standard. He previously helped to implement a lightweight common runtime for concurrent managed languages, as well as virtual machine enhancements to support transactional memory, dynamic code reorganization, pointer compression, and other optimizations for both Java and CLI. Prior to Intel, he worked at Sun Microsystems Laboratories, where he was one of the key developers of a persistent Java virtual machine, a new programming language for system programming, a retargetable binary translation framework, and performance monitoring tools for a new distributed operating system. He received a Ph.D. in Computer Science from the University of Washington.