Data Parallel C++ : : Programming Accelerated Systems Using C++ and SYCL.

Saved in:
Bibliographic Details
:
TeilnehmendeR:
Place / Publishing House:Berkeley, CA : : Apress L. P.,, 2023.
©2023.
Year of Publication:2023
Edition:2nd ed.
Language:English
Online Access:
Physical Description:1 online resource (648 pages)
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Intro
  • Table of Contents
  • About the Authors
  • Preface
  • Foreword
  • Acknowledgments
  • Chapter 1: Introduction
  • Read the Book, Not the Spec
  • SYCL 2020 and DPC++
  • Why Not CUDA?
  • Why Standard C++ with SYCL?
  • Getting a C++ Compiler with SYCL Support
  • Hello, World! and a SYCL Program Dissection
  • Queues and Actions
  • It Is All About Parallelism
  • Throughput
  • Latency
  • Think Parallel
  • Amdahl and Gustafson
  • Scaling
  • Heterogeneous Systems
  • Data-Parallel Programming
  • Key Attributes of C++ with SYCL
  • Single-Source
  • Host
  • Devices
  • Sharing Devices
  • Kernel Code
  • Kernel: Vector Addition (DAXPY)
  • Asynchronous Execution
  • Race Conditions When We Make a Mistake
  • Deadlock
  • C++ Lambda Expressions
  • Functional Portability and Performance Portability
  • Concurrency vs. Parallelism
  • Summary
  • Chapter 2: Where Code Executes
  • Single-Source
  • Host Code
  • Device Code
  • Choosing Devices
  • Method#1: Run on a Device of Any Type
  • Queues
  • Binding a Queue to a Device When Any Device Will Do
  • Method#2: Using a CPU Device for Development, Debugging, and Deployment
  • Method#3: Using a GPU (or Other Accelerators)
  • Accelerator Devices
  • Device Selectors
  • When Device Selection Fails
  • Method#4: Using Multiple Devices
  • Method#5: Custom (Very Specific) Device Selection
  • Selection Based on Device Aspects
  • Selection Through a Custom Selector
  • Mechanisms to Score a Device
  • Creating Work on a Device
  • Introducing the Task Graph
  • Where Is the Device Code?
  • Actions
  • Host tasks
  • Summary
  • Chapter 3: Data Management
  • Introduction
  • The Data Management Problem
  • Device Local vs. Device Remote
  • Managing Multiple Memories
  • Explicit Data Movement
  • Implicit Data Movement
  • Selecting the Right Strategy
  • USM, Buffers, and Images
  • Unified Shared Memory
  • Accessing Memory Through Pointers.
  • USM and Data Movement
  • Explicit Data Movement in USM
  • Implicit Data Movement in USM
  • Buffers
  • Creating Buffers
  • Accessing Buffers
  • Access Modes
  • Ordering the Uses of Data
  • In-order Queues
  • Out-of-Order Queues
  • Explicit Dependences with Events
  • Implicit Dependences with Accessors
  • Choosing a Data Management Strategy
  • Handler Class: Key Members
  • Summary
  • Chapter 4: Expressing Parallelism
  • Parallelism Within Kernels
  • Loops vs. Kernels
  • Multidimensional Kernels
  • Overview of Language Features
  • Separating Kernels from Host Code
  • Different Forms of Parallel Kernels
  • Basic Data-Parallel Kernels
  • Understanding Basic Data-Parallel Kernels
  • Writing Basic Data-Parallel Kernels
  • Details of Basic Data-Parallel Kernels
  • The range Class
  • The id Class
  • The item Class
  • Explicit ND-Range Kernels
  • Understanding Explicit ND-Range Parallel Kernels
  • Work-Items
  • Work-Groups
  • Sub-Groups
  • Writing Explicit ND-Range Data-Parallel Kernels
  • Details of Explicit ND-Range Data-Parallel Kernels
  • The nd_range Class
  • The nd_item Class
  • The group Class
  • The sub_group Class
  • Mapping Computation to Work-Items
  • One-to-One Mapping
  • Many-to-One Mapping
  • Choosing a Kernel Form
  • Summary
  • Chapter 5: Error Handling
  • Safety First
  • Types of Errors
  • Let's Create Some Errors!
  • Synchronous Error
  • Asynchronous Error
  • Application Error Handling Strategy
  • Ignoring Error Handling
  • Synchronous Error Handling
  • Asynchronous Error Handling
  • The Asynchronous Handler
  • Invocation of the Handler
  • Errors on a Device
  • Summary
  • Chapter 6: Unified Shared Memory
  • Why Should We Use USM?
  • Allocation Types
  • Device Allocations
  • Host Allocations
  • Shared Allocations
  • Allocating Memory
  • What Do We Need to Know?
  • Multiple Styles
  • Allocations à la C
  • Allocations à la C++
  • C++ Allocators.
  • Deallocating Memory
  • Allocation Example
  • Data Management
  • Initialization
  • Data Movement
  • Explicit
  • Implicit
  • Migration
  • Fine-Grained Control
  • Queries
  • One More Thing
  • Summary
  • Chapter 7: Buffers
  • Buffers
  • Buffer Creation
  • Buffer Properties
  • use_host_ptr
  • use_mutex
  • context_bound
  • What Can We Do with a Buffer?
  • Accessors
  • Accessor Creation
  • What Can We Do with an Accessor?
  • Summary
  • Chapter 8: Scheduling Kernels and Data Movement
  • What Is Graph Scheduling?
  • How Graphs Work in SYCL
  • Command Group Actions
  • How Command Groups Declare Dependences
  • Examples
  • When Are the Parts of a Command Group Executed?
  • Data Movement
  • Explicit Data Movement
  • Implicit Data Movement
  • Synchronizing with the Host
  • Summary
  • Chapter 9: Communication and Synchronization
  • Work-Groups and Work-Items
  • Building Blocks for Efficient Communication
  • Synchronization via Barriers
  • Work-Group Local Memory
  • Using Work-Group Barriers and Local Memory
  • Work-Group Barriers and Local Memory in ND-Range Kernels
  • Local Accessors
  • Synchronization Functions
  • A Full ND-Range Kernel Example
  • Sub-Groups
  • Synchronization via Sub-Group Barriers
  • Exchanging Data Within a Sub-Group
  • A Full Sub-Group ND-Range Kernel Example
  • Group Functions and Group Algorithms
  • Broadcast
  • Votes
  • Shuffles
  • Summary
  • Chapter 10: Defining Kernels
  • Why Three Ways to Represent a Kernel?
  • Kernels as Lambda Expressions
  • Elements of a Kernel Lambda Expression
  • Identifying Kernel Lambda Expressions
  • Kernels as Named Function Objects
  • Elements of a Kernel Named Function Object
  • Kernels in Kernel Bundles
  • Interoperability with Other APIs
  • Summary
  • Chapter 11: Vectors and Math Arrays
  • The Ambiguity of Vector Types
  • Our Mental Model for SYCL Vector Types
  • Math Array (marray)
  • Vector (vec).
  • Loads and Stores
  • Interoperability with Backend-Native Vector Types
  • Swizzle Operations
  • How Vector Types Execute
  • Vectors as Convenience Types
  • Vectors as SIMD Types
  • Summary
  • Chapter 12: Device Information and Kernel Specialization
  • Is There a GPU Present?
  • Refining Kernel Code to Be More Prescriptive
  • How to Enumerate Devices and Capabilities
  • Aspects
  • Custom Device Selector
  • Being Curious: get_info&lt
  • &gt
  • Being More Curious: Detailed Enumeration Code
  • Very Curious: get_info plus has()
  • Device Information Descriptors
  • Device-Specific Kernel Information Descriptors
  • The Specifics: Those of "Correctness"
  • Device Queries
  • Kernel Queries
  • The Specifics: Those of "Tuning/Optimization"
  • Device Queries
  • Kernel Queries
  • Runtime vs. Compile-Time Properties
  • Kernel Specialization
  • Summary
  • Chapter 13: Practical Tips
  • Getting the Code Samples and a Compiler
  • Online Resources
  • Platform Model
  • Multiarchitecture Binaries
  • Compilation Model
  • Contexts: Important Things to Know
  • Adding SYCL to Existing C++ Programs
  • Considerations When Using Multiple Compilers
  • Debugging
  • Debugging Deadlock and Other Synchronization Issues
  • Debugging Kernel Code
  • Debugging Runtime Failures
  • Queue Profiling and Resulting Timing Capabilities
  • Tracing and Profiling Tools Interfaces
  • Initializing Data and Accessing Kernel Outputs
  • Multiple Translation Units
  • Performance Implication of Multiple Translation Units
  • When Anonymous Lambdas Need Names
  • Summary
  • Chapter 14: Common Parallel Patterns
  • Understanding the Patterns
  • Map
  • Stencil
  • Reduction
  • Scan
  • Pack and Unpack
  • Pack
  • Unpack
  • Using Built-In Functions and Libraries
  • The SYCL Reduction Library
  • The reduction Class
  • The reducer Class
  • User-Defined Reductions
  • Group Algorithms
  • Direct Programming
  • Map.
  • Stencil
  • Reduction
  • Scan
  • Pack and Unpack
  • Pack
  • Unpack
  • Summary
  • For More Information
  • Chapter 15: Programming for GPUs
  • Performance Caveats
  • How GPUs Work
  • GPU Building Blocks
  • Simpler Processors (but More of Them)
  • Expressing Parallelism
  • Expressing More Parallelism
  • Simplified Control Logic (SIMD Instructions)
  • Predication and Masking
  • SIMD Efficiency
  • SIMD Efficiency and Groups of Items
  • Switching Work to Hide Latency
  • Offloading Kernels to GPUs
  • SYCL Runtime Library
  • GPU Software Drivers
  • GPU Hardware
  • Beware the Cost of Offloading!
  • Transfers to and from Device Memory
  • GPU Kernel Best Practices
  • Accessing Global Memory
  • Accessing Work-Group Local Memory
  • Avoiding Local Memory Entirely with Sub-Groups
  • Optimizing Computation Using Small Data Types
  • Optimizing Math Functions
  • Specialized Functions and Extensions
  • Summary
  • For More Information
  • Chapter 16: Programming for CPUs
  • Performance Caveats
  • The Basics of Multicore CPUs
  • The Basics of SIMD Hardware
  • Exploiting Thread-Level Parallelism
  • Thread Affinity Insight
  • Be Mindful of First Touch to Memory
  • SIMD Vectorization on CPU
  • Ensure SIMD Execution Legality
  • SIMD Masking and Cost
  • Avoid Array of Struct for SIMD Efficiency
  • Data Type Impact on SIMD Efficiency
  • SIMD Execution Using single_task
  • Summary
  • Chapter 17: Programming for  FPGAs
  • Performance Caveats
  • How to Think About FPGAs
  • Pipeline Parallelism
  • Kernels Consume Chip "Area"
  • When to Use an FPGA
  • Lots and Lots of Work
  • Custom Operations or Operation Widths
  • Scalar Data Flow
  • Low Latency and Rich Connectivity
  • Customized Memory Systems
  • Running on an FPGA
  • Compile Times
  • The FPGA Emulator
  • FPGA Hardware Compilation Occurs "Ahead-of-Time"
  • Writing Kernels for FPGAs
  • Exposing Parallelism.
  • Keeping the Pipeline Busy Using ND-Ranges.