Data Parallel C++ : : Programming Accelerated Systems Using C++ and SYCL.
Saved in:
: | |
---|---|
TeilnehmendeR: | |
Place / Publishing House: | Berkeley, CA : : Apress L. P.,, 2023. ©2023. |
Year of Publication: | 2023 |
Edition: | 2nd ed. |
Language: | English |
Online Access: | |
Physical Description: | 1 online resource (648 pages) |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Table of Contents:
- Intro
- Table of Contents
- About the Authors
- Preface
- Foreword
- Acknowledgments
- Chapter 1: Introduction
- Read the Book, Not the Spec
- SYCL 2020 and DPC++
- Why Not CUDA?
- Why Standard C++ with SYCL?
- Getting a C++ Compiler with SYCL Support
- Hello, World! and a SYCL Program Dissection
- Queues and Actions
- It Is All About Parallelism
- Throughput
- Latency
- Think Parallel
- Amdahl and Gustafson
- Scaling
- Heterogeneous Systems
- Data-Parallel Programming
- Key Attributes of C++ with SYCL
- Single-Source
- Host
- Devices
- Sharing Devices
- Kernel Code
- Kernel: Vector Addition (DAXPY)
- Asynchronous Execution
- Race Conditions When We Make a Mistake
- Deadlock
- C++ Lambda Expressions
- Functional Portability and Performance Portability
- Concurrency vs. Parallelism
- Summary
- Chapter 2: Where Code Executes
- Single-Source
- Host Code
- Device Code
- Choosing Devices
- Method#1: Run on a Device of Any Type
- Queues
- Binding a Queue to a Device When Any Device Will Do
- Method#2: Using a CPU Device for Development, Debugging, and Deployment
- Method#3: Using a GPU (or Other Accelerators)
- Accelerator Devices
- Device Selectors
- When Device Selection Fails
- Method#4: Using Multiple Devices
- Method#5: Custom (Very Specific) Device Selection
- Selection Based on Device Aspects
- Selection Through a Custom Selector
- Mechanisms to Score a Device
- Creating Work on a Device
- Introducing the Task Graph
- Where Is the Device Code?
- Actions
- Host tasks
- Summary
- Chapter 3: Data Management
- Introduction
- The Data Management Problem
- Device Local vs. Device Remote
- Managing Multiple Memories
- Explicit Data Movement
- Implicit Data Movement
- Selecting the Right Strategy
- USM, Buffers, and Images
- Unified Shared Memory
- Accessing Memory Through Pointers.
- USM and Data Movement
- Explicit Data Movement in USM
- Implicit Data Movement in USM
- Buffers
- Creating Buffers
- Accessing Buffers
- Access Modes
- Ordering the Uses of Data
- In-order Queues
- Out-of-Order Queues
- Explicit Dependences with Events
- Implicit Dependences with Accessors
- Choosing a Data Management Strategy
- Handler Class: Key Members
- Summary
- Chapter 4: Expressing Parallelism
- Parallelism Within Kernels
- Loops vs. Kernels
- Multidimensional Kernels
- Overview of Language Features
- Separating Kernels from Host Code
- Different Forms of Parallel Kernels
- Basic Data-Parallel Kernels
- Understanding Basic Data-Parallel Kernels
- Writing Basic Data-Parallel Kernels
- Details of Basic Data-Parallel Kernels
- The range Class
- The id Class
- The item Class
- Explicit ND-Range Kernels
- Understanding Explicit ND-Range Parallel Kernels
- Work-Items
- Work-Groups
- Sub-Groups
- Writing Explicit ND-Range Data-Parallel Kernels
- Details of Explicit ND-Range Data-Parallel Kernels
- The nd_range Class
- The nd_item Class
- The group Class
- The sub_group Class
- Mapping Computation to Work-Items
- One-to-One Mapping
- Many-to-One Mapping
- Choosing a Kernel Form
- Summary
- Chapter 5: Error Handling
- Safety First
- Types of Errors
- Let's Create Some Errors!
- Synchronous Error
- Asynchronous Error
- Application Error Handling Strategy
- Ignoring Error Handling
- Synchronous Error Handling
- Asynchronous Error Handling
- The Asynchronous Handler
- Invocation of the Handler
- Errors on a Device
- Summary
- Chapter 6: Unified Shared Memory
- Why Should We Use USM?
- Allocation Types
- Device Allocations
- Host Allocations
- Shared Allocations
- Allocating Memory
- What Do We Need to Know?
- Multiple Styles
- Allocations à la C
- Allocations à la C++
- C++ Allocators.
- Deallocating Memory
- Allocation Example
- Data Management
- Initialization
- Data Movement
- Explicit
- Implicit
- Migration
- Fine-Grained Control
- Queries
- One More Thing
- Summary
- Chapter 7: Buffers
- Buffers
- Buffer Creation
- Buffer Properties
- use_host_ptr
- use_mutex
- context_bound
- What Can We Do with a Buffer?
- Accessors
- Accessor Creation
- What Can We Do with an Accessor?
- Summary
- Chapter 8: Scheduling Kernels and Data Movement
- What Is Graph Scheduling?
- How Graphs Work in SYCL
- Command Group Actions
- How Command Groups Declare Dependences
- Examples
- When Are the Parts of a Command Group Executed?
- Data Movement
- Explicit Data Movement
- Implicit Data Movement
- Synchronizing with the Host
- Summary
- Chapter 9: Communication and Synchronization
- Work-Groups and Work-Items
- Building Blocks for Efficient Communication
- Synchronization via Barriers
- Work-Group Local Memory
- Using Work-Group Barriers and Local Memory
- Work-Group Barriers and Local Memory in ND-Range Kernels
- Local Accessors
- Synchronization Functions
- A Full ND-Range Kernel Example
- Sub-Groups
- Synchronization via Sub-Group Barriers
- Exchanging Data Within a Sub-Group
- A Full Sub-Group ND-Range Kernel Example
- Group Functions and Group Algorithms
- Broadcast
- Votes
- Shuffles
- Summary
- Chapter 10: Defining Kernels
- Why Three Ways to Represent a Kernel?
- Kernels as Lambda Expressions
- Elements of a Kernel Lambda Expression
- Identifying Kernel Lambda Expressions
- Kernels as Named Function Objects
- Elements of a Kernel Named Function Object
- Kernels in Kernel Bundles
- Interoperability with Other APIs
- Summary
- Chapter 11: Vectors and Math Arrays
- The Ambiguity of Vector Types
- Our Mental Model for SYCL Vector Types
- Math Array (marray)
- Vector (vec).
- Loads and Stores
- Interoperability with Backend-Native Vector Types
- Swizzle Operations
- How Vector Types Execute
- Vectors as Convenience Types
- Vectors as SIMD Types
- Summary
- Chapter 12: Device Information and Kernel Specialization
- Is There a GPU Present?
- Refining Kernel Code to Be More Prescriptive
- How to Enumerate Devices and Capabilities
- Aspects
- Custom Device Selector
- Being Curious: get_info<
- >
- Being More Curious: Detailed Enumeration Code
- Very Curious: get_info plus has()
- Device Information Descriptors
- Device-Specific Kernel Information Descriptors
- The Specifics: Those of "Correctness"
- Device Queries
- Kernel Queries
- The Specifics: Those of "Tuning/Optimization"
- Device Queries
- Kernel Queries
- Runtime vs. Compile-Time Properties
- Kernel Specialization
- Summary
- Chapter 13: Practical Tips
- Getting the Code Samples and a Compiler
- Online Resources
- Platform Model
- Multiarchitecture Binaries
- Compilation Model
- Contexts: Important Things to Know
- Adding SYCL to Existing C++ Programs
- Considerations When Using Multiple Compilers
- Debugging
- Debugging Deadlock and Other Synchronization Issues
- Debugging Kernel Code
- Debugging Runtime Failures
- Queue Profiling and Resulting Timing Capabilities
- Tracing and Profiling Tools Interfaces
- Initializing Data and Accessing Kernel Outputs
- Multiple Translation Units
- Performance Implication of Multiple Translation Units
- When Anonymous Lambdas Need Names
- Summary
- Chapter 14: Common Parallel Patterns
- Understanding the Patterns
- Map
- Stencil
- Reduction
- Scan
- Pack and Unpack
- Pack
- Unpack
- Using Built-In Functions and Libraries
- The SYCL Reduction Library
- The reduction Class
- The reducer Class
- User-Defined Reductions
- Group Algorithms
- Direct Programming
- Map.
- Stencil
- Reduction
- Scan
- Pack and Unpack
- Pack
- Unpack
- Summary
- For More Information
- Chapter 15: Programming for GPUs
- Performance Caveats
- How GPUs Work
- GPU Building Blocks
- Simpler Processors (but More of Them)
- Expressing Parallelism
- Expressing More Parallelism
- Simplified Control Logic (SIMD Instructions)
- Predication and Masking
- SIMD Efficiency
- SIMD Efficiency and Groups of Items
- Switching Work to Hide Latency
- Offloading Kernels to GPUs
- SYCL Runtime Library
- GPU Software Drivers
- GPU Hardware
- Beware the Cost of Offloading!
- Transfers to and from Device Memory
- GPU Kernel Best Practices
- Accessing Global Memory
- Accessing Work-Group Local Memory
- Avoiding Local Memory Entirely with Sub-Groups
- Optimizing Computation Using Small Data Types
- Optimizing Math Functions
- Specialized Functions and Extensions
- Summary
- For More Information
- Chapter 16: Programming for CPUs
- Performance Caveats
- The Basics of Multicore CPUs
- The Basics of SIMD Hardware
- Exploiting Thread-Level Parallelism
- Thread Affinity Insight
- Be Mindful of First Touch to Memory
- SIMD Vectorization on CPU
- Ensure SIMD Execution Legality
- SIMD Masking and Cost
- Avoid Array of Struct for SIMD Efficiency
- Data Type Impact on SIMD Efficiency
- SIMD Execution Using single_task
- Summary
- Chapter 17: Programming for FPGAs
- Performance Caveats
- How to Think About FPGAs
- Pipeline Parallelism
- Kernels Consume Chip "Area"
- When to Use an FPGA
- Lots and Lots of Work
- Custom Operations or Operation Widths
- Scalar Data Flow
- Low Latency and Rich Connectivity
- Customized Memory Systems
- Running on an FPGA
- Compile Times
- The FPGA Emulator
- FPGA Hardware Compilation Occurs "Ahead-of-Time"
- Writing Kernels for FPGAs
- Exposing Parallelism.
- Keeping the Pipeline Busy Using ND-Ranges.