Table of Contents: Data Parallel C++ :

Data Parallel C++ : : Programming Accelerated Systems Using C++ and SYCL.

Saved in:

Bibliographic Details
:	Reinders, James.
TeilnehmendeR:	Ashbaugh, Ben. Brodman, James. Kinsner, Michael. Pennycook, John. Tian, Xinmin.
Place / Publishing House:	Berkeley, CA : : Apress L. P.,, 2023. ©2023.
Year of Publication:	2023
Edition:	2nd ed.
Language:	English
Online Access:	https://ebookcentral.proquest.com/lib/oeawat/detail.action?docID=30882798
Physical Description:	1 online resource (648 pages)
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Intro
Table of Contents
About the Authors
Preface
Foreword
Acknowledgments
Chapter 1: Introduction
Read the Book, Not the Spec
SYCL 2020 and DPC++
Why Not CUDA?
Why Standard C++ with SYCL?
Getting a C++ Compiler with SYCL Support
Hello, World! and a SYCL Program Dissection
Queues and Actions
It Is All About Parallelism
Throughput
Latency
Think Parallel
Amdahl and Gustafson
Scaling
Heterogeneous Systems
Data-Parallel Programming
Key Attributes of C++ with SYCL
Single-Source
Host
Devices
Sharing Devices
Kernel Code
Kernel: Vector Addition (DAXPY)
Asynchronous Execution
Race Conditions When We Make a Mistake
Deadlock
C++ Lambda Expressions
Functional Portability and Performance Portability
Concurrency vs. Parallelism
Summary
Chapter 2: Where Code Executes
Single-Source
Host Code
Device Code
Choosing Devices
Method#1: Run on a Device of Any Type
Queues
Binding a Queue to a Device When Any Device Will Do
Method#2: Using a CPU Device for Development, Debugging, and Deployment
Method#3: Using a GPU (or Other Accelerators)
Accelerator Devices
Device Selectors
When Device Selection Fails
Method#4: Using Multiple Devices
Method#5: Custom (Very Specific) Device Selection
Selection Based on Device Aspects
Selection Through a Custom Selector
Mechanisms to Score a Device
Creating Work on a Device
Introducing the Task Graph
Where Is the Device Code?
Actions
Host tasks
Summary
Chapter 3: Data Management
Introduction
The Data Management Problem
Device Local vs. Device Remote
Managing Multiple Memories
Explicit Data Movement
Implicit Data Movement
Selecting the Right Strategy
USM, Buffers, and Images
Unified Shared Memory
Accessing Memory Through Pointers.
USM and Data Movement
Explicit Data Movement in USM
Implicit Data Movement in USM
Buffers
Creating Buffers
Accessing Buffers
Access Modes
Ordering the Uses of Data
In-order Queues
Out-of-Order Queues
Explicit Dependences with Events
Implicit Dependences with Accessors
Choosing a Data Management Strategy
Handler Class: Key Members
Summary
Chapter 4: Expressing Parallelism
Parallelism Within Kernels
Loops vs. Kernels
Multidimensional Kernels
Overview of Language Features
Separating Kernels from Host Code
Different Forms of Parallel Kernels
Basic Data-Parallel Kernels
Understanding Basic Data-Parallel Kernels
Writing Basic Data-Parallel Kernels
Details of Basic Data-Parallel Kernels
The range Class
The id Class
The item Class
Explicit ND-Range Kernels
Understanding Explicit ND-Range Parallel Kernels
Work-Items
Work-Groups
Sub-Groups
Writing Explicit ND-Range Data-Parallel Kernels
Details of Explicit ND-Range Data-Parallel Kernels
The nd_range Class
The nd_item Class
The group Class
The sub_group Class
Mapping Computation to Work-Items
One-to-One Mapping
Many-to-One Mapping
Choosing a Kernel Form
Summary
Chapter 5: Error Handling
Safety First
Types of Errors
Let's Create Some Errors!
Synchronous Error
Asynchronous Error
Application Error Handling Strategy
Ignoring Error Handling
Synchronous Error Handling
Asynchronous Error Handling
The Asynchronous Handler
Invocation of the Handler
Errors on a Device
Summary
Chapter 6: Unified Shared Memory
Why Should We Use USM?
Allocation Types
Device Allocations
Host Allocations
Shared Allocations
Allocating Memory
What Do We Need to Know?
Multiple Styles
Allocations à la C
Allocations à la C++
C++ Allocators.
Deallocating Memory
Allocation Example
Data Management
Initialization
Data Movement
Explicit
Implicit
Migration
Fine-Grained Control
Queries
One More Thing
Summary
Chapter 7: Buffers
Buffers
Buffer Creation
Buffer Properties
use_host_ptr
use_mutex
context_bound
What Can We Do with a Buffer?
Accessors
Accessor Creation
What Can We Do with an Accessor?
Summary
Chapter 8: Scheduling Kernels and Data Movement
What Is Graph Scheduling?
How Graphs Work in SYCL
Command Group Actions
How Command Groups Declare Dependences
Examples
When Are the Parts of a Command Group Executed?
Data Movement
Explicit Data Movement
Implicit Data Movement
Synchronizing with the Host
Summary
Chapter 9: Communication and Synchronization
Work-Groups and Work-Items
Building Blocks for Efficient Communication
Synchronization via Barriers
Work-Group Local Memory
Using Work-Group Barriers and Local Memory
Work-Group Barriers and Local Memory in ND-Range Kernels
Local Accessors
Synchronization Functions
A Full ND-Range Kernel Example
Sub-Groups
Synchronization via Sub-Group Barriers
Exchanging Data Within a Sub-Group
A Full Sub-Group ND-Range Kernel Example
Group Functions and Group Algorithms
Broadcast
Votes
Shuffles
Summary
Chapter 10: Defining Kernels
Why Three Ways to Represent a Kernel?
Kernels as Lambda Expressions
Elements of a Kernel Lambda Expression
Identifying Kernel Lambda Expressions
Kernels as Named Function Objects
Elements of a Kernel Named Function Object
Kernels in Kernel Bundles
Interoperability with Other APIs
Summary
Chapter 11: Vectors and Math Arrays
The Ambiguity of Vector Types
Our Mental Model for SYCL Vector Types
Math Array (marray)
Vector (vec).
Loads and Stores
Interoperability with Backend-Native Vector Types
Swizzle Operations
How Vector Types Execute
Vectors as Convenience Types
Vectors as SIMD Types
Summary
Chapter 12: Device Information and Kernel Specialization
Is There a GPU Present?
Refining Kernel Code to Be More Prescriptive
How to Enumerate Devices and Capabilities
Aspects
Custom Device Selector
Being Curious: get_info&lt
&gt
Being More Curious: Detailed Enumeration Code
Very Curious: get_info plus has()
Device Information Descriptors
Device-Specific Kernel Information Descriptors
The Specifics: Those of "Correctness"
Device Queries
Kernel Queries
The Specifics: Those of "Tuning/Optimization"
Device Queries
Kernel Queries
Runtime vs. Compile-Time Properties
Kernel Specialization
Summary
Chapter 13: Practical Tips
Getting the Code Samples and a Compiler
Online Resources
Platform Model
Multiarchitecture Binaries
Compilation Model
Contexts: Important Things to Know
Adding SYCL to Existing C++ Programs
Considerations When Using Multiple Compilers
Debugging
Debugging Deadlock and Other Synchronization Issues
Debugging Kernel Code
Debugging Runtime Failures
Queue Profiling and Resulting Timing Capabilities
Tracing and Profiling Tools Interfaces
Initializing Data and Accessing Kernel Outputs
Multiple Translation Units
Performance Implication of Multiple Translation Units
When Anonymous Lambdas Need Names
Summary
Chapter 14: Common Parallel Patterns
Understanding the Patterns
Map
Stencil
Reduction
Scan
Pack and Unpack
Pack
Unpack
Using Built-In Functions and Libraries
The SYCL Reduction Library
The reduction Class
The reducer Class
User-Defined Reductions
Group Algorithms
Direct Programming
Map.
Stencil
Reduction
Scan
Pack and Unpack
Pack
Unpack
Summary
For More Information
Chapter 15: Programming for GPUs
Performance Caveats
How GPUs Work
GPU Building Blocks
Simpler Processors (but More of Them)
Expressing Parallelism
Expressing More Parallelism
Simplified Control Logic (SIMD Instructions)
Predication and Masking
SIMD Efficiency
SIMD Efficiency and Groups of Items
Switching Work to Hide Latency
Offloading Kernels to GPUs
SYCL Runtime Library
GPU Software Drivers
GPU Hardware
Beware the Cost of Offloading!
Transfers to and from Device Memory
GPU Kernel Best Practices
Accessing Global Memory
Accessing Work-Group Local Memory
Avoiding Local Memory Entirely with Sub-Groups
Optimizing Computation Using Small Data Types
Optimizing Math Functions
Specialized Functions and Extensions
Summary
For More Information
Chapter 16: Programming for CPUs
Performance Caveats
The Basics of Multicore CPUs
The Basics of SIMD Hardware
Exploiting Thread-Level Parallelism
Thread Affinity Insight
Be Mindful of First Touch to Memory
SIMD Vectorization on CPU
Ensure SIMD Execution Legality
SIMD Masking and Cost
Avoid Array of Struct for SIMD Efficiency
Data Type Impact on SIMD Efficiency
SIMD Execution Using single_task
Summary
Chapter 17: Programming for FPGAs
Performance Caveats
How to Think About FPGAs
Pipeline Parallelism
Kernels Consume Chip "Area"
When to Use an FPGA
Lots and Lots of Work
Custom Operations or Operation Widths
Scalar Data Flow
Low Latency and Rich Connectivity
Customized Memory Systems
Running on an FPGA
Compile Times
The FPGA Emulator
FPGA Hardware Compilation Occurs "Ahead-of-Time"
Writing Kernels for FPGAs
Exposing Parallelism.
Keeping the Pipeline Busy Using ND-Ranges.

Data Parallel C++ : : Programming Accelerated Systems Using C++ and SYCL.

Similar Items