Data Parallel C++ : : Programming Accelerated Systems Using C++ and SYCL.

Saved in:
Bibliographic Details
:
TeilnehmendeR:
Place / Publishing House:Berkeley, CA : : Apress L. P.,, 2023.
©2023.
Year of Publication:2023
Edition:2nd ed.
Language:English
Online Access:
Physical Description:1 online resource (648 pages)
Tags: Add Tag
No Tags, Be the first to tag this record!
id 50030882798
ctrlnum (MiAaPQ)50030882798
(Au-PeEL)EBL30882798
(OCoLC)1403550971
collection bib_alma
record_format marc
spelling Reinders, James.
Data Parallel C++ : Programming Accelerated Systems Using C++ and SYCL.
2nd ed.
Berkeley, CA : Apress L. P., 2023.
©2023.
1 online resource (648 pages)
text txt rdacontent
computer c rdamedia
online resource cr rdacarrier
Intro -- Table of Contents -- About the Authors -- Preface -- Foreword -- Acknowledgments -- Chapter 1: Introduction -- Read the Book, Not the Spec -- SYCL 2020 and DPC++ -- Why Not CUDA? -- Why Standard C++ with SYCL? -- Getting a C++ Compiler with SYCL Support -- Hello, World! and a SYCL Program Dissection -- Queues and Actions -- It Is All About Parallelism -- Throughput -- Latency -- Think Parallel -- Amdahl and Gustafson -- Scaling -- Heterogeneous Systems -- Data-Parallel Programming -- Key Attributes of C++ with SYCL -- Single-Source -- Host -- Devices -- Sharing Devices -- Kernel Code -- Kernel: Vector Addition (DAXPY) -- Asynchronous Execution -- Race Conditions When We Make a Mistake -- Deadlock -- C++ Lambda Expressions -- Functional Portability and Performance Portability -- Concurrency vs. Parallelism -- Summary -- Chapter 2: Where Code Executes -- Single-Source -- Host Code -- Device Code -- Choosing Devices -- Method#1: Run on a Device of Any Type -- Queues -- Binding a Queue to a Device When Any Device Will Do -- Method#2: Using a CPU Device for Development, Debugging, and Deployment -- Method#3: Using a GPU (or Other Accelerators) -- Accelerator Devices -- Device Selectors -- When Device Selection Fails -- Method#4: Using Multiple Devices -- Method#5: Custom (Very Specific) Device Selection -- Selection Based on Device Aspects -- Selection Through a Custom Selector -- Mechanisms to Score a Device -- Creating Work on a Device -- Introducing the Task Graph -- Where Is the Device Code? -- Actions -- Host tasks -- Summary -- Chapter 3: Data Management -- Introduction -- The Data Management Problem -- Device Local vs. Device Remote -- Managing Multiple Memories -- Explicit Data Movement -- Implicit Data Movement -- Selecting the Right Strategy -- USM, Buffers, and Images -- Unified Shared Memory -- Accessing Memory Through Pointers.
USM and Data Movement -- Explicit Data Movement in USM -- Implicit Data Movement in USM -- Buffers -- Creating Buffers -- Accessing Buffers -- Access Modes -- Ordering the Uses of Data -- In-order Queues -- Out-of-Order Queues -- Explicit Dependences with Events -- Implicit Dependences with Accessors -- Choosing a Data Management Strategy -- Handler Class: Key Members -- Summary -- Chapter 4: Expressing Parallelism -- Parallelism Within Kernels -- Loops vs. Kernels -- Multidimensional Kernels -- Overview of Language Features -- Separating Kernels from Host Code -- Different Forms of Parallel Kernels -- Basic Data-Parallel Kernels -- Understanding Basic Data-Parallel Kernels -- Writing Basic Data-Parallel Kernels -- Details of Basic Data-Parallel Kernels -- The range Class -- The id Class -- The item Class -- Explicit ND-Range Kernels -- Understanding Explicit ND-Range Parallel Kernels -- Work-Items -- Work-Groups -- Sub-Groups -- Writing Explicit ND-Range Data-Parallel Kernels -- Details of Explicit ND-Range Data-Parallel Kernels -- The nd_range Class -- The nd_item Class -- The group Class -- The sub_group Class -- Mapping Computation to Work-Items -- One-to-One Mapping -- Many-to-One Mapping -- Choosing a Kernel Form -- Summary -- Chapter 5: Error Handling -- Safety First -- Types of Errors -- Let's Create Some Errors! -- Synchronous Error -- Asynchronous Error -- Application Error Handling Strategy -- Ignoring Error Handling -- Synchronous Error Handling -- Asynchronous Error Handling -- The Asynchronous Handler -- Invocation of the Handler -- Errors on a Device -- Summary -- Chapter 6: Unified Shared Memory -- Why Should We Use USM? -- Allocation Types -- Device Allocations -- Host Allocations -- Shared Allocations -- Allocating Memory -- What Do We Need to Know? -- Multiple Styles -- Allocations à la C -- Allocations à la C++ -- C++ Allocators.
Deallocating Memory -- Allocation Example -- Data Management -- Initialization -- Data Movement -- Explicit -- Implicit -- Migration -- Fine-Grained Control -- Queries -- One More Thing -- Summary -- Chapter 7: Buffers -- Buffers -- Buffer Creation -- Buffer Properties -- use_host_ptr -- use_mutex -- context_bound -- What Can We Do with a Buffer? -- Accessors -- Accessor Creation -- What Can We Do with an Accessor? -- Summary -- Chapter 8: Scheduling Kernels and Data Movement -- What Is Graph Scheduling? -- How Graphs Work in SYCL -- Command Group Actions -- How Command Groups Declare Dependences -- Examples -- When Are the Parts of a Command Group Executed? -- Data Movement -- Explicit Data Movement -- Implicit Data Movement -- Synchronizing with the Host -- Summary -- Chapter 9: Communication and Synchronization -- Work-Groups and Work-Items -- Building Blocks for Efficient Communication -- Synchronization via Barriers -- Work-Group Local Memory -- Using Work-Group Barriers and Local Memory -- Work-Group Barriers and Local Memory in ND-Range Kernels -- Local Accessors -- Synchronization Functions -- A Full ND-Range Kernel Example -- Sub-Groups -- Synchronization via Sub-Group Barriers -- Exchanging Data Within a Sub-Group -- A Full Sub-Group ND-Range Kernel Example -- Group Functions and Group Algorithms -- Broadcast -- Votes -- Shuffles -- Summary -- Chapter 10: Defining Kernels -- Why Three Ways to Represent a Kernel? -- Kernels as Lambda Expressions -- Elements of a Kernel Lambda Expression -- Identifying Kernel Lambda Expressions -- Kernels as Named Function Objects -- Elements of a Kernel Named Function Object -- Kernels in Kernel Bundles -- Interoperability with Other APIs -- Summary -- Chapter 11: Vectors and Math Arrays -- The Ambiguity of Vector Types -- Our Mental Model for SYCL Vector Types -- Math Array (marray) -- Vector (vec).
Loads and Stores -- Interoperability with Backend-Native Vector Types -- Swizzle Operations -- How Vector Types Execute -- Vectors as Convenience Types -- Vectors as SIMD Types -- Summary -- Chapter 12: Device Information and Kernel Specialization -- Is There a GPU Present? -- Refining Kernel Code to Be More Prescriptive -- How to Enumerate Devices and Capabilities -- Aspects -- Custom Device Selector -- Being Curious: get_info&lt -- &gt -- -- Being More Curious: Detailed Enumeration Code -- Very Curious: get_info plus has() -- Device Information Descriptors -- Device-Specific Kernel Information Descriptors -- The Specifics: Those of "Correctness" -- Device Queries -- Kernel Queries -- The Specifics: Those of "Tuning/Optimization" -- Device Queries -- Kernel Queries -- Runtime vs. Compile-Time Properties -- Kernel Specialization -- Summary -- Chapter 13: Practical Tips -- Getting the Code Samples and a Compiler -- Online Resources -- Platform Model -- Multiarchitecture Binaries -- Compilation Model -- Contexts: Important Things to Know -- Adding SYCL to Existing C++ Programs -- Considerations When Using Multiple Compilers -- Debugging -- Debugging Deadlock and Other Synchronization Issues -- Debugging Kernel Code -- Debugging Runtime Failures -- Queue Profiling and Resulting Timing Capabilities -- Tracing and Profiling Tools Interfaces -- Initializing Data and Accessing Kernel Outputs -- Multiple Translation Units -- Performance Implication of Multiple Translation Units -- When Anonymous Lambdas Need Names -- Summary -- Chapter 14: Common Parallel Patterns -- Understanding the Patterns -- Map -- Stencil -- Reduction -- Scan -- Pack and Unpack -- Pack -- Unpack -- Using Built-In Functions and Libraries -- The SYCL Reduction Library -- The reduction Class -- The reducer Class -- User-Defined Reductions -- Group Algorithms -- Direct Programming -- Map.
Stencil -- Reduction -- Scan -- Pack and Unpack -- Pack -- Unpack -- Summary -- For More Information -- Chapter 15: Programming for GPUs -- Performance Caveats -- How GPUs Work -- GPU Building Blocks -- Simpler Processors (but More of Them) -- Expressing Parallelism -- Expressing More Parallelism -- Simplified Control Logic (SIMD Instructions) -- Predication and Masking -- SIMD Efficiency -- SIMD Efficiency and Groups of Items -- Switching Work to Hide Latency -- Offloading Kernels to GPUs -- SYCL Runtime Library -- GPU Software Drivers -- GPU Hardware -- Beware the Cost of Offloading! -- Transfers to and from Device Memory -- GPU Kernel Best Practices -- Accessing Global Memory -- Accessing Work-Group Local Memory -- Avoiding Local Memory Entirely with Sub-Groups -- Optimizing Computation Using Small Data Types -- Optimizing Math Functions -- Specialized Functions and Extensions -- Summary -- For More Information -- Chapter 16: Programming for CPUs -- Performance Caveats -- The Basics of Multicore CPUs -- The Basics of SIMD Hardware -- Exploiting Thread-Level Parallelism -- Thread Affinity Insight -- Be Mindful of First Touch to Memory -- SIMD Vectorization on CPU -- Ensure SIMD Execution Legality -- SIMD Masking and Cost -- Avoid Array of Struct for SIMD Efficiency -- Data Type Impact on SIMD Efficiency -- SIMD Execution Using single_task -- Summary -- Chapter 17: Programming for  FPGAs -- Performance Caveats -- How to Think About FPGAs -- Pipeline Parallelism -- Kernels Consume Chip "Area" -- When to Use an FPGA -- Lots and Lots of Work -- Custom Operations or Operation Widths -- Scalar Data Flow -- Low Latency and Rich Connectivity -- Customized Memory Systems -- Running on an FPGA -- Compile Times -- The FPGA Emulator -- FPGA Hardware Compilation Occurs "Ahead-of-Time" -- Writing Kernels for FPGAs -- Exposing Parallelism.
Keeping the Pipeline Busy Using ND-Ranges.
Description based on publisher supplied metadata and other sources.
Electronic reproduction. Ann Arbor, Michigan : ProQuest Ebook Central, 2024. Available via World Wide Web. Access may be limited to ProQuest Ebook Central affiliated libraries.
Electronic books.
Ashbaugh, Ben.
Brodman, James.
Kinsner, Michael.
Pennycook, John.
Tian, Xinmin.
Print version: Reinders, James Data Parallel C++ Berkeley, CA : Apress L. P.,c2023 9781484296905
ProQuest (Firm)
https://ebookcentral.proquest.com/lib/oeawat/detail.action?docID=30882798 Click to View
language English
format eBook
author Reinders, James.
spellingShingle Reinders, James.
Data Parallel C++ : Programming Accelerated Systems Using C++ and SYCL.
Intro -- Table of Contents -- About the Authors -- Preface -- Foreword -- Acknowledgments -- Chapter 1: Introduction -- Read the Book, Not the Spec -- SYCL 2020 and DPC++ -- Why Not CUDA? -- Why Standard C++ with SYCL? -- Getting a C++ Compiler with SYCL Support -- Hello, World! and a SYCL Program Dissection -- Queues and Actions -- It Is All About Parallelism -- Throughput -- Latency -- Think Parallel -- Amdahl and Gustafson -- Scaling -- Heterogeneous Systems -- Data-Parallel Programming -- Key Attributes of C++ with SYCL -- Single-Source -- Host -- Devices -- Sharing Devices -- Kernel Code -- Kernel: Vector Addition (DAXPY) -- Asynchronous Execution -- Race Conditions When We Make a Mistake -- Deadlock -- C++ Lambda Expressions -- Functional Portability and Performance Portability -- Concurrency vs. Parallelism -- Summary -- Chapter 2: Where Code Executes -- Single-Source -- Host Code -- Device Code -- Choosing Devices -- Method#1: Run on a Device of Any Type -- Queues -- Binding a Queue to a Device When Any Device Will Do -- Method#2: Using a CPU Device for Development, Debugging, and Deployment -- Method#3: Using a GPU (or Other Accelerators) -- Accelerator Devices -- Device Selectors -- When Device Selection Fails -- Method#4: Using Multiple Devices -- Method#5: Custom (Very Specific) Device Selection -- Selection Based on Device Aspects -- Selection Through a Custom Selector -- Mechanisms to Score a Device -- Creating Work on a Device -- Introducing the Task Graph -- Where Is the Device Code? -- Actions -- Host tasks -- Summary -- Chapter 3: Data Management -- Introduction -- The Data Management Problem -- Device Local vs. Device Remote -- Managing Multiple Memories -- Explicit Data Movement -- Implicit Data Movement -- Selecting the Right Strategy -- USM, Buffers, and Images -- Unified Shared Memory -- Accessing Memory Through Pointers.
USM and Data Movement -- Explicit Data Movement in USM -- Implicit Data Movement in USM -- Buffers -- Creating Buffers -- Accessing Buffers -- Access Modes -- Ordering the Uses of Data -- In-order Queues -- Out-of-Order Queues -- Explicit Dependences with Events -- Implicit Dependences with Accessors -- Choosing a Data Management Strategy -- Handler Class: Key Members -- Summary -- Chapter 4: Expressing Parallelism -- Parallelism Within Kernels -- Loops vs. Kernels -- Multidimensional Kernels -- Overview of Language Features -- Separating Kernels from Host Code -- Different Forms of Parallel Kernels -- Basic Data-Parallel Kernels -- Understanding Basic Data-Parallel Kernels -- Writing Basic Data-Parallel Kernels -- Details of Basic Data-Parallel Kernels -- The range Class -- The id Class -- The item Class -- Explicit ND-Range Kernels -- Understanding Explicit ND-Range Parallel Kernels -- Work-Items -- Work-Groups -- Sub-Groups -- Writing Explicit ND-Range Data-Parallel Kernels -- Details of Explicit ND-Range Data-Parallel Kernels -- The nd_range Class -- The nd_item Class -- The group Class -- The sub_group Class -- Mapping Computation to Work-Items -- One-to-One Mapping -- Many-to-One Mapping -- Choosing a Kernel Form -- Summary -- Chapter 5: Error Handling -- Safety First -- Types of Errors -- Let's Create Some Errors! -- Synchronous Error -- Asynchronous Error -- Application Error Handling Strategy -- Ignoring Error Handling -- Synchronous Error Handling -- Asynchronous Error Handling -- The Asynchronous Handler -- Invocation of the Handler -- Errors on a Device -- Summary -- Chapter 6: Unified Shared Memory -- Why Should We Use USM? -- Allocation Types -- Device Allocations -- Host Allocations -- Shared Allocations -- Allocating Memory -- What Do We Need to Know? -- Multiple Styles -- Allocations à la C -- Allocations à la C++ -- C++ Allocators.
Deallocating Memory -- Allocation Example -- Data Management -- Initialization -- Data Movement -- Explicit -- Implicit -- Migration -- Fine-Grained Control -- Queries -- One More Thing -- Summary -- Chapter 7: Buffers -- Buffers -- Buffer Creation -- Buffer Properties -- use_host_ptr -- use_mutex -- context_bound -- What Can We Do with a Buffer? -- Accessors -- Accessor Creation -- What Can We Do with an Accessor? -- Summary -- Chapter 8: Scheduling Kernels and Data Movement -- What Is Graph Scheduling? -- How Graphs Work in SYCL -- Command Group Actions -- How Command Groups Declare Dependences -- Examples -- When Are the Parts of a Command Group Executed? -- Data Movement -- Explicit Data Movement -- Implicit Data Movement -- Synchronizing with the Host -- Summary -- Chapter 9: Communication and Synchronization -- Work-Groups and Work-Items -- Building Blocks for Efficient Communication -- Synchronization via Barriers -- Work-Group Local Memory -- Using Work-Group Barriers and Local Memory -- Work-Group Barriers and Local Memory in ND-Range Kernels -- Local Accessors -- Synchronization Functions -- A Full ND-Range Kernel Example -- Sub-Groups -- Synchronization via Sub-Group Barriers -- Exchanging Data Within a Sub-Group -- A Full Sub-Group ND-Range Kernel Example -- Group Functions and Group Algorithms -- Broadcast -- Votes -- Shuffles -- Summary -- Chapter 10: Defining Kernels -- Why Three Ways to Represent a Kernel? -- Kernels as Lambda Expressions -- Elements of a Kernel Lambda Expression -- Identifying Kernel Lambda Expressions -- Kernels as Named Function Objects -- Elements of a Kernel Named Function Object -- Kernels in Kernel Bundles -- Interoperability with Other APIs -- Summary -- Chapter 11: Vectors and Math Arrays -- The Ambiguity of Vector Types -- Our Mental Model for SYCL Vector Types -- Math Array (marray) -- Vector (vec).
Loads and Stores -- Interoperability with Backend-Native Vector Types -- Swizzle Operations -- How Vector Types Execute -- Vectors as Convenience Types -- Vectors as SIMD Types -- Summary -- Chapter 12: Device Information and Kernel Specialization -- Is There a GPU Present? -- Refining Kernel Code to Be More Prescriptive -- How to Enumerate Devices and Capabilities -- Aspects -- Custom Device Selector -- Being Curious: get_info&lt -- &gt -- -- Being More Curious: Detailed Enumeration Code -- Very Curious: get_info plus has() -- Device Information Descriptors -- Device-Specific Kernel Information Descriptors -- The Specifics: Those of "Correctness" -- Device Queries -- Kernel Queries -- The Specifics: Those of "Tuning/Optimization" -- Device Queries -- Kernel Queries -- Runtime vs. Compile-Time Properties -- Kernel Specialization -- Summary -- Chapter 13: Practical Tips -- Getting the Code Samples and a Compiler -- Online Resources -- Platform Model -- Multiarchitecture Binaries -- Compilation Model -- Contexts: Important Things to Know -- Adding SYCL to Existing C++ Programs -- Considerations When Using Multiple Compilers -- Debugging -- Debugging Deadlock and Other Synchronization Issues -- Debugging Kernel Code -- Debugging Runtime Failures -- Queue Profiling and Resulting Timing Capabilities -- Tracing and Profiling Tools Interfaces -- Initializing Data and Accessing Kernel Outputs -- Multiple Translation Units -- Performance Implication of Multiple Translation Units -- When Anonymous Lambdas Need Names -- Summary -- Chapter 14: Common Parallel Patterns -- Understanding the Patterns -- Map -- Stencil -- Reduction -- Scan -- Pack and Unpack -- Pack -- Unpack -- Using Built-In Functions and Libraries -- The SYCL Reduction Library -- The reduction Class -- The reducer Class -- User-Defined Reductions -- Group Algorithms -- Direct Programming -- Map.
Stencil -- Reduction -- Scan -- Pack and Unpack -- Pack -- Unpack -- Summary -- For More Information -- Chapter 15: Programming for GPUs -- Performance Caveats -- How GPUs Work -- GPU Building Blocks -- Simpler Processors (but More of Them) -- Expressing Parallelism -- Expressing More Parallelism -- Simplified Control Logic (SIMD Instructions) -- Predication and Masking -- SIMD Efficiency -- SIMD Efficiency and Groups of Items -- Switching Work to Hide Latency -- Offloading Kernels to GPUs -- SYCL Runtime Library -- GPU Software Drivers -- GPU Hardware -- Beware the Cost of Offloading! -- Transfers to and from Device Memory -- GPU Kernel Best Practices -- Accessing Global Memory -- Accessing Work-Group Local Memory -- Avoiding Local Memory Entirely with Sub-Groups -- Optimizing Computation Using Small Data Types -- Optimizing Math Functions -- Specialized Functions and Extensions -- Summary -- For More Information -- Chapter 16: Programming for CPUs -- Performance Caveats -- The Basics of Multicore CPUs -- The Basics of SIMD Hardware -- Exploiting Thread-Level Parallelism -- Thread Affinity Insight -- Be Mindful of First Touch to Memory -- SIMD Vectorization on CPU -- Ensure SIMD Execution Legality -- SIMD Masking and Cost -- Avoid Array of Struct for SIMD Efficiency -- Data Type Impact on SIMD Efficiency -- SIMD Execution Using single_task -- Summary -- Chapter 17: Programming for  FPGAs -- Performance Caveats -- How to Think About FPGAs -- Pipeline Parallelism -- Kernels Consume Chip "Area" -- When to Use an FPGA -- Lots and Lots of Work -- Custom Operations or Operation Widths -- Scalar Data Flow -- Low Latency and Rich Connectivity -- Customized Memory Systems -- Running on an FPGA -- Compile Times -- The FPGA Emulator -- FPGA Hardware Compilation Occurs "Ahead-of-Time" -- Writing Kernels for FPGAs -- Exposing Parallelism.
Keeping the Pipeline Busy Using ND-Ranges.
author_facet Reinders, James.
Ashbaugh, Ben.
Brodman, James.
Kinsner, Michael.
Pennycook, John.
Tian, Xinmin.
author_variant j r jr
author2 Ashbaugh, Ben.
Brodman, James.
Kinsner, Michael.
Pennycook, John.
Tian, Xinmin.
author2_variant b a ba
j b jb
m k mk
j p jp
x t xt
author2_role TeilnehmendeR
TeilnehmendeR
TeilnehmendeR
TeilnehmendeR
TeilnehmendeR
author_sort Reinders, James.
title Data Parallel C++ : Programming Accelerated Systems Using C++ and SYCL.
title_sub Programming Accelerated Systems Using C++ and SYCL.
title_full Data Parallel C++ : Programming Accelerated Systems Using C++ and SYCL.
title_fullStr Data Parallel C++ : Programming Accelerated Systems Using C++ and SYCL.
title_full_unstemmed Data Parallel C++ : Programming Accelerated Systems Using C++ and SYCL.
title_auth Data Parallel C++ : Programming Accelerated Systems Using C++ and SYCL.
title_new Data Parallel C++ :
title_sort data parallel c++ : programming accelerated systems using c++ and sycl.
publisher Apress L. P.,
publishDate 2023
physical 1 online resource (648 pages)
edition 2nd ed.
contents Intro -- Table of Contents -- About the Authors -- Preface -- Foreword -- Acknowledgments -- Chapter 1: Introduction -- Read the Book, Not the Spec -- SYCL 2020 and DPC++ -- Why Not CUDA? -- Why Standard C++ with SYCL? -- Getting a C++ Compiler with SYCL Support -- Hello, World! and a SYCL Program Dissection -- Queues and Actions -- It Is All About Parallelism -- Throughput -- Latency -- Think Parallel -- Amdahl and Gustafson -- Scaling -- Heterogeneous Systems -- Data-Parallel Programming -- Key Attributes of C++ with SYCL -- Single-Source -- Host -- Devices -- Sharing Devices -- Kernel Code -- Kernel: Vector Addition (DAXPY) -- Asynchronous Execution -- Race Conditions When We Make a Mistake -- Deadlock -- C++ Lambda Expressions -- Functional Portability and Performance Portability -- Concurrency vs. Parallelism -- Summary -- Chapter 2: Where Code Executes -- Single-Source -- Host Code -- Device Code -- Choosing Devices -- Method#1: Run on a Device of Any Type -- Queues -- Binding a Queue to a Device When Any Device Will Do -- Method#2: Using a CPU Device for Development, Debugging, and Deployment -- Method#3: Using a GPU (or Other Accelerators) -- Accelerator Devices -- Device Selectors -- When Device Selection Fails -- Method#4: Using Multiple Devices -- Method#5: Custom (Very Specific) Device Selection -- Selection Based on Device Aspects -- Selection Through a Custom Selector -- Mechanisms to Score a Device -- Creating Work on a Device -- Introducing the Task Graph -- Where Is the Device Code? -- Actions -- Host tasks -- Summary -- Chapter 3: Data Management -- Introduction -- The Data Management Problem -- Device Local vs. Device Remote -- Managing Multiple Memories -- Explicit Data Movement -- Implicit Data Movement -- Selecting the Right Strategy -- USM, Buffers, and Images -- Unified Shared Memory -- Accessing Memory Through Pointers.
USM and Data Movement -- Explicit Data Movement in USM -- Implicit Data Movement in USM -- Buffers -- Creating Buffers -- Accessing Buffers -- Access Modes -- Ordering the Uses of Data -- In-order Queues -- Out-of-Order Queues -- Explicit Dependences with Events -- Implicit Dependences with Accessors -- Choosing a Data Management Strategy -- Handler Class: Key Members -- Summary -- Chapter 4: Expressing Parallelism -- Parallelism Within Kernels -- Loops vs. Kernels -- Multidimensional Kernels -- Overview of Language Features -- Separating Kernels from Host Code -- Different Forms of Parallel Kernels -- Basic Data-Parallel Kernels -- Understanding Basic Data-Parallel Kernels -- Writing Basic Data-Parallel Kernels -- Details of Basic Data-Parallel Kernels -- The range Class -- The id Class -- The item Class -- Explicit ND-Range Kernels -- Understanding Explicit ND-Range Parallel Kernels -- Work-Items -- Work-Groups -- Sub-Groups -- Writing Explicit ND-Range Data-Parallel Kernels -- Details of Explicit ND-Range Data-Parallel Kernels -- The nd_range Class -- The nd_item Class -- The group Class -- The sub_group Class -- Mapping Computation to Work-Items -- One-to-One Mapping -- Many-to-One Mapping -- Choosing a Kernel Form -- Summary -- Chapter 5: Error Handling -- Safety First -- Types of Errors -- Let's Create Some Errors! -- Synchronous Error -- Asynchronous Error -- Application Error Handling Strategy -- Ignoring Error Handling -- Synchronous Error Handling -- Asynchronous Error Handling -- The Asynchronous Handler -- Invocation of the Handler -- Errors on a Device -- Summary -- Chapter 6: Unified Shared Memory -- Why Should We Use USM? -- Allocation Types -- Device Allocations -- Host Allocations -- Shared Allocations -- Allocating Memory -- What Do We Need to Know? -- Multiple Styles -- Allocations à la C -- Allocations à la C++ -- C++ Allocators.
Deallocating Memory -- Allocation Example -- Data Management -- Initialization -- Data Movement -- Explicit -- Implicit -- Migration -- Fine-Grained Control -- Queries -- One More Thing -- Summary -- Chapter 7: Buffers -- Buffers -- Buffer Creation -- Buffer Properties -- use_host_ptr -- use_mutex -- context_bound -- What Can We Do with a Buffer? -- Accessors -- Accessor Creation -- What Can We Do with an Accessor? -- Summary -- Chapter 8: Scheduling Kernels and Data Movement -- What Is Graph Scheduling? -- How Graphs Work in SYCL -- Command Group Actions -- How Command Groups Declare Dependences -- Examples -- When Are the Parts of a Command Group Executed? -- Data Movement -- Explicit Data Movement -- Implicit Data Movement -- Synchronizing with the Host -- Summary -- Chapter 9: Communication and Synchronization -- Work-Groups and Work-Items -- Building Blocks for Efficient Communication -- Synchronization via Barriers -- Work-Group Local Memory -- Using Work-Group Barriers and Local Memory -- Work-Group Barriers and Local Memory in ND-Range Kernels -- Local Accessors -- Synchronization Functions -- A Full ND-Range Kernel Example -- Sub-Groups -- Synchronization via Sub-Group Barriers -- Exchanging Data Within a Sub-Group -- A Full Sub-Group ND-Range Kernel Example -- Group Functions and Group Algorithms -- Broadcast -- Votes -- Shuffles -- Summary -- Chapter 10: Defining Kernels -- Why Three Ways to Represent a Kernel? -- Kernels as Lambda Expressions -- Elements of a Kernel Lambda Expression -- Identifying Kernel Lambda Expressions -- Kernels as Named Function Objects -- Elements of a Kernel Named Function Object -- Kernels in Kernel Bundles -- Interoperability with Other APIs -- Summary -- Chapter 11: Vectors and Math Arrays -- The Ambiguity of Vector Types -- Our Mental Model for SYCL Vector Types -- Math Array (marray) -- Vector (vec).
Loads and Stores -- Interoperability with Backend-Native Vector Types -- Swizzle Operations -- How Vector Types Execute -- Vectors as Convenience Types -- Vectors as SIMD Types -- Summary -- Chapter 12: Device Information and Kernel Specialization -- Is There a GPU Present? -- Refining Kernel Code to Be More Prescriptive -- How to Enumerate Devices and Capabilities -- Aspects -- Custom Device Selector -- Being Curious: get_info&lt -- &gt -- -- Being More Curious: Detailed Enumeration Code -- Very Curious: get_info plus has() -- Device Information Descriptors -- Device-Specific Kernel Information Descriptors -- The Specifics: Those of "Correctness" -- Device Queries -- Kernel Queries -- The Specifics: Those of "Tuning/Optimization" -- Device Queries -- Kernel Queries -- Runtime vs. Compile-Time Properties -- Kernel Specialization -- Summary -- Chapter 13: Practical Tips -- Getting the Code Samples and a Compiler -- Online Resources -- Platform Model -- Multiarchitecture Binaries -- Compilation Model -- Contexts: Important Things to Know -- Adding SYCL to Existing C++ Programs -- Considerations When Using Multiple Compilers -- Debugging -- Debugging Deadlock and Other Synchronization Issues -- Debugging Kernel Code -- Debugging Runtime Failures -- Queue Profiling and Resulting Timing Capabilities -- Tracing and Profiling Tools Interfaces -- Initializing Data and Accessing Kernel Outputs -- Multiple Translation Units -- Performance Implication of Multiple Translation Units -- When Anonymous Lambdas Need Names -- Summary -- Chapter 14: Common Parallel Patterns -- Understanding the Patterns -- Map -- Stencil -- Reduction -- Scan -- Pack and Unpack -- Pack -- Unpack -- Using Built-In Functions and Libraries -- The SYCL Reduction Library -- The reduction Class -- The reducer Class -- User-Defined Reductions -- Group Algorithms -- Direct Programming -- Map.
Stencil -- Reduction -- Scan -- Pack and Unpack -- Pack -- Unpack -- Summary -- For More Information -- Chapter 15: Programming for GPUs -- Performance Caveats -- How GPUs Work -- GPU Building Blocks -- Simpler Processors (but More of Them) -- Expressing Parallelism -- Expressing More Parallelism -- Simplified Control Logic (SIMD Instructions) -- Predication and Masking -- SIMD Efficiency -- SIMD Efficiency and Groups of Items -- Switching Work to Hide Latency -- Offloading Kernels to GPUs -- SYCL Runtime Library -- GPU Software Drivers -- GPU Hardware -- Beware the Cost of Offloading! -- Transfers to and from Device Memory -- GPU Kernel Best Practices -- Accessing Global Memory -- Accessing Work-Group Local Memory -- Avoiding Local Memory Entirely with Sub-Groups -- Optimizing Computation Using Small Data Types -- Optimizing Math Functions -- Specialized Functions and Extensions -- Summary -- For More Information -- Chapter 16: Programming for CPUs -- Performance Caveats -- The Basics of Multicore CPUs -- The Basics of SIMD Hardware -- Exploiting Thread-Level Parallelism -- Thread Affinity Insight -- Be Mindful of First Touch to Memory -- SIMD Vectorization on CPU -- Ensure SIMD Execution Legality -- SIMD Masking and Cost -- Avoid Array of Struct for SIMD Efficiency -- Data Type Impact on SIMD Efficiency -- SIMD Execution Using single_task -- Summary -- Chapter 17: Programming for  FPGAs -- Performance Caveats -- How to Think About FPGAs -- Pipeline Parallelism -- Kernels Consume Chip "Area" -- When to Use an FPGA -- Lots and Lots of Work -- Custom Operations or Operation Widths -- Scalar Data Flow -- Low Latency and Rich Connectivity -- Customized Memory Systems -- Running on an FPGA -- Compile Times -- The FPGA Emulator -- FPGA Hardware Compilation Occurs "Ahead-of-Time" -- Writing Kernels for FPGAs -- Exposing Parallelism.
Keeping the Pipeline Busy Using ND-Ranges.
isbn 9781484296912
9781484296905
callnumber-first Q - Science
callnumber-subject QA - Mathematics
callnumber-label QA76
callnumber-sort QA 276.76 C65
genre Electronic books.
genre_facet Electronic books.
url https://ebookcentral.proquest.com/lib/oeawat/detail.action?docID=30882798
illustrated Not Illustrated
oclc_num 1403550971
work_keys_str_mv AT reindersjames dataparallelcprogrammingacceleratedsystemsusingcandsycl
AT ashbaughben dataparallelcprogrammingacceleratedsystemsusingcandsycl
AT brodmanjames dataparallelcprogrammingacceleratedsystemsusingcandsycl
AT kinsnermichael dataparallelcprogrammingacceleratedsystemsusingcandsycl
AT pennycookjohn dataparallelcprogrammingacceleratedsystemsusingcandsycl
AT tianxinmin dataparallelcprogrammingacceleratedsystemsusingcandsycl
status_str n
ids_txt_mv (MiAaPQ)50030882798
(Au-PeEL)EBL30882798
(OCoLC)1403550971
carrierType_str_mv cr
is_hierarchy_title Data Parallel C++ : Programming Accelerated Systems Using C++ and SYCL.
author2_original_writing_str_mv noLinkedField
noLinkedField
noLinkedField
noLinkedField
noLinkedField
marc_error Info : MARC8 translation shorter than ISO-8859-1, choosing MARC8. --- [ 856 : z ]
_version_ 1792331073078689792
fullrecord <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>11855nam a22004933i 4500</leader><controlfield tag="001">50030882798</controlfield><controlfield tag="003">MiAaPQ</controlfield><controlfield tag="005">20240229073851.0</controlfield><controlfield tag="006">m o d | </controlfield><controlfield tag="007">cr cnu||||||||</controlfield><controlfield tag="008">240229s2023 xx o ||||0 eng d</controlfield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9781484296912</subfield><subfield code="q">(electronic bk.)</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="z">9781484296905</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(MiAaPQ)50030882798</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(Au-PeEL)EBL30882798</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)1403550971</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">MiAaPQ</subfield><subfield code="b">eng</subfield><subfield code="e">rda</subfield><subfield code="e">pn</subfield><subfield code="c">MiAaPQ</subfield><subfield code="d">MiAaPQ</subfield></datafield><datafield tag="050" ind1=" " ind2="4"><subfield code="a">QA76.76.C65</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Reinders, James.</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Data Parallel C++ :</subfield><subfield code="b">Programming Accelerated Systems Using C++ and SYCL.</subfield></datafield><datafield tag="250" ind1=" " ind2=" "><subfield code="a">2nd ed.</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Berkeley, CA :</subfield><subfield code="b">Apress L. P.,</subfield><subfield code="c">2023.</subfield></datafield><datafield tag="264" ind1=" " ind2="4"><subfield code="c">©2023.</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">1 online resource (648 pages)</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">computer</subfield><subfield code="b">c</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">online resource</subfield><subfield code="b">cr</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="505" ind1="0" ind2=" "><subfield code="a">Intro -- Table of Contents -- About the Authors -- Preface -- Foreword -- Acknowledgments -- Chapter 1: Introduction -- Read the Book, Not the Spec -- SYCL 2020 and DPC++ -- Why Not CUDA? -- Why Standard C++ with SYCL? -- Getting a C++ Compiler with SYCL Support -- Hello, World! and a SYCL Program Dissection -- Queues and Actions -- It Is All About Parallelism -- Throughput -- Latency -- Think Parallel -- Amdahl and Gustafson -- Scaling -- Heterogeneous Systems -- Data-Parallel Programming -- Key Attributes of C++ with SYCL -- Single-Source -- Host -- Devices -- Sharing Devices -- Kernel Code -- Kernel: Vector Addition (DAXPY) -- Asynchronous Execution -- Race Conditions When We Make a Mistake -- Deadlock -- C++ Lambda Expressions -- Functional Portability and Performance Portability -- Concurrency vs. Parallelism -- Summary -- Chapter 2: Where Code Executes -- Single-Source -- Host Code -- Device Code -- Choosing Devices -- Method#1: Run on a Device of Any Type -- Queues -- Binding a Queue to a Device When Any Device Will Do -- Method#2: Using a CPU Device for Development, Debugging, and Deployment -- Method#3: Using a GPU (or Other Accelerators) -- Accelerator Devices -- Device Selectors -- When Device Selection Fails -- Method#4: Using Multiple Devices -- Method#5: Custom (Very Specific) Device Selection -- Selection Based on Device Aspects -- Selection Through a Custom Selector -- Mechanisms to Score a Device -- Creating Work on a Device -- Introducing the Task Graph -- Where Is the Device Code? -- Actions -- Host tasks -- Summary -- Chapter 3: Data Management -- Introduction -- The Data Management Problem -- Device Local vs. Device Remote -- Managing Multiple Memories -- Explicit Data Movement -- Implicit Data Movement -- Selecting the Right Strategy -- USM, Buffers, and Images -- Unified Shared Memory -- Accessing Memory Through Pointers.</subfield></datafield><datafield tag="505" ind1="8" ind2=" "><subfield code="a">USM and Data Movement -- Explicit Data Movement in USM -- Implicit Data Movement in USM -- Buffers -- Creating Buffers -- Accessing Buffers -- Access Modes -- Ordering the Uses of Data -- In-order Queues -- Out-of-Order Queues -- Explicit Dependences with Events -- Implicit Dependences with Accessors -- Choosing a Data Management Strategy -- Handler Class: Key Members -- Summary -- Chapter 4: Expressing Parallelism -- Parallelism Within Kernels -- Loops vs. Kernels -- Multidimensional Kernels -- Overview of Language Features -- Separating Kernels from Host Code -- Different Forms of Parallel Kernels -- Basic Data-Parallel Kernels -- Understanding Basic Data-Parallel Kernels -- Writing Basic Data-Parallel Kernels -- Details of Basic Data-Parallel Kernels -- The range Class -- The id Class -- The item Class -- Explicit ND-Range Kernels -- Understanding Explicit ND-Range Parallel Kernels -- Work-Items -- Work-Groups -- Sub-Groups -- Writing Explicit ND-Range Data-Parallel Kernels -- Details of Explicit ND-Range Data-Parallel Kernels -- The nd_range Class -- The nd_item Class -- The group Class -- The sub_group Class -- Mapping Computation to Work-Items -- One-to-One Mapping -- Many-to-One Mapping -- Choosing a Kernel Form -- Summary -- Chapter 5: Error Handling -- Safety First -- Types of Errors -- Let's Create Some Errors! -- Synchronous Error -- Asynchronous Error -- Application Error Handling Strategy -- Ignoring Error Handling -- Synchronous Error Handling -- Asynchronous Error Handling -- The Asynchronous Handler -- Invocation of the Handler -- Errors on a Device -- Summary -- Chapter 6: Unified Shared Memory -- Why Should We Use USM? -- Allocation Types -- Device Allocations -- Host Allocations -- Shared Allocations -- Allocating Memory -- What Do We Need to Know? -- Multiple Styles -- Allocations à la C -- Allocations à la C++ -- C++ Allocators.</subfield></datafield><datafield tag="505" ind1="8" ind2=" "><subfield code="a">Deallocating Memory -- Allocation Example -- Data Management -- Initialization -- Data Movement -- Explicit -- Implicit -- Migration -- Fine-Grained Control -- Queries -- One More Thing -- Summary -- Chapter 7: Buffers -- Buffers -- Buffer Creation -- Buffer Properties -- use_host_ptr -- use_mutex -- context_bound -- What Can We Do with a Buffer? -- Accessors -- Accessor Creation -- What Can We Do with an Accessor? -- Summary -- Chapter 8: Scheduling Kernels and Data Movement -- What Is Graph Scheduling? -- How Graphs Work in SYCL -- Command Group Actions -- How Command Groups Declare Dependences -- Examples -- When Are the Parts of a Command Group Executed? -- Data Movement -- Explicit Data Movement -- Implicit Data Movement -- Synchronizing with the Host -- Summary -- Chapter 9: Communication and Synchronization -- Work-Groups and Work-Items -- Building Blocks for Efficient Communication -- Synchronization via Barriers -- Work-Group Local Memory -- Using Work-Group Barriers and Local Memory -- Work-Group Barriers and Local Memory in ND-Range Kernels -- Local Accessors -- Synchronization Functions -- A Full ND-Range Kernel Example -- Sub-Groups -- Synchronization via Sub-Group Barriers -- Exchanging Data Within a Sub-Group -- A Full Sub-Group ND-Range Kernel Example -- Group Functions and Group Algorithms -- Broadcast -- Votes -- Shuffles -- Summary -- Chapter 10: Defining Kernels -- Why Three Ways to Represent a Kernel? -- Kernels as Lambda Expressions -- Elements of a Kernel Lambda Expression -- Identifying Kernel Lambda Expressions -- Kernels as Named Function Objects -- Elements of a Kernel Named Function Object -- Kernels in Kernel Bundles -- Interoperability with Other APIs -- Summary -- Chapter 11: Vectors and Math Arrays -- The Ambiguity of Vector Types -- Our Mental Model for SYCL Vector Types -- Math Array (marray) -- Vector (vec).</subfield></datafield><datafield tag="505" ind1="8" ind2=" "><subfield code="a">Loads and Stores -- Interoperability with Backend-Native Vector Types -- Swizzle Operations -- How Vector Types Execute -- Vectors as Convenience Types -- Vectors as SIMD Types -- Summary -- Chapter 12: Device Information and Kernel Specialization -- Is There a GPU Present? -- Refining Kernel Code to Be More Prescriptive -- How to Enumerate Devices and Capabilities -- Aspects -- Custom Device Selector -- Being Curious: get_info&amp;lt -- &amp;gt -- -- Being More Curious: Detailed Enumeration Code -- Very Curious: get_info plus has() -- Device Information Descriptors -- Device-Specific Kernel Information Descriptors -- The Specifics: Those of "Correctness" -- Device Queries -- Kernel Queries -- The Specifics: Those of "Tuning/Optimization" -- Device Queries -- Kernel Queries -- Runtime vs. Compile-Time Properties -- Kernel Specialization -- Summary -- Chapter 13: Practical Tips -- Getting the Code Samples and a Compiler -- Online Resources -- Platform Model -- Multiarchitecture Binaries -- Compilation Model -- Contexts: Important Things to Know -- Adding SYCL to Existing C++ Programs -- Considerations When Using Multiple Compilers -- Debugging -- Debugging Deadlock and Other Synchronization Issues -- Debugging Kernel Code -- Debugging Runtime Failures -- Queue Profiling and Resulting Timing Capabilities -- Tracing and Profiling Tools Interfaces -- Initializing Data and Accessing Kernel Outputs -- Multiple Translation Units -- Performance Implication of Multiple Translation Units -- When Anonymous Lambdas Need Names -- Summary -- Chapter 14: Common Parallel Patterns -- Understanding the Patterns -- Map -- Stencil -- Reduction -- Scan -- Pack and Unpack -- Pack -- Unpack -- Using Built-In Functions and Libraries -- The SYCL Reduction Library -- The reduction Class -- The reducer Class -- User-Defined Reductions -- Group Algorithms -- Direct Programming -- Map.</subfield></datafield><datafield tag="505" ind1="8" ind2=" "><subfield code="a">Stencil -- Reduction -- Scan -- Pack and Unpack -- Pack -- Unpack -- Summary -- For More Information -- Chapter 15: Programming for GPUs -- Performance Caveats -- How GPUs Work -- GPU Building Blocks -- Simpler Processors (but More of Them) -- Expressing Parallelism -- Expressing More Parallelism -- Simplified Control Logic (SIMD Instructions) -- Predication and Masking -- SIMD Efficiency -- SIMD Efficiency and Groups of Items -- Switching Work to Hide Latency -- Offloading Kernels to GPUs -- SYCL Runtime Library -- GPU Software Drivers -- GPU Hardware -- Beware the Cost of Offloading! -- Transfers to and from Device Memory -- GPU Kernel Best Practices -- Accessing Global Memory -- Accessing Work-Group Local Memory -- Avoiding Local Memory Entirely with Sub-Groups -- Optimizing Computation Using Small Data Types -- Optimizing Math Functions -- Specialized Functions and Extensions -- Summary -- For More Information -- Chapter 16: Programming for CPUs -- Performance Caveats -- The Basics of Multicore CPUs -- The Basics of SIMD Hardware -- Exploiting Thread-Level Parallelism -- Thread Affinity Insight -- Be Mindful of First Touch to Memory -- SIMD Vectorization on CPU -- Ensure SIMD Execution Legality -- SIMD Masking and Cost -- Avoid Array of Struct for SIMD Efficiency -- Data Type Impact on SIMD Efficiency -- SIMD Execution Using single_task -- Summary -- Chapter 17: Programming for  FPGAs -- Performance Caveats -- How to Think About FPGAs -- Pipeline Parallelism -- Kernels Consume Chip "Area" -- When to Use an FPGA -- Lots and Lots of Work -- Custom Operations or Operation Widths -- Scalar Data Flow -- Low Latency and Rich Connectivity -- Customized Memory Systems -- Running on an FPGA -- Compile Times -- The FPGA Emulator -- FPGA Hardware Compilation Occurs "Ahead-of-Time" -- Writing Kernels for FPGAs -- Exposing Parallelism.</subfield></datafield><datafield tag="505" ind1="8" ind2=" "><subfield code="a">Keeping the Pipeline Busy Using ND-Ranges.</subfield></datafield><datafield tag="588" ind1=" " ind2=" "><subfield code="a">Description based on publisher supplied metadata and other sources.</subfield></datafield><datafield tag="590" ind1=" " ind2=" "><subfield code="a">Electronic reproduction. Ann Arbor, Michigan : ProQuest Ebook Central, 2024. Available via World Wide Web. Access may be limited to ProQuest Ebook Central affiliated libraries. </subfield></datafield><datafield tag="655" ind1=" " ind2="4"><subfield code="a">Electronic books.</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Ashbaugh, Ben.</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Brodman, James.</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Kinsner, Michael.</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Pennycook, John.</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Tian, Xinmin.</subfield></datafield><datafield tag="776" ind1="0" ind2="8"><subfield code="i">Print version:</subfield><subfield code="a">Reinders, James</subfield><subfield code="t">Data Parallel C++</subfield><subfield code="d">Berkeley, CA : Apress L. P.,c2023</subfield><subfield code="z">9781484296905</subfield></datafield><datafield tag="797" ind1="2" ind2=" "><subfield code="a">ProQuest (Firm)</subfield></datafield><datafield tag="856" ind1="4" ind2="0"><subfield code="u">https://ebookcentral.proquest.com/lib/oeawat/detail.action?docID=30882798</subfield><subfield code="z">Click to View</subfield></datafield></record></collection>