Data Parallel C++ : : Mastering DPC++ for Programming of Heterogeneous Systems Using C++ and SYCL.

Saved in:
Bibliographic Details
:
TeilnehmendeR:
Place / Publishing House:Berkeley, CA : : Apress L. P.,, 2020.
©2021.
Year of Publication:2020
Edition:1st ed.
Language:English
Online Access:
Physical Description:1 online resource (565 pages)
Tags: Add Tag
No Tags, Be the first to tag this record!
LEADER 11784nam a22004933i 4500
001 5006383586
003 MiAaPQ
005 20240229073836.0
006 m o d |
007 cr cnu||||||||
008 240229s2020 xx o ||||0 eng d
020 |a 9781484255742  |q (electronic bk.) 
020 |z 9781484255735 
035 |a (MiAaPQ)5006383586 
035 |a (Au-PeEL)EBL6383586 
035 |a (OCoLC)1204226016 
040 |a MiAaPQ  |b eng  |e rda  |e pn  |c MiAaPQ  |d MiAaPQ 
050 4 |a QA76.76.C65 
100 1 |a Reinders, James. 
245 1 0 |a Data Parallel C++ :  |b Mastering DPC++ for Programming of Heterogeneous Systems Using C++ and SYCL. 
250 |a 1st ed. 
264 1 |a Berkeley, CA :  |b Apress L. P.,  |c 2020. 
264 4 |c ©2021. 
300 |a 1 online resource (565 pages) 
336 |a text  |b txt  |2 rdacontent 
337 |a computer  |b c  |2 rdamedia 
338 |a online resource  |b cr  |2 rdacarrier 
505 0 |a Intro -- Table of Contents -- About the Authors -- Preface -- Acknowledgments -- Chapter 1: Introduction -- Read the Book, Not the Spec -- SYCL 1.2.1 vs. SYCL 2020, and DPC++ -- Getting a DPC++ Compiler -- Book GitHub -- Hello, World! and a SYCL Program Dissection -- Queues and Actions -- It Is All About Parallelism -- Throughput -- Latency -- Think Parallel -- Amdahl and Gustafson -- Scaling -- Heterogeneous Systems -- Data-Parallel Programming -- Key Attributes of DPC++ and SYCL -- Single-Source -- Host -- Devices -- Sharing Devices -- Kernel Code -- Kernel: Vector Addition (DAXPY) -- Asynchronous Task Graphs -- Race Conditions When We Make a Mistake -- C++ Lambda Functions -- Portability and Direct Programming -- Concurrency vs. Parallelism -- Summary -- Chapter 2: Where Code Executes -- Single-Source -- Host Code -- Device Code -- Choosing Devices -- Method#1: Run on a Device of Any Type -- Queues -- Binding a Queue to a Device, When Any Device Will Do -- Method#2: Using the Host Device for Development and Debugging -- Method#3: Using a GPU (or Other Accelerators) -- Device Types -- Accelerator Devices -- Device Selectors -- When Device Selection Fails -- Method#4: Using Multiple Devices -- Method#5: Custom (Very Specific) Device Selection -- device_selector Base Class -- Mechanisms to Score a Device -- Three Paths to Device Code Execution on CPU -- Creating Work on a Device -- Introducing the Task Graph -- Where Is the Device Code? -- Actions -- Fallback -- Summary -- Chapter 3: Data Management -- Introduction -- The Data Management Problem -- Device Local vs. Device Remote -- Managing Multiple Memories -- Explicit Data Movement -- Implicit Data Movement -- Selecting the Right Strategy -- USM, Buffers, and Images -- Unified Shared Memory -- Accessing Memory Through Pointers -- USM and Data Movement -- Explicit Data Movement in USM. 
505 8 |a Implicit Data Movement in USM -- Buffers -- Creating Buffers -- Accessing Buffers -- Access Modes -- Ordering the Uses of Data -- In-order Queues -- Out-of-Order (OoO) Queues -- Explicit Dependences with Events -- Implicit Dependences with Accessors -- Choosing a Data Management Strategy -- Handler Class: Key Members -- Summary -- Chapter 4: Expressing Parallelism -- Parallelism Within Kernels -- Multidimensional Kernels -- Loops vs. Kernels -- Overview of Language Features -- Separating Kernels from Host Code -- Different Forms of Parallel Kernels -- Basic Data-Parallel Kernels -- Understanding Basic Data-Parallel Kernels -- Writing Basic Data-Parallel Kernels -- Details of Basic Data-Parallel Kernels -- The range Class -- The id Class -- The item Class -- Explicit ND-Range Kernels -- Understanding Explicit ND-Range Parallel Kernels -- Work-Items -- Work-Groups -- Sub-Groups -- Writing Explicit ND-Range Data-Parallel Kernels -- Details of Explicit ND-Range Data-Parallel Kernels -- The nd_range Class -- The nd_item Class -- The group Class -- The sub_group Class -- Hierarchical Parallel Kernels -- Understanding Hierarchical Data-Parallel Kernels -- Writing Hierarchical Data-Parallel Kernels -- Details of Hierarchical Data-Parallel Kernels -- The h_item Class -- The private_memory Class -- Mapping Computation to Work-Items -- One-to-One Mapping -- Many-to-One Mapping -- Choosing a Kernel Form -- Summary -- Chapter 5: Error Handling -- Safety First -- Types of Errors -- Let's Create Some Errors! -- Synchronous Error -- Asynchronous Error -- Application Error Handling Strategy -- Ignoring Error Handling -- Synchronous Error Handling -- Asynchronous Error Handling -- The Asynchronous Handler -- Invocation of the Handler -- Errors on a Device -- Summary -- Chapter 6: Unified Shared Memory -- Why Should We Use USM? -- Allocation Types -- Device Allocations. 
505 8 |a Host Allocations -- Shared Allocations -- Allocating Memory -- What Do We Need to Know? -- Multiple Styles -- Allocations à la C -- Allocations à la C++ -- C++ Allocators -- Deallocating Memory -- Allocation Example -- Data Management -- Initialization -- Data Movement -- Explicit -- Implicit -- Migration -- Fine-Grained Control -- Queries -- Summary -- Chapter 7: Buffers -- Buffers -- Creation -- Buffer Properties -- use_host_ptr -- use_mutex -- context_bound -- What Can We Do with a Buffer? -- Accessors -- Accessor Creation -- What Can We Do with an Accessor? -- Summary -- Chapter 8: Scheduling Kernels and Data Movement -- What Is Graph Scheduling? -- How Graphs Work in DPC++ -- Command Group Actions -- How Command Groups Declare Dependences -- Examples -- When Are the Parts of a CG Executed? -- Data Movement -- Explicit -- Implicit -- Synchronizing with the Host -- Summary -- Chapter 9: Communication and Synchronization -- Work-Groups and Work-Items -- Building Blocks for Efficient Communication -- Synchronization via Barriers -- Work-Group Local Memory -- Using Work-Group Barriers and Local Memory -- Work-Group Barriers and Local Memory in ND-Range Kernels -- Local Accessors -- Synchronization Functions -- A Full ND-Range Kernel Example -- Work-Group Barriers and Local Memory in Hierarchical Kernels -- Scopes for Local Memory and Barriers -- A Full Hierarchical Kernel Example -- Sub-Groups -- Synchronization via Sub-Group Barriers -- Exchanging Data Within a Sub-Group -- A Full Sub-Group ND-Range Kernel Example -- Collective Functions -- Broadcast -- Votes -- Shuffles -- Loads and Stores -- Summary -- Chapter 10: Defining Kernels -- Why Three Ways to Represent a Kernel? -- Kernels As Lambda Expressions -- Elements of a Kernel Lambda Expression -- Naming Kernel Lambda Expressions -- Kernels As Named Function Objects. 
505 8 |a Elements of a Kernel Named Function Object -- Interoperability with Other APIs -- Interoperability with API-Defined Source Languages -- Interoperability with API-Defined Kernel Objects -- Kernels in Program Objects -- Summary -- Chapter 11: Vectors -- How to Think About Vectors -- Vector Types -- Vector Interface -- Load and Store Member Functions -- Swizzle Operations -- Vector Execution Within a Parallel Kernel -- Vector Parallelism -- Summary -- Chapter 12: Device Information -- Refining Kernel Code to Be More Prescriptive -- How to Enumerate Devices and Capabilities -- Custom Device Selector -- Being Curious: get_info&lt -- &gt -- -- Being More Curious: Detailed Enumeration Code -- Inquisitive: get_info&lt -- &gt -- -- Device Information Descriptors -- Device-Specific Kernel Information Descriptors -- The Specifics: Those of "Correctness" -- Device Queries -- Kernel Queries -- The Specifics: Those of "Tuning/Optimization" -- Device Queries -- Kernel Queries -- Runtime vs. Compile-Time Properties -- Summary -- Chapter 13: Practical Tips -- Getting a DPC++ Compiler and Code Samples -- Online Forum and Documentation -- Platform Model -- Multiarchitecture Binaries -- Compilation Model -- Adding SYCL to Existing C++ Programs -- Debugging -- Debugging Kernel Code -- Debugging Runtime Failures -- Initializing Data and Accessing Kernel Outputs -- Multiple Translation Units -- Performance Implications of Multiple Translation Units -- When Anonymous Lambdas Need Names -- Migrating from CUDA to SYCL -- Summary -- Chapter 14: Common Parallel Patterns -- Understanding the Patterns -- Map -- Stencil -- Reduction -- Scan -- Pack and Unpack -- Pack -- Unpack -- Using Built-In Functions and Libraries -- The DPC++ Reduction Library -- The reduction Class -- The reducer Class -- User-Defined Reductions -- oneAPI DPC++ Library -- Group Functions. 
505 8 |a Direct Programming -- Map -- Stencil -- Reduction -- Scan -- Pack and Unpack -- Pack -- Unpack -- Summary -- For More Information -- Chapter 15: Programming for GPUs -- Performance Caveats -- How GPUs Work -- GPU Building Blocks -- Simpler Processors (but More of Them) -- Expressing Parallelism -- Expressing More Parallelism -- Simplified Control Logic (SIMD Instructions) -- Predication and Masking -- SIMD Efficiency -- SIMD Efficiency and Groups of Items -- Switching Work to Hide Latency -- Offloading Kernels to GPUs -- SYCL Runtime Library -- GPU Software Drivers -- GPU Hardware -- Beware the Cost of Offloading! -- Transfers to and from Device Memory -- GPU Kernel Best Practices -- Accessing Global Memory -- Accessing Work-Group Local Memory -- Avoiding Local Memory Entirely with Sub-Groups -- Optimizing Computation Using Small Data Types -- Optimizing Math Functions -- Specialized Functions and Extensions -- Summary -- For More Information -- Chapter 16: Programming for CPUs -- Performance Caveats -- The Basics of a General-Purpose CPU -- The Basics of SIMD Hardware -- Exploiting Thread-Level Parallelism -- Thread Affinity Insight -- Be Mindful of First Touch to Memory -- SIMD Vectorization on CPU -- Ensure SIMD Execution Legality -- SIMD Masking and Cost -- Avoid Array-of-Struct for SIMD Efficiency -- Data Type Impact on SIMD Efficiency -- SIMD Execution Using single_task -- Summary -- Chapter 17: Programming for FPGAs -- Performance Caveats -- How to Think About FPGAs -- Pipeline Parallelism -- Kernels Consume Chip "Area" -- When to Use an FPGA -- Lots and Lots of Work -- Custom Operations or Operation Widths -- Scalar Data Flow -- Low Latency and Rich Connectivity -- Customized Memory Systems -- Running on an FPGA -- Compile Times -- The FPGA Emulator -- FPGA Hardware Compilation Occurs "Ahead-of-Time" -- Writing Kernels for FPGAs. 
505 8 |a Exposing Parallelism. 
588 |a Description based on publisher supplied metadata and other sources. 
590 |a Electronic reproduction. Ann Arbor, Michigan : ProQuest Ebook Central, 2024. Available via World Wide Web. Access may be limited to ProQuest Ebook Central affiliated libraries.  
655 4 |a Electronic books. 
700 1 |a Ashbaugh, Ben. 
700 1 |a Brodman, James. 
700 1 |a Kinsner, Michael. 
700 1 |a Pennycook, John. 
700 1 |a Tian, Xinmin. 
776 0 8 |i Print version:  |a Reinders, James  |t Data Parallel C++  |d Berkeley, CA : Apress L. P.,c2020  |z 9781484255735 
797 2 |a ProQuest (Firm) 
856 4 0 |u https://ebookcentral.proquest.com/lib/oeawat/detail.action?docID=6383586  |z Click to View