Staff View: Data Parallel C++ :

Data Parallel C++ : : Mastering DPC++ for Programming of Heterogeneous Systems Using C++ and SYCL.

Saved in:

Bibliographic Details
:	Reinders, James.
TeilnehmendeR:	Ashbaugh, Ben. Brodman, James. Kinsner, Michael. Pennycook, John. Tian, Xinmin.
Place / Publishing House:	Berkeley, CA : : Apress L. P.,, 2020. ©2021.
Year of Publication:	2020
Edition:	1st ed.
Language:	English
Online Access:	https://ebookcentral.proquest.com/lib/oeawat/detail.action?docID=6383586
Physical Description:	1 online resource (565 pages)
Tags:	Add Tag No Tags, Be the first to tag this record!


LEADER	11784nam a22004933i 4500
001	5006383586
003	MiAaPQ
005	20240229073836.0
006	m o d \|
007	cr cnu\|\|\|\|\|\|\|\|
008	240229s2020 xx o \|\|\|\|0 eng d
020			\|a 9781484255742 \|q (electronic bk.)
020			\|z 9781484255735
035			\|a (MiAaPQ)5006383586
035			\|a (Au-PeEL)EBL6383586
035			\|a (OCoLC)1204226016
040			\|a MiAaPQ \|b eng \|e rda \|e pn \|c MiAaPQ \|d MiAaPQ
050		4	\|a QA76.76.C65
100	1		\|a Reinders, James.
245	1	0	\|a Data Parallel C++ : \|b Mastering DPC++ for Programming of Heterogeneous Systems Using C++ and SYCL.
250			\|a 1st ed.
264		1	\|a Berkeley, CA : \|b Apress L. P., \|c 2020.
264		4	\|c ©2021.
300			\|a 1 online resource (565 pages)
336			\|a text \|b txt \|2 rdacontent
337			\|a computer \|b c \|2 rdamedia
338			\|a online resource \|b cr \|2 rdacarrier
505	0		\|a Intro -- Table of Contents -- About the Authors -- Preface -- Acknowledgments -- Chapter 1: Introduction -- Read the Book, Not the Spec -- SYCL 1.2.1 vs. SYCL 2020, and DPC++ -- Getting a DPC++ Compiler -- Book GitHub -- Hello, World! and a SYCL Program Dissection -- Queues and Actions -- It Is All About Parallelism -- Throughput -- Latency -- Think Parallel -- Amdahl and Gustafson -- Scaling -- Heterogeneous Systems -- Data-Parallel Programming -- Key Attributes of DPC++ and SYCL -- Single-Source -- Host -- Devices -- Sharing Devices -- Kernel Code -- Kernel: Vector Addition (DAXPY) -- Asynchronous Task Graphs -- Race Conditions When We Make a Mistake -- C++ Lambda Functions -- Portability and Direct Programming -- Concurrency vs. Parallelism -- Summary -- Chapter 2: Where Code Executes -- Single-Source -- Host Code -- Device Code -- Choosing Devices -- Method#1: Run on a Device of Any Type -- Queues -- Binding a Queue to a Device, When Any Device Will Do -- Method#2: Using the Host Device for Development and Debugging -- Method#3: Using a GPU (or Other Accelerators) -- Device Types -- Accelerator Devices -- Device Selectors -- When Device Selection Fails -- Method#4: Using Multiple Devices -- Method#5: Custom (Very Specific) Device Selection -- device_selector Base Class -- Mechanisms to Score a Device -- Three Paths to Device Code Execution on CPU -- Creating Work on a Device -- Introducing the Task Graph -- Where Is the Device Code? -- Actions -- Fallback -- Summary -- Chapter 3: Data Management -- Introduction -- The Data Management Problem -- Device Local vs. Device Remote -- Managing Multiple Memories -- Explicit Data Movement -- Implicit Data Movement -- Selecting the Right Strategy -- USM, Buffers, and Images -- Unified Shared Memory -- Accessing Memory Through Pointers -- USM and Data Movement -- Explicit Data Movement in USM.
505	8		\|a Implicit Data Movement in USM -- Buffers -- Creating Buffers -- Accessing Buffers -- Access Modes -- Ordering the Uses of Data -- In-order Queues -- Out-of-Order (OoO) Queues -- Explicit Dependences with Events -- Implicit Dependences with Accessors -- Choosing a Data Management Strategy -- Handler Class: Key Members -- Summary -- Chapter 4: Expressing Parallelism -- Parallelism Within Kernels -- Multidimensional Kernels -- Loops vs. Kernels -- Overview of Language Features -- Separating Kernels from Host Code -- Different Forms of Parallel Kernels -- Basic Data-Parallel Kernels -- Understanding Basic Data-Parallel Kernels -- Writing Basic Data-Parallel Kernels -- Details of Basic Data-Parallel Kernels -- The range Class -- The id Class -- The item Class -- Explicit ND-Range Kernels -- Understanding Explicit ND-Range Parallel Kernels -- Work-Items -- Work-Groups -- Sub-Groups -- Writing Explicit ND-Range Data-Parallel Kernels -- Details of Explicit ND-Range Data-Parallel Kernels -- The nd_range Class -- The nd_item Class -- The group Class -- The sub_group Class -- Hierarchical Parallel Kernels -- Understanding Hierarchical Data-Parallel Kernels -- Writing Hierarchical Data-Parallel Kernels -- Details of Hierarchical Data-Parallel Kernels -- The h_item Class -- The private_memory Class -- Mapping Computation to Work-Items -- One-to-One Mapping -- Many-to-One Mapping -- Choosing a Kernel Form -- Summary -- Chapter 5: Error Handling -- Safety First -- Types of Errors -- Let's Create Some Errors! -- Synchronous Error -- Asynchronous Error -- Application Error Handling Strategy -- Ignoring Error Handling -- Synchronous Error Handling -- Asynchronous Error Handling -- The Asynchronous Handler -- Invocation of the Handler -- Errors on a Device -- Summary -- Chapter 6: Unified Shared Memory -- Why Should We Use USM? -- Allocation Types -- Device Allocations.
505	8		\|a Host Allocations -- Shared Allocations -- Allocating Memory -- What Do We Need to Know? -- Multiple Styles -- Allocations à la C -- Allocations à la C++ -- C++ Allocators -- Deallocating Memory -- Allocation Example -- Data Management -- Initialization -- Data Movement -- Explicit -- Implicit -- Migration -- Fine-Grained Control -- Queries -- Summary -- Chapter 7: Buffers -- Buffers -- Creation -- Buffer Properties -- use_host_ptr -- use_mutex -- context_bound -- What Can We Do with a Buffer? -- Accessors -- Accessor Creation -- What Can We Do with an Accessor? -- Summary -- Chapter 8: Scheduling Kernels and Data Movement -- What Is Graph Scheduling? -- How Graphs Work in DPC++ -- Command Group Actions -- How Command Groups Declare Dependences -- Examples -- When Are the Parts of a CG Executed? -- Data Movement -- Explicit -- Implicit -- Synchronizing with the Host -- Summary -- Chapter 9: Communication and Synchronization -- Work-Groups and Work-Items -- Building Blocks for Efficient Communication -- Synchronization via Barriers -- Work-Group Local Memory -- Using Work-Group Barriers and Local Memory -- Work-Group Barriers and Local Memory in ND-Range Kernels -- Local Accessors -- Synchronization Functions -- A Full ND-Range Kernel Example -- Work-Group Barriers and Local Memory in Hierarchical Kernels -- Scopes for Local Memory and Barriers -- A Full Hierarchical Kernel Example -- Sub-Groups -- Synchronization via Sub-Group Barriers -- Exchanging Data Within a Sub-Group -- A Full Sub-Group ND-Range Kernel Example -- Collective Functions -- Broadcast -- Votes -- Shuffles -- Loads and Stores -- Summary -- Chapter 10: Defining Kernels -- Why Three Ways to Represent a Kernel? -- Kernels As Lambda Expressions -- Elements of a Kernel Lambda Expression -- Naming Kernel Lambda Expressions -- Kernels As Named Function Objects.
505	8		\|a Elements of a Kernel Named Function Object -- Interoperability with Other APIs -- Interoperability with API-Defined Source Languages -- Interoperability with API-Defined Kernel Objects -- Kernels in Program Objects -- Summary -- Chapter 11: Vectors -- How to Think About Vectors -- Vector Types -- Vector Interface -- Load and Store Member Functions -- Swizzle Operations -- Vector Execution Within a Parallel Kernel -- Vector Parallelism -- Summary -- Chapter 12: Device Information -- Refining Kernel Code to Be More Prescriptive -- How to Enumerate Devices and Capabilities -- Custom Device Selector -- Being Curious: get_info&lt -- &gt -- -- Being More Curious: Detailed Enumeration Code -- Inquisitive: get_info&lt -- &gt -- -- Device Information Descriptors -- Device-Specific Kernel Information Descriptors -- The Specifics: Those of "Correctness" -- Device Queries -- Kernel Queries -- The Specifics: Those of "Tuning/Optimization" -- Device Queries -- Kernel Queries -- Runtime vs. Compile-Time Properties -- Summary -- Chapter 13: Practical Tips -- Getting a DPC++ Compiler and Code Samples -- Online Forum and Documentation -- Platform Model -- Multiarchitecture Binaries -- Compilation Model -- Adding SYCL to Existing C++ Programs -- Debugging -- Debugging Kernel Code -- Debugging Runtime Failures -- Initializing Data and Accessing Kernel Outputs -- Multiple Translation Units -- Performance Implications of Multiple Translation Units -- When Anonymous Lambdas Need Names -- Migrating from CUDA to SYCL -- Summary -- Chapter 14: Common Parallel Patterns -- Understanding the Patterns -- Map -- Stencil -- Reduction -- Scan -- Pack and Unpack -- Pack -- Unpack -- Using Built-In Functions and Libraries -- The DPC++ Reduction Library -- The reduction Class -- The reducer Class -- User-Defined Reductions -- oneAPI DPC++ Library -- Group Functions.
505	8		\|a Direct Programming -- Map -- Stencil -- Reduction -- Scan -- Pack and Unpack -- Pack -- Unpack -- Summary -- For More Information -- Chapter 15: Programming for GPUs -- Performance Caveats -- How GPUs Work -- GPU Building Blocks -- Simpler Processors (but More of Them) -- Expressing Parallelism -- Expressing More Parallelism -- Simplified Control Logic (SIMD Instructions) -- Predication and Masking -- SIMD Efficiency -- SIMD Efficiency and Groups of Items -- Switching Work to Hide Latency -- Offloading Kernels to GPUs -- SYCL Runtime Library -- GPU Software Drivers -- GPU Hardware -- Beware the Cost of Offloading! -- Transfers to and from Device Memory -- GPU Kernel Best Practices -- Accessing Global Memory -- Accessing Work-Group Local Memory -- Avoiding Local Memory Entirely with Sub-Groups -- Optimizing Computation Using Small Data Types -- Optimizing Math Functions -- Specialized Functions and Extensions -- Summary -- For More Information -- Chapter 16: Programming for CPUs -- Performance Caveats -- The Basics of a General-Purpose CPU -- The Basics of SIMD Hardware -- Exploiting Thread-Level Parallelism -- Thread Affinity Insight -- Be Mindful of First Touch to Memory -- SIMD Vectorization on CPU -- Ensure SIMD Execution Legality -- SIMD Masking and Cost -- Avoid Array-of-Struct for SIMD Efficiency -- Data Type Impact on SIMD Efficiency -- SIMD Execution Using single_task -- Summary -- Chapter 17: Programming for FPGAs -- Performance Caveats -- How to Think About FPGAs -- Pipeline Parallelism -- Kernels Consume Chip "Area" -- When to Use an FPGA -- Lots and Lots of Work -- Custom Operations or Operation Widths -- Scalar Data Flow -- Low Latency and Rich Connectivity -- Customized Memory Systems -- Running on an FPGA -- Compile Times -- The FPGA Emulator -- FPGA Hardware Compilation Occurs "Ahead-of-Time" -- Writing Kernels for FPGAs.
505	8		\|a Exposing Parallelism.
588			\|a Description based on publisher supplied metadata and other sources.
590			\|a Electronic reproduction. Ann Arbor, Michigan : ProQuest Ebook Central, 2024. Available via World Wide Web. Access may be limited to ProQuest Ebook Central affiliated libraries.
655		4	\|a Electronic books.
700	1		\|a Ashbaugh, Ben.
700	1		\|a Brodman, James.
700	1		\|a Kinsner, Michael.
700	1		\|a Pennycook, John.
700	1		\|a Tian, Xinmin.
776	0	8	\|i Print version: \|a Reinders, James \|t Data Parallel C++ \|d Berkeley, CA : Apress L. P.,c2020 \|z 9781484255735
797	2		\|a ProQuest (Firm)
856	4	0	\|u https://ebookcentral.proquest.com/lib/oeawat/detail.action?docID=6383586 \|z Click to View

Data Parallel C++ : : Mastering DPC++ for Programming of Heterogeneous Systems Using C++ and SYCL.

Similar Items