Table of Contents: Optimizing HPC Applications with Intel Cluster Tools :

Optimizing HPC Applications with Intel Cluster Tools : : Hunting Petaflops.

Saved in:

Bibliographic Details
:	Supalov, Alexander.
TeilnehmendeR:	Semin, Andrey. Dahnken, Christopher. Klemm, Michael.
Place / Publishing House:	Berkeley, CA : : Apress L. P.,, 2014. Ã2014.
Year of Publication:	2014
Edition:	1st ed.
Language:	English
Online Access:	https://ebookcentral.proquest.com/lib/oeawat/detail.action?docID=6422827
Physical Description:	1 online resource (291 pages)
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Intro
Contents at a Glance
Contents
About the Authors
About the Technical Reviewers
Acknowledgments
Foreword
Introduction
Chapter 1: No Time to Read This Book?
Using Intel MPI Library
Using Intel Composer XE
Tuning Intel MPI Library
Gather Built-in Statistics
Optimize Process Placement
Optimize Thread Placement
Tuning Intel Composer XE
Analyze Optimization and Vectorization Reports
Use Interprocedural Optimization
Summary
References
Chapter 2: Overview of Platform Architectures
Performance Metrics and Targets
Latency, Throughput, Energy, and Power
Peak Performance as the Ultimate Limit
Scalability and Maximum Parallel Speedup
Bottlenecks and a Bit of Queuing Theory
Roofline Model
Performance Features of Computer Architectures
Increasing Single-Threaded Performance: Where You Can and Cannot Help
Process More Data with SIMD Parallelism
Distributed and Shared Memory Systems
Use More Independent Threads on the Same Node
Don't Limit Yourself to a Single Server
HPC Hardware Architecture Overview
A Multicore Workstation or a Server Compute Node
Coprocessor for Highly Parallel Applications
Group of Similar Nodes Form an HPC Cluster
Other Important Components of HPC Systems
Summary
References
Chapter 3: Top-Down Software Optimization
The Three Levels and Their Impact on Performance
System Level
Application Level
Working Against the Memory Wall
The Magic of Vectors
Distributed Memory Parallelization
Shared Memory Parallelization
Other Existing Approaches and Methods
Microarchitecture Level
Addressing Pipelines and Execution
Closed-Loop Methodology
Workload, Application, and Baseline
Iterating the Optimization Process
Summary
References
Chapter 4: Addressing System Bottlenecks.
Classifying System-Level Bottlenecks
Identifying Issues Related to System Condition
Characterizing Problems Caused by System Configuration
Understanding System-Level Performance Limits
Checking General Compute Subsystem Performance
Testing Memory Subsystem Performance
Testing I/O Subsystem Performance
Characterizing Application System-Level Issues
Selecting Performance Characterization Tools
Monitoring the I/O Utilization
Analyzing Memory Bandwidth
Summary
References
Chapter 5: Addressing Application Bottlenecks: Distributed Memory
Algorithm for Optimizing MPI Performance
Comprehending the Underlying MPI Performance
Recalling Some Benchmarking Basics
Gauging Default Intranode Communication Performance
Gauging Default Internode Communication Performance
Discovering Default Process Layout and Pinning Details
Gauging Physical Core Performance
Doing Initial Performance Analysis
Is It Worth the Trouble?
Example 1: Initial HPL Performance Investigation
Getting an Overview of Scalability and Performance
Learning Application Behavior
Example 2: MiniFE Performance Investigation
Choosing Representative Workload(s)
Example 2 (cont.): MiniFE Performance Investigation
Balancing Process and Thread Parallelism
Example 2 (cont.): MiniFE Performance Investigation
Doing a Scalability Review
Example 2 (cont.): MiniFE Performance Investigation
Analyzing the Details of the Application Behavior
Example 2 (cont.): MiniFE Performance Investigation
Choosing the Optimization Objective
Detecting Load Imbalance
Example 2 (cont.): MiniFE Performance Investigation
Dealing with Load Imbalance
Classifying Load Imbalance
Addressing Load Imbalance
Example 2 (cont.): MiniFE Performance Investigation
Example 3: MiniMD Performance Investigation.
Optimizing MPI Performance
Classifying the MPI Performance Issues
Addressing MPI Performance Issues
Mapping Application onto the Platform
Understanding Communication Paths
Selecting Proper Communication Fabrics
Using Scalable Datagrams
Specifying a Network Provider
Using IP over IB
Controlling the Fabric Fallback Mechanism
Using Multirail Capabilities
Detecting and Classifying Improper Process Layout and Pinning Issues
Controlling Process Layout
Controlling the Global Process Layout
Controlling the Detailed Process Layout
Setting the Environment Variables at All Levels
Controlling the Process Pinning
Controlling Memory and Network Affinity
Example 4: MiniMD Performance Investigation on Xeon Phi
Example 5: MiniGhost Performance Investigation
Tuning the Intel MPI Library
Tuning Intel MPI for the Platform
Tuning Point-to-Point Settings
Adjusting the Eager and Rendezvous Protocol Thresholds
Changing DAPL and DAPL UD Eager Protocol Threshold
Bypassing Shared Memory for Intranode Communication
Bypassing the Cache for Intranode Communication
Choosing the Best Collective Algorithms
Tuning Intel MPI Library for the Application
Using Magical Tips and Tricks
Disabling the Dynamic Connection Mode
Applying the Wait Mode to Oversubscribed Jobs
Fine-Tuning the Message-Passing Progress Engine
Reducing the Pre-reserved DAPL Memory Size
What Else?
Example 5 (cont.): MiniGhost Performance Investigation
Optimizing Application for Intel MPI
Avoiding MPI_ANY_SOURCE
Avoiding Superfluous Synchronization
Using Derived Datatypes
Using Collective Operations
Betting on the Computation/Communication Overlap
Replacing Blocking Collective Operations by MPI-3 Nonblocking Ones
Using Accelerated MPI File I/O.
Example 5 (cont.): MiniGhost Performance Investigation
Using Advanced Analysis Techniques
Automatically Checking MPI Program Correctness
Comparing Application Traces
Instrumenting Application Code
Correlating MPI and Hardware Events
Collecting and Analyzing Hardware Counter Information in ITAC
Collecting and Analyzing Hardware Counter Information in VTune
Summary
References
Chapter 6: Addressing Application Bottlenecks: Shared Memory
Profiling Your Application
Using VTune Amplifier XE for Hotspots Profiling
Hotspots for the HPCG Benchmark
Compiler-Assisted Loop/Function Profiling
Sequential Code and Detecting Load Imbalances
Thread Synchronization and Locking
Dealing with Memory Locality and NUMA Effects
Thread and Process Pinning
Controlling OpenMP Thread Placement
Thread Placement in Hybrid Applications
Summary
References
Chapter 7: Addressing Application Bottlenecks: Microarchitecture
Overview of a Modern Processor Pipeline
Pipelined Execution
Data Conflicts
Control Conflicts
Structural Conflicts
Out-of-order vs. In-order Execution
Superscalar Pipelines
SIMD Execution
Speculative Execution: Branch Prediction
Memory Subsystem
Putting It All Together: A Final Look at the Sandy Bridge Pipeline
A Top-down Method for Categorizing the Pipeline Performance
Intel Composer XE Usage for Microarchitecture Optimizations
Basic Compiler Usage and Optimization
Using Optimization and Vectorization Reports to Read the Compiler's Mind
Optimizing for Vectorization
The AVX Instruction Set
Why Doesn't My Code Vectorize in the First Place?
Data Dependences
Data Aliasing
Array Notations
Vectorization Directives
ivdep
vector
simd
Understanding AVX: Intrinsic Programming
What Are Intrinsics?.
First Steps: Loading and Storing
Arithmetic
Data Rearrangement
Dealing with Disambiguation
Dealing with Branches
__builtin_expect
Profile-Guided Optimization
Pragmas for Unrolling Loops and Inlining
unroll/nounroll
unroll_and_jam/nounroll_and_jam
inline, noinline, forceinline
Specialized Routines: How to Exploit the Branch Prediction for Maximal Performance
When Optimization Leads to Wrong Results
Using a Standard Library Method
Using a Manual Implementation in C
Vectorization with Directives
Analyzing Pipeline Performance with Intel VTune Amplifier XE
Summary
References
Chapter 8: Application Design Considerations
Abstraction and Generalization of the Platform Architecture
Types of Abstractions
Levels of Abstraction and Complexities
Raw Hardware vs. Virtualized Hardware in the Cloud
Questions about Application Design
Designing for Performance and Scaling
Designing for Flexibility and Performance Portability
Data Layout
Structured Approach to Express Parallelism
Understanding Bounds and Projecting Bottlenecks
Data Storage or Transfer vs. Recalculation
Total Productivity Assessment
Summary
References
Index.

Optimizing HPC Applications with Intel Cluster Tools : : Hunting Petaflops.

Similar Items