TL;DR
- Thread and block configurations significantly impact CUDA kernel performance
- Power-of-2 sizes provide most efficient execution patterns
- Warp alignment (multiples of 32 threads) optimizes memory coalescing
- Block size involves trade-offs between thread cooperation and SM distribution
- Visual analysis helps understand how different configurations affect execution
Interactive Exploration
Before diving into the technical details, try out our interactive CUDA visualizer to get a hands-on understanding of thread and block patterns:
CUDA Architecture Overview
Before diving into configurations, let’s understand key CUDA concepts:
Thread Hierarchy
Grid
├── Block 0
│ ├── Thread (0,0)
│ ├── Thread (0,1)
│ └── ...
├── Block 1
│ ├── Thread (1,0)
│ └── ...
└── ...
- Threads: Basic execution units
- Warps: Groups of 32 threads executed simultaneously
- Blocks: Collections of threads that can cooperate
- Grid: Collection of blocks executing the same kernel
Hardware Constraints
- Maximum threads per block: 1024
- Warp size: 32 threads
- Typical SM can handle multiple blocks simultaneously
- Memory access is coalesced within warps
CUDA Architecture Deep Dive
Hardware Architecture (SM89)
Streaming Multiprocessor (SM)
├── Warp Schedulers: 4 per SM
├── CUDA Cores: 128 per SM
├── Shared Memory: 64KB per SM
├── L1 Cache: 128KB per SM
└── Register File: 64K 32-bit registers
Memory Hierarchy
Device Memory (Global Memory)
├── L2 Cache (Shared by all SMs)
│ └── Cache Line Size: 128 bytes
├── L1 Cache (Per SM)
│ └── Cache Line Size: 128 bytes
├── Shared Memory (Per SM)
│ └── Bank Width: 32-bit
└── Register File (Per SM)
└── Access Latency: 1 cycle
Theoretical Performance Metrics
- Global Memory Bandwidth: 912 GB/s
- Shared Memory Bandwidth: ~19 TB/s per SM
- Register Bandwidth: ~39 TB/s per SM
- Warp Scheduling Rate: 1 instruction per clock
Experimental Setup
Our test kernel uses unified memory and records detailed execution patterns:
// Structure to store thread processing info
struct ThreadInfo {
int index; // Global array index
int blockId; // Block identifier
int threadId; // Thread identifier within block
float value; // Computed value
};
__global__
void add(int n, float *x, float *y, ThreadInfo *info) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
y[idx] += x[idx];
// Record thread/block information
info[idx].index = idx;
info[idx].blockId = blockIdx.x;
info[idx].threadId = threadIdx.x;
info[idx].value = y[idx];
}
}
1. Power-of-2 Configurations
Standard Case (N=256, Block=32)
// Launch configuration
int N = 256;
int blockSize = 32; // One warp
int numBlocks = (N + blockSize - 1) / blockSize; // 8 blocks
add<<<numBlocks, blockSize>>>(N, x, y, info);
Technical Analysis:
- Block size matches warp size (32 threads)
- Perfect memory coalescing within warps
- 8 blocks distribute evenly across SMs
- 100% thread utilization: 256/(32*8) = 1.0
Performance Implications:
- Optimal warp execution efficiency
- No partial warps
- Good SM occupancy
- Minimal scheduling overhead
Large Blocks (N=1024, Block=256)
// 256 threads per block = 8 warps per block
add<<<4, 256>>>(1024, x, y, info);
Technical Analysis:
- 8 warps per block (256/32)
- 4 blocks total
- Higher register pressure per block
- More thread cooperation possible within blocks
Performance Considerations:
- May limit SM occupancy due to resource usage
- Better for compute-bound kernels with thread cooperation
- Reduced block scheduling overhead
2. Small Block Configurations
Minimal Blocks (N=100, Block=8)
// Sub-warp block size
add<<<13, 8>>>(100, x, y, info);
Technical Deep Dive:
- Block size (8) is 1/4 of a warp
- Each warp underutilized (75% idle threads)
- 13 blocks for 100 elements
- Thread utilization: 100/(8*13) ≈ 96.2%
Performance Impact:
- Poor warp execution efficiency
- Higher block scheduling overhead
- Better load balancing across SMs
- Worse memory coalescing
3. Prime Numbers and Odd Sizes
Prime Array Size (N=97, Block=32)
// Prime number of elements with warp-sized blocks
add<<<4, 32>>>(97, x, y, info);
Warp-Level Analysis:
- Full warps except last block
- Last block: 97 - (3*32) = 1 element
- Last warp: 31/32 threads idle
- Overall thread utilization: 97/(32*4) ≈ 75.8%
Memory Access Patterns:
- First 3 blocks: Coalesced access
- Last block: Highly divergent
- Potential for bank conflicts
4. Memory Access Pattern Visualizations
Coalesced vs Non-Coalesced Access
Coalesced Access (Efficient):
Thread 0 1 2 3 ... 31 // One warp
Memory [0] [1] [2] [3] ... [31] // One transaction
└───────────128B───────┘
Non-Coalesced Access (Inefficient):
Thread 0 1 2 3 ... 31 // One warp
Memory [0] [32][64][96]... [992] // Multiple transactions
↑ ↑ ↑ ↑ ↑
└───┴───┴───┴──...─┘
32 separate transactions
Memory Bank Access Patterns
Shared Memory Banks (32 Banks):
Bank0 Bank1 Bank2 Bank3 ... Bank31
[0] [1] [2] [3] ... [31]
[32] [33] [34] [35] ... [63]
[64] [65] [66] [67] ... [95]
Sequential Access (No Conflicts):
Thread0 → Bank0 [0]
Thread1 → Bank1 [1]
Thread2 → Bank2 [2]
...
Thread31→ Bank31 [31]
Strided Access (2-way Bank Conflicts):
Thread0 → Bank0 [0]
Thread1 → Bank0 [32] ⚠️ Conflict!
Thread2 → Bank1 [2]
Thread3 → Bank1 [34] ⚠️ Conflict!
Performance Optimization Guidelines
-
Warp Alignment
// Prefer warp-aligned block sizes blockSize = 32 * N; // Where N is 1, 2, 4, 8
-
Occupancy Optimization
- Balance block size vs. number of blocks
- Consider register usage
- Account for shared memory requirements
-
Memory Access Patterns
// Ensure coalesced access within warps int idx = blockIdx.x * blockDim.x + threadIdx.x; // Access pattern follows thread index y[idx] += x[idx];
-
Block Size Selection
// Rule of thumb for compute-bound kernels const int threadsPerBlock = 256; // 8 warps // Rule of thumb for memory-bound kernels const int threadsPerBlock = 128; // 4 warps
Hands-On Learning
To solidify your understanding of these concepts, experiment with our interactive visualizer:
→ Try Different Configurations
Observe how different grid and block sizes affect:
- Thread utilization efficiency
- Memory coalescing potential
- Warp-level execution patterns
Conclusions
This analysis reveals several key insights for CUDA optimization:
-
Block Size Selection
- Match warp size (32) or multiples
- Consider resource limits
- Balance with total thread count
-
Thread Utilization
- Power-of-2 sizes optimize efficiency
- Handle irregular sizes carefully
- Consider warp-level effects
-
Performance Trade-offs
- Block size affects resource usage
- Thread count impacts memory patterns
- Configuration affects SM utilization
These insights help in selecting optimal configurations for different CUDA workloads, balancing factors like:
- Memory access patterns
- Thread cooperation needs
- SM utilization
- Scheduling overhead
What’s Next?
In upcoming articles, I’ll expand on:
- Dynamic parallelism strategies
- Tensor Core utilization for AI inference
- Memory throughput optimization techniques
- Register pressure vs. occupancy trade-offs