TL;DR

  • Thread and block configurations significantly impact CUDA kernel performance
  • Power-of-2 sizes provide most efficient execution patterns
  • Warp alignment (multiples of 32 threads) optimizes memory coalescing
  • Block size involves trade-offs between thread cooperation and SM distribution
  • Visual analysis helps understand how different configurations affect execution

Interactive Exploration

Before diving into the technical details, try out our interactive CUDA visualizer to get a hands-on understanding of thread and block patterns:

Block Visualizer

CUDA Architecture Overview

Before diving into configurations, let’s understand key CUDA concepts:

Thread Hierarchy

Grid
├── Block 0
│   ├── Thread (0,0)
│   ├── Thread (0,1)
│   └── ...
├── Block 1
│   ├── Thread (1,0)
│   └── ...
└── ...
  • Threads: Basic execution units
  • Warps: Groups of 32 threads executed simultaneously
  • Blocks: Collections of threads that can cooperate
  • Grid: Collection of blocks executing the same kernel

Hardware Constraints

  • Maximum threads per block: 1024
  • Warp size: 32 threads
  • Typical SM can handle multiple blocks simultaneously
  • Memory access is coalesced within warps

CUDA Architecture Deep Dive

Hardware Architecture (SM89)

Streaming Multiprocessor (SM)
├── Warp Schedulers: 4 per SM
├── CUDA Cores: 128 per SM
├── Shared Memory: 64KB per SM
├── L1 Cache: 128KB per SM
└── Register File: 64K 32-bit registers

Memory Hierarchy

Device Memory (Global Memory)
├── L2 Cache (Shared by all SMs)
│   └── Cache Line Size: 128 bytes
├── L1 Cache (Per SM)
│   └── Cache Line Size: 128 bytes
├── Shared Memory (Per SM)
│   └── Bank Width: 32-bit
└── Register File (Per SM)
    └── Access Latency: 1 cycle

Theoretical Performance Metrics

  • Global Memory Bandwidth: 912 GB/s
  • Shared Memory Bandwidth: ~19 TB/s per SM
  • Register Bandwidth: ~39 TB/s per SM
  • Warp Scheduling Rate: 1 instruction per clock

Experimental Setup

Our test kernel uses unified memory and records detailed execution patterns:

// Structure to store thread processing info
struct ThreadInfo {
    int index;      // Global array index
    int blockId;    // Block identifier
    int threadId;   // Thread identifier within block
    float value;    // Computed value
};
 
__global__
void add(int n, float *x, float *y, ThreadInfo *info) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        y[idx] += x[idx];
        // Record thread/block information
        info[idx].index = idx;
        info[idx].blockId = blockIdx.x;
        info[idx].threadId = threadIdx.x;
        info[idx].value = y[idx];
    }
}

1. Power-of-2 Configurations

Standard Case (N=256, Block=32)

// Launch configuration
int N = 256;
int blockSize = 32;  // One warp
int numBlocks = (N + blockSize - 1) / blockSize;  // 8 blocks
add<<<numBlocks, blockSize>>>(N, x, y, info);

Technical Analysis:

  • Block size matches warp size (32 threads)
  • Perfect memory coalescing within warps
  • 8 blocks distribute evenly across SMs
  • 100% thread utilization: 256/(32*8) = 1.0

Performance Implications:

  • Optimal warp execution efficiency
  • No partial warps
  • Good SM occupancy
  • Minimal scheduling overhead

Large Blocks (N=1024, Block=256)

// 256 threads per block = 8 warps per block
add<<<4, 256>>>(1024, x, y, info);

Technical Analysis:

  • 8 warps per block (256/32)
  • 4 blocks total
  • Higher register pressure per block
  • More thread cooperation possible within blocks

Performance Considerations:

  • May limit SM occupancy due to resource usage
  • Better for compute-bound kernels with thread cooperation
  • Reduced block scheduling overhead

2. Small Block Configurations

Minimal Blocks (N=100, Block=8)

// Sub-warp block size
add<<<13, 8>>>(100, x, y, info);

Technical Deep Dive:

  • Block size (8) is 1/4 of a warp
  • Each warp underutilized (75% idle threads)
  • 13 blocks for 100 elements
  • Thread utilization: 100/(8*13) ≈ 96.2%

Performance Impact:

  • Poor warp execution efficiency
  • Higher block scheduling overhead
  • Better load balancing across SMs
  • Worse memory coalescing

3. Prime Numbers and Odd Sizes

Prime Array Size (N=97, Block=32)

// Prime number of elements with warp-sized blocks
add<<<4, 32>>>(97, x, y, info);

Warp-Level Analysis:

  • Full warps except last block
  • Last block: 97 - (3*32) = 1 element
  • Last warp: 31/32 threads idle
  • Overall thread utilization: 97/(32*4) ≈ 75.8%

Memory Access Patterns:

  • First 3 blocks: Coalesced access
  • Last block: Highly divergent
  • Potential for bank conflicts

4. Memory Access Pattern Visualizations

Coalesced vs Non-Coalesced Access

Coalesced Access (Efficient):
Thread   0   1   2   3  ...  31    // One warp
Memory  [0] [1] [2] [3] ... [31]   // One transaction
         └───────────128B───────┘

Non-Coalesced Access (Inefficient):
Thread   0   1   2   3  ...  31    // One warp
Memory  [0] [32][64][96]... [992]  // Multiple transactions
         ↑   ↑   ↑   ↑      ↑
         └───┴───┴───┴──...─┘
         32 separate transactions

Memory Bank Access Patterns

Shared Memory Banks (32 Banks):
Bank0  Bank1  Bank2  Bank3 ... Bank31
[0]    [1]    [2]    [3]  ... [31]
[32]   [33]   [34]   [35] ... [63]
[64]   [65]   [66]   [67] ... [95]

Sequential Access (No Conflicts):
Thread0 → Bank0  [0]
Thread1 → Bank1  [1]
Thread2 → Bank2  [2]
...
Thread31→ Bank31 [31]

Strided Access (2-way Bank Conflicts):
Thread0 → Bank0  [0]
Thread1 → Bank0  [32] ⚠️ Conflict!
Thread2 → Bank1  [2]
Thread3 → Bank1  [34] ⚠️ Conflict!

Performance Optimization Guidelines

  1. Warp Alignment

    // Prefer warp-aligned block sizes
    blockSize = 32 * N; // Where N is 1, 2, 4, 8
  2. Occupancy Optimization

    • Balance block size vs. number of blocks
    • Consider register usage
    • Account for shared memory requirements
  3. Memory Access Patterns

    // Ensure coalesced access within warps
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    // Access pattern follows thread index
    y[idx] += x[idx];
  4. Block Size Selection

    // Rule of thumb for compute-bound kernels
    const int threadsPerBlock = 256;  // 8 warps
    // Rule of thumb for memory-bound kernels
    const int threadsPerBlock = 128;  // 4 warps

Hands-On Learning

To solidify your understanding of these concepts, experiment with our interactive visualizer:

→ Try Different Configurations

Observe how different grid and block sizes affect:

  • Thread utilization efficiency
  • Memory coalescing potential
  • Warp-level execution patterns

Conclusions

This analysis reveals several key insights for CUDA optimization:

  1. Block Size Selection

    • Match warp size (32) or multiples
    • Consider resource limits
    • Balance with total thread count
  2. Thread Utilization

    • Power-of-2 sizes optimize efficiency
    • Handle irregular sizes carefully
    • Consider warp-level effects
  3. Performance Trade-offs

    • Block size affects resource usage
    • Thread count impacts memory patterns
    • Configuration affects SM utilization

These insights help in selecting optimal configurations for different CUDA workloads, balancing factors like:

  • Memory access patterns
  • Thread cooperation needs
  • SM utilization
  • Scheduling overhead

What’s Next?

In upcoming articles, I’ll expand on:

  1. Dynamic parallelism strategies
  2. Tensor Core utilization for AI inference
  3. Memory throughput optimization techniques
  4. Register pressure vs. occupancy trade-offs