Demystifying CuTe Layouts: From Fundamentals to Nsight Profiling

A hands-on walkthrough exploring layout algebra with CuTe and profiling GPU Memory Accesses.

Introduction

I recently got nerd-sniped by Nvidia’s announcement of Python support for CuTe / CUTLASS in GTC2025. So I will set out on a worklog to implement some SOTA Attention mechanism using CUTLASS / CuTe.

I recommend that you follow along with me and try to implement the following by yourself. For this worklog, I will use only RTX-4090. Although, at some point I might bite the bullet and level up to using H100/B200 etc.

Further, we will learn through implementing a specific set of tasks which will progressively level us up. The goal for this worklog is to implement one of the SOTA Attention Kernels in CUTLASS.

Part 1: Understanding Basic Layouts in CuTe

Task 1 – Row-Major vs Column-Major Layouts

CuTe uses a clean DSL for expressing tensor layouts. You can define layouts using:

make_layout(make_shape(...), make_stride(...))

Row-Major vs Column-Major Example

auto rm_layout = make_layout(make_shape(4, 4), make_stride(4, 1));
auto cm_layout = make_layout(make_shape(4, 4), make_stride(1, 4));

This yields the following visualization via cute::print_layout():

Row-Major (4,4):(4,1):

       0    1    2    3 
    +----+----+----+----+
 0  |  0 |  1 |  2 |  3 |
    +----+----+----+----+
 1  |  4 |  5 |  6 |  7 |
    +----+----+----+----+
 2  |  8 |  9 | 10 | 11 |
    +----+----+----+----+
 3  | 12 | 13 | 14 | 15 |
    +----+----+----+----+

Column-Major (4,4):(1,4):

       0    1    2    3 
    +----+----+----+----+
 0  |  0 |  4 |  8 | 12 |
    +----+----+----+----+
 1  |  1 |  5 |  9 | 13 |
    +----+----+----+----+
 2  |  2 |  6 | 10 | 14 |
    +----+----+----+----+
 3  |  3 |  7 | 11 | 15 |
    +----+----+----+----+

Part 2: Composition of Layouts

Task 2 – Composing Layouts C = A ∘ B

You can explore the idea of composing two layouts:

C(i) = A(B(i));

Given:

auto A = make_layout(make_shape(6, 2), make_stride(8, 2));
auto B = make_layout(make_shape(4, 3), make_stride(3, 1));
auto C = composition(A, B);

CuTe provides a helpful print_layout() to visualize this:

Layout A (6,2):(8,2)

       0    1 
    +----+----+
 0  |  0 |  2 |
    +----+----+
 1  |  8 | 10 |
    +----+----+
 2  | 16 | 18 |
    +----+----+
 3  | 24 | 26 |
    +----+----+
 4  | 32 | 34 |
    +----+----+
 5  | 40 | 42 |
    +----+----+

Layout B (4,3):(3,1)

       0    1    2 
    +----+----+----+
 0  |  0 |  1 |  2 |
    +----+----+----+
 1  |  3 |  4 |  5 |
    +----+----+----+
 2  |  6 |  7 |  8 |
    +----+----+----+
 3  |  9 | 10 | 11 |
    +----+----+----+

Composed Layout C

((2,2),(3,1)):((24,2),(8,2))
       0    1    2 
    +----+----+----+
 0  |  0 |  8 | 16 |
    +----+----+----+
 1  | 24 | 32 | 40 |
    +----+----+----+
 2  |  2 | 10 | 18 |
    +----+----+----+
 3  | 26 | 34 | 42 |
    +----+----+----+

Validating C(i) = A(B(i))

for (int i = 0; i < size(C); i++) {
    assert(C(i) == A(B(i)));
}

All indices passed, confirming that CuTe composition behaves correctly.

Part 2b: Getting the Coordinate from a 1D Index

We learned that:

  • B(i) returns a memory offset, not a coordinate.
  • To get coordinates, use:
auto flat = coalesce(B);
auto coord = flat(i);  // returns (row, col)

We then manually verified:

C(i) == A(flat(i))

for all valid i.

Part 3: Profiling Layout Access with Nsight Compute

Real CuTe Layouts from Task 2

Layout A (4x3, stride=(3,1))

Layout A

Layout B (6x2, stride=(2,1))

Layout B

Composed Layout C = A(B(i))

Composed Layout

These plots show how CuTe translates coordinate systems and compositions into linear memory offsets.

Warp Divergence from Nsight Compute

Based on profiling data, we observed that an average of only 20.5 out of 32 threads were active.

Warp Divergence

This warp divergence visualization illustrates how thread-level control flow can impact GPU efficiency. Red bars represent inactive threads due to branching or predication, and green bars represent active threads.

These visualizations help illustrate how memory layout and access patterns impact performance and occupancy on the GPU. They are useful tools for understanding and optimizing low-level CUDA behavior.

Key Learnings

ConceptInsight
Layout definitionmake_layout(shape, stride) expresses memory patterns
CompositionC(i) = A(B(i)) holds if dimensionality matches
B(i) as offsetCuTe returns a scalar unless explicitly flattened
Coordinate extractionUse coalesce(B)(i) to get (row, col)
ProfilingRequires an actual CUDA kernel to show up in ncu

Up Next

In the next phase, we’ll explore:

  • Batched layouts with 3D shapes
  • How batching affects Memory Traversal and layout stride
  • Start preparing for tiling and fused kernels in MLA

Appendix

  • Code examples: available in repo/task directory
  • Reference: NVIDIA CuTe layout algebra PDF
  • Python used only for generating scripts and plots from layout metadata.