Here I want to show how low-level bit manipulation can lead to curious results in the memory layout of 2D matrices.

It is a well known fact among programmers that certain arithmetic operations can be trivially expressed at the bit level. Among them:¹

Division by 2^N is a right shift by N bits.
Multiplication by 2^N is a left shift by N bits.
Finding the remainder after division by 2^N. Just take N rightmost bits.
Alignment of memory objects to a 2^N byte boundary. Clear N rightmost bits to go to the nearest lower address.

The operations above are special cases that are faster than their full-fledged counterparts that are not limited to power-of-two arguments. This is, for example, why compilers love replacing any mul and div with corresponding bitwise operations whenever possible.²

Linear layout

Take a look at the following matrix.

The red thread shows how this matrix is laid out in memory as an array. The numbers in each cell are memory offsets of the corresponding elements.

This layout is called the linear layout.

Each element in the matrix has a Cartesian coordinate. Assuming the origin is in the top left corner each coordinate can be written as (y, x). Offset 0 corresponds to coordinate (0, 0); offset 42 to coordinate (5, 2) and so on.

The mapping between the Cartesian coordinates and memory offsets can be expressed as a function:

def linear(x, y, width): return width * y + x

This function works for any value of width. To see something interesting on the bit level, we need to limit ourselves to square matrices with a power-of-two size.

Here are a few mappings for the matrix above:

y	x	linear offset
0₁₀ = 000₂	0₁₀ = 000₂	0₁₀ = 000 000₂
1₁₀ = 001₂	2₁₀ = 010₂	10₁₀ = 001 010₂
5₁₀ = 101₂	2₁₀ = 010₂	42₁₀ = 101 010₂

Calculating a linear offset is a matter of concatenating y and x. This works because the bit size of y and x is log₂(8) = 3.

The linear layout is intuitive and easy to work with, but there are cases when it is inefficient.

Think about querying a texture in a shader. Textures are rarely read in linear order, row by row. More often a small rectangle is needed, that spans several rows and several columns. As it is always the case with memory, bytes are read in batches equal to the size of a cache line³, which means that reading random rectangles from a texture is inefficient, as it requires reading and then discarding a large amount of bytes.

To improve efficiency, variations of the swizzled layout are sometimes used by GPUs.

Swizzling

The swizzled layout keeps spatially close areas close together in memory:

The Z-pattern repeats itself on all levels, over and over again. This layout is more cache-friendly. Fortunately, mapping the Cartesian coordinates to memory offsets is very easy:

def swizzle(x, y): return ???

Well…

y	x	swizzled offset
0₁₀ = 000₂	0₁₀ = 000₂	0₁₀ = 000000₂
0₁₀ = 000₂	7₁₀ = 111₂	21₁₀ = 010101₂
5₁₀ = 101₂	2₁₀ = 010₂	38₁₀ = 100110₂
y₂y₁y₀	x₂x₁x₀	y₂x₂y₁x₁y₀x₀

The last line explains the pattern. The bits of y and x are zipped together:

   x2    x1    x0
y2    y1    y0

No arithmetic is needed.

The swizzle function can not be implemented efficiently in a high level language, so I am not showing such an implementation.⁴ To be useful, this kind of layout needs to be implemented in hardware.

I find it amusing how simple bit manipulations may produce useful arithmetic results, or, in this case, even a recursive Z-pattern.

Now let’s hope no one succeeds in finding a similar pattern while factoring large numbers…

Of course these are basic properties that hold for any numeral system. ↩
These optimizations are part of a more general strength reduction optimization pass. ↩
The exact size of a cache line varies across processor architectures, 64 and 128 bytes are not uncommon. ↩
Modern x86 processors provide the PDEP instruction, which allows for an efficient implementation. ↩