Here I want to show how low-level bit manipulation can lead to curious results in the memory layout of 2D matrices.

It is a well known fact among programmers that certain arithmetic operations can be trivially expressed at the bit level. Among them:1

• Division by 2N is a right shift by N bits.
• Multiplication by 2N is a left shift by N bits.
• Finding the remainder after division by 2N. Just take N rightmost bits.
• Alignment of memory objects to a 2N byte boundary. Clear N rightmost bits to go to the nearest lower address.

The operations above are special cases that are faster than their full-fledged counterparts that are not limited to power-of-two arguments. This is, for example, why compilers love replacing any mul and div with corresponding bitwise operations whenever possible.2

# Linear layout

Take a look at the following matrix.

The red thread shows how this matrix is laid out in memory as an array. The numbers in each cell are memory offsets of the corresponding elements.

This layout is called the linear layout.

Each element in the matrix has a Cartesian coordinate. Assuming the origin is in the top left corner each coordinate can be written as (y, x). Offset 0 corresponds to coordinate (0, 0); offset 42 to coordinate (5, 2) and so on.

The mapping between the Cartesian coordinates and memory offsets can be expressed as a function:

``````def linear(x, y, width): return width * y + x
``````

This function works for any value of `width`. To see something interesting on the bit level, we need to limit ourselves to square matrices with a power-of-two size.

Here are a few mappings for the matrix above:

y x linear offset
010 = 0002 010 = 0002 010 = 000 0002
110 = 0012 210 = 0102 1010 = 001 0102
510 = 1012 210 = 0102 4210 = 101 0102

Calculating a linear offset is a matter of concatenating y and x. This works because the bit size of y and x is log2(8) = 3.

The linear layout is intuitive and easy to work with, but there are cases when it is inefficient.

Think about querying a texture in a shader. Textures are rarely read in linear order, row by row. More often a small rectangle is needed, that spans several rows and several columns. As it is always the case with memory, bytes are read in batches equal to the size of a cache line3, which means that reading random rectangles from a texture is inefficient, as it requires reading and then discarding a large amount of bytes.

To improve efficiency, variations of the swizzled layout are sometimes used by GPUs.

# Swizzling

The swizzled layout keeps spatially close areas close together in memory:

The Z-pattern repeats itself on all levels, over and over again. This layout is more cache-friendly. Fortunately, mapping the Cartesian coordinates to memory offsets is very easy:

``````def swizzle(x, y): return ???
``````

Well…

y x swizzled offset
010 = 0002 010 = 0002 010 = 0000002
010 = 0002 710 = 1112 2110 = 0101012
510 = 1012 210 = 0102 3810 = 1001102
y2y1y0 x2x1x0 y2x2y1x1y0x0

The last line explains the pattern. The bits of y and x are zipped together:

``````   x2    x1    x0
y2    y1    y0
``````

No arithmetic is needed.

The `swizzle` function can not be implemented efficiently in a high level language, so I am not showing such an implementation.4 To be useful, this kind of layout needs to be implemented in hardware.

I find it amusing how simple bit manipulations may produce useful arithmetic results, or, in this case, even a recursive Z-pattern.

Now let’s hope no one succeeds in finding a similar pattern while factoring large numbers…

1. Of course these are basic properties that hold for any numeral system.

2. These optimizations are part of a more general strength reduction optimization pass.

3. The exact size of a cache line varies across processor architectures, 64 and 128 bytes are not uncommon.

4. Modern x86 processors provide the PDEP instruction, which allows for an efficient implementation.