Bit Swizzling
Here I want to show how lowlevel bit manipulation can lead to curious results in the memory layout of 2D matrices.
It is a well known fact among programmers that certain arithmetic operations can be trivially expressed at the bit level. Among them:^{1}
 Division by 2^{N} is a right shift by N bits.
 Multiplication by 2^{N} is a left shift by N bits.
 Finding the remainder after division by 2^{N}. Just take N rightmost bits.
 Alignment of memory objects to a 2^{N} byte boundary. Clear N rightmost bits to go to the nearest lower address.
The operations above are special cases that are faster than their fullfledged counterparts that are not limited to poweroftwo arguments. This is, for example, why compilers love replacing any mul and div with corresponding bitwise operations whenever possible.^{2}
Linear layout
Take a look at the following matrix.
The red thread shows how this matrix is laid out in memory as an array. The numbers in each cell are memory offsets of the corresponding elements.
This layout is called the linear layout.
Each element in the matrix has a Cartesian coordinate. Assuming the origin is in the top left corner each coordinate can be written as (y, x). Offset 0 corresponds to coordinate (0, 0); offset 42 to coordinate (5, 2) and so on.
The mapping between the Cartesian coordinates and memory offsets can be expressed as a function:
def linear(x, y, width): return width * y + x
This function works for any value of width
. To see something interesting on the bit level, we need to limit ourselves to square matrices with a poweroftwo size.
Here are a few mappings for the matrix above:
y  x  linear offset 

0_{10} = 000_{2}  0_{10} = 000_{2}  0_{10} = 000 000_{2} 
1_{10} = 001_{2}  2_{10} = 010_{2}  10_{10} = 001 010_{2} 
5_{10} = 101_{2}  2_{10} = 010_{2}  42_{10} = 101 010_{2} 
Calculating a linear offset is a matter of concatenating y and x. This works because the bit size of y and x is log_{2}(8) = 3.
The linear layout is intuitive and easy to work with, but there are cases when it is inefficient.
Think about querying a texture in a shader. Textures are rarely read in linear order, row by row. More often a small rectangle is needed, that spans several rows and several columns. As it is always the case with memory, bytes are read in batches equal to the size of a cache line^{3}, which means that reading random rectangles from a texture is inefficient, as it requires reading and then discarding a large amount of bytes.
To improve efficiency, variations of the swizzled layout are sometimes used by GPUs.
Swizzling
The swizzled layout keeps spatially close areas close together in memory:
The Zpattern repeats itself on all levels, over and over again. This layout is more cachefriendly. Fortunately, mapping the Cartesian coordinates to memory offsets is very easy:
def swizzle(x, y): return ???
Well…
y  x  swizzled offset 

0_{10} = 000_{2}  0_{10} = 000_{2}  0_{10} = 000000_{2} 
0_{10} = 000_{2}  7_{10} = 111_{2}  21_{10} = 010101_{2} 
5_{10} = 101_{2}  2_{10} = 010_{2}  38_{10} = 100110_{2} 
y_{2}y_{1}y_{0}  x_{2}x_{1}x_{0}  y_{2}x_{2}y_{1}x_{1}y_{0}x_{0} 
The last line explains the pattern. The bits of y and x are zipped together:
x2 x1 x0
y2 y1 y0
No arithmetic is needed.
The swizzle
function can not be implemented efficiently in a high level language, so I am not showing such an implementation.^{4} To be useful, this kind of layout needs to be implemented in hardware.
I find it amusing how simple bit manipulations may produce useful arithmetic results, or, in this case, even a recursive Zpattern.
Now let’s hope no one succeeds in finding a similar pattern while factoring large numbers…

Of course these are basic properties that hold for any numeral system. ↩

These optimizations are part of a more general strength reduction optimization pass. ↩

The exact size of a cache line varies across processor architectures, 64 and 128 bytes are not uncommon. ↩

Modern x86 processors provide the PDEP instruction, which allows for an efficient implementation. ↩