Memory layout

The memory layout consists of 4GB address space that is split into 16 blocks 256MB each. The memory inside a block is always continuous. This allows for quick lookups at the expense of RAM usage.

Address Description
0x00000000 Elf binary mapping area. SPU problem area at 0x00040000.
0x10000000 Internal ps3emu area. Currently only 16MB are used. Managed using Boost’s rbtree_best_fit.hpp.
0x20000000 Unused.
0x30000000 Heap area. malloc() returns memory from this range.
0x40000000 GCM registers at 0x40100040, GCM labels at 0x40300000, empty otherwise.
0x50000000 Unused.
0x60000000 Unused.
0x70000000 Unused.
0x80000000 Unused.
0x90000000 Unused.
0xa0000000 Unused.
0xb0000000 Unused.
0xc0000000 RSX local memory of size 0xf900000. An OpenGL buffer is mapped here.
0xd0000000 Stack area of size 32MB.
0xe0000000 Raw SPU memory, including mapped LS memory and MMIO registers.
0xf0000000 SPU thread base address.

Load-reserve/Store-conditional (lwarx/stwcx)

Every SPU or PPU thread has a unique granule that can be assigned to a cache line to create a reservation. Once a reservation is created, it is possible to execute a conditional store to the same cache line. The store either succeds if a reservation is active on this cache line, or fail if there is no active reservation. The memory is not modified when the conditional store fails. The cache line size on CELL is 128 bytes.

A single cache line can have multiple granules assigned to it at the same time. A single granule can either be unassigned, or assigned to exactly one cache line.

There are two ways to make a reservation. From the PPU side, the lwarx or ldarx instructions can be used to create a reservation and simultaneously read 4 or 8 bytes respectively. On the SPU side, the DMA command GETLLAR can be used, which also reads the whole 128 byte cache line.

After a reservation has been made, it can be destroyed in several ways. The thead that created the reservation can destroy it either by creating a new reservation using llwarx/ldarx/GETLLAR or by executing a conditional store stwcx/swdcx/PUTLLC. Additionally, an SPU thread can execute a special unconditional store PUTLLUC that always stores the memory and destroys the reservation if present. Any regular store made by the thread that created a reservation doesn’t destroy the reservation.

Any regular store, PUTLLUC or successfull conditional store always destroy reservations made by any other thread.

If a thread makes a store while there is an active reservation at the target cache line, the existing reservations can be either kept or destroyed depending on what thrad created them. If there are multiple reservations active at the same time, it is possible that only some of them are destroyed, while others are kept.

Reservation Owner \ Type Cond Store: Success Cond Store: Failure Regular Store Uncond Store
Other Thread Destroy Keep Destroy Destroy
Same Thread Destroy - Keep Destroy

As per the documentation, there are exactly three events that are guaranteed to destroy a reservation.

  1. The processor element holding the reservation executes another lwarx or ldarx instruction. This clears the first reservation and establishes a new one.
  2. The processor element holding the reservation executes any stwcx. or stdcx. instruction regardless of whether its address matches that of the lwarx or ldarx.
  3. Some other processor element executes a store or dcbz (data cache block clear to zero) to the same reservation granule (128 bytes), or modifies a referenced or changed bit in the same reservation granule.

All stores to the same cache line are serialied using spinlocks. While a lock is kept, it is guaranteed that no other thread executes a store to the same cache line. Depending on the type of the store, the thread that owns the lock might destroy one or more active reservations on that cache line.

This design limits thread contention to individual cache lines, while any unrelated stores proceed without blocking.

Alternatives

The idea of emulating common patterns of lwarx/stwcx instructions has been considered. E.g. the block below:

rT = __sync_fetch_and_add([rA|rB], const) + const
107F8 loc_loop:
107F8                 lwarx     rT, rA, rB
107FC                 addi      rT, rT, const
10800                 stwcx.    rT, rA, rB
10804                 bne-      loc_loop

The problem is the lack of atomic instructions on x86 that can operate on big endian values. Also, emulating certain uses of the LR/SC pair using CAS, for example, leads to the ABA problem.

Modification map

Several GCM objects such as textures and shaders can’t be directly accessed from memory by OpenGL. To use a texture it needs to be copied to a separate location in a particular format, not compatible with RSX. At the same time RSX is allowed to access a texture directly, for both reading and wrinting. This leads to a problem when the emulator needs to keep track of texture modifications and update the internal OpenGL representation when a write to the texture is made by an SPU of PPU thread. A failure to do so leads to stale textures. Updating textures before every draw on the other hand leads to severe performance degradation. A middle ground is to keep a map that keeps track of modifications. When a store is made by a thread, a bit is set in the map indicating that a particular location has been changed. When RSX is preparing to make a draw, it checkes the map first to make sure all active textures are updated.

The map has granularity unrelated to cache lines and might be tweaked in the future.