Memory layout
The memory layout consists of 4GB address space that is split into 16 blocks 256MB each. The memory inside a block is always continuous. This allows for quick lookups at the expense of RAM usage.
Address | Description |
---|---|
0x00000000 | Elf binary mapping area. SPU problem area at 0x00040000. |
0x10000000 | Internal ps3emu area. Currently only 16MB are used. Managed using Boost’s rbtree_best_fit.hpp . |
0x20000000 | Unused. |
0x30000000 | Heap area. malloc() returns memory from this range. |
0x40000000 | GCM registers at 0x40100040, GCM labels at 0x40300000, empty otherwise. |
0x50000000 | Unused. |
0x60000000 | Unused. |
0x70000000 | Unused. |
0x80000000 | Unused. |
0x90000000 | Unused. |
0xa0000000 | Unused. |
0xb0000000 | Unused. |
0xc0000000 | RSX local memory of size 0xf900000. An OpenGL buffer is mapped here. |
0xd0000000 | Stack area of size 32MB. |
0xe0000000 | Raw SPU memory, including mapped LS memory and MMIO registers. |
0xf0000000 | SPU thread base address. |
Load-reserve/Store-conditional (lwarx/stwcx)
Every SPU or PPU thread has a unique granule that can be assigned to a cache line to create a reservation. Once a reservation is created, it is possible to execute a conditional store to the same cache line. The store either succeds if a reservation is active on this cache line, or fail if there is no active reservation. The memory is not modified when the conditional store fails. The cache line size on CELL is 128 bytes.
A single cache line can have multiple granules assigned to it at the same time. A single granule can either be unassigned, or assigned to exactly one cache line.
There are two ways to make a reservation. From the PPU side, the lwarx or ldarx instructions can be used to create a reservation and simultaneously read 4 or 8 bytes respectively. On the SPU side, the DMA command GETLLAR can be used, which also reads the whole 128 byte cache line.
After a reservation has been made, it can be destroyed in several ways. The thead that created the reservation can destroy it either by creating a new reservation using llwarx/ldarx/GETLLAR or by executing a conditional store stwcx/swdcx/PUTLLC. Additionally, an SPU thread can execute a special unconditional store PUTLLUC that always stores the memory and destroys the reservation if present. Any regular store made by the thread that created a reservation doesn’t destroy the reservation.
Any regular store, PUTLLUC or successfull conditional store always destroy reservations made by any other thread.
If a thread makes a store while there is an active reservation at the target cache line, the existing reservations can be either kept or destroyed depending on what thrad created them. If there are multiple reservations active at the same time, it is possible that only some of them are destroyed, while others are kept.
Reservation Owner \ Type | Cond Store: Success | Cond Store: Failure | Regular Store | Uncond Store |
---|---|---|---|---|
Other Thread | Destroy | Keep | Destroy | Destroy |
Same Thread | Destroy | - | Keep | Destroy |
As per the documentation, there are exactly three events that are guaranteed to destroy a reservation.
- The processor element holding the reservation executes another lwarx or ldarx instruction. This clears the first reservation and establishes a new one.
- The processor element holding the reservation executes any stwcx. or stdcx. instruction regardless of whether its address matches that of the lwarx or ldarx.
- Some other processor element executes a store or dcbz (data cache block clear to zero) to the same reservation granule (128 bytes), or modifies a referenced or changed bit in the same reservation granule.
All stores to the same cache line are serialied using spinlocks. While a lock is kept, it is guaranteed that no other thread executes a store to the same cache line. Depending on the type of the store, the thread that owns the lock might destroy one or more active reservations on that cache line.
This design limits thread contention to individual cache lines, while any unrelated stores proceed without blocking.
Alternatives
The idea of emulating common patterns of lwarx/stwcx instructions has been considered. E.g. the block below:
rT = __sync_fetch_and_add([rA|rB], const) + const
107F8 loc_loop:
107F8 lwarx rT, rA, rB
107FC addi rT, rT, const
10800 stwcx. rT, rA, rB
10804 bne- loc_loop
The problem is the lack of atomic instructions on x86 that can operate on big endian values. Also, emulating certain uses of the LR/SC pair using CAS, for example, leads to the ABA problem.
Modification map
Several GCM objects such as textures and shaders can’t be directly accessed from memory by OpenGL. To use a texture it needs to be copied to a separate location in a particular format, not compatible with RSX. At the same time RSX is allowed to access a texture directly, for both reading and wrinting. This leads to a problem when the emulator needs to keep track of texture modifications and update the internal OpenGL representation when a write to the texture is made by an SPU of PPU thread. A failure to do so leads to stale textures. Updating textures before every draw on the other hand leads to severe performance degradation. A middle ground is to keep a map that keeps track of modifications. When a store is made by a thread, a bit is set in the map indicating that a particular location has been changed. When RSX is preparing to make a draw, it checkes the map first to make sure all active textures are updated.
The map has granularity unrelated to cache lines and might be tweaked in the future.