Real-Time 3D Shaders on Game Boy Color: A Masterclass in Constraint-Driven Engineering
How logarithms, self-modifying code, and 8-bit fractions enable the impossible
The Challenge
The Game Boy Color has no multiply instruction. No floating point. An 8 MHz processor (roughly 140,000 cycles per frame). And yet, a developer managed to render real-time 3D shaders with player-controlled lighting.
This is what engineering under extreme constraints looks like.
The Core Math
A Lambert shader—the simplest 3D lighting model—uses the dot product:
v = N · L
Where N is the normal vector and L is the light direction. Expanded component-wise:
v = Nx*Lx + Ny*Ly + Nz*Lz
Three multiplications, two additions. Trivial on modern hardware. Impossible as written on a Game Boy.
Solution #1: Spherical Coordinates
By converting to spherical coordinates, the dot product becomes:
v = sin(Nθ) * sin(Lθ) * cos(Nφ - Lφ) + cos(Nθ) * cos(Lθ)
If we fix Lθ (the light’s vertical angle) as constant, we can extract coefficients m and b that don’t change per-pixel:
m = sin(Nθ) * sin(Lθ) [constant]
b = cos(Nθ) * cos(Lθ) [constant]
v = m * cos(Nφ - Lφ) + b
Now we only compute one multiplication per pixel. But we still have no multiply instruction.
Solution #2: Logarithmic Multiplication
Here’s the clever bit. Logarithms have this property:
log(x * y) = log(x) + log(y)
x * y = 2^(log(x) + log(y))
We can multiply by adding logarithms, then looking up the result in a power table. In pseudocode:
pow_table = [...] # 256 entries
x = float_to_logspace(0.3) # compile-time
y = float_to_logspace(0.5) # compile-time
result = pow_table[x + y] # runtime: just add + lookup
The Sign Bit Trick
You can’t take the log of a negative number. Solution: encode a sign bit in the MSB (bit 7). When adding two log-space values, the sign bit effectively XORs (toggles). The power table accounts for this and returns positive or negative results.
Solution #3: 8-Bit Fractions
All scalars are restricted to [-1.0, +1.0] and encoded in a single byte:
| Byte | Linear Value | Log Value |
|---|---|---|
| 0 | 0/127 = 0 | 2^0 = 1 |
| 1 | 1/127 ≈ 0.0079 | 2^(-1/6) ≈ 0.89 |
| 127 | 127/127 = 1 | 2^(-127/6) ≈ 0 |
| 128 | undefined | -2^0 = -1 |
| 255 | -1/127 ≈ -0.0079 | -2^(-127/6) ≈ -0 |
Why 127 instead of 128? To represent both +1 and -1 in two’s complement.
Why base 2^(1/6)? To ensure adding 3 log values won’t overflow: 42+42+42 = 126.
Solution #4: Combined Lookup Tables
Instead of separate cos and log lookups, create a combined cos_log table:
cos_log(x) = log(cos(x))
This lets us rewrite the shader as:
v = pow(m_log + cos_log(Nφ - Lφ)) + b
Per-pixel operations:
– 1 subtraction
– 1 lookup (cos_log)
– 1 addition
– 1 lookup (pow)
– 1 addition
Total: 3 add/sub, 2 lookups. About 130 cycles per pixel.
Solution #5: Self-Modifying Code
The final optimization: hard-coded immediate values are faster than memory loads.
; Slower: 28 cycles
ld a, [Ltheta] ; 12 cycles
ld b, a ; 4 cycles
ld a, [hl+] ; 8 cycles
sub a, b ; 4 cycles
; Faster: 16 cycles
ld a, [hl+] ; 8 cycles
sub a, 8 ; 8 cycles
The difference: 12 cycles per pixel × 960 pixels = 11,520 cycles saved per frame (~10% of shader runtime).
How do you use the faster form when the value changes? Modify the instruction operand at runtime. The instruction sub a, 8 is encoded as D6 08. Change the 08 to a different value, and the instruction now subtracts something else.
Performance Results
- 15 tiles rendered per frame
- ~130 cycles per pixel
- ~89% of frame time in the shader
- Remaining time for input handling and I/O
The visual tearing is intentional—different portions of the image render on different frames. LCD ghosting makes it less noticeable.
The AI Question
The developer attempted to use Claude Sonnet 4 for the SM83 assembly. Result: failure. The code required too much domain-specific knowledge and constraint-aware design that current models can’t replicate.
What worked: Python scripts for Blender automation, reading OpenEXR layers, documented hardware features.
What didn’t: The core algorithmic work. The soul of the project.
Technical Tags
#Low-Level-Programming #Assembly #Game-Development #Optimization #Constraint-Engineering #Mathematics #Retro-Computing
Key Insights
- Constraints breed creativity – The best solutions often come from severe limitations
- Mathematical transforms unlock performance – Spherical coords + logarithms turned multiplication into addition
- Lookup tables are powerful – Trading memory for computation is timeless
- Self-modifying code has valid uses – When cycles matter more than maintainability
- AI has limits – Novel, constraint-driven engineering remains a human domain
Sometimes the most impressive engineering isn’t building the fastest system—it’s making an impossible system work at all.