Real-Time 3D Shaders on Game Boy Color: A Masterclass in Constraint-Driven Engineering

4 min read

How logarithms, self-modifying code, and 8-bit fractions enable the impossible


The Challenge

The Game Boy Color has no multiply instruction. No floating point. An 8 MHz processor (roughly 140,000 cycles per frame). And yet, a developer managed to render real-time 3D shaders with player-controlled lighting.

This is what engineering under extreme constraints looks like.

The Core Math

A Lambert shader—the simplest 3D lighting model—uses the dot product:

v = N · L

Where N is the normal vector and L is the light direction. Expanded component-wise:

v = Nx*Lx + Ny*Ly + Nz*Lz

Three multiplications, two additions. Trivial on modern hardware. Impossible as written on a Game Boy.

Solution #1: Spherical Coordinates

By converting to spherical coordinates, the dot product becomes:

v = sin(Nθ) * sin(Lθ) * cos(Nφ - Lφ) + cos(Nθ) * cos(Lθ)

If we fix Lθ (the light’s vertical angle) as constant, we can extract coefficients m and b that don’t change per-pixel:

m = sin(Nθ) * sin(Lθ)  [constant]
b = cos(Nθ) * cos(Lθ)  [constant]
v = m * cos(Nφ - Lφ) + b

Now we only compute one multiplication per pixel. But we still have no multiply instruction.

Solution #2: Logarithmic Multiplication

Here’s the clever bit. Logarithms have this property:

log(x * y) = log(x) + log(y)
x * y = 2^(log(x) + log(y))

We can multiply by adding logarithms, then looking up the result in a power table. In pseudocode:

pow_table = [...]  # 256 entries
x = float_to_logspace(0.3)  # compile-time
y = float_to_logspace(0.5)  # compile-time
result = pow_table[x + y]   # runtime: just add + lookup

The Sign Bit Trick

You can’t take the log of a negative number. Solution: encode a sign bit in the MSB (bit 7). When adding two log-space values, the sign bit effectively XORs (toggles). The power table accounts for this and returns positive or negative results.

Solution #3: 8-Bit Fractions

All scalars are restricted to [-1.0, +1.0] and encoded in a single byte:

ByteLinear ValueLog Value
00/127 = 02^0 = 1
11/127 ≈ 0.00792^(-1/6) ≈ 0.89
127127/127 = 12^(-127/6) ≈ 0
128undefined-2^0 = -1
255-1/127 ≈ -0.0079-2^(-127/6) ≈ -0

Why 127 instead of 128? To represent both +1 and -1 in two’s complement.

Why base 2^(1/6)? To ensure adding 3 log values won’t overflow: 42+42+42 = 126.

Solution #4: Combined Lookup Tables

Instead of separate cos and log lookups, create a combined cos_log table:

cos_log(x) = log(cos(x))

This lets us rewrite the shader as:

v = pow(m_log + cos_log(Nφ - Lφ)) + b

Per-pixel operations:
– 1 subtraction
– 1 lookup (cos_log)
– 1 addition
– 1 lookup (pow)
– 1 addition

Total: 3 add/sub, 2 lookups. About 130 cycles per pixel.

Solution #5: Self-Modifying Code

The final optimization: hard-coded immediate values are faster than memory loads.

; Slower: 28 cycles
ld a, [Ltheta]  ; 12 cycles
ld b, a         ; 4 cycles
ld a, [hl+]     ; 8 cycles
sub a, b        ; 4 cycles

; Faster: 16 cycles
ld a, [hl+]     ; 8 cycles
sub a, 8        ; 8 cycles

The difference: 12 cycles per pixel × 960 pixels = 11,520 cycles saved per frame (~10% of shader runtime).

How do you use the faster form when the value changes? Modify the instruction operand at runtime. The instruction sub a, 8 is encoded as D6 08. Change the 08 to a different value, and the instruction now subtracts something else.

Performance Results

  • 15 tiles rendered per frame
  • ~130 cycles per pixel
  • ~89% of frame time in the shader
  • Remaining time for input handling and I/O

The visual tearing is intentional—different portions of the image render on different frames. LCD ghosting makes it less noticeable.

The AI Question

The developer attempted to use Claude Sonnet 4 for the SM83 assembly. Result: failure. The code required too much domain-specific knowledge and constraint-aware design that current models can’t replicate.

What worked: Python scripts for Blender automation, reading OpenEXR layers, documented hardware features.

What didn’t: The core algorithmic work. The soul of the project.

Technical Tags

#Low-Level-Programming #Assembly #Game-Development #Optimization #Constraint-Engineering #Mathematics #Retro-Computing

Key Insights

  1. Constraints breed creativity – The best solutions often come from severe limitations
  2. Mathematical transforms unlock performance – Spherical coords + logarithms turned multiplication into addition
  3. Lookup tables are powerful – Trading memory for computation is timeless
  4. Self-modifying code has valid uses – When cycles matter more than maintainability
  5. AI has limits – Novel, constraint-driven engineering remains a human domain

Sometimes the most impressive engineering isn’t building the fastest system—it’s making an impossible system work at all.

Share this article

Related Articles