Prototyping DOS effects with ShaderToy

From SizeCoding
Revision as of 05:33, 6 October 2025 by Pestis (talk | contribs)

Jump to: navigation, search

Sometimes it is useful to prototype ideas for DOS effects before going through the trouble of writing them in x86/x87 assembly. Shadertoy is a popular choice for creating such prototypes. However, the shaders are written in WebGL, which is a relatively powerful language and includes native support for vectors, matrices, many built-in functions, and arithmetic operations. Most of these features are not available in x86 assembly. It is fairly easy to write tiny shaders in Shadertoy that end up well over 256 bytes once finally ported to DOS.

To make sure your Shadertoy prototype is portable to DOS, you should avoid operations that are going to be costly (in terms of bytes) and only use ones that will be cheap in assembly. Below you’ll find some size estimates for WebGL code once ported to x87 math.

Scalar operators

ShaderToy Bytes Rough x87 equivalent
x+=y 2 faddp st1, st0
x+y 4 If both x and y are needed later:
fld st0; fadd st0, st2

The cost for -, *, and / scalar operations is identical. A lot of this depends on how your x87 stack is organized (which variable is at the top of the stack at st0) and whether you need to keep copies of the variables for later use. In the last optimization phases, you can often save a few bytes by reorganizing your stack, to avoid unnecessary fld or fxch instructions.

Notice the existence of fsubr and fdivr instructions, so x=(y/x) can still be just 2 bytes, even if it looks more complicated in ShaderToy.

Also notice that operating on a single component of a vector (b.x += a.x) is actually a scalar operation and thus takes the same 2-4 bytes.

Scalar functions

ShaderToy Bytes x87 equivalent
-x 2 fchs
abs(x) 2 fabs
sqrt(x) 2 fsqrt
sin(x) 2 fsin
cos(x) 2 fcos
sin(x) ... cos(x) 2 fsincos
tan(x) 2 fptan
atan(y,x) 2 fpatan
log2(x) 4 fld1
...
fyl2x
exp2(x) 14 fld1; fld st1; fprem; f2xm1; faddp st1,st0; fscale; fstp st1
pow(x,y) 16 Computed as 2^(y*log2(x)) i.e. fyl2x, followed by the exp2(x) code
exp(x) 18 Computed as 2^(x*log2(e)) i.e. fldl2e and fmulp, followed by the exp2(x) code

There are no acos, asin, sinh, cosh, tanh, asinh, acosh, and atanh instructions on x87 and implementing them yourself is probably not worth your bytes. This is a pity, as tanh is a classic "squash" function to get any number into -1 .. 1 range.

Rounding and remainders

ShaderToy Bytes x87 equivalent
round(x) 2 frndint (the default rounding mode is to nearest)
x % y 2 fprem or fprem1
ceil(x) 2 + up to 5 Up to 5 bytes to setup the rounding mode with fldcw, followed by frndint
floor(x) 2 + up to 5 Up to 5 bytes to setup the rounding mode with fldcw, followed by frndint

Notice that x-round(x) is a very compact way to do domain repetition for raymarchers.

Vector arithmetic (examples)

ShaderToy Bytes x87 equivalent
a.xy = a.yx 2 fxch st0, st1
a.xyz = a.yzx 4 fxch st0, st2; fxch st0, st1;
a+=b 5-6 Assuming b is not needed later.
6 bytes: faddp st3, st0; faddp st3, st0; faddp st3, st0;
5 bytes: if you have a trashable register with suitable parity, use the looping three times trick
dot(a,b) 9-10 If neither a or b is needed later, compute this as a*=b followed by a.z+=a.y+=a.x
a+=b 10 If b is needed later: fadd st3; fld st1; faddp st5; fld st2; faddp st6
length(a) 16 If a is not needed later: fmul st0, st0; fld st1; fmul st0, st0; faddp st1, st0; fld st2; fmul st0, st0; faddp st1, st0; fsqrt;

From this you can already see that a simple normalize(x) is going to take a lot of bytes, as it has to be computed as x/=length(x). Therefore, normalizing your raymarchers rays is usually to be avoided. cross, reflect, and refract are probably also too costly for sizecoding.

Floating point constants

x87 has the following constants built-in and loading each takes just 2 bytes:

Constant Approximation Instruction
0.0 0.0 fldz
1.0 1.0 fld1
pi 3.14159... fldpi
log2(e) 1.44270... fldl2e
loge(2) 0.69315... fldln2
log2(10) 3.32193... fldl2t
log10(2) 0.30103... fldlg2

Thus, if you just need "some random constant" in your shader, using one of these can save bytes. Notice, however, that fldpi; fmulp st1, st0 is still 4 bytes, whereas fmul st0, dword [bp+offset] can be as little as 3 bytes, if the offset is short and you can reuse code or another value as the constant.

Even if you need to define a new constant, you don't always need the full 4 bytes to define a single IEEE floating-point number—sometimes even a single byte suffices. With a single byte, you can already define the exponent of a float, so the order of magnitude is already correct. You can then try to place this somewhere in your code or data where at least the first few bits of the mantissa are correct, to slightly increase the accuracy. You can use tools like this to see what floating-point values encode to, and what different byte patterns represent as floating-point constants.

Case study: Balrog

With all the earlier in mind, Balrog 256b executable graphics can serve as a case study. Balrog is a fractal raymarcher, with the innermost loop of:

for(int j=0;j<ITERS;j++){                    
    t.x = abs(t.x - round(t.x)); // abs is folding, t.x - round(t.x) is domain repetition               
    t.x += t.x; // domain scaling
    r *= RSCALE;          
    r += t.x*t.x;
    t.xyz = t.yzx; // shuffle coordinates so next time we operate on previous y etc.
    t.x += t.z * o; // rotation, but using very poor math
    t.z -= t.x * o;               
}

Even if there's vectors, the code mostly does scalar math, and then uses coordinate shuffling (t.xyz = t.yzx) to do math on other coordinates. That code ports to:

    mov     cl, ITERS
.maploop:
    fld     st0          ; t.x t.x
    frndint
    fsubp   st1, st0     ; t.x-round(t.x)
    fabs                 ; t.x = abs(t.x - round(t.x))
    fadd    st0          ; t.x += t.x;
    fld     dword [c_rscale+bp-BASE]
    fmulp   st4, st0     ; r *= RSCALE
    fld     st0
    fmul    st0
    faddp   st4, st0     ; r += t.x*t.x
    fxch    st2, st0
    fxch    st1, st0     ; t.xyz = t.yzx
    fld     st2
    fmul    dword [si]
    faddp   st1, st0     ; t.x += t.z * o;
    fld     st0
    fmul    dword [si]
    fsubp   st3, st0     ; t.z -= t.x * o
    loop    .maploop

The comments show exactly how each ShaderToy line maps to different x87 instructions.

The Balrog code also later exemplifies the floating point truncation technique:

c_mindist equ $-3
    db      0x38  ; 0.0001
c_glowamount equ $-2
c_colorscale equ $-2
    dw      0x3d61  ; 0.055
c_stepsizediv equ $-1
    db      0x03 ; 807
c_stepsizediv_z equ $-3
    db      0x40 ; 2.1006666666666662
c_glowdecay equ $-2
    dw      0x461c ; 1e4
c_rscale equ $-2
    db      0xa1, 0x3f  ; 1.2599210498948732
c_rdiv equ $-2
    dw      0x434b ; 203.18733465192963
c_camz equ $-1
    db      0xcc, 0x12, 0x42 ; 36.7
c_xdiv equ $-1
    db      0x09, 0x00, 0x40 ; 2.0006
c_xmult equ $-2
    dw      0x3f2a
c_camy equ $-2
    dw      0x3f1c ; 0.61

Two of the constants were finally the same constant (c_glowamount and c_colorscale), many are only have the exponents (single db), and two of the constants required as much as 3 bytes to get enough precision (c_camz and c_xdiv). The ordering of the constants was carefully chosen, so that when the exponent of one constant serves as a part of the mantissa of next constant, the value is at least roughly correct.