Difference between revisions of "Prototyping DOS effects with ShaderToy"
m |
|||
(18 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | Sometimes it is useful to prototype ideas for DOS effects before going through the trouble of writing | + | Sometimes it is useful to prototype ideas for DOS effects before going through the trouble of writing them in x86/x87 assembly. [https://shadertoy.com/ Shadertoy] is a popular choice for creating such prototypes. However, the shaders are written in WebGL, which is a relatively powerful language and includes native support for vectors, matrices, many built-in functions, and arithmetic operations. Most of these features are not available in x86 assembly. It is fairly easy to write tiny shaders in Shadertoy that end up well over 256 bytes once finally ported to DOS. Furthermore, the x87's limit of 8 stack items imposes significant constraints on the number of temporary variables you can use: for example, two 3D vectors already occupy 6 stack slots, and you typically need at least one additional item as scratch space, so that's almost the entire stack already. |
− | To make sure your | + | To make sure your Shadertoy prototype is portable to DOS, you should avoid operations that are going to be costly (in terms of bytes) and only use ones that will be cheap in assembly. Below you’ll find some size estimates for WebGL code once ported to x87 math. |
== Scalar operators == | == Scalar operators == | ||
Line 7: | Line 7: | ||
{| class="wikitable" | {| class="wikitable" | ||
|- | |- | ||
− | ! ShaderToy !! Bytes !! | + | ! ShaderToy !! Bytes !! x87 equivalent |
|- | |- | ||
| x+=y || 2 || faddp st1, st0 | | x+=y || 2 || faddp st1, st0 | ||
Line 16: | Line 16: | ||
The cost for <code>-</code>, <code>*</code>, and <code>/</code> scalar operations is identical. A lot of this depends on how your x87 stack is organized (which variable is at the top of the stack at st0) and whether you need to keep copies of the variables for later use. In the last optimization phases, you can often save a few bytes by reorganizing your stack, to avoid unnecessary <code>fld</code> or <code>fxch</code> instructions. | The cost for <code>-</code>, <code>*</code>, and <code>/</code> scalar operations is identical. A lot of this depends on how your x87 stack is organized (which variable is at the top of the stack at st0) and whether you need to keep copies of the variables for later use. In the last optimization phases, you can often save a few bytes by reorganizing your stack, to avoid unnecessary <code>fld</code> or <code>fxch</code> instructions. | ||
− | Notice the existence of <code>fsubr</code> | + | Notice the existence of <code>fsubr</code> and <code>fdivr</code> instructions, so <code>x=(y/x)</code> can still be just 2 bytes, even if it looks more complicated in ShaderToy. |
Also notice that operating on a single component of a vector (<code>b.x += a.x</code>) is actually a scalar operation and thus takes the same 2-4 bytes. | Also notice that operating on a single component of a vector (<code>b.x += a.x</code>) is actually a scalar operation and thus takes the same 2-4 bytes. | ||
Line 44: | Line 44: | ||
| log2(x) || 4 || fld1<br>...<br>fyl2x | | log2(x) || 4 || fld1<br>...<br>fyl2x | ||
|- | |- | ||
− | | exp2( | + | | 2*min(x,y) || 10 || Computed as a+b-abs(a-b) i.e.<br/>fld st0; fsub st0, st2; fabs; fsubp st1, st0; faddp st1, st0<br/>Replace fsubp with faddp for 2*max(x,y) |
+ | |- | ||
+ | | min(x,y) || 11 || fcom st0, st1; fnstsw ax; sahf; jc .S; fxch st0, st1; .S: fstp st1, st0<br/>Replace jc with jnc for max(x,y) | ||
+ | |- | ||
+ | | exp2(y) || 14 || fld1; fld st1; fprem; f2xm1; faddp st1,st0; fscale; fstp st1 | ||
|- | |- | ||
| pow(x,y) || 16 || Computed as 2^(y*log2(x)) i.e. fyl2x, followed by the exp2(x) code | | pow(x,y) || 16 || Computed as 2^(y*log2(x)) i.e. fyl2x, followed by the exp2(x) code | ||
|- | |- | ||
− | | exp( | + | | exp(y) || 18 || Computed as 2^(y*log2(e)) i.e. fldl2e and fmulp, followed by the exp2(y) code |
|} | |} | ||
− | <code>acos</code>, <code>asin</code>, <code>sinh</code>, <code>cosh</code>, <code>tanh</code>, <code>asinh</code>, <code>acosh</code>, and <code>atanh</code> | + | There are no <code>acos</code>, <code>asin</code>, <code>sinh</code>, <code>cosh</code>, <code>tanh</code>, <code>asinh</code>, <code>acosh</code>, and <code>atanh</code> instructions on x87 and implementing them yourself is probably not worth your bytes. This is a pity, as <code>tanh</code> is a classic "squash" function to get any number into -1 .. 1 range. |
+ | |||
+ | Notice the significant byte cost of min and max operations, primarily due to the absence of fcomi and fmovcc instructions on older processors. This can catch people off guard, as many shaders use min/max extensively. For example, raymarchers often rely on them to compute shape unions. Avoiding unnecessary use of min and max can prevent a few headaches later. | ||
== Rounding and remainders == | == Rounding and remainders == | ||
Line 59: | Line 65: | ||
! ShaderToy !! Bytes !! x87 equivalent | ! ShaderToy !! Bytes !! x87 equivalent | ||
|- | |- | ||
− | | round(x) || 2 || frndint | + | | round(x) || 2 || frndint with the default rounding mode, which is nearest |
|- | |- | ||
− | | x | + | | mod(x,y) || 2 || fprem or fprem1. Notice they compute the remainder, not modulo, so they are the same only for positive values of x. |
|- | |- | ||
− | | ceil(x) || 2 | + | | ceil(x) || 2-7 || Up to 5 bytes to setup the rounding mode with fldcw, followed by frndint |
|- | |- | ||
− | | floor(x) || 2 | + | | floor(x) || 2-7 || Up to 5 bytes to setup the rounding mode with fldcw, followed by frndint |
|} | |} | ||
Line 74: | Line 80: | ||
{| class="wikitable" | {| class="wikitable" | ||
|- | |- | ||
− | ! ShaderToy !! x87 equivalent | + | ! ShaderToy !! Bytes !! x87 equivalent |
|- | |- | ||
|| a.xy = a.yx || 2 || fxch st0, st1 | || a.xy = a.yx || 2 || fxch st0, st1 | ||
Line 80: | Line 86: | ||
|| a.xyz = a.yzx || 4 || fxch st0, st2; fxch st0, st1; | || a.xyz = a.yzx || 4 || fxch st0, st2; fxch st0, st1; | ||
|- | |- | ||
− | || a+=b || 5-6 || Assuming b is not needed later.<br/>6 bytes: faddp st3, st0; faddp st3, st0; faddp st3, st0;<br/>5 bytes: if you have a register with suitable parity | + | || a+=b || 5-6 || Assuming b is not needed later.<br/>6 bytes: faddp st3, st0; faddp st3, st0; faddp st3, st0;<br/>5 bytes: if you have a trashable register with suitable parity, use the [[General_Coding_Tricks#Looping_three_times|looping three times trick]] |
|- | |- | ||
|| dot(a,b) || 9-10 || If neither a or b is needed later, compute this as a*=b followed by a.z+=a.y+=a.x | || dot(a,b) || 9-10 || If neither a or b is needed later, compute this as a*=b followed by a.z+=a.y+=a.x | ||
Line 86: | Line 92: | ||
|| a+=b || 10 || If b is needed later: fadd st3; fld st1; faddp st5; fld st2; faddp st6 | || a+=b || 10 || If b is needed later: fadd st3; fld st1; faddp st5; fld st2; faddp st6 | ||
|- | |- | ||
− | || length(a) || | + | || length(a) || 14 || If a is not needed later: fmul st0, st0; fxch st0, st2; fmul st0, st0; faddp st2, st0; fmul st0, st0; faddp st1, st0; fsqrt |
|} | |} | ||
− | From this you can already see that a simple <code>normalize( | + | From this you can already see that a simple <code>normalize(a)</code> is going to take a lot of bytes, as it has to be computed as <code>a/=length(a)</code>. Therefore, normalizing your raymarchers rays is usually to be avoided. <code>cross</code>, <code>reflect</code>, and <code>refract</code> are probably also too costly for sizecoding. |
== Floating point constants == | == Floating point constants == | ||
− | x87 has the following constants | + | x87 has the following constants built-in and loading each takes just 2 bytes: |
{| class="wikitable" | {| class="wikitable" | ||
Line 114: | Line 120: | ||
|} | |} | ||
− | Thus, if you just need "some random constant" in your shader, using one of these can save bytes. Notice, however, that <code>fldpi;fmulp st1, st0</code> is still 4 bytes, whereas <code>fmul st0, dword [bp+offset]</code> can be as little as 3 bytes, if the offset is | + | Thus, if you just need "some random constant" in your shader, using one of these can save bytes. Notice, however, that <code>fldpi; fmulp st1, st0</code> is still 4 bytes, whereas <code>fmul st0, dword [bp+offset]</code> can be as little as 3 bytes, if the offset is short and you can reuse code or another value as the constant. |
− | Even if you need to define a new constant, you don't always need | + | Even if you need to define a new constant, you don't always need the full 4 bytes to define a single IEEE floating-point number—sometimes even a single byte suffices. With a single byte, you can already define the exponent of a float, so the order of magnitude is already correct. You can then try to place this somewhere in your code or data where at least the first few bits of the mantissa are correct, to slightly increase the accuracy. You can use tools like [https://www.h-schmidt.net/FloatConverter/IEEE754.html this] to see what floating-point values encode to, and what different byte patterns represent as floating-point constants. |
== Case study: Balrog == | == Case study: Balrog == | ||
Line 138: | Line 144: | ||
<pre> | <pre> | ||
mov cl, ITERS | mov cl, ITERS | ||
− | .maploop: | + | .maploop: ; for(int j=0;j<ITERS;j++) { |
− | fld st0 | + | fld st0 |
frndint | frndint | ||
− | fsubp st1, st0 | + | fsubp st1, st0 |
fabs ; t.x = abs(t.x - round(t.x)) | fabs ; t.x = abs(t.x - round(t.x)) | ||
fadd st0 ; t.x += t.x; | fadd st0 ; t.x += t.x; | ||
Line 157: | Line 163: | ||
fmul dword [si] | fmul dword [si] | ||
fsubp st3, st0 ; t.z -= t.x * o | fsubp st3, st0 ; t.z -= t.x * o | ||
− | loop .maploop | + | loop .maploop ; } |
</pre> | </pre> | ||
Latest revision as of 12:51, 7 October 2025
Sometimes it is useful to prototype ideas for DOS effects before going through the trouble of writing them in x86/x87 assembly. Shadertoy is a popular choice for creating such prototypes. However, the shaders are written in WebGL, which is a relatively powerful language and includes native support for vectors, matrices, many built-in functions, and arithmetic operations. Most of these features are not available in x86 assembly. It is fairly easy to write tiny shaders in Shadertoy that end up well over 256 bytes once finally ported to DOS. Furthermore, the x87's limit of 8 stack items imposes significant constraints on the number of temporary variables you can use: for example, two 3D vectors already occupy 6 stack slots, and you typically need at least one additional item as scratch space, so that's almost the entire stack already.
To make sure your Shadertoy prototype is portable to DOS, you should avoid operations that are going to be costly (in terms of bytes) and only use ones that will be cheap in assembly. Below you’ll find some size estimates for WebGL code once ported to x87 math.
Contents
Scalar operators
ShaderToy | Bytes | x87 equivalent |
---|---|---|
x+=y | 2 | faddp st1, st0 |
x+y | 4 | If both x and y are needed later: fld st0; fadd st0, st2 |
The cost for -
, *
, and /
scalar operations is identical. A lot of this depends on how your x87 stack is organized (which variable is at the top of the stack at st0) and whether you need to keep copies of the variables for later use. In the last optimization phases, you can often save a few bytes by reorganizing your stack, to avoid unnecessary fld
or fxch
instructions.
Notice the existence of fsubr
and fdivr
instructions, so x=(y/x)
can still be just 2 bytes, even if it looks more complicated in ShaderToy.
Also notice that operating on a single component of a vector (b.x += a.x
) is actually a scalar operation and thus takes the same 2-4 bytes.
Scalar functions
ShaderToy | Bytes | x87 equivalent |
---|---|---|
-x | 2 | fchs |
abs(x) | 2 | fabs |
sqrt(x) | 2 | fsqrt |
sin(x) | 2 | fsin |
cos(x) | 2 | fcos |
sin(x) ... cos(x) | 2 | fsincos |
tan(x) | 2 | fptan |
atan(y,x) | 2 | fpatan |
log2(x) | 4 | fld1 ... fyl2x |
2*min(x,y) | 10 | Computed as a+b-abs(a-b) i.e. fld st0; fsub st0, st2; fabs; fsubp st1, st0; faddp st1, st0 Replace fsubp with faddp for 2*max(x,y) |
min(x,y) | 11 | fcom st0, st1; fnstsw ax; sahf; jc .S; fxch st0, st1; .S: fstp st1, st0 Replace jc with jnc for max(x,y) |
exp2(y) | 14 | fld1; fld st1; fprem; f2xm1; faddp st1,st0; fscale; fstp st1 |
pow(x,y) | 16 | Computed as 2^(y*log2(x)) i.e. fyl2x, followed by the exp2(x) code |
exp(y) | 18 | Computed as 2^(y*log2(e)) i.e. fldl2e and fmulp, followed by the exp2(y) code |
There are no acos
, asin
, sinh
, cosh
, tanh
, asinh
, acosh
, and atanh
instructions on x87 and implementing them yourself is probably not worth your bytes. This is a pity, as tanh
is a classic "squash" function to get any number into -1 .. 1 range.
Notice the significant byte cost of min and max operations, primarily due to the absence of fcomi and fmovcc instructions on older processors. This can catch people off guard, as many shaders use min/max extensively. For example, raymarchers often rely on them to compute shape unions. Avoiding unnecessary use of min and max can prevent a few headaches later.
Rounding and remainders
ShaderToy | Bytes | x87 equivalent |
---|---|---|
round(x) | 2 | frndint with the default rounding mode, which is nearest |
mod(x,y) | 2 | fprem or fprem1. Notice they compute the remainder, not modulo, so they are the same only for positive values of x. |
ceil(x) | 2-7 | Up to 5 bytes to setup the rounding mode with fldcw, followed by frndint |
floor(x) | 2-7 | Up to 5 bytes to setup the rounding mode with fldcw, followed by frndint |
Notice that x-round(x)
is a very compact way to do domain repetition for raymarchers.
Vector arithmetic (examples)
ShaderToy | Bytes | x87 equivalent |
---|---|---|
a.xy = a.yx | 2 | fxch st0, st1 |
a.xyz = a.yzx | 4 | fxch st0, st2; fxch st0, st1; |
a+=b | 5-6 | Assuming b is not needed later. 6 bytes: faddp st3, st0; faddp st3, st0; faddp st3, st0; 5 bytes: if you have a trashable register with suitable parity, use the looping three times trick |
dot(a,b) | 9-10 | If neither a or b is needed later, compute this as a*=b followed by a.z+=a.y+=a.x |
a+=b | 10 | If b is needed later: fadd st3; fld st1; faddp st5; fld st2; faddp st6 |
length(a) | 14 | If a is not needed later: fmul st0, st0; fxch st0, st2; fmul st0, st0; faddp st2, st0; fmul st0, st0; faddp st1, st0; fsqrt |
From this you can already see that a simple normalize(a)
is going to take a lot of bytes, as it has to be computed as a/=length(a)
. Therefore, normalizing your raymarchers rays is usually to be avoided. cross
, reflect
, and refract
are probably also too costly for sizecoding.
Floating point constants
x87 has the following constants built-in and loading each takes just 2 bytes:
Constant | Approximation | Instruction |
---|---|---|
0.0 | 0.0 | fldz |
1.0 | 1.0 | fld1 |
pi | 3.14159... | fldpi |
log2(e) | 1.44270... | fldl2e |
loge(2) | 0.69315... | fldln2 |
log2(10) | 3.32193... | fldl2t |
log10(2) | 0.30103... | fldlg2 |
Thus, if you just need "some random constant" in your shader, using one of these can save bytes. Notice, however, that fldpi; fmulp st1, st0
is still 4 bytes, whereas fmul st0, dword [bp+offset]
can be as little as 3 bytes, if the offset is short and you can reuse code or another value as the constant.
Even if you need to define a new constant, you don't always need the full 4 bytes to define a single IEEE floating-point number—sometimes even a single byte suffices. With a single byte, you can already define the exponent of a float, so the order of magnitude is already correct. You can then try to place this somewhere in your code or data where at least the first few bits of the mantissa are correct, to slightly increase the accuracy. You can use tools like this to see what floating-point values encode to, and what different byte patterns represent as floating-point constants.
Case study: Balrog
With all the earlier in mind, Balrog 256b executable graphics can serve as a case study. Balrog is a fractal raymarcher, with the innermost loop of:
for(int j=0;j<ITERS;j++){ t.x = abs(t.x - round(t.x)); // abs is folding, t.x - round(t.x) is domain repetition t.x += t.x; // domain scaling r *= RSCALE; r += t.x*t.x; t.xyz = t.yzx; // shuffle coordinates so next time we operate on previous y etc. t.x += t.z * o; // rotation, but using very poor math t.z -= t.x * o; }
Even if there's vectors, the code mostly does scalar math, and then uses coordinate shuffling (t.xyz = t.yzx) to do math on other coordinates. That code ports to:
mov cl, ITERS .maploop: ; for(int j=0;j<ITERS;j++) { fld st0 frndint fsubp st1, st0 fabs ; t.x = abs(t.x - round(t.x)) fadd st0 ; t.x += t.x; fld dword [c_rscale+bp-BASE] fmulp st4, st0 ; r *= RSCALE fld st0 fmul st0 faddp st4, st0 ; r += t.x*t.x fxch st2, st0 fxch st1, st0 ; t.xyz = t.yzx fld st2 fmul dword [si] faddp st1, st0 ; t.x += t.z * o; fld st0 fmul dword [si] fsubp st3, st0 ; t.z -= t.x * o loop .maploop ; }
The comments show exactly how each ShaderToy line maps to different x87 instructions.
The Balrog code also later exemplifies the floating point truncation technique:
c_mindist equ $-3 db 0x38 ; 0.0001 c_glowamount equ $-2 c_colorscale equ $-2 dw 0x3d61 ; 0.055 c_stepsizediv equ $-1 db 0x03 ; 807 c_stepsizediv_z equ $-3 db 0x40 ; 2.1006666666666662 c_glowdecay equ $-2 dw 0x461c ; 1e4 c_rscale equ $-2 db 0xa1, 0x3f ; 1.2599210498948732 c_rdiv equ $-2 dw 0x434b ; 203.18733465192963 c_camz equ $-1 db 0xcc, 0x12, 0x42 ; 36.7 c_xdiv equ $-1 db 0x09, 0x00, 0x40 ; 2.0006 c_xmult equ $-2 dw 0x3f2a c_camy equ $-2 dw 0x3f1c ; 0.61
Two of the constants were finally the same constant (c_glowamount and c_colorscale), many are only have the exponents (single db), and two of the constants required as much as 3 bytes to get enough precision (c_camz and c_xdiv). The ordering of the constants was carefully chosen, so that when the exponent of one constant serves as a part of the mantissa of next constant, the value is at least roughly correct.