Difference between revisions of "General Coding Tricks"

From SizeCoding
Jump to: navigation, search
(Various (small) Hints)
(Need a constant?)
 
(11 intermediate revisions by 4 users not shown)
Line 32: Line 32:
  
 
If you need a constant value but you're out of space, search your assembled code for a byte value you can use.
 
If you need a constant value but you're out of space, search your assembled code for a byte value you can use.
 +
 +
If you need more than a byte, then another method could be to create some '''literal pool''' in memory which can be addressed for constants. This technique was first used inside
 +
[https://www.pouet.net/prod.php?which=94080 TERRA256] (please update if there are earlier examples) and is especially useful if [[Floating-point_Opcodes]] are used, since those can not address byte integer constants in memory but only word and dword. Different to byte constants, it is less likely to find word or dword constants inside the opcodes where hi- or lo-byte must be zero. Chances are higher if the lower byte of the searched constant word does not matter that much and can be ignored. Else, the literal pool is a very nice technique and simplifies the search for constants a lot.
 +
 +
The construction of the literal pool can look like this:
 +
<syntaxhighlight lang=nasm>
 +
        mov cx, 255 ;very likely this can be replaced by a shorter version or even skipped; value of CX can also be larger than 255
 +
initlp: ;very likely there are other things to initialize, like i.e. setting the DAC color palette
 +
        push cx ;push 16-bit word to the stack with the current counter value
 +
        loop initlp
 +
        push cx ;optionally push final zero constant, if required
 +
</syntaxhighlight>
 +
This example will construct a literal pool of 256 words on the stack, starting with 0x00FF and ending with 0x0000. The additional "push cx" after the loop will add a final zero constant, if required. The final zero also allows to exit from a COM executable using the "ret" instruction.
 +
 +
Practically almost any existing initialization loop can be used to setup such a pool, why this construction in the best case only costs a single additional byte for the "push cx" instruction.
 +
 +
To use the literal pool, some index register like SI, DI, BX or BP should be initialized to point to the literal pool. The init value can be the content of the stack pointer (SP) but also a fixed offset, since the position of the literal pool and the current stack position is typically well known. Also the initial value of DI (0xFFFE) or a zero offset may work. Accessing the literal pool can look like this:
 +
 +
<syntaxhighlight lang=nasm>
 +
        mov bx, sp ;point index register to literal pool
 +
        fild word [bx+0x13*2] ;load integer value 0x0013 into FPU register
 +
        fild word [bx+0x31*2] ;load integer value 0x0031 into FPU register
 +
</syntaxhighlight>
 +
 +
To address larger values, some further tricks can be used to keep size low:
 +
<syntaxhighlight lang=nasm>
 +
        fild word [bx+0x087*2]  ;take care that this instruction is 1 byte larger than the others
 +
        fild word [bx+si+0x087*2-0x100] ;but this trick could help: load integer value 0x0087 into FPU register, assuming SI is 0x0100
 +
        fild word [bx+0x03*2-1] ;load integer value 0x0300 into FPU register
 +
        fidiv dword [bx+0x05*2-3] ;divide by integer value 0x05000400
 +
</syntaxhighlight>
  
 
== A smaller way to point to Mode 13's screen segment ==
 
== A smaller way to point to Mode 13's screen segment ==
Line 73: Line 104:
 
</syntaxhighlight>
 
</syntaxhighlight>
  
== The Rrrola Trick (Obtaining X and Y without DIV) ==
+
== Obtaining X and Y without DIV (The Rrrola Trick) ==
  
In [[Output#Outputting_in_mode_13h_.28320x200.29|320x200 mode]], instead of constructing X and Y from the screen pointer DI with DIV, you can get a decent estimation with multiplying the screen pointer with 0xCCCD and read X and Y from the 8bit registers DH (+DL as 16bit value) and DL (+AH as 16bit value). The idea is to interpret DI as a kind of 16 bit float in the range [0,1], from start to end. Multiplying this number in [0,1] with 65536 / 320 = 204,8 results in the row before the comma, and again as a kind of a float, the column after the comma. The representation 0xCCCD is the nearest rounding of 204,8 * 256 ( = 52428,8 ~ 52429 = 0xCCCD). As long as the 16 bit representations are used, there is no precision loss.
+
In [[Output#Outputting_in_mode_13h_.28320x200.29|320x200 mode]], instead of constructing X and Y from the screen pointer DI with DIV, you can get a decent estimation with multiplying the screen pointer with 0xCCCD and read X and Y from the 8bit registers DH (+DL as 16bit value) and DL (+AH as 16bit value). The idea is to interpret DI as a kind of 16 bit fixed point in the range [0,1], from start to end. Multiplying this number in [0,1] with 65536 / 320 = 204,8 results in the row before the comma, and again as a kind of a fixed point, the column after the comma. The representation 0xCCCD is the nearest rounding of 204,8 * 256 ( = 52428,8 ~ 52429 = 0xCCCD). As long as the 16 bit representations are used, there is no precision loss.
  
 
This is adapted from [http://www.pouet.net/prod.php?which=53816 "Puls" by Rrrola] where X and Y are directly modified on the stack by performing <code>add dword[di],0000CCCDh</code> on each pixel iteration, which requires 7 bytes of code. The vertical alignment correction is solved with a good starting value on said DWORD on the stack before each frame, which requires 2 additional bytes. Both approaches are too different to directly compare, but share the core idea of [http://www.pouet.net/topic.php?which=8791&page=8#c411796 multiplying with <code>0xCCCD</code>], so "Rrrolas trick" is an appropriate term to use.
 
This is adapted from [http://www.pouet.net/prod.php?which=53816 "Puls" by Rrrola] where X and Y are directly modified on the stack by performing <code>add dword[di],0000CCCDh</code> on each pixel iteration, which requires 7 bytes of code. The vertical alignment correction is solved with a good starting value on said DWORD on the stack before each frame, which requires 2 additional bytes. Both approaches are too different to directly compare, but share the core idea of [http://www.pouet.net/topic.php?which=8791&page=8#c411796 multiplying with <code>0xCCCD</code>], so "Rrrolas trick" is an appropriate term to use.
 +
 +
=== Alternative explanation by [https://news.ycombinator.com/user?id=pjc50 pjc50]===
 +
[https://gistpreview.github.io/?9b252f267cd1fdf9754059bb73a18487 Interactive snippet]
 +
More clearly: DI = (y * 320) + x
 +
 +
Multiply by 0xCCCD => (y * 0x1000040) + (x * 0xcccd)
 +
 +
Take top byte is equivalent to divide by 0x1000000. So that gives you Y.
 +
The next lower (third) byte is then (x * 0xcccd / 0x10000) == (x * 52429 / 65536) =~ (x * 256/320).
 +
And the lower two bytes are noise.
  
 
== Use the entire register for a smaller opcode form ==
 
== Use the entire register for a smaller opcode form ==
  
1. As you know e.g. <code>add cl,1</code> produces 3 Bytes of code while <code>inc cl</code> compiles to 2 Bytes. If ch does not matter (or you know that it won't be affected) use <code>inc cx</code> instead and get the most out of that 1 Byte. This is no real trick but sometimes such things can be overlooked - while the 2 saved Bytes could be invested wisely.
+
As you know e.g. <code>add cl,1</code> produces 3 Bytes of code while <code>inc cl</code> compiles to 2 Bytes. If ch does not matter (or you know that it won't be affected) use <code>inc cx</code> instead and get the most out of that 1 Byte. This is no real trick but sometimes such things can be overlooked - while the 2 saved Bytes could be invested wisely.
 +
 
 +
== Use the carry flag in your calculations ==
 +
 
 +
Let's say you have to <code>add si,128</code>. Unfortunately this takes 1 Byte more than <code>add si,127</code>. But you can add 128 without that extra Byte. If your previous code sets the carry flag simply include it into your calculation and <code>adc si,127</code>. The same goes for <code>sub si,128</code> vs <code>sbb si,127</code>.

Latest revision as of 13:44, 15 February 2024

Data is code, code is data

Code is nothing more than data that the CPU interprets. For example, consider this multi-byte instruction:

        mov ah,37h

This assembles to B4 37. B4 by itself isn't interesting, but 37 happens to be the opcode for AAS. Let's say you had this code before a loop, and you needed to perform AAS at the top of a loop. Rather than put AAS at the top of the loop, you can reuse the opcode that will already be there as part of the mov ah,37 that comes before it. Just jump directly into the middle of the mov ah,37h, which will get interpreted and executed as AAS:

label:
        mov ah,37h
        ;misc. stuff
        loop label+1

The +1 specifies the jump should go to 1 byte past the actual location.

Reuse

You can use opcodes hidden in your existing data. For example, .COM files can end with RET, which is opcode C3. If you already have a C3 somewhere else in your code, even as part of data, just JMP to that pre-existing C3 instead of adding a RET.

If your environment holds you back, change it

The default MCGA palette is fairly horrible, but can be size advantages to changing it: While setting a new palette costs bytes, the new palette arrangement could save you headaches down the road. For example, if your code is calculating pixel colors that fall into goofy ranges, rather than constantly adjust the colors to sane ranges (ie. aligned to powers of 2), just set the palette so that values falling into those ranges look the way you want. (This assumes you have very small ways of redefining the palette, of course.)

The above is maybe not the best example. Rewrites to this section are welcome.

Need a constant?

If you need a constant value but you're out of space, search your assembled code for a byte value you can use.

If you need more than a byte, then another method could be to create some literal pool in memory which can be addressed for constants. This technique was first used inside TERRA256 (please update if there are earlier examples) and is especially useful if Floating-point_Opcodes are used, since those can not address byte integer constants in memory but only word and dword. Different to byte constants, it is less likely to find word or dword constants inside the opcodes where hi- or lo-byte must be zero. Chances are higher if the lower byte of the searched constant word does not matter that much and can be ignored. Else, the literal pool is a very nice technique and simplifies the search for constants a lot.

The construction of the literal pool can look like this:

        mov cx, 255 ;very likely this can be replaced by a shorter version or even skipped; value of CX can also be larger than 255
initlp: ;very likely there are other things to initialize, like i.e. setting the DAC color palette
        push cx ;push 16-bit word to the stack with the current counter value
        loop initlp
        push cx ;optionally push final zero constant, if required

This example will construct a literal pool of 256 words on the stack, starting with 0x00FF and ending with 0x0000. The additional "push cx" after the loop will add a final zero constant, if required. The final zero also allows to exit from a COM executable using the "ret" instruction.

Practically almost any existing initialization loop can be used to setup such a pool, why this construction in the best case only costs a single additional byte for the "push cx" instruction.

To use the literal pool, some index register like SI, DI, BX or BP should be initialized to point to the literal pool. The init value can be the content of the stack pointer (SP) but also a fixed offset, since the position of the literal pool and the current stack position is typically well known. Also the initial value of DI (0xFFFE) or a zero offset may work. Accessing the literal pool can look like this:

        mov bx, sp ;point index register to literal pool
        fild word [bx+0x13*2] ;load integer value 0x0013 into FPU register
        fild word [bx+0x31*2] ;load integer value 0x0031 into FPU register

To address larger values, some further tricks can be used to keep size low:

        fild word [bx+0x087*2]  ;take care that this instruction is 1 byte larger than the others
        fild word [bx+si+0x087*2-0x100] ;but this trick could help: load integer value 0x0087 into FPU register, assuming SI is 0x0100
        fild word [bx+0x03*2-1] ;load integer value 0x0300 into FPU register
        fidiv dword [bx+0x05*2-3] ;divide by integer value 0x05000400

A smaller way to point to Mode 13's screen segment

Rather than mov ah,0a0h; mov es,ax or push word 0a000h; pop es, try this 2-byte wonder:

les bp,[bx]

This sets ES=9FFF, only one away from A000. You can write to the screen with ES: this way as long as you are aware the segment is one paragraph (16 bytes) behind, so just increase your offset by 16 if you need exact placement.

How does this work? At start of execution of a .COM file, BX=0, and DS=CS. The contents of the COM file get loaded to offset 0x100 in that segment, but loaded before that is the PSP (program segment prefix), which DOS populates with information about the loaded program and other info. The PSP starts with CD 20 (INT 20, which exits the program), so that's what gets loaded into BP. The next word is the number of the last free conventional memory segment, typically 0x9fff (but can be something different if parts of the upper memory range are either not installed or allocated).

Warning: This trick doesn't always work. On FreeDos, this can set ES=9FE0 and there is something resident at that location that can screw up the system after normal program exit if you overwrite it.

Accessing the timer tick for free

If using a 386+, FS=0 at .COM start. So, FS:[046C] gets you the DOS timer tick variable, which you could use for timing/pacing, or a random seed. Some environments, especially EMS/XMS programs, can modify the FS register, so it can't always assumed to be 0000h. POP DS right after the start and accessing [046C] does the trick then with equal size.

Looping twice

If you need to repeat a section of code that doesn't modify the carry flag, and you know the carry flag is clear, you can loop once in only 3 bytes:

looping:
        ;do stuff here
        cmc
        jc      looping

Looping three times

If you need to repeat a section of code and you have a register whose value is zero and can be incremented freely, or whose value is -1 and can be decremented freely, you can loop twice in only 3 bytes:

looping:
        ;do stuff here
        inc     bx ;if decrementing instead, parity check must be reversed
        jpo     looping ;1 (01b) and 2 (10b) have odd parity, 3 (11b) has even parity

Obtaining X and Y without DIV (The Rrrola Trick)

In 320x200 mode, instead of constructing X and Y from the screen pointer DI with DIV, you can get a decent estimation with multiplying the screen pointer with 0xCCCD and read X and Y from the 8bit registers DH (+DL as 16bit value) and DL (+AH as 16bit value). The idea is to interpret DI as a kind of 16 bit fixed point in the range [0,1], from start to end. Multiplying this number in [0,1] with 65536 / 320 = 204,8 results in the row before the comma, and again as a kind of a fixed point, the column after the comma. The representation 0xCCCD is the nearest rounding of 204,8 * 256 ( = 52428,8 ~ 52429 = 0xCCCD). As long as the 16 bit representations are used, there is no precision loss.

This is adapted from "Puls" by Rrrola where X and Y are directly modified on the stack by performing add dword[di],0000CCCDh on each pixel iteration, which requires 7 bytes of code. The vertical alignment correction is solved with a good starting value on said DWORD on the stack before each frame, which requires 2 additional bytes. Both approaches are too different to directly compare, but share the core idea of multiplying with 0xCCCD, so "Rrrolas trick" is an appropriate term to use.

Alternative explanation by pjc50

Interactive snippet More clearly: DI = (y * 320) + x

Multiply by 0xCCCD => (y * 0x1000040) + (x * 0xcccd)

Take top byte is equivalent to divide by 0x1000000. So that gives you Y. The next lower (third) byte is then (x * 0xcccd / 0x10000) == (x * 52429 / 65536) =~ (x * 256/320). And the lower two bytes are noise.

Use the entire register for a smaller opcode form

As you know e.g. add cl,1 produces 3 Bytes of code while inc cl compiles to 2 Bytes. If ch does not matter (or you know that it won't be affected) use inc cx instead and get the most out of that 1 Byte. This is no real trick but sometimes such things can be overlooked - while the 2 saved Bytes could be invested wisely.

Use the carry flag in your calculations

Let's say you have to add si,128. Unfortunately this takes 1 Byte more than add si,127. But you can add 128 without that extra Byte. If your previous code sets the carry flag simply include it into your calculation and adc si,127. The same goes for sub si,128 vs sbb si,127.