RISC OS on ARM based CPUs

From SizeCoding
Revision as of 04:15, 14 July 2020 by Kuemmel (talk | contribs) (Code Tricks added)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Why ARM and why on RISC OS ?

x86 and CPUs based on ARM architecture are the two major CPU architectures of modern times, the later one especially for any kind of mobile devices. Back in the 80's ARM was founded to power the successor of the BBC Micro. Most popular and known may be is the Acorn Archimedes range (1987) and the Acorn RISC PC. All those home computers were run by RISC OS, a unique operating system for ARM cpu's.

Nowadays due to the work of a few enthusiast RISC OS is still in development and you can run it on popular single-board computers. Especially recommended and cheap is the Raspberry Pi range. So the fastest cpu to run RISC OS natively at the time of writing is an overclocked RPi4 at 2147 Mhz.

Actually I'm not aware if Android or an kind of Linux would be a better platform for sizecoding on ARM hardware. Just proof us wrong and write to us about it.

What does ARM offer compared to x86 ?

If you come from x86 coding on ARM will be a very different experience, as that architecture never had any inherited obstacles from an 8 or 16 Bit age. It was purely RISC and 32 Bit from the beginning regarding instruction set and register size. Over the years a lot of enhancements took place. In general you got:

  • 16 full size 32-Bit registers (r0...15, usually: r13: stack pointer, r14: link register, r15: program counter)
  • VFP/NEON(SIMD) units with 32 32-Bit single precision registers (s0...s31), 32 64-Bit multi purpose or double precision registers for SIMD (d0...d31), and 16 128-Bit multi purpose registers for SIMD (q0...q15). All those registers are mapped on each other
  • THUMB/THUMB-2 instruction set (especially useful regarding sizecoding)

...and of course the single commands in general are very different to x86...some things might be familiar, some are not at all...over the years the ARM instruction set became quite huge. Nowadays there's hardware integer divide, various SIMD approaches in either ARM or NEON instructions. Just regarding the FPU it still lacks trigonometric and other fancy instructions compared to x87. There is a so called FPEmulator in RISC OS for taking care of that, but that's rather slow as implemented by software and not available for THUMB/THUMB-2. By now though it might be an option for e.g. precalc when you use the Basic Assembler from RISC OS.

The size of the instructions is always 4 Bytes, only THUMB offers a limited instruction set with a length of 2 Bytes.

That may sound as a bit of a handicap regarding size coding and for some tasks that is definitely true. For others it's not due to the things even one instruction can do (e.g. conditional execution and shifts for free). The following shows an example:

ARM (8 Bytes)

cmp   r0,r1            //compare r0 with r1
addhi r0,r2,r3,lsr#4   //if r0>r1 then r0 = r2 + r3>>4  (r3 is only shifted for the add and remains unchanged)

x86 (11 Bytes)

cmp eax,ebx
jna skip 
   mov eax,edx
   shr eax,4
   add eax,ecx
skip:

The conditional execution in ARM mode isn't limited to the next instruction. You can continue endlessly with conditional instructions until the code executes an instruction that triggers the flags like e.g. cmp or an instruction with the suffix s added like e.g. adds r0,r1,r2.

When it comes to THUMB mode unfortunately only branches are conditional. But with THUMB-2 the it instruction was introduced with that up to 4 following instructions can be conditional. Some code from the ARM Information center explains this by the GDC algortithm (Greatest Common Divisor).

ARM (16 Bytes)

gcd:
   cmp   r0,r1
   subgt r0,r0,r1
   suble r1,r1,r0
   bne gcd

THUMB-2 (10 Bytes)

gcd:
   cmp   r0,r1
   ite   gt 
   subgt r0,r0,r1
   suble r1,r1,r0
   bne gcd

What does RISC OS offer for sizecoding ?

  • more or less easy access to common screen modes
  • all screen modes have a linear frame buffer, no 16Bit screen banks limit like on DOS
  • convenient access to operating system/kernel routines (so called SWI's (SoftWare Interrupt), comparable to 'int' on x86).
  • up to date 16-Bit sound system, for e.g. generating bytebeat based stuff
  • built in BBC Basic including an Assembler

What does it lack (only partly relevant to tiny intro sizecoding) ?

  • an FPU (like x87) with trigonometric or logarithmic functions
  • no multicore support
  • no shader access or any kind of OpenGL or DirectX
  • lack of software development in general, so web browsing is there but a bit limited

Code Examples - Simple sizecoding framework and output to screen

So what would a common intro framework look like ? For now we will use the gnu assembler to assemble our code, as the built in BASIC Assembler doesn't support THUMB code.

Before we start with the actual code it's best to define some of the mentioned SWI's for OS interaction by their number. Here's a list of some basic ones.

.set OS_ScreenMode, 0x65
.set OS_RemoveCursors, 0x36
.set OS_ScreenMode, 0x65
.set OS_ReadVduVariables, 0x31
.set OS_ReadMonotonicTime, 0x42
.set OS_ReadEscapeState, 0x2c
.set OS_Exit, 0x11
.set OS_CallASWI 0x6f

So for a basic intro loop in THUMB-2 this would look like

.syntax unified
.thumb                   //assemble using thumb mode
movs r0,#0               //reason code to set screen mode by number
movs r1,#13              //screen mode 13 = 320x256 256 colours
swi OS_ScreenMode        //set screen mode 
adr.n r0,screen_address  //address of input block to read screen mode address
movs r1,r0               //address of output block where screen mode address is stored  
swi OS_ReadVduVariables  //read and write screen mode address from/to blocks 

mainloop:
ldr.n r7,screen_address  //read screen address
swi OS_ReadMonotonicTime //get OS timer to r0
movs r2,#255             //screen y
yloop:
   movs r1,#320          //screen x
   xloop:
      adds r3,r1,r0      //p = x+timer
      eors r3,r3,r2      //p = (x+timer) xor y
      strb r3,[r7],1     //plot result as byte (with standard palette)
      subs r1,r1,#1      //dec x 
   bne xloop
   subs r2,r2,#1         //dec y
bge yloop
swi OS_ReadEscapeState   //ESC pressed ?
bcc mainloop
swi OS_Exit              //if yes exit to OS

.align 2                 //align
screen_address:
.word 148                //input block to read screen address
.word -1                 //request block needs to be terminated by -1

This assembles to 52 Bytes.

As you can see for setting the screen mode you can rely on smaller old school modes with up to e.g. 800x600x256 colours by just choosing a mode by a number (listed here: Screen Modes). After you set the screen mode you got to read it's start address by the OS_ReadVduVariables, as that is not a fixed address. On one specific device it should work to read that address and finally hardcode this address into your code, but then of course you would be restricted to your device (e.g. a RPI4 shows different results than a RPI3 for the same screen mode).

An intro showing that technique is e.g. Exoticorn's Edgedancer

Edgedancer.png

If you want to go for true colour it's a bit more complex. The probably shortest way is to use the option to kind of upgrade those old school screen modes by a string using reason code 15 of the SWI (Check out this link for further information). That would look like this code snippet:

.syntax unified
.thumb                   //assemble using thumb mode
movs r0,#15              //reason code to request screen mode by string     
adr.n r1,mode_string     //pointer to string
swi OS_ScreenMode        //set screen mode 
adr.n r0,screen_address  //address of input block to read screen mode address
movs r1,r0               //address of output block where screen mode address is stored  
swi OS_ReadVduVariables  //read and write screen mode address from/to blocks 

mainloop:
ldr.n r7,screen_address  //read screen address
swi OS_ReadMonotonicTime //get OS timer to r0
movs r2,#255             //screen y
ands r0,r0,r2            //get lowest byte of timer
lsls r0,r0,#8            //create 'B' for RGB from timer
yloop:
   lsls r4,r2,#16        //create 'R' for RGB from y
   orrs r4,r4,r0         //combine 'R' and 'B'
   movs r1,#320          //screen x
   xloop:
      lsrs r3,r1,#1      //x>>1 for 'G' as x>256
      orrs r3,r3,r4      //finalize RGB value 
      stmia r7!,{r3}     //store true colour pixel and increment address
      subs r1,r1,#1      //dec x 
   bne xloop
   subs r2,r2,#1         //dec y
bge yloop
swi OS_ReadEscapeState   //ESC pressed ?
bcc mainloop
swi OS_Exit              //if yes exit to OS

.align 2                 //align
mode_string:
.string "13 C16M"        //screen mode string (terminated by 0) => 13 = 320*256 C16M = true colour
screen_address:
.word 148                //input block to read screen address
.word -1                 //request block needs to be terminated by -1

This assembles to 68 Bytes.

An intro showing that technique is e.g. Exoticorn's Elsecaller

Elsecaller.png

Another approach is to read the current screen mode, as most users would run in 1920x1080x32Bit anyway and not even set the screen mode, which also makes the intro independent of the resolution:

An intro showing that technique is e.g. Kuemmels's Risc OS 3dball. In a later upgrade to that intro you can also see the combined use of THUMB-2 and NEON within the code which lead to a reduction in code size from the initial non-THUMB version of around 44 Bytes. For more insights and requirements of the use of VFP/NEON check out the section below.

To trigger the THUMB mode for the resulting executable in general you can conveniently set the first Bit of the start address (executeables in RISC OS have a load address and a start address stored in the filesystem as an attribute) by the following command on the command line in RISC OS (&8000 is the general start address for executables in RISC OS). The best way to do so is to use a batch file for that, as shown in most of the above mentioned intros:

SYS "OS_File",1,"filename",&8000,&8001,,19

Regarding THUMB mode on RISC OS in general there's a small thing to address. A very ancient module has to be removed from the OS, otherwise it crashes your code. By today that bug is still not fixed. The modules names is "SpecialFX" and needs to be removed by "rmkill SpecialFX" on the command line or by any batch file as shown in the intro links from above.

To exit your intro and go back to the desktop you simple use the shown SWI OS_Exit. If you didn't change the mode you got to use e.g. the SWI "OS_NewLine" to re-trigger desktop redraw. Of course all of those can be omitted if your tiny intro compo rules allow you too...

Code Examples - Using VFP/NEON code

VFP and NEON are basically the FPU and the SIMD (single instruction, multiple data) extension of the ARM instruction set. VFP works with 32 single (32 Bit) and double precision (64 bit) floating point registers (s0...s31,d0...d31).

NEON uses the same register set regarding d0...d31 and adds the 128 Bit sized q0...q15 registers. It can use and operate with multiple integer data types (8,16,32,64 Bit signed and unsigned) and single precision floating point (32 Bit) numbers. It's also possible to use an indexed register like d0[0], e.g. for multiplying multiple data in a register by a single scalar from another register (like vmul.f32 q0,q1,d4[0] => multiply each of the 4 single floats in q1 by single float d4[0] and place results in q0).

Another feature is to have instructions that saturate the results, which is quite useful when working with colouring. So e.g. vqadd.u8 q0,q1,q2 would add 4 true colour RGB pixels (datatype u8 = unsgined 8 Bit integer) from q1 to the ones in q2. If an overflow occurres the value would be saturated to 255.

The pure amount of available NEON instructions and their variations (saturating, narrowing, widening,...) is quite huge, make sure to check the links below to read up on that. Like explained before the register set of VFP and NEON and the registers sets themselves are mapped on each other. So modifying s0 would result in modifying the low 32 Bits of d0 and q0.

Before we can use the VFP/NEON unit within our RISC OS code we need to invest some bytes in requesting a so called VFPContext for initialization. To do so with ARM code that would look like this.

mov r0,#3+(1<<31)
mov r1,#32          //request full set of 32 VFP/NEON registers
mov r2,#0
swi VFPSupport_CreateContext
Further documentation on the SWI VFPSupport_CreateContext can be found here

The same in THUMB-2:

movs r1,#32
lsls r0,r1,#26      //reuse r1
adds r0,r0,#3       //r0=3+(1<<31)
movs r2,#0
movw r10,#0x8ec1
movt r10,#0x5
swi OS_CallASWI     //needed due to swi number >0xff

The major interest on using NEON is speed (for floats and integers) and working with floats in general, not so much size, as like shown above the setup consumes some bytes. If your code doesn't need floats and the speed is good enough there might not be much need for using NEON. You will find also a small amount of parallel arithmetic and saturating capable instructions for normal ARM integer code in the instruction set. But as shown in Kuemmels's RISC OS 3dball...that wouldn't be probably possible without NEON in that size/speed.

3dball.png

Code Examples - Sound output by interrupt driven bytebeat

For basic sound output the principle of a so called timer based bytebeat could be used. For further reference check out this thread on pout Experimental music from very short C programs. I took an example bytebeat from rrrola (shortened by ryg).

To achieve that we need to set up an interrupt handler to take care of a timed output to the systems sound buffer. Here comes a bit of an obstacle. The SWI's for that purpose have a number that exceeds 0xff which would be fine for normall ARM code but not for THUMB. So here we've got to use the SWI OS_CallASWI to call those SWI's indirectly. The SWI number to be called has to be set in r10. As we need 3 different SWI's for that in total (install handler, sample rate, remove handler) and those SWI's are within a short range of numbers we can save some bytes by just add/sub an offset for the other calls. Check out the code here:

.syntax unified
.thumb
//--- set up shared sound interrupt handler ---------------
adr.w r0,soundcode+1    //+1 as code address for interrupt routine needs to be in thumb state also
movs  r2,#0             //immediate handler
adr.n r3,soundhandler_title
str   r2,[r3]           //dummy title string
movw  r10,#0xb440
movt  r10,#0x6          //install XSharedSound handler (SWI 0x6b440)
swi   OS_CallASWI
mov   r4,r0             //backup handler number (r0 gets corrupted by SharedSound_SampleRate)
mov   r1,#8000*1024     //sample rate *1024
add   r10,r10,#6        //XSharedSound_SampleRate (SWI 0x6b446)
swi   OS_CallASWI
sub   r10,r10,#5        //prepare r10 for XSharedSound_RemoveHandler (SWI 0x6b441) on exit later
//--- main intro loop -------------------------------------
mainloop:
//any graphics code or whatever would be here
swi OS_ReadEscapeState
bcc mainloop
mov r0,r4               //restore handler number
swi OS_CallASWI         //Remove XSharedSound handler
swi OS_Exit
//--- interrupt routine/sound generation ------------------
// r1 -> base of buffer, r2 -> end of buffer, r6 = 8.24 fractional step
// ByteBeat formula is = t*(0xca98>>(t>>9&14)&15)|t>>8
soundcode:
push {r0-r7,LR}
lsrs  r6,r6,#8          //adjust fractional step
ldr.n r0,soundtimer     //t = soundtimer
soundloop:
   lsrs r5,r0,#16       //adjust timer for bytebeat
   movw r7,#0xca98      //bytebeat multi constant
   lsrs r4,r5,#9        //t>>9
   and  r4,r4,#14       //(t>>9)&14
   lsrs r7,r7,r4        //0xca98>>(t>>9)&14
   and  r7,r7,#15       //(0xca98>>(t>>9)&14)&15
   muls r7,r5,r7        //t*((0xca98>>(t>>9)&14)&15)
   orr  r7,r7,r5,lsr#8  //t*(0xca98>>(t>>9&14)&15)|t>>8
   lsls r7,r7,#8        //8Bit => 16Bit sound
   orr  r7,r7,r7,lsl#16 //mono => stereo copy
   stm  r1!,{r7}        //store sound word
   adds r0,r0,r6        //inc timer by fractional step
   cmp  r1,r2           //check if buffer filled
bne soundloop
adr.n r4,soundtimer
str r0,[r4]             //save timer...no pc relative str in Thumb...
pop {r0-r7,PC}
//--- data ----------------------------------------------
.align 2
soundhandler_title:
soundtimer:

This assembles to 96 Bytes.

There are other ways to do sound on RISC OS, but those were not evaluated at the time of writing. Also BBC Basic has ways to create sounds by note or frequency (Link is here). Check out the Sound SWI calls in detail here Sound SWI Calls. Some further insights on the sound system can be found here The RISC OS sound system by j. Lesurf.

Code Tricks

Cheap Absolute Value

Usually calculating ABS() would look like this in ARM (of course the cmp could be omitted if a preceding instruction triggered the flags already).

cmp   r0,#0
rsblt r0,r0,#0

and in Thumb

cmp   r0,#0
it    lt
rsblt r0,r0,#0

If your routine is okay with a small deviation (this variant will form a negative value using the formulae ABS(x) = -x - 1) this can be done by just one instruction (ARM). Hint taken from Coder's Revenge 06/96 disc mag.

eor r0,r0,r0,asr#31

In Thumb this would need 2 instructions so I skip that here.

Clamping

Clamping towards zero or similar code cases can be optimized in the same way. Here are 3 examples also from Coder's Revenge 06/96:

Example 1 - All values < 0 should be 0

cmp   r0,#0 // r0 ? 0
movlt r0,#0 // r0<0 => r0=0

can be done by

bic   r0,r0,r0,asr#31

Example 2 - All values >= 0 should be 0

cmp   r0,#0 // r0  ? 0
movge r0,#0 // r0>=0 => r0=0

can be done by

and   r0,r0,r0,asr#31

Example 3 - All values <= -1 should be -1

cmn   r0,#1 // r0 ? -1
mvnle r0,#0 // r0<=-1 => r0=-1

can be done by

orr   r0,r0,r0,asr#31

All of those simplifications can be coded in Thumb, too, but would need also two instructions each and therefore end up with the same size, so I skip that here.

Compression

Due to the overhead of a decompression-routine, compressing your intro would start making sense from a level of may be 512 Byte and for sure when you aim coding an intro >=1 Kbyte.

Luckily we already have a tool for that. That is the absolute and untyped file compressor called Codepressor originally developed by Pervect/Topix, now updated and maintained by Phlamethrower. Check out this Link to get the latest version. It contains different compression algorithms and will try all of them and choose the best finally.

The usage is quite straight forward. Just let the filer see the application and type in the command line codepressr <filename_in> <filename_out> to compress your intro. As an example the intro 'blury' was compressed from 966 to 832 Bytes.

Codepressor also supports Thumb executables, both, if the thumb trigger is set within the code or as shown above set within the filesystem. Future versions might also have thumb decompression routines to make the files even shorter. At the time of writing the decompression routines are written in normal ARM code.

Resources

Links on the OS

RISC OS Open - Home of the current OS version, documentation on the OS (e.g. SWI's) and discussion forum

RISC OS Direct - Easy installation package for RISC OS and all needed sizecoding tools for your Raspberry Pi including !GCC (includes gnu assembler) and !StrongED (most popular text editor)

Links on ARM coding

Thumb 16-bit Instruction Set Quick Reference Card

ARM and Thumb-2 Instruction Set Quick Reference Card

Vector Floating Point Instruction Set Quick Reference Card

NEON Programmer's Guide

Instruction Set Overview

Coding for NEON - Part 1 - load and stores

Coding for NEON - Part 2 - dealing with leftovers

Coding for NEON - Part 3 - matrix multiplication

Coding for NEON - Part 4 - shifting left and right

Coding for NEON - Part 5 - rearranging vectors

Condition Codes 1: Condition Flags and Codes

Condition Codes 2: Conditional Execution

Condition Codes 3: Conditional Execution in Thumb-2

Condition Codes 4: Floating-Point Comparisons Using VFP