Optimising C Compilers

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

JetSetIlly

While working on the CDFJ driver for the STM32 cartridges (UnoCart and PlusCart)
the question of optimisation has arisen. In particular how C is optimised during
compilation.

To illustrate the effect of written code on the instructions generated by the
compiler, I'll use the bank-switching logic for CDFJ.

The logic for bank-switching is a little unusual in that the bank-switching
addresses (hotspots) do not line up exactly with the bank that will be switched
to (CDFJ+ corrects this). For reference, here is the list of hotspot addresses
and corresponding banks

1ff4   6
1ff5   0
1ff6   1
1ff7   2
1ff8   3
1ff9   4
1ffa   5
1ffb   6


The first pass of driver development resulted in this. We'll call this code-A:

if (addr >= 0x1ff4 && addr <= 0x1ffb) {
   if (addr == 0x1ff4) {
      bankPtr = &cartridge[0x6000];
   } else {
      bankPtr = &cartridge[(addr - 0x1ff5) * 0x1000 ];
   }
}

On the surface it looks concise and as best as I can tell the logic is correct.
But, and this important, it's not /easy/ to tell if it's correct just by
looking. It requires a little bit more thinking than I would like as a
maintainer of the code.


The second pass of development changed the bank-switching code to this. We'll
call this code-B:

switch (addr) {
   case 0x1ff4:
      bankPtr = cartridge + 0x6000;
      break;
   case 0x1ff5:
      bankPtr = cartridge + 0x0000;
      break;
   case 0x1ff6:
      bankPtr = cartridge + 0x1000;
      break;
   case 0x1ff7:
      bankPtr = cartridge + 0x2000;
      break;
   case 0x1ff8:
      bankPtr = cartridge + 0x3000;
      break;
   case 0x1ff9:
      bankPtr = cartridge + 0x4000;
      break;
   case 0x1ffa:
      bankPtr = cartridge + 0x5000;
      break;
   case 0x1ffb:
      bankPtr = cartridge + 0x6000;
      break;
}

What do you think? Better or worse.

Well, it takes more room on the screen but I would hope we can agree that the
/intention/ of the code is clearer. A little bit of commentary explaining and
reinforcing that intention and I would be happy with that as the final code.


But what about the execution efficiency? This is a time-critical application so
the code needs to be efficient.

To understand this we can use the objdump tool. This tool allows us to view the
instructions generated by the compiler along with the line of source code that
produced the instructions. (In the extracts below, the results are sorted by line
number.


These are the results for code-A:

cartridge_emulation_ACEROM.c:104
         if (addr >= 0x1ff4 && addr <= 0x1ffb) {
 802819a:   f5a1 50ff    sub.w   r0, r1, #8160   @ 0x1fe0
 802819e:   3814         subs   r0, #20
 80281a0:   b280         uxth   r0, r0
 80281a2:   2807         cmp   r0, #7
 80281a4:   d854         bhi.n   8028250 <emulate_ACEROM_cartridge+0x150>
cartridge_emulation_ACEROM.c:105
            if (addr == 0x1ff4) {
 80281a6:   f641 70f4    movw   r0, #8180   @ 0x1ff4
 80281aa:   4281         cmp   r1, r0
 80281ac:   f000 8090    beq.w   80282d0 <emulate_ACEROM_cartridge+0x1d0>
cartridge_emulation_ACEROM.c:106
               bankPtr = &cartridge[0x6000];
 80282d0:   4e1e         ldr   r6, [pc, #120]   @ (802834c <emulate_ACEROM_cartridge+0x24c>)
 80282d2:   e778         b.n   80281c6 <emulate_ACEROM_cartridge+0xc6>
cartridge_emulation_ACEROM.c:108
               bankPtr = &cartridge[(addr - 0x1ff5) * 0x1000 ];
 80281b0:   031e         lsls   r6, r3, #12
 80281b2:   f106 56f0    add.w   r6, r6, #503316480   @ 0x1e000000
 80281b6:   f506 4640    add.w   r6, r6, #49152   @ 0xc000
 
 
 It doesn't look too bad but we won't really know until we compare it to the
 instructions generated by code-B:

cartridge_emulation_ACEROM.c:124
            switch (addr) {
 802817a:   f5a3 51ff    sub.w   r1, r3, #8160   @ 0x1fe0
 802817e:   3914         subs   r1, #20
 8028180:   2907         cmp   r1, #7
 8028182:   d803         bhi.n   802818c <emulate_ACEROM_cartridge+0x8c>
 8028184:   4e89         ldr   r6, [pc, #548]   @ (80283ac <emulate_ACEROM_cartridge+0x2ac>)
 8028186:   447e         add   r6, pc
 8028188:   f856 9021    ldr.w   r9, [r6, r1, lsl #2]
 
 
 That's it! The entire switch statement from code-B has been collapsed into 7
 instructions. Compare that to the 13 instructions from code-A. In terms of
 bytes, that's 18 bytes versus 36 bytes.
 
 (Both sets of instructions also load data from outside the instruction
 block so strictly speaking the amount of space required is a little greater for
 both examples)
 
 In terms of cycles, it's difficult to say without actually executing the code
 but looking at the negative path (ie. when the address is not one of the
 hotspot addresses): for code-A it takes 5 instructions to get reach the exit
 branch, while it takes only 4 instructions for code-B.
 
 That's a rough estimate but the switch statement certainly isn't any slower.
 
 So not only is code-B easier to read and therefore maintain, it produces
 smaller and faster code.
 
 
 The first lesson here is not to "optimise" without profiling the execution. Or,
 as we've done here, inspecting the generated code. For small sections of code,
 inspection is fine but for larger sections of code, such as the game logic in
 your custom section, profiling is the only practical option.
 
 But the most important lesson is that optimising C compilers are better than
 you might think. Write clear code and the C compiler will do well. I always
 appreciate being reminded of that and I hope you do too :-)
https://github.com/JetSetIlly/Gopher2600
@JetSetIlly@mastodon.gamedev.place
@jetsetilly.bsky.social

Zack

That's verry interesting. The double subtraction in code-b especially.

I'm guessing the double subtraction is because 1fe0 can be encoded as ff << 5 and it's doing that to avoid consuming another register to subtract the literal 1ff4 directly. Do know if that's why?

Apparently trying to understand the generated assembly is a great way to improve your ARM Assembly skills.