Display Kernel for CC and iCC

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Zack

As many of you know, I've been assisting with the display kernels used by Andrew Davie in the Sentinel and Elite demos. The current code is actually 5 different kernels copy/pasted and hacked up until things looked right. It was done as quickly as possible to enable Andrew to progress with his demos.

I'd like this topic to be both a discussion on how to make one general display kernel for CC/iCC and a view into the process I use to generate display kernels with the new elf format. Eventually there will be a library available with several of these general kernels that can be combined as needed to produce the desired results.

Step 1 - Cycle Budgeting
I like to do some rough estimates to check that everything will fit before trying to make sure the timing can work too. For this kernel there will be a full asymmetric playfield and a 48-pixel sprite. Rather than trying to change the color of the playfield and both players, they will all be set to black, and the background will be used for the different color each scanline. The PF will require 6 writes and the 48-pixel sprite will need to use 6 writes plus an extra strobe to flush the last value to the VDEL register. (1+6+6)*5cycles + 3cycles = 68 cycles. That leaves 8 cycles to adjust timing of writes and maintain the program counter.

Note: It's important to call vcsJmp3() a few times throughout the screen because the 6507 can only address 4KB of contiguous ROM space and allowing PC to go past $1fff would transfer control to TIA read registers, eventually crashing the 6507.

Step 2 - Scheduling
If the 8 extra cycles can be kept together. They can be done with a single call to vcsNop2n(4) which will put a nop instruction on the bus and give about 7 6507-cycles of time where the M0+ can do some useful work. The m0+ is running about 53 times faster. That's over 350 arm cycles each scanline to process graphics. On lines where vcsJmp3() is called to reset the PC back to $1000 there will only be half as many cycles free.

CC/iCC consists of 3 scanlines per graphics line. I will refer to them as R, G, and B even though they may be assigned different colors. To keep things simple, I plan to calculate the 6 PF values for the next triplet on the R line, the 6 GRP values on the G line, and colors and indexes on the B line. Since B is doing less, it can also fit in a vcsJmp3().
R: 68 cycles of writes, 8 cycles of nop - PF for next triplet are calculated
G: 68 cycles of writes, 8 cycles of nop - GRP values for next triplet are calculated
B: 69 cycles of writes, 4 cycles of nop - colors and indexes are calculated, 3 cycles to reset PC back to $1000

I will refer to COLUBK write as BK, and the PF writes as 0a, 1a, 2a, 0b, 1b, 2b

The GRP0/1 writes are more complicated since the last 4 writes need to all be 3 cycles apart. They will be referred to as follows:
LDX - 2 cycle load - pixels 32-39
LDY - 2 cycle load - pixels 40-47
VD0 - 5 cycle write to GRP0 - pixels 0-7
VD1 - 5 cycle write to GRP1 - pixels 8-15
GR0 - 5 cycle write to GRP0 - pixels 16-23
GR1 - 5 cycle write to GRP1 - pixels 24-31
STX - 3 cycle write to GRP0 - pixels 32-39
STY - 3 cycle write to GRP1 - pixels 40-47
STA - 3 cycle write to GRP0 - flush vdel register

{ LDX, LDY, VD0, VD1, GR0 } can be scheduled anytime prior to the 48-pixel sprite
[XY] will refer to the 4 cycles consumed by LDX and LDY combined. They will be treated as a single operation for scheduling purposes.
{ GR1, STX, STY, STA } must be scheduled together and the final write must occur on pixel 36-38 of the 48ps (48-pixel sprite). I will refer to this group as [---24-47----] in the schedules below.

Note: each 6507 cycle is represented as 1 character in the schedule. I.E. [GR0] is 5 cycles
Hazards:
                      [-BK-------------------------------------------------]
                      [-0a--]         [-2a-------]    [-1b-------]
                            [-1a------]          [-0b-]          [-2b------]                 
                                               
When the 48ps is positioned at pixel 0 the following schedule can be used:
][BK-][0a-][1a-][2a-][---24-47----][0b-][1b-][2b-][XY][VD0][VD1][GR0][-nop8-
Sliding the entire schedule over 1 cycle at a time allows this sequence to handle additional 48ps positions
-][BK-][0a-][1a-][2a-][---24-47----][0b-][1b-][2b-][XY][VD0][VD1][GR0][-nop8
8-][BK-][0a-][1a-][2a-][---24-47----][0b-][1b-][2b-][XY][VD0][VD1][GR0][-nop
p8-][BK-][0a-][1a-][2a-][---24-47----][0b-][1b-][2b-][XY][VD0][VD1][GR0][-no
op8-][BK-][0a-][1a-][2a-][---24-47----][0b-][1b-][2b-][XY][VD0][VD1][GR0][-n
...
0][-nop8-][BK-][0a-][1a-][2a-][---24-47----][0b-][1b-][2b-][XY][VD0][VD1][GR

To be continued...

Thomas Jentzsch


Zack

So far there is no need for busstuffing in this kernel. I prefer to avoid using busstuffing when possible.

Andrew Davie

Quote from: Zack on 17 Jul 2024, 01:26 PMNote: It's important to call vcsJmp3() a few times throughout the screen because the 6507 can only address 4KB of contiguous ROM space and allowing PC to go past $1fff would transfer control to TIA read registers, eventually crashing the 6507.

Took me a while to understand this, but to clarify for other readers - the kernel is pumping bytes onto the bus which are instructions executed by the 6507 -- various register updates (PF0/GRP0) etc. The kernel does this for 192 scanlines - and the 6507 happily executes these instructions. However, the 6507 program counter is "free running" - it thinks it is retrieving all of these instructions from memory linearly. So it's doing 192 scanlines of code execution without any looping or branching. Let's say (just roughly) 50 bytes per scanline.  So with just 4K of address space we would run out of runway within about 80 scanlines at the very best. Depends on where the PC is when we start. It seems to me that setting PC to $1000 at the start of a kernel, then every (say) 32 lines we do a vcsJmp3() to reset it... would be safe enough.

However, the kernel could keep a track of implicit PC address and only do the vcsJmp3() write when it becomes necessary/urgent. If I'm understanding all of this correctly, the current centered kernel uses 36 bytes without the pC rest. If the PC was set to $1000 just before the scanline loop, then that would suggest 113 scanlines would be possible before we're in trouble. That would require just one vcsJmp3() halfway down the screen.


Zack

Step 2 - Scheduling (continued).

As the [---24-47----] group slides across the screen one cycle at a time, it eventually lands in a spot that cannot work. It overlaps 1a on the left and doesn't leave enough time to write 1b after it. I think the only way to handle this is to leave gaps or switch to busstuffing.

                      [-BK-------------------------------------------------]
                      [-0a--]        [-2a-------]    [-1b-------]
                            [-1a------]          [-0b-]          [-2b------] 
                                      [---24-47----] <-- not enough cycles for writing 1b             
We will proceed with busstuffing. The vdel registers will no longer be needed. I will refer to each write as follows:
COLUBK as bk]
PF as 0a], 1a], 2a], 0b], 1b], and 2b]
g0] - preload GRP0 - pixels 0-7
g1] - preload GRP1 - pixels 8-15
[--16--47--] - 4 contiguous writes to GRP0, GRP1, GRP0, and GRP1 - pixels 16-47
[j] - vcsJmp3 - 3 cycle jmp to reset PC back to $1000, also used to pad space between writes for scheduling purposes
[--] - nops (must be a multiple of 2)


The 48ps can be located anywhere from 0-111 horizontally on the screen. 111pixels/3 = 37.3 cycles. The 48ps will slide 38 cycles total with a one cycle resolution. Some writes and nops will need to be moved around as the 48ps slides over. The entire range can be handled with a single kernel in 4 configurations as shown here.

                      [-BK-------------------------------------------------]
                      [-0a--]        [-2a-------]    [-1b-------]
                            [-1a------]          [-0b-]          [-2b------]  offset48psCycles 
-]bk]0a]1a]2a]g0]g1][j][--16--47--]0b]1b][----]2b][--------][---------------  0 
------------]bk]0a]1a]2a]g0]g1][j][--16--47--]0b]1b][----]2b][--------][----  11
// [----] and 0b] moved over
----]bk]0a]1a]2a]g0]g1][j][----]0b][--16--47--]1b]2b][--------][------------  12
--------]bk]0a]1a]2a]g0]g1][j][----]0b][--16--47--]1b]2b][--------][--------  16
// 1b] moved over
------]bk]0a]1a]2a]g0]g1][j][----]0b]1b][--16--47--]2b][--------][----------  17
---------------]bk]0a]1a]2a]g0]g1][j][----]0b]1b][--16--47--]2b][--------][-  26
// [--------] and 2b] moved over
---]bk]0a]1a]2a]g0]g1][--------][j][----]0b]1b]2b][--16--47--][-------------  27
--------------]bk]0a]1a]2a]g0]g1][--------][j][----]0b]1b]2b][--16--47--][--  38

Part 3 - Implementing the schedule

Translating the schedule to code is very simple. Each item is replaced with the corresponding vcs() function and place holder values are used to make testing easy.

Starting with only the first configuration we get this.

vcsWrite3(COLUBK, i);
vcsWrite3(PF0, 0xff);
vcsWrite3(PF1, 0xff);
vcsWrite3(PF2, 0xff);
vcsWrite3(GRP0, i & 1 ? 0 : 0xff);
vcsWrite3(GRP1, i & 1 ? 0 : 0xff);
vcsJmp3();
vcsWrite3(GRP0, i & 1 ? 0xff : 0);
vcsWrite3(GRP1, i & 1 ? 0xff : 0);
vcsWrite3(GRP0, i & 1 ? 0 : 0xff);
vcsWrite3(GRP1, i & 1 ? 0 : 0xff);
vcsWrite3(PF0, 0);
vcsWrite3(PF1, 0);
vcsNop2n(3);
vcsWrite3(PF2, 0);
vcsNop2n(5);
vcsNop2n(9);

The cycle count suffix of each function makes it easy to check that we have scheduled exactly 76 cycles.
14 3 cycle functions + 17 2 cycle nops = 42+34=76 cycles total

The other configurations can be added in by conditioning on the value of offset48psCycles. This allows the kernel to stay compact despite having to rearrange itself multiple times.

vcsWrite3(COLUBK, i);
vcsWrite3(PF0, 0xff);
vcsWrite3(PF1, 0xff);
vcsWrite3(PF2, 0xff);
vcsWrite3(GRP0, i & 1 ? 0 : 0xff);
vcsWrite3(GRP1, i & 1 ? 0 : 0xff);
if(offset48psCycles > 26)
  vcsNop2n(5);
vcsJmp3();
if(offset48psCycles > 11){
  vcsNop2n(3);
  vcsWrite3(PF0, 0);
  if(offset48psCycles > 16){
    vcsWrite3(PF1, 0);
    if(offset48psCycles > 26)
      vcsWrite3(PF2, 0);
  }
}

// These 4 writes must be performed accurately to 1 cycle, others can be moved around to accomadate
vcsWrite3(GRP0, i & 1 ? 0xff : 0);
vcsWrite3(GRP1, i & 1 ? 0xff : 0);
vcsWrite3(GRP0, i & 1 ? 0 : 0xff);
vcsWrite3(GRP1, i & 1 ? 0 : 0xff);

if(offset48psCycles < 12)
  vcsWrite3(PF0, 0);
if(offset48psCycles < 17)
  vcsWrite3(PF1, 0);
if(offset48psCycles < 12)
  vcsNop2n(3);
if(offset48psCycles < 27) {
  vcsWrite3(PF2, 0);
  vcsNop2n(5);
}
vcsNop2n(9);
// Lots of arm cycles available to calculate next triplet here

Wrap it with some setup code and we have our first test image.