As part of my project to emulate an IBM PC or XT with cycle accuracy, I need to be able to get the machine into a completely known and consistent state, which I call lockstep. That way I can run a program many times and be sure of getting exactly the same result each time.
This is a bit tricky, because while all the PC's clocks are derived from a single 14.318MHz crystal, they divide it in different ways. The CPU clock is made by dividing this frequency by 3, the PIT clock is made by dividing it by 4 and the CGA clock is made by dividing it by 16.
Getting the CPU clock in lockstep with the CGA clock is the difficult bit, since the CGA clock is in lockstep with the PIT clock by definition (assuming that such a lockstep is possible - I'm not sure offhand if the phase relationship between the PIT and the CGA clock is always the same or if it's randomized on startup - the latter would make it more complicated to use the PIT in lockstep mode, but that's not really a big problem since the point of lockstep mode is to be able to do timing statically).
Since the CGA clock and the CPU clock have periods which are relatively prime numbers of hdots, it's definitely possible to get them into lockstep. Once I had a rough idea of what the CGA wait states were, I realized that achieving lockstep ought to be possible with a combination of delays and CGA accesses. The algorithm would be:
- Do a CGA memory access, reducing the number of possible relative phases from 16 to 3.
- Delay for A ccycles.
- Do a CGA memory access, reducing the number of possible relative phases from 3 to 2.
- Delay for B ccycles.
- Do a CGA memory access, reducing the number of possible relative phases from 2 to 1.
Delaying for 16 ccycles gives the same relative phases as delaying for 0 ccycles, so the problem boils down to finding A mod 16 and B mod 16. That's only 256 possibilities (and probably quite a few of those will work) so trial and error works fine. Delaying for particular numbers of cycles is okay too - the 8-bit MUL instruction takes 69 ccycles plus 1 ccycle for each set bit in AL, so as long as you don't mind waiting for 69 ccycles you can get a delay of any number of ccycles you like.
But there's a more fundamental problem - how do we recognize when we've succeeded? The definition of lockstep involves consistency - ending up in a known end state no matter what the initial state. So in order to determine whether we're in lockstep or not, we really need to be able to control the initial state - in other words, in order to figure out whether we're in lockstep or not we first need to be in lockstep! That's a bit of a chicken-and-egg problem.
If I knew exactly what the CGA wait states were at this stage I could have figured out the right A and B values on purely theoretical principles, but I didn't - my examinations of the CGA schematics left some questions (particularly in areas involving how the 8088 treats READY signals occuring at different clock phases, and how some apparent race conditions actually turn out in real hardware). It was only in the course of achieving lockstep that I discovered what the CGA wait states actually are.
I had a few false starts involving identifying the 6 behavior classes for the 27 possible transition tables involved in the long-term behaviors of repeated sections of code. For example, if a piece of repeated code has 3 possible long-term behaviors depending on the relative phase at the start, I know that the repeated section must leave the relative phase alone.
But that was getting rather complicated, and I wasn't really getting anywhere until I hit on a better way - I realized that I could visualize exactly how long a piece of code was taking by running it and then changing the CGA palette register, which has an immediate effect, so marks the position on the screen where the electron beam was pointing when the register changed.
That's only useful if the transition happens in exactly the same place on the screen in each frame (otherwise you don't get a stable image and can't see what's going on). Which sounds like we're back to the chicken-and-egg problem again. But it's a more limited kind of lockstep that we need for this particular experiment - we don't need absolute lockstep, just a way of getting code to run at a consistent place relative to the raster beam, from frame to frame. That is to say, it doesn't matter if next time we run the program the image appears in a different place on the screen.
Fortunately, there's a way to do this on the PC even if we don't have full lockstep, since we can use the PIT to introduce a lockstep that's just consistent from frame-to-frame, not from run-to-run. If we set a timer to go off exactly 262*912/12 = 19912 PIT cycles, it'll occur exactly once every frame. That's not quite enough though, because interrupts don't quite have an instantaneous effect on the CPU - the CPU does finish whatever instruction it's currently executing before starting an interrupt. So you have to make sure it's not executing any instructions - i.e. is in the halt state. Another complication is that I had to disable all other interrupts and the DRAM refresh in order to avoid them messing up the timings, which meant that I had to access each DRAM row within a certain period of time lest the capacitors discharge and the memory contents decay, which meant that I couldn't leave the CPU halted for too long!
Once I had a stable image I was able to generate the 16 different CGA/CPU relative phases with multiply instructions, and made a diagonal line that advanced 3 hdots (1 ccycle) on each line just by cycle counting. Then by placing CGA memory accesses between this diagonal line and the palette write I was able to see exactly what the CGA wait states were:
It look a while to get this image because whenever I added some code to a line I had to change the delay code at the end of the line to get the start of the next line to line up correctly, so there was a fair amount of trial and error involved.
In this image, every other scanline displays (just using normal 640-pixel graphics mode) a pattern that repeats every 16 pixels, so that I can see where the lchar boundaries are.
Once I had the CGA wait states, getting CGA/CPU lockstep was relatively easy. Here's a photo I took when I finally got it:
Note that of the lower 16 black horizontal lines, they all end at the same position mod 48 hdots (you'll have to take my word that before the lockstep code they were at 16 different relative phases, like the lines further up which make a diagonal pattern).
Phew, that's a lot of complications for such a tiny piece of code!
Tomorrow we'll look at how to get the CRTC in lockstep as well.
[...] Reenigne blog Stuff I think about « Adventures in CGA lockstep [...]
[...] importantly it makes the timings much more consistent (which is essential for anything involving lockstep). However, whenever I've told people about this the reaction is "doesn't that make the machine [...]
c64 demo programmer here.
Such problems also occur on our hardware. The solutions are well-developed by now; we use a CIA (PIT) timer which is first synced to the raster, then there's N cycles of jitter from uncertainty in the interrupt, so we developed a "clockslide" routine to consume N cycles. To know what N is, just read the timer value during the interrupt (assuming it counts cycles). You can probably translate the clockslide ideas; add one cycle for a branch taken, self-modifying code where the data of one 3 bytes instruction happens to be an opcode for a 2 byte instruction, jumping into a table of NOPs, loops of various granularity (2*a+3*b=N) etc.
I don't see why you'd have to keep your entire code in sync, only the parts during screen drawing need it, you can resync once at the top of the screen, that's how we do it.
We also deal with DMA during our code, and code around it, however it is consistent on each line, so your refresh trick is easiest.
Yeah, the fact that the CIA timer counts actual cycles on the C64 makes it a bit easier there. The PIT clock frequency is 1/4 of the PC/XT CPU clock frequency, so it's not accurate enough to do it the same way.
One reason I wanted to avoid resyncing each frame is that it's a fairly slow process to do from scratch. The reason for this is that we have to disable DRAM refresh to avoid losing cycles to DMA. To do this without losing data, we need to increase the refresh frequency and then wait for a while to ensure all rows have been refreshed recently.
In practice, the best solution is probably to synchronize the CGA and PIT once at startup, then set up a timer interrupt for the start of the frame. Make sure to HLT before this interrupt occurs to eliminate jitter, and then the CPU will be synchronized to the PIT on the interrupt. It's not possible to eliminate the PIT from the equation anyway because of DRAM refresh.
However, there is another problem. Since writing this blog post, I've discovered that there are actually 4 possible relative phases between the CGA and the PIT, and which one you get is determined at random on power on. Also, one of the 16 wait states suffers from a race condition and can take an inconsistent amount of time even from one iteration to the next on a single machine.
This doesn't make it impossible to write cycle-exact PC/XT CGA code, just really difficult! The trick is to figure out sequences that work the same no matter which of the 4 phases the machine boots up in (or to detect the phase and dispatch to one of 4 routines), and which avoid accessing CGA memory on the inconsistent cycle. I think this will be doable once we have a cycle-exact emulator (which I'm working on, slowly).