As part of my project to emulate an IBM PC or XT with cycle accuracy, I also wanted to emulate the CGA card with cycle accuracy. That meant figuring out exactly what the wait states are when accessing CGA memory. Here's what I found out.
When talking about this stuff it helps to have a common terminology to talk about the several units of timing involved. This is the terminology I use:
- 1 hdot = ~70ns = 1/14.318MHz = 1 pixel time in 640-pixel mode
- 1 ldot = 2 hdots = ~140ns = 1/7.159MHz = 1 pixel time in 320-pixel mode
- 1 ccycle = 3 hdots = ~210ns = 1/4.77MHz = 1 CPU cycle
- 1 cycle = 4 hdots = ~279ns = 1/3.58MHz = 1 NTSC color burst cycle
- 1 hchar = 8 hdots = ~559ns = 1/1.79MHz = 1 character time in 80-column text mode
- 1 lchar = 16 hdots = ~1.12us = 1/895KHz = 1 character time in 40-column text mode
The wait state algorithm for the original IBM CGA is basically "wait 1 hchar, then wait for the next lchar, then wait for the next ccycle". That works out at between 3 and 8 ccycles depending on the relative phase of the CPU and CGA clocks. There are actually 16 possible relative phases (one for each of the hdots within the lchar at which the CPU cycle starts).
One relative phase has a 3 ccycle wait state and there are 3 relative phases for each of the other 5 possible wait state lengths (4, 5, 6, 7 and 8 ccycles respectively). 1+3+3+3+3+3=16. So the average wait state is (3+4*3+5*3+6*3+7*3+8*3)/16 = 5.8125 ccycles, but you might measure a different average depending on how your piece of code ends up synchronizing with the CGA clock.
In a way it's rather unfortunate because with a slight hardware modification I think the 1 hchar wait state could have been eliminated, making the average wait state about 3 ccycles shorter and roughly doubling the average speed of the CGA memory access.
Also unfortunately, "rep stosw" gives almost the worst possible wait state behavior. I haven't tried it yet, but I suspect that it would be possible to write CGA code that self-synchronizes to get the best possible wait states (though of course that would probably only improve performance on machines that were cycle exact with the machine that it was tuned for).
A third unfortunate thing is that the wait states are the same whereever the raster is on the screen - they aren't disabled during the retrace interval or anything like that. There's a good reason for that though - the CRTC continues to strobe through the CGA RAM throughout the overscan/retrace areas for dynamic RAM refresh - allowing the CPU access to the full memory bandwidth could result in loss of video RAM data, since the CGA doesn't participate in the system DRAM refresh cycles (which is a good thing, because otherwise all those wait states would propagate to the entire memory system).
Could you expand on why REP MOVSW gives the worst possible wait state behavior? And would the alternatives really be faster, since they would likely require 10x the instructions to sync up?
REP MOVSW moves 256KB/s to CGA RAM with DRAM refresh disabled, which works out at 112 hdots per word. To system RAM, REP MOVSW moves 382KB/s which works out at 75 hdots per word, which means the wait states are adding 37 hdots per word. So I suppose it could be a bit worse - the theoretical worst set of wait states would be 48 hdots per word and the theoretical best would be 18 (so the "badness index" is (37-18)/30 = 63%).
What would be ever better is if "REP MOVSW" only took 4 EU ccycles per IO instead of 8 - then the CPU wouldn't be a bottleneck at all - you'd be able to transfer at 1 byte every other lchar (447KB/s) to CGA RAM, and you'd be able to max out the bus (1432KB/s) when doing RAM<->RAM transfers.
It's probably not possible to do better than REP MOVSW if you're just transferring a block of data from system RAM to CGA RAM (though I should time strings of MOVSB, MOVSW, LODSW STOSW and LODSB STOSB just to make sure). There are some surprising things about wait states, though: for example, a string of "STOSW NOP NOP" to CGA takes the same time as a string of "STOSW NOP" (i.e. you can get an extra NOP per word for free because it fits into the wait states).
If you're doing some processing at the same time as writing to CGA RAM (i.e. a demo effect, or even just transferring attribute bytes for 160x100x16 mode) the number of wait states you'll hit will depend on the exact ordering of the instructions, and may also be different from run-to-run of the routine (depending on the relative phase of the CGA and CPU clocks when the routine starts). With care it might be possible to arrange the routine so that it goes into CGA/CPU lockstep itself and then has the best possible set of wait states - that's what I meant by "self synchronizing" CGA code.
[...] the 2bpp modes. Each output composite pixel depends on the colours of four consecutive pixels of hdot width. These pixels cover at most 3 consecutive ldots, so any given pixel position depends on at [...]