Adventures in CRTC lockstep

October 1st, 2012

Once I had achieved CGA lockstep, I tried some test programs. This image was made by cycling through the possible palette registers as quickly as possible (i.e. it's running a big unrolled loop of "INC AX; OUT DX,AL" to the palette register):

That worked great, except that in making it I noticed that the pattern wasn't always starting the same way - half the time the first visible scanline had different set of colours. Somehow a bit of state was leaking through my lockstep routine!

After a while I figured out that it was due to the way I was getting the CRTC into lockstep with the CGA and CPU. The smallest frame that the 6845 CRTC can do is two character clocks (1 character by 2 scanlines - a 1 scanline high frame doesn't work with that CRTC). I thought I could get around this by going into high-res mode - then 1 character clock is 1 hchar so a frame would be 1 lchar and we'd be in a known place in the frame once we were in a known place in the CGA cycle.

Have you spotted the problem yet? The problem is that I don't know what the phase relationship is between the CGA clock and the CRTC clock - the first hchar of the frame could be the left or the right hchar within the lchar! And in fact, which it turned out to be was decided at random on startup.

With a bit of fiddling I eventually came up with a way to get the CRTC into lockstep as well. The trick hinges on the fact that if we set up the CRTC parameters so that one of the scanlines is displaying a normal visible image and one is overscan, we can tell which scanline is which by reading the display enable bit of the CGA status register. Then we delay an odd number of lchars if the display enable bit is set one way and an even number of lchars if it's set the other way (it doesn't matter which is which). Because we want to keep the CGA and CPU in lockstep as well, the difference in the codepath lengths must also be a multiple of 3 lchars, so delaying for X lchars one way and X+3 the other works fine.

That's about all there is to it. The full lockstep routine is on github. Once lockstep is entered it'll persist until you wait for a time that depends on an external event (such as reading from disk/serial/parallel/ethernet/joystick or waiting for a keystroke). That doesn't mean that lockstep mode games and trackmos are impossible, though. The keyboard can be read by polling (pretty much all PC software directly or indirectly uses an interrupt for keyboard access but it isn't compulsary and I've done it by polling a few times). You just have to make sure the code paths are the same length no matter whether a key was pressed or not and no matter which key was pressed if there is one, which can be done by adding suitable delays. Disk access is a bit more difficult, since there's going to be a DMA bus access at some unpredictable point, and after it's happened you'll be out of lockstep. I think the solution is to HLT after the disk access is complete and restart execution on a timer interrupt. In the event that lockstep between CGA and PIT isn't possible, regaining lockstep once the timer interrupt has occurred should be possible by delaying for N ccycles for some N between 0 and 15 and a CGA memory access. Another possible way is to make sure the CPU is running code that is either:

  1. BIU-bound with no wait states, or
  2. that is EU-bound and never exhausts the prefetch queue

for the entire time that the accesses might be happening. That way the time taken to run the code doesn't depend on exactly when the accesses occur.

Adventures in CGA lockstep

September 30th, 2012

As part of my project to emulate an IBM PC or XT with cycle accuracy, I need to be able to get the machine into a completely known and consistent state, which I call lockstep. That way I can run a program many times and be sure of getting exactly the same result each time.

This is a bit tricky, because while all the PC's clocks are derived from a single 14.318MHz crystal, they divide it in different ways. The CPU clock is made by dividing this frequency by 3, the PIT clock is made by dividing it by 4 and the CGA clock is made by dividing it by 16.

Getting the CPU clock in lockstep with the CGA clock is the difficult bit, since the CGA clock is in lockstep with the PIT clock by definition (assuming that such a lockstep is possible - I'm not sure offhand if the phase relationship between the PIT and the CGA clock is always the same or if it's randomized on startup - the latter would make it more complicated to use the PIT in lockstep mode, but that's not really a big problem since the point of lockstep mode is to be able to do timing statically).

Since the CGA clock and the CPU clock have periods which are relatively prime numbers of hdots, it's definitely possible to get them into lockstep. Once I had a rough idea of what the CGA wait states were, I realized that achieving lockstep ought to be possible with a combination of delays and CGA accesses. The algorithm would be:

  1. Do a CGA memory access, reducing the number of possible relative phases from 16 to 3.
  2. Delay for A ccycles.
  3. Do a CGA memory access, reducing the number of possible relative phases from 3 to 2.
  4. Delay for B ccycles.
  5. Do a CGA memory access, reducing the number of possible relative phases from 2 to 1.

Delaying for 16 ccycles gives the same relative phases as delaying for 0 ccycles, so the problem boils down to finding A mod 16 and B mod 16. That's only 256 possibilities (and probably quite a few of those will work) so trial and error works fine. Delaying for particular numbers of cycles is okay too - the 8-bit MUL instruction takes 69 ccycles plus 1 ccycle for each set bit in AL, so as long as you don't mind waiting for 69 ccycles you can get a delay of any number of ccycles you like.

But there's a more fundamental problem - how do we recognize when we've succeeded? The definition of lockstep involves consistency - ending up in a known end state no matter what the initial state. So in order to determine whether we're in lockstep or not, we really need to be able to control the initial state - in other words, in order to figure out whether we're in lockstep or not we first need to be in lockstep! That's a bit of a chicken-and-egg problem.

If I knew exactly what the CGA wait states were at this stage I could have figured out the right A and B values on purely theoretical principles, but I didn't - my examinations of the CGA schematics left some questions (particularly in areas involving how the 8088 treats READY signals occuring at different clock phases, and how some apparent race conditions actually turn out in real hardware). It was only in the course of achieving lockstep that I discovered what the CGA wait states actually are.

I had a few false starts involving identifying the 6 behavior classes for the 27 possible transition tables involved in the long-term behaviors of repeated sections of code. For example, if a piece of repeated code has 3 possible long-term behaviors depending on the relative phase at the start, I know that the repeated section must leave the relative phase alone.

But that was getting rather complicated, and I wasn't really getting anywhere until I hit on a better way - I realized that I could visualize exactly how long a piece of code was taking by running it and then changing the CGA palette register, which has an immediate effect, so marks the position on the screen where the electron beam was pointing when the register changed.

That's only useful if the transition happens in exactly the same place on the screen in each frame (otherwise you don't get a stable image and can't see what's going on). Which sounds like we're back to the chicken-and-egg problem again. But it's a more limited kind of lockstep that we need for this particular experiment - we don't need absolute lockstep, just a way of getting code to run at a consistent place relative to the raster beam, from frame to frame. That is to say, it doesn't matter if next time we run the program the image appears in a different place on the screen.

Fortunately, there's a way to do this on the PC even if we don't have full lockstep, since we can use the PIT to introduce a lockstep that's just consistent from frame-to-frame, not from run-to-run. If we set a timer to go off exactly 262*912/12 = 19912 PIT cycles, it'll occur exactly once every frame. That's not quite enough though, because interrupts don't quite have an instantaneous effect on the CPU - the CPU does finish whatever instruction it's currently executing before starting an interrupt. So you have to make sure it's not executing any instructions - i.e. is in the halt state. Another complication is that I had to disable all other interrupts and the DRAM refresh in order to avoid them messing up the timings, which meant that I had to access each DRAM row within a certain period of time lest the capacitors discharge and the memory contents decay, which meant that I couldn't leave the CPU halted for too long!

Once I had a stable image I was able to generate the 16 different CGA/CPU relative phases with multiply instructions, and made a diagonal line that advanced 3 hdots (1 ccycle) on each line just by cycle counting. Then by placing CGA memory accesses between this diagonal line and the palette write I was able to see exactly what the CGA wait states were:

It look a while to get this image because whenever I added some code to a line I had to change the delay code at the end of the line to get the start of the next line to line up correctly, so there was a fair amount of trial and error involved.

In this image, every other scanline displays (just using normal 640-pixel graphics mode) a pattern that repeats every 16 pixels, so that I can see where the lchar boundaries are.

Once I had the CGA wait states, getting CGA/CPU lockstep was relatively easy. Here's a photo I took when I finally got it:

Note that of the lower 16 black horizontal lines, they all end at the same position mod 48 hdots (you'll have to take my word that before the lockstep code they were at 16 different relative phases, like the lines further up which make a diagonal pattern).

Phew, that's a lot of complications for such a tiny piece of code!

Tomorrow we'll look at how to get the CRTC in lockstep as well.

The CGA wait states

September 29th, 2012

As part of my project to emulate an IBM PC or XT with cycle accuracy, I also wanted to emulate the CGA card with cycle accuracy. That meant figuring out exactly what the wait states are when accessing CGA memory. Here's what I found out.

When talking about this stuff it helps to have a common terminology to talk about the several units of timing involved. This is the terminology I use:

  • 1 hdot = ~70ns = 1/14.318MHz = 1 pixel time in 640-pixel mode
  • 1 ldot = 2 hdots = ~140ns = 1/7.159MHz = 1 pixel time in 320-pixel mode
  • 1 ccycle = 3 hdots = ~210ns = 1/4.77MHz = 1 CPU cycle
  • 1 cycle = 4 hdots = ~279ns = 1/3.58MHz = 1 NTSC color burst cycle
  • 1 hchar = 8 hdots = ~559ns = 1/1.79MHz = 1 character time in 80-column text mode
  • 1 lchar = 16 hdots = ~1.12us = 1/895KHz = 1 character time in 40-column text mode

The wait state algorithm for the original IBM CGA is basically "wait 1 hchar, then wait for the next lchar, then wait for the next ccycle". That works out at between 3 and 8 ccycles depending on the relative phase of the CPU and CGA clocks. There are actually 16 possible relative phases (one for each of the hdots within the lchar at which the CPU cycle starts).

One relative phase has a 3 ccycle wait state and there are 3 relative phases for each of the other 5 possible wait state lengths (4, 5, 6, 7 and 8 ccycles respectively). 1+3+3+3+3+3=16. So the average wait state is (3+4*3+5*3+6*3+7*3+8*3)/16 = 5.8125 ccycles, but you might measure a different average depending on how your piece of code ends up synchronizing with the CGA clock.

In a way it's rather unfortunate because with a slight hardware modification I think the 1 hchar wait state could have been eliminated, making the average wait state about 3 ccycles shorter and roughly doubling the average speed of the CGA memory access.

Also unfortunately, "rep stosw" gives almost the worst possible wait state behavior. I haven't tried it yet, but I suspect that it would be possible to write CGA code that self-synchronizes to get the best possible wait states (though of course that would probably only improve performance on machines that were cycle exact with the machine that it was tuned for).

A third unfortunate thing is that the wait states are the same whereever the raster is on the screen - they aren't disabled during the retrace interval or anything like that. There's a good reason for that though - the CRTC continues to strobe through the CGA RAM throughout the overscan/retrace areas for dynamic RAM refresh - allowing the CPU access to the full memory bandwidth could result in loss of video RAM data, since the CGA doesn't participate in the system DRAM refresh cycles (which is a good thing, because otherwise all those wait states would propagate to the entire memory system).

Multifunction gates

September 28th, 2012

Recently, I came across an interesting article about unusual electronic components. One of the components that article talks about is the multifunction gate, which is a 6-pin integrated circuit which can act as one of several different 2-input logic gates depending on which input pins are connected to which incoming signal lines, and whether the ones that aren't connected are pulled high or low.

However, the gates mentioned by that article can't act as any 2-input logic gate, only some of them. The 74LVC1G97 can't act as NAND or NOR, the 74LVC1G98 can't act as AND or OR, and neither of the devices can act as XOR or XNOR. Their truth tables are as follows:

Inputs 74LVC1G97 74LVC1G98
0 0 0 0 1
0 0 1 0 1
0 1 0 1 0
0 1 1 1 0
1 0 0 0 1
1 0 1 1 0
1 1 0 0 1
1 1 1 1 0

That got me wondering if it's possible for such a 6-pin device to act as any 2-input logic gate. Such a 6-pin device is naturally equivalent to a single 3-input logic gate, since two of the pins are needed for power. There are only 256 (28) possible 3-input logic gates (since the truth table for a 3-input logic gate contains 23=8 bits of information). So it's a simple matter to just write a program that enumerates all the possible 3-input logic gates, tries all the possible ways of connecting them up, and sees what the results are equivalent to.

There are 16 possible 2-input logic gates: 0, 1, A, B, ~A, ~B, A&B, A|B, ~(A&B), ~(A|B), A~B, ~(A~B), A&(~B), (~A)&B, A|(~B) and (~A)|B. Discounting the trivial gates, the gates that ignore one of their inputs and treating two gates as identical if we can get one from the other by swapping the inputs, gives us 8 distinct gates: AND, OR, NAND, NOR, XOR, XNOR, AND-with-one-inverting-input and OR-with-one-inverting-input.

There are 4 possibilities for connecting up each of the three input pins: low, high, incoming signal line A and incoming signal line B. I originally thought that connecting the output pin to one of the input pins might also be useful, but it turns out that it isn't - either the value of the input pin makes no difference (in which case it might just as well be connected to one of the other four possibilities) or it does. If it does, then either the output is the same as the input pin that it's connected to (in which case it's underconstrained, and not a function of the other two input pins), or the opposite (in which case it's overconstrained, and also not a function of the other two input pins).

So we have 256 possible 3-input logic gates times four possibilities for each of the three input pins, times two possible states for each of the two incoming signal line - that's 65536 circuit evaluations to try, which a computer program can run through in a timespan that is indistinguishable from instantaneous by unaided human senses.

Running the program didn't find any 3-input gates which can be configured to act as any of the 16 2-input gates, but it find something quite interesting - a dozen 3-input gates which can be configured to make 14 of them. Since swapping the inputs around gives equivalent gates, there's actually just two different gates, which for the purposes of this essay I'll call Harut and Marut.

Inputs Harut Marut
0 0 0 0 1
0 0 1 0 1
0 1 0 1 0
0 1 1 1 0
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
1 1 1 1 0

Note that the truth tables for Harut and Marut are quite similar to 74LVC1G97 and 74LVC1G98 respectively, just with two of the rows swapped or inverted. I haven't been able to find Harut and Marut gates on Mouser - it would be interesting to know if anybody makes them. One possible downside is that the 3-input gate isn't as useful in its own right (the 3-input gates for 74LVC1G97 and 74LVC1G98 are just a 2-input multiplexer and same with inverted output respectively).

I think it should be possible to design a 6-pin device that can yield any 2-input gate by making the supply lines take part in the configuration as well. For example, if you have a Harut gate and a Marut gate which give high-impedence outputs when power is not applied, you could put diodes on their supply lines and connect them up in parallel with the supply lines interchanged. Then applying power in the normal way would yield a Harut gate and reversing the polarity would yield a Marut gate. There's probably better ways to do it.

doitclient ported to Windows

September 27th, 2012

In order to write my cycle exact emulator, I need to run a lot of little "experiment" programs on my XT to determine exactly how long various things take. But the machine is bulky and noisy so I'd rather not keep it in the living room. In Seattle I kept it in the workshop downstairs. In Cornwall I'll probably keep it in the garage. I also have a modern PC next to it which is on the network and can reboot the XT, start programs on it and collect the results.

So, in order to remotely run programs on the XT I ultimately needed a way to remotely run programs on the modern PC. I had VNC and remote desktop on the modern PC so I could use that at a pinch, but I really wanted an automatable method - i.e. a command I could run on my laptop that would cause another command to be run on the modern PC. Under Linux this would be easy, but both the modern PC and my laptop run Windows.

I looked at a couple of ways of doing this, psexec and at least one other that I don't remember, but everything I tried was rubbish - either too slow, didn't work at all, or ran in a context which didn't include my network drive mappings (which made it kind of useless for running a program I had just built locally). I thought about writing a simple TCP client/server app to do what I wanted, then I decided to check to see if anyone had already written such a thing. Of course someone has, and I was not entirely surprised (nor at all displeased) to find out that the someone was Simon Tatham. Unfortunately DoIt has one minor problem which made it impossible to use as-is for my scenario - the client part (the part that runs on the machine which dispatches the commands) is a Unix program. Fortunately it's a pretty simple C program and I was able to port it to run on Windows with just a few changes. Here's the port, with a binary, diff, modified source code and a Visual Studio project file.

I suppose I could set up some wireless serial connections instead of using a separate PC which would make this entire post redundant. It'd probably even save money in electricity bills in the long run. For now, though, I decided to work with the hardware I have.

Buying up the junk

September 26th, 2012

I recently moved from Seattle to the UK. Part of the process of doing a big move like that is getting rid of stuff that you don't need any more. For some things that's just a question of throwing it away (we had one visit from the haulers, two trips to the tip and several weeks of extra garbage charges). But there's other stuff that, while we didn't need it any more, would have some value to someone else. Getting rid of all that stuff was a consistent feature of all our todo lists, and we still didn't manage to get rid of as much stuff as we wanted. We gave some things away to friends, sold some things on a neighborhood email list and had a (not particularly successful) garage sale.

There really ought to be a better way than this. There must be a enormous amount of value locked up in peoples houses in the form of stuff that they don't use but which still has some value and therefore they don't want want to throw away, but getting rid of it is an annoying, difficult, low priority task, so never gets done.

I think somebody could make a fortune by setting up a company that takes away your unwanted stuff. You'd request a visit from them on their website (or maybe they could just show up on a regular basis) and take away anything that you didn't want. They'd do the work of valuing it, selling it and shipping it, and then send you the proceeds (after taking their cut). If, after the valuation stage, you decided that the item was worth more than that you could reject the offer and they'd bring it back with the next visit (perhaps for a small charge to avoid the service being abused as a free valuation service). Items that might leak liquids or emit odors would probably not be accepted (the small amount of value held in such items would probably not be worth the possible damage to other items).

They'd do all the work of making sure that items were packed sufficiently well for shipping, reusing packing materials as much as possible (and eliminating a large amount of waste). If they delivered items as well (instead of relying on UPS, Fedex or similar) they could take away the packing materials on delivery, helping the environment and saving the customer from another annoying job (my workshop in Seattle often used to get cluttered up with old cardboard boxes, packing peanuts and bubble wrap).

Another nice thing about this business is that it would be really easy to bootstrap - you could start it off in just one city with a couple of people, a van and a simple website and some insurance against breaking things. Deliveries to places too far away for the van to get to (or between two different cities where the company does have vans) could be done with the existing delivery services. After visiting one house for a requested pickup they could visit other nearby houses and ask if they have any items they want to get rid of.

eBay private bids

September 25th, 2012

There are all sorts of programs available for "eBay sniping" - automatically placing your bid in the last moments of the auction in order to minimize the amount of information available to your adversaries (the other bidders) and thereby maximize your chances of winning while minimizing the amonunt you expect to pay.

The trouble with this is that it creates two classes of eBay bidders - those who have paid extra (in money, effort or both) for the sniping software, and those who haven't. This makes the eBay user experience less friendly and more frustrating.

So I think eBay should offer its own (free) sniping software built right into the site - give bidders the opportunity to make a public or a private bid (or both). Only the highest public bid is shown but the highest bid overall (public or private) will actually win.

Why would anyone make a public bid if a private one is available? Wouldn't this turn all eBay auctions into silent (?) auctions? Not necessarily - there are some circumstances when a public bid is actually in the bidder's favour - for example if there are several auctions for equivalent items all ending around the same time, making a public bid on one of them is likely to push other bidders towards the other auctions.

Though that bit of game-theoretic oddness could also be eliminated with a closely feature closely related to private bids, which is the ability to (automatically) withdraw a private bid. This would allow one to bid on several auctions at once, while guaranteeing that you'll only win one of them. More complicated logic would also be possible, like "I want to buy either A or the combination of (B and any of C, D or E)". I'm not sure if this is currently possible with sniping software (I haven't used it). One could also set different bids for different auctions, if some are more desirable than others in some ways.

All these changes favor buyers rather than sellers, so eBay users who are primarily sellers probably wouldn't like them (after all, they help buyers save money!) But sellers already hate eBay - many of their other policies are drastically biased towards buyers. The only reason that sellers keep selling stuff on eBay is that is where the buyers are (and therefore that is where the best prices are, even after factoring out eBay's cut).

One other reason that eBay might want to do this would be that by having private bids go through the site, they get more accurate information about who is prepared to pay how much for what. I don't know if eBay currently does anything with this sort of information, but it surely must have some value to someone.

How to decide if a website is trustworthy

September 24th, 2012

Occasionally people will send me an email they have received or a link to a website they've heard about and ask me if it's genuine or a scam. Usually it's easy to tell, but sometimes (even for the savvy) it's actually quite hard. The example that prompted this entry was a program that purported to speed up your computer by cleaning up orphaned temporary files and registry entries. This is an area that's ripe for scams - a program that does absolutely nothing could still seem to be effectual through the placebo effect. Also, running such a program is a vector by which all manner of nasty things could be installed. Yet there are genuine programs which do this sort of thing, and slowdown due to a massive (and massively fragmented) temporary directory is certainly possible.

Here are some methods one can use to try to figure out if something like this is trustworthy or not:

  • Trust network. Is it trusted by people you trust to be honest and knowledgeable about such things? I've never used CCleaner myself (I just clean up manually) but people I trust (and know to be knowledgeable about such things) say it's genuine. Similarly, think about how you came to find out about a program. If it was via an advert then that lends no credence (scammers can place adverts quite easily). If it was via a review in a trustworthy publication, that's does lend some credence.
  • Do you understand the business model? CCleaner's is quite clear (a functional free program with paid support). The program that prompted this entry had a free version which just detected problems - fixing the problems required buying the full version. This business model seems just like "scareware" - the free program always finds hundreds of problems (even on a perfectly clean system) because its purpose is to convince people to buy the full version. Being honest would be a disadvantage! Even if the program starts out honest, there's a tremendous incentive to gradually become less honest over time.
  • Does it seem too good to be true? If so, it almost certainly is. (Though exceptions exist.)
  • Is there a way to verify it? Availability of source code is certainly a good sign - it's something genuine programs can do demonstrate their honesty. A scam almost certainly wouldn't bother, because anyone who could examine the source code would not be taken in by it anyway. Though of course, once this starts being a factor a lot of people look for, it'll start being gamed. As far as I can tell, that hasn't happened at the time of writing, though. I think I would have heard about it if it had.
  • What does the internet say about it? Especially known-trustworthy sites that the scammer has no control over. Remember that scammers can put up their own good reviews, but getting bad ones taken down is much more difficult. So if there's a lot of posts in different places by different people saying that it's a scam, that's a pretty good signal that it's bad (not infallible though - someone might have a grudge against the authors of the program for reasons unrelated to whether it does what it's supposed to).

The future of the past and the past of the future

September 23rd, 2012

Today is the 50th anniversary of the first broadcast of the Jetsons. It's always fascinating to look at how the people in the past used to imagine the future would be like, and see how different their extrapolations were to how things actually turned out. There are certainly technologies and societal changes that have happened in the last 50 years that would have been impossible to predict.

Equally fascinating, I think, is to imagine how people in the future will think of our present. Okay, it's a rather different problem in that there will (hopefully!) be actual historical records of what life today is like (in fact, our present is probably the most well-documented historical period ever). Still, we surely have misconceptions today about what life was like in the past, and it's interesting to wonder what misconceptions the people of the future will have about us. What technologies that have yet to be invented will be so ubiquitous and game-changing that people will have real trouble imagining what life was like without them? What changes will happen to society which will make today seem unfathomably alien? Given enough time, I'm sure such changes are inevitable, so (despite the excellent records) I think it would be completely unsurprising if the people of tomorrow have some serious misconceptions about the people of today (especially amongst those who don't study the past for a living).

Scan doubler reverse engineered

September 22nd, 2012

My XT came with an unusual and interesting ISA card - a PGS scan doubler II. From the name, connections and the chips on it I determined that it was supposed to convert CGA RGBI TTL signals from 15.7KHz horizontal line rate to 31.4KHz, which would make it (timing wise) compatible with standard VGA monitors (it'd still need a DAC to convert from the TTL digital signals to RGB analogue ones).

Soon after I got it, I tried to make it work with my CGA card, but couldn't get anything to display on my VGA monitor. I didn't have an oscilloscope then so there wasn't really much I could do in way of diagnosis (I do have one now, but I still haven't diagnosed the problem due to my XT being en route from Seattle to the UK). For debugging purposes (and just because I was really curious about how it works) I decided to reverse engineer the card to a schematic. Here is the resulting schematic.

Interestingly, it only uses half of it's 2Kb of RAM. There are four 1024x4 bit NMC2148HN-3 SRAM chips, but address line A9 of each chip is grounded, so only the first half of each chip is ever actually read to or written from. One might be inclined to wonder why they didn't use half the number of RAM chips. The answer is memory bandwidth: for each CGA pixel (i.e. at a rate of 14.318MHz) the card has to write a pixel to RAM and read back two. Each pixel is 4 bits, so that's an access rate of 229 megabits per second, which would be too fast for two such chips by a factor of two. So the solution is to increase the bandwidth by parallelization - it turns out that accessing 16 bits at each cycle is enough, but that means having four chips instead of two.

Most of the rest of the card is pretty straightforward - just sequencing the read and write operations in the right order to the different chips, detecting the hsync pulses for genlocking and parallelizing the input pixels. There is one bit which involves logic gates coupled by capacitors - this seems to be a clever hack to double the 14.318MHz clock to generate a 28.636MHz VGA pixel clock (I haven't simulated it because I can't quite read the capacitor values - I think I'll need to unsolder them to measure them). Technically such a clock doubling probably isn't necessary, since the left pixels could be emitted on the low half of the clock and the right pixels on the high (or possibly vice-versa) but maybe the logic delays cause the pixels to interfere, or maybe it was just easier this way.