Archive for the ‘emulation’ Category

PAL version of demo machine

Wednesday, July 2nd, 2008

The demo machine described the other day could easily be generalized to PAL output. There are some complexities though. Because of the missing quarter cycle of the carrier frequency per line, the PAL signal for a still image repeats every 4 frames. This means that in order to do the same "extremely simple, highly standards-compliant demo" that was possible on the original demo machine for NTSC, we need 2.7Mb of sample data. Let's run with this and round up the PAL machine's memory to 4Mb - rather than making the PAL machine as similar as possible to the NTSC machine, we should take the opportunity to introduce some variety.

Similarly, the CPU clock speed for the PAL machine should be (exactly) 17.734475MHz.

Generating interesting standard PAL signals does have some complications that NTSC signals don't have. Because of the 25Hz offset, the colour carrier frequency starts at a different phase on each line, meaning that a sprite needs to have different sample data depending on its vertical position. I expect that most demos written for the machine would use one of three simplications of the PAL standard (as most if not all computers and consoles that generate PAL signals did):

  1. eliminate the 25Hz offset so that the colour carrier phase repeats every 4 lines
  2. use a whole number of subcarrier cycles per line (making the chroma patterns vertical)
  3. eliminate interlacing, doubling the frame rate at the expense of halving the vertical resolution

These simplifications change the horizontal and vertical retrace frequencies slightly from the standard 15.625KHz and 50Hz rates, but not so much that real hardware is likely to fail to display the intended image.

NTSC decoder

Tuesday, July 1st, 2008

I want to write a piece of software that simulates a colour television (well, monitor really). It would really be a filter - the input is a sampled, quantized composite (CVBS) signal at some sample rate and the output is a series of video frames scaled to some resolution. I'd want the composite->RGB transformation to be done at the same time as the horizontal scaling to maximize quality whilst minimizing computation time. That means dynamically generating the appropriate filter kernel for a particular pixel width and sample rate.

Such a thing would be particularly useful for emulating old computers and consoles which generated a composite colour signal directly (including the demo machine of yesterday's post). In particular, it would render colour artifacts and interlacing perfectly). It would also be useful for simulating (for example) how an image would look on a TV for applications like DVD mastering.

This is the kind of image it will be able to generate (this mock-up was done at fixed resolution and not in real time):

No high-frequency chroma filtering was done on this image, so you can see the chroma artifacts. I wanted to include an animation showing the dot-crawl effects, but animated GIFs don't go up to 60fps. It does look uncannily like a TV screen though.

Writing this filter might even inspire me to fix up the NTSC emulation in MESS and improve the video emulation of machines like CGA, Apple II, CoCo and Atari 400/800.

Demo machine

Monday, June 30th, 2008

If one were to design a computer specifically for the purpose of having Oldskool-style demos written for it, what would it be like?

I came up with a few requirements:

  • Oldskool "feel"
  • Attractive output: sampled sound and realistic looking (still) images can be achieved very easily
  • The more skilled the programmer, the better effects can be achieved - very great skill is required for the best effects
  • Hardware as simple as possible
  • Easy to emulate
  • Fun to program

To get that Oldskool feel but still allow realistic looking still images, I decided to base it around a composite NTSC signal. It turns out there is a standard for digital composite NTSC signals - SMPTE 244M, so I just picked the 8-bit version of that. The video hardware is about as simple as video hardware could possibly be - just an 8-bit DAC that repeatedly cycles through a consecutive region of memory and dumps the bytes to the output at a frequency of 14.318MHz (4 times the color carrier frequency). This is also the CPU clock speed (the CPU is in sync with the pixel clock).

Technically, a still image in this format that is fully NTSC compliant requires 955,500 bytes (933Kb) of memory (2 frames*910*525) because the color carrier phase alternates between frames. Given that the machine's memory size should be a power of 2 (so that we can map 32-bit addresses to physical addresses just by clearing the top bits) I decided to give it 1Mb (any more and it stops being so Oldskool). This doesn't necessarily mean that you need to use almost all the memory to store the framebuffer - by tweaking the burst/sync patterns you can get a 912*525 (468Kb) interlaced mode where you only need one frame, or a 912*262 (233Kb) non-interlaced mode (again with only frame), both of which should be compatible with any composite monitor. These have usable resolutions (including the overscan area) of about 720*480 and 720*240 respectively (with respective CGA-esque pixel aspect ratios of 5:6 and 5:12). So this gives a nice bit of flexibility to software with no extra hardware complexity.

One disadvantage of this setup is that it may be possible for software to damage some composite monitors by creating sync pulses too often. So one would probably want to use an emulator when testing/debugging programs! Also, scrolling is fiddly because you have to move the image data independently of the sync/burst pulses. There are block move instructions which can move 4 pixels per clock to help with this, though.

Audio is done in a very similar way to video - set the beginning and end of the region of memory and the hardware cycles through that memory range turning bytes into samples and putting them through a DAC. Another register is used to set the sample rate (I decided a fixed rate would be too inflexible).

Programs are just 1Mb memory images which are dumped into RAM. The first two words of RAM are the instruction pointer and the stack pointer, so execution starts at the address pointed to by the word at 0. I/O registers are also at the start of the memory region.

Most instructions are 1 cycle long (with the exception of instructions like the equivalents of "REP MOVSB" and "REP STOSB") to enable effects that require cycle counting to be done as easily as possible. Most instructions are 1 byte long (with the exception of instructions that have a 1-byte, 2-byte or 4-byte immediate argument). The CPU is 32-bit (to make addressing that 1Mb of memory easy). The architecture is stack machine based (kind of like Java) - I have a soft spot for such architectures because they're really easy to write decent compilers for (registers are helpful at the hardware level for making processors fast, but that isn't a particular concern here). Devolving the CPU has some ideas for generating a good instruction set.

Because of the cycle counting and simplicity requirements, there are no hardware interrupts - all timing must be done by cycle counting. This also means that there are no instructions whose purpose is to determine some value N and take O(N) time to do it - so no "REP SCASB" in this architecture. I don't think that is very useful for demos anyway, and it's a non-goal of this design to be suitable for general purpose computing tasks like searching and parsing.

It would be pretty easy to generalize this architecture to make a games machine - just have a register whose bits reflect the up/down status of controller buttons.

Now, while I'd like to believe that building such a machine would revitalize the Oldskool demo scene by providing a new focal point, I suspect a more realistic point of view is that because programming such a machine would be such a non-applicable skill, because the limitations of the machine constrain the possible effects, and because there would be none of the nostalgia inspired by actual Oldskool hardware, nobody would actually be interested. But it's still an interesting exercise!

CRTC emulation for MESS

Friday, November 24th, 2006

Background

I am the author of the remastered version of Windmill Software's Digger. In creating this, I wanted to make the experience of running this game as close as possible to the experience of running the game on an original 4.77MHz CGA IBM PC.

I mostly succeeded, but there are a few rough edges. The sound is a little harsh when not using the PC speaker as I am not filtering out aliased high frequencies properly. Also, I never got the flashing effect on the "Enter your initials" screen quite perfect. For one thing, it is still CPU-speed dependent in the DOS version of Digger Remastered, as it was in the original.

Trouble is, I don't know exactly how this flashing effect is supposed to look. The palette changed (partway through lines) after every 2 or 3 scanlines on my PC1512, but I knew that wasn't exactly right as the PC1512 runs at 8MHz. I later found out that the effect was pretty similar on a 4.77MHz machine - it was not synchronized to the horizontal retrace or anything like that - the author describes it as "The rolling colors appeared as if the text was in a moving rainbow".

I realized that to see this effect as it was originally intended I would need a cycle-exact emulator. This seemed like rather a big job so I put it on the back burner.

Years later, I came across MESS and discovered that the hardest part of the work was done. With a few minor modifications (and a rewrite of the 8253 Programmable Interval Timer) Digger (both original and remastered) worked great - even the sound was better. However, the "rolling rainbow" raster effect still didn't appear. The raster effects in California Games also don't work. California games flips the palette at the same place each frame in order to use multiple palettes (and more than 4 colours) at once:

In order to make these things work, I decided to embark on a complete rewrite of the video emulation for machines which use a CRTC (Cathode Ray Tube Controller) based on the Motorola 6845 and variants.

About the 6845 CRTC

You can tell if a machine uses a 6845 variant because its video system will have the following characteristics:

  1. A character-cell based display
  2. A text-mode hardware cursor whose position is controlled by registers at offset 14 and 15
  3. Hardware scrolling controlled by registers at offset 12 and 13

There's a lot more too it than that, but just about every other feature of the 6845 is missing or different in some implementation or other - that's about all that's common to all the variants.

The 6845 CRTC keeps track of the position of the CRT beam and generates:

  • horizontal and vertical sync pulses (to keep the real CRT in sync with the CRTC)
  • a "memory address" (a unique number for each character cell in the picture)
  • a "row address" (the scanline within the character cell)
  • a "display enable" bit, to indicate whether the beam is currently displaying memory-driven data or overscan/blanking/retrace
  • a "cursor" bit, indicating whether the beam is currently within the cursor

The 6845 can also be thought of as a 4-stage counter:

  • Stage 0: horizontal character counter
  • Stage 1: scanline counter within a character row
  • Stage 2: character row counter
  • Stage 3: frame counter (for cursor flashing)

Graphics modes and the 6845

The MC6845 only has a 7-bit character row (stage 2) counter, so if you make each character row one scanline high, you can only display ~100 scanlines (after overhead for vertical overscan and blanking). So most machines that use a 6845 and support high-resolution graphics use some of the row address bits as memory address bits:

  • In graphics modes, the CGA uses the least significant bit of the row address (R0) as the most significant bit of the memory address (M13). This explains why even scanlines are in the low 8Kb of RAM and the odd scanlines are in the high 8Kb.
  • The BBC Micro in modes 0-6 uses the lowest three row address bits R0-R2 as the low bits of the memory address (M0-M2). This explains the somewhat counter-intuitive memory layout of this architecture.

Other features supported by the 6845

Some variants of the 6845 support:

  1. Software-programmable timings. Most 6845 variants allow software to change the number of characters/scanlines per frame, the number of displayed characters/scanlines and the relative positions of the sync signals. This makes video hardware that uses these variants very flexible, but in some cases does make it possible for software to destroy hardware (some fixed-scan monitors can be damaged if the timings of the sync signals are out of range.)
  2. A lightpen. When the CRT beam passes the lightpen sensor, a strobe signal is sent to the 6845 and the current memory address is latched. This value can then be read by software.
  3. Interlaced display. Even frames are advanced by a half-scanline and odd frames are retarded by a half-scanline. Even frames are therefore one scanline larger than odd frames, causing the beam to start the first visible line a half-scanline higher on even frames.
  4. Use as a memory controller. To avoid contention for display RAM between the display logic and the CPU, all memory access is done through the CRTC. Some variants support features such as fast video RAM to video RAM copy and fill via CRTC commands.
  5. A blanking bit separate from the display enable and sync bits, enabling a "two stage" overscan - an outer black region surrounding an inner solid-colour region.

List of 6845 variants (with references)

  • Motorola 6845 (Motorola 68A45 and 68B45 are software equivalent but have different maximum clock speeds)
  • Motorola 6845-1 (equivalents: Motorola 68A45-1/68B45-1)
  • Rockwell 6545 (equivalents: Rockwell 6545E)
  • Rockwell 6545-1 (equivalents: Commodore 6545-1)
  • Hitachi 46505
    • differences described here.
  • Synertek SY6545-1 / SY6845E
  • UMC UM6845 / Hitachi HD6845S (Amstrad CPC "type 0")
  • UMC UM6845R (Amstrad CPC "type 1")
  • Amstrad AMS40489 (Amstrad CPC "type 3" - ASIC in CPC464+, CPC6128+, GX4000)
  • Amstrad Pre-ASIC (Amstrad CPC "type 4" - used in "cost-down" CPC6128)
    • information about these types here and here.
  • Amstrad PC1512 - timings fixed, always displays 200 lines (regardless of character cell height)
  • EGA - only very loosely based on the 6845, has many new features
  • VGA - similar to the EGA but with even more new features
  • 8563 (used in Commodore C128) - Wikipedia article
  • 8568 (used in D[CR] models of the Commodore C128) - there don't appear to be any differences for software or emulation between this and the 8563 - the main difference is an extra (unused) interrupt line.
  • Chips & Tech 82c425 and 82c426 - used in some CGA clones.
  • Professional Graphics Controller
  • Other CGA clones - 3270 PC, Plantronics ColorPlus, Amstrad PPC/PC20, Olivetti M24, Olivetti Prodest PC1

How the new 6845 emulation in MESS will work

You can download the code (as it is so far) here.

When the screen is initialized, the video hardware implementation calls crtc6845_init() to allocate and set up a CRTC object. Multiple objects can be allocated (for example if we're emulating a PC with both a CGA and an MDA display). The CRTC object contains the entire state of the CRTC, including a bitmap containing the current state of the frame (as it would be displayed on the CRT).

Whenever the CPU reads from or writes to the CRTC, crtc6845_update() is called to bring the state of the CRTC up to date. The do_update() function does the real work, and is where the main emulation loop is.

The main emulation loop works in a way very similar to the actual hardware. There are internal counters which are incremented each cycle or scanline and compared to values derived from the register values. Each cycle, the CRTC calls a callback supplied by the video hardware emulation to actually draw the pixels. This callback is supplied with the output of the CRTC (such as row and memory address and coordinates at which to draw). This may be somewhat slow (in 80-column text mode on a CGA, this loop will have to execute 1.79 million times per second) but if it is too slow there are many possible ways to optimize it without reducing the accuracy of the emulation, such as having the callback process larger parts of scanlines at once.

The machine's VIDEO_UPDATE function calls video_update_crtc6845(), which gets the CRTC up to date and then copies the CRTC's bitmap to the output bitmap.

The machine's VIDEO_EOF function calls video_eof_crtc6845(), which updates to the end of the current field and then does "per field" tasks. The main one of these is checking to see if the size of the generated image is the same as it was in the previous field. If it isn't, screen_configure() is called to enlarge the bitmap (if necessary - it never gets shrunk) and call video_screen_configure() to update the frame rate and size. This is done here rather then just by looking at the programmed field width and height parameters in the CRTC registers because some effects involve having multiple smaller CRTC fields per CRT field (i.e. "resetting" the CRTC part way through the frame). This works fine on real hardware as long as the timings of the horizontal and vertical sync pulses are correct, so we would like to be able to emulate these effects.

There are several different rectangles which the CRTC keeps track of:

  1. The display area (i.e. the area over which the "display enable" bit is set and the displayed pixels are driven by memory data)
  2. The display+overscan area (i.e. the area over which blanking is disabled and non-black pixels are actually drawn)
  3. The scanning area (i.e. the area over which both the horizontal and vertical sync pulses are low, and the cathode ray beam is progressing in the normal rightwards/downwards direction)
  4. The visible area (i.e. the area which is actually displayed by the MAME/MESS core. This must be within the scanning area but is independent of the blanking, overscan and display boundaries)

The visible area is set by the function crtc6845_set_visible_area(). The width and height parameters to this function are fractions of the width and height of the scanning area, which should be fairly close to what real monitors do. This function can also move the image horizontally and vertically as well as changing its size. It corresponds to the size and position controls of the monitor. It should probably be set once in the machine initialization function, tuned for the machine (so that the entire image and a small amount of overscan is visible in every mode) and then left alone.

Whenever a CRTC register that controls timing values is written to, the recalculate_timings() function is called. This function decodes the register values into a format more easily used by the do_update() function.

Each CRTC has an internal array of possible drawing callbacks that can be used. The element of this array that is actually called is set by the crtc6845_set_callback(), which will usually be called by the machine-dependent code which sets the video mode. This is an optimization so that each callback can take care of one video mode and you don't have to switch on video mode in the callback function (which is called a *lot* so should be as simple as possible).

Each CRTC also has an internal array of possible clock frequencies. This is because some machines can supply multiple different clock rates to the CRTC. For example, the input clock to the CRTC on a CGA is 1.79MHz in 80-column text mode and 895KHz in graphics modes and 40-column text mode. Memory is scanned through horizontally at twice the rate in 80-column text mode as in other modes.

There are a few other functions that machines can call to change CRTC parameters:

  • crtc6845_set_pixels_per_tick() - set how much to increase the bitmap x coordinate by each clock tick. Usually this will be set at the same time as the clock frequency.
  • crtc6845_set_enable() - enable or disable the CRTC (equivalent to the enable pin on the actual chip). A black screen is drawn if the CRTC is disabled for any part of the field. This can help to prevent weirdness during mode changes.
  • crtc6845_set_refresh_limits() - sets the maximum and minimum refresh rate that the CRTC will attempt to pass to video_screen_configure(). This prevents emulated software from being able to do bad things to MESS by setting frame rates too high or low, and prevents weirdness during mode changes.
  • crtc6845_set_lightpen_position() - sets the position of the lightpen relative to the visible area.

Finally, there are two functions (crtc6845_get_display_enable_status() and crtc6845_get_vsync_status()) for obtaining values of the output pins of the CRTC. Some video hardware implementations (such as CGA) can return the values of these bits via an IO port.

Terminology and conventions

Within the CRTC code, many variables have prefixes which correspond to the units which that variable counts:

  • t_ - ticks of the input clock, horizontal characters, memory addresses
  • s_ - scanlines
  • x_ - horizontal pixels
  • y_ - vertical pixels
  • f_ - fields

The underscore (in general) represents "per" so (for example) a variable called "x_t" represents the number of horizontal pixels per tick of the input clock.

This is a kind of Hungarian notation - Hungarian as it was meant to be used rather than the horrible misuse that is usually perpetuated whereby the prefix just duplicates the type information rather than telling you what the variable actually counts or measures.

6845 registers

6845 has these registers:

  • 0x00 - Horiz. total characters
  • 0x01 - Horiz. displayed characters per line
  • 0x02 - Horiz. sync position
  • 0x03 - Horiz. sync width in characters
  • 0x04 - Vert. total lines
  • 0x05 - Vert. total adjust (scan lines)
  • 0x06 - Vert. displayed rows
  • 0x07 - Vert. sync position (character rows)
  • 0x08 - Mode
  • 0x09 - Maximum scan line address
  • 0x0a - Cursor start (scan line)
  • 0x0b - Cursor end (scan line)
  • 0x0c - Start address (MSB)
  • 0x0d - Start address (LSB)
  • 0x0e - Cursor address (MSB)
  • 0x0f - Cursor address (LSB)
  • 0x10 - Light pen address (MSB) (read only)
  • 0x11 - Light pen address (LSB) (read only)

Amstrad PC1512 (40041 VDU) is equivalent to 6845 without registers 0-8 inclusive. 200 scanlines are always displayed, even if this not a multiple of the character height.

Rockwell 6545 is same as 6845 with these additional registers:

  • 0x12 - Update register (MSB)
  • 0x13 - Update register (LSB)
  • 0x1f - Memory access (mapped to video memory location specified in update register)

EGA and VGA are same as 6845 with these additional/changed registers:

  • 0x03 - End Horiz. Blank
  • 0x04 - Start Horiz. Retrace
  • 0x05 - End Horiz. Retrace
  • 0x06 - Vertical Total
  • 0x07 - CRTC Overflow
  • 0x08 - Preset Row Scan
  • 0x09 - Maximum Scan Line
  • 0x10 - Vert. Retrace Start
  • 0x11 - Vert. Retrace End
  • 0x12 - Vertical Display End
  • 0x13 - Offset
  • 0x14 - Underline Location
  • 0x15 - Start Vert. Blank
  • 0x16 - End Vertical Blank
  • 0x17 - CRTC Mode Control
  • 0x18 - Line Compare

8563 and 8568 are same as 6545 with these additional registers:

  • 0x14 - Attribute Start Address (MSB)
  • 0x15 - Attribute Start Address (LSB)
  • 0x16 - Hz Chr Pxl Ttl/IChar Spc
  • 0x17 - Vert. Character Pxl Spc
  • 0x18 - Block/Rvs Scr/V. Scroll
  • 0x19 - Diff. Mode Sw/H. Scroll
  • 0x1A - ForeGround/BackGround Col
  • 0x1B - Row/Adrs. Increment
  • 0x1C - Character Set Addrs/Ram
  • 0x1D - Underline Scan Line
  • 0x1E - Word Count (-1)
  • 0x1F - Data
  • 0x20 - Block Copy Source (MSB)
  • 0x21 - Block Copy Source (LSB)
  • 0x22 - Display Enable Begin
  • 0x23 - Display Enable End
  • 0x24 - DRAM Refresh Rate

Remaining work to do

  • Change enum constants to caps, and be less verbose (CRTC_MC6845 instead of crtc6845_personality_mc6845)
  • Struct types should be typedefed like those in mame.h
  • video_update_crtc6845() and video_eof_crtc6845() should be crtc6845_update() and crtc6845_eof() respectively to avoid confusion with the VIDEO_UPDATE/VIDEO_EOF macros. (I did it this way to emphasize the fact that these are the crtc6845's equivalent of VIDEO_UPDATE and VIDEO_EOF but I can appreciate that this could cause problems.)
  • Have a global variable "crtc6845 *crtc" that is in crtc6845.c but not used there. If that is not there, almost all consumers of crtc6845.h will have their own static variable.
  • Many 8-bit systems will simply use one CRTC. Some helpers implemented as VIDEO_UPDATE/VIDEO_EOF/READ8_HANDLER/WRITE8_HANDLER that simply use the crtc global described above may be helpful for these common cases.
  • Change memory_address to be of type offs_t
  • Free the bitmap that is kept in the crtc structure
  • Add save state support
  • Implement the drawing callbacks for each piece of video hardware that uses a 6845-variant CRTC
  • Switch over the video hardware implementations to use the new CRTC emulator
  • Implement the differences between different CRTC variants (read/write vs. write only registers etc.)
  • Cleanup (convert spaces to tabs, // comments to /**/ comments and make the brace style consistent with that used elsewhere in MESS)
  • Optimization:
    • Dirty flag (return UPDATE_HAS_NOT_CHANGED from video_update_crtc6845() if nothing has changed, otherwise return 0)
    • Per-character/scanline dirty flags to reduce CPU usage when nothing's changing
    • Per-character/scanline attributes (such as palette data) so that (e.g.) the California Games title screen doesn't need to be completely redrawn each frame
    • Change the drawing callback to do a (partial) scanline instead of just the intersection of one character with one scanline
    • Pre-decode graphics?
  • Implement CGA status register 0x3da (bit 0 = NOT(display enable), bit 1 = (vertical sync))
  • Implement remaining EGA/VGA CRTC registers:
    • register 3 bits 5-6: Display Enable Skew
    • register 5 bits 5-6: Horizonal Retrace Skew
    • register 8 bits 5-6: Byte panning
    • register 9 bit 7: Scan doubling
    • register 11 bits 5-6: Cursor skew
    • register 17 bit 4: Clear Vertical Interrupt
    • register 17 bit 5: Enable Vertical Interrupt
    • register 17 bit 6: Memory Refresh Bandwidth
    • register 20 bit 5: Divide Memory address clock by 4
    • register 20 bit 6: Double-Word Addressing
    • register 20 bits 0-5: Underline Location
    • register 23 bit 0: Map Display Address 13
    • register 23 bit 1: Map Display Address 14
    • register 23 bit 2: Divide Scan Line clock by 2
    • register 23 bit 3: Divide Memory Address clock by 2
    • register 23 bit 5: Address Wrap Select
    • register 23 bit 6: Word/Byte Mode Select
    • register 23 bit 7: Sync Enable
  • Finish light pen work
  • Implement phase-locked-loop effects (see below)
  • Improve NTSC composite emulation (see below)
  • Generalization: Much of this CRTC code is applicable to all machines and should probably be moved into the MAME core, replacing the raster-related functions in video.c and cpuexec.c. This should make it much easier for other machines to implement raster effects.

Phase-locked loop

 

A CRT contains an oscillator which pulses at the horizontal retrace frequency. Each pulse sends the electron beam flying back from the right edge of the screen to the leftmost point of the next line. This flyback is driven by the internal oscillator, not the horizontal sync pulse that is sent from the video source to the CRT.

The CRT also has a mechanism (called a phase-locked loop) which keeps the horizontal retrace oscillator in sync with the input horizontal retrace pulse. If the input pulse is ahead of the oscillator pulse then the oscillator phase is adjusted forward a little, and if the input pulse is behind the oscillator pulse then the oscillator phase is adjusted backward a little. You can see this graphically in the following screenshot, courtesy of Trixter:

This is in 320x200 mode, so each character clock corresponds to 8 pixels, 2 bytes or 1/40 of the display width. Every 10 scanlines the sync pulse is moved left or right (altenately) by 1 character clock (8 pixels). On the last scanline of the "7" the start of the beam is at the leftmost position on the screen. Immediately after that the retrace pulse is delayed by a clock, so if the retrace were driven directly by the input retrace pulse we would expect to see the following line start 8 pixels to the right.

However, it only moves a few pixels on that first scanline, then a few more the next scanline and so on. Tt slows down after a few scanlines, and by the end of the "8" the vertical lines are pretty much vertical again at the new position, just in time to start moving again at the top of the "9".

The resulting "wobble" isn't quite a sine wave - the graph (if you squint a bit) looks like a bunch of exponential curves. It's the same effect you'd see if you put an analogue voltmeter across a square wave source with a period of a second or so - it takes a moment for the needle to "settle" on the new position. In fact that's exactly what's happening to the phase-locked loop.

This is an effect that would be neat to emulate in MESS.

A related effect occurs when you skip every other retrace pulse altogether. This may seem like the kind of abuse that might break a fixed-scan monitor, but because of the retrace is driven by an oscillator the real retrace pulse happens at roughly the right time anyway, despite the lack of an input pulse. A phase locked loop will often work just fine when driven at half of its resonant frequency. This makes it possible to create a CGA 320x100 mode where every other scanline is black - you just set the CRTC timing registers for half as many scanlines and make the scanlines twice as long (but only display in the first half of the scanline). This is actually very useful as you can create a fully graphical mode at half the resolution with two pages.

Composite output

The CGA composite output in MESS could use some improvement. Currently it takes each nybble and just treats it as a single colour over the entire span of that nybble. This does not reflect what is really going on with the composite output and the NTSC decoding. It looks like this:

Dosbox's composite output for the same game is a little better. Notice how much clearer the word "GEAR" is, for example. I've magnified a part of one of the instruments:

Dosbox does something much closer to a real NTSC decoding - sampling the signal once per pixel (640 times per line) and applying filters (multiplying by 1, a sine wave and a cosine wave before averaging over a colour clock cycle period) to get the Y (luminance), I (in-phase) and Q (quadrature) values for each pixel. Obtaining RGB values from these is a simple linear transformation.

We can do slightly better if we sample the signal 1280 times per line instead of 640:

This sample rate smoothes out the transitions between colours a little more, but also has the advantage that it allows us to decode colours from modes other than 640x200 palette 15 accurately (something that Dosbox currently does not do). To do this, we need to understand a little about how the CGA generates its composite signal. I worked this out from looking at the CGA schematics.

The CGA's composite output consists of an signal which can be reconstructed completely by sampling at 28.64MHz with 2 bits of quantization. Neglecting blank, sync and color burst signals there are 4 voltage levels that can be generated:

  • Y=0, C=0: 0.416V
  • Y=1, C=0: 0.709V
  • Y=0, C=1: 1.160V
  • Y=1, C=1: 1.460V

I calculated these voltages with a circuit simulator using this circuit, with one refinement (a 75 ohm load - thanks rj for pointing out that ommission!). These are a bit off from the usual NTSC levels, but any TV will do some kind of automatic gain adjustment anyway so the picture won't look too bad.

It's kind of interesting that the C (chroma) bit has roughly twice the effect of the Y (luminance) bit - I'm sure it's not a coincidence that on a digital monitor, changing the intensity bit has half the effect on gray level than changing the R, G and B bits.

The Y bit is just the intensity bit, so can only change on pixel boundaries (14.32MHz). The C bit works a little differently. The CGA actually generates 8 different C signals, 6 of which are square waves with frequencies of 3.58MHz (the color burst frequency) and different phases corresponding to different hues. Which one is actually output depends on the R, G and B values.
R=0, G=0, B=0, C=00000000 BLACK
R=0, G=0, B=1, C=00001111 BLUE
R=0, G=1, B=0, C=01111000 GREEN
R=0, G=1, B=1, C=00111100 CYAN
R=1, G=0, B=0, C=11000011 RED
R=1, G=0, B=1, C=10000111 MAGENTA
R=1, G=1, B=0, C=11110000 YELLOW (also used for colour burst)
R=1, G=1, B=1, C=11111111 WHITE

Note that there are two more phases (00011110 (aqua) and 11100001 (orange)) which cannot be generated directly in the chroma bit.

Note also that because the green and magenta signals are out of phase with the yellow and blue signals by 1/28.64MHz (=35ns) the full signal cannot be reconstructed by sampling at 14.32MHz (green and magenta would be confused for either yellow and blue or for cyan and red).

One more note: the CGA schematic is actually incorrect - it has RED and CYAN wired up the wrong way around. I'm pretty sure the colours on composite are reasonably close to those on an RGB monitor so I think it is the schematic rather than the real hardware which is incorrect.

I wrote a decoder which it might be possible to use with MESS (it's in ntsc_decode.c in crtc6845.zip). It can be used with a composite signal of any frequency, but does not (yet) resample between frequencies. A resampling decoder could be used in the scaling code to produce very high quality composite output. PAL and SECAM equivalents should be pretty similar for machines which output these standards (CGA does not).

Video timings

I'd like to share some of my findings about video timings. Where do all these strange numbers like 4.77MHz, 1.79MHz, 895KHz, 28.64MHz, 14.32MHz, 3.58MHz (and for that matter the 1.193182MHz that is the input frequency of the Programmable Interval Timer in every PC) come from?

This wikipedia article explains how the timings for the NTSC standard came to be as follows:
Frame rate = 59.94Hz (4500000/286/262.5 = 60000/1001)
Line rate = 15.734KHz (4500000/286 = 2250000/143)
Colour burst frequency = 3.58MHz (4500000*227.5/286 = 39375000/11)

Because of this, when the IBM PC was designed, crystals of frequency suitable for use in colour TV sets at frequencies of 14.32MHz (3.58MHz*4 = 157500000/11) were cheap and easy to obtain, so the PC's designers used crystals of these frequencies for all the PC's timing:
CPU speed = 4.77MHz = 14.32MHz/3 (52500000/11)
PIT speed = 1.193MHz = 14.32MHz/12 (1312500/11)
6845 clock frequency (80-column text mode) = 1.79MHz = 14.32MHz/8 (19687500/11)
6845 clock frequency (other modes) = 895KHz = 14.32MHz/16 (9843750/11)
Composite sample frequency = 28.64MHz = 14.32MHz*2 (315000000/11)

Surprisingly, despite the fact that all these frequencies are based on NTSC frequencies, we still have timing problems when we wish to generate an NTSC video signal. Remember that there are exactly 227.5 cycles of the colour burst frequency per line in NTSC (the 0.5 is so that the phase changes by 180 degrees each line, which reduces artifacting). But the 6845 only counts in whole numbers - it can't generate 113.75 or 56.875 characters per line as it would need to to get a line rate of 15.734KHz.

Fortunately (because of the phase locked loop) monitors have some leeway in the exact frequencies they can accept, and will do just fine with a line rate of 15.7KHz (228 cycles of the colour burst frequency, or 114 narrow characters, or 57 wide characters - 4500000*227.5/286/228 = 3281250/209 Hz). In this setup, the colour burst frequency will have the same phase on every line, so you can't separate the chroma and luma signals by comparing the signals on successive lines (as some decoders do in order to improve quality).

Given that the CGA only has enough memory for a 200-scanline display, the CGA's designers decided to make the display non-interlaced, reducing flicker at the expense of resolution. This is done by having the 6845 generate only 262 scanlines instead of 262.5. Again, TVs have enough tolerance that they can display this non-standard signal. The frame rate is therefore 4500000*227.5/286/228/262 = 1640625/27379 = 59.92Hz, slightly less than the NTSC standard 59.94Hz.

The CGA contains circuitry to generate a non-interlaced output even if the 6845 generates sync pulses for interlaced images. This was probably done to simplify the colour-burst generation circuitry.

Other CGA quirks and emulation "todo"s

Here is a table of the CGA control register bits:
Port 0x3d8 bit 0 - "+HRES" - use 1.79MHz character clock instead of 895KHz character clock
Port 0x3d8 bit 1 - "+GRPH" - use graphics mode instead of text mode and turn on snow suppression
Port 0x3d8 bit 2 - "+BW" - disable the colour burst signal (making the composite output monochrome) and force 320x200 colour 2 to red
Port 0x3d8 bit 3 - "+VIDEO ENABLE" - enable memory driven output (if this is 0 it is as if the CGA memory is filled with 0s)
Port 0x3d8 bit 4 - "+1BPP" - generate high-resolution 1bpp graphics (0 bits force the output to black)
Port 0x3d8 bit 5 - "+ENABLE BLINK" - text attribute 7 means blinking if set, high intensity background if not set
Port 0x3d9 bit 4 - "+BACKGROUND I" - use intense version for palette for 320x200 mode
Port 0x3d9 bit 5 - "+COLOR SEL" - use cyan instead of green palette for 320x200 mode

A few other things I noticed while looking at the CGA schematics:

  • The CGA forces the text-mode cursor to blink even if the CRTC generates a non-blinking cursor.
  • The CGA suppresses blinking of blinking text when the cursor is on.
  • Text blinks at a rate of 16 frames on, 16 frames off.
  • There are several "improper" (and not particularly useful) video modes:
    • When "+1BPP" is set and "+GRPH" is clear, the 1bpp output is overlaid on top of the text mode output.
    • When "+HRES" and "+GRPH" are set and "+1BPP" is clear, the CGA will *almost* generate a 640-pixel wide, 2bpp mode (using the normal 2bpp palette). However, because of the snow suppression circuitry which is active in graphics modes, odd addresses are latched at the wrong time and some columns are repeated.
    • When "+HRES", "+GRPH" and "+1BPP" are all set you get a 640-pixel, 1bpp image with only the even bits displayed.
  • The 5160 technical reference manual implies that the "BACKGROUND I" bit has an effect in text mode. The CGA schematics imply that it does not, and are correct here.
  • There is an error in the CGA schematics (as well as the RED/CYAN one I mentioned earlier). On sheet 4, about 2/3 down the right-hand side the NOR gate marked as "U23 LS32" is actually an OR gate.

The control register works a little differently on the Amstrad PC1512 - there are no "improper" modes: "+1BPP" has no effect when "+GRPH" is clear and "+HRES" has no effect when "+GRPH" is set.

Other MESS-related work

There are several other improvements I would like to make to MESS, if nobody beats me to them:

  • Improve the sound emulation for machines which can generate sampled output and effects by rapidly toggling a 1-bit speaker and modulating the width of the pulse. This is used in some software on PCs and the Apple II, amongst others. It should be possible for emulated machines to generate a high frequency (e.g. 1.193MHz) 1-bit stream and have this resampled to the OSD sound output frequency by the core.
  • Implement the prefetch queue and memory bus delay for the 8088 and 8086 CPUs. Currently the 8088 emulation in MESS is about 25% too fast because these effects are not emulated. Emulating this will probably be necessary to make the scrolling in Super Zaxxon work properly. Trixter wrote provided some good information about this, which I have reproduced here.
  • Fix the mouse code so that it doesn't keep moving the mouse to the top-left of the screen when two mice are connected (this makes MESS almost unusable for me until I remove the line:
    autoselect_analog_devices(inp, IPT_MOUSE_X, IPT_MOUSE_Y, 0, ANALOG_TYPE_MOUSE, "mouse");

    from wininput_init() in src\windows/input.c, but this is clearly not the right long-term fix.)

  • Find out why the floppy drive doesn't work on ibmxt.
  • Find a PC1512 hard drive ROM - according to this the genuine ROM gave the message "Hard disk controller error" instead of "1701" that the IBM one gave when no hard drive was present.
  • Add the Camputers Lynx to MESS (and any other as yet unemulated machines that I can find).

Contact me

Questions? Comments? Suggestions? Offers to help implement any of this stuff? Comment below or email me at andrew@reenigne.org and I will try to respond promptly.

More geekism

Sunday, July 11th, 2004

I finally got my 8253 timer project finished. One minor hurdle that I didn't anticipate is that MESS's abstraction of the host computer's soundcard and its emulated CPU aren't synchronized. The soundcard is supposed to request samples from the emulated CPU at 44.1KHz, but according to the emulated CPU it seems to be doing it slightly faster than that (or the emulated CPU is running slightly slow by the soundcard's standard) so my buffer keeps underrunning.

The way to fix this is to tell the CPU to generate samples at a slightly higher rate (according to its clock) so that the soundcard and the CPU stay in sync. The trouble is, I don't know exactly how much faster the soundcard is than the CPU. Besides which, it seems to fluctuate and the ratio may even be completely different on a different machine.

The next thing I tried was dynamically adjusting the sample rate 60 times a second to the value which was calculated for the previous time-period. The problem with this was that the variations in the sample rate could sometimes be quite large, and the resulting "flutter" of the emulated computer's soundtrack sounded absolutely awful. Also, the buffer might now be slightly less full than optimal (leaving little room for further error) or slightly more full than optimal (causing the soundtrack to "lag" behind the graphics a bit). Clearly the most elegant way to solve this would be to smooth out these variations in such a way that the average sample rate over long time would be correct.

Next, I set up a running total of the "drift" time, and adjusted the sample rate in proportion to this. Of course, this caused the sample rate to oscillate around the correct value - I needed some damping to remove "energy" from the system. A couple of pages of calculations later I had modelled the system with a second order differential equation and solved for the critical damping value. After some tweaking of the buffer length and time-constant values it now works wonderfully - no buffer underruns, no noticable flutter and the graphics and sound are fairly well synchronized.

The only problem now is the aliasing - the 8253 timer in the PC can generate frequencies of up to almost 600KHz, and because of the rather unscientific way I'm downsampling that high-frequency waveform down to 44,100Hz (roughly) all the harmonics too high for the soundcard to play are aliased down to frequencies between 0 and 22,050KHz (roughly) so when you attempt to play a square wave, you get a whole pile of harmonics you weren't expecting and it sounds rather less than perfect. Since the sampling rate drifts up and down, playing a constant high-pitched tone makes all sorts of swooping noises. I wonder if it would be feasable to use a proper finite impulse response filter to downsample while removing all the frequencies that are too high-pitched to play...

For those rare few of you for who are not interested in all this geek stuff, here are some rather spectacular mazes.

Geekism and gunkism

Tuesday, June 22nd, 2004

One of my ongoing projects is getting old computer games (particularly the old Windmill Software games) to run properly on modern hardware. One technique for doing this that I have recently been experimenting with is emulation - specifically, the Multiple Emulator Super System. This is almost there, as it features cycle-exact emulation of an 8088 processor and hardware-level emulation of a CGA card, thus getting the graphics and speed just right for these old games.

The one place MESS falls down is its emulation of the 8253 Programmable Interval Timer. This small but very important circuit exists almost unchanged in every PC ever made (as well as a number of other computers such as the Telenova Compis, the INDATA DAI, the Sharp MZ-700 and the Tesla PMD-85).

In modern operating systems, the 8253 is used by the task scheduler to interrupt whatever the machine is currently doing so that another program can get a chance at running, but these old games used it for advanced sound effects and "random" number generation. As such, they generally pushed it to the limits of its capabilities, so a very accurate emulation is needed to get these games to work properly.

The 8253 is well documented but unfortunately (though understandably) all the documentation available is written for people who want to program the thing or build a computer a which uses it, rather than for people who want to duplicate its functionality in an emulator. So there are a number of obscure edge-cases like "what happens if you do x, but half way through you do y" which aren't documented. In the interests of accuracy, I want to get these edge cases to work just as they would on the real hardware.

I'm finding myself having to do some real science to figure out the internals of this chip - formulating hypotheses and devising and running experiments to prove or disprove these hypotheses, and gradually constructing a "grand unified theory of the 8253" (if you will). Sometimes it's pretty straightforward, but other times tne observations must be made indirectly. For example, some of these experiments require timing as accurate as a millionth of a second, but this is a fraction of the time it takes to actually read data from or write data to the device in question. So the only way to do these experiments is to fire a whole load of data at the chip ("tuned" to maximize the probability of creating the desired event), record what comes back and then sort through the results (sometimes using statistical methods) to figure out if the edge case that I'm trying to reproduce actually happened, and if so what the result was. If the experiment is run enough times, hopefully the event I'm looking for will actually happen. It's quite a lot like doing high energy particle physics, except I'm searching for logic gates instead of Higgs bosons.


The bathroom sink was draining really slowly, so I decided to remove the U-bend and clean it out. You would not believe the evil things that lurked in there - a foul-smelling black, grey and green sludge held together with human hair. I would have taken a picture of it, but it was so unholy I doubt it would show up in photographs (like vampires in that respect). Hours and a long hot shower later, I still feel dirty.