The last 100 seconds of 8088 MPH sound very different to the rest of the demo. The end tune is actually a 4-channel Amiga MOD file (which you can download here) composed by coda. Playing back a MOD through PC speaker on such a slow machine has never been done before. Here is how we did it.
We knew early on that we wanted to push the limits of audio from the machine as well as video, but also that we wanted to use the PC speaker for output. The first Sound Blaster card didn't arrive on the scene until 1989. Also, Galaxy Player (glx212) can play a MOD on a 4.77MHz 8088 with a Sound Blaster, so doing that wouldn't have broken any records. Galaxy Player is a very impressive piece of code - I once broke into it with a debugger to figure out how it works. It can also play back through the PC speaker, but that requires a much faster computer.
Most software that uses the PC speaker plays square wave beeps. This can be done with minimal CPU involvement as channel 2 of the PC's 8253 Programmable Interval Timer (PIT) is connected (via an AND gate) to the speaker. Setting that channel to square wave mode and programming an appropriate frequency (as a divisor of 105MHz/88 ~= 1.193MHz) is about all you need to do for an unattended beep. Changing the beep frequency 60 times a second (via an interrupt) is how the music is played for most of 8088 MPH. By rapidly switching between several notes (arpeggiation) you can give the impression of playing multiple notes at once (though it's a poor substitute for true chords) and by varying the duty cycles of arpeggios you can get a very limited impression of volume attenuation.
Playing back pre-recorded sampled sound over the PC speaker was less common, but still a well understood technique. As well as a square wave mode, the PIT has a "one shot" mode where the output goes low while the counter counts down and then goes high waiting for further input. By loading new values into the PIT's count register regularly, Pulse Width Modulation can be achieved. I remember being stunned the first time I heard this being done on the family Amstrad PC1512 - a crystal-clear (for the time) 6 bits of dynamic range at a sample rate of 18.6kHz! The "regularly" bit is the problem, though. Most (if not all) PC speaker software using this technique used a hardware timer interrupt (IRQ0, PIT channel 0) to trigger the CPU regularly to reload the channel 2 count with the next sample. However, interrupts on a 4.77MHz 8088 are really slow, and when playing back samples at ~20kHz there is basically no time for anything else (like the mixing required for a MOD player). All you can do is play back pre-rendered sound, and 100 seconds of 16.6kHz audio would have been 1.6MB of data - way too much for our 360kB disk space budget even with pklite helping out, not to mention our 640kB of RAM.
However, since we were targeting a specific CPU (and a specific clock speed) for this demo, we had a way of timing things without the timer interrupt - counting cycles. Conceivably, the same techniques could be used in a more portable player - having several predefined routines for the most common slow CPUs (and an IRQ-driven version for faster CPUs), or calibrating the speed on startup.
When writing highly-optimized 8088 code, there are two techniques which are particularly important - loop unrolling and self-modifying code. There is a tension between these two techniques, though - the more unrolling you do the lower your looping overhead but the more code you have to self-modify. Galaxy player plays these two techniques off each other quite nicely - picking an unrolled loop size which minimizes the total time the routine takes. Unfortunately this technique isn't a good fit for CPU-timed PC speaker audio because the samples aren't generated sequentially - it first renders a frame of samples for channel 0, then adds in a frame of samples for channel 1 and so on. It might be possible to statically intersperse the PIT count "out" instructions into the mixing code but that's really difficult code to write (it really needs to be written by a sophisticated tool that has knowledge of the exact 8088 timings to minimize jitter). Perhaps for a future project...
After playing about with (and measuring the execution speed of) a whole lot of different routines, I settled on one that doesn't unroll the sample loop at all (but does completely unroll the channel loop). The mixing code itself looks like this:
add bp,9999 mov bx,bp mov bl,99 mov al,[bx]
add si,9999 mov bx,si mov bl,99 add al,[bx]
add di,9999 mov bx,di mov bl,99 add al,[bx]
add dx,9999 mov bx,dx mov bl,99 add al,[bx]
out 0x42,al |
add bp,9999 mov bx,bp mov bl,99 mov al,[bx]
add si,9999 mov bx,si mov bl,99 add al,[bx]
add di,9999 mov bx,di mov bl,99 add al,[bx]
add dx,9999 mov bx,dx mov bl,99 add al,[bx]
out 0x42,al
This code can't play back arbitrary MODs because it has a limitation on sample length - rather than being of arbitrary length, samples are all exactly 256 bytes long. The idea is that samples repeat with one oscillation (i.e. an up and a down) over those 256 bytes. With a 16.6kHz output sample rate that translates to an output frequency range of 0 to 8.3kHz with a frequency resolution of 0.25Hz. So it's capabilities are somewhat closer to the C64's SID chip than the Amiga's Paula (in fact we codenamed the routine SID during development - we had others codenamed VIC and Paula which may get an airing in the future).
Our "SID" also has no volume tables - the volumes have to be baked into the samples themselves, so there may need to be several copies of any given sample at different volume levels (one for each volume level at which the sample is played). If we have too many samples, volume can be quantized (the program that converts from the source .mod to the player's internal format does the volume baking and optimizes the number of quantization levels). If we can squeeze it into 288 CPU cycles (spoiler: we can!), that's 72 PIT cycles which divides nicely into 4 - each channel's samples go from 1 to 18, and the final sample goes from 4 to 72 (we have to take care not to program 0 into the PIT or it'll count down for 65536 cycles and we'll get no audio for about 55ms). It also works out well for DRAM refresh - using the default DRAM refresh period of 18 we'll get exactly four refresh bus cycles per sample, and they'll be in the same places in the execution of the code every sample, so won't cause any jitter.
The four registers bp, si, di and dx each hold the respective channel's current position within the waveform (as an 8.8 bit fixed point number). The "9999"s are the respective channels' frequencies. The higher the frequency, the further the position gets advanced each sample (direct digital synthesis - similar to that which I used in the Physical Tone Matrix). These frequency values are modified right in the code itself during runtime (self-modifying code). This reduces register pressure (which is important as there are not a lot of registers to spare!)
Similarly, the "99"s (also patched at runtime) are the waveform numbers for each channel. The 256x256 waveform table is turned "sideways" (the low byte of the address is the waveform number and the high byte is the position) in order to avoid having to shift the high 8 bits from the position register to the low 8 bits of the sample pointer.
Now, if we run this code in a loop we'll get a chord playing from the PC speaker. But only a single chord - we want to play a whole song, where the note frequencies and waveform numbers change potentially 50 times per second. So we need to add a way to patch the frequencies and waveform numbers in the code. The fastest way to do that is to use the stack as our stream of patch data:
This means we need to run with interrupts disabled, but we need to do that anyway - the delays introduced by the timer interrupt would cause massive audio quality degradation.
Now we have enough ingredients to play an actual tune, but not a long one! If we're pulling 4 bytes off the stack for every sample we play, we're going to run out of stack in under a second. We might as well just pull preprocessed samples from the stack if we're going to do that - it would last 4 times longer!
However, most of the time we don't need to patch anything - we just want to leave the sample playing until we next want to change something, and then we probably want to change everything at once after 20ms (331 samples). So we want some kind of loop that counts down and then we do the patching once it reaches zero:
loopTop:
add bp,9999 mov bx,bp mov bl,99 mov al,[bx]
add si,9999 mov bx,si mov bl,99 add al,[bx]
add di,9999 mov bx,di mov bl,99 add al,[bx]
add dx,9999 mov bx,dx mov bl,99 add al,[bx]
out 0x42,al
loop loopTop
pop bx
pop word[cs:bx]
mov cl,99
jmp loopTop |
loopTop:
add bp,9999 mov bx,bp mov bl,99 mov al,[bx]
add si,9999 mov bx,si mov bl,99 add al,[bx]
add di,9999 mov bx,di mov bl,99 add al,[bx]
add dx,9999 mov bx,dx mov bl,99 add al,[bx]
out 0x42,al
loop loopTop
pop bx
pop word[cs:bx]
mov cl,99
jmp loopTop
The "99" in the "mov cl,99" instruction is (you guessed it) another value that is patched at runtime.
This code takes exactly 288 cycles to run in the case where we're patching, but it's quite a bit shorter in the no-patch case. We need it to run at 288 cycles every iteration (patch or no patch) to keep the samples coming regularly. Fortunately, there's a nice place to stick some NOPs where they will be executed only in the non-patch case, making two loops that mutually overlap without either being nested within the other:
v:
times 15 nop
loopTop:
add bp,9999 mov bx,bp mov bl,99 mov al,[bx]
add si,9999 mov bx,si mov bl,99 add al,[bx]
add di,9999 mov bx,di mov bl,99 add al,[bx]
add dx,9999 mov bx,dx mov bl,99 add al,[bx]
out 0x42,al
loop v
pop bx
pop word[cs:bx]
mov cl,99
jmp loopTop |
v:
times 15 nop
loopTop:
add bp,9999 mov bx,bp mov bl,99 mov al,[bx]
add si,9999 mov bx,si mov bl,99 add al,[bx]
add di,9999 mov bx,di mov bl,99 add al,[bx]
add dx,9999 mov bx,dx mov bl,99 add al,[bx]
out 0x42,al
loop v
pop bx
pop word[cs:bx]
mov cl,99
jmp loopTop
That's it - that's the entire inner loop as it is when the CPU executes its first instruction. All the remaining magic is in the data that's pointed to by the stack pointer. Let's think about how fast we're burning through that data now. 50 times a second we need to patch (worst case) 10 locations (4 frequencies, 4 sample numbers and the loop counter twice). That's 40 bytes, 50 times per second or 2000 bytes per second. That means we get through our 64kB of stack data in less than 33 seconds. That's better than 1 second, but still too short for our song by a factor of 3. We could use a larger stack, but then we'd need to update our stack segment somehow every 33 seconds at least. There's no code to do that, though, and nowhere to put such code that wouldn't execute far too often.
Or is there? Take another look at those 15 NOPs at the start of the routine. If you squint a bit, don't they sort of look like a blank canvas just waiting to be painted with some amazing work of art? (No? Maybe it's just me then). Yes, if we keep the "mov cl,99" line patched to be "mov cl,1" we can patch as many times as we like without the code between v and loopTop being executed at all, which means that we can patch some code into there and then switch the CL value to 2 for a sample in order to execute it. We have to make sure that these little "patched routines" (I call them v-instructions, hence the label) take exactly the same time as the 15 NOPs. This turns out to be possible for all the v-instructions we need for 8088 MPH. However, some of them are *smaller* than the 15 NOPs (in particular those which use some of their bus cycles to access memory or IO ports instead of fetch instruction bytes). This means we need to jump to a different place in the code to start them - that's easy enough, though, we can just patch the destination byte of the "loop v" instruction the same way we're patching everything else (almost half the bytes in this little routine get patched at some point!)
The next thing to notice is that the set of v-instructions that we can execute form a small (albeit verbose) bytecode interpreted language - we can do whatever we like in there provided we meet the space and time requirements. Our little mod player has become Turing complete! In particular, we can modify the stack pointer in order to do loops. That means instead of being hundreds of kilobytes, our v-instruction program can be relatively small (the one in 8088 MPH is just 652 bytes long). It's tricky to write, because it needs to have inside it the locations of the various points within the program that we need to patch, which might move around as we debug things. So rather than writing them directly I wrote some assembler macros to generate them for me. Oh, and because a v-instruction is made up of several CPU instructions, I ended up writing a sort-of mini assembler in the assembler's macro language! Here is one of the v-instructions from the CRTC update v-instruction routine in 8088 MPH:
forget 7
startAt 4
MOV_BX_DX
MOV_DX_iw
w 0x3d4
MOV_AX_iw
w 0x990d
OUT_DX_AX
MOV_DX_BX
runV 1 |
forget 7
startAt 4
MOV_BX_DX
MOV_DX_iw
w 0x3d4
MOV_AX_iw
w 0x990d
OUT_DX_AX
MOV_DX_BX
runV 1
This translates to the 8088 instructions:
mov bx,dx
mov dx,0x03d4
mov ax,0x990d
out dx,ax
mov dx,bx |
mov bx,dx
mov dx,0x03d4
mov ax,0x990d
out dx,ax
mov dx,bx
The "0x99" (AH value) is, you guessed it, patched to the desired CRTC start address byte (yes, I used self-modifying code in the v-instructions as well as in the 8088 instructions).
The main v-instruction routine runs 50 times per second and updates the frequency and waveform data in the actual mixing routine. It also fetches more song data from the pointer ES:0 (and increments ES). This means that we can burn through just 800 bytes of data per second for our song instead of 2000, dramatically reducing the memory usage. Only 12 of the 16 bytes in the paragraph are used for the actual musical data, the other 4 bytes hold the address of another subroutine to set the stack pointer to, and an argument for that routine. These "h-instruction" subroutines do things like printing a character on the screen, changing the cursor position for the print routine, changing the CRTC start address (for hardware scrolling) and finishing the routine (by patching the final "jmp looptop" instruction. So there's 3 layers of code here: the actual 8088 mixer code, the v-instruction code in the stack and the h-instruction code interspersed into the song data.
Somewhat surprisingly, there's still plenty of v-instruction time left - even the longest of these takes only 119 samples of the 331 available. So the routine actually ends up using only about 87% of the available time (less when just scrolling slowly without printing any extra characters). Getting the routine to do more is tricky, though - using more than 4 bytes per 20ms frame for the non-song stuff would probably involve moving through the song data at a non-integer number of segments per frame. Also, the fact that the only persistent registers available to v-instructions are ES and AH (though AL and BX can be stomped) makes writing new v-instructions tricky. More can certainly be done, though.
You might notice that the "mov cl,99" instruction directly follows the instruction that patches it ("pop word[cs:bx]"). This means that the 8088 will execute the unpatched version of the instruction (as it's already in the prefetch queue by the time it's patched). The v-instruction program generation macros take this into account. In theory this instruction could be anywhere between the "loop v" instruction and the "jmp loopTop" instruction, but in practice if I put it anywhere except where it is, the routine ends up taking more than 288 CPU cycles. For debugging in DOSBox (which executes the patched version) I have a debug switch which moves the instruction before the pop (the timing is wrong on DOSBox anyway, but I can at least test functional changes that way). Debugging is still tricky, though - breakpoints don't work so well when your entire program is executed from the same 15 byte memory region!
Some pre-processing of the source .mod file is performed on a modern computer before transferring it to the old PC - resampling the looping samples to 256 bytes and changing the note and effect data into frequencies and waveform numbers in the 800 bytes per second format. The .mod interpreter is based on a older version of PT2PLAY. The source is here and there is a compiled binary for Win32 here. Only mode 1 works at present. All Protracker effects should work with the exception of 9xx (set sample offset), E0x (hardware filter), E9x (retrigger) and CIA timer modes. The latter could be accomplished to some extent by adjusting the number of samples per tick.
As well as generating the data for the routine, mod_convert generates a .wav so that musicians can hear how their work sounds on the PC without actually having to have one. To do this, the program generates a 1-bit waveform at 1.193MHz and then resamples it to 44.1kHz, so even the high frequency carrier wave is reproduced. It's essentially a little emulator of the PC speaker circuit.
One nice feature of this routine that I didn't find a way to use in "8088 MPH" is ring modulation. If the second "mov bl,99" instruction is patched to be "mov bl,al" then the output from the first channel is used as the second channel's waveform number. If the waveform numbers in slots 1..18 are all the same basic waveform multiplied by a suitable set of amplitudes, then the output of the second channel is ring modulated by the first!
One other nice little finishing touch with the credits part is the way that it exits to DOS at the end, with the "A:\>_" prompt right after. This is a fully functioning DOS prompt, not a mockup. The idea is that the "What can you do?" line is immediately followed by the prompt to actually do it! In order to get this to work, the screen had to be in the "unscrolled" location at when the demo exits (DOS and the BIOS scroll the screen by shifting video RAM data, not by changing the CRTC start address). That's easy enough then, just subtract the number of character positions that we scroll from 8192 and program that value into the start address initially. We also need to move our initial write pointer and VileR's awesome background 40-column ANSI art so that they start in the right place. I accomplished this without needing to add code to handle wrapping, by taking advantage of the fact that the CGA ignores address bit 14, causing physical addresses 0xB8000-0xBBFFF to be mirrored in the address range 0xBC000-0xBFFFF. This turns out to be another emulator-breaking change, though (at least in DOSBox)!
The source for the player itself can be found on my github.