Archive for the ‘computer’ Category

Sometimes, doing things incrementally hurts more than it helps

Monday, October 12th, 2009

Usually the best way to make a major change to a piece of code is to try to break it down into small changes and to keep the code working the same after each such small change. The idea being that if you make too many changes and break the code too badly, you might never get it working again. Without working code it can be difficult to figure out what the next step should be.

But sometimes, incremental changes just don't work. In particular, if you're making major architectural changes, trying to construct something that is 90% original architecture and 10% new architecture is going to involve just as much extra work to try to make the incompatible pieces fit. In these cases, sometimes the only thing you can do is take the whole thing to pieces and put it back together again the way you want it.

Scaling/scanlines algorithm for monitor emulation

Monday, October 12th, 2009

For my TV emulation, I wanted to render scanlines nicely and at any resolution. xanalogtv does vertical rescaling by duplicating rows of pixels, which unfortunately makes some scanlines appear wider than others. Blargg's NTSC filters don't do any vertical rescaling at all.

The first thing I tried was a sinc interpolation filter with the kernel scaled such that the scanline only covered 70% of the pixels vertically (essentially modelling the scanlines as long thin rectangles). This worked great except that it was far too slow because of the sinc function's infinite extent (I was doing a multiplication for each combination of horizontal position, vertical position and scanline). So I windowed the kernel with a Lanczos window. I got annoying aliasing effects using less than 3 lobes. With 3 lobes it was still too slow because each pixel was a weighted sum of 3-4 separate scanlines. Also, because of the negative lobes I needed extra headroom which meant I either had to reduce my colour resolution or use more than 8 bits per sample (which would also be slow).

The next thing I tried was a Gaussian kernel. This has several nice features:

  1. The Fourier Transform of a Gaussian is also a Gaussian, which is also a better approximation of a scanline than a rectangle (the focussing of the electron beam isn't perfect, so to a first approximation their distribution around the beam center is normal).
  2. It dies off much more quickly than the sinc function.

The Gaussian kernel also gave a good image, so I kept it.

The next thing I wanted to do was improve the speed. I still had several scanlines contributing to every pixel. However, that doesn't make much physical sense - the scanlines don't really overlap (in fact there is a small gap between them) so I figured I should be able to get away with only using the highest coefficient that applies to each pixel. I tried this and it worked beautifully - no difference in the image at large sizes and it speed the program up by a factor of several. The downside was at small sizes - the image was too dark. This is because the filter was set up so that each pixel would be the average of several scanlines, but if only one scanline is contributing then then the brightness is 1/several. To fix this I just divided all the coefficients by the largest. There's no mathematical justification for this, but it looks fine (apart from the fact that some of the scanlines don't contribute to the picture at all).

If each pixel is only in one scanline, lots more optimizations are possible - for example, one can generate the image progressively, a scanline at a time, which helps keep data in the caches.

Finally, I still needed it to be faster so I moved all the rescaling (vertical and horizontal) to the GPU. I came up with a devishly clever hack to implement the same scanline algorithm on the GPU. No shader is needed - it can be done just using textures and alpha blending. There are two passes - the first draws the actual video data. The second alpha-blends a dark texture over the top for the scanlines. This texture is 1 texel wide and as many texels high as there are pixels vertically.

One other complication is that I wanted the video data texture to be linearly interpolated horizontally and nearest-neighbour interpolated vertically. This was done by drawing this texture on a geometry consisting of a number of horizontal stripes, each of which has the same v-texture-coordinate at its top as at its bottom.

Sometimes, figuring out the right architecture is half the battle

Sunday, October 11th, 2009

I went through quite a few design revisions to get to the pipeline architecture I described yesterday. Some ideas I tried out and then abandoned:

  • Giving the filters the responsibility of keeping the data they needed.
  • Filters telling other filters how many samples they should consume or produce.
  • A FilterGraph object which held all the buffers and which had methods to make and break connections.
  • All the reader and writer methods being on the buffers.
  • Lookahead reader methods for filters.
  • Filters encapsulating other filters that they communicate with.
  • Having separate consume() and produce() methods.
  • Having the Reader and Writer functionality as part of the Consumer and Producer classes (this introduced a surprisingly significant overhead)
  • A Connection object to encapsulate the buffer.

It's rather difficult to tell what's going to work well and what isn't until you actually write some code. And then it takes some trial and error work to hit upon the right pattern. Assumptions must be called into question. Prejudices must be discarded. Darlings must be killed. But you know when you've got it right because the rest then practically writes itself.

Pipeline architecture

Saturday, October 10th, 2009

My software TV/monitor emulator is best thought of as a bunch of filters for transforming a signal in certain ways, connected together in a pipeline:

  • Decoding composite signals to YIQ
  • Transforming YIQ signals to RGB
  • Horizontal rescaling
  • Ghosting due to signal reflection in the cable
  • Adding noise

Because that's the best way to think about it, that's how I'd like to implement it. Then it will be easier to remove/replace filters that I don't need or that I want to implement in a different way. The filters.h file in the crtsim source implements this architecture.

When you have a pipeline, there are two ways to drive it. One is "consumer pulls" and the other is "producer pushes". In this case, the consumer is the code that actually renders the window to the screen. In "consumer pulls" mode, this code will fire at probably 60 times per second (potentially depending on the refresh rate of the monitor on the host machine) and each time it does, it will ask the filter which supplies its data for enough data to render one frame (or field, if we're doing interlaced signals). This filter will then in turn ask the next one along the chain for data and so on up the chain until we get to the code that actually generates the composite signal.

In "producer pushes" mode, the producer generates data at a constant rate (possibly fixed by some other source of time in the system such as the audio device outputting at the correct rate). This data is then pushed from each filter to the next in the chain until it gets to the consumer. When the consumer has collected enough data to render a frame, a frame is rendered.

So for the purposes of emulating a TV or monitor as part of a microcomputer system emulator, "consumer pulls" and "producer pushes" modes can be thought of as "video rate driven" and "audio rate driven" modes respectively. Most emulators are hard-coded to do one or the other. But which one is best is determined by the user's hardware and what they're doing with the emulated system (video driven mode will generally smoother graphics while audio driven mode will generally give more stable audio). So ideally we'd like to make the choice user-selectable.

A third possibility is for the producer code to decide when to draw a frame and to call for a window redraw, which causes a data pull through the filter chain. However, I've discounted this idea because that is an incorrect placement of responsibility. The producer doesn't (and shouldn't) know about the state of the monitor. Even if it has just produced a vsync pulse it doesn't necessarily mean its time for a new frame (if the monitor is "rolling" as it will do momentarily when the signal timebase changes, it won't be).

There is another factor in pipeline design which is how the data stream corresponds to function calls. The simplest way would be to have each sink call the corresponding source each time it needs a sample (in pull mode) or each source call its corresponding sink each time a sample is available (in push mode). However, there are potentially quite a few filters and (because they are all replaceable at run time) each call from one filter to the next will be a virtual function call. That means that the compiler can't inline the code and the CPU's pipelining will get screwed up. According to this one can expect a virtual function call overhead of maybe 13 nanoseconds compared to inlined code (a crude test, but sufficient for order-of-magnitude calculations). Since most of our samples will be at 14MHz (4 times the NTSC color carrier frequency) that's only about 5 virtual function calls per sample before we've used up all our CPU.

So each function call really needs to transfer a pile of samples, not just one, and we will need to have circular buffers in between the filters to keep the data. How many samples should we transfer at once? A good rule of thumb for figuring that out is that one's hottest code and data should fit in L1 cache (which is maybe 64Kb on modern CPUs). Divide that up by the number of steps in the pipeline and we're looking at low-single-digit numbers of Kb to be passed at once. A scanline's worth of data (910 samples, give or take) is probably about right. That reduces the virtual function call overhead by three orders of magnitude which puts it well into the negligible range. Conceivably one could try benchmarking with lots of different "samples per call" values and then pick the one with the best overall performance (taking into account both call overhead and cache misses). I'll probably do this at some point.

One disadvantage of the pipeline architecture is that it introduces some variable amount of latency - not enough to normally be visible to end users, but this does complicate one thing that I want to emulate - light pens. A light pen is just a fast light sensor that can be placed anywhere on the screen. When the electron beam passes underneath it, it sends a signal to the computer. The computer knows where the beam is supposed to be at any given moment, so it can figure out where the light pen is. However, for an emulator to have proper lightpen support, it needs to have very low latency between the screen and the machine emulation. For this reason, I might abandon the pipeline architecture and just hard-code all the signal munging effects I care about in the CRT simulator itself, processing a line at a time and stopping when the horizontal sync pulse is found. Then, if the lightpen is anywhere on the next line the CRT will be able to tell the machine emulation exactly when the lightpen is going to be triggered.

NTSC hacking

Friday, October 9th, 2009

Recently I've been playing about doing NTSC decoding in software, trying to build an software TV/monitor for emulation purposes. I originally wanted to do the decoding of sampled composite signals to RGB and the horizontal scaling in a single step (precomputing a finite impulse response filter which does it all). However, I have come to realize that while this would yield the fastest code, it's not sufficiently flexible for what I want to do.

Specifically, in the signals I want to decode, the horizontal sync pulses can happen at any point (within a certain range) which means that the relationship between samples and horizontal pixel positions is not fixed in advance. This means that it's better to do the decoding (at least composite to YIQ if not all the way to RGB) at a fixed frequency and then rescale the result to pixels in real time (possibly using linear or cubic rescaling).

Having determined this, I looked to see what other NTSC software implementations do. Blargg's NES filter rescales at a ratio of 3:7 at the same time as it decodes, then it's up to the calling code to rescale this to the right width. xanalogtv converts composite to YIQ at 4 samples per color carrier cycle, uses linear rescaling on the YIQ samples and then converts the result to RGB. The resulting pixels may be doubled or tripled to get to the right width. This also allows for nice effects such as "blooming" (widening brighter lines).

My current simulator is here and the source is here (Windows only for now - sorry). This uses similar techniques to xanalogtv, but the rescaling is done by the GPU, in RGB space. The scanline effects are a bit more accurate (all the scanlines appear to be the same width, no matter what size the window is), and a phosphor mask is displayed. Most reasonably modern machines should be able to display the images at full speed (60Hz). If your machine is too slow or your monitor doesn't run at 60Hz there may be some odd effects (most LCD panels run at 60Hz). I believe this is the only software CRT simulator that correctly renders both interlaced and non-interlaced signals, and has physically correct phase-locked-loop line frequency behavior. If I can figure out how to add a pixel shader for light bloom, I should be able to get images as good as these (except with arbitrary scaling in real time).

One other rough edge in this version is that the horizontal sync pulse is currently only found to the nearest sample. This means that the phase locked loop isn't very tunable, and will cause problems for PAL signals (where the horizontal sync position is at a different subsample offset on every line). That should be quite easy to fix, though.

This simulator is going to form the basis for my demo machine emulator. The emulator itself is trivial - in fact I have already written it. But I haven't tried it out yet because I have no program to run on it. First I have to write an assembler for it. I might tweak the instruction set a bit somewhat in doing so, so I don't want to release the emulator just yet. Watch this space!

Memory handling in high-level and low-level languages

Wednesday, October 7th, 2009

The usual distinction between high-level and low-level languages is that high-level languages provide more abstractions. Assembly language is clearly a low-level language, and provides few very abstractions over the raw binary format consumed by the processor. Javascript is very high-level because it abstracts away almost all the details about the particular machine and operating system that it runs on, making it much easier to write actual end-user applications in (but impossible to write, for example, device drivers in).

One particular place where the distinction is particularly obvious is how different languages handle memory. There seem to be three basic strategies for handling memory:

  1. Implicit or explicit memory management (assembly language, C, C++) - calls to malloc() and free() (or equivalent) are inserted into the code by the programmer or by the compiler.
  2. Garbage collection with explicit state (C#, Java, Python, millions more) - no malloc() or free() are possible - all memory is "owned" by the garbage collector. Memory management is abstracted away.
  3. Referential transparency (Haskell) - there is no state, the concept of memory itself is abstracted away.

The boundaries can blur a little - you can plug a garbage collector into C++ for example, and high-level languages often provide "unsafe" functionality that allows one to escape to lower-level features.

Larrabee: Devolving the CPU vindicated

Monday, October 5th, 2009

I was interested to read that the cores of Intel's new Larrabee graphics architecture are based on the old P54C design, originally used in the second generation of Pentium chips (dating back to before even MMX made its first appearance). The core has since been updated with the latest instructions (including x86-64) but is missing performance features such as out-of-order execution. This vindicates what I wrote in devolving the CPU - that individual cores will get simpler and slower again as they get more numerous.

Setting constraints when programming

Sunday, October 4th, 2009

When writing a program, I generally don't just write it in any old way that might work, I set myself some constraints. Generally these constraints are particular programming principles that I have found make my life easier in the long run, such as choosing appropriate names. Sometimes I'll try to maximize speed or minimize memory usage. For Argulator I tried to minimize the number of SQL queries I made to satisfy any given request (which led to some convolutions to do appropriate data caching on the PHP side.) I also tried to make the page sent by the server as close as possible to the page that actually gets rendered (i.e. make Javascript not needed to display the page, just to interact with it). I mostly succeeded, at the expense of having some code duplication between the server side and the client side (though the argument creation UI is partially created by Javascript).

Sometimes I'll try to eschew certain operations, such as eliminating the square root operation in the algorithm to determine if a point is in the Mandelbrot cardioid. Often I'll try to eliminate subtraction and division and write algorithms in terms of addition and multiplication instead. This might seem to be a strange thing to do but the resulting algorithms are often more concise and it's easier to tell that they're correct. This can also eliminate division by zero errors.

Tet4 is a game of luck

Saturday, October 3rd, 2009

I used to think it might be possible to come up with some kind of general algorithm for playing Tet4 indefinitely, but I now suspect that no matter how tall the pit is and how many pieces you can look ahead, there are always sequences of pieces that will guarantee a loss - in particular, I don't think there is any way to place the sequence SSZ without adding a row every 6 pieces. This means that ultimately, Tet4 is a game of luck. However, most sequences that occur in practice do seem to be solvable - many years ago I wrote a program to play Tet4 automatically, and it seems to be able to play indefinitely. So in practice speed and accuracy are more important than luck for winning the game.

Side note: On the most recent occasion that I tried to figure this out, I attempted to resurrect my old Tet4-auto-player program (not trivial, since it's a graphical DOS program and hence doesn't work on Vista) and modify it to see if it could solve the SSZ repeated sequence. Once I'd got it running, I was amused to see that it was already set up to solve the SSZ repeated sequence - I'd done the exact same thing before and forgotten all about it!

Talking of Tet4, it's had a couple of updates since I last mentioned it here. The original version has had a bug fix so that playback of games in which the curtain reaches the bottom works properly. Also, a second version has been added, which is exactly the same but with no undo (and which is therefore less forgiving). This may be one of very few occasions on which the second version of a piece of software is identical to the first version apart from the removal of a significant feature.

When I first wrote the Javascript version of Tet4, I took the opportunity to change the keys from the DOS version a bit, to make them more logical and faster (so that if there were 5 or fewer possible positions for a piece, either hand could be used, for example). This meant that I kept making mistakes because (although I hadn't played Tet4 for years) my muscle memory was still trained with the old keys. Hence I added the undo feature. But in making the game more forgiving, I changed its nature significantly.

There's two ways I could have added undo to Tet4. One way is to undo the random number generator one step when undoing. The other is not to. Undoing the random number generator essentially gives you unlimited lookahead capability. To look ahead n pieces, drop those n pieces (anywhere) and then undo n times. I didn't want to grant that power, so I wrote it in such a way that dropping n pieces and undoing them changes the random number generator state. However, this gives an even stronger ability - now you can essentially choose which piece comes next by dropping and undoing until you get the piece you want. I think this makes the game too easy (at least until it gets really fast), hence the new version.

Text editor as shell

Friday, October 2nd, 2009

The text editor interface I talked about yesterday is so good I find myself wanting to use for things other than text editing. In particular, I'd love to be able to use it as my command line interface. Most command line shells are annoying because you can't use the keyboard to move the cursor up into a program's output to copy it - you generally have to resort to the mouse. I think it would be far better to treat the entire command-line session as a document. One way to do this would be to have a text editor with a special key combination (say Ctrl+Enter) which takes the contents of the current line, executes it as a command, and pastes the resulting output into the document immediately below, followed by a prompt, followed by the cursor.

One useful aspect of command line interfaces is tab completion - it would be nice if there were a way to make this continue to work in the editor interface. Perhaps tab could be interpreted to mean "tab completion" if the line above the cursor was a prompt (maybe prompt lines are "special" in some way, though that counteracts the principle that there should be no "hidden" information). If prompt lines are special, then maybe pressing Enter on the line below would suffice for running commands, instead of a special key combination for this.

A variation on this idea would be to allow editing of multiple files in a single buffer. Suppose you're using the editor interface and "cat" (Unix) or "type" (Windows) the contents of a file, causing it to be inserted into the buffer. I think it would be tremendously useful if that file was then "live" and could be edited by just moving the cursor up into its contents, changing things, and then hitting some other key (perhaps Alt+S?) to save it. Again this goes against the "no hidden information" principle, though. Perhaps one of the ASCII control codes that aren't usually found in text files could be repurposed to have special meaning to the editor, to signify a line that is a prompt or contains the path to an embedded file.