Sometimes, doing things incrementally hurts more than it helps

October 12th, 2009

Usually the best way to make a major change to a piece of code is to try to break it down into small changes and to keep the code working the same after each such small change. The idea being that if you make too many changes and break the code too badly, you might never get it working again. Without working code it can be difficult to figure out what the next step should be.

But sometimes, incremental changes just don't work. In particular, if you're making major architectural changes, trying to construct something that is 90% original architecture and 10% new architecture is going to involve just as much extra work to try to make the incompatible pieces fit. In these cases, sometimes the only thing you can do is take the whole thing to pieces and put it back together again the way you want it.

Scaling/scanlines algorithm for monitor emulation

October 12th, 2009

For my TV emulation, I wanted to render scanlines nicely and at any resolution. xanalogtv does vertical rescaling by duplicating rows of pixels, which unfortunately makes some scanlines appear wider than others. Blargg's NTSC filters don't do any vertical rescaling at all.

The first thing I tried was a sinc interpolation filter with the kernel scaled such that the scanline only covered 70% of the pixels vertically (essentially modelling the scanlines as long thin rectangles). This worked great except that it was far too slow because of the sinc function's infinite extent (I was doing a multiplication for each combination of horizontal position, vertical position and scanline). So I windowed the kernel with a Lanczos window. I got annoying aliasing effects using less than 3 lobes. With 3 lobes it was still too slow because each pixel was a weighted sum of 3-4 separate scanlines. Also, because of the negative lobes I needed extra headroom which meant I either had to reduce my colour resolution or use more than 8 bits per sample (which would also be slow).

The next thing I tried was a Gaussian kernel. This has several nice features:

  1. The Fourier Transform of a Gaussian is also a Gaussian, which is also a better approximation of a scanline than a rectangle (the focussing of the electron beam isn't perfect, so to a first approximation their distribution around the beam center is normal).
  2. It dies off much more quickly than the sinc function.

The Gaussian kernel also gave a good image, so I kept it.

The next thing I wanted to do was improve the speed. I still had several scanlines contributing to every pixel. However, that doesn't make much physical sense - the scanlines don't really overlap (in fact there is a small gap between them) so I figured I should be able to get away with only using the highest coefficient that applies to each pixel. I tried this and it worked beautifully - no difference in the image at large sizes and it speed the program up by a factor of several. The downside was at small sizes - the image was too dark. This is because the filter was set up so that each pixel would be the average of several scanlines, but if only one scanline is contributing then then the brightness is 1/several. To fix this I just divided all the coefficients by the largest. There's no mathematical justification for this, but it looks fine (apart from the fact that some of the scanlines don't contribute to the picture at all).

If each pixel is only in one scanline, lots more optimizations are possible - for example, one can generate the image progressively, a scanline at a time, which helps keep data in the caches.

Finally, I still needed it to be faster so I moved all the rescaling (vertical and horizontal) to the GPU. I came up with a devishly clever hack to implement the same scanline algorithm on the GPU. No shader is needed - it can be done just using textures and alpha blending. There are two passes - the first draws the actual video data. The second alpha-blends a dark texture over the top for the scanlines. This texture is 1 texel wide and as many texels high as there are pixels vertically.

One other complication is that I wanted the video data texture to be linearly interpolated horizontally and nearest-neighbour interpolated vertically. This was done by drawing this texture on a geometry consisting of a number of horizontal stripes, each of which has the same v-texture-coordinate at its top as at its bottom.

Sometimes, figuring out the right architecture is half the battle

October 11th, 2009

I went through quite a few design revisions to get to the pipeline architecture I described yesterday. Some ideas I tried out and then abandoned:

  • Giving the filters the responsibility of keeping the data they needed.
  • Filters telling other filters how many samples they should consume or produce.
  • A FilterGraph object which held all the buffers and which had methods to make and break connections.
  • All the reader and writer methods being on the buffers.
  • Lookahead reader methods for filters.
  • Filters encapsulating other filters that they communicate with.
  • Having separate consume() and produce() methods.
  • Having the Reader and Writer functionality as part of the Consumer and Producer classes (this introduced a surprisingly significant overhead)
  • A Connection object to encapsulate the buffer.

It's rather difficult to tell what's going to work well and what isn't until you actually write some code. And then it takes some trial and error work to hit upon the right pattern. Assumptions must be called into question. Prejudices must be discarded. Darlings must be killed. But you know when you've got it right because the rest then practically writes itself.

Pipeline architecture

October 10th, 2009

My software TV/monitor emulator is best thought of as a bunch of filters for transforming a signal in certain ways, connected together in a pipeline:

  • Decoding composite signals to YIQ
  • Transforming YIQ signals to RGB
  • Horizontal rescaling
  • Ghosting due to signal reflection in the cable
  • Adding noise

Because that's the best way to think about it, that's how I'd like to implement it. Then it will be easier to remove/replace filters that I don't need or that I want to implement in a different way. The filters.h file in the crtsim source implements this architecture.

When you have a pipeline, there are two ways to drive it. One is "consumer pulls" and the other is "producer pushes". In this case, the consumer is the code that actually renders the window to the screen. In "consumer pulls" mode, this code will fire at probably 60 times per second (potentially depending on the refresh rate of the monitor on the host machine) and each time it does, it will ask the filter which supplies its data for enough data to render one frame (or field, if we're doing interlaced signals). This filter will then in turn ask the next one along the chain for data and so on up the chain until we get to the code that actually generates the composite signal.

In "producer pushes" mode, the producer generates data at a constant rate (possibly fixed by some other source of time in the system such as the audio device outputting at the correct rate). This data is then pushed from each filter to the next in the chain until it gets to the consumer. When the consumer has collected enough data to render a frame, a frame is rendered.

So for the purposes of emulating a TV or monitor as part of a microcomputer system emulator, "consumer pulls" and "producer pushes" modes can be thought of as "video rate driven" and "audio rate driven" modes respectively. Most emulators are hard-coded to do one or the other. But which one is best is determined by the user's hardware and what they're doing with the emulated system (video driven mode will generally smoother graphics while audio driven mode will generally give more stable audio). So ideally we'd like to make the choice user-selectable.

A third possibility is for the producer code to decide when to draw a frame and to call for a window redraw, which causes a data pull through the filter chain. However, I've discounted this idea because that is an incorrect placement of responsibility. The producer doesn't (and shouldn't) know about the state of the monitor. Even if it has just produced a vsync pulse it doesn't necessarily mean its time for a new frame (if the monitor is "rolling" as it will do momentarily when the signal timebase changes, it won't be).

There is another factor in pipeline design which is how the data stream corresponds to function calls. The simplest way would be to have each sink call the corresponding source each time it needs a sample (in pull mode) or each source call its corresponding sink each time a sample is available (in push mode). However, there are potentially quite a few filters and (because they are all replaceable at run time) each call from one filter to the next will be a virtual function call. That means that the compiler can't inline the code and the CPU's pipelining will get screwed up. According to this one can expect a virtual function call overhead of maybe 13 nanoseconds compared to inlined code (a crude test, but sufficient for order-of-magnitude calculations). Since most of our samples will be at 14MHz (4 times the NTSC color carrier frequency) that's only about 5 virtual function calls per sample before we've used up all our CPU.

So each function call really needs to transfer a pile of samples, not just one, and we will need to have circular buffers in between the filters to keep the data. How many samples should we transfer at once? A good rule of thumb for figuring that out is that one's hottest code and data should fit in L1 cache (which is maybe 64Kb on modern CPUs). Divide that up by the number of steps in the pipeline and we're looking at low-single-digit numbers of Kb to be passed at once. A scanline's worth of data (910 samples, give or take) is probably about right. That reduces the virtual function call overhead by three orders of magnitude which puts it well into the negligible range. Conceivably one could try benchmarking with lots of different "samples per call" values and then pick the one with the best overall performance (taking into account both call overhead and cache misses). I'll probably do this at some point.

One disadvantage of the pipeline architecture is that it introduces some variable amount of latency - not enough to normally be visible to end users, but this does complicate one thing that I want to emulate - light pens. A light pen is just a fast light sensor that can be placed anywhere on the screen. When the electron beam passes underneath it, it sends a signal to the computer. The computer knows where the beam is supposed to be at any given moment, so it can figure out where the light pen is. However, for an emulator to have proper lightpen support, it needs to have very low latency between the screen and the machine emulation. For this reason, I might abandon the pipeline architecture and just hard-code all the signal munging effects I care about in the CRT simulator itself, processing a line at a time and stopping when the horizontal sync pulse is found. Then, if the lightpen is anywhere on the next line the CRT will be able to tell the machine emulation exactly when the lightpen is going to be triggered.

NTSC hacking

October 9th, 2009

Recently I've been playing about doing NTSC decoding in software, trying to build an software TV/monitor for emulation purposes. I originally wanted to do the decoding of sampled composite signals to RGB and the horizontal scaling in a single step (precomputing a finite impulse response filter which does it all). However, I have come to realize that while this would yield the fastest code, it's not sufficiently flexible for what I want to do.

Specifically, in the signals I want to decode, the horizontal sync pulses can happen at any point (within a certain range) which means that the relationship between samples and horizontal pixel positions is not fixed in advance. This means that it's better to do the decoding (at least composite to YIQ if not all the way to RGB) at a fixed frequency and then rescale the result to pixels in real time (possibly using linear or cubic rescaling).

Having determined this, I looked to see what other NTSC software implementations do. Blargg's NES filter rescales at a ratio of 3:7 at the same time as it decodes, then it's up to the calling code to rescale this to the right width. xanalogtv converts composite to YIQ at 4 samples per color carrier cycle, uses linear rescaling on the YIQ samples and then converts the result to RGB. The resulting pixels may be doubled or tripled to get to the right width. This also allows for nice effects such as "blooming" (widening brighter lines).

My current simulator is here and the source is here (Windows only for now - sorry). This uses similar techniques to xanalogtv, but the rescaling is done by the GPU, in RGB space. The scanline effects are a bit more accurate (all the scanlines appear to be the same width, no matter what size the window is), and a phosphor mask is displayed. Most reasonably modern machines should be able to display the images at full speed (60Hz). If your machine is too slow or your monitor doesn't run at 60Hz there may be some odd effects (most LCD panels run at 60Hz). I believe this is the only software CRT simulator that correctly renders both interlaced and non-interlaced signals, and has physically correct phase-locked-loop line frequency behavior. If I can figure out how to add a pixel shader for light bloom, I should be able to get images as good as these (except with arbitrary scaling in real time).

One other rough edge in this version is that the horizontal sync pulse is currently only found to the nearest sample. This means that the phase locked loop isn't very tunable, and will cause problems for PAL signals (where the horizontal sync position is at a different subsample offset on every line). That should be quite easy to fix, though.

This simulator is going to form the basis for my demo machine emulator. The emulator itself is trivial - in fact I have already written it. But I haven't tried it out yet because I have no program to run on it. First I have to write an assembler for it. I might tweak the instruction set a bit somewhat in doing so, so I don't want to release the emulator just yet. Watch this space!

What does supernatural actually mean?

October 8th, 2009

I was discussing philosophy with a theist friend recently and the argument "this only applies to natural things and God is supernatural so this doesn't apply" came up. I've seen this argument in other debates as well, but I have to confess that I don't completely understand it. What does "supernatural" actually mean? The dictionary definition that seems to best apply is "unexplainable by natural law or phenomena".

There's two possible meanings to that. One is "unexplainable by the laws of physics as they are currently known" and the other is "unexplainable by the laws of physics even if we knew them all".

The existence of supernatural things of the first type is not denied by any sufficiently well-informed scientist - it's no secret that the laws of physics are incomplete. One possible example might be the details of the event horizon around a black hole - at small scales this requires a theory of quantum gravity, which we don't yet have. I think we'll eventually eliminate all such supernatural things by having a complete theory of physics.

I suspect theists would therefore prefer the second definition. But does this definition even make any sense? What would it mean for there to be phenomena in our universe for which no physical theory could be described to explain? Well, supernatural phenomena of that sort of can be said to exist too - when a quantum variable is measured and the wavefunction collapses, the result isn't necessarily determined by anything in the universe. But I don't see any theists suggesting that God acts on the universe by deciding how each and every wavefunction collapse occurs, despite the apparent omnipotence that such power would grant. I suspect this is because this would eliminate any possibility of human free will - quantum wavefunction collapse is part of all physical processes, so controlling quantum wavefunction collapse would mean controlling all our thoughts and actions. There would be no will except the will of God possible, making all religion (and indeed everything) rather pointless.

It seems to me there are philosophical reasons to reject the concept of a non-random supernatural process - if something is non-random there is some sort of (at least partially) predictable pattern to it, which means one could come up with a law of physics to describe that pattern, which means it's no longer supernatural. "Predictable" only means predictable in principle, though, not predictable in practice. Chaitin's constant (the probability that a random program will halt if run for long enough) for example isn't random but is uncomputable in the sense that only a finite number of its digits could be determined by any finite algorithm. Curiously, this number could be thought of as omniscient - it encompasses all mathematical knowledge (since it can be used as an oracle to solve the halting problem) but a number (even a real, uncomputable one) doesn't seem like it could match the theists' descriptions of God as having certain properties like compassion and goodness.

As well as the gaps in our knowledge of physics and the gaps caused by quantum improbability, there are also gaps due to the fact that there are some real world phenomena which we just can't do experiments on for one reason or another. We can't do experiments on UFOs because we can't predict when and where they will show up (though I'm sure that if one did show up in a suitably equipped science lab, laws of physics could be found to describe it).

Another thing we can't do physics on is subjective experience, simply because it's subjective. We don't currently have any technology by which one person can experience what it's like to be another person (and even if we did, there is no objective way to be sure that it's the same experience for both people - one can't compare subjective things with objective things). All we can do is ask people to report on their subjective experiences, and a personal report isn't as reliable a piece of evidence as a repeatable experiment.

Each one of us can't even be truly sure that other human beings actually *have* subjective experiences - maybe they're just p-zombies who say they do. It's a useful working hypothesis to assume that they do, though (and the opposite assumption would be rather dangerous for all concerned).

So perhaps God is Himself a subjective experience. That certainly dovetails with some things that theists say, like "I know God is in my heart and I've experienced His love, but I have no way to prove that to you". And objective evidence of God does seem to be rather thin on the ground, to put it mildly. If this is what God is then I am a teeny bit jealous of theists for having that experience that I never had (even when I was a theist, went to church, prayed regularly etc.).

If neuroscientists are able to generate religious feelings in others by stimulation of their temporal lobes with magnetic fields then, objectively, that suggests that religious experiences (at least these ones) are "fake" in the sense that there is no "objective God" causing them. But on the other hand, why should we treat subjective experiences as less "real" than objective reality? What could be more real to someone than something they have experienced first hand? By that logic, hallucinations caused by drugs or schizophrenia should also be considered "real" in this subjective sense.

Objective reality, on the other hand, consists of the things that we can in principle (given sufficient experimental data and suitable application of logic) convince other rational human beings of - in other words the things that we can in principle (if we're honest) all agree upon. As such, only objectively verifiable things should be used as a basis for public policy.

Memory handling in high-level and low-level languages

October 7th, 2009

The usual distinction between high-level and low-level languages is that high-level languages provide more abstractions. Assembly language is clearly a low-level language, and provides few very abstractions over the raw binary format consumed by the processor. Javascript is very high-level because it abstracts away almost all the details about the particular machine and operating system that it runs on, making it much easier to write actual end-user applications in (but impossible to write, for example, device drivers in).

One particular place where the distinction is particularly obvious is how different languages handle memory. There seem to be three basic strategies for handling memory:

  1. Implicit or explicit memory management (assembly language, C, C++) - calls to malloc() and free() (or equivalent) are inserted into the code by the programmer or by the compiler.
  2. Garbage collection with explicit state (C#, Java, Python, millions more) - no malloc() or free() are possible - all memory is "owned" by the garbage collector. Memory management is abstracted away.
  3. Referential transparency (Haskell) - there is no state, the concept of memory itself is abstracted away.

The boundaries can blur a little - you can plug a garbage collector into C++ for example, and high-level languages often provide "unsafe" functionality that allows one to escape to lower-level features.

Taxing hard work verses taxing luck

October 6th, 2009

In the year 2000, John McCain was asked by a student, "Why is it that someone like my father who goes to school for 13 years gets penalized in a huge tax bracket because he's a doctor. Why is that - why does he have to pay higher taxes than everybody else? Just because he makes more money. How is that fair?". To his credit, McCain did actually defend the idea of progressive taxation (those who make more money should pay a higher percentage of their income as tax).

It seems to be a common misconception amongst conservatives that progressive taxation is "punishing hard work" (or perhaps it just makes a good soundbite). But I don't think that even the most diehard liberals want to punish anybody for working hard - what should be (and is) punished is good luck. Many extremely wealthy people haven't had to work particularly hard for their wealth - especially those who have inherited it, but also those who have had the good fortune to be selling the right product in the right place at the right time (for example Microsoft in its early days with MS-DOS).

Also, there are a great many people who work extremely hard, but don't make a lot of money (those who work two or three low-paying jobs).

Progressive taxation decreases the reward for being lucky and thereby increases the amount of that reward that is due to hard work. So a (well implemented) progressive taxation strategy actually encourages hard work rather than discouraging it.

Not all progressive taxes are well implemented, though. In the UK there is a VAT limit of (currently) £68,000 per year. If your business makes more than that you have to register for VAT and pay 15 percent on all your profits (not just the profits over some limit). This greatly discourages people who are making just under the limit from working any harder, because if they did they would actually make less money.

I think one of the fairest taxes is the inheritance tax. This is much maligned as the "death tax", a moniker which doesn't make much sense to me as it isn't the dead person that pays the tax, it's the beneficiary. It's a fair tax because there's no hard work involved in inheriting money, you just have to have the good fortune to be born to wealthy parents. The downside of the inheritance tax is it's effect on the "family business" - a child wouldn't necessarily be able to afford to take over a business from his or her parents if there are large inheritance taxes to be paid, and this may lead to the destruction of some otherwise profitable businesses. But why, as a society, should we give that child the gift of this business and not expect anything in return?

A business which would be affected by this must have a large intrinsic value to be subject to the inheritance tax, but also make a small enough profit that it would take an unreasonably long time for the beneficiary to pay it back. I think the answer to this conundrum is investors - the beneficiary ceases to be sole owner of the business but the business can continue to exist and benefit society. Inheritance taxes should be set up in such a way that this can work.

Larrabee: Devolving the CPU vindicated

October 5th, 2009

I was interested to read that the cores of Intel's new Larrabee graphics architecture are based on the old P54C design, originally used in the second generation of Pentium chips (dating back to before even MMX made its first appearance). The core has since been updated with the latest instructions (including x86-64) but is missing performance features such as out-of-order execution. This vindicates what I wrote in devolving the CPU - that individual cores will get simpler and slower again as they get more numerous.

Setting constraints when programming

October 4th, 2009

When writing a program, I generally don't just write it in any old way that might work, I set myself some constraints. Generally these constraints are particular programming principles that I have found make my life easier in the long run, such as choosing appropriate names. Sometimes I'll try to maximize speed or minimize memory usage. For Argulator I tried to minimize the number of SQL queries I made to satisfy any given request (which led to some convolutions to do appropriate data caching on the PHP side.) I also tried to make the page sent by the server as close as possible to the page that actually gets rendered (i.e. make Javascript not needed to display the page, just to interact with it). I mostly succeeded, at the expense of having some code duplication between the server side and the client side (though the argument creation UI is partially created by Javascript).

Sometimes I'll try to eschew certain operations, such as eliminating the square root operation in the algorithm to determine if a point is in the Mandelbrot cardioid. Often I'll try to eliminate subtraction and division and write algorithms in terms of addition and multiplication instead. This might seem to be a strange thing to do but the resulting algorithms are often more concise and it's easier to tell that they're correct. This can also eliminate division by zero errors.