VC++ inline assembly hack

July 8th, 2010

Here's a cute little hack I came up with when doing some fixed-point arithmetic stuff in Visual C++ a while ago. I wanted to make a templated C++ class with some inline assembly, the generation of which depends on a numeric template parameter. Doing the obvious thing doesn't work:

template<int N> class ArithmeticHelper<N>
{
public:
    static Int5 MultiplyShiftRight(Int5 x, Int5 y)
    {
        __asm {
            mov eax, x
            imul y
            shrd eax, edx, N
        }
    }
};

However, one can introduce numeric parameters into inline assembly code by pretending that they are the length of something, and using the "length" operator:

template<int N> class ArithmeticHelper<N>
{
private:
    static const int nn[N];  // VC's inline asm doesn't allow "shl eax, N" directly, but this works...
public:
    static Int5 MultiplyShiftRight(Int5 x, Int5 y)
    {
        __asm {
            mov eax, x
            imul y
            shrd eax, edx, length nn
        }
    }
};

Most different colours

July 7th, 2010

For various visualization programming tasks, it's useful to have a set of colours that are as far apart as possible. I set out to write a program to generate such sets.

The first problem is that we must work in a perceptual colour space such as Lab rather than the usual sRGB, or our colours will be overly concentrated in the areas where sRGB over-represents.

Next, I set up a system of mutually repelling particles and let them arrange themselves (sliding along the sides and edges as required). This didn't work too well even just in sRGB space - the trouble is that I don't care about minimizing the distance to the most distant colours, only the nearest ones. I thought about do a Delaunay triangulation to find the nearest neighbours of each point, but it turns out that's overkill - all you need to do is just repel the 6 nearest particles (and recalculate what those particles are after each frame). If you end up with the closest points all on one side, the particle will be repelled away until one of the closest points is on the other side.

Even with this fix, I was still getting strange results. After some head-scratching, I realized that it was just to the "kinks" in the some of the edges of the sRGB gamut in Lab space:

The particles tend get "stuck" in these kinks.

I'm not quite sure what to do about this. Perhaps I can find optimal sets in a rectilinear gamut and then gradually morph this into sRGB while continuing to let the particles repel.

Arbitrary precision Mandelbrot sets

July 6th, 2010

I added arbitrary precision arithmetic to my Mandelbrot zoomer. Surprisingly, the difficult part wasn't the arbitrary precision arithmetic itself, it was deciding when and how to switch from double-precision to arbitrary precision arithmetic. Because my program reuses data computed at one zoom level to speed up the computation of more deeply zoomed images, some images contained pixels computed using both methods. The trouble is, the two methods don't give the same results. In fact, after some experimentation I discovered that the problem was even worse than that: even using just the double-precision code, you get different results depending on the number of digits you use. So the point that was represented in my program by 0xffa615ce + 0x00000002i (using 10.22 bit fixed point) escaped after 548 iterations but the point 0xffa615ce00000000 + 0x0000000200000000i (using 10.54 bit fixed point) escaped after 384 iterations. The problem is that after every multiplication we shift the result right by the number of fractional bits, which performs a rounding operation. The images generated look reasonable but are actually only approximations to the true images that would be calculated if infinite precision were employed.

Having determined this, I realized it would be necessary to throw away all the points computed with lower precision once we started using a higher precision. This isn't as bad as it sounds, since (when zooming in by a factor of 2) the recycled points only account for 1/4 of the computed points but if we just threw them all out at once it would result in a visible artifact when passing through the precision boundary - the entire window would go black and we'd restart calculation from scratch. I wanted to avoid this if possible, so I started thinking about convoluted schemes involving restarting a point if its 3 sibling points were of higher precision. Then I realized that I could just recycle the points but keep their colours when splitting blocks to a new precision boundary, avoiding the glitch. There are still some artifacts while the image is being computed, but they are less jarring.

Finally there was the problem of interfacing with double-precision floating-point numbers. The Mandelbrot set is contained entirely within the circle |c|<2 and in this range a double has 52 bits of fractional precision. But if we replace the 10.22 and 10.54 bit fixed point numbers with doubles, we'll have a region of blockiness where we need 53 or 54 bits but only have 52. Rather than having two sorts of 64-bit precision numbers, I decided to simplify things and have 12 integer bits in my fixed point numbers. The program is very slow when it switches into arbitrary precision mode - it's barely optimized at all. The 96-bit precision code is currently has a theoretical maximum speed of about 92 times slower than the SSE double-precision code (190 million iterations per second vs 2 million). It could be significantly faster with some hand assembler tuning, though - I have a 96-bit squaring routine that should speed it up by an order of magnitude or so. All of the multiplications in the Mandelbrot inner loop can be replaced with squarings, since [latex]2xy = (x+y)^2 - x^2 - y^2[/latex]. Squaring is a bit faster than multiplication for arbitrary precision integers, since you only need to do [latex]\displaystyle \frac{n(n+1)}{2}[/latex] digit multiplications instead of [latex]n^2[/latex]. Given that we are calculating [latex]x^2[/latex] and [latex]y^2[/latex] anyway and the subtractions are very fast, this should be a win overall.

Waller's Torus: a four-dimensional musical instrument

July 5th, 2010

Consider the major and minor chords in Western music. There are 24 of them altogether (with the normal twelve-tone equal temperament tuning system usually used): one major chord and one minor chord for each of the 12 notes in the chromatic scale (A-G and 5 accidentals).

Each of these 24 chords consists of 3 notes. Pick one of the notes in the chord and throw it away, leaving two notes. Given these two notes, one can always find two chords (one major and one minor) which use those two notes. So given any major chord one can find three minor chords which share two notes with it, and given any minor chord one can find three major chords which share two notes with it.

That suggests that the 24 chords form a sort of network or graph, with vertices corresponding to chords and edges corresponding to pairs of notes. In three dimensions this network can be arranged on the surface of a torus (Waller's Torus). You can arrange the vertices in such a way that all the edges are the same length, but some vertices will have different sets of angles between edges than others.

In 4 dimensions, the symmetries of the network are more apparent - the vertices can be arranged such that each edge is the same length and all the vertices have the same set of angles between edges. The resulting object isn't a polychroron because only three edges meet at each vertex but it isn't a polyhedron either because it does extend into all 4 dimensions. The analogous figure in 3D would might be something like a star shape - if you draw it in 2D the inner points will be closer to each other than the outer points but in 3D you can arrange the points on the surface of a cylinder (sort of like the points one might find at the top of a paper party hat from a Christmas cracker). Here's what it looks like:

Using this figure, one could generate a nonlinear mapping from points in a 4-dimensional space to chords. Not just major and minor chords either - other chords are represented by points other than the vertices. At each vertex you can move in the direction of one of the edges to change the pitch of one of the notes. Since there are only three degrees of freedom in a chord (one for the pitch of each note) there is also a direction at each point in 4-space which leaves the chord unchanged as you move along it.

I am not (just) a strange loop

July 4th, 2010

A while back, I read Douglas Hofstadter's book "I am a strange loop". As one might expect from Hofstadter, it's a fascinating book packed with good ideas. However, it happens that I disagree with someone of them. Hofstadter believes that the human brain fundamentally works in an entirely mechanistic, deterministic manner and that all of the mysteries of consciousness can be explained in terms of symbols being triggered by other symbols in the brain. Our subjective "awareness" is, according to Hofstadter, just an illusion - a hallucination. I'm not convinced by this - the concept of a hallucination implies that there is something (someone) there to experience the hallucination. But if a hallucination has an experience, how can it be a hallucination? It's sort of like how the concept of "creating time" is meaningless, because the concept of creation implies a time before and a time after.

If "souls" (or whatever less loaded word you'd prefer to use to mean the part of us which has subjective experiences) are not made of particles or patterns of particles, how do they get distributed so that there's one per human brain? I think that's asking the wrong question, because it presupposes that souls are localized entities like particles. I think there are many other possibilities. Unfortunately because we have no way to do experiments on subjective experience, answering this question seems to be out of the reach of science (at least for the moment).

Floating-point faster than integers

July 3rd, 2010

In an attempt to get a little extra speed in my fractal program, I decided to try rewriting the innermost loop to use 32-bit fixed-point numbers instead of double-precision floating-point numbers. The precision would be less good but I need to write some precision-switching code anyway. I had noticed that my CRT simulator got faster when I switched from floating-point to integer code, so I assumed (incorrectly, it turns out) that the same would be true in this program.

It turns out that the best way to multiply two signed 32-bit integers and then shift the 64-bit result right by 22 bits (the number of fractional bits I'm using) is to use the one-operand IMUL instruction followed by the SHRD instruction. Unfortunately the only register you get to pick with this combination is one of the input registers - both the output and the other input are in EAX. This is a problem because it means that you have to load EAX right before every multiplication and stash the result somewhere else right after, before starting the next multiplication. All this shuffling slows the code right down - my 2GHz Core Duo laptop peaks at ~150 million iterations per second for the double-precision routine and ~100 million iterations per second for the integer routine. To make matters worse, you also lose the use of the EDX register (which is stomped by the IMUL) so even with frame pointer omission you're down to just 5 registers (ESI, EDI, EBP, ECX and EBX).

Another possible way is to use the MMX registers and the PMULUDQ instruction, but that has two problems: the multiplication is unsigned and there's no arithmetic right-shift in the MMX/SSE ISA so it seems unlikely that it could be made faster than the IMUL version.

This makes me wonder if floating-point instructions would also be faster for other uses where integers have traditionally reigned supreme. Bignums for example. Instead of storing 32 bits of precision in each 32-bit register, you can store 52 bits of precision in the mantissa part of each 64-bit register. There is a little extra complexity involved since the floating point units aren't designed for this (and for multiplication you need to do four 26-bit*26-bit->52-bit multiplies instead of one 52-bit*52-bit->104-bit multiply). However, the extra bits might make this a win. Poking around at the gmp source suggests that they aren't currently using any such tricks for x86, though there is some FPU hackery on other architectures.

Very low-level programming

July 2nd, 2010

In most computer programs where speed is critical, there are a small number of chunks of code in which the CPU spends most of its time - the innermost loops. For a fractal program, it will be the iteration function. For a high-precision number library, it will probably be the multiplication routine. For a video encoder, it might be the loop that searches for motion compensation vectors.

In each case, programmers will generally spend a lot of time optimizing these innermost loops for the most important CPU microarchitectures in order to squeeze out every last drop of performance where it really matters. This often involves hand-writing assembly code, modelling the execution pipelines of the target machine and poring over optimization documentation.

Doing this, programmers will often think things like "if only this CPU had an instruction that does X, which could be easily done in hardware, this routine could be so much faster". This is why every couple of years or so a new set of instructions is added to the x86 architecture: MMX, SSE, SSE2, SSE3, SSSE3, SSE4, AVX. These instructions are getting more and more complicated with time, which makes the optimization process all the more difficult - one might have to look at dozens of ways of doing something to find the best one for your specific task.

All this makes me wonder if there might be a better way. With each new instruction set extension, new functional units are added along with new ways of hooking them up. Suppose that instead of programming these inner loops the usual way (deciding which registers represent which variables, choosing instructions and ordering them) we could instead set some state inside the CPU that directly connected the output of multiplier 1 to an the input of adder 2 (for example) in effect creating a special purpose machine for the duration of the inner loop. The wiring up probably takes longer than running the original code would, but once you've wired it up each iteration through the loop can run extremely quickly, since there's no single execution thread causing a bottleneck. It would be essentially like having an FPGA built into your CPU.

I can think of several reasons why this isn't currently done. One is that writing applications at such a "low level" creates a dependency between the software and the most intimate details of the hardware. Tomorrow's CPU might be able to run today's software more quickly by having some extra multiplication units (for example) available to each core. But if the applications are addressing the multiplication units directly, that won't help unless the software is rewritten to take advantage of them. One possible answer to this might be to create a programming interface that involves a very large number of virtual functional units and allow the CPU to decide how to map that on to the actual hardware in the optimal way (possibly turning parts into more traditional code if it there aren't sufficient functional units) - a sort of "JIT Verilog" if you will.

A second reason is that when you increase the amount of state on the CPU affiliated with a particular execution thread, you make context switching more difficult, because there's more registers that need to get swapped out to memory. FPGAs are generally not reprogrammed extremely frequently because they contain a lot of state and it takes time to reprogram them. This could also be solved by virtualizing functional units, or it could be solved by keeping track of whether the CPU was reprogrammed since the last time the thread run, and only reprogramming when necessary. That would solve the common case at the expense of speed when doing dumb things like running two different video encoders at the same time (something that is also likely to be rather sub-optimal with todays CPUs, since it isn't generally something that is optimized for).

What would be even cleverer is if the CPU could examine the code sequence and wire things up automatically to maximize the speed of that inner loop. To some extent, CPUs are starting to do such things. Modern CPUs keep a queue of decoded instructions and if an inner loop fits into this instruction queue, no decoding needs to be done while it is running. Cache hints are another example of a transparent feature designed to allow optimization when specific hardware details are known.

Random number generation on 8-bit AVR

July 1st, 2010

For my physical tone matrix I wanted a mode where at the start (or end) of each cycle the machine would pick a random "on" light and turn it off, and a random "off" light and turn it on. With 5-10 lights on, this makes a fairly nice little bit of random music - continually changing so it doesn't get too repetitive.

To implement this I decided the best way was to pick a frame (the second one of the cycle turned out to be the most convenient) and decrement one of two variables "randomOn" or "randomOff" depending on whether the current LED within the frame was on or off. If the variable reached zero the LED would be toggled. That doesn't take too many cycles. But the sample before this frame we need to generate random numbers and initialize randomOn and randomOff with them. randomOn needs to be in the range 0..lightsLit-1 (where lightsLit is the number of "on" LEDs) and randomOff needs to be in the range 0..(255-lightsLit). So we need to generate two random numbers in the range 0..n-1 where n is in the range 1..255 as quickly as possible. I wanted to try to generate these numbers in a single sample as trying to spread the work across multiple samples makes the program much more complicated. That left me with about 300 cycles to generate two random numbers.

The usual way of generating a random number in the range 0..n-1 is to generate a random number in a much larger range (say 0..65535) and use the modulo operation ("remainder after division by n") to get it into the range 0..n-1.

Generating a 16-bit random number is done in a fairly standard way using a Linear Congruential Generator (I used the X = 214013*X + 2531011 variant because it cuts out a few multiplications). That by itself takes nine 8-bit by 8-bit multiplications and 56 cycles. I rewrote it myself in assembly language rather than using the built-in generator from avr-libc, because the latter it not very optimized (it uses a full 32-bit multiplication which is not necessary).

If you use the 16-bit modulo function from libgcc it takes something like 215 cycles, which was too many. Even unrolled it's something like 144, which is still too many. I was about to embark on the work to spread this loop across several samples when I read this which describes a way of turning a division by a constant into a multiplication by a constant. That's not very useful for arbitrary division, since computing the multiplier constant is more work than just doing the division. But we're not doing arbitrary division here - we know something about our divisor n - it is no larger than 255. So it becomes practical to precalculate a table of multipliers and look up an entry in this table at runtime.

The next problem is that that method only works when the division is exact (has remainder zero) which is no good for this application since it's the remainder we're interested in. But the classic bit-twiddling reference book Hacker's Delight describes a variation which does work (for some divisors the dividend needs to be added back after the multiplication, and for some the result needs to be shifted right by a number of bits depending on the divisor). So our mod algorithm looks like this:

  1. Look up multiplier in table indexed by divisor
  2. Multiply dividend by multiplier
  3. Look up helper routine in table indexed by divisor (there are 18 variations - 9 possible shifts and either "add" or "don't add", but not all of them are used) and jump to it.
  4. Multiply by the divisor and subtract the result from the original dividend - the result is the remainder.

The resulting mod routine takes 40-53 cycles (depending on divisor) giving 96-109 cycles for the entire random number routine. With various other bits of overhead, the random routine takes 253 cycles on this "init" sample and up to 29 per sample on the first frame of the cycle.

Physical tone matrix screen construction details

June 30th, 2010

I think the most unusual thing about the physical tone matrix I posted about yesterday is the screen - I hadn't seen one like it before I made it, and having made one I now know why - they're very fiddly to make. I'd like to go into some more details about how it is made and how it works.

Step 1 is to cut out a 6" square piece of 1/8" thick acrylic. I bought a 12"x24" piece from Delvie's plastics (the cheapest shipped price I could find for such a small quantity), which is enough for 8 such screens (though I doubt I'll make another 7 since threading the wire takes so long - I hope I'll be able to make something else fun out of the rest, though). Cutting this stuff is really easy - just score it a few times with a knife and a straightedge where you want it to break, clamp it to a table so that the scored line runs along the edge of the table and then push down hard on the bit that protrudes over the edge of the table - it'll break cleanly along the scored line. Leave the backing paper on for now.

Step 2 is to drill the holes. I deliberated a bit about the hole pattern to use. I originally thought of threading the wires just horizontally and vertically, so that the switch was formed at the place where the two wires "cross over" (but don't quite touch) in the following picture (red is top, blue is bottom):

But then I decided I'd rather have the central point of the switch not have wire over it, so that the LED would be able to shine through at that point. I also realized that I'd be able to make the switches longer (and therefore easier to press) if the wires were arranged diagonally and fairly close to each other. Working backwards from where I wanted the wires to be on the top, I came up with this pattern:

This requires about 67 feet of wire altogether - because of the zig-zagging, each horizontal and vertical wire is about 15" on the matrix itself, with 2" spare on the top, left and right sides and 14" spare on the bottom to connect to the PCB. Use 22AWG solid-core hook-up wire - this should work nicely.

Here is a page showing where all the holes need to be drilled. Print this out and glue it on to one side of the acrylic sheet. Use a drill press to drill all 1,024 holes. I used this which is the cheapest one I could find. It's a bit flimsy but good enough for the purpose (as well as for PCBs, which is what it's meant for). It doesn't quite have enough clearance to reach the center holes if you put it together in the obvious way, but it does come with an attachment which lets you move the drill further away from the metal bar, if you jam a piece of plastic or something in the gap. I used these bits which seemed to work fine. The positioning doesn't have to be perfect but the closer the better. I think I used a number 68 drill bit or thereabouts - make sure your wire fits through the first hole you drill before drilling the rest. The plastic swarf from the drill will accumulate under the glued paper a bit but that doesn't really matter.

While you've got the drill press handy, make a hole of ~2mm diameter in each corner for the screws to hold it onto the PCB. The way I built it, the switch matrix is screwed to the PCB using 1-1/2" long screws and then the PCB is screwed to the bottom box, so the entire thing is rigid (the top of the box also screwed to the bottom of the box - only these screws need to be removed to change the batteries). Choose the positions of these holes carefully, since you will need to make holes in the same position in the PCB.

Step 3 is to remove the backing paper and sand the acrylic on the bottom side using a piece of sandpaper. Shine an LED through it to make sure it's diffusing properly. If you don't sand it the screen will have a very narrow viewing angle (depending on the LEDs used - cheap high brightness ones tend to have a very narrow viewing angle though) and when you can see them they will dazzle you. I think I used a #100 sandpaper or thereabouts - I don't think it matters much but try on a scrap piece first if you're worried. An orbital sander will probably get you a more homogeneous finish, but I just did it by hand (you can see the swirl patterns I made if you look at the screen very closely).

Step 4 is to cut and strip the wire. Cut 16 lengths of wire 19" long and 16 lengths of wire 31" long. Avoid kinking/bending the wire at this stage, to the extent you can. Use wire strippers to strip of all but 2" of insulation from the 19" lengths and all but 14" of insulation from the 31" lengths. You'll need to do this about 6" at a time. You might need to grip the 2" piece of insulation with pliers when stripping the last bit, otherwise you'll remove the wrong bit. Keep the pieces of insulation for step 6.

Step 5 is to thread the wire. Take each piece of wire in turn and thread it into the acrylic, starting at the bottom (31" sections) and the right (19" sections). Push the wire in so that the remaining insulation is right up against the bottom of the acrylic. Follow the pattern carefully - the top wire of each switch goes horizontally and the bottom goes vertically. Bear in mind that the pattern alternates each row, so if you start with the top one in the first row, you'll start with the bottom the second. The PCB has holes for soldering the top and left sides directly underneath the switch matrix, so make sure you pick the alternating pattern that gets it to line up. Make sure none of the wires touch each other - you can always pull them apart slightly with needlenose pliers if they do.

There is a bit of a knack to getting the wires flat and tight with no kinks. Here is how I did it. Suppose you have one segment threaded on the bottom and you're doing the next one on the top side.
a) Thread the loose end of the wire through the next hole. Pull it through. As you are doing so, make sure the wire is in the plane that is perpendicular to the acrylic sheet and that passes through the two holes. If the wire starts to twist over, adjust it so that it is back in this plane. If you don't do this, you'll get a kink in the wire when you pull it tight, which makes it difficult to get it to go where you want it to.
b) get a flat piece of metal (like the side of a pair of pliers) and push against the threaded segment on the bottom. This will prevent the bottom segment from being pushed up in the next step.
c) get another smaller flat piece of metal (like the end of another pair of pliers) and push against the newly threaded segment on the top. Start pushing at the "old" end (bending the wire into a right angle) and work your way along to the "new" end until it's totally flat against the acrylic sheet. If you don't do this there will be slack in the wire which will cause the switches to move when you touch them.

It gets more difficult once you get more of the wires in place, because you've got to navigate around the protruding ends with your flat edges. It might make it easier to do the outermost wires first and then work in towards the middle, but I didn't think that far ahead.

Once I got into practice, it took me about 20 minutes to do each wire, so about 11 hours of threading altogether (this is why I'm not planning to make any more). It's not too bad if you do a couple a day for a couple of weeks - one can carry on a conversation or watch TV at the same time.

Step 6 is to re-insulate the uninsulated loose ends. Cut 1" sections of insulation from the leftovers from step 4 and push them back onto the ends of the wire. If there are any bends or kinks in these parts of the wires you'll have to straighten them out with pliers first. Twist the insulation slightly as you push it back on and it'll go on more easily. This will help avoid short circuits when the circuit is partially assembled and you're testing it.

The switch matrix works by strobing each row and then testing each column. The rows are connected to the output of the "row" 74HC595 shift registers. They are connected via 10K resistors so that if something metal touches the switches they won't cause short out anything. The "active" row (which changes about 1,000 times per second) is brought to logic high (+5V), the others to logic low.

The switch columns are connected (again via 10K resistors) to 74HC4051 analogue multiplexers so that the microcontroller can specify (with three output bits) which of the 8 columns in each half of the matrix it wants to read. This column selection changes 15,625 times per second. The outputs of the two 4051s are connected to Darlington pairs which amplify the signal (which is very weak after it's passed through a finger) and pass it on to two input pins of the microcontroller (one for the left half, one for the right). Immediately after reading these pins, the microcontroller changes the multiplexer selection bits to maximize the time the circuit has to settle into the correct state for the next switch.

The Darlington pair bases are each connected to ground through a capacitor - they won't register as "on" until this capacitor charges up. The larger the capacitor the less sensitive the switches. If the capacitor is too small, you'll get "false on" readings from switches that you aren't actually touching (if the effect could be controlled this might make an interesting proximity sensor sort of input device but it's probably too finicky). If the capacitor is too large then you'll have to press the switches really hard or have damp fingers to for the touch to register.

Physical tone matrix

June 29th, 2010

Inspired by Andre' Michelle's Tone Matrix flash applet I decided to make a physical version - i.e. a box with buttons, flashing lights and knobs to turn. I wanted 16x16 - the Bliptronic 5000 just doesn't have enough lights. All the existing 16x16 equivalents were more money than I wanted to pay: A Tenori-on costs something like $1000. Four Bliptronic 5000s cost $200. A monome two fifty six costs $1400. I was thinking more along the lines of $50. Besides which, I thought to myself, it would be fun to make it myself.

I decided to base it around the ATmega328 microcontroller used in the Arduino, after reading how easy it is to get started hardware hacking with the Arduino. Yes, I know a Cortex M0 is a cheaper and more powerful chip, but I didn't know about these until I'd already started with the ATmega328. Also, they're more difficult to solder.

Because of my assembly programming roots, I wanted to push the CPU to its limits. Sketching out the core audio and video assembly routines, I realized that with the Arduino's 16MHz clock rate, I could do 16 channel PWM audio at a sample rate of 15.625KHz (i.e. a sample every 1024 cycles) with a 256-element 8-bit waveform and arbitrary volumes and frequencies for each channel. At the same time I could control a 16x16 LED matrix with a refresh rate of 61Hz (line rate of 976Hz) with each individual LED having an independent brightness (duty cycle of 0-16 sample periods per frame, i.e. up to 1/16 in 1/256 increments) using 32 bits of shift registers and the SPI lines (at maximum speed it takes a significant number cycles to output so many bits, so this is interleaved with other code). Unfortunately (because of the large gamma of LEDs) only a few of these are distinguishable. With all of this going on in the background (written in heavily optimized assembly language) I still had enough spare CPU cycles to do some interesting foreground things.

The next problem was how to make the switch matrix and LED matrix. I thought about buying 4 Bliptronic 5000s and tearing them apart, or using sixteen Sparkfun 4x4 button pads with PCBs, but both of those options cost more than I wanted to pay. The cheapest high-brightness LEDs I could find cost $0.06 in quantity from Mouser, giving a screen cost of $15.36 - much more like it. (Even cheaper prices are possible in greater volumes from Transistor Parts Wholesale). I think this is the cheapest way of making a 6"x6" screen.

I had trouble thinking of a way to set up 256 switches over the LEDs without obscuring them or spending too much, until I remembered an effect I had noticed messing about with transistors years ago - you can make a touch switch out of a couple of transistors arranged in a Darlington configuration. I could arrange them in a matrix like the LEDs so that the only per-switch cost was a couple of small sections of wire. This could be threaded through an acrylic sheet which would serve a dual purpose - to diffuse the LEDs and to hold the switch wires. I originally thought I would have 4 solder connections to the PCB per switch, but then I realized that that would be too difficult to solder and that it would be better to just connect my switch matrix at the edges.

The row strobes used for the switches are the same as the ones used for the LEDs, and the column strobes are accomplished with a couple of 8-bit analogue multiplexers, so that only two Darlington pairs are needed for the entire matrix (the final thing includes a third for a "menu" switch). One tricky thing here turns out to be the capacitance - since the finger resistance is about 10 megaohms and we move on to the next switch 15,625 times per second, we need a capacitance of no more than 6pF, which one gets from having just a few centimeters of wire in close proximity. Fortunately that's just for between the multiplexer and the Darlington pair - the switch matrix itself only changes configuration 976 times per second so we can get away with a larger capacitance. Even so, I think this is at about the limit of practical resolution for such a matrix. It proved necessary to put a capacitor between the base of the Darlington pair and ground to counteract the admittance of the switch matrix and reduce its sensitivity.

I used a double sided circuit board (in order that I could have LED row wires on one side and column wires on the other), but I think if I were doing it again I'd use a single sided board and just connect the columns by soldering the LED anodes to straight pieces of wire laid across the top of the board. Between aligning the two sides correctly, doing very fiddly soldering under the components on the top side, making lots of vias and not being able to test most of the board until almost everything was soldered (due to some of the component legs acting as vias) it was more trouble than it was worth.

For debugging purposes, I made a connector so that the device could be connected via an Arduino and a USB port to a computer. There is essentially a "non stop" debug interface built into the program - as it's running, one can send commands over the ATmega's UART to peek and poke memory.

The sound quality isn't great at the moment - it was okay on the breadboard but the breadboarded version of the circuit had a "screen" of only 4 LEDs. With 256 LEDs the ripples on the power supply are much bigger and there's a lot of noise on the speaker when lots of the pixels are lit. I did have some decoupling capacitors in the circuit but I drastically underestimated the amount of capacitance I would need - I'll replace the capacitors with larger ones after my next Mouser order.

The final design has 4 potentiometers: tuning, volume, decay/sustain and tempo.

The software I wrote for it has lots of features:

  • Random mode: after each cycle through the pattern, extinguish one LED at random and light another at random. This keeps the pattern varying.
  • Game of Life mode: after each cycle through the pattern, transform the pattern according to Conway's rules.
  • Various waveforms: choose from sine wave, square wave, triangle wave (which unfortunately sounds indistinguishable from the sine wave) or two different types of random noise. There are also a couple of different waveform editors so you can make up your own.
  • Tuning editor: the default scale is pentatonic but you can change it to use whichever frequencies you like.
  • Overrides for decay, tempo and tuning so they can be set via either digital or analogue controls.
  • Microtone mode: sets the matrix up as a 256-key keyboard spanning 7.5 octaves with a 34-TET tuning. LEDs corresponding to the notes of the C major scale are lit (which makes a pretty pattern). Because of the way the switch matrix works, only chording within a row or a column is possible without introducing spurious notes.
  • Saving and loading patterns, waveforms, tunings and other settings to/from EEPROM. Unfortunately there is a bug in the software at the moment which causes it to crash after saving.
  • Multi-pattern mode: loads a new pattern from EEPROM each time the current one finishes.
  • Sync in and sync out sockets: I believe these should be compatible with those on the Bliptronic 5000, but I don't have one to try it out with.
  • Ability to have less than 16 beats before repeat (for making rhythms with different time signatures).
  • Just for fun, a red/green/blue LED triplet. This displays a hue corresponding to the beat currently being played within the pattern. It seems totally frivolous, but was quite handy for debugging this problem when the program was so broken that the serial code didn't even work.

There are a few more that I've thought of but haven't implemented yet. Almost half the flash is unused currently - enough to add a few games as well.

The machine runs great on 4 AA batteries (6.52V according to my multimeter) or from a 5V supply. I think the ICs are rated up to 15V or so, but the LED current limiting resistors would need to be increased to use a higher voltage.

One tricky thing about making this is threading all the wire through the acrylic sheet - I used fairly thick wire (22 AWG) for strength so I had to be careful to avoid kinks and make sure the wire was tight each time. Using thinner wire would be easier but flimsier. I imagine that it could be mass produced more easily using techniques similar to making a double sided PCB (i.e. using tracks and through-hole plated vias instead of a solid piece of wire).

In the end the parts cost for this adds up to $64.83 (not counting time, broken tools and supplies like solder, toner, paper, acetone, steel wool, glue and ferric chloride), though not all of the parts I actually used were bought new (I salvaged the speaker and a capacitor from an old alarm clock, for example). It could probably be mass produced for significantly less (particularly the case I imagine).

The schematics, source code, PCB layout and parts list are available here. If you make a copy or derivative I'd love to hear about it.