Audio codec idea

August 16th, 2011

I was writing some audio code recently and it gave me an idea for an audio compression strategy. I don't know enough about audio compression to know for sure if it's been done before, but here goes anyway. The idea is that you have a selection of short, repeating waveforms (e.g. square wave, sine wave, sawtooth wave, white noise) and multiple channels each of which can independently play one of these waveforms at some particular frequency and volume. This is almost exactly the way that my physical tone matrix renders audio.

I was thinking that with enough channels, enough possible waveforms to choose from, enough frequencies, enough volumes and changing the parameters quickly enough, one could make a (perceptually) good approximation to any input waveform. And because the model sort of reflects how most of the things we listen to are created in the first place (a piece of music consists of multiple instruments playing at different volumes and pitches but each generally consisting of a fundamental tone with overtones of various intensities which might change throughout the course of any given note) this might actually give a pretty good approximation in a very small space.

One big problem is how to choose the right pitch, volume and waveform for each channel. The first thing I tried was using a genetic algorithm - the set of waveforms and the pitch/volume/channel information being the "genome" and the rendered waveform being the "phenome". The fitness function is just mean squared error (though eventually I'd like to switch to a more sophisticated psycho-acoustic model like that used by MP3 encoders). There is a population of genomes and the best ones survive, mutate and recombine to make the new population. Recomputing the phenome is not too expensive as the computations are very easy and each pitch/volume/waveform datum only affects a small amount of the phenome. Changing a bit in one of the waveforms is more expensive though as you have to go through all the pieces of the phenome that use that waveform.

Unfortunately it doesn't seem to be converging on the input waveform at all yet (it probably would if I left it long enough, it's just far too slow). The next thing I want to try is seeding the initial genomes with some kind of reasonable approximation to the initial waveform, and seeding the waveforms themselves with some random set of overtones modulated by a range of power laws. To seed the initial genomes, I'll need to break the input waveform into pieces, do an FFT on each piece to find the frequencies, pick the best waveform/pitch/frequency combination for that piece, then subtract it and repeat until we've picked a waveform for each channel.

Even if this codec doesn't beat existing codecs on quality per bitrate metrics, it could still be extremely useful because very little CPU power is required to play it back (I've already proved that a 16MHz Atmel AVR ATMega32 can do 16 channels of this kind of rendering in real time at 16KHz using less than half its capacity). If you premultiply each waveform by each possible volume you can playback on hardware that doesn't even have a multiplier (at significant cost to quality and/or bitrate).

Another fun thing about this system is that you could train it on a large number of sounds at once and come up with a set of waveforms that are good for compressing lots of different things. These could then be baked into the encoder and decoder instead of being transmitted with the rest of the data, leading to further savings. I wonder what the optimal waveforms would sound like.

This idea owes something to Mahoney's c64mp3 and cubase64 demos (unfortunately Mahoney's site breaks inbound links but the cubase64 link is in the 2nd of October 2010 section under "Recent Changes"). However, I think the sound quality of this scheme is likely to be much better than c64mp3 since we're not limited to a one channel plus noise.

Perceptual colour space update

August 15th, 2011

Last year, I wrote about some code I had written to find maximally distributed sets of colours in perceptual colour space. I was having some problems with the code at the time, but since then I have got it working. I fixed it by only repelling each particle from the nearest one to it - then particles quickly settle into the points where they are equidistant from the nearest particles on each side.

Here are the colours it came up with in LUV space:

     

        

           

              

                 

                    

                       

                          

                             

And here are the colours it came up with in LAB space:

     

        

           

              

                 

                    

                       

                          

                             

I also did a variation where it just chooses a different hue for each colour, maximizing the saturation (so arranging colours in a ring, rather than throughout a volume) - this is the LAB version but the LUV version is very similar:

     

        

           

              

                 

                    

                       

                          

                             

I want to do a little flash applet so you can see how the colour particles repel and rotate them around in 3D space - it's a very good way to visualize the perceptual colour solids.

Run a random file

August 14th, 2011

I have a collection of childrens' TV shows on a computer connected to the family TV. It's a good way to allow the children to watch some TV for a limited period of time without adverts. When one show finishes and they ask me to put another one, I often say "Okay, but you have to wait 20 minutes". Usually in that time they find something else to do and forget about the TV.

Picking what to put on can be a bother though - if I try to play them in order I forget which one I put on last, and if I try to pick one at random I'll invariably pick some more often than others. So I wrote run_random (binary). This program takes its arguments, expands wildcards and recurses directories to obtain a list of files and then picks one at random and calls ShellExecute() on it (which, if it's a video file, will play the video).

It's a handy little utility but it's also a good testcase for some path and directory handling routines I wrote for various other purposes.

Oscilloscope waveform rendering

August 13th, 2011

One thing that always annoyed me about Cool Edit Pro (now Adobe Audition which seems to be much more annoying to use in several respects) is the quality of the waveform visualization. What it seems to do is find the highest and lowest signals at each horizontal pixel and draw a vertical line between them (interpolating as necessary if you're zoomed way in). That means that when you're zoomed, the waveform is a big blob of green with very little useful detail in it. Only green and black pixels are used - no intermediate colours are used to smooth the image. Other waveform editors I've tried seem to work in similar ways.

I think we should be able to do much better. Suppose we rendered the waveform at at an extremely high resolution (one pixel per sample horizontally or better) and then downsampled it to our window size. There's a problem with doing it that way, though - unless the waveform only covers a few pixels vertically, the waveform is going to be spread out amongst too many pixels and will be very dark. Imagine an analog oscilloscope with the beam intensity set at normal for a horizontal trace and then visualizing a signal which oscillates over the entire display at high frequency - most of the signal will be invisible with the exception of the peaks.

The solution to this with the analog oscilloscope is to increase the beam intensity. We can do exactly the same thing with a digital visualizer too - we're not limited to 100% intensity for our intermediate calculations (if a pixel ends up at more than 100% in the final image, we can clamp it or use an exposure function). Increasing the intensity to infinity gives us the Cool Edit Pro visualization again - the any pixel the waveform passes through is fully lit.

What does it look like? Watch this space to find out!

Edit 14th July 2013:

Oona Räisänen beat me to it.

Cleartype for images

August 12th, 2011

It occurs to me that one could use similar sub-pixel techniques used by ClearType to improve the resolution of graphics as well as text. One way to do this would be to downsample an image using a kernel with different horizontal phases for red, green and blue. However, this wouldn't take into account the fact that when making one pixel red, you need to make a nearby one cyan to avoid annoying fringes. With text, the geometry can be controlled so that the widths of horizontal spans are always a multiple of 3 subpixels, but if you're starting with a bitmap you can't really adjust the geometry. Perhaps it wouldn't matter in practice: the same effect happens with NTSC television signals - if you get a black and white pattern with horizontal frequency components in the chroma band you'll get colour fringes for exactly the same reason, but you usually don't notice it because for most images it evens out on average.

I'll have to do some experiments and see what happens.

Blog updates

August 11th, 2011

I've been doing a little spring-cleaning of the old blog now that I've started writing in it again. I installed the WP-Syntax and WP LaTeX plugins and went back through my old posts to prettify up the code blocks and equations. I think I've got most of them (with the exception of a few bits of code that are in languages that I made up, so the plugin doesn't understand them, and a few odd here and there).

There are so many LaTeX and syntax highlighting plugins available for WordPress, though - I have no idea if the ones I'm using are the best. I ended up just using the "most popular" search on the WordPress plugins page and picked the first results for "syntax" and "latex" that looked like they would do what I wanted, on the theory that the most popular ones would be the best. They seem to work well enough but I do worry a bit that at some point I will find they're missing some important feature and I'll have to switch plugins and go back and update all my old posts again.

I also went back and fixed any typos and mistakes in equations that I noticed while doing the updates, and updated with new developments such as the mention of the Eclipse of Mars illusion in The undiscovered colour and the 75 ohm load in CRTC emulation for MESS.

I also want to change my theme (Default is okay but I think it belies how much work I put into this thing). I'll probably change it to Twenty Eleven with a custom header unless someone with better design taste than me has a better suggestion. I need to find a better way to do centering of large images as well. The current way still mostly works but occasionally collides with the links over on the right, and there's a lot of pictures without the correct stylesheet class. I also want to fix the ordered list style - I'd much prefer nice round bullets to those chevrons.

To make the banner I wanted to take a big pile of pictures that I had made especially for blog entries and just sort of stick them all together at random in a big jumbled collage with some nice blending between them. I thought it would be easy to find software to do this but there's about a million "collage maker" programs, almost all of them "trial-ware" and all of them completely useless for the purpose. It seems that to do a jumbled collage in any of them I'd have to individually add, place and rotate each picture. I found one with a wizard that did an automatic layout with blending, but you can't change the blend radius to anything other than "hardly noticeable at all", the pictures are forced into a grid layout and the program is buggy anyway (and won't let you save in the trial version, but doesn't seem to mind you taking a screenshot of the finished result).

So I think I'm going to write my own automatic collage maker. Shouldn't be too hard, and that way it'll be easy to update the header as I add new images.

One other thought that came to mind as I was going through all my old posts - this blog used to be a lot more picture heavy than it is at the moment. I have a big backlog of pictures (over 2 years worth) that I do eventually want to put up here - I'll probably do a month a day at some point. Most of our pictures these days are taken by Gennie, though, and she tends to put them up on her flickr. There will be some overlap with that, though I don't think I'll be including all the food pictures here (I know I have in the past but that was before Gennie started using flickr).

PIC12 architecture review

August 10th, 2011

There are probably very few ways of getting to know an architecture better than trying to optimize the hell out of a piece of assembly code written for it, so I think I know PIC12 quite well by now. It's a cute little architecture - a very small number of instructions (though could be smaller - why are TRIS and OPTION instructions rather than memory locations?). Most operations work on the W register and/or a memory location, so it's either like having a single register or 32 registers (depending on whether you consider the memory locations registers or not).

As you might expect, the W register does become a bit of a bottleneck on occasion (though not completely - you can set, clear and test individual bits of any memory location without affecting W, amongst a few other things). The upside is that there are so few instructions that it's easy to commit the architecture to memory.

It's a bit unfortunate that returning from a function always sets W to a constant - I very rarely found use for this feature and it seemed to get in the way fairly often. At the very least a RET variant which doesn't set W would have been helpful.

Most instructions are 1 cycles or 2 for jumps just as with the AVR8 architecture, however one of these "cycle"s is actually 4 clock cycles. This makes me suspect that "1 cycle operation" is mostly a marketing feature - they might have been able to make the instructions 2-5 clock cycles but making them 4 or 8 (while slower) makes cycle counting easier. It may also simplify the internal circuitry, and it may also improve code density (not so many NOPs needed to make different code paths take the same time).

It may be that the 8-bit AVR devices do something similar but also have an internal frequency multiplier so that they really do 1 cycle operation. I suspect that "multiply frequency by 4" circuitry is much more complicated and finicky than "divide frequency by 4" circuitry.

The architecture has call and return instructions, but it's prudent to avoid them in the most time-critical code, since they cost 4 cycles (2 for the call, 2 for the return). Also, the stack depth is very limited (just two return addresses) so often a continuation-stashing style can be useful (especially since an indirect jump to W is just a single cycle).

A PIC12 programmer from an Arduino

August 9th, 2011

In order to program the microcontrollers for interval bars, I needed a programmer. This is a device which connects to the chip and to a computer, and which allows one to transfer a program from the computer to the chip. You can buy them for $40 or so but I decided to make my own - since I already have an Arduino to act as a low level computer interface, I could make it very cheaply. The only complicated bit is the 13V power supply - I used this little boost converter circuit.

The Arduino program is fairly straightforward - it just reads a hex file over the serial serial line, checks the checksums and converts it to the sequence of signals according to the PIC12's programming specifications. It has facilities for reading back the microcontroller's contents and passing them back to the computer as well, and also a way of calibrating the clock rate. I added this not because I accidentally overwrote the OSCCAL and backup OSCCAL values (well, I did, but I had already read them at that point so I knew what they were) but because I wanted to know what the effect of the OSCCAL value was and if the factory programmed one was optimal.

I discovered that it wasn't quite the best (the preprogrammed value of 0x24 gave a within-spec clock speed of 996.5KHz but 0x26 gave a clock speed of 1002.0KHz). I also discovered that (with the one part I tested) the low bit of the OSCCAL was ignored and the top 7 bits interpreted as a signed quantity between -64 and 63 gave an extremely linear relationship to frequencies between 587.1KHz and 1230.1KHz with a resolution of about 5KHz. I also discovered that the frequency stability was very good - a standard deviation of only about 30Hz.

Here's the schematic for the programmer:

To build the programmer, edit the second line of build_programmer.bat to point to your Arduino installation (if it's somewhere other than C:\Program Files\Arduino) and run it. To upload it to the Arduino you'll need WinAVR, modify upload.bat to point to your WinAVR installation and run it. We can't use the avrdude from the Arduino installation here because it won't toggle the line to reset (that's normally done by the Arduino software).

To use the programmer, connect to it with a serial monitor (like the one in the Arduino IDE). There are several one-character commands:

  • W - Write program from Arduino to PIC.
  • O - Write program from Arduino to PIC, including OSCCAL value (necessary if write failed, or if we want to use a different OSCCAL value).
  • P - Write program from Arduino to PIC, including OSCCAL value and backup OSCCAL value - only use if the backup OSCCAL value got screwed up somehow.
  • R - Read program from PIC to Arduino.
  • U - Upload program from PC to Arduino (follow this with the contents of your hex file).
  • L - Download program from Arduino to PC.
  • T - Start timing mode.
  • Q - Stop timing mode.
  • A - Perform automatic calibration.
  • B - Output a nybble to port B (for debugging purposes.)

The returned status codes are as follows:

  • K - Success.
  • EF - Error: config fuses didn't verify.
  • EV - Error: Program, OSCCAL, user IDs or backup OSCCAL didn't verify.
  • ER - Address in hex file out of range.
  • EU - Unknown record type in hex file.
  • EZ - Non-zero byte count in end-of-file marker in hex file.
  • EC - Hex file checksum failure.
  • EH - Incorrect hex digit.
  • E: - colon expected but not received.

In automatic calibration mode, the Arduino also sends calibration information as three numbers per line: OSCCAL value (3 digits), frequency in Hz (7 digits) and standard deviation of frequency in Hz (5 digits).

Don't overdo it with the automatic calibration - it reprograms the PIC with each possible OSCCAL value so you'll wear out the flash if you do it too much.

Final interval bars code (I hope)

August 8th, 2011

I think I've now got the optimal implementation of the PIC12 code for interval bars. It's gone through many rewrites since my last post on the subject. I decided to get rid of the "heartbeat" after all in favor of a burst system which sends and receives 9 bits (all the information for one bar) at a time every 100 clock cycles or so and synchronizes once in each direction during that period, right before data transmission. This means we can use 3-pin connectors instead of 4-pin connectors. A 2-pin connector isn't really practical since extra electronics would have to be added to separate and recombine the power and data signals.

The downside to this approach is that the microcontroller in the "root" box now has two time-critical tasks - reading the bits as they come in (can't just request one when we're ready for it any more) and outputting audio samples. But I think that is managable, especially if there are queues so that the interrupt routines can just add to or remove from their respective queue and the real work can be done in the foreground. The average data transmission rate is one bar every 317 cycles - the other 217 are spent in the "prime" phase where each bar discovers which bars are connected to it and cycles are broken to turn the graph into a tree. The data rate is about 3,154 bars per second or about 32ms per update with 100 bars - a workable latency for musical purposes.

The program takes 496 instruction slots of the available 512 and 21 bytes of RAM of the available 25. The source file is just 354 lines.

I realized early on that there wasn't going to be space (in flash or RAM) or time for any debugging code so that this program would be impossible to debug on real hardware. I knew I'd never get it right the first time, so it was necessary to write a simulator. There is an existing simulator included with the Microchip tools, but I couldn't get it to work properly and in any case it certainly wouldn't support programmatically randomly plugging and unplugging as many as 100 instances. So I wrote my own cycle exact simulator. Actually it had to be rather better than cycle exact to simulate the fact that the microcontrollers run at slightly different speeds. My simulated timebase is about 39 picoseconds, giving a frequency resolution of about 39Hz - 512 steps between 0.99MHz and 1.01MHz.

After getting the simulator working, I spent a long time cycling through a process that looks like this:

  1. Run the program in the simulator.
  2. Discover that after a while it locks up or produces incorrect results for an extended period.
  3. Look at various diagnostics I added (up to and including full interleaved instruction traces) to figure out what went wrong.
  4. Adjust the program to avoid the problem, and repeat.

Most of the time, a trip through this loop increases the time-to-failure by a factor of between 2 and 10, but occasionally, it's turned out that there was no simple fix to the problem - the program required substantial rewrites to avoid the situation. These rewrites in turn have their own bugs, and the time-to-failure again becomes very small. It got easier, though - the same sorts of problems kept cropping up and I got better at recognizing them with time. Also, at the beginning I kept having to interrupt the cycle to write more diagnostic code when my existing techniques proved insufficient.

With the current version the simulation ran for more than a week of real time (91 hours of simulated time), went through 15,371,546 configurations with a worst settling time of 92ms.

The best version before this one ran for 774430 reconfigurations and 9 hours of real time (about 4.5 hours of simulated time) before getting itself into a state from which some of the bars stopped responding. That problem took a week to track down and because it happens so rarely. The story of how it happens is kind of like a distilled version of a theatrical farce. There is one signal line for communication which can be in one of two states. As the program progresses, signals of different meanings need to be exchanged (there are about 27 different meanings in the latest version). The two communicating bars need to be "on the same page" about what the meaning of a 0 and 1 will be. But because bars can be connected or disconnected at any time, these meanings can become confused. The farce happens when one signal is confused for another and (due to coincidences that would be amazing if I wasn't conspiring to arrange them) this causes a worse confusion later on and so on, escalating until we get to a state from which the system can't recover.

The way out of this mess is by making some of the messages more complicated than a single bit. For example, the "prime" signal which initiates the data transfer is a 6-cycle low, a 31-cycle high and another 6-cycle low. The receiving code checks the line twice (37 cycles apart) for lows and in the middle somewhere for a high. This means that it can't be confused with either the 9-bit data signal (which is at most 36 cycles of low in a row) or for a single long low signal. The response to this is an 8-cycle low, an 8-cycle high and then a low of variable length (in previous versions it was a single long low of varying length). This increases the number of "this can't happen" signals. When we detect one of these we can put the program into a state that is robust against further unexpected input.

A continual battle with this program has been making it as fast as possible whilst still being reliable and fitting in the available space. There isn't enough space to duplicate any significant part of the program for each of the 12 combinations of input and output pin, so I initially divided it up into "read from pin x" and "write to pin x" subroutines. The "write to pin x" subroutines can then be folded together by means of a couple of variables whose values can be written to the IO port to signify a low and a high respectively. Since reading from a memory location takes the same time as loading a constant, there's no cost to this indirection (apart from the setup code which has to initialize these variables depending on the pin we need to write to). The read subroutines can't be factored this way because the bit to read from is encoded in the opcode of the instruction which tests a bit. Using an "extract bit x" subroutine would have slowed the program down too much.

Phew. I think that (per line and per byte) this was the most difficult-to-write program I've ever written. Brian Kernighan is often quoted as saying "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." However, there is a corollary to this - if you write a program that's as clever as you can make it and then force yourself to debug it, you become twice as smart in the process.

Edit 14th July 2013:

LFT said it better than I could.

Rethinking memory

August 7th, 2011

There are currently two major ways in which programming languages deal with memory. One is explicit deallocation (as used by C, C++ and assembly language) and the other is garbage collection, which is used by almost everything else.

Explicit deallocation is generally thought of as being very error prone and difficult to get right, but if your language has good features to support it (like destructors in C++) it can be very nearly as easy as garbage collection. It also has some other advantages over garbage collection:

  1. Less actual memory is needed to accomplish any given task (most garbage collected languages need about twice the memory for any given task as a non-garbage collected language).
  2. Non-memory resources can be cleared up in the same way as memory.
  3. Everything is deterministic (which is not the case if you have a garbage collector running on a separate thread).
  4. Most garbage collectors need to pause your program to work, making them unsuitable for hard real-time systems.
  5. Locality of reference is better preserved, leading to more cache-friendly memory access patterns.

Garbage collection also has it's own advantages:

  1. You never need to deallocate memory explicitly - the computer essentially emulates infinite memory for you.
  2. You don't need to explicitly break (or avoid making) cycles of references - the garbage collector will find them.
  3. With a moving collector, allocation can be much faster (requiring just a pointer increment).
  4. If you can always arrange for garbage collection to happen in time that would otherwise be idle, the system can be faster overall.
  5. With a moving collector, memory fragmentation can be avoided.

Generally I prefer explicit deallocation to garbage collection. This might be because my parents drummed it into me as a child that I should put away my toys when I finished playing with them, not leave them lying around to be tidied away my someone else. The one thing that does bother me about explicit deallocation, though, is the possibility of memory fragmentation. Lately I have been wondering if would be possible to fix this problem whilst retaining the advantages of explicit deallocation. Specifically what got me thinking about this is the crazy memory tricks I implemented for my fractal plotter, and how to generalize it to other programs. The result is something I call ravioli allocation (code here although it doesn't actually do anything at the time of writing).

The idea there is that you have a pile of objects which each take the same amount of memory. So there is a simple algorithm for keeping them defragmented - whenever you deallocate an object, move the highest object in memory into the space it vacated.

There are two difficulties with doing this more generally:

  1. To move object X, you have to know where all the pointers pointing to it are so you can fix them up. This caused many long and painful debugging sessions with my fractal program.
  2. All the objects have to be the same size.

The second problem can be solved if the objects are large enough to hold two pointers, by simulating arrays and structures using binary trees (though access time for elements increases a bit - from O(1) to O(log n)).

The first problem can be solved by linking together in a linked list all the pointers which point to the same thing. The list needs to be a doubly linked list because we need to be able to remove a pointer from the list it's in. So, each "high level" pointer really consists of three "low level" pointers - one to the referent itself and two for the previous and next "high level" pointers in the list. An object itself will also need space for two other "low level" pointers - one to the first element of the linked list of pointers pointing to that object, and one to tell the allocator whether it contains 0, 1 or 2 "high level" pointers (this can also be used as a vtable pointer for other purposes).

To avoid splitting objects across cache lines, we want the size of each object to be a power of two, so they will need to be able to hold 8 "low level" pointers (i.e. be 32 bytes on 32-bit systems, 64 bytes on 64-bit systems). That means that each 32-bit object can store 6 32-bit integers or 24 characters. On a 64-bit machine, an object could store 12 32-bit integers or 48 characters. So the small string optimization is very natural with ravioli allocation.

Once we have the ability to move objects around for defragmentation purposes, we might want to move them around for other purposes too. For example, if we expect that whenever object A is used, object B is also very likely to be used (and vice versa) we might want to put them on the same cache line. We can also set up separate memory arenas for separate purposes and move objects between them to improve locality for paging purposes.

One big downside of this system seems to be that deallocation (or moving an object in general) can be quite an expensive operation - chasing one pointer (and one potential cache miss) for each pointer that points to the moved object. However, I suspect that this wouldn't be as expensive as it sounds most of the time, since the objects that we move will tend to be ones which were allocated recently (since they will be in higher memory locations) and won't have been around long enough to have accumulated too many pointers. The objects that tend to have a lot of objects pointing to them will tend to be long lived and will migrate towards the beginning of memory (even if they don't start off there). However, walking this linked list should probably be done with non-temporal loads to avoid polluting the cache with a bunch of objects that you don't really care about. Also, consider the average case - each object can only point to two other objects, which means that the average number of pointers pointing to an object will be at most 2. And there is the nice property that a routine which doesn't deallocate data that it didn't allocate, also won't move data that it didn't allocate, so performance of small routines can be quantified nicely.

There is a fair bit of overhead to all these linked lists - 44% in the best case (one big array) and 300% in the worst case (a linked list of small records). However, some of this is offset against existing overhead (heap information, vtable pointers, fragmentation, existing binary tree pointers, reference counts[1]) so for real programs the costs might be much smaller or even negative.

Multithreading might seem to be a problem at first - one thread can't hold on to a pointer to an object in another thread's heap. However, you can transfer data to or from another thread if the other thread is in a known (waiting) state. You have to be careful transferring data between threads, but that's true anyway.

Local (stack) variables present a bit of a problem - they can't hold pointers (unless you walk the entire stack looking for pointers whenever you deallocate something). One way would be to store "how to get there" (from a global root object which is allocated first and never moves) instead of "where it is" on the stack. Another would be for the stack itself to be allocated on the heap. This is probably the faster solution, and has some nice implications for implementing things like arbitrary closures and continuations. Hardware support would definitely help here though.

[1] There is a nice synergy between ravioli allocation and reference counting - whenever you change a pointer you remove it from its linked list, and if the linked list then contains zero entries you know you can remove the object you just stopped referencing. Cycles must be explicitly broken, though - you can't have a raw/weak pointer (because it would go stale when the object it referred to moved).

Edit 14th July 2013:

Here's a good rant against garbage collection.