I was writing some audio code recently and it gave me an idea for an audio compression strategy. I don't know enough about audio compression to know for sure if it's been done before, but here goes anyway. The idea is that you have a selection of short, repeating waveforms (e.g. square wave, sine wave, sawtooth wave, white noise) and multiple channels each of which can independently play one of these waveforms at some particular frequency and volume. This is almost exactly the way that my physical tone matrix renders audio.
I was thinking that with enough channels, enough possible waveforms to choose from, enough frequencies, enough volumes and changing the parameters quickly enough, one could make a (perceptually) good approximation to any input waveform. And because the model sort of reflects how most of the things we listen to are created in the first place (a piece of music consists of multiple instruments playing at different volumes and pitches but each generally consisting of a fundamental tone with overtones of various intensities which might change throughout the course of any given note) this might actually give a pretty good approximation in a very small space.
One big problem is how to choose the right pitch, volume and waveform for each channel. The first thing I tried was using a genetic algorithm - the set of waveforms and the pitch/volume/channel information being the "genome" and the rendered waveform being the "phenome". The fitness function is just mean squared error (though eventually I'd like to switch to a more sophisticated psycho-acoustic model like that used by MP3 encoders). There is a population of genomes and the best ones survive, mutate and recombine to make the new population. Recomputing the phenome is not too expensive as the computations are very easy and each pitch/volume/waveform datum only affects a small amount of the phenome. Changing a bit in one of the waveforms is more expensive though as you have to go through all the pieces of the phenome that use that waveform.
Unfortunately it doesn't seem to be converging on the input waveform at all yet (it probably would if I left it long enough, it's just far too slow). The next thing I want to try is seeding the initial genomes with some kind of reasonable approximation to the initial waveform, and seeding the waveforms themselves with some random set of overtones modulated by a range of power laws. To seed the initial genomes, I'll need to break the input waveform into pieces, do an FFT on each piece to find the frequencies, pick the best waveform/pitch/frequency combination for that piece, then subtract it and repeat until we've picked a waveform for each channel.
Even if this codec doesn't beat existing codecs on quality per bitrate metrics, it could still be extremely useful because very little CPU power is required to play it back (I've already proved that a 16MHz Atmel AVR ATMega32 can do 16 channels of this kind of rendering in real time at 16KHz using less than half its capacity). If you premultiply each waveform by each possible volume you can playback on hardware that doesn't even have a multiplier (at significant cost to quality and/or bitrate).
Another fun thing about this system is that you could train it on a large number of sounds at once and come up with a set of waveforms that are good for compressing lots of different things. These could then be baked into the encoder and decoder instead of being transmitted with the rest of the data, leading to further savings. I wonder what the optimal waveforms would sound like.
This idea owes something to Mahoney's c64mp3 and cubase64 demos (unfortunately Mahoney's site breaks inbound links but the cubase64 link is in the 2nd of October 2010 section under "Recent Changes"). However, I think the sound quality of this scheme is likely to be much better than c64mp3 since we're not limited to a one channel plus noise.