I have been poking again at my fractal plotting program recently. The latest version is available here (source). I've added multi-threading (so all available cores will be used for calculation) and Direct3D rendering (which makes zooming much smoother and allows us to rotate in real time as well with the Z and X keys - something I've not seen any other fractal program do). There is still some jerkiness when zooming in or out (especially past a power-of-two boundary) but it's greatly reduced (especially when there's lots of detail) and with a little more work (breaking up the global matrix operations) it should be possible to eliminate it altogether.
The linked grid data structure in the last version I posted worked reasonably well, but it used too much memory. At double-precision, every block needed 72 bytes even after it had been completed. This means that an image that covered a 1600x1200 screen with detail at 16 texels per pixel for anti-aliasing would use more than 2Gb. I could use a different block type for completed points but there would still need to be at least 4 pointers (one for each direction) at each point, which could take 500Mb by themselves.
So I decided to switch to a data structure based on a quadtree, but with some extra optimizations - at each node, one can have not just 2x2 children but 2n by 2n for any positive integer n. When a node contains 4 similar subtrees (each having the same size and subnode type) it is consolidated into a 2n+1 by 2n+1 grid. If we need to change the type of one of the subnodes of this grid, we do the reverse operation (a split). In this way, if we have a large area of completed detail it will use only 4 bytes per point (storing the final number of iterations). This works much better - memory usage can still be quite high during processing (but seems to top out at ~700Mb rather than ~2Gb for the linked grid data structure) and goes right back down to ~200Mb when the image is completed (only about 35Mb of which is the data matrix, the rest is Direct3D textures).
One disadvantage of the grid tree method is that it doesn't naturally enforce the rule preventing 4 or more blocks being along one side of one other block. That rule is important for ensuring that we don't lose details that protrude through a block edge. However, this turns out to be pretty easy to explicitly add back in. This gives a more efficient solid-guessing method than I've seen in any other fractal program (by which I mean it calculates the fewest possible points in areas of solid colour).
This data structure was surprisingly difficult to get right. The reason for this is that with all the splitting and consolidating of grids, one must avoid keeping grid pointers around (since they can get invalidated when a split or consolidation happens in a nearby part of the tree). That means that many algorithms that would be natural to write recursively need to be written iteratively instead (since the recursive algorithm would keep grid pointers on the call stack).
Another tricky thing to get right was the memory allocation. My first attempt just used malloc() and free() to allocate and deallocate memory for the grids. However, the way grids are used is that we allocate a whole bunch of them and then deallocate most of them. However, they are quite small and the ones that don't get deallocated are spread throughout memory. This means that very little memory can be reclaimed by the operating system, since nearly every memory page has at least one remaining block (or part of one) on it.
The problem here is that the malloc()/free() interface assumes that a block, once allocated, never moves. However, grids turn out to be quite easy to move compared to general data structures. This is because the grid tree is only modified by one thread at once (it must be, since modifying it can invalidate any part of it) and also because we know where all the grid pointers are (we have to for split and consolidate operations). Only a dozen or so different grid sizes are currently used in practice, so it's quite practical to have a separate heap for each possible grid size. If all the objects in a heap are the same size, there's an extremely simple compaction algorithm - whenever we delete an item, just move the highest item in the heap down into the newly free space. So a heap consists of a singly-linked list of items. These are allocated in chunks of about 1Mb. Each chunk is contiguous region of memory, but chunks within a heap don't have to be contiguous. This allows the heaps to interleave and make the best use of scarce (on 32-bit machines) address space.
Sounds simple enough but again there turned out to be complications. Now it's not just split and consolidate that can invalidate grid pointers - now deleting a grid can also invalidate grid pointers. This means more recursive algorithms that need to be made iterative (such as deleting an entire subtree) and certain operations need to be done in a very specific order to avoid keeping more grid pointers around than are necessary. I now understand why such techniques are so rarely used - they're really difficult to get right. I do think that a general garbage-collected heap would be slower and use more memory than this special purpose allocator, but I'm not sure if difference would be large enough to make all this effort worth it. Perhaps someone will port my code to a garbage-collected language so the performance can be compared.