I added arbitrary precision arithmetic to my Mandelbrot zoomer. Surprisingly, the difficult part wasn't the arbitrary precision arithmetic itself, it was deciding when and how to switch from double-precision to arbitrary precision arithmetic. Because my program reuses data computed at one zoom level to speed up the computation of more deeply zoomed images, some images contained pixels computed using both methods. The trouble is, the two methods don't give the same results. In fact, after some experimentation I discovered that the problem was even worse than that: even using just the double-precision code, you get different results depending on the number of digits you use. So the point that was represented in my program by 0xffa615ce + 0x00000002i (using 10.22 bit fixed point) escaped after 548 iterations but the point 0xffa615ce00000000 + 0x0000000200000000i (using 10.54 bit fixed point) escaped after 384 iterations. The problem is that after every multiplication we shift the result right by the number of fractional bits, which performs a rounding operation. The images generated look reasonable but are actually only approximations to the true images that would be calculated if infinite precision were employed.
Having determined this, I realized it would be necessary to throw away all the points computed with lower precision once we started using a higher precision. This isn't as bad as it sounds, since (when zooming in by a factor of 2) the recycled points only account for 1/4 of the computed points but if we just threw them all out at once it would result in a visible artifact when passing through the precision boundary - the entire window would go black and we'd restart calculation from scratch. I wanted to avoid this if possible, so I started thinking about convoluted schemes involving restarting a point if its 3 sibling points were of higher precision. Then I realized that I could just recycle the points but keep their colours when splitting blocks to a new precision boundary, avoiding the glitch. There are still some artifacts while the image is being computed, but they are less jarring.
Finally there was the problem of interfacing with double-precision floating-point numbers. The Mandelbrot set is contained entirely within the circle |c|<2 and in this range a double has 52 bits of fractional precision. But if we replace the 10.22 and 10.54 bit fixed point numbers with doubles, we'll have a region of blockiness where we need 53 or 54 bits but only have 52. Rather than having two sorts of 64-bit precision numbers, I decided to simplify things and have 12 integer bits in my fixed point numbers. The program is very slow when it switches into arbitrary precision mode - it's barely optimized at all. The 96-bit precision code is currently has a theoretical maximum speed of about 92 times slower than the SSE double-precision code (190 million iterations per second vs 2 million). It could be significantly faster with some hand assembler tuning, though - I have a 96-bit squaring routine that should speed it up by an order of magnitude or so. All of the multiplications in the Mandelbrot inner loop can be replaced with squarings, since [latex]2xy = (x+y)^2 - x^2 - y^2[/latex]. Squaring is a bit faster than multiplication for arbitrary precision integers, since you only need to do [latex]\displaystyle \frac{n(n+1)}{2}[/latex] digit multiplications instead of [latex]n^2[/latex]. Given that we are calculating [latex]x^2[/latex] and [latex]y^2[/latex] anyway and the subtractions are very fast, this should be a win overall.
Oh, god, this is two YEARS late, and I'm sorry for that, but you should use faster multiplication methods! My favorite is the O(N^{3/2}) method, which is significantly better than naive squaring, because of how simple it is.
If you are trying to square A*2^n + B, the answer is A^2 2^{2n} + 2 A B 2^n + B^2. This can be done by computing A^2, A*B, and B^2. But this is two squarings and a multiplication, and is basically what you did up there. So as a solution, compute A^2, B^2, and (A+B)^2, and then compute A^2 * 2^{2n} + ((A+B)^2 - A^2 - B^2)*2^n + B^2. Let T be the function of how many computations you have to do at each step. Since at each step, you're doing T(n) = 3 T(n/2) + 2n computations, by the master theorem, T(n) = O(n^{3/2}).
More generally, if N=2^n (for sake of making notation clearer), to find (AN+B)(CN+D), after computing AC, BD, and (A+B)(C+D), (AN+B)(CN+D) = AC N^2 + ((A+B)(C+D)-AB-CD)N + CD
I love this method because the multiplication is significantly faster than the usual method, and it's so easy to implement.
Really? Easy to implement? Its the Karatsuba multiplication you are describing. Its not at all simply to implement on the assmbler level optimization that matters here. Go ahead and try it.