In an attempt to get a little extra speed in my fractal program, I decided to try rewriting the innermost loop to use 32-bit fixed-point numbers instead of double-precision floating-point numbers. The precision would be less good but I need to write some precision-switching code anyway. I had noticed that my CRT simulator got faster when I switched from floating-point to integer code, so I assumed (incorrectly, it turns out) that the same would be true in this program.
It turns out that the best way to multiply two signed 32-bit integers and then shift the 64-bit result right by 22 bits (the number of fractional bits I'm using) is to use the one-operand IMUL instruction followed by the SHRD instruction. Unfortunately the only register you get to pick with this combination is one of the input registers - both the output and the other input are in EAX. This is a problem because it means that you have to load EAX right before every multiplication and stash the result somewhere else right after, before starting the next multiplication. All this shuffling slows the code right down - my 2GHz Core Duo laptop peaks at ~150 million iterations per second for the double-precision routine and ~100 million iterations per second for the integer routine. To make matters worse, you also lose the use of the EDX register (which is stomped by the IMUL) so even with frame pointer omission you're down to just 5 registers (ESI, EDI, EBP, ECX and EBX).
Another possible way is to use the MMX registers and the PMULUDQ instruction, but that has two problems: the multiplication is unsigned and there's no arithmetic right-shift in the MMX/SSE ISA so it seems unlikely that it could be made faster than the IMUL version.
This makes me wonder if floating-point instructions would also be faster for other uses where integers have traditionally reigned supreme. Bignums for example. Instead of storing 32 bits of precision in each 32-bit register, you can store 52 bits of precision in the mantissa part of each 64-bit register. There is a little extra complexity involved since the floating point units aren't designed for this (and for multiplication you need to do four 26-bit*26-bit->52-bit multiplies instead of one 52-bit*52-bit->104-bit multiply). However, the extra bits might make this a win. Poking around at the gmp source suggests that they aren't currently using any such tricks for x86, though there is some FPU hackery on other architectures.