Archive for the ‘computer’ Category

Floating-point faster than integers

Saturday, July 3rd, 2010

In an attempt to get a little extra speed in my fractal program, I decided to try rewriting the innermost loop to use 32-bit fixed-point numbers instead of double-precision floating-point numbers. The precision would be less good but I need to write some precision-switching code anyway. I had noticed that my CRT simulator got faster when I switched from floating-point to integer code, so I assumed (incorrectly, it turns out) that the same would be true in this program.

It turns out that the best way to multiply two signed 32-bit integers and then shift the 64-bit result right by 22 bits (the number of fractional bits I'm using) is to use the one-operand IMUL instruction followed by the SHRD instruction. Unfortunately the only register you get to pick with this combination is one of the input registers - both the output and the other input are in EAX. This is a problem because it means that you have to load EAX right before every multiplication and stash the result somewhere else right after, before starting the next multiplication. All this shuffling slows the code right down - my 2GHz Core Duo laptop peaks at ~150 million iterations per second for the double-precision routine and ~100 million iterations per second for the integer routine. To make matters worse, you also lose the use of the EDX register (which is stomped by the IMUL) so even with frame pointer omission you're down to just 5 registers (ESI, EDI, EBP, ECX and EBX).

Another possible way is to use the MMX registers and the PMULUDQ instruction, but that has two problems: the multiplication is unsigned and there's no arithmetic right-shift in the MMX/SSE ISA so it seems unlikely that it could be made faster than the IMUL version.

This makes me wonder if floating-point instructions would also be faster for other uses where integers have traditionally reigned supreme. Bignums for example. Instead of storing 32 bits of precision in each 32-bit register, you can store 52 bits of precision in the mantissa part of each 64-bit register. There is a little extra complexity involved since the floating point units aren't designed for this (and for multiplication you need to do four 26-bit*26-bit->52-bit multiplies instead of one 52-bit*52-bit->104-bit multiply). However, the extra bits might make this a win. Poking around at the gmp source suggests that they aren't currently using any such tricks for x86, though there is some FPU hackery on other architectures.

Very low-level programming

Friday, July 2nd, 2010

In most computer programs where speed is critical, there are a small number of chunks of code in which the CPU spends most of its time - the innermost loops. For a fractal program, it will be the iteration function. For a high-precision number library, it will probably be the multiplication routine. For a video encoder, it might be the loop that searches for motion compensation vectors.

In each case, programmers will generally spend a lot of time optimizing these innermost loops for the most important CPU microarchitectures in order to squeeze out every last drop of performance where it really matters. This often involves hand-writing assembly code, modelling the execution pipelines of the target machine and poring over optimization documentation.

Doing this, programmers will often think things like "if only this CPU had an instruction that does X, which could be easily done in hardware, this routine could be so much faster". This is why every couple of years or so a new set of instructions is added to the x86 architecture: MMX, SSE, SSE2, SSE3, SSSE3, SSE4, AVX. These instructions are getting more and more complicated with time, which makes the optimization process all the more difficult - one might have to look at dozens of ways of doing something to find the best one for your specific task.

All this makes me wonder if there might be a better way. With each new instruction set extension, new functional units are added along with new ways of hooking them up. Suppose that instead of programming these inner loops the usual way (deciding which registers represent which variables, choosing instructions and ordering them) we could instead set some state inside the CPU that directly connected the output of multiplier 1 to an the input of adder 2 (for example) in effect creating a special purpose machine for the duration of the inner loop. The wiring up probably takes longer than running the original code would, but once you've wired it up each iteration through the loop can run extremely quickly, since there's no single execution thread causing a bottleneck. It would be essentially like having an FPGA built into your CPU.

I can think of several reasons why this isn't currently done. One is that writing applications at such a "low level" creates a dependency between the software and the most intimate details of the hardware. Tomorrow's CPU might be able to run today's software more quickly by having some extra multiplication units (for example) available to each core. But if the applications are addressing the multiplication units directly, that won't help unless the software is rewritten to take advantage of them. One possible answer to this might be to create a programming interface that involves a very large number of virtual functional units and allow the CPU to decide how to map that on to the actual hardware in the optimal way (possibly turning parts into more traditional code if it there aren't sufficient functional units) - a sort of "JIT Verilog" if you will.

A second reason is that when you increase the amount of state on the CPU affiliated with a particular execution thread, you make context switching more difficult, because there's more registers that need to get swapped out to memory. FPGAs are generally not reprogrammed extremely frequently because they contain a lot of state and it takes time to reprogram them. This could also be solved by virtualizing functional units, or it could be solved by keeping track of whether the CPU was reprogrammed since the last time the thread run, and only reprogramming when necessary. That would solve the common case at the expense of speed when doing dumb things like running two different video encoders at the same time (something that is also likely to be rather sub-optimal with todays CPUs, since it isn't generally something that is optimized for).

What would be even cleverer is if the CPU could examine the code sequence and wire things up automatically to maximize the speed of that inner loop. To some extent, CPUs are starting to do such things. Modern CPUs keep a queue of decoded instructions and if an inner loop fits into this instruction queue, no decoding needs to be done while it is running. Cache hints are another example of a transparent feature designed to allow optimization when specific hardware details are known.

Random number generation on 8-bit AVR

Thursday, July 1st, 2010

For my physical tone matrix I wanted a mode where at the start (or end) of each cycle the machine would pick a random "on" light and turn it off, and a random "off" light and turn it on. With 5-10 lights on, this makes a fairly nice little bit of random music - continually changing so it doesn't get too repetitive.

To implement this I decided the best way was to pick a frame (the second one of the cycle turned out to be the most convenient) and decrement one of two variables "randomOn" or "randomOff" depending on whether the current LED within the frame was on or off. If the variable reached zero the LED would be toggled. That doesn't take too many cycles. But the sample before this frame we need to generate random numbers and initialize randomOn and randomOff with them. randomOn needs to be in the range 0..lightsLit-1 (where lightsLit is the number of "on" LEDs) and randomOff needs to be in the range 0..(255-lightsLit). So we need to generate two random numbers in the range 0..n-1 where n is in the range 1..255 as quickly as possible. I wanted to try to generate these numbers in a single sample as trying to spread the work across multiple samples makes the program much more complicated. That left me with about 300 cycles to generate two random numbers.

The usual way of generating a random number in the range 0..n-1 is to generate a random number in a much larger range (say 0..65535) and use the modulo operation ("remainder after division by n") to get it into the range 0..n-1.

Generating a 16-bit random number is done in a fairly standard way using a Linear Congruential Generator (I used the X = 214013*X + 2531011 variant because it cuts out a few multiplications). That by itself takes nine 8-bit by 8-bit multiplications and 56 cycles. I rewrote it myself in assembly language rather than using the built-in generator from avr-libc, because the latter it not very optimized (it uses a full 32-bit multiplication which is not necessary).

If you use the 16-bit modulo function from libgcc it takes something like 215 cycles, which was too many. Even unrolled it's something like 144, which is still too many. I was about to embark on the work to spread this loop across several samples when I read this which describes a way of turning a division by a constant into a multiplication by a constant. That's not very useful for arbitrary division, since computing the multiplier constant is more work than just doing the division. But we're not doing arbitrary division here - we know something about our divisor n - it is no larger than 255. So it becomes practical to precalculate a table of multipliers and look up an entry in this table at runtime.

The next problem is that that method only works when the division is exact (has remainder zero) which is no good for this application since it's the remainder we're interested in. But the classic bit-twiddling reference book Hacker's Delight describes a variation which does work (for some divisors the dividend needs to be added back after the multiplication, and for some the result needs to be shifted right by a number of bits depending on the divisor). So our mod algorithm looks like this:

  1. Look up multiplier in table indexed by divisor
  2. Multiply dividend by multiplier
  3. Look up helper routine in table indexed by divisor (there are 18 variations - 9 possible shifts and either "add" or "don't add", but not all of them are used) and jump to it.
  4. Multiply by the divisor and subtract the result from the original dividend - the result is the remainder.

The resulting mod routine takes 40-53 cycles (depending on divisor) giving 96-109 cycles for the entire random number routine. With various other bits of overhead, the random routine takes 253 cycles on this "init" sample and up to 29 per sample on the first frame of the cycle.

Volatile registers in GCC

Monday, June 14th, 2010

When writing microcontroller code in assembler, it's nice to be able to keep some commonly used variables in registers. The 8-bit AVR devices in particular have a nice big set of registers, most of which are rarely used in compiled code. Usually it's a waste of time to write all the code in assembler, though - it's much better to write the non time-critical bits in C, compile them with GCC and then link with the assembly code bits.

When accessing the register variables from the C code, the natural thing to do is just to make a global register variable:

register uint8_t frame __asm__ ("r3");

That has two effects - it allows you do access the variable as if it were a normal variable:

void initFrame()
{
    frame = 0;
}

And it prevents the compiler from using r3 for something else (though one also has to be careful that the register isn't used by any other linked in libraries, including the C library and the compiler support functions in libgcc).

The trouble comes when you try to read those register variables. If optimizations are turned on, then the following code might just be an infinite loop:

void waitForFrame(int frameToWaitFor)
{
    while (frame != frametoWaitFor);
}

The compiler hoists the read of frame outside the loop, and never sees the updates. If frame was a normal variable we could fix this just by adding volatile, but using volatile with a register variable doesn't work. This seems odd until we think about what volatile actually means. A read from or write to a volatile variable is considered an observable effect (like doing IO) so the compiler won't optimize it away. But the compiler has no concept of a "read from" or "write to" a register - registers are just used or not used and the optimizations around them are unaffected by the notion of volatile.

There is a reasonably easy and not-too-invasive way to fix this, though, through the use of inline assembly. If you write:

#define getFrame() ({ \
    __asm__ volatile ("" : "=r"(frame)); \
    frame; \
})
 
void waitForFrame(int frameToWaitFor)
{
    while (getFrame() != frametoWaitFor);
}

The compiler will treat the "" as a block of assembly code which writes to r3 and which has unknown side effects (so that it can't be hoisted out of a loop for example). The code doesn't actually do anything (so the generated code won't be adversely affected) but it essentially provides a read barrier to the register. Unfortunately you can't use getFrame() to write back to frame, so to increment it for example you have to do frame = getFrame() + 1; but that's actually kind of helpful because it makes the possibility of a race condition (for example by an interrupt routine also incrementing frame at the same time) more obvious.

What are all those pins for?

Wednesday, October 28th, 2009

I recently built myself a new computer using an Intel Core i7 920 CPU. This CPU has more pins (well, "lands" actually, since they are just flat conducting areas that touch pins in the socket) than any other yet produced, 1366 of them to be precise. I was wondering why so many were needed, so I grabbed the datasheet and made a map:

Power:
     VSS
     VCC
     VCCPLL
     VTTA
     VTTD
     VDDQ

Memory:
     DDR0 data      other
     DDR1 data      other
     DDR2 data      other

Other:
     QPI data      other
     Other
     reserved

Idle speculation follows (I don't have any background in CPU or motherboard design):

The pins roughly divide into six sections: two for memory data, one for other memory-related signals, one for power, one for the QPI bus and one that is mostly reserved.

That there are a lot of power pins is not surprising - this CPU can use as much as 145A of current, which is enough to vaporize any one of those tiny connections, so it has to be spread out amongst ~300 of them for each of power and ground. Having two very big pins for power would probably make the mechanical engineering of the CPU much more difficult and would push the responsibility for branching out that power onto the CPU, whereas it is better done by the motherboard.

It's interesting that the ground lands are mostly spread out but the power lands are mostly together. I'm not sure why that should be - I would expect them both to be spread out. Perhaps the 8 or 9 big groups of VCC on the north edge each correspond to a single "power line" on the motherboard (and hence are grouped together) while the distributed ground lands are needed to supply electrons for the signal lands.

Three DDR3 channels also use a lot of lands - 192 for data alone and almost as many again for addresses, strobes and clocks.

Another thing that surprised me is that there are so many reserved lands (~250 of them). Initially I thought that this was because the socket was designed before the designers knew how many pins they would actually need, so they made sure to design for the absolute maximum. However, a good chunk of the reserved lands are used by the Xeon 5500 CPUs, which use the same socket - in particular for memory error detection/correction and the second QPI bus (which is presumably in the northwest corner).

Edit 14th July 2013:

Here are some more nice pin maps.

Top posting

Saturday, October 24th, 2009

Before I started working at Microsoft, I used to always reply to emails by quoting them, breaking up the quoted text into pieces and then replying to each of the pieces directly below, for example:

From: andrew@reenigne.org
To: xyz@example.com
Subject: Re: Hello

Xyz <xyz@example.com> wrote:
> Hello,

Hello!

> how are you today

I'm fine, thank you.

This style is called inline replying with trimming. This is a fine system because the person I'm replying to gets reminded of what they wrote, and I don't have to write things like "In regards to the part of your email where you asked me how I was today,".

The most common other system is top posting, which looks like this:

From: andrew@reenigne.org
To: xyz@example.com
Subject: Re: Hello

Hello! I'm fine, thank you.

Xyz <xyz@example.com> wrote:
> Hello, how are you today

This is the natural default with Microsoft Outlook. In the geek circles I had moved in before working at Microsoft, this style was greatly frowned upon. However, it is ubiquitous at Microsoft. I'm not sure whether this is because it's the default style in Outlook or whether it's the default style in Outlook because it is ubiquitous at Microsoft. However, once I had forced myself to "do as the Romans do" and top post, I found that it does actually make more sense in that environment. This is for two reasons:

  1. When the conversation falters due to lack of knowledge about something, it's very common to "loop in" an expert to give their two cents by adding them to the CC line. In order for the expert to have some context, it's useful to have the previous history of the conversation right there in the email, so he or she can read it (bottom to top).
  2. With each email carrying the entire thread, emails can get pretty long. It's inconvenient to have to scroll all the way to the bottom of each email to see the latest reply (especially if you're just a spectator rather than a contributor to a busy thread) so it's better for the replies to be at the top than at the bottom.

It's still useful to reply inline as well sometimes - at Microsoft this is done by quoting the email you're replying two twice - once, in its entirety, at the bottom and once (suitably chopped and trimmed) inline. I used to do this quite frequently as it's the best way I've found (pre-Argulator) of addressing each point individually. However, one of my managers once told me that if the conversation got sufficiently complex that I felt it was best to do that, I should instead "take it offline" and schedule a face-to-face meeting instead to hash out these issues. However, I felt (and still feel) that inline email replies are better than face-to-face meetings for such complicated issues - in face to face meetings there's less time to think about your answer, and points can get lost - as the conversation progresses it can only follow one "branch" of the argument tree, and without explicitly maintaining a stack it's very easy for branches to get forgotten about.

Does the human brain tap into a third form of computing?

Wednesday, October 21st, 2009

There are two forms of computing currently thought to be possible in our universe. One is the classical, deterministic computing that we all know and love. Many people think the human brain is a kind of (very large and complicated) classical computer. However, it is still unknown whether (and if so, how) a classical computer can give rise to consciousness and subjective experience.

The second form of computing is quantum computing, where you essentially run a pile of classical computers in superposition and allow their outputs to interfere in order to obtain the result. Anything quantum computers can do can also be done by classical computers (albeit much more slowly). The human brain might be a quantum computer, but (unless there's something about quantum computing that we don't yet understand) that still doesn't solve the problem of consciousness.

A third form of computing is possible if you have a time machine. I've speculated before that the human brain could be a time travelling computer. These computers are faster still than quantum computers, but still can't compute anything that can't in principle (given long enough) be computed by a classical computer, so this still doesn't solve the consciousness problem.

Could it be that by accident of evolution the human brain has tapped into a form of computing that is qualitatively different from classical computing, much as birds and bees have tapped into a qualitatively different method of flying (flapping) than the method use in our aeroplanes? While this smells of dualism, I think it's a possibility that can't be fully discounted without a complete theory of physics.

One such qualitatively different form of computing is the infinity machine. This can verify true things in finite time even if there is no finite proof that those things are true. Thus it can find completely new truths that are not provable by conventional mathematics.

It seems rather unlikely that the infinity machine is possible in our universe (quantum mechanics puts an absolute limit on clock speed) but there could be other forms of computation that we've just never thought of.

Penrose's Orchestrated Objective Reduction theory is one such possibility.

CGA: Why the 80-column text mode requires the border color to be set

Saturday, October 17th, 2009

The original IBM Color Graphics Adapter has a curious quirk - it won't by default display colour on the composite output in 80-column text mode. By looking at the schematics, I've figured out why this is, and what the CGA's designers could have done differently to avoid this bug. The following diagram illustrates the structure of the various horizontal and vertical sync pulses, overscan and visible areas in the CGA.

There are two horizontal sync pulses - there's the one generated by the 6845 (the 160-pixel wide red/grey/yellow band in the diagram) and there's the one output to the monitor (the 64-pixel wide grey band within it). The CGA takes the 6845's hsync pulse and puts it through various flip flops to generate the output hsync pulse (delayed by 2 LCLKs and with a width of 4 LCLKs) and also the color burst pulse (in yellow, delayed by 7 LCLKs and with a width of 2 LCLKs).

The 6845 can generate an hsync pulse anywhere from 1 to 16 clock ticks in width. The IBM's BIOS sets it up at 10 ticks (as shown in the diagram). However, in 80-column text mode those ticks are only half as wide, so only extend 3/4 of the way through the output hsync pulse. The 6845's hsync pulse ends before the color burst pulse gets a chance to start, so it never happens and the display will show a monochrome image.

By changing the overscan color to brown, one can create one's own color burst signal at the right point in the signal, and this was the usual way of working around the problem (possibly the only way that works reliably)

By changing the 6845's pulse width to the maximum of 16, one could generate the first half of the color burst pulse (I think) and some monitors might recognize this as a color burst.

If the CGA's designers had started the output hsync pulse at the beginning of the 6845's hsync pulse (or delayed by only 1 LCLK instead of 2) then using the maximum pulse width would have been sufficient to generate the correct color burst. I guess they were just trying to center the output hsync pulse and the color burst within the 6845 pulse, without thinking of the high-res case.

The diagram also shows why interlaced mode doesn't work on the CGA - the output vertical sync pulse is generated in a similar way to the output horizontal sync pulse, only it's 3 lines instead of 4 LCLKs. It always starts at the beginning of an output hsync pulse, so a field can't start halfway through a scanline.

CGA: Reading the current beam position with the lightpen latch

Friday, October 16th, 2009

Here is a little known trick that a genuine IBM Color Graphics Adapter can play, that I noticed when looking at its schematic recently. There are two ports (0x3db and 0x3dc) which are related to the light pen. A read to or write from 0x3db clears the light pen strobe (which you need to do after reading the light pen position so that you'll be able to read a different position next time). A read to or write from 0x3dc sets the light pen strobe - what's the point of that? One possibility might be to implement a light pen that signals the computer in a different way (via an interrupt) rather than being connected directly to the CGA card. That wouldn't work very well though - the interrupt latency of the original IBM PCs was extremely high.

Another possibility is to allow the programmer to directly find the position of the beam at any moment, to an accuracy of 2 scanlines (in graphics modes) and one character width (1/40th of the visible screen width in graphics modes and 40-column text modes, 1/80th of the visible screen width in 80-column text modes). Read from 0x3db and 0x3dc and then read the light pen CRTC registers to find out where the beam was when you read from 0x3dc. This technique is so obscure it probably won't work on non-IBM CGA cards, so its usefulness is rather limited. Might be useful for an oldskool demo, though. I'll be sure to implement this technique when I finally get around to making my extremely accurate PC emulator.

Building 3D chips

Thursday, October 15th, 2009

In the not-too-distant future, we'll hit a limit on how small we can make transistors. The logical next step from there will be to starting building up - moving from chips that are almost completely 2D to fully 3D chips. When that happens, we'll have to figure out a way to cool them. Unlike with a 2D chip, you can't just stick a big heatsink and fan on top because it would only cool one surface, leaving the bulk of the chip to overheat. What you need is a network of cooling pipes distributed throughout the chip, almost like a biological system.

I suspect these pipes would work best if they go straight through the chip and out the other side. At small scales, fluid is very viscous and trying to turn a corner would probably slow down the flow too much. So suppose you have a cubic chip with lots of tiny pipes going in one face and coming out the opposite face. The next problem is that, if the fluid is all moving the same way, one side of the chip (the "incoming fluid" side) would get much hotter than the other. The effect could be mitigated somewhat by having some of the pipes flowing in the opposite direction. Ideally you'd want fluid coming in on all 6 faces to maximize cooling. Another possibility is pipes that split up within the chip. A wide pipe of cold fluid will have a similar effect as several smaller pipes of warmer fluid (the increase in fluid temperature is offset by the extra surface area). It would be an interesting puzzle to try to model the heat flows and come up with optimal pipe configurations. In doubling the side of the chip, one probably has to increase the proportion of chip volume dedicated to cooling by some factor 2^n - I wonder what this fractal dimension is.

For most efficient cooling, one would probably want to take the cooling fluid from the CPU and any other hot parts of the system and compress it (just like the coolant in a fridge), allowing it to expand inside the CPU. Then rather than having lots of noisy fans one has one noisy compressor (which would probably be easier to acoustically isolate - maybe even by putting it outside). Fans are a big problem for noise and reliability - my main desktop machine (at the time of writing) has five of them, of which two have failed and a third is on its last legs.

Another major problem that will need to be solved is pluggable cooling lines. People expect to be able to build their own computers, which means that it must be possible to plug together a CPU, motherboard, graphics card and cooling system without an expensive machine. That means we'll need some kind of connector for plugging the coolant lines from the CPU (and other hot components) into the cooling system. Ideally it will be easy to connect up and disconnect without the possibility of introducing dirt or air into the coolant lines, and without the possibility of coolant leaks. I suspect that whoever invents such a connector will make a lot of money.