Archive for the ‘computer’ Category

Tainted files

Monday, May 14th, 2007

A lot of work has gone into making Windows in general and Internet Explorer specifically more secure over the past few years. One of my favorite Windows security features is that when you download a file with IE and save it to disk, it is marked with some extra metadata saying "hey, this file is not to be trusted". Then, when you later forgot where the file came from and try to run it a big scary warning pops up saying "you might not want to run this unless you're sure you it's safe". It doesn't prevent you from doing anything like some security features, it's just a perfectly sensible tainting system.

It's a shame that this technique hasn't been generalized to files that come from other untrusted sources like CDs, floppy disks, USB keys, etc. A lot of attacks could be mitigated that way.

Cases

Sunday, May 13th, 2007

C++ is hard to parse. One reason for this is because it's difficult to tell what is supposed to be a variable and what is supposed to be the name of a user defined type. This means that information from later stages of parsing has to be fed back into earlier stages, making the whole thing a big tangled mess.

It's not just compilers that have problems either - the problematic constructs can also be difficult for humans to understand (especially if taken out of context).

I think C++ would be much easier to parse (for both humans and machines) if a very simple rule were added - if a name starts with a capital letter, it is the name of a type, and if it starts with a lower-case letter it is the name of a variable. Sort of a 1-bit hungarian notation (except that it has meaning to the compiler).

This is already a de facto standard in many programs (often by virtue of being a consequence of an even stricter scheme). Though of course if it were added to C++ now it would break many many programs...

A nice side effect of this is that the syntax for types could be made much more regular, whilst still retaining the "definition mirrors use" paradigm. Given that we found a captial letter at the start of an identifier, we know we are parsing a type name. So if we come across an open parenthesis ("(") we can parse that as "a function returning the type to the left of the parenthesis. So for example we can name the type "array of N pointers to functions returning pointers to functions returning pointers to characters" just by reading it backwards: Char*()*()*[N]. This is much easier than the normal C++ declaration "char*(*(*a[N])())()" which is very hard to understand and which also buries the name of the variable you are declaring in the middle of the type name rather than putting it on the right where it belongs. SPECS solves the same problem but at the cost of a less familiar syntax and losing "definition mirrors use".

The true meaning of Reenigne

Saturday, May 12th, 2007

The time has come for me to tell the story of how I came to own the domain reenigne.org, and hence the reason why this blog is called what it is called.

The story I tell on the site is that I reverse-engineered Digger, hence "reverse engineer" or "reenigne". That's only part of the story though. My experience doing this manual reverse-engineering project (and a couple of others) convinced me that building a tool to aid in decompilation would be an interesting project. I decided to call this tool Reenigne and bought the domain name for the project. I set up my personal website there as a temporary measure but when I got a job with Microsoft, Reenigne (the program) got put on the back burner.

Decompilation (or, more generally, analysis of programs) is an interesting problem in general. An important result in theoretical computer science (the undecidability of the halting problem) says that it's impossible to write a program that even determines if a given input program halts or not, let alone analyses it to the extent that Reenigne would have to in order to be useful.

But that theorem only applies to infinite computers. If you know in advance the maximum memory that the program can use you can simulate a machine running the program and detect if it gets into a looping state. Not tremendously useful since that could still be many orders of magnitude more data than today's computers have even for the simplest programs. Does it matter for real-world programs? Well maybe not for early 80s games...

Another advantage we have over the Turing machines in the halting problem is an oracle - a human operator. If Reenigne gets stuck analyzing a particular program it can ask for human help. A human may be able to spot paths of execution or quickly prove things that the computer working alone cannot do. Together (the idea goes) Reenigne and a human operator form an unbeatable team that can reverse-engineer any real world program to source code quickly and (relatively) easily (assuming some knowledge of what the program does and how it might work). The human inputs to Reenigne can in principle be recorded to form a script which can be played back to enable anyone who has a copy of the same binary program to obtain the source code. The source code itself could not be distributed (it's copyrighted) but as far as I can tell the Reenigne scripts could be arbitrarily distributed - they're not a derivative work of the binary program, they are more like a commentary on it (I even thought about having the program save these scripts with the extension ".commentary" to hammer this point home extra hard).

Assuming that .commentary files for any piece of software are easily available, you can obtain source code for any binary you have - effectively making any program open in the same way that GPL software is (although with the minor restriction that you cannot legally redistribute the program itself). As I'm sure you can imagine, such a state of affairs would scare the bejeezus out of Microsoft. That's not the reason I haven't written it though - it's just lack of time.

The game doesn't end there though - for any possible Reenigne it's possible to write a program in such a way as to maximizes the difficulty in reverse-engineering it with that particular Reenigne (by twisting up the logic to make it hard for a human to see what's going on, and by causing the program to have to do things like prove Fermat's Last Theorem just to find all the code in amongst random distracting junk). This would make programs slower and could never eliminate the threat of reverse-engineering completely (as I said, it's always possible in theory to simulate the program on a bigger computer so any roadblocks would just pose engineering problems, not theoretical problems). But given the lengths that some software misguidedly goes to to keep its secrets from the owner of a computer whilst spilling those same secrets wide open to the CPU, I have no doubt that if a useful Reenigne were to become available, all these tricks would be tried.

We would then enter a period of "arms racing" between the Reenigne-creators (which, I am assuming, would be more than just me by then) and those trying to keep their algorithms and content decryption keys secret. They add a Reenigne-defeating trick, then a new version of Reenigne appears which understands and works around that trick, and so on.

Given the propensity of the US government to pass laws like the DMCA (aka the "Snake Oil protection act") and arrest computer programmers for doing their jobs, I suspect it would be unwise for me to pursue Reenigne seriously whilst living in the US (at least while the political climate is still so pro-"intellectual property").

Compiler compilers

Friday, May 11th, 2007

I think it would be cool if a language had a parser generator built in so that you could write BNF grammars right in the source code and the compiler would be able to turn these into parsing functions (this would also be really easy if the grammars were interpreted as Parsing Expression Grammars). What would be even better is if you could write text in the languages these grammars describe and have the compiler compile them right into the program alongside the code generated from the language itself.

For example, if you were writing an assembler, you might define a language for defining mnemonics and how they (and their arguments) translate into opcodes. Then you would add text in this language describing the architecture that your assembler will target. Rather than this parser running every time you ran your assembler, the output of the parser could be embedded right into the program so the result is highly efficient.

An efficient code generator that could be called at runtime would also be useful. This would make applications like formula-driven fractal generators, graphing programs, electrical circuit simulators and numerical analysis packages fast and easy to write. Combined with the parser generators, it would also make it very simple to write a compiler for this hypothetical language - just define the syntax and the semantics, plug in the code generator and there's your compiler. It would be almost like having a compiled homoiconic language, except with a proper built-in syntax (unlike Lisp). Unfortunately it would compete with itself so is unlikely ever to be written as a commercial compiler.

High density mouse pointer

Wednesday, May 9th, 2007

Many techniques have been invented for making the mouse pointer more visible. One is a little animation of circles radiating out from it when you press a "locate" button, another is just making it bigger. Yet another is adding "mouse trails" showing the position the pointer was at over the last few frames (though this has the disconcerting effect of making your pointer appear to be a snake). One which I think has been inadequately explored is making it "high density". Normally when you move the mouse the operating system erases the pointer from the old location and plots it at the new location (doing nothing in between) if you move the pointer around quickly in a circle it the pointer seems to appear at discrete locations.

I think it would be better if, in any given frame, the operating system plotted the pointer at the position it was at in the last frame, the position it is at in the current frame and everywhere in between. This would give the impression of a "streak" as you moved the pointer, or a solid circle if you moved it in a rapid circle, as if the pointer is plotted much more often than once per frame - more "densely" in other words. It would be kind of like mouse trails done properly.

Recursive subdivision

Tuesday, May 8th, 2007

Recently I was experimenting with drawing some Mandelbrot sets. Rather than scanning across the window pixel by pixel and doing one escape per pixel I generated a sequence of complex numbers within the window by repeatedly adding a complex number with irrational real and imaginary parts (the tangent of the argument of which was also irrational) and wrapping the real and imaginary parts into the window. This scatters points across the window evenly in such a way that no point is ever hit twice (within the bounds of precision) and that the pixel grid becomes gradually more dense. Thus the image sort of "rezzes in" and is gradually refined. Anti-aliasing is sort of built in - as the distance between samples gradually decreases, the effective resolution gradually increases. The program never completely stops but you can tell when it's done as the picture seems perfectly anti-aliased and stops changing. I didn't build this in, but the generated points could potentially be redrawn in a larger window to do a sort of (limited) real-time zoom (a la XaoS but hopefully with less artifacting - though it might take a lot of memory to store all that data).

This is kind of neat but I think we can do much better:

  • The program doesn't take advantage of the natural structure of the Mandelbrot set (large "lakes" of high iteration count, which have detail only at the edges)
  • The iteration limit is fixed (it would look wrong if some pixels had different iteration limits) - it would be nice if the iteration limit could be gradually increased (for points near the boundary) at the same time as resolution was gradually increased.
  • There is no way of reusing the calculation for one point for a nearby point (Synchronous Orbit Iteration, which can really speed things up for very deep zooms)

I think all these things are possible with the right data structure. My idea is to form a sort of 2-dimensional linked list. The image is divided into rectangles. Each rectangle has iteration data (corresponding to its center) and 8 pointers to other rectangles, two on each edge. The two pointers may point to the same rectangle or to different rectangles. Suppose both right-pointers of rectangle A point to rectangle B. Then either the two left-pointers of rectangle B can point back to rectangle A, or only one of them does. Thus, each rectangle is bordered on each side by either two rectangles, one rectangle or half a rectangle. And by a process analogous to linked-list insertion, any rectangle can be subdivided horizontally (if it doesn't have half rectangles on either of the horizontal edges) and/or vertically (if it doesn't have half rectangles on either of the vertical edges).

Thus at any point we can increase resolution (by splitting rectangles) or iteration count (by iterating on existing rectangles). If a rectangle is bounded by lake on all sides it is probably also full of lake and we can avoid subdividing or iterating, meaning that we do less work (or defer work) in the boring areas. We also have a natural way of doing SOI (copying the iteration data when we split a rectangle).

We can also do real-time zooming, by noticing that the relationship between rectangle dimension and pixel dimension need not be fixed, and (if necessary) throwing away or paging out rectangles that are outside the window.

There are some fiddly parts, like "how do you divide your priorities between iterating and subdividing" and "what do you do at the edges" which are the reason why I haven't explored this any futher, but it's an interesting project.

Foundry

Monday, May 7th, 2007

It would be really awesome if there was a source control/versioning system that integrated with a text editor so that every change you made was persisted as a new version - you'd never even need to save let alone check in your changes. Infinite-level undo would always be available even across power failures.

This version-control system would have to be multi-headed so that your changes didn't break the version of the product that other people were using or working on until they were ready and tested.

Furthermore, I think this version-control system would be extra awesome if (in the background, or on a remote machine) it rebuilt your program and ran tests on it every time you made a change, so that you were notified of bugs as soon as possible after they occurred.

Obviously there wouldn't be time to build after every keystroke so you'd have to assign a priority to the builds - perhaps the "Save" command could be repurposed to make the current state higher priority, states you mark as "this should be good" are higher priority still and states marked as "this has been reviewed" are marked higher still. If a bug does turn up the system can go back and do a binary search on earlier versions to track down exactly when it was introduced.

There could be a database of all the changes, and another database holding all of the intermediates:

  • particular versions of particular files
  • built executable files
  • intermediate build files

This should allow incremental building and testing to happen pretty quickly (especially if the system knows which changes can affect which executables and which testcases), as long as "rebuild the world" files like compilers and global header files aren't modified.

The system could keep track of which files affect which build steps by watching which files are accessed during a build and keeping this information in the "intermediates" database too.

To eliminate false-positive and false-negative test results due to a dirty machine state, a refinement of the system could run all the tests on a virtual machine, the state of which is reset after each test. This could be done by storing diffs from one virtual machine image to another in the intermediates database. The system could then either build VM images from these diffs (and store them as intermediates) before the tests start or have the VM software choose the right bits dynamically as particular virtual disk sectors are requested (whichever is faster - probably some combination of the two techniques).

Interesting blog

Friday, May 4th, 2007

Scott Aaronson's lecture serious Quantum Computing since Democritus is brilliant - fascinating and hilarious. If I had been taught by this guy at university I might still be in academia. A quote:
Oracles were apparently first studied by Turing, in his 1938 PhD thesis. Obviously, anyone who could write a whole thesis about these fictitious entities would have to be an extremely pure theorist, someone who wouldn't be caught dead doing anything relevant. This was certainly true in Turing's case -- indeed he spent the years after his PhD, from 1939 to 1943, studying certain abstruse symmetry transformations on a 26-letter alphabet.
The sheer range of material just stunning too, and there are lots of in-jokes for those with a passing familiarity with the subjects covered. Plus really smart people like Greg Kuperberg and Bram Cohen read his blog. (And he uses the same hosting provider as me.)

More keywords

Thursday, May 3rd, 2007

I think the C and C++ computer languages don't have enough keywords. The current set may have been appropriate for the earliest C compilers but there are many keywords whose absence I often curse. Unfortunately it's very difficult to add new keywords to C++ because doing so tends to break a lot of existing programs. This is a curious difference between human languages and computer languages - although it seems like computer languages move much faster than human languages, the opposite is in fact true because you can add words to human languages without changing the meaning of existing documents.

Anyway, here are some of the keywords and constructs I'd like to be able to use in my programs:

Two part loops:

Often I find myself wanting a loop with the test in the middle. I usually end up writing these as an infinite loop with a "break" statement:

do {
    x;
    if (y)
        break;
    z;
} while (true);

I'd prefer to write it like this:

do {
    x;
} while (!y) {
    z;
}

[I realized sometime after writing this that I actually got this idea from here.]

until and unless:

I'd also like to be able to use "until" instead of "while" to reverse the sense of the test, like this:

do {
    x;
} until (y) {
    z;
}

Similarly, I'd like to be able to use "unless (x)" instead of "if (!x)". While this doesn't really make programs any shorter, I think being able to eliminate the "!" makes programs easier to understand and helps with the "say what you mean" principle.

Grouping if statements:

As an alternative to using the "switch" statement, I sometimes write:

if (a)
    b;
else
    if (c)
        d;
    else
        if (e)
            f;
        else
            g;

But this has an unforuntate tendency to lean to the right. I'd prefer to write:

if (a)
    b;
elseif (c)
    d;
elseif (e)
    f;
else
    g;

and leave the "else if" construct for situations like:

if (a)
    if (b)
        c;
    else
        d;
else
    if (b)
        e;
    else
        f;

Again, it doesn't make a big difference syntactically but does tend to shed more light on the programmer's intention, which is always a good thing. Similarly there should be an "elseunless" keyword meaning elseif with the sense reversed.

[These ideas came from here, though his "unless" is a bit different to mine, and I've never missed "when".]

Done clauses:

Sometimes I'm using a loop for a search and want to do something in particular if the thing I was searching for was not found. Normally I'd have to write something like this:

bool found = false;
do {
    Item current = get_next_item();
    if (current == target) {
        process(current);
        found=true;
        break;
    }
} while (current != last);
if (!found)
    fail(target);

I'd prefer it if loops had something analogous to an "else" clause which is called only if the loop condition fails. I call this the "done" clause. This would make the example look like this:

do {
    Item current = get_next_item();
    if (current == target) {
        process(current);
        break;
    }
} while (current != last); done
    fail(target);

Much neater.

Multiple break

I'd like to break out of multiple loops at once. Suppose I'm searching a 2D array:

for (int x = 0; x < 10; ++x) {
    int y = 0;
    for (; y < 10; ++y)
        if (array[y][x] == target) {
            foobar(x, y);
            break;
        }
    if (y < 10)
        break;
}

I'd much rather write it as:

for (int x = 0; x < 10; ++x)
    for (int y = 0; y < 10; ++y)
        if (array[y][x] == target) {
            foobar(x, y);
            break break;
        }

The syntax "break continue" could be employed to mean "break out of the innermost loop and then do a 'continue' on the next loop out".

[I also posted this idea on comp.std.c++ here but it was shot down. I still think it's a good idea, though. That entire thread is a gold mine of good (and some not so good) programming language ideas.]

Exponentiation operator:

I'd like to change the meaning of the ^ operator to be exponentiation and have binary ~ mean XOR (giving a nice symmetry between unary/binary - (negative/minus) and unary/binary ~ (NOT/XOR)).

bool argument to if and while

It is surprising how many of the criticisms levelled at C and C++ can be traced back to a simple misfeature - allowing conditional statements ("if" and "while") to have arguments of any integral type. These statements should only accept bool values, thus sidestepping the whole "confusing = and ==" business altogether (well, at least until you want to compare two bool variables). The usual boolean operators !, &&, ||, ==, !=, <, >, <= and >= should all return bool values and !, && and || should also require their arguments to be bool. This would break a lot of programs, but they would be easy to fix (just sprinkle "!= 0" everywhere you see a "bool argument required" error.

Assignment not an operator

I think that having the assignment operator return a value is a mistake. I try to avoid using constructs like "if (a = b)" and "a = b = c;" in my code because they are overly terse and make refactoring more difficult. I would also eliminate the post increment and decrement operators (no more "c++"!) and make all assignments, increments and decrements statements instead of expressions. I'm not sure I would go so far as to separate functions into pure functions returning values which can be used in expressions and procedures which have no return values but can have side effects, but it is tempting.

Compile-time checked exceptions

Wednesday, May 2nd, 2007

C++ offers an ability to specify which exceptions might be thrown by a given function or method. Unfortunately because of its terrible implementation (the exception specification is only checked at runtime) it is rarely used. I (and many others) think that this would have been much more useful as a compile time feature.

Here is how I think it could work. Every function has an exception specification which can be explicit or implicit. If it is implicit it is calculated by creating the union of the exception specifications of all called functions and methods and the set of exceptions the function actually throws itself, and removing from that set any of the exceptions that that function actually catches. That way, exception specifications are optional but are used if they are there.

Then, certain functions (like the main() function, functions callable from C code, destructors and functions implemented in C code) can have an implicit exception specification of the empty set, eliminating an entire class of bugs.