Archive for the ‘language’ Category

Lexical scoping and closures

Thursday, May 29th, 2008

A closure is a just a piece of code with some context - data on which it works. Such things are common in the functional programming community but were left out of the C language because they're actually quite difficult to implement properly.

The C++ language has sufficiently powerful constructs that one can create closures, but the syntax is very unwieldy. One must write:

class Totalizer
{
    int _total;
public:
    Totalizer() : _total(0) { }
    void operator(int x) { _total += x; }
    int total() const { return _total; }
};
 
int total(Vector<int> x)
{
    Totalizer t;
    for_each(x.begin(), x.end(), t);
    return t.total();
}

where one would really like to write something like:

int total(Vector<int> x)
{
    int t;
    for_each (x, lambda(int y) {t += y;} );
    return t;
}

Here the code is the "{t += y;}" bit and the context is the "t" variable, which is local to the "total" function.

C# 3.0 provides functionality like this.

What's really going on here is that an object is being created which encapulates some subset of the local variables (the "t" variable in this case) and which is passed to the for_each function (just like in the C++ version). Closures and objects are really just different syntaxes for the same thing.

If we want to return a closure from a function we can do so, but we must then allocate the object in such a way that the "t" variable will still be valid when the function exits. This is where things get really fiddly. This is one of those things that is easier in a language with garbage collection (you can just allocate the closure on the heap and the GC will take care of cleaning it up when it's finished with) but with care I think it should be possible to do it in a non-GC language as well.

Stringy identifiers

Wednesday, May 28th, 2008

In most computer languages, identifiers can only contain alphanumeric characters (plus, often, the underscore '_') and can't start with a number. There are very good reasons for this, but there are occasions when it would be desirable to be able to use arbitrary strings as identifiers. For example, suppose that we have a language that has a parser generator built in. Suppose also that we give it a parsing rule like this:

WhileStatement := ("while" | "until") "(" Expression ")" Statement

This is really just another way of writing a parsing class:

class WhileStatement { ... }

We would like the members of this class to be automatically generated by the parser generator in a predictable way. With string-identifiers, the generated class might look something like this:

  class WhileStatement
  {
    Enum { `"while"`, `"until"` } `"while" | "until"`;
    Expression expression;
    Statement statement;
  }

i.e. the name of the member variable which holds the piece of data about whether this is a "while" loop or an "until" loop is called `"while" | "until"`. The type of this member is

Enum { `"while"`, `"until"` }

which means that it can take only the values `"while"` and `"until"`.

This class can be used as follows:

  WhileStatement whileStatement = WhileStatement::Parse(file);
  if (whileStatement.`"while" | "until"` == `"while"`)
    ...

This reads very naturally, and avoids the need for anyone (human or compiler) to make up any variable names. It's good to avoid making the programmer make up names for things when not absolutely necessary.

Stack measuring

Wednesday, May 14th, 2008

We generally do what we can to make programs reliable, but there is one particular cause of software failure which seems to have been surprisingly under-attacked. Probably because it is fairly rare in real-world situations. That failure is stack overflow.

Many real-world programs have recursive algorithms where the recursion depth depends on user input and is unchecked, and where the activation records are allocated on the stack. Current computer systems do not recover well from stack overflow (you generally get an access violation which the program cannot recover from because it is in an unknown state). This means that it is possible to supply input which crashes the program.

The compiler for my ideal language will detect when a program uses a potentially unbound recursion and allocate the activation records for such functions on the heap instead of the stack. This means that calling such a function can potentially throw an out of memory exception, which is much easier to recover from than a stack overflow (since we still have a stack to work with, and because the state of the program is well-defined - as far as the calling function is concerned the called function threw an exception and as far as the called function is concerned it never got called.)

A nice side effect of this is that now every function with an empty set exception specification has a well defined maximum stack size, which means that when you create a thread you know exactly how much stack it needs. This means stack guard pages are no longer necessary (stack can be treated just like any other area of memory) and potentially means that threads that perform small tasks can be created with very little overhead, possibly making thread pools unnecessary.

A programmer can track the maximum stack size of every thread in their application. If a change causes the stack size to increase dramatically this may be due to calling a very high-level function from a very low-level function, which may indicate a bug. Similarly, adding recursion without realizing it also indicates a possible bug, so it would be nice if the compiler could also alert the programmer to that.

ALFE build system

Tuesday, May 13th, 2008

When I finally get around to building my ideal language, it will have a linker and build system built in. It seems very silly to write your code in one language, your makefile in another, your configure script in a third etc.

The compiler will be invoked with the name of one source file on the command line, and that source file includes all the other required source files by means of an "include" statement which works similarly to the "#include" of C and C++, but never includes any given actual file more than once.

The program can also specify the name of the output file to generate, which can be a binary (.exe or .dll for Windows, or equivalent on other platforms) or intermediate file to be used with a C or C++ linker (.obj or .lib, or equivalent). Similarly, object and library files can be included just as source files can.

On a first pass through the program, the compiler will enumerate the complete set of input files and check their timestamps. If no input files are more recent than the output file, compilation will stop early.

Cases

Sunday, May 13th, 2007

C++ is hard to parse. One reason for this is because it's difficult to tell what is supposed to be a variable and what is supposed to be the name of a user defined type. This means that information from later stages of parsing has to be fed back into earlier stages, making the whole thing a big tangled mess.

It's not just compilers that have problems either - the problematic constructs can also be difficult for humans to understand (especially if taken out of context).

I think C++ would be much easier to parse (for both humans and machines) if a very simple rule were added - if a name starts with a capital letter, it is the name of a type, and if it starts with a lower-case letter it is the name of a variable. Sort of a 1-bit hungarian notation (except that it has meaning to the compiler).

This is already a de facto standard in many programs (often by virtue of being a consequence of an even stricter scheme). Though of course if it were added to C++ now it would break many many programs...

A nice side effect of this is that the syntax for types could be made much more regular, whilst still retaining the "definition mirrors use" paradigm. Given that we found a captial letter at the start of an identifier, we know we are parsing a type name. So if we come across an open parenthesis ("(") we can parse that as "a function returning the type to the left of the parenthesis. So for example we can name the type "array of N pointers to functions returning pointers to functions returning pointers to characters" just by reading it backwards: Char*()*()*[N]. This is much easier than the normal C++ declaration "char*(*(*a[N])())()" which is very hard to understand and which also buries the name of the variable you are declaring in the middle of the type name rather than putting it on the right where it belongs. SPECS solves the same problem but at the cost of a less familiar syntax and losing "definition mirrors use".

Compiler compilers

Friday, May 11th, 2007

I think it would be cool if a language had a parser generator built in so that you could write BNF grammars right in the source code and the compiler would be able to turn these into parsing functions (this would also be really easy if the grammars were interpreted as Parsing Expression Grammars). What would be even better is if you could write text in the languages these grammars describe and have the compiler compile them right into the program alongside the code generated from the language itself.

For example, if you were writing an assembler, you might define a language for defining mnemonics and how they (and their arguments) translate into opcodes. Then you would add text in this language describing the architecture that your assembler will target. Rather than this parser running every time you ran your assembler, the output of the parser could be embedded right into the program so the result is highly efficient.

An efficient code generator that could be called at runtime would also be useful. This would make applications like formula-driven fractal generators, graphing programs, electrical circuit simulators and numerical analysis packages fast and easy to write. Combined with the parser generators, it would also make it very simple to write a compiler for this hypothetical language - just define the syntax and the semantics, plug in the code generator and there's your compiler. It would be almost like having a compiled homoiconic language, except with a proper built-in syntax (unlike Lisp). Unfortunately it would compete with itself so is unlikely ever to be written as a commercial compiler.

More keywords

Thursday, May 3rd, 2007

I think the C and C++ computer languages don't have enough keywords. The current set may have been appropriate for the earliest C compilers but there are many keywords whose absence I often curse. Unfortunately it's very difficult to add new keywords to C++ because doing so tends to break a lot of existing programs. This is a curious difference between human languages and computer languages - although it seems like computer languages move much faster than human languages, the opposite is in fact true because you can add words to human languages without changing the meaning of existing documents.

Anyway, here are some of the keywords and constructs I'd like to be able to use in my programs:

Two part loops:

Often I find myself wanting a loop with the test in the middle. I usually end up writing these as an infinite loop with a "break" statement:

do {
    x;
    if (y)
        break;
    z;
} while (true);

I'd prefer to write it like this:

do {
    x;
} while (!y) {
    z;
}

[I realized sometime after writing this that I actually got this idea from here.]

until and unless:

I'd also like to be able to use "until" instead of "while" to reverse the sense of the test, like this:

do {
    x;
} until (y) {
    z;
}

Similarly, I'd like to be able to use "unless (x)" instead of "if (!x)". While this doesn't really make programs any shorter, I think being able to eliminate the "!" makes programs easier to understand and helps with the "say what you mean" principle.

Grouping if statements:

As an alternative to using the "switch" statement, I sometimes write:

if (a)
    b;
else
    if (c)
        d;
    else
        if (e)
            f;
        else
            g;

But this has an unforuntate tendency to lean to the right. I'd prefer to write:

if (a)
    b;
elseif (c)
    d;
elseif (e)
    f;
else
    g;

and leave the "else if" construct for situations like:

if (a)
    if (b)
        c;
    else
        d;
else
    if (b)
        e;
    else
        f;

Again, it doesn't make a big difference syntactically but does tend to shed more light on the programmer's intention, which is always a good thing. Similarly there should be an "elseunless" keyword meaning elseif with the sense reversed.

[These ideas came from here, though his "unless" is a bit different to mine, and I've never missed "when".]

Done clauses:

Sometimes I'm using a loop for a search and want to do something in particular if the thing I was searching for was not found. Normally I'd have to write something like this:

bool found = false;
do {
    Item current = get_next_item();
    if (current == target) {
        process(current);
        found=true;
        break;
    }
} while (current != last);
if (!found)
    fail(target);

I'd prefer it if loops had something analogous to an "else" clause which is called only if the loop condition fails. I call this the "done" clause. This would make the example look like this:

do {
    Item current = get_next_item();
    if (current == target) {
        process(current);
        break;
    }
} while (current != last); done
    fail(target);

Much neater.

Multiple break

I'd like to break out of multiple loops at once. Suppose I'm searching a 2D array:

for (int x = 0; x < 10; ++x) {
    int y = 0;
    for (; y < 10; ++y)
        if (array[y][x] == target) {
            foobar(x, y);
            break;
        }
    if (y < 10)
        break;
}

I'd much rather write it as:

for (int x = 0; x < 10; ++x)
    for (int y = 0; y < 10; ++y)
        if (array[y][x] == target) {
            foobar(x, y);
            break break;
        }

The syntax "break continue" could be employed to mean "break out of the innermost loop and then do a 'continue' on the next loop out".

[I also posted this idea on comp.std.c++ here but it was shot down. I still think it's a good idea, though. That entire thread is a gold mine of good (and some not so good) programming language ideas.]

Exponentiation operator:

I'd like to change the meaning of the ^ operator to be exponentiation and have binary ~ mean XOR (giving a nice symmetry between unary/binary - (negative/minus) and unary/binary ~ (NOT/XOR)).

bool argument to if and while

It is surprising how many of the criticisms levelled at C and C++ can be traced back to a simple misfeature - allowing conditional statements ("if" and "while") to have arguments of any integral type. These statements should only accept bool values, thus sidestepping the whole "confusing = and ==" business altogether (well, at least until you want to compare two bool variables). The usual boolean operators !, &&, ||, ==, !=, <, >, <= and >= should all return bool values and !, && and || should also require their arguments to be bool. This would break a lot of programs, but they would be easy to fix (just sprinkle "!= 0" everywhere you see a "bool argument required" error.

Assignment not an operator

I think that having the assignment operator return a value is a mistake. I try to avoid using constructs like "if (a = b)" and "a = b = c;" in my code because they are overly terse and make refactoring more difficult. I would also eliminate the post increment and decrement operators (no more "c++"!) and make all assignments, increments and decrements statements instead of expressions. I'm not sure I would go so far as to separate functions into pure functions returning values which can be used in expressions and procedures which have no return values but can have side effects, but it is tempting.

Compile-time checked exceptions

Wednesday, May 2nd, 2007

C++ offers an ability to specify which exceptions might be thrown by a given function or method. Unfortunately because of its terrible implementation (the exception specification is only checked at runtime) it is rarely used. I (and many others) think that this would have been much more useful as a compile time feature.

Here is how I think it could work. Every function has an exception specification which can be explicit or implicit. If it is implicit it is calculated by creating the union of the exception specifications of all called functions and methods and the set of exceptions the function actually throws itself, and removing from that set any of the exceptions that that function actually catches. That way, exception specifications are optional but are used if they are there.

Then, certain functions (like the main() function, functions callable from C code, destructors and functions implemented in C code) can have an implicit exception specification of the empty set, eliminating an entire class of bugs.

Negative-overhead exceptions

Tuesday, May 1st, 2007

There are two schools of thought about how to handle errors in computer programs. By "errors" here I mean things that are out of the programmer's control, like "out of memory" or "file not found" rather than mistakes in the actual program (which is a whole other topic). Ideally we want programs which don't just crash when they hit such unexpected circumstances but rather tell the user exactly what the problem is and give them the opportunity to fix it and try again or do something different instead.

The first school of thought is that functions which fail should return some kind of "status code" (which may be "success" or one of several other values indicating different types of error). This is how most of the functions comprising the Windows API signal failure.

The problems with status codes are:

  1. Whenever you call a function you need to check it's status code or risk missing an error (which usually makes things worse). It's tedious and error prone (mistake prone) to write the code to do this and sometimes programmers forget.
  2. The information about an error you can return is quite limited - you can say "file not found" but you can't say which file wasn't found or what the program was trying to do at the time (you have to determine this from context).
  3. If you're using the return value of your function for a status code you can't use it for something more natural. For example, if you're writing a square root function that needs to return an error code when given a input that is negative, it can't also return the answer in a natural way.

The second school of thought is that failures should be indicated by "throwing an exception", which is technically rather more complicated. This involves looking at the function that called the failing function, and the function that called that one and so up the stack until you find a function that says "hey, I know about that exception - pass control to me and I'll deal with it". This eliminates the need for all the functions in between to know about the possible failure at all (other than cleaning up their resources).

Exceptions are new and spiffy and are generally considered to be the "right way" of doing things in new programs. Unfortunately, exceptions also have drawbacks:

  1. Your program is less "explicit" about what it actually does - certain controls paths are "hidden away" in the exception mechanism. Proponents of exceptions argue that this is a good thing.
  2. Certain programming styles cannot be used in the presence for exceptions. For example "do X, then do Y, then do Z, then do W if Y failed." Proponents of exceptions argue that these styles are mistake prone (it's all to easy to forget to do Z) and should be avoided anyway.
  3. While many programming languages in widespread use support exceptions, not everything does. In particular, certain "glue" for interfacing between different bits of code written in different languages generally doesn't support exceptions because they need to be compatible with languages which don't support exceptions (the old "lowest common denominator" problem).

Another criticism often made of exceptions is that they are slow. This may have been true in the past, but in practice it turns out that exceptions are no slower than status codes in the success case (they are slower in the error case, but that doesn't matter too much because errors should be pretty rare).

I think that it is possible to make programs that use exceptions even faster than programs that use status codes. With the right set of tables, the error case code can be completely separated from the normal case code. No code that deals with or tests for errors even needs to be loaded into memory until an exception is actually thrown. Then control passes to a "throw" function which loads the tables and error handlers into from disk and examines the stack to determine which functions the exception passes through and which function will actually handle the exception. The code is faster than the "status code" version because it doesn't have to have all those "if failed" tests sprinkled throughout.

I haven't tried implementing it, but as far as I can tell there are very few places where this scheme would have any overhead at all. One is this situation - State0::RunState() and State1::RunState() can't be folded together with my scheme because they have different stack unwinding information.

The other limitation is that the stack layout and set of constructed objects has to be completely determined by the instruction pointer in any given function. This is usually the case but does mean that functions can't allocate dynamically sized arrays on the stack. These don't tend to be used in practice because they have the serious problem that it is very easy to overflow the stack. If you do know in advance a maximum possible size for this array you may as well just allocate the maximum size.

Bootstrapping a compiler from nothing

Wednesday, September 14th, 2005

Two posts today 'cos I missed yesterday due to being disorganized.

Recently I've been working on bootstrapping a compiler from nothing. Just for fun. I know it's been done before but I wanted to learn about parsing and optimizing and how compilers are constructed.

The first stage of my compiler is a pretty clever hack, even if I do say so myself. I didn't want to use any external tools to get my compiler started, but that left me with a problem - how do I generate the first executable file? Well, one way to generate an arbitrary file from Windows is to just use an "echo" statement in the Windows command prompt and redirect the output to a file. But that only works reliably for ASCII characters (and not even all of those). This poses a problem, because the opcodes for even simple "MOV" instructions are all non-ASCII characters. But it turns out that the "constrained machine code" for x86 consisting of only ASCII bytes is actually Turing-complete and can be used to do useful things (non-ASCII opcodes such as the one for "INT" can be constructed using self-modifying code). So I put together an ASCII program that takes two characters from the command line, combines them into a byte and outputs the resulting byte (which can then be redirected to a file). Calls to this program can be strung together to make (almost) arbitrary binary files, which can be used to compile more complex languages.

In this way (13 iterations later) I have built up a simple but effective 2-pass 16-bit DOS assembler which outputs .com files. I have also written a recursive descent parser for simple infix expressions on unsigned 16-bit integers, and am working on writing a code generator which can output binary code for these expressions.

Eventually I hope to evolve this into a fast and powerful language to rival C++. It will be a language which combines very low-level and very high-level concepts, and will therefore be an ideal language for writing compilers (such as itself) in. I could then use it for writing all sorts of other fun things - maybe I'll tackle an OS when I've finished the language. But for now I'm just having fun learning about things.