Archive for the ‘language’ Category

Static scoping improved

Thursday, September 11th, 2008

Many programming languages have a facility (usually called "static") to allow the programmer to declare a variable which is visible only to some particular object but has storage at the program's scope - i.e. its value is the same for all instances of that object and when it changes for one it changes for all the other instances too.

One programming language feature I've never seen (but which I think would be useful) is a generalization of this - the ability to declare a variable which is only visible in a particular object but whose scope is the (lexical) parent object. I call this "outer". For top-level objects, this would be the same as static but for nested classes the scope would be that of the outer class.

One could even use the "outer" keyword multiple times to put the variable in any particular level in the object nesting tree. This doesn't violate encapsulation, since members can still only be declared inside their classes.

If you have "outer" instead of "static" (and maybe a few other more minor tweaks) any program can be turned into an isolated object inside another program - i.e. you can easily turn a program into a multi-instance version of that program with all the instances running in the same process.

String storage

Sunday, August 17th, 2008

Most applications store strings as a consecutive sequence of characters. Sometimes when the string is copied the characters are copied too. Sometimes strings are reference counted or garbage collected so to minimize this copying, but copies are made when concatenating and performing other "string building" operations (otherwise the characters would no longer be consecutive in memory).

An alternative that might work better (especially for something like a compiler) would be to do the concatenation lazily. Actual character data comes from just a few places (the input files which are kept in memory in their entirety, static character data, and program argument data). There are two subtypes of string - one consists of a pointer to the first character and an integer recording the number of characters in the string. The other subtype consists of a vector of strings which are to be concatenated together. Integers (and maybe also formatting information) could be kept in other subtypes. The resulting tree-like data structure has a lot in common with the one I described in Lispy composable compiler.

I'm not sure if this actually saves much (if anything) in terms of memory space or speed over the usual methods (I suppose it depends on how long the average basic string chunk is), but it does have at least one potential advantage - Vectors (especially if they grow by doubling) will have many fewer possible lengths than strings, so memory fragmentation may be reduced. I think it's also kind of neat (especially if you have such data structures lying around anyway).

Parsing expression grammar grammar

Friday, August 15th, 2008

I have found that a fun thing to do is make up grammars for computer languages - figure out what syntax rules work well and what is ambiguous (to both humans and computers - it seems the two are more closely related in this respect that I would initially have imagined).

The language I eventually want to write will have a parser generator (probably generating packrat parsers from Parsing Expression Grammars) built in, so I thought I would write a grammar for the grammars accepted by that - a rather self-referential exercise. I keep going back and forth on some of the syntax details, but this is how it looks at the moment:

// Characters

Character = `\n` | `\r` | ` `..`~`;

EndOfLine = `\r\n` | `\n\r` | `\n` | `\r`

AlphabeticCharacter = `a`..`z` | `A`..`Z` | `_`;

AlphanumericCharacter = AlphabeticCharacter | `0`..`9`;

EscapedCharacter = `\\` (`\\` | `\`` | `n` | `r` | `"`);


// Space

MultilineComment :=
  `/*` (
      MultilineComment
    | !`*/` Character
  )* "*/"
// Note that this is recursive because multi-line comments nest!
// To match C-style (non-nesting comments), use 
// CStyleMultilineComment := `/*` (!`*/` Character)* "*/";

Space =
  (
      ` `
    | EndOfLine
    | `//` (!EndOfLine Character)*
    | MultilineComment
  )*;

_ := !AlphanumericCharacter [Space];


// Tokens

Identifier := AlphabeticCharacter AlphanumericCharacter*;

CharacterLiteral := `\`` ( Character-(`\n` | `\\` | `\``) | EscapedCharacter )* "`";
  // No spaces matched afterwards

StringLiteral := `"` ( Character-(`\n` | `\\` | `"`) | EscapedCharacter )* "\"";
  // Optionally matches _ afterwards

// Productions and rules

CharacterRange := CharacterLiteral ".." CharacterLiteral

Rule :=
  (
    (
      (
          Identifier
        | "[" Rule "]"
        | "!" Rule
        | "&" Rule
        | "(" Rule ")"
        | "EndOfFile"
        | StringLiteral
        | CharacterRange
        | CharacterLiteral
      ) / "|" / "-" / "\\" / "/"
    ) ["+" | "*"]
  )*;

Production := [Identifier] (":=" | "=") Rule ";";

= [_] Production* EndOfFile;

The rules are as follows:

Rule1 | Rule2 prioritized alternative
Rule1 Rule2 sequence
Rule* Kleene star
Rule+ Rule Rule*
!Rule does not match Rule
&Rule matches Rule but is not consumed
(Rule) order of operations
Rule1-Rule2 matches Rule1 but not Rule2
Rule1/Rule2 a sequence of strings matching Rule1 separated by strings matching Rule2 - left-associative (i.e. X := Y/Z => X := Y (Z Y)*)
Rule1\Rule2 a sequence of strings matching Rule1 separated by strings matching Rule2 - right-associative (i.e. X := Y\Z => X := Y [Z X])
Char1..Char2 matches a character between the character in Char1 and the character in Char2

Having a single grammar for both Parser and Lexer is nice in some respects but does introduce some additional complications. Some strings (those I've called CharacterLiterals here) must match exactly (no whitespace is consumed after them) and some (those I've called StringLiterals here) must consume any whitespace that appears after them (done by optionally matching the _ production). Similarly with productions - those created with ":=" optionally match _ at the end.

The root production has no name.

The "/" and "\" delimiters makes it really easy to write grammars for expressions with infix operators. For example, the core of the C++ expression production is:

LogicalOrExpression := CastExpression
  / (".*" | "->*")
  / ("*" | "/" | "%")
  / ("+" | "-")
  / ("<<" | ">>")
  / ("<" | ">" | "<=" | ">=")
  / ("==" | "!=")
  / "&"
  / "^"
  / "|"
  / "&&"
  / "||";

English as a language for programming

Friday, August 8th, 2008

Programmers write programs in computer languages but the comments and identifiers (which are important, but not meaningful to the computer) are written in a human language.

Usually this human language is English, but not always - I have occasionally run across a pieces of source code in French, German and Hebrew. I guess it makes sense for a programmer to write code in their first language if they are not expecting to collaborate with someone who doesn't speak that language (or if that piece of code is very specific to that language - like a natural language parser).

On the other hand, it seems kind of short-sighted to write a program in anything other than English these days. There can't be many programmers who don't speak some amount of English (since most of the technical information they need to read is written in English), and it seems likely that all but the most obscure hobby programs will eventually be examined or modified by someone who doesn't speak the first language of the original author (if that language isn't English).

There are other advantages to standardizing on English - a common vocabulary can be developed for particular programming constructs which makes programs easier to understand for those who are not familiar with their internal workings. The aim is, of course, that any programmer should be able to understand and work on any program.

That there is a particular subset of the English language that is used by programmers is already evident to some extent - I think it will be interesting in the next few years and decades to see how this subset solidifies into a sub-language in its own right.

I should point out that I'm not advocating putting legal or arbitrary technical barriers to prevent programs being written in other languages - more that it might be useful to have tools which can help out with programming tasks for programs written in English.

Having said all that I think that there will in years to come, a higher proportion of programming will be done to solve particular one-off problems rather than create lasting programs - there's no reason why these throw-away programs shouldn't be in languages other than English. Tool support for this can be very minimal, though - perhaps just treating the UTF-8 bytes 0x80-0xbf and 0xc2-0xf4 as alphabetic characters and the sequence 0xef, 0xbb, 0xbf as whitespace.

Concrete data type

Friday, July 25th, 2008

A handy thing to have around (and something I want to put into the UnityALFE language) is a data type for representing physical quantities. This keeps track of a real number plus powers of metres, seconds, kilograms, amps, Kelvins (and maybe others, including user-defined ones). Two values of type Concrete can always be multiplied or divided but can only be added or subtracted if their dimensions match, and can only be converted to other types if they are dimensionless.

Common units of measurement and physical constants can be given as constants. Because the units are tracked automatically you can do things like:

(2*metre+5*inch)/metre

and get the right answer.

Usually the compiler should be able to check the dimensions at compile time and elide them like other type information, or give a compile error for expressions like:

(2*metre+5)/metre

Along similar lines, the log() function could be considered to return a value of type Concrete with some special non-zero dimension. Then you can (and indeed must) specify to which base the logarithm should be by dividing by another logarithm (e.g. log(x)/log(e), log(x)/log(10) or log(x)/log(2)). This syntax is rather more verbose than the usual method of having separate functions for common bases (log(), lg(), ln() etc.) but I find that this is more than made up for by the fact that one doesn't have to remember which function corresponds to which base - it's self-describing.

Another useful type for scientific/engineering work would be a value with confidence interval (e.g. 5±1 meaning "distributed normally with a mean of 5 and a standard deviation of 1"). There are well-defined rules for doing arithmetic with these. A generalization of this to other distribution functions might also be useful.

My first made-up language

Wednesday, July 23rd, 2008

Years ago (more than a dozen), I tried to write (in C) an interpreter/compiler for a dialect of BASIC that I made up. It was based on BBC BASIC and the BBC's graphics facilities (which I was very familiar with at the time) but it would have run on the PC and had some additional features:

  • Operators and control structures from C
  • More string and array processing facilities
  • More graphics commands
  • Interactive editor and debugger
  • Commands for accessing PC facilities (interrupts, calling native code etc.)
  • Built-in help
  • Facilities for storing chunks of code as a library
  • Complex numbers
  • Self-modification and introspection facilities

The last of these is what I want to write about today.

As a child I used many of the 8-bit BASIC variants which had line numbers and it always irked me that there were some things you could do in immediate mode that you couldn't do from a running program, such as editing the program. Why was it that typing:

PRINT "HELLO"

printed "HELLO" immediately and:

10 PRINT "HELLO"

printed "HELLO" when the program was run but

10 10 PRINT "HELLO"

didn't create a program that replaced itself with the program '10 PRINT "HELLO"' when it was run? While doing so didn't seem terribly useful it seemed to me that an interpreter would be just as easy (if not easier) to write with this facility than without it, and that it was an unnatural ommission.

Along similar lines, my dialect had an "INTERPRET" command which took a string and ran it as BASIC code and a "PROGRAM$" array which contained the lines of the program.

I got some way in to writing a parser for it but as I didn't know how to write a parser back then I got horribly tangled trying to write code for each possible combination of current state and next piece of input.

The similarities between this and the UnityALFE language that I'm in the (very slow and gradual) process of writing haven't escaped me.

Lispy composable compiler

Friday, July 4th, 2008

When I finally write the compiler for my super-language, I want to make it nice and modular - i.e. make it very easy to compile different languages, target different architectures and include different optimizations.

In order to achieve this I will need to have some kind of intermediate form in which pieces of code in various states of compilation to be transmitted from one module to another.

This form should be as simple as possible but should also make it possible to express all the complexities that occur in real code. In short, it will need to be a tree structure.

While I'm not convinced that the Lisp idea of "code without syntax" is a great one (I think every available visual/syntactical clue makes it easier to understand unfamiliar code) I have to admit that the lisp data structures (SExps and Lists) are perfect for this role. I wouldn't expect people to be reading large quantities of this code but it's handy to be able to print it out for debugging purposes.

A Symbol (what lispers call a SExp, S-Expression or Symbolic Expression) is either an Integer, a String, an Atom, a List or a Vector. Integers are what you'd expect them to be. Strings are used to hold things like variable names. Atoms are displayed as strings but treated as unique, opaque integers internally. All you do with an Atom is compare it to another Atom and see if they are the same. See Stringy Identifiers for another perspective on these. This is not quite the same as the lisp usage of the word, where integers are also atoms.

A List (delimited with parentheses) is effectively an object. It consists one or more Symbols, the first of which must be an Atom. It has a type, which is represented by its Atom and the number and types of the following Symbols.

A Vector [delimited with square brackets] is effectively an array of length possibly not known until run-time. This can be used for a sequence of statements (for example) or for the bytes in the compiled program before it is output to disk. Lisp doesn't distinguish between lists and vectors but I think it is useful to keep them separate.

It is not a coincidence that this is quite similar to the RTL of GCC. Many of the same problems are being solved, and I was learning about RTL when I came up with this. However, this structure is a little more general-purpose and a little simpler. In particular it can also be used for front-ends.

These expressions will generally be manipulated by pattern matchers - that is, we search through the program for an object matching some characteristics, perform some transformation on it and then substitute the result for the original object.

For example, here is a part of an x86 assembler written in a hypothetical language which has support for these Symbols built in:

    with(x) {
        match (`jmp` d:Int):
            if (isSByte(d))
                becomes (`db` [0xeb d]);
            else
                becomes (`db` [0xe9 bytesFromDWord(d)]);
        ...
    }

This matches a List which consists of the `jmp` atom followed by an integer (which we will call d). We pass that integer to a function and (depending on the result) we emit one of two possible forms of the x86 JMP instruction. As "bytesFromDWord" is known to be a function and not an Symbol, it is evaluated and the result placed into the vector rather than two Symbols (bytesFromDWord and (d)) being created and placed into the vector.

This works just as well for the front end. The "unless" statement that I have written about before can be broken down using this code:

    match (`unless` condition): becomes (`if` (`!` condition));

And we should have this optimization as well, in case condition is itself a negated condition:

    match (`!` (`!` x)): becomes (x);

Quite a lot of the compiler can be written very easily and tersely this way. There are some complications to do with the order in which transformations are applied, though - we would rather avoid having to specify that explicitly and let the computer figure it out. But the order might depend in quite a complicated way on the context. One way to deal with this might be to keep the pre-transformed Symbol around and continue to match against it even after it has been replaced. Another might be for transformations to transform other transformations as they are applied to the Symbols. I haven't quite figured this bit out yet. This seems like one of those problems that might be technically intractable but always works out to be fast enough in practice.

With some cleverness, it ought to be possible to write a very good compiler this way. Great speed is possible as no string handling is done after the initial parsing phase and symbol table lookups (e.g. we don't write out assembler instructions and rely on a separate assembler to assemble them, though we could still easily obtain assembly code if necessary by having a separate output module.) Many speed improvements are possible (e.g. by performing some transformations when the compiler is compiled).

But the really great thing about this scheme is that code for the compiler is completely self-describing - it's exactly the code one would hope to be able to write.

Having a set format for code also enables the scenario I wrote about in Compiler compilers.

Finally, this structure works just as well for disassemblers and decompilers as it does for compilers and assemblers. Some of the algorithms will be different most of the transformations probably cannot be reversed directly but many of the same patterns show up.

Plug-ins should not be in the same address space as the main process

Friday, June 27th, 2008

After some thought I have come to the conclusion that a software architecture which allows third-party code to run in the same process as the main program is a bad idea.

The first problem with such an architecture is reliability - how do you protect yourself against bugs in the third-party code? The short answer is that you can't - the plugin has the ability to stomp all over your process's invariants. The best you can do is terminate the process when such a situation is detected.

The second problem is that a compiler can't reason about the third-party code (it might not even exist at compile time). This means that there are all kinds of checks and optimizations that cannot be performed (like some of the stack tricks I mentioned a while ago).

The third problem is that one cannot tell what code the plug-in will rely on - if it sticks to documented interfaces that's okay, but (either deliberately or accidentally) a plug-in might rely on undocumented behavior which causes the plug-in to break if the program is updated. This forces the program to have ugly kludges to keep old plug-ins working.

If plug-ins are run as separate processes communicating with the main program via well-defined IPC protocols, all these problems are solved. The down-side is that these protocols are likely to be a little more difficult to code against and (depending on the efficiency of the operating system's IPC implementation) may also be significantly slower. The speed problem seems unlikely to be insurmountable though - current OSes can play back HD video which involves the uncompressed video stream crossing the user/kernel boundary - few applications are likely to need IPC bandwidth greater than that.

Compile-time symbolic computations

Saturday, May 31st, 2008

A compiler ought to be able to perform compuations at compile time. Unlike the generated code, these computations don't have to be blazingly fast (since they only happen at compile time), and don't have to conform to any particular machine's architecture (since the language should be the same for different architectures anyway) so some nice things can be done.

Arbirary-precision integer arithmetic is a nice easy one and one that is done by many scripting languages already. This allows you to write expressions such as:

const int a = 1000000000000/1000;

and have a initialized to one billion even on 32-bit machines.

Rational arithmetic is also easy and useful. This allows one to write:

const int a = (1000/3)*3;

and have a initialized to exactly 1000 even on machines lacking any sort of floating point facility.

Then there are closed forms such as:

const int a = sqrt(1000)^2;  // a==1000, not 961

One could even make a compiler understand pi as a symbolic constant of type "Real" and evaluate it to the appropriate number of significant figures for the expression and type that is being initialized. So:

const int a = truncate_cast<int>(pi*10000);

would initialize a to 31416.

Once these things are in place, a compiler can perform quite sophisticated mathematical transformations to allow programmers to write what they really mean and still obtain optimal code. Even to the level of Maple/Mathematica if the compiler writers are sophisticated enough.

Compile time metalanguages

Friday, May 30th, 2008

The C programming language actually contains two different languages - C itself (the code for which runs at run-time) and the preprocessor language which runs at compile time and has completely different syntax. It's also not very powerful - often programmers resort to writing programs which generate C code as output instead.

It would be really nice to be able to make both these approaches unnecesary, and have a compile time language that uses the same syntax as the run-time language. A compiler ought to be able to detect parts of the program which run on constant data, run them during compilation and emit the output directly into the generated binary.

Replacing #ifdefs is easy - just use the normal "if" and ensure that your compiler will optimize away false clauses for constant expressions. GCC does this and indeed the GNU coding standards recommend this.

Replacing code that performs complex transformations on data is more complicated, but not intractably so as long as the compiler can reliably determine that all the functions used in these transformations are pure.