Archive for the ‘computer’ Category

Parsing expression grammar grammar

Friday, August 15th, 2008

I have found that a fun thing to do is make up grammars for computer languages - figure out what syntax rules work well and what is ambiguous (to both humans and computers - it seems the two are more closely related in this respect that I would initially have imagined).

The language I eventually want to write will have a parser generator (probably generating packrat parsers from Parsing Expression Grammars) built in, so I thought I would write a grammar for the grammars accepted by that - a rather self-referential exercise. I keep going back and forth on some of the syntax details, but this is how it looks at the moment:

// Characters

Character = `\n` | `\r` | ` `..`~`;

EndOfLine = `\r\n` | `\n\r` | `\n` | `\r`

AlphabeticCharacter = `a`..`z` | `A`..`Z` | `_`;

AlphanumericCharacter = AlphabeticCharacter | `0`..`9`;

EscapedCharacter = `\\` (`\\` | `\`` | `n` | `r` | `"`);


// Space

MultilineComment :=
  `/*` (
      MultilineComment
    | !`*/` Character
  )* "*/"
// Note that this is recursive because multi-line comments nest!
// To match C-style (non-nesting comments), use 
// CStyleMultilineComment := `/*` (!`*/` Character)* "*/";

Space =
  (
      ` `
    | EndOfLine
    | `//` (!EndOfLine Character)*
    | MultilineComment
  )*;

_ := !AlphanumericCharacter [Space];


// Tokens

Identifier := AlphabeticCharacter AlphanumericCharacter*;

CharacterLiteral := `\`` ( Character-(`\n` | `\\` | `\``) | EscapedCharacter )* "`";
  // No spaces matched afterwards

StringLiteral := `"` ( Character-(`\n` | `\\` | `"`) | EscapedCharacter )* "\"";
  // Optionally matches _ afterwards

// Productions and rules

CharacterRange := CharacterLiteral ".." CharacterLiteral

Rule :=
  (
    (
      (
          Identifier
        | "[" Rule "]"
        | "!" Rule
        | "&" Rule
        | "(" Rule ")"
        | "EndOfFile"
        | StringLiteral
        | CharacterRange
        | CharacterLiteral
      ) / "|" / "-" / "\\" / "/"
    ) ["+" | "*"]
  )*;

Production := [Identifier] (":=" | "=") Rule ";";

= [_] Production* EndOfFile;

The rules are as follows:

Rule1 | Rule2 prioritized alternative
Rule1 Rule2 sequence
Rule* Kleene star
Rule+ Rule Rule*
!Rule does not match Rule
&Rule matches Rule but is not consumed
(Rule) order of operations
Rule1-Rule2 matches Rule1 but not Rule2
Rule1/Rule2 a sequence of strings matching Rule1 separated by strings matching Rule2 - left-associative (i.e. X := Y/Z => X := Y (Z Y)*)
Rule1\Rule2 a sequence of strings matching Rule1 separated by strings matching Rule2 - right-associative (i.e. X := Y\Z => X := Y [Z X])
Char1..Char2 matches a character between the character in Char1 and the character in Char2

Having a single grammar for both Parser and Lexer is nice in some respects but does introduce some additional complications. Some strings (those I've called CharacterLiterals here) must match exactly (no whitespace is consumed after them) and some (those I've called StringLiterals here) must consume any whitespace that appears after them (done by optionally matching the _ production). Similarly with productions - those created with ":=" optionally match _ at the end.

The root production has no name.

The "/" and "\" delimiters makes it really easy to write grammars for expressions with infix operators. For example, the core of the C++ expression production is:

LogicalOrExpression := CastExpression
  / (".*" | "->*")
  / ("*" | "/" | "%")
  / ("+" | "-")
  / ("<<" | ">>")
  / ("<" | ">" | "<=" | ">=")
  / ("==" | "!=")
  / "&"
  / "^"
  / "|"
  / "&&"
  / "||";

Deleting from the database

Wednesday, August 13th, 2008

Continuing on from yesterday's post - a commenter asked what happens if you want to delete something from the database irrevocably. I was going to reply with a comment but my reply got kind of long and it's an important concept so I decided to make a post out of it.

I have thought about this a bit and I think there are two reasons why you would want to delete something irrevocably. One is to save disk space (if the thing you want to delete is very big and you're sure you won't need it again). Current storage capacities are sufficiently high that this is a much less common problem than it used to be (especially for programmers - I suppose astronomers and particle physicists tend to have larger chunks of data, though).

The other reason is if the data should never have been checked into the version control system in the first place - for example if you accidentally checked your tax return into your company's source control repository.

Either way, any reasonable VCS should be able to cope with such requirements. To prevent abuse, this would probably have to be done by the owner or administrator of the system storing the data. I know at Microsoft there was a process for this (I never had to use it myself, but I saw an embarrassing bug report which had had its edit history erased so I know it can be done).

This is one area where distributed version control has a bit of a disadvantage - you have to contact everyone who has downloaded a copy of the repository and persuade them to erase the unwanted data and pass on the message to everybody who might have downloaded it from them and so on. For a popular project, this might be as impossible as unpublishing something that's been published on the internet.

As for the technical details of obliterating from the accumulate/cache database in particular, it's easy to do (as long as you can remove data from the accumulate table) except for any changes which happen on top of the undesirable change. These either have to be reparented or deleted. If there have been a lot of changes on top of the unwanted one this might be very labour-intensive or undesirable respectively, particularly if those changes are to data you're trying to remove.

Deconstructing the database

Tuesday, August 12th, 2008

This post elaborates on some database organization ideas that I mentioned in passing in Foundry.

The usual interface presented to a database is one based on "query/update" - the database contains some set of information which can be queried and updates can be made when this information needs to be changed. At any time there is a single "most up to date" state of the database consisting of the data after all "update" operations known so far have been performed.

The usefulness of this format cannot be denied but there is another possible way of organizing things - I call it "accumulate/cache" (I haven't read much in the way of database theory literature so this might already exist by some other name). In such a database, you (at some fundamental level) never change things, only add new things. "Update" operations consist of sending a "delta" to the database saying what needs to change, what it needs to change to and the timestamp of the change that is being used as the "parent" of this change. These deltas are then themselves timestamped and dumped onto the end of one big file (or table, if you're implementing this on top of a traditional relational database, though because of the limited nature of the operations on this table, writing special purpose code for this could probably improve efficiency).

To extract information from this database, you need a timestamp as well as the key of the piece of information you want to look up. You can look at the database as it was at any point in history just by supplying the appropriate timestamp. Thus you get what is essentially revision control for free (except for the fact that this kind of database probably uses much more space than the traditional sort - lack of disk space in previous generations may be why this sort of database isn't the way things are usually done).

Given a timestamp and a key, the algorithm for looking up a piece of information is quite simple: first look in the cache and see if that timestamp/key combination is there. If it is, you're done - just return that value. If it's not, look up that timestamp in the accumulation table. If that accumulation changes the value corresponding to your key, add the result of the query to the cache and you're done. Otherwise pick the parent timestamp of that change and repeat.

The caching is necessary to avoid bad asymptotic behavior as the size of the database grows. I think (but haven't proved) that for a database of size N it should be possible to get back to O(log N) amortized performance with an O(N log N) sized cache. The amount of cache used can be tuned by throwing away the least recently accessed entries depending on the database's access patterns (patterns which look back at history a lot will probably require larger caches than patterns which only look at recent timestamps).

Only the accumulation table needs to be backed up as the cache table is automatically reconstructed as necessary when queries are made. The fact that the accumulation table is only ever appended to simplifies this.

The next piece of infrastructure needed on top of this is a "head" mechanism. The last timestamp in the accumulation table is the latest physical change made but this might not be the latest logical change. We would like the ability to have "atomic" operations made up of several changes. This can be done without any additional locks (so scales very well). The "head" value is just a timestamp filed under some well-known key. To get the most recent "head" just get the value of "head" corresponding to the most recent timestamp, and do all your other queries with the head value timestamp.

To perform an atomic update, you need to make your changes and then merge them with the latest head. This is trivial if the current head is the parent of your first change. Otherwise something else has written to the database in the meantime and a merge will need to be done, creating a new "parent of first change". Eventually (as long as the time between commits is long compared to the time it takes to do a commit) you'll get to the trivial case, and the only thing that needs to be done atomically is the trivial "compare parent to the value of head and swap with the new head if they're equal" (I think most CPUs can do this atomically with a single instruction, but in any case it is one of the simpler lock-free algorithms).

Some databases already work like this behind the scenes to some extent (log replaying etc) but I think that making this part of the interface simplifies the atomicity logic considerably, not to mention the usefulness of being able to query versions of the database from any point in history.

One complication occurs if it becomes necessary to remove from the database some information which should never have been placed there (e.g. some copyrighted work). Since you want to obliterate history in this case, an update needs to be made on the accumulation table. The cache (or the portion of it from after that point) must then be regenerated. One also needs to decide if subsequent updates to the copyrighted value are "derived works" or not, and either obliterate or leave alone the updates to that value.

One limiting factor of this architecture is access times - even with the fastest current disks one can only traverse a chain at maybe 500 links per second. It will only be a few years before we're all using solid state flash disks, though, and that will dramatically speed up access times.

Edit 14th July 2013:

Since writing this I came across an interesting article and talk on the same subject.

Computer algebra system

Sunday, August 10th, 2008

At some point in the future I'd like to write a computer algebra system, like Maple or Mathematica, just because I think it would be really fun piece of software to write. The basic idea is that you can tell the computer things you know (e.g. "x+2=3") and ask it questions (like "x?") and the computer would attempt to give the simplest possible answer by searching its database of facts. When printing formulae on the screen it would use algorithms from TeX to give attractive output.

Another nice feature would be the ability to directly manipulate formulae, for example rearranging terms of an equation by dragging them with the mouse or expanding multiplications by clicking on them (the program, of course, would prevent manipulations that aren't mathematically correct). These kinds of manipulations can be very tedious to do by hand.

Yet another feature that I want is the ability to create animated, interactive proofs. Rather than just presenting a static sequence of "X implies Y implies Z" on a page, the program could actually create an animation of X turning into Y. And if, at some stage, a derivation is unclear, the user could right-click on it, select "Why?" and the program would attempt to explain. That sounds difficult to do but I think much of this is really quite mechanical. When studying mathematics at university, I often wished that the proofs were presented like this - it would have made things much easier.

As well as an interactive algebra and mathematical presentation system, this program would also contain a big database of mathematical facts, both to aid in the program's own proofs and as an (interactive) mathematics textbook in its own right. Mathematicians using the software could contribute their discoveries to this database/textbook in addition to (or even instead of) the traditional distribution method of writing papers.

English as a language for programming

Friday, August 8th, 2008

Programmers write programs in computer languages but the comments and identifiers (which are important, but not meaningful to the computer) are written in a human language.

Usually this human language is English, but not always - I have occasionally run across a pieces of source code in French, German and Hebrew. I guess it makes sense for a programmer to write code in their first language if they are not expecting to collaborate with someone who doesn't speak that language (or if that piece of code is very specific to that language - like a natural language parser).

On the other hand, it seems kind of short-sighted to write a program in anything other than English these days. There can't be many programmers who don't speak some amount of English (since most of the technical information they need to read is written in English), and it seems likely that all but the most obscure hobby programs will eventually be examined or modified by someone who doesn't speak the first language of the original author (if that language isn't English).

There are other advantages to standardizing on English - a common vocabulary can be developed for particular programming constructs which makes programs easier to understand for those who are not familiar with their internal workings. The aim is, of course, that any programmer should be able to understand and work on any program.

That there is a particular subset of the English language that is used by programmers is already evident to some extent - I think it will be interesting in the next few years and decades to see how this subset solidifies into a sub-language in its own right.

I should point out that I'm not advocating putting legal or arbitrary technical barriers to prevent programs being written in other languages - more that it might be useful to have tools which can help out with programming tasks for programs written in English.

Having said all that I think that there will in years to come, a higher proportion of programming will be done to solve particular one-off problems rather than create lasting programs - there's no reason why these throw-away programs shouldn't be in languages other than English. Tool support for this can be very minimal, though - perhaps just treating the UTF-8 bytes 0x80-0xbf and 0xc2-0xf4 as alphabetic characters and the sequence 0xef, 0xbb, 0xbf as whitespace.

Webifying old games

Tuesday, August 5th, 2008

One of these days I'd like to put some of my old DOS programs on the web to make them a bit more accessible. I used to think that I would do that in Java but these days Flash seems to be the most popular way to do interactive thingies on the web so maybe I should learn that. Or even just plain old HTML with Javascript - if someone can implement Lemmings in DHTML then I'm sure Cool Dude and probably some of the VGA demos could probably be implemented that way as well.

Digger adventure

Monday, August 4th, 2008

Lately I haven't really had any interest in improving Digger, but I still occasionally think it might be fun to write a sequel. "Digger 2" or "Digger's Adventure" would have a 2D continuously scrolling playfield, impenetrable walls, new monsters, bosses, new items to collect, more weapons, shields, warps, speed-boosters, and maybe the possibility to dig to the surface where it would be possible to jump.

Another 2D game

Sunday, August 3rd, 2008

Another 2D game I'd like to write (which might even be the same game as that one) is a bouncing ball game.

The player would be in control of a ball which could be accelerated by means of the direction keys. So holding down right would give a velocity increase in the positive x direction proportional to the length of time the key was held down (there have been a few games that work like this but in my experience in most non-simulation games the player only has one or two speeds and acceleration to/from these is instanteous). Wind resistance would be implemented to avoid the possibility of the ball reaching unlimited speeds.

Once the player has got used to these controls the next complication is maze traversal. While you could in principle just maneuvre around the maze slowly, it is much more fun to play at speeds close to the ball's terminal velocity. At these speeds, the ball's turning radius will be much larger than the scale of features in the maze, so the ball will hit the walls. When it does so, the obvious thing to happen is that it should bounce off (unless the wall is dangerous or special in some way).

Another possible complication is areas of the maze with gravity fields. The strength of the gravity could be smaller than the control acceleration (in which case it just changes the controls a bit) or could be greater (in which case entire directions become impossible to travel in). Gravity fields can be constant or central (or may have even more complicated configurations).

Then there could be places in the maze where the drag factor is proportional to some the velocity plus some constant field (i.e. windy places).

Some parts of the maze could be flooded with various different kinds of fluid, giving the opportunity to show off fun 2D fluid dynamics simulations.

Then of course there are the usual game elements - things to collect, power ups, space warps, spikes, enemies with various degrees of artificial intelligence and so on.

Emulation for fun and profit

Saturday, August 2nd, 2008

There's much more that you can do with an emulator than just play old computer games. In fact, I think that the usefulness of emulators is seriously underrated. Here are some useful things I can think of doing with an emulator that has some appropriate extensibility hooks:

  • Debugging. A debugger with an integrated emulator might be able to do the following:
    • Debug a program without the program being able to tell that it is running in a debugger - handy for investigating malware like viruses and DRM.
    • Save (delta) states at each step to make it possible to undo steps or perform backwards-in-time debugging.
    • Debug multi-threaded programs deterministically by simulating multiple threads on a single thread and allowing the user to decide when to context switch.
  • Reverse engineering. The problem of finding the actual code in the binary is Turing-complete in general but if you can find most of the important code by actually running it you can get most of the way there.
  • Static analysis. Finding bugs in code after it's been compiled by running it and (as it runs) checking things that would be difficult to check at compile time (code invariants). For example, assertions might not be compiled into an optimized binary but could be provided as metadata that could be understood by the analyzer. This would be a great help for tracking down those tricky bugs that disappear when you switch to debug binaries.

Keep your laws off my internet

Friday, August 1st, 2008

Do there need to be laws for the internet like there are laws in real life? The two realms are very different, so attempts to directly translate real-life rules to the internet are bound to fail.

For example, most laws in real life are to protect A from bad things that B might want to do to him (such as murder or theft). But all such forms of law are unnecessary on the internet because all that B can do to A is send packets to him. A is under no obligation to do anything with these packets - if he doesn't like them he can just ignore them.

So most of the laws that apply on the internet are the more nebulous sort that protect some third party C from B sending packets to A (for example copyright, C being the copyright holder) - censorship, essentially. But since C usually can't even tell when A and B are communicating, such laws are rather difficult to enforce and there is something rather troubling about having unenforceable laws on the books.

Another sort of law is that which protects commons (like a village square) from having houses built on it, or having rubbish dumped on it or otherwise being made unusable for its intended purpose.

Commons as such don't exist on the internet in quite the same way - for every computer on the internet, somebody has to pay for the electricity to keep it running if nothing else. There are "intellectual commons" - protocols and standards which we use to communicate. They can't really be stolen but they can be polluted in some sense - email is quite heavily polluted by spam now, for example.

Combating internet pollution does appear on the face of it to be something that requires laws (and indeed laws have been passed that try to do something about spam - the state of my junk mail folder suggests that haven't been very effective, though). But there is a technical way to deal with such things, and that is through the use of trust networks. The Sovereign Computing guys have some interesting ideas about trust networks. It does seem to me that a trust network layer built on top of the internet and which could provide services to applications like email would be a very useful thing.

Some social network sites are quite popular but these tend to be walled gardens, lacking the distributedness that would be needed to create real trust networks. They all also seem to have little if any concept of transitivity of trust meaning that they are little more than glorified lists of friends as far as their users are concerned.