Startups

August 16th, 2008

I've been reading a lot about startups lately, specifically in the essays of Paul Graham. There is a part of me that regrets not having ever tried to start my own startup - unfortunately the bursting of the bubble and the intricacies of US immigration policy meant that there was never really a practical time. While being filty rich does have a certain appeal, the more compelling reason is the one at the end of this essay - getting the business of getting enough money for life over with as quickly as possible in order that I can work on whatever I want to. The summer/autumn before I started at Microsoft, when I had several months of mostly uninterrupted time to myself to work on whatever I wanted to was one of the most intellectually satisfying (not to mention relaxing) periods of my life.

I have always thought of myself as having progressive politics but this is the best argument I have seen yet against taxing the rich. I guess one possible counterargument is that maybe taking very big risks isn't really necessary to develop new technologies (nor necessarily particularly beneficial to society as a whole) - the only reason that this has been true in the past is that developers and funders of new technologies have very little information about whether these new technologies are likely to be successful or not. With better information, perhaps we will be able to pick the winners before a lot of money has been spent, meaning that the big risks will be unnecessary.

One other thought: as startups become cheaper than ever to create and run, perhaps we will see more of them following the model of craigslist - staying small (in the financial sense) and private and concentrating on what users want instead of aiming for explosive growth, trying to become as profitable as possible and then either go public or get aquired. While the latter is more lucrative, I suspect the former might be more intellectually and spiritually satisfying.

Parsing expression grammar grammar

August 15th, 2008

I have found that a fun thing to do is make up grammars for computer languages - figure out what syntax rules work well and what is ambiguous (to both humans and computers - it seems the two are more closely related in this respect that I would initially have imagined).

The language I eventually want to write will have a parser generator (probably generating packrat parsers from Parsing Expression Grammars) built in, so I thought I would write a grammar for the grammars accepted by that - a rather self-referential exercise. I keep going back and forth on some of the syntax details, but this is how it looks at the moment:

// Characters

Character = `\n` | `\r` | ` `..`~`;

EndOfLine = `\r\n` | `\n\r` | `\n` | `\r`

AlphabeticCharacter = `a`..`z` | `A`..`Z` | `_`;

AlphanumericCharacter = AlphabeticCharacter | `0`..`9`;

EscapedCharacter = `\\` (`\\` | `\`` | `n` | `r` | `"`);


// Space

MultilineComment :=
  `/*` (
      MultilineComment
    | !`*/` Character
  )* "*/"
// Note that this is recursive because multi-line comments nest!
// To match C-style (non-nesting comments), use 
// CStyleMultilineComment := `/*` (!`*/` Character)* "*/";

Space =
  (
      ` `
    | EndOfLine
    | `//` (!EndOfLine Character)*
    | MultilineComment
  )*;

_ := !AlphanumericCharacter [Space];


// Tokens

Identifier := AlphabeticCharacter AlphanumericCharacter*;

CharacterLiteral := `\`` ( Character-(`\n` | `\\` | `\``) | EscapedCharacter )* "`";
  // No spaces matched afterwards

StringLiteral := `"` ( Character-(`\n` | `\\` | `"`) | EscapedCharacter )* "\"";
  // Optionally matches _ afterwards

// Productions and rules

CharacterRange := CharacterLiteral ".." CharacterLiteral

Rule :=
  (
    (
      (
          Identifier
        | "[" Rule "]"
        | "!" Rule
        | "&" Rule
        | "(" Rule ")"
        | "EndOfFile"
        | StringLiteral
        | CharacterRange
        | CharacterLiteral
      ) / "|" / "-" / "\\" / "/"
    ) ["+" | "*"]
  )*;

Production := [Identifier] (":=" | "=") Rule ";";

= [_] Production* EndOfFile;

The rules are as follows:

Rule1 | Rule2 prioritized alternative
Rule1 Rule2 sequence
Rule* Kleene star
Rule+ Rule Rule*
!Rule does not match Rule
&Rule matches Rule but is not consumed
(Rule) order of operations
Rule1-Rule2 matches Rule1 but not Rule2
Rule1/Rule2 a sequence of strings matching Rule1 separated by strings matching Rule2 - left-associative (i.e. X := Y/Z => X := Y (Z Y)*)
Rule1\Rule2 a sequence of strings matching Rule1 separated by strings matching Rule2 - right-associative (i.e. X := Y\Z => X := Y [Z X])
Char1..Char2 matches a character between the character in Char1 and the character in Char2

Having a single grammar for both Parser and Lexer is nice in some respects but does introduce some additional complications. Some strings (those I've called CharacterLiterals here) must match exactly (no whitespace is consumed after them) and some (those I've called StringLiterals here) must consume any whitespace that appears after them (done by optionally matching the _ production). Similarly with productions - those created with ":=" optionally match _ at the end.

The root production has no name.

The "/" and "\" delimiters makes it really easy to write grammars for expressions with infix operators. For example, the core of the C++ expression production is:

LogicalOrExpression := CastExpression
  / (".*" | "->*")
  / ("*" | "/" | "%")
  / ("+" | "-")
  / ("<<" | ">>")
  / ("<" | ">" | "<=" | ">=")
  / ("==" | "!=")
  / "&"
  / "^"
  / "|"
  / "&&"
  / "||";

It always comes down to economics

August 14th, 2008

What do Bruce Schneier, Mark Chu-Carroll and Paul Graham have in common? They're all bloggers (though Graham prefers to describe himself as an essayist) who seem to have gradually transitioned towards writing about economics starting from their more specialized subjects (cryptography, mathematics and lisp respectively).

I guess this is because economics is a very powerful subject - if you want to change the world you have to make it worth peoples' while to do so, which means you have to set up the economic conditions for it to do so. Conversely, if you know (or think you know) how the world is likely to change in the next few years you can use economic principles to figure out what to place bets on (invest in).

Deleting from the database

August 13th, 2008

Continuing on from yesterday's post - a commenter asked what happens if you want to delete something from the database irrevocably. I was going to reply with a comment but my reply got kind of long and it's an important concept so I decided to make a post out of it.

I have thought about this a bit and I think there are two reasons why you would want to delete something irrevocably. One is to save disk space (if the thing you want to delete is very big and you're sure you won't need it again). Current storage capacities are sufficiently high that this is a much less common problem than it used to be (especially for programmers - I suppose astronomers and particle physicists tend to have larger chunks of data, though).

The other reason is if the data should never have been checked into the version control system in the first place - for example if you accidentally checked your tax return into your company's source control repository.

Either way, any reasonable VCS should be able to cope with such requirements. To prevent abuse, this would probably have to be done by the owner or administrator of the system storing the data. I know at Microsoft there was a process for this (I never had to use it myself, but I saw an embarrassing bug report which had had its edit history erased so I know it can be done).

This is one area where distributed version control has a bit of a disadvantage - you have to contact everyone who has downloaded a copy of the repository and persuade them to erase the unwanted data and pass on the message to everybody who might have downloaded it from them and so on. For a popular project, this might be as impossible as unpublishing something that's been published on the internet.

As for the technical details of obliterating from the accumulate/cache database in particular, it's easy to do (as long as you can remove data from the accumulate table) except for any changes which happen on top of the undesirable change. These either have to be reparented or deleted. If there have been a lot of changes on top of the unwanted one this might be very labour-intensive or undesirable respectively, particularly if those changes are to data you're trying to remove.

Deconstructing the database

August 12th, 2008

This post elaborates on some database organization ideas that I mentioned in passing in Foundry.

The usual interface presented to a database is one based on "query/update" - the database contains some set of information which can be queried and updates can be made when this information needs to be changed. At any time there is a single "most up to date" state of the database consisting of the data after all "update" operations known so far have been performed.

The usefulness of this format cannot be denied but there is another possible way of organizing things - I call it "accumulate/cache" (I haven't read much in the way of database theory literature so this might already exist by some other name). In such a database, you (at some fundamental level) never change things, only add new things. "Update" operations consist of sending a "delta" to the database saying what needs to change, what it needs to change to and the timestamp of the change that is being used as the "parent" of this change. These deltas are then themselves timestamped and dumped onto the end of one big file (or table, if you're implementing this on top of a traditional relational database, though because of the limited nature of the operations on this table, writing special purpose code for this could probably improve efficiency).

To extract information from this database, you need a timestamp as well as the key of the piece of information you want to look up. You can look at the database as it was at any point in history just by supplying the appropriate timestamp. Thus you get what is essentially revision control for free (except for the fact that this kind of database probably uses much more space than the traditional sort - lack of disk space in previous generations may be why this sort of database isn't the way things are usually done).

Given a timestamp and a key, the algorithm for looking up a piece of information is quite simple: first look in the cache and see if that timestamp/key combination is there. If it is, you're done - just return that value. If it's not, look up that timestamp in the accumulation table. If that accumulation changes the value corresponding to your key, add the result of the query to the cache and you're done. Otherwise pick the parent timestamp of that change and repeat.

The caching is necessary to avoid bad asymptotic behavior as the size of the database grows. I think (but haven't proved) that for a database of size N it should be possible to get back to O(log N) amortized performance with an O(N log N) sized cache. The amount of cache used can be tuned by throwing away the least recently accessed entries depending on the database's access patterns (patterns which look back at history a lot will probably require larger caches than patterns which only look at recent timestamps).

Only the accumulation table needs to be backed up as the cache table is automatically reconstructed as necessary when queries are made. The fact that the accumulation table is only ever appended to simplifies this.

The next piece of infrastructure needed on top of this is a "head" mechanism. The last timestamp in the accumulation table is the latest physical change made but this might not be the latest logical change. We would like the ability to have "atomic" operations made up of several changes. This can be done without any additional locks (so scales very well). The "head" value is just a timestamp filed under some well-known key. To get the most recent "head" just get the value of "head" corresponding to the most recent timestamp, and do all your other queries with the head value timestamp.

To perform an atomic update, you need to make your changes and then merge them with the latest head. This is trivial if the current head is the parent of your first change. Otherwise something else has written to the database in the meantime and a merge will need to be done, creating a new "parent of first change". Eventually (as long as the time between commits is long compared to the time it takes to do a commit) you'll get to the trivial case, and the only thing that needs to be done atomically is the trivial "compare parent to the value of head and swap with the new head if they're equal" (I think most CPUs can do this atomically with a single instruction, but in any case it is one of the simpler lock-free algorithms).

Some databases already work like this behind the scenes to some extent (log replaying etc) but I think that making this part of the interface simplifies the atomicity logic considerably, not to mention the usefulness of being able to query versions of the database from any point in history.

One complication occurs if it becomes necessary to remove from the database some information which should never have been placed there (e.g. some copyrighted work). Since you want to obliterate history in this case, an update needs to be made on the accumulation table. The cache (or the portion of it from after that point) must then be regenerated. One also needs to decide if subsequent updates to the copyrighted value are "derived works" or not, and either obliterate or leave alone the updates to that value.

One limiting factor of this architecture is access times - even with the fastest current disks one can only traverse a chain at maybe 500 links per second. It will only be a few years before we're all using solid state flash disks, though, and that will dramatically speed up access times.

Edit 14th July 2013:

Since writing this I came across an interesting article and talk on the same subject.

Different kinds of truth

August 11th, 2008

I used to think that the truth was just that - The Truth, singular. That there was just one "Platonic" set of true mathematical facts. I no longer subscribe to this point of view - what's true depends on who you ask.

First there are some basic truths that we have to agree on to have a discussion about anything, like "if A is true, and if A implies B, then B is also true". If we don't accept these basic logical principles as true the consequences are simply that we can't deduce anything, or that we have to accept that everything is true, or that nothing is true. We accept these truths because if we didn't what we get is a rather limited and boring set of mathematics, useless for doing anything interesting (like modelling the real world) with. Those who would deny them can't be disproven, but they can't be reasoned with either. So these truths just have to be admitted as axioms.

Next there are empirical truths like "the sky is blue" and "2+2=4". These can be thought of as facts about the universe we live in. We know they are true because we can see that they are. One could in principle do mathematics without such facts (just using pure logic) but most mathematicians generally accept these truths as well as it makes mathematics more interesting (and definitely more useful).

Sometimes mathematicians envisage mathematical objects which cannot exist in our universe - objects which are infinite in some sense (not necessarily infinitely big - a perfect sphere is infinitely smooth, for example, and the real number line contains infinitely many points). Infinity is a very slippery thing to deal with precisely because infinities are never directly observed in the universe. How can we say anything about infinity then? Well, mathematicians have developed techniques like "epsilon delta" (for every delta you can name, no matter how small, I can name an epsilon with such and such a property). These arguments break down in physics (nothing can be smaller than the Planck length or the concentration of energy required to confine it in that interval would cause a black hole) so they are purely mathematical in nature. Nevertheless they form a consistent and beautiful theory, and they do turn out to be useful for approximating physics, so we accept them.

But when infinities start to get involved, things get very weird - you start to find that there are multiple different versions of mathematics (multiple different sets of "true facts") which are consistent with themselves, consistent with our universe and interesting. Two of these are accepting and denying the "Axiom of Choice" (AC). If we accept the AC it allows us to prove things about infinities without actually constructing or defining them. This has some very weird results (like being able to disassemble a sphere into 5 pieces, move and rotate them and end up with 2 identical spheres of the same size as the original with no gaps). But denying the AC also gives you some weird results (every set can be put into order). Each are just as "true" but give different sets of mathematics. Currently mathematics including the AC is more popular as it seems to provide greater richness of intellectual territory.

As mathematics develops, it seems likely that more of these "interesting" axioms will be discovered (some of which might already have been assumed in some proofs) and that mathematics will fracture into increasng numbers of "branches" depending on which axioms one chooses to accept and which to deny. In fact, Gödel's Incompleteness Theorem says that for any axiomatic system of mathematics there will be "obviously true" statements that can't be proved from these axioms, in other words that the "bulk of mathematics" (though not necessarily the bulk of interesting mathematics) is found at the leaves of this metamathematical tree.

There are other branches of mathematics whose "truth value" is currently unknown to human mathematicians. For example, many theorems have been proven under the assumption that the Riemann hypothesis is true. We think it probably is but nobody has been able to prove it yet. The volume of work which assumes it makes it one of the most important unsolved problems.

Computer algebra system

August 10th, 2008

At some point in the future I'd like to write a computer algebra system, like Maple or Mathematica, just because I think it would be really fun piece of software to write. The basic idea is that you can tell the computer things you know (e.g. "x+2=3") and ask it questions (like "x?") and the computer would attempt to give the simplest possible answer by searching its database of facts. When printing formulae on the screen it would use algorithms from TeX to give attractive output.

Another nice feature would be the ability to directly manipulate formulae, for example rearranging terms of an equation by dragging them with the mouse or expanding multiplications by clicking on them (the program, of course, would prevent manipulations that aren't mathematically correct). These kinds of manipulations can be very tedious to do by hand.

Yet another feature that I want is the ability to create animated, interactive proofs. Rather than just presenting a static sequence of "X implies Y implies Z" on a page, the program could actually create an animation of X turning into Y. And if, at some stage, a derivation is unclear, the user could right-click on it, select "Why?" and the program would attempt to explain. That sounds difficult to do but I think much of this is really quite mechanical. When studying mathematics at university, I often wished that the proofs were presented like this - it would have made things much easier.

As well as an interactive algebra and mathematical presentation system, this program would also contain a big database of mathematical facts, both to aid in the program's own proofs and as an (interactive) mathematics textbook in its own right. Mathematicians using the software could contribute their discoveries to this database/textbook in addition to (or even instead of) the traditional distribution method of writing papers.

Lost: The most complex story ever told?

August 9th, 2008

I am a big fan of Lost. It has beautiful scenery, wonderful, well-paced storytelling, an incredibly compelling plot and some terrific acting. It always keeps me guessing and often surprises me. Many of its concepts ("The Others", "Constants") seem to be destined to become cultural symbols like "Big Brother" and "Room 101" from 1984 (it always used to amuse me when the TV shows of these names were on at the same time in the UK - what would George Orwell have thought about that?).

Lost's most notable feature, though, might be that it is (as far as I can tell) the most complex story ever told. Most fictional television shows (like Star Trek and Buffy the Vampire Slayer) tell a story of the course of an episode or two - the only point of the continuity is to avoid having to introduce all the characters for each story. But you can't really watch a single episode of Lost outside of the context of the show and have it make much sense. Nothing ever seems to happen on the show without a reason - every detail seems to have some purpose (even if isn't revealed until much later in the story).

Most novels seem to take 2-4 hours to tell on screen, and even multi-novel series like Harry Potter or the Belgariad/Malloreon are at most only 8-10 times as complex as this. "War and Peace" was made into a 7 hour film. "Lord of the Rings" was a little over 11 hours in total for the extended editions. Wagner's "Ring Cycle" is about 15 hours. Lost will be about 85 hours (without adverts) by the time it is finished. The only thing that even comes close is Babylon 5 at 77 hours, and much of that consists of standalone episodes.

I'm not counting soap operas or the Xanth series (a guilty pleasure of my youth) here because these are really separate stories that happen to occur in the same setting, involving overlapping subsets of characters, without an overall story arc.

English as a language for programming

August 8th, 2008

Programmers write programs in computer languages but the comments and identifiers (which are important, but not meaningful to the computer) are written in a human language.

Usually this human language is English, but not always - I have occasionally run across a pieces of source code in French, German and Hebrew. I guess it makes sense for a programmer to write code in their first language if they are not expecting to collaborate with someone who doesn't speak that language (or if that piece of code is very specific to that language - like a natural language parser).

On the other hand, it seems kind of short-sighted to write a program in anything other than English these days. There can't be many programmers who don't speak some amount of English (since most of the technical information they need to read is written in English), and it seems likely that all but the most obscure hobby programs will eventually be examined or modified by someone who doesn't speak the first language of the original author (if that language isn't English).

There are other advantages to standardizing on English - a common vocabulary can be developed for particular programming constructs which makes programs easier to understand for those who are not familiar with their internal workings. The aim is, of course, that any programmer should be able to understand and work on any program.

That there is a particular subset of the English language that is used by programmers is already evident to some extent - I think it will be interesting in the next few years and decades to see how this subset solidifies into a sub-language in its own right.

I should point out that I'm not advocating putting legal or arbitrary technical barriers to prevent programs being written in other languages - more that it might be useful to have tools which can help out with programming tasks for programs written in English.

Having said all that I think that there will in years to come, a higher proportion of programming will be done to solve particular one-off problems rather than create lasting programs - there's no reason why these throw-away programs shouldn't be in languages other than English. Tool support for this can be very minimal, though - perhaps just treating the UTF-8 bytes 0x80-0xbf and 0xc2-0xf4 as alphabetic characters and the sequence 0xef, 0xbb, 0xbf as whitespace.

Economy in ubiquity

August 7th, 2008

I think that at some point in the future we will develop nano-replication technology (a la Star Trek: The Next Generation) and there will be perfect open-source recipes for these machines for everything from roast beef to nano-replication machines.

Shortly after that will come the complete collapse of society, because we won't need anyone else for our basic necessities any more. We will just replicate power generators (solar, wind or both depending on climate) and devices for collecting rainwater and making it drinkable. Our waste will be recycled into the raw materials the replicators need (as disgusting as that sounds, I'm sure we'll learn to deal with it) and our internet connections will be made out of wi-fi meshes and long-range point-to-point radio connections.

How can you have any sort of economy with such an absence of scarcity? Well, presumably there will still be some things that are scarce and desirable (like land in nice locations). And presumably there will still be people doing particularly creative things that improve peoples' lives (making movies or inventing new recipes) - we'd just need some way to connect the two.

I'm not advocating for intellectual property as such here (that makes the accounting far more complicated than it really needs to be and introduces artificial barriers to creativity) - instead I'm imagining something more like Nielsen ratings (but for everything) to determine who gets the scarce wealth.

I guess we'll probably always have some form of money, and I'm sure the arguments about how it is accounted for will be as heated as ever. How do you decide on the relative value of the recipe your dinner was made from verses the movie you watched?