Algorithm for finding "hot" records in a database

October 15th, 2008

Suppose you have a database, and (as often happens with databases) records change from time to time. Suppose also that you'd like to maintain a list of the "hottest" records, that is the ones which have been changing a lot lately.

The first thing you have to determine is whether you want to put the emphasis more on "a lot" or "lately" - i.e. you need to have a characteristic time tc such that n changes tc ago are equivalent to n times e changes now. This time determines how quickly changes "decay" into irrelevance. Depending on your application, this might be a day or so.

The next thing you might try is to keep a table of all the changes made, along with a time for each. Then you can just weight the change times according to how long ago they are and add them up. That's going to be a big table and an expensive operation, though.

A clever trick is to use a running average and "last changed" timestamp in each row of the original table. The running average starts off at 0. Each time the row is modified, calculate the number of characteristic times since the last change N = (tnow-tlast)/tc, update the average by multiplying it by e-N and adding 1 and then update the old "last changed" timestamp to tnow for the next change.

To show that this works, suppose the running average was a=1+e-N1+e-N1-N2+e-N1-N2-N3+... (one term for each change, weighted by how long ago they happened). When we update the running average it becomes 1+e-N(1+e-N1+e-N1-N2+e-N1-N2-N3+...) = 1+e-N+e-N-N1+e-N-N1-N2+... which is just what we want.

That isn't quite the end of the story though because the running averages in the table are not directly comparable to each other - if a record had a burst of activity a long time ago but then hasn't been touched since, it will have a similar activity to a record which had a similar burst of activity which has only just ended. To compute the "current" value of the running average we need to multiply a by the e-N corresponding to the time since it was last updated (without adding one this time, since we haven't added another unit of activity). This requires looking at all the records in the table though, which will be faster than the table of changes approach but might still be rather slow for a big database.

If we only care about the top (10, say) "hottest" records, we can speed it up by caching the results in a small table, and noting that scaling all the activity values by the same factor doesn't affect the ordering of the list. Suppose we have a singleton value tupdate which is the time we last updated the small table and a10 which is the activity of the 10th hottest record the last time it was changed. Whenever we change a record, take the new activity value a, multiply it by eN (note no minus sign here) where N=(tnow-tupdate)/tc and compare it to a10. If it's larger the new record should be inserted into the "top ten" table and the old 10th hottest record shuffled out (if the new record wasn't already in the table) - think of a high score table for a game. When this happens, set tupdate=tnow, multiply all the activity values in the small table by e-N and update a10 with the new value. Then when you need to display the hottest records just display this table.

There is one more complication which comes about from deleting records. If you delete a record it probably shouldn't appear in the "hottest" records list, even it was updated quite recently. But if you delete a record from the small table when it is deleted from the big table, you will only have 9 records in the small table and you'd have to go searching through the entire big table to find the new 10th record.

If records don't get deleted from your database too often, a simple workaround to this problem is to keep maybe 20 records instead of 10 in the small table so that there are plenty of "substitutes" around, and only display the top 10 of them.

The algorithms used by Digg, Reddit, StackOverflow etc. are a little more complicated than this because the records of those sites also have a "rating" which is factored in (higher rated records are considered "hotter") but which can change with time. There might be a way to deal with this by scaling the running average according to the rating and updating the hot records table when the rating changes.

Gödel's incompleteness theorem

October 14th, 2008

Gödel's incompleteness theorem is an interesting little piece of mathematics but probably gets talked about more than it really deserves to be. I'm going to add to the problem here by talking about it.

It's a pretty simple idea really - if you have a formal system of mathematics F sufficiently rich to do anything interesting, you can write down a statement G that says, essentially, "statement G cannot be proved in formal system F". G must be true since if it was false it would be provable and therefore true, which is a contradiction. So given any formal system you can either use it to prove a contradiction (in which case it can be used to be prove anything, and is therefore useless) or you can find a true statement that cannot be proved within the formal system. Mathematics (being a non-contradictory formal system) can therefore never be used to prove all the statements that are true. "But hang on a minute," I hear you say, "that proof that G is true looked an awful lot like mathematics to me." Really all this means is that the kind of things we think about as being mathematical proofs can't be enumerated into some nice tidy finite set, since whatever you put in that set you can use to find another element that needs to be added.

There is an analogy here with Cantor's diagonal proof that the real numbers form a greater class of infinity than the integers. While they're both interesting proofs there is a fundamental "infinite-ness" and "strange loop-iness" about both of them. Infinite things are strange and non-intuitive, and also useless for describing the real universe (since we can never measure anything with infinite precision).

JavaScript vs PHP

October 13th, 2008

In order to implement Tet4 I learnt two new languages - JavaScript (or JScript, or ECMAScript - the language has a bit of an identity crisis) and PHP. Why PHP? It's installed on my web hosting server and seems to have a huge community of people writing code in it and pre-written scripts. It may not be the ideal language for writing web server apps, but it does seem to be the most well-supported.

JavaScript seems to be a very clean, pretty language. The whole closure thing seemed a bit weird at first but once I understood that "class" is spelled "function" and "public" is spelled "this." I got to rather liking it. I especially like how each scope has access
to the variables from all the outer scopes - that saves a lot of messing about. It's very well integrated with the browser - manipulating the DOM feels very natural and not tacked on.

PHP on the other hand is a bit of a mess. It is as if its designers had a little spinner with markings "C, C++, Perl" which they spun each day to decide what languages features to copy that day. If JavaScript was sent by God, surely PHP was sent by the devil.

W3Schools has been an excellent reference for learning all this.

I have to say though that automatically promoting integers to double-precision floating point numbers on overflow is weird. On IE7, computing the value of 1111111111*1111111111 gives 1234567900987654400 (you can easily see this is wrong because it's even). This caused a rather hard-to-debug problem with my random number generator (which assumed that when multiplying two 32-bit integers together, at least the low 32 bits of the result should be correct). If you're going to automatically promote numbers, at least have the decency to use a multiple-precision integer library - there are lots around.

Mornington Crescent data mining

October 12th, 2008

Over on alt.games.mornington.cresent, a game has been running for a very long time. Many years ago (so the legend goes) the newgroup received a spam with the subject line "10,000 celebrity CDs". People started replying to the thread, decrementing the number and adding (vaguely) related text each time. Here is an example:

>> >>>> >>>>>>> 4840 Film scores that include 'Suspicious minds'
>> >>>> >>>>>>4839 Jukeboxes playing 'Suspicion'
>> >>>> >>>>> 4838 Suspicious boxes leading to evacuated buildings
>> >>>> >>>>4837 Other excuses for time of work
>> >>>> >>> 4836 hours to retirement
>> >>>> >>4835 hours to many
>> >>>> > 4834 second rate customer services
>> >>>> 4833 Vodaphone helpdeskers
>> >>>4832 Desk Tidies (A real help)
>> >> 4831 Deck reshuffles
>> >4830 Aces up my sleeve
>> 4829 dead rabbits
>4828 1.5v cells
4827 years before Red got his redemption

At some point, the subject line got changed to celerity and it stuck (or maybe it was misspelled in the original spam, I'm not sure).

I first learnt of this at university, some 10 years ago. I check back on this thread every once in a while to see where it's got to. It seems to have slowed down a bit in recent years - I think when I looked at it a few years ago I calculated that it should be over by now.

It would be an interesting task (maybe for a Googler?) to run a program over the history of this thread and plot a scattergraph showing how the number has changed over time. This would show the various rival threads as they fork off, skip ahead (or backwards) and sometimes die, and show how interest in the thread has waxed and waned over time.

Touch typing is irrelevant for programmers

October 11th, 2008

Steve Yegge argues that touch typing is a fundamental skill that all programmers should have.

I disagree. Being able to type at a reasonable speed is important, but touch typing isn't. I never learnt to do "proper" touch typing, but I type pretty fast and I don't look at the keyboard when I type. I also don't always use the same finger for each key - I use the finger which happens to be nearest to the key I want to press. I'm not overly terse when writing comments, emails and documentation. Improving my typing speed wouldn't make me more productive - the bottleneck is my brain, not my fingers.

Touch typing was invented for typing dictated text on mechanical typewriters which required a lot more force to operate than a computer keyboard. Programming is very different. Working on code there is much more punctuation, much more moving around the document and much more reading. Keeping hands in the touch typing position means that they are not in good places for punctuation and cursor movement, which is slower and leads to RSI.

Touch typists also dislike laptop keyboards. I prefer laptop keyboards to full-size keyboards - the keys are closer together and travel less far, which makes typing faster.

Manifesto

October 10th, 2008

I seem to have developed opinions about a lot of political things over the past few years. Here is a rundown of some of them. This is rather US-centric but much of it applies anywhere. In alphabetical order:

Copyright & patents

  • Put the copyright term back to 14 years with an option for another 14.
  • Eliminate work for hire - copyright remains with creators. For a large project like a movie, this won't make much difference since it would be difficult for someone else to come along and obtain licences from all the people involved. Recording studios will work for musicians, not the other way around.
  • Only work which is available to the general public in the preferred form for making modifications to it is considered to be copyrighted.
  • Introduce compulsory licensing - copyright should not give you a say about how your work is used, just that you get reimbursed for it. Yes, this would break the GPL but I think these reforms would also make it unnecessary.
  • Abolish software patents.
  • Disband the RIAA, MPAA and BSA for racketeering.

Defense & foreign policy

  • Reduce military spending. Investigate abusive recruitment tactics - allow people to opt out of being contacted by military recruiters.
  • Get out of Iraq as soon as possible - civil war seems to be inevitable there anyway. Get out of Afghanistan. Stop picking sides in conflicts between other countries. Stop supporting regimes with abusive human rights records. Adopt a general policy of non-interference with foreign governments except by UN resolution. Promote mediation between conflicting parties rather than invasion and occuption as a way to reduce conflict.

Drugs

  • Reduce the drinking age to 18.
  • Make all currently illegal drugs legal for people 18 and over. Tax them. Stop interfering with drug-producing countries. Ensure anybody who is addicted to drugs and does not which to be has access to rehabilitation facilities. Make rehabilitation mandatory for convicted criminals before their prison sentence can start.
  • Pardon anyone convicted for non-violent drug crimes not involving under-18s (i.e. anyone who would have done nothing wrong under these new rules).

Education

  • Empower teachers to give students the guidance they need. Education is critical for avoiding hereditary poverty and making the society work - you can't have real democracy without an informed electorate, or a working free market without informed consumers. Education is also critical for a country to be competitive in the global market, and for progressing science and technology as rapidly as possible.
  • Teenagers must be taught how various contraceptives work and how to use them, what STDs are and how to avoid them, how the human reproductive system works and that they should never feel obliged to have sex with anyone that they don't want to have sex with. Parents should not get to opt their children out of this, but are welcome to tell children that abstinence is best outside of school.
  • Certain other life skills like how to do household budgeting and how to raise children should also be taught in school.

Healthcare

  • There is a big problem with ever having someone that is uninsured - if they become ill while uninsured, they can't get insurance "because you don't insure a house that's on fire". Nobody should ever get bankrupted as a result of illness or injury. To fix this, ensure everyone has free-at-point-of-use access to a basic level of healthcare including preventative care and life-saving care.
  • Keep abortion legal up to a certain gestational point as it is today, but do what we can to make it as rare as possible. Provide free contraception. Provide more help for mothers who choose not to abort. Promote adoption as an alternative, including removing roadblocks preventing gay couples adopting. These things will reduce abortion more than banning it.
  • Ban circumcising children. This might be rather controversial, especially in the US but there is no medical reason for it. People might object on religious grounds, but we don't allow mutilation of girls for religious reasons - why should boys be any different? Adults can still be circumcised if they want to be.

Homeland Security & Immigration

  • Get rid of US-VISIT. Expand the visa waiver program. Stop mandating RFIDs in passports. Be more welcoming to tourists.
  • Make it quicker and easier for people to become permanent residents and citizens.
  • Scrap security theater in favor of actual security. Affirm the right of people to take photographs in public places (even of police officers, infrastructure and government buildings).
  • Scrap the no fly list - it's racist and ineffective. If anyone on it is actually too dangerous to be let onto an aeroplane, prosecute them.
  • Shut down Guantanamo Bay.

Justice & crime

  • Repeal the death penalty - it has no place in a civilized society.
  • Start a program to allow early parole of some prisoners into the military for rehabilitation purposes.
  • Nationalize the prisons - they never should have been privatized in the first place.
  • Get rid of the PATRIOT act. No searches, taps or seizures without a warrant.
  • Any seized property that is legal to possess must be returned within a week (enough time to copy the hard disks of seized computers).
  • Investigate unconstitutional restrictions on freedom of speech and assembly (e.g. at the GOP convention in 2008). Put safeguards in place to stop such abuses happening again.
  • Make it easier to impeach government officials for condoning torture or suspension of Habeas Corpus.
  • Many injustices seem to be caused by pleading guilty in exchange for a reduced sentence. This practice should be prohibited - the purpose of the court system is to find the truth, not to make it easy for prosecutors to find somebody to blame.
  • I don't think it's practical to ban guns in the US, but close loopholes allowing people to obtain guns without licences and background checks. Investigate introducing technologies that prevent guns from being fired by people other than their legal owners, and eventually phase out guns that don't have such a mechanism.
  • Decouple police funding from traffic fines so that speeding tickets etc. are handed out to improve safety rather than raise money.

Religion & culture

  • Repeal the tax exemption for religious entities. If churches want to do charitable work they can fork off a charitable arm which must not promote religious beliefs, and donate money to these organizations tax-free as corporations can do.
  • Allow creationism/intelligent design to be taught in schools but not in science lessons - schools should have religious education lessons which teach what people believe, why they believe it, what they practice etc. This should include a discussion about what creationism/ID is and why it exists.
  • Provide help for people who want to leave a church or cult but who fear retribution.
  • The Archbishop of Canterbury was involved in some controversy a while back for saying something to the effect that aspects of Sharia law in the UK were unavoidable. If you read what he says, it's actually quite sensible - if Muslims (or any other group) wish to have their own courts for dispute resolution they should be welcome to do so (but can fall back to the state courts if they do not wish to abide by the private court's findings.)
  • Ensure homosexuals have the same civil rights as heterosexuals. Don't care whether you call it civil unions vs marriages or marriages vs church marriages or get rid of all marriage-based civil rights altogether - that's just semantics. But make it fair. Churches should not be forced to perform or recognize any unions they don't like (not that anybody is suggesting that they should as far as I can tell.

Tax & economics

  • Tax as little as possible, but recognize that some taxation is necessary to ensure a decent standard of living for the most vulnerable members of society.
  • Taxes should generally be fair, progressive and cheap to collect. Simplify the tax code. Allow everyone to file taxes online for free.
  • Taxation policy can be used to encourage behavior that benefits society. Taxing luxuries (especially unhealthy ones like tobacco, alcohol and recreational drugs) is a good idea, as are tax breaks for entrepreneurial enterprises.
  • Eliminate the deficit and put safeguards in place to prevent future deficits.
  • Most things can be handled by the market with some regulation to ensure the market is diverse and well-informed - the recent financial crisis was caused by deregulation.
  • Companies that get "too big to fail" should be broken up into smaller companies before they fail rather than get bailed out at taxpayer expense when they inevitably do.
  • Unsecured loans are a bad idea - discourage them by making bankruptcy easier. Ultimately it would be great if this would lead to the demise of the credit reporting agencies.
  • Overhaul zoning laws to make it easier for people to run businesses from their homes and to make cities more livable/walkable by mixing commercial/residential areas more.

Science, technology & the environment

  • Progress is good. Promote cheap bandwidth.
  • Much research can be done in the private sector, but there is a place for government funding research whose payoff is too small financially or too far off in the future for the private sector to be concerned about it.
  • Environments are good. We should have one. Sign the Kyoto protocol. Amory Lovins has some good ideas about how to reduce and ultimately eliminate use of fossil fuels.
  • Public transport is good. So is investment in infrastructure.

Unemployment & poverty

  • Provide training and job-seeking help for the employed in return for unemployment benefits. Help families escape the cycle of poverty by giving them the resources they need to raise children to become successful and productive adults.

Voting & politics

  • Everyone over the age of majority should be allowed to vote, even prison inmates. Disenfranchising people is wrong.
  • Only a vote on a paper ballot is a true vote (though machines can be used to fill out the ballots and generate exit polls). Wanting the results on election day is not a good excuse for making voting easy to tamper with.
  • Severly limit campaign spending and lobbyist influence. Reduce the amount of money in politics.
  • Switch to approval voting or similar to improve the diversity of views in government beyond the two party system.

Taking an interest in politics

October 9th, 2008

It bothers me a bit when people say they have no interest in politics. Almost everyone has some things that they are passionate about, and most of those things can be helped by the right government policies or hindered by the wrong ones. Everyone who has the right to vote should exercise it to help shape the world into the place they most want it to be.

But I don't think that everyone should just get out and vote for some random candidate, or vote based on (for example) which candidate you'd most like to have dinner with - that's as bad as (if not worse than) not voting at all. Ideally you should have an informed opinion on every issue. If there is an issue on your ballot that you don't know or genuinely don't care about, don't vote on that issue.

I am guilty of not following my own advice here by not maintaining my right to vote in UK elections (my excuse is that it's an annoying amount of paperwork which has to be done every year, and I'm ill-informed about UK politics these days anyway).

Programs to write programs

October 8th, 2008

When one needs to write a computer program but that computer program would be particularly difficult and (especially) repetitive to write, a useful pattern is to instead write a program to write the program for you. Programming in assembly language is very tedious so we have compilers. Configure scripts are fiddly and tend to be very similar so we have autoconf. Makefiles are hard to get right so we have automake. When building GCC, there are many files written in domain-specific languages (such as machine description files and opt files) which are used to used to generate the final C source code that is compiled to produce the final GCC. Even the C preprocessor is best thought of as a separate language from C itself, to allow for some limited compile-time code generation.

I wonder how far it is sensible to take this. If the program that writes the program is itself difficult to write, can we write a program to write that one too? At some point it seems that you will reach a point of diminishing returns - adding layers of code generation makes things more complicated, not less. Perhaps the kinds of programs we can understand (and therefore write and debug) are limited by the number of layers of code generation which need to be used. Perhaps with better tools for understanding these layers (such as this), we can write programs that are more sophisticated and less buggy.

Another factor is that in many languages (especially compiled languages) it is quite difficult to write code that writes code - you have to make sure that a suitable compiler is installed, invoke it etc. I hope that future compiled languages will include
code generation libraries which make this easier.

Things cyclists should be allowed to do

October 7th, 2008

I used to cycle to work sometimes. Not all the way to work, but to a bus stop 2-3 miles away. It's very good exercise but I stopped doing it after Alexander was born for several reasons:

  • The commute was 40 minutes to an hour that way at best, by car it was 20 minutes at best.
  • I always ended up very sweaty on the bus - not very pleasant for my fellow passengers.
  • Sometimes the bus was standing room only, meaning the time I spent on the bus had to be wasted.
  • The only (not too hilly) route between my house and the bus stop is along a fairly busy but rather narrow road. The speed limit is 25 mph there but cars expect to be able to go at speeds of up to about 45 mph. They get frustrated when they are stuck behind a cyclist who is going at less than 25mph and there isn't room to safely overtake. Sometimes they will overtake when there isn't enough room, putting the cyclist in danger.
  • In Seattle, cyclists are technically supposed to stay as far right in the lane as possible to allow cars to overtake. However, in order to minimize the danger to myself from overtaking cars, I would often ride in the middle of the lane when it wasn't safe to overtake. This was both a clear signal to the cars that they should not attempt to overtake, and to give me a space to dive into on the right in case they tried anyway. One time, a driver got very angry at me for doing this and started shouting at me. I couldn't hear exactly what he said over the sound of his engine but he gave me the impression that he thought roads were for cars and that cyclists need to stay off of them. He eventually overtook me (unsafely) but then I overtook him again while he was waiting at the traffic light (so I hadn't actually delayed him at all). This seemed to make him even madder and he shouted at me even more (but I still couldn't hear him, and was too out of breath to say anything myself). I switched to the sidewalk as the traffic was too close to the curb for me to get past, which made him even madder still. The whole incident put a very bad taste in my mouth and put me off cycling altogether for a while. I never quite felt safe cycling along that road after that - perhaps the same guy would see me late one night and decide to run me off the road. The irony is that because of that I drove more, which caused more traffic and probably made him even later (if he was a regular user of that road).

Cycling has a problem in Seattle (and in many other cities as well, especially in the US) in that it lacks critical mass (hence Critical Mass). Cycling is great - it's good exercise, good for the environment and reduces traffic. But there are so many cars on the road, going so fast and paying so little attention that in many places cycling is very dangerous. With enough cyclists or few enough cars using that road, it would be safe for cyclists but the danger keeps the number of cyclists down and the number of cars up, as my story above illustrates.

In order to promote cycling, we need to make sure the road rules favor cyclists. For example:

  • Cyclists should be able to go to the front of the line at red lights (and stop signs if there is a long queue). In many places this is legal but there is not always room in front of the traffic, and sometimes no room to undertake at the approach to the light. Improved road markings would help here. These exist in some places (not in Seattle, though).
  • Cyclists should not have to come to a complete stop at stop signs. Currently they are supposed to (at least in Seattle) but rarely do. For a cyclist, coming to a complete stop is much more of a problem than it is for a car - not only do you have to build up your momentum again but you can't balance when stopped. Cyclists should just need to slow down enough to ensure that they don't ride out in front of a car or another cyclist coming from a different direction, or a pedestrian trying to cross the street, and only stop if they need to wait for other traffic. Pedestrians do not have to stop at stop signs if there is no traffic - I think cyclists should have that advantage too.
  • Cyclists should be able to ride in the middle of the lane when it is not safe for them to be overtaken, as I used to do.

One of the complaints car drivers seem to have about cyclists is that sometimes they follow pedestrian road rules and sometimes car road rules. This seems to me to be an advantage of the bicycle as a form of transportation - in some ways it is like being in a car and in some ways it is like walking. Cyclists take up much less space than cars and are much less dangerous to pedestrians, so it makes sense that they should be allowed to ride on the sidewalk (pavement) when it is wide enough and when they are going at walking speed (though of course they should give way to pedestrians when doing so). On level ground and downhill they can go fast enough that it makes more sense to use space shared with cars than space shared with pedestrians.

I recently started cycling again on a regular basis - now that I don't have to commute I can do so for exercise, and choose non-dangerous routes. I found it to be a much more pleasant experience without cars honking behind me - there are lots of other cyclists on this route and they often say hello as they whizz past me.

It's not in the last place you'll look

October 6th, 2008

When I was looking for something my parents would often quip "it'll be in the last place you look". Brilliant. And not at all helpful. In order to prove them wrong, I now make a habit of looking in some more places after I've found something that I was looking for. This isn't as pointless as it sounds - sometimes I find things I wasn't looking for.