| Home | laurie@tratt.net |
A Proposal for Error HandlingDecember 14 2009 A while ago I wrote about my experience of writing extsmail, and how surprised I was that highly reliable and fault tolerant programs could be written in C. In large part I attributed this to the lack of exceptions in C. In this article, I expand upon this point, consider some of the practical issues with exceptions based language, and present a candidate language design proposal that might partly mitigate the problem. I don't promise that this is a good design; but it does present some of the issues in a different way than I've previously seen and if it encourages a debate on this issue, that might be use enough.The two approachesThere are two main approaches to trapping and propagating errors which, to avoid possible ambiguity, I define as follows:
Outline of the problemThe fundamental problem as I see it can be concisely expressed: exceptions allow predictable systems to be written with much less code than with error checking; but exceptions make writing highly fault tolerant code difficult. The former point is, I'm fairly sure, uncontentious; with exceptions, one needn't explicitly check and propagate most errors, negating the need for vast wodges of code. For many systems, this is fine - indeed, it is desirable. In general, most of the small and medium sized systems I write have little to no exception checking; if something odd happens (e.g. a file is missing) I prefer such systems to immediately exit rather than limp on to an unpredictable end.The problem comes when one is trying to write highly fault tolerant code, by which I mean systems which try to keep on running even when undesirable events or situations arise. I documented my experiences when writing the (tiny) e-mail sending program Some readers might interpret the above as being a position based either on ignorance, or a hatred of exceptions. While I generally plead guilty to charges of ignorance, I can provide some evidence that in this rare situation it's not the case. The language that I designed - Converge - is an exception-based language. To the charge of hatred I say that, except when writing highly fault-tolerant systems, I believe exceptions are the best way of dealing with errors. If I don't hate exceptions, what's the problem? Here's the rub. In theory there is no real problem with exceptions; they're clearly a more pleasing solution to the problem than error checking. The problem comes because they seem to inevitably lead to a certain style of programming that is not suitable for highly fault tolerant code. This style permeates every exception based language and library I have seen. Before I go into more detail as to what the problem is, it's best to take a step back and look at the competing solutions. Error checkingLet's consider error checking in a C-like environment (I say C-like because I wish to avoid the horror that is is
int err;
if ((err = write(...)) != 0) {
switch (err) {
case EINTR:
...
case EAGAIN:
...
case ...:
...
}
}
In other words, a function write is called and, if it returns an error, all the possible errors that the function can return are explicitly dealt with. There are three important things implicit in the above. First, functions uniformly denote success or error (most Unix C functions denote success by returning 0; values other than 0 indicate failure). Second, errors are simple integer values. This means that they can easily be stored and compared against pre-defined error codes. Third, functions carefully document which errors they can return so that the caller need only handle those errors.
In general, it's unusual to explicitly deal with each different error that a function can return. At the other end of extreme, one can choose to simply ignore the error returned by a function: write(...);which is equivalent to the exception-based code try { write } catch (Exception) { } in, say, Java - although noticeably terser. In practice, one will also see many instances of the following idiom:
if (write(...) != 0) err(1, "Fatal error.\n");This is a rough approximation of the concept in exception-based systems of exiting the whole program if an exception is not caught at some level. This particular idiom in error-checking systems is a pain in the rear end: it's tedious and verbose to write; and if one has several points that share the error message Fatal error then it may not be possible to easily distinguish which one of them triggered.
As the above hopefully suggests, error checking code suffers several problems including: verbosity; the ease with which minor typos escape notice; and the difficulty of retrospectively changing APIs (extending the errors that a function returns would, in general, necessitate changing - or at least checking - all callers of that function). However there is one point which, while it might initially seem a problem, turns out to have interesting positive consequences. Since error checking relies entirely on convention, every function that follows this idiom must carefully document what errors it can return; without that, the idiom would be unusable. We'll return to this shortly. ExceptionsMost readers are probably familiar with exception based systems as the majority of modern languages utilise them in one form or another. It is this familiarity which I suspect numbs us to one of the major practical problems with exceptions. Let's recast the initial error checking example into exceptions:
try {
err = write(...);
}
catch Interrupt_Exception e {
...
}
catch Again_Exception e {
...
}
catch ... {
...
}
As this suggests, it's possible to almost exactly emulate the error checking approach in an exception based language. However, one will almost never see the above idiom in an exception based language (I don't think I've ever seen it). Exceptions are typically organised into a hierarchy so one is much more likely to see:
try {
err = write(...);
}
catch IO_Exception e {
...
}
Oddly enough, this organisation of exceptions into hierarchies, while convenient, is not something I'm keen on for fault tolerant systems. The reason is that, by abstracting away from the specific error that occurred, it tends to give a false sense of security that one is dealing with a given exception in the right way. For example, most systems have a multitude of IO related exceptions, yet it is rare for anyone to deal with anything other than the top-level IO exception class; in a truly fault tolerant system many of those sub-exceptions are likely to be best dealt with differently than others.
Of course, the beauty of exceptions is that they don't need to be explicitly handled. Furthermore, there is much less potential to silently, and accidentally, swallow an error (as can easily happen in error-checking systems; and assuming one doesn't have a brain-dead checked exception system as in Java). A system which uses exceptions is entirely predictable: it will run reliably and tend to fail reliably and quickly. In contrast, a buggy error-checking system will often fall over long after the real error occurred, in a way which makes debugging painful at best. The problem with exceptionsIronically it is the ease of exceptions which I believe is their downfall.
For average programs, none of these points is a huge problem; in fact, they arguably make a programmers life easier. But for fault-tolerant systems all are genuine issues: it is impossible to write a fault tolerant system if you are unsure what errors you need to deal with; if the number of errors you are required to deal with becomes too great, the sheer magnitude of task will overwhelm you; if you're unclear what errors are the result of your mistakes and which aren't, debugging becomes a philosophical minefield; and if you tend to be too crude in the granularity with which you check errors, mistakes will happen. A proposalAt a high-level, what I believe might be an improvement on the current situation is a feature which, to some extent, merges the better parts of error-checking and exceptions. Let us therefore consider what the best parts of each approach are. Error checking forces programmers to be considerate both of the errors that they have to deal with, and the quantity of errors they return to the caller. Exceptions make code far smaller and are particularly good at dealing with programmer or cock-ups and very unusual situations (e.g. out of memory errors) which virtually no-one wants to deal with explicitly.What both error-checking and exceptions provide is a way of returning two things from a function: an error code / exception, and the results of the function. The beginning part of my proposal is for an operator which pulls these two things apart in one go. For arguments sake, let's use back-slash bytes_written \ err = write(...); if (err == EINTR) ...On the other side of the coin, we need a way for functions to return errors (or not). Mirroring the above syntax, return x returns x as per normal and sets the returned error code as 0 (or whatever the no errorvalue is). return x \ e returns x and sets the returned error code to e (note that it would be valid, though syntactically redundant to explicitly return the no errorvalue). Although it's not directly related, an obvious problem with error checking is the difficulty with extending an API; any errors codes not explicitly checked for will be silently swallowed. To some extent this is a deliberate design decision: it should be culturally difficult (not impossible; but definitely difficult) to extend the range of errors that a function returns (exceptions provide no encouragement to follow this rule whatsoever). However, I also assume that any language with this feature in will have an equivalent of Converge's So what has all this bought us? Well, at first glance, we've just got an error-checking system which formalises the way in which errors are returned. Indeed, this is a large part of the story. We now have a simple, uniform way of performing the error-checking approach. But, if this was all we'd got, we wouldn't have advanced much beyond C. By formalising how errors are returned, we can also define what happens if the error code is not explicitly checked for. In other words, what does the following code do? bytes_written = write(...);Answer: if write raises an error and it is not dealt with, the error turns from an integer error code into a standard exception. In general, I would not expect such exceptions to be caught; they would terminate the program, leaving behind a simple to debug line-by-line backtrace.
Finally, I expect exceptions to be maintained almost exactly as they exist in current languages, including the IntentionsThe intention of the proposal is twofold:
construct, what's left is a standard (if slightly simplified) exception-based system. In other words, the error-checking approach is a bonus which doesn't interfere with normal exception-based practices; it degrades gracefully into using standard exceptions.
ConclusionIn one way, what this article has done is to note problems with the practice of exceptions, and then suggest a partial return to the stone-age of explicit error checking. In that sense, one counter-argument is that I am aiming to throw the baby out with the bath water; and there is merit in that argument. There are also a couple of obvious questions which it raises. First, has this scheme been proposed elsewhere? Not that I know of, but there are a vast amount of relatively obscure languages lurking around, so it's not impossible. Second, would this work in a practical language? I don't know - this is a classic paper design which may well not survive its first pass through a compiler. One day I hope to find out.My thanks to Martin Berger for commenting on a draft of this article. Any errors and infelicities are my own, as are all the bad ideas. The Missing Level of Abstraction?September 15 2009 Levels of abstractionsI feel fairly confident in stating that everyone who is familiar with the details of computing will have encountered the phraseshigh level of abstractionand low level of abstraction- probably rather often. Abstraction is one of those words which is used rather more frequently than it is considered. In short, an X that is an abstraction of Y implies three things in our context:
Abstractions are typically relative things. From the statements Regrettably, my professional life has taught me that the notion of relative levels of abstraction is not universally shared, probably because it is harder to understand than the notion of absolute levels of abstraction. This fallacy is commonplace; I first encountered it in the MDA world where the terms PIM (Platform Independent Model) and PSM (Platform Specific Model) are widely, and incorrectly, abused to imply absolute levels of abstraction. Similarly one often hears talk of Objective-C, alas, I find a more difficult language to like. In small part this is because it generally implies (and did in my case) working under OS X, an operating system beloved of those who do not truly love computers - its idiot-proof lack of customisation and innumerable small bugs came close to breaking me. Objective-C grafts higher-level Smalltalk-like features onto a lower-level C base. The resulting language is not only more verbose than either of its parents - I particularly enjoyed needing to type class attributes into three different places - but distorts many of their defining features. For example, Smalltalk's collection hierarchy (i.e. lists, sets etc.) is a thing of genuine beauty and utility, allowing a syntactically minimalist language to succinctly express complex constraints; since Objective-C doesn't allow blocks (small anonymous functions), not to mention that its collection hierarchy is not as well designed, this is impossible. Another aspect is that C is a statically (if weakly) typed language, catching errors at compile-time that would otherwise cause hard to debug segfaults; however Objective-C's message sending is, like Smalltalk, dynamically typed. In Smalltalk this is not an issue - type errors are simply triggered, lead to predictable backtraces, and are easily fixed. While some dynamic typing errors in Objective-C lead to similar backtraces, others fall foul of C's free-wheeling approach to memory management and cause horrible run-time results, leading to little more than a splat sound; debugging those is a chore. Memory managementOne of C's many lower-level delights is its approach to memory management: put simply, the user is responsible for all memory allocation and deallocation. The virtue of C's approach is that it's simple to understand (for users) and implement (for compiler and library writers). The disadvantage is that it's easy to use incorrectly; in particular, it's very easy to forget to free memory, causing hard to debug memory leaks. A less common, but more dangerous, error is to try and free an already freed chunk of memory (the so-called double free). It is thus reasonable to say that C's memory management is low-level. In contrast, high-level memory management has, in my mind, been largely synonymous with garbage collection, which automatically frees memory (and implicitly prevents double frees). Garbage collection is not without some negative implications - for example, it is notoriously difficult to work out when garbage collection pauses will strike - but overall is generally a good thing. Indeed, garbage collection is now largely ubiquitous outside of systems and embedded programming; anyone weened on Java (or even much older languages such as Lisp and Smalltalk) will know nothing else.As I understand things, modern Objective-C under OS X uses garbage collection. On some other platforms, such as the one I was targeting, there is no garbage collection. I initially assumed that in such cases Objective-C would revert to a standard C-style system of Memory pools are an attempt to allow implicit memory allocation (something which is, for good reason, deeply frowned on in traditional C libraries) with implicit memory deallocation. In practice, they resemble a poor-mans attempt at garbage collection or, perhaps, garbage collection as designed by someone who hadn't fully grasped the idea. For the life of me, I can not fully prevent memory leaks in a medium-sized Objective-C system; ironically, I have found it much easier to debug memory leaks in pure C applications. Different parts of code can The middle-level of abstractionA good question at this point is: what does all this have to do with abstractions? Well, the two examples above are instances of something that I've seen once or twice before, but which is hardly common. In normal computing conversation, C is at a low-level of abstraction, and Smalltalk is at a high-level. As I wrote earlier, levels of abstraction are relative, but that doesn't mean we can't talk about the gaps between two levels. Objective-C, which is a child of these two languages, explicitly pitches itself as being the middle-level of abstraction between C and Smalltalk. Similarly, its memory management is the middle-level of abstraction betweenmalloc/free and garbage collection.
The interesting thing to me is not that Objective-C is the middle-level of abstraction between C and Smalltalk as such - after all, given two points on the abstraction scale, there's a high chance that there are other levels in-between - but that it appears to me to have been deliberately designed to slot into this place. This made me wonder about two things. First, why are there not more systems designed to slot between two existing levels of abstractions? Second, why have I never heard the term A couple of answers immediately suggested themselves to me; no doubt others can think of better ones. Perhaps systems designed to slot between two existing levels are more likely than not (as Objective-C) to be worse than the things either side of it? Alternatively, perhaps under normal circumstances we only need to think of one higher level of abstraction thing and one lower-level thing in order to work satisfactorily, and what comes in the middle is generally irrelevant? Whatever the reason may be, I suspect we're unlikely to ever hear many uses of the term Good Programmers are Good Sysadmins are Good ProgrammersMarch 20 2009 It is human nature to assume that what is familiar to us, is familiar to all. I can still clearly remember, when I was around 12, going to a friends house, and being astonished at the fact that around my dinner plate were three pairs of knives and forks. In fact, petrified would probably be a better word - not only had I never personally seen anything like it, I had never imagined that anyone would, or indeed could, use more than one knife and fork in a single meal.Of course, my life thus far has not just involved my surprise at other peoples habits; on occasions (less rare than my ego might have preferred), other people have been surprised at mine. Recently a non-computing friend saw my main computer workspace - a Unix setup with 4 xterms displayed - and asked, jokingly, if I was plotting to take over the world (I blame the media for this particular image). I'm so used to my own setup that I no longer think of it as odd but, when suitably prompted, I can see why other people might think so. In comparison to a Windows or Mac machine, full of little visual goodies, and perhaps with only a web browser or word processor loaded, a number of tiny xterms filled with half-executed commands does look odd. It's not really surprising that a non-computing person would find my setup odd - after all, I spend a lot of time on computers, so it's to be expected that some things that I have come to find natural scare casual users. What has surprised, and continues to surprise me, is how many computing people I come across find my setup odd - sufficiently odd that it attracts comment. Some people are baffled as to why my systems are as they are, some are curious as to how it works, and some people sneer at the way I do things. There is no good answer to the sneer, nor any great reason to answer that person (although, possibly due to a deep character flaw, I find the sneer rather amusing). However the Here's the broad setup I use. I have a desktop machine (because it's fast and comfortable), a laptop (which I use only when out and about, because laptops are slow and ergonomically disastrous), a main server (where you're probably reading this from), and a backup server (for the next time the water company cuts through the cable powering the main server; in an attempt to salve my environmental qualms, the backup server is a very low power device that also serves some domestic purposes). Though various people have some sort of access to the servers, I personally administer all 4 machines. This raises two immediate problems: how to keep the administration overhead to a minimum; and how to keep files synchronised between each machine. The answer to the administration overhead question is, for me, simple: use the same operating system for all machines. That way, the lessons learned on one machine apply trivially to the others (and, when things go belly up, machines can relatively easily stand in for one another). As someone who (through a quirk of geography and history) never passed through the DOS / Windows world, I eventually gravitated towards Unix operating systems and, after a brief flirtation with Linux, I've exclusively used OpenBSD for nearly 10 years. This immediately scares most people off, or confirms their worst suspicions of me - to give you an idea of the popularity of this OS, at the time of writing, I've met precisely 1 OpenBSD user in real life. Why did I chose OpenBSD? Simple: it's simple. OpenBSD does very little by default, and what it does, it does well, consistently, with minimal configuration, good documentation, and is easily administered remotely. I can have a blank box turned into a complete OpenBSD install with everything I want, setup how I need, in a couple of hours (most of which is automatic downloading of stuff, and doesn't require my presence). I keep an open mind about OS replacements, but so far none of them appears an improvement, or even a sideways step. Of course, using what is often dismissively called a The file synchronization problem is a little more subtle, but arguably more important. I outlined my mechanism for this a while back and while it's changed in detail quite a bit, in spirit it's still the same: I use a version control system (git these days) for my important files and Unison for large files that I can recreate via other mechanisms. A corollary of using a decent Unix, and synchronizing files automatically, is that virtually everything is configured by simple text files, so to a large extent my configuration also propagates across machines. I have also tried over the years to accept, whenever possible, the default configuration on a machine. The reason for this is simple: the less I feel the need to change, the easier it is to move between machines (and different OS's). Of course, there's a limit to how far I'm prepared to accept someone else's choices, and so I do change a reasonable number of settings; but, compared to most people I know, I change relatively little. As well as trying to use the default configuration as far as is practicable, I also try to maximise my use of tools supplied with the OS and, failing that, to use the simplest tool that does the task I require. A decent Unix comes with a wide variety of little tools, most of them neglected by most users; it continues to amaze me as to how many tasks can be expressed in terms of these little tools. Using tools that are standard across many different machines and OS's again lowers the barriers to moving between machines. It also generally implies a greater consistency of user experience, since tools from the same providers tend to be more consistent; some providers (particularly commercial) seem to delight in perverse user interface choices, which means that installing and learning new tools can be an uncomfortable experience. I also try, whenever possible, to use command-line tools not because - despite what some of my friends think - I like being obscure (my formative years were spent on RISC OS, where the GUI was King and the default assumption was that the command-line was for the mentally unsound) but because it's easier to control command-line tools and maintain consistency across platforms. Most of my time on a computer is spent either doing e-mail (I use mutt because it's the least annoying mail client I've yet found, despite its obvious limitations), web browsing, programming, or writing. The latter two tasks are the most interesting. If for you, as for me, an average day is a wild trip of programming in several languages, and working on several papers of dubious literary merit, then you'll know how much time one can spend editing text; I often have 20 or 30 files open for editing. The sad truth of the matter is that, as far as I can tell, the modern computing world does not contain a single decent text editor (whereas RISC OS, which I mentioned earlier, had at least 2 excellent text editors). Most text editors are either arcane (e.g. vi and emacs) or bloated (e.g. Eclipse). Since I am not clever enough for the former, and far too impatient to wait for the latter to load (I had a massive shock 4 or 5 years back when, on a powerful machine, I found to my horror that if I typed at full speed in a well known IDE, there was a noticeable lag in text appearing on screen), I use a half-way option, NEdit. NEdit has many limitations and flaws, but it's simple, loads almost immediately, and its syntax highlighting is just about powerful enough to satisfy me. Let's return to the 4 xterms I mentioned at the beginning of the article, which scared my non-computing friend - it's both worse and better than it seems. I setup KDE so that it has 12 virtual desktops (one of the main reasons I used KDE in the early days was because it binds sensible keys to virtual desktop selection by default), of which I typically use 7 to 8 at any given point. The first is my In the above I've tried to give a brief outline of So, in conclusion, my computer setup is an attempt to emulate, in my own small way, the best habits I've been able to pick up from those more able than myself. It's a continual work in progress, but it does the trick for me. How can C Programs be so Reliable?November 11 2008 C is, today, a unique programming language. Surprisingly few people can really program in C and yet many of us have quite strong opinions about it. Buffer overflows, stack smashing, integer overflows - C has many well publicised flaws, and these terms are often bandied about confidently, even by those unfamiliar with C. Personally I shied away from C for a decade, for one reason or another: originally, compilers were expensive (this being the days before free UNIX clones were readily available) and slow; the culture was intimidatory; and, of course, all the C scare stories made me think that a mere mortal programmer such as myself would never be able to write a reliable C program.Discounting a couple of tiny C modules that I created largely by blindly cutting and pasting from other places, the first C program I wrote was the Converge VM. Two things from this experience surprised me. First, writing C programs turned out not to be that difficult. With hindsight, I should have realised that a youth misspent writing programs in assembler gave me nearly all the mental tools I needed - after all, C is little more than a high-level assembly language. Once one has understood a concept such as pointers (arguably the trickiest concept in low-level languages, having no simple real-world analogy) in one language, one has understood it in every language. Second, the Converge VM hasn't been riddled with bugs as I expected. In fact, ignoring logic errors that would have happened in any language, only two C-specific errors have thus far caused any real problem in the Converge VM (please note, I'm sure there are lots of bugs lurking - but I'm happy not to have hit too many of them yet). One was a list which wasn't correctly NULL terminated (a classic C error); that took a while to track down. The other was much more subtle, and took several days, spread over a couple of months, to solve. The Converge garbage collector can My experience with the Converge VM didn't really fit my previous prejudices. I had implicitly bought into the idea that C programs segfault at random, eat data, and generally act like Vikings on a day trip to Lindisfarne; in contrast, programs written in After a dark period of paper writing, I've recently been doing a little bit of C programming. As someone who, at some points, spends far too much time away from home, reliably sending e-mail has always been an issue. For several years I have sent e-mail by piping messages to a The first observation is semi-obvious. Because software written in C can fail in so many ways, I was much more careful than normal when writing it. In particular, anything involved in manipulating chunks of memory raises the prospect of off-by-one type errors - which are particularly dangerous in C. Whereas in a higher-level language I might be lazy and think The second observation is something I had not previously considered. In C there is no exception handling. If, as in the case of extsmail, one wants to be robust against errors, one has to handle all possible error paths oneself. This is extremely painful in one way - a huge proportion (I would guess at least 40%) of extsmail is dedicated to detecting and recovering from errors - although made easier by the fact that UNIX functions always carefully detail how and when they will fail. In other words, when one calls a function like What I realised is that neither exception-based approach is appropriate when one wishes to make software as robust as possible. What one needs is to know exactly which errors / exceptions a function can return / raise, and then deal with each on a case-by-case basis. While it is possible that modern IDEs could (indeed, they may well do, for all I know) automatically show you some of the exceptions that a given function can raise, this can only go so far. Theoretically speaking, sub-classing and polymorphism in OO languages means that pre-compiled libraries can not be sure what exceptions a given function call may raise (since subclasses may overload functions, which can then raise different exceptions). From a practical point of view, I suspect that many functions would claim to raise so many different exceptions that the user would be overwhelmed: in contrast, the UNIX functions are very aware that they need to minimise the amount of errors that they return to the user, either by recovering from internal failure, or by grouping errors. I further suspect that many libraries that rely on exception handling would need to be substantially rewritten to reduce the number of exceptions they raise to a reasonable number. Furthermore, it is the caller of a function who needs to determine which errors are minor and can be recovered from, and which cause more fundamental problems, possibly resulting in the program exiting; checked exceptions, by forcing the caller to deal with certain exceptions, miss the point here. Henry Spencer said, Free Text GeocodingSeptember 1 2008 I think nearly everyone would agree that maps are useful things; I've only met a couple of people who labour under the illusion that they know the way from anywhere to anywhere. Personally I find maps more than interesting - they're often fascinating. Maps are one thing the British do immensely well (judging by some of the inaccurate doodles I've seen abroad). By looking at the detailed Ordnance Survey map of the area where I grew up, I can easily see layers of historical information such as: the Roman road a mile away; the long-abandoned railway lines upon which roads were later built; the now-drained marshes and the canal network later built upon them. Of course, these are fairlystandardmaps with a well-understood relationship to the real world; sites such as Strange Maps and Mark Easton's blog show how maps can present data in many other forms. My headmaster at school thought that geography as a subject had lost its way; instead of focusing on rock strata, it should focus on teaching children where places where. At the time, we all thought he was a touch eccentric; in retrospect, he was absolutely right. Huge tracts of politics, economics, and history only make sense in the context of a given place(s). I could go on, but I hope my point is clear: maps are important, and not just for finding our way around. A big problem with traditional maps is finding things: places, areas, and so on. Even for a medium sized paper map, the size of the index is huge; and the index for a good quality London A-Z is often almost as big as the map itself. Paper map indexes have their own funny little language, trying to squeeze as much data in as possible. However paper indexes have two problems. First, how many times have you forgotten what square you were supposed to look at when you get to the proscribed page? Second, indexes can only grow to a certain size before they are unusable. Computer geocodingThe advent of computer mapping has been a real boon, and not to just map fiends such as I. The first site I used regularly was Street Map. Having the map data for the whole UK was incredibly useful and the search facilities, while crude, were a huge improvement on a paper index. When later sites such as Google Maps became available, I was stunned. Suddenly freely viewable map data (though note I do not use the phrase free map data) for much of the world was coupled with a new way of searching. No paper index, or Streetmap's cloying need to be told what type of search was being performed (e.g. place name, street name, post code etc.). Instead is what I call (for want of a better name) free text geocoding: that is, where one types in something in the format used in real-life, and the search engine finds the right place automatically. One doesn't need to tell the search engine that the search is for a postcode or a place-name, or even which country one is searching - it magically does the right thing. Well, of course - not always the right thing. For example, at the time of writing, if I do a search for There are several reasons why sites such as Google Maps are not as widely used as they might be. Regrettably the chief reason is legal: there are restrictions on the way that map data can be used. Fortunately an initiative - OpenStreetMap - to create much freer map data was started a few years back (and started by Brits - we are, it appears, a nation of map lovers) and is now at the stage where, despite many huge holes in its data, it is semi-usable: in central London, for example, it already has arguably the highest quality maps. Some of the uses of OpenStreetMap are already quite astonishing (the UK postcode layer is a simple, but effective, example of what can be done - click the little One part of OpenStreetMap that frustrates me a little is its search (the Name Finder). Although it does a reasonable job in many respects, the results it returns are hard to interpret (type in Free text geocodingSince one of the huge benefits of using computer mapping is its search ability, I thought a fun little summer project would be to create my own free text geocoder. I started with only a vague idea of what a free text geocoder should (or could) do. While I don't now claim to have thought of everything, I do now have a much clearer idea of what a good free text geocoder should do. I've split these into
FetegeoTo this end I created, and have now released, Fetegeo with a BSD / MIT licence. Using Fetegeo's included client / server interface, queries can be performed on the command-line: $ fetegeoc geo London Match #1 ID: 719913 Name: London Latitude: 51.508415 Longitude: -0.125533 Country ID: 233 Parent ID: 1262 Population: 7421209 PP: London, United Kingdom Dangling text:Of course, there are a lot of London's in the world and I haven't copied all of Fetegeo's output. Notice though that, since the preferred country of the user wasn't specified, it's chosen what most people are likely to consider to be theLondon as the first match. If the user specifies that their country is Canada then London in Ontario is the first match: $ fetegeoc -c ca geo london Match #1 ID: 2984878 Name: London Latitude: 42.983389283 Longitude: -81.233042387 Country ID: 39 Parent ID: 540 Population: 346765 PP: London, Ontario Dangling text:Fetegeo can be instructed to allow dangling (i.e. unmatched) text in matches: $ fetegeoc -d geo Museums in London Match #1 ID: 719913 Name: London Latitude: 51.508415 Longitude: -0.125533 Country ID: 233 Parent ID: 1262 Population: 7421209 PP: London, United Kingdom Dangling text: Museums in If you're interested, there's a slightly more thorough description of the ways that Fetegeo can be used, and a simple demo which geocodes results and shows them on an OpenStreetMap map. How Fetegeo worksInternally, Fetegeo's search is fairly simple and its approach is easily described. Strings in Fetegeo are always normalised; in particular punctuation is removed, and strings are lower-cased. String queries to the database are always on hashes of normalised words. Given a normalised string How Fetegeo comparesHow does Fetegeo score on the must / should chart? It's too early to say how accurate Fetegeo's results are. First, Fetegeo has only been relatively lightly tested so far: it's inevitable that there will be bugs and oversights. Second, any free text geocoder is subject to something I'm tentatively calling Tratt's First Law of Free Text Geocoding (though I doubt I'm the first to think of it): the upper bound for results quality is determined by your dataset. The best free text geocoder in the world can only give iffy results with an iffy dataset. Fetegeo's initial dataset is based on Geonames data (and postcode data from various other sources). While Geonames should be saluted as the first serious attempt to collate freely-available place data, the structure of the data is less than ideal, and the data itself is of variable quality, suffering from frequent inaccuracies and duplication. Because of this, Fetegeo has been designed to be relatively independent of any particular dataset; I hope one day in the not too distant future that OpenStreetMap's data will be sufficiently broad in scope to replace Geoname's (OpenStreetMap's data is already deeper in the sense that it includes roads, which Geonames doesn't). Fetegeo is already reasonably fast, given that it's only been semi-optimised. On my 3-ish year old desktop machine, using the stock install of PostgreSQL (a few tweaks would, I suspect, make it perform much better - if only I could work out what those tweaks were amongst the mass of overlapping configuration options!), typical queries are answered in less than 0.1s. Fetegeo makes use of simple caching internally to speed things up. Someone who understands databases better than I could almost certainly make things run much faster. Fetegeo has the beginnings of being usable for more than just longitude / latitude searches, but there is some way to go yet to prove this is feasible. In particular I would like to see it capable of being used by applications to classify things as being in particular areas. Imagine you have a website listing X's in Britain (where X could be just about anything), where each X is located at a particular latitude / longitude. This allows one to easily search for Fetegeo is usable in a number of different ways. As such, Fetegeo is just a Python library which can be included and used in any application. Fetegeo also comes with a standard internet server and (command-line) client, which can receive and answer XML queries (as an aside, the XML parser used is often the slowest part of querying). This means that even a simple web-site can query a single Fetegeo server and make use of its caching facilities and so on. ConclusionI have no idea whether anyone will find Fetegeo useful. It seems to me that, even in its current embryonic form, it fills an unoccupied niche, at least in terms of its licence if not its functionality (yet). I hope that other people might find it interesting, and start to extend its functionality to make it more widely applicable. If you want to find out more, and contribute, please waltz on over to fetegeo.org.
|
| Home | Copyright © 1995-2010 Laurence Tratt laurie@tratt.net |