| Home | laurie@tratt.net |
Free Text GeocodingSeptember 1 2008 I think nearly everyone would agree that maps are useful things; I've only met a couple of people who labour under the illusion that they know the way from anywhere to anywhere. Personally I find maps more than interesting - they're often fascinating. Maps are one thing the British do immensely well (judging by some of the inaccurate doodles I've seen abroad). By looking at the detailed Ordnance Survey map of the area where I grew up, I can easily see layers of historical information such as: the Roman road a mile away; the long-abandoned railway lines upon which roads were later built; the now-drained marshes and the canal network later built upon them. Of course, these are fairlystandardmaps with a well-understood relationship to the real world; sites such as Strange Maps and Mark Easton's blog show how maps can present data in many other forms. My headmaster at school thought that geography as a subject had lost its way; instead of focusing on rock strata, it should focus on teaching children where places where. At the time, we all thought he was a touch eccentric; in retrospect, he was absolutely right. Huge tracts of politics, economics, and history only make sense in the context of a given place(s). I could go on, but I hope my point is clear: maps are important, and not just for finding our way around. A big problem with traditional maps is finding things: places, areas, and so on. Even for a medium sized paper map, the size of the index is huge; and the index for a good quality London A-Z is often almost as big as the map itself. Paper map indexes have their own funny little language, trying to squeeze as much data in as possible. However paper indexes have two problems. First, how many times have you forgotten what square you were supposed to look at when you get to the proscribed page? Second, indexes can only grow to a certain size before they are unusable. Computer geocodingThe advent of computer mapping has been a real boon, and not to just map fiends such as I. The first site I used regularly was Street Map. Having the map data for the whole UK was incredibly useful and the search facilities, while crude, were a huge improvement on a paper index. When later sites such as Google Maps became available, I was stunned. Suddenly freely viewable map data (though note I do not use the phrase free map data) for much of the world was coupled with a new way of searching. No paper index, or Streetmap's cloying need to be told what type of search was being performed (e.g. place name, street name, post code etc.). Instead is what I call (for want of a better name) free text geocoding: that is, where one types in something in the format used in real-life, and the search engine finds the right place automatically. One doesn't need to tell the search engine that the search is for a postcode or a place-name, or even which country one is searching - it magically does the right thing. Well, of course - not always the right thing. For example, at the time of writing, if I do a search for There are several reasons why sites such as Google Maps are not as widely used as they might be. Regrettably the chief reason is legal: there are restrictions on the way that map data can be used. Fortunately an initiative - OpenStreetMap - to create much freer map data was started a few years back (and started by Brits - we are, it appears, a nation of map lovers) and is now at the stage where, despite many huge holes in its data, it is semi-usable: in central London, for example, it already has arguably the highest quality maps. Some of the uses of OpenStreetMap are already quite astonishing (the UK postcode layer is a simple, but effective, example of what can be done - click the little One part of OpenStreetMap that frustrates me a little is its search (the Name Finder). Although it does a reasonable job in many respects, the results it returns are hard to interpret (type in Free text geocodingSince one of the huge benefits of using computer mapping is its search ability, I thought a fun little summer project would be to create my own free text geocoder. I started with only a vague idea of what a free text geocoder should (or could) do. While I don't now claim to have thought of everything, I do now have a much clearer idea of what a good free text geocoder should do. I've split these into
FetegeoTo this end I created, and have now released, Fetegeo with a BSD / MIT licence. Using Fetegeo's included client / server interface, queries can be performed on the command-line: $ fetegeoc geo London Match #1 ID: 719913 Name: London Latitude: 51.508415 Longitude: -0.125533 Country ID: 233 Parent ID: 1262 Population: 7421209 PP: London, United Kingdom Dangling text:Of course, there are a lot of London's in the world and I haven't copied all of Fetegeo's output. Notice though that, since the preferred country of the user wasn't specified, it's chosen what most people are likely to consider to be theLondon as the first match. If the user specifies that their country is Canada then London in Ontario is the first match: $ fetegeoc -c ca geo london Match #1 ID: 2984878 Name: London Latitude: 42.983389283 Longitude: -81.233042387 Country ID: 39 Parent ID: 540 Population: 346765 PP: London, Ontario Dangling text:Fetegeo can be instructed to allow dangling (i.e. unmatched) text in matches: $ fetegeoc -d geo Museums in London Match #1 ID: 719913 Name: London Latitude: 51.508415 Longitude: -0.125533 Country ID: 233 Parent ID: 1262 Population: 7421209 PP: London, United Kingdom Dangling text: Museums in If you're interested, there's a slightly more thorough description of the ways that Fetegeo can be used, and a simple demo which geocodes results and shows them on an OpenStreetMap map. How Fetegeo worksInternally, Fetegeo's search is fairly simple and its approach is easily described. Strings in Fetegeo are always normalised; in particular punctuation is removed, and strings are lower-cased. String queries to the database are always on hashes of normalised words. Given a normalised string How Fetegeo comparesHow does Fetegeo score on the must / should chart? It's too early to say how accurate Fetegeo's results are. First, Fetegeo has only been relatively lightly tested so far: it's inevitable that there will be bugs and oversights. Second, any free text geocoder is subject to something I'm tentatively calling Tratt's First Law of Free Text Geocoding (though I doubt I'm the first to think of it): the upper bound for results quality is determined by your dataset. The best free text geocoder in the world can only give iffy results with an iffy dataset. Fetegeo's initial dataset is based on Geonames data (and postcode data from various other sources). While Geonames should be saluted as the first serious attempt to collate freely-available place data, the structure of the data is less than ideal, and the data itself is of variable quality, suffering from frequent inaccuracies and duplication. Because of this, Fetegeo has been designed to be relatively independent of any particular dataset; I hope one day in the not too distant future that OpenStreetMap's data will be sufficiently broad in scope to replace Geoname's (OpenStreetMap's data is already deeper in the sense that it includes roads, which Geonames doesn't). Fetegeo is already reasonably fast, given that it's only been semi-optimised. On my 3-ish year old desktop machine, using the stock install of PostgreSQL (a few tweaks would, I suspect, make it perform much better - if only I could work out what those tweaks were amongst the mass of overlapping configuration options!), typical queries are answered in less than 0.1s. Fetegeo makes use of simple caching internally to speed things up. Someone who understands databases better than I could almost certainly make things run much faster. Fetegeo has the beginnings of being usable for more than just longitude / latitude searches, but there is some way to go yet to prove this is feasible. In particular I would like to see it capable of being used by applications to classify things as being in particular areas. Imagine you have a website listing X's in Britain (where X could be just about anything), where each X is located at a particular latitude / longitude. This allows one to easily search for Fetegeo is usable in a number of different ways. As such, Fetegeo is just a Python library which can be included and used in any application. Fetegeo also comes with a standard internet server and (command-line) client, which can receive and answer XML queries (as an aside, the XML parser used is often the slowest part of querying). This means that even a simple web-site can query a single Fetegeo server and make use of its caching facilities and so on. ConclusionI have no idea whether anyone will find Fetegeo useful. It seems to me that, even in its current embryonic form, it fills an unoccupied niche, at least in terms of its licence if not its functionality (yet). I hope that other people might find it interesting, and start to extend its functionality to make it more widely applicable. If you want to find out more, and contribute, please waltz on over to fetegeo.org. Extended BacktracesJune 2 2008 I can quite distinctly remember when, as a teenager, I realised that I would spend much of my life debugging. I was programming in assembler, where loops are simply branch statements to defined labels. Because loops are so common, one would use the same label for each loop; later loops would redefine the label, meaning that no problems could occur. Being a literal person I used the labelloopfor all such loops. When one of my programs mysteriously failed, I could not work out why. Eventually I realised that one of my label definitions had spelt looop(three O's instead of two) instead of loop, so my loop had branched back to the previous loop in the file. Spotting that took me a couple of days. Later, I realised that most programming errors fit into two broad categories: the obvious and the subtle. Obvious errors are those whose source can be easily pinpointed (even if fixing the problem takes a while). The subtle are typically those where cause and effect are separated, making identification of the root of the problem difficult (often, when eventually located, such problems are easily fixed). The There are, to my mind, two tricks to debugging. The first is to try and turn subtle problems into obvious problems; however subtle problems are typically inherently subtle and unamenable to such treatment. The second is to try and speed up the solving of obvious problems. For me, the main tool for solving obvious problems is the humble backtrace which, when an exception occurs, shows one (in some manner or another) the call stack and, hopefully, what file and line number each entry therein is associated with. Given the following trivial program:
func head(l):
return [l[0], l[1 : ]]
func main():
head([1])
head([])
a standard looking backtrace would be:
Traceback (most recent call at bottom): 1: File "/tmp/head.cv", line 6 2: File "/tmp/head.cv", line 2 3: (internal), in List.get Bounds_Exception: 0 exceeds upper bound 0Using this we can fairly quickly see that the cause of our error is passing an empty list to a function which assumes that there is at least one element in the list. [As a side note, this example is in one way unrepresentative: in the vast majority of cases, it's typically the bottom one or two lines of the backtrace that pinpoint the real source of the error.] Backtraces like the above can be found in most modern programming languages like Java. They are immensely useful and form precisely half of my debugging toolkit, the other half being printf - in my view of the world, these two tools obviate the need for debuggers. The power of backtraces is most obviously felt in those languages that don't have them. C programs typically need to be run through a debugger to get a backtrace, meaning that errors in programs running in production can be extremely difficult to diagnose. The first Haskell program I wrote had the Python was the first language I saw that took backtraces a little bit further, printing (when possible) the line of source code associated with each part of the backtrace. A Python-esque backtrace looks roughly as follows:
Traceback (most recent call at bottom):
1: File "/tmp/head.cv", line 6
head([])
2: File "/tmp/head.cv", line 2
return [l[0], l[1 : ]]
3: (internal), in List.get
Bounds_Exception: 0 exceeds upper bound 0
This simple innovation is a real boon: as in this case, one often doen't even need to open a source file in a text editor to see the error made. Python-esque backtraces help make obvious errors quicker to solve than traditional backtraces.
I realised early on in Converge's development that knowing merely the line number of an error was only part of the problem. Often a specific sub-expression within a certain line is the relevant part of the backtrace, and the rest of the line is noise. Converge therefore recorded the column (i.e. offset within a line) where each error is associated with, meaning that backtraces looked like the following: Traceback (most recent call at bottom): 1: File "/tmp/head.cv", line 6, column 4 2: File "/tmp/head.cv", line 2, column 13 3: (internal), in List.get Bounds_Exception: 0 exceeds upper bound 0This extra information is very helpful: it means that I can accurately pinpoint which of the two list lookups in line 2 is responsible for calling List.get incorrectly. As a useful advantage, Converge's approach also means that errors that happen within multi-line statements (i.e. logicallines of source split over multiple physical lines in a source file to aid presentation) work properly. Converge's backtraces stayed like the above for quite some time, until recently when I realised that knowing the start column associated with an error is only part of the story. What one really wants to know is the start and end of the associated expression. A small tweak to the parser, and a huge (but mechanical) change to the compiler, and Converge backtraces could tell one how many characters in the line an error was associated with:
Traceback (most recent call at bottom):
1: File "/tmp/head.cv", line 6, column 4, length 8
head([])
2: File "/tmp/head.cv", line 2, column 12, length 4
return [l[0], l[1 : ]]
3: (internal), in List.get
Bounds_Exception: 0 exceeds upper bound 0
This is almost helpful, but in practice I find it surprisingly hard to count n characters within a line on screen, which hinders interpretation of the above data.
A short while later, the answer hit me: what the backtraces need to do is to highlight the relevant sub-expression within the line. Here's a screenshot of the above error running in an xterm with the latest version of Converge:
As you might notice, the tiny little difference here is that the part of each line pertinent to the error is in bold and underlined. Knowing that, one can instantly see that the first of the two list lookups on line 2 is responsible for calling As I explained in a previous entry, when Converge DSLs are translated into Converge ASTs, individual call stack entries can be associated with more than one source location. This means that backtraces tend to be rather long, which previously made tracking down the cause of an error tedious - loading multiple files into text editors and continually flipping back and forth to xterms is not fun.
Looking at this backtrace, an experienced programmer will be able to quickly surmise that, given the exception message, the most likely candidate for this error is in the From a practical point of view, Converge's extended backtraces have no run-time penalty for correct code, and users don't have to do anything to enable them - they're a standard part of the system. Extended backtraces can be found in -current versions of Converge (at the time of writing, Converge's support for Curses under Windows is weak, so underlining doesn't work there - it's a quick, fun little project for someone who's interested). So, going back to the start of this entry, how do Converge's extended backtraces help with debugging? Well, they might help turn the odd subtle error into an obvious error, but that's an incidental benefit. What I think they do is make solving obvious errors much quicker than previously. In the sort time since I've had extended backtraces, I've noticed that I've often been able to almost instantly fix errors that before might have taken me a couple of minutes. Given the number of programming errors I make, the cumulative time saving is most welcome. In summary, I think that Converge's extended backtraces are a real boon to programming. To the best of my knowledge, Converge is the first language with such backtraces in - I hope it won't be the last! Designing Sane Scoping RulesMarch 3 2008 If there's one thing which unites pretty much every post-assembly programming language it is the use of the humble variable. Variables are such a common feature that we tend to take them for granted; perhaps I show my background in assembly by being explicitly aware of them. However where programming languages often differ is in the way that they allow one to reference variables: the dreadedscopingrules. In this article I'm going to outline why Converge has the scoping rules it does. The first thing I need to do is to outline the problem. Although all my examples are framed in terms of a x := 2 ... y := xIn every language I know this says assign. So after this code is run both x and y will have the value 2.
The first major issue with regards to variables is global vs. local variables. Take the following code which is intended to represent the top-level of a source file:
x := 2
func f():
x := 3
f()
In a lot of BASIC-type languages there is only one underlying x variable in this program, so after this code is run the outer x will have the value 3. In essence we have a flat variable scope: all variables belong to the same, single, namespace. This not only makes writing programs error-prone (e.g. one function accidentally corrupts another's x), but makes certain styles of programming largely impractical (e.g. recursive functions). It's probably no coincidence that the first mainstream programming language to make a virtue out of recursion - Algol 60 - also was among the first to get its scoping rules in reasonable order.
Retro-fitting sane scoping rules is not easy if the first version of your language used the above scoping rule (note the deliberate use of the singular) - any change of scoping rule(s) has a high chance of breaking programs in nasty ways. Some of the BASIC-type languages I first used solved the backwards compatibility problem by making variables
x := 2
func f():
local x
x := 3
f()
This code now has two distinct x variables: one at the top-level and one in the f function. Every time f is called - even recursively - it will be given space for a new, fresh x. This feature makes programming a lot easier, even if it defaults to the insane globalscoping rule by default. As an aside, surprisingly (to me at least), you can still see the global by defaultscoping rule in the modern language Lua. This goes to show how fundamental scoping rules are: once they're in a language, users will resist nearly all meaningful change to them. Most programming languages adopt a slight variant of the above rules which are fairly easy to understand in practice. Essentially variables with the same name as a top-level variable reference that top-level variable directly, while all other variables are local. So in a C-type language the following code contains two variables: a top-level
x := 2
func f():
x := 3
y := 4
f()
Running the above code means that the top-level x is set to 3, while the y variable is local to f. Variations on this set of scoping rules underly many programming languages in use today.
The next major design challenge for scoping rules is much more subtle and confuses many of us to this day. Knowing that the
x = 2
def f():
x = 3
def g():
global x
print x
x = 4
print x
g()
f()
print x
What do you think this will print out? Let's try it out with Python 2.5:
$ python scope.py 3 2 4 $In other words, we get a result which is a long way from what we might have expected: the print statement in g prints 2 instead of the expected 3 and the final print statement prints 4 instead of the expected 2. What is it doing? What's happening is that most languages don't have nested scopes (as one might expect) but two scopes: a top-level (a.k.a. global or module) scope and a function scope. What this means is that the assignment to x in g references the top-level x, not the x in f; you might want to read that twice to check that you've really understood it.
It might at first seem that Python's scoping rules are simply silly; actually, they're not unreasonable and they're shared by most programming languages (e.g. Java). Why? The problem is that the function
func f():
x := 2
func g():
return x
return g
f()()
In other words, f returns a reference to g; when g is executed, the value of x known to f will have disappeared as, in most programming languages, variables are stored on the stack. This means that variables only exist for the duration of a function call. Since f's variables will have disappeared when g is executed, all sorts of bad things could happen.
Scheme was the first language that presented a practical solution to this problem of nested scopes in the form of closures. The standard way that closures are defined is guaranteed to confuse and I'm not going to repeat it. They're actually very simple: essentially each function allocates heap memory to store variables on. Thus if an inner function outlives an outer function there is no problem in referencing variables in the outer function even if the stack space has long since disappeared, since a function calls variables can outlive the function call itself. [As an aside, the fact that closures need to allocate heap memory (although it's often possible to statically analyse such allocations away) has been used as an argument against them in languages such as Java. That's the chief reason that Java has all sorts of complications like inner classes, final variables and so on: Java resisted closures, and then had to resort to hacks to get a poor facsimile of its functionality. It's hard to imagine any decent programming language being built now that doesn't implement closures (Converge certainly does), so closures are gradually losing their When I was designing Converge, I put some effort into deciding what its scoping rules were going to be. I wanted to make things safe by default (e.g. no
x := 2
func f():
x := 3
func g():
nonlocal x
Sys::println(x)
x := 4
Sys::println(x)
g()
func main():
f()
Sys::println(x)
Which when run does the expected:
$ converge s.cv 3 3 2 $The interesting thing here, in my mind at least, is the nonlocal keyword. Although it's a tad awkward sounding, it was the best combination of brevity and accuracy that I could think of. Unlike Python it would be incorrect to declare a variable as global since there are more than 2 scopes: in fact, there are an arbitrary number. What nonlocal is saying is when you see an . It's not a commonly used feature, but when you need it is priceless.
I said above that Converge has - or had - the simplest of any imperative programming language (at least as far as I'm aware). Some time after the first publications and release of Converge, the Python team decided to fix their scoping rules for their backwards-compatibility-breaking Python 3000 release. PEP 3104 contains the eventual proposal they came up with. Interestingly it is identical to Converge's scoping rules even down to using the The open question is this: why has it taken us, as a community, 50 years or more to define two simple scoping rules (assignment is local; Some Lessons Learned from IconDecember 3 2007
One of Converge's lesser known influences is Icon. Although it's a relatively obscure language itself, a surprising number of people have heard of Icon's direct predecessor SNOBOL. I first heard of Icon through Tim Peters and Python, where Icon was the inspiration for Python's generator feature (basically procedures that whose execution can be resumed to produce multiple values). However Icon has a much richer, and definitely unique, approach to expression evaluation than any other imperative programming language. When I started working on Converge, I decided to go back to This article is a personal look back at Icon's influence on Converge, and the successes and failures resulting from that influence. It's quite probable that much of it reflects my slightly superficial understanding of Icon - apart from its innovative approach to expression evaluation, much of the language has an old fashioned feel to it, sometimes appearing like a dynamically typed version of Pascal, and this prevented me from ever wanting to write anything large in it. This isn't a criticism of Icon as such - it was ahead of its time in many ways - but is merely an observation that modern programmers expect something slightly different from their programming languages. Hopefully, even with the above caveat, this article contains something of value to those interested in programming languages. Why is Icon interestingIn a nutshell, Icon is interesting due to what it calls goal-directed evaluation. This is an evaluation mechanism which gives to an imperative programming language some of the flexibility more often associated with languages like Prolog, such as a limited form of backtracking. It is built around the concepts of success and failure, and generators (see the Converge documentation for more details). Expressions can succeed or fail; if they succeed, they produce a value; if they fail, they do not produce a value and transmit failure to their containing scope. Generators are functions which can produce more than one value.Icon's syntax is derived from Algol, so the following complete program prints 1 to 9 inclusive:
procedure upto(x)
i := 0
while i < x do {
suspend i
i := i + 1
}
end
procedure main()
every x := upto(10) do write(x)
end
Most of this is probably reasonably intuitive, apart from the every and suspend keywords. every can be considered to be equivalent to the typical understanding of a for statement (indeed, in Converge, the same construct is called for). suspend is equivalent to yield in Python or Converge, returning a value, but allowing the procedure calls execution to be resumed directly after the suspend statement.
Failure and
If I had to pick one thing from Icon that I would not be without in Converge, it would be the concept of failure in |
| Earlier articles |
| Home | Copyright © 1995-2008 Laurence Tratt laurie@tratt.net |