Laurence Tratt's Technical Articles http://tratt.net/laurie/tech_articles/ Laurence Tratt's Technical Articles en-gb Problems with Software 3: Creating Crises Where There Aren't Any [June 28 2011] http://tratt.net/laurie/tech_articles/articles/creating_crises_where_there_arent_any [This is number 3 in an occasional series ‘Problems with Software’; read the previous article here.]

As an undergraduate in the late 1990s, the part of my course I liked the least was Software Engineering, which started with obviously silly 1970s practices, and went downhill from there. As someone who was already familiar with both programming and, to some extent, the software industry, I pooh-poohed the course; my arrogance was duly rewarded with the lowest mark of any part of my degree, which taught me a useful life lesson.

One previously unfamiliar concept which this dreadful course introduced to me was the ‘software crisis’. First promulgated in the late 1960s and early 1970s, it captured the notion that software was nearly always over-budget, behind schedule, and of low quality. Those with an appreciation of history will know that most of the software technologies (programming languages, operating systems, and so on) we now take for granted took form around this time. As with any new field, this was an exciting time, with no technical precedent: none of its participants would have been able to predict the path forward. Rather, experimentation with new ideas and tools led to successes and failures; the latter received greater attention, leading to the idea of a ‘software crisis’.

To the best of my knowledge, the ‘software crisis’ was the first wide-spread moment of self-doubt in the history of computing. It reflected the idea that the problems of the day were huge and, quite possibly, perpetual. It is this latter point that, with the benefit of hindsight, looks hopelessly naive: by the early 1990s (and, arguably, many years before), software was demonstrably and continually improving in quality and scope, and has continued doing so ever since. While the massive improvements in hardware certainly helped (by the mid 1990s, hardware imposed constraints only on the most demanding of software), it was surely inevitable that, as a community, we would become better at creating software, given experience.

Mercifully, I have not found myself in a discussion about the ‘software crisis’ for 5 or more years; it seems finally to have been put to rest, albeit many, many years after it had any possible relevance. However it seems that, in software, we abhor a crisis vacuum: as soon as one has disappeared, others arise in its place. I've seen a few over the last few years including a ‘skills crisis’ and an ‘outsourcing crisis’, all minor problems at best.

However, one ‘crisis’ has gained particularly widespread attention in the last few years: the ‘multi-core crisis’. A quick historical detour is instructive. Although – in accordance with Gordon Moore's famous observation – the number of transistors on a CPU has doubled every 2 years for the last 50 years, by the mid part of this decade, it was proving increasingly difficult to make use of extra transistors to improve the performance of single-core (i.e. traditional) processors. In short, the easy performance increases that the software industry had indirectly relied upon for many decades largely disappeared. Instead, hardware manufacturers used the extra transistors to produce multi-core processors. The thesis of the ‘multi-core crisis’ is that software has to make use of the extra cores, yet we currently have no general techniques or languages for writing good parallel programs.

Assuming, for the time being, that we accept that this might be a crisis, there are two things which, I think, we can all agree upon. First, virtually all software ever written (including virtually all software written today) has been created with the assumption of sequential execution. Second, although we feel intuitively that many programs should be paralellisable, we do not currently know how to make them so. In a sense, both these statements are closely related: most software is sequential precisely because we don't know how to parallelise it.

One question, which those behind the ‘multi-core crisis’ do not seem to have asked themselves, is whether this will change. For it to do so, we would need to develop languages or techniques which would allow us to easily write parallel software. Broadly speaking, only two approaches have so far seen wide-spread use: processes and threads. The processes approach breaks software up into chunks, each of which executes in its own largely sealed environment, communicating with other chunks via well-defined channels. It has the advantage that it is relatively easy to understand each process; and that processes can not accidentally cause each other to fail. On the downside, processes must be large: the approach is not suited to fine levels of granularity, ruling out most potential uses. The threads approach is based on parallel execution within a single process, using atomic machine instructions to ensure synchronisation between multiple threads. It has the advantage of allowing fine grained parallelisation. On the downside it is notoriously difficult to comprehend multi-threaded programs, and they are typically highly unreliable because of incorrect assumptions surrounding state shared between threads.

In short, processes and threads are not without their uses, but they will never get us to parallel nirvana. A number of other approaches have been proposed but have yet to prove themselves on a wide scale. Transactional memory, for example, has serious efficiency problems and is conceptually troubling with respect to common operations such as input / output. There also seems to be an interesting tradeoff between parallel and sequential suitability: languages such as Erlang, which are, relatively speaking, good at parallelism, often tend to be rather lacking when it comes to the mundane sequential processing which is still a large part of any ‘parellel’ system.

However, all this is incidental detail. The point about a crisis is not about whether we have solutions to it or not, but whether it is a real, pressing problem. The simple answer is, for the vast majority of the population, “no”. Most peoples computing requirements extend little beyond e-mail, web, and perhaps word processing or similar. For such tasks – be they on a desktop computer or running on a remote server – computers have been more than powerful enough since the turn of the century. If we were able to increase sequential execution speed by an order of magnitude, most people wouldn't notice very much: they are already happy with the performance of their processors, so increasing it again won't make much difference (most users will see far more benefit from an SSD than a faster processor). In those areas where better parallel performance would be a big win – scientific computing, for example, or computer games – people are prepared to pay the considerable premium currently needed to create parallel softwre. Ultimately, we may get better at creating parallel software, but most users' lives will tick on merrily enough either way.

The ‘software crisis’ was bound to solve itself given time and the ‘multi-core crisis’ is no such thing: just because we've been given an extra tool doesn't mean that we're in crisis if we don't use it. Eventually (and this may take many years), I suspect people will realise that. But, I have little doubt, it will be replaced by another ‘crisis’, because that seems to be the way of our subject: no issue is too small not to be thought of as a crisis. The irony of our fixation on artificially created crises is that deep, long-term problems – which arguably should be considered crises – are ignored. The most important of these surely relates to security — too much of our software is susceptible to attack, and too few users understand how to use software in a way that minimises security risks. I have yet to come across any major organisation whose computing systems give me the impression that they are truly secure (indeed, I have occasionally been involved in uncovering trivial flaws in real systems). That to me is a real crisis — and it's one that's being exploited every minute of every day. But it's not sexy, it's not new, and it's hard: none of which makes it worthy, to most people, of being thought of as a crisis. ]]> Problems with Software 2: Failing to Use the Computing Lever [June 7 2011] http://tratt.net/laurie/tech_articles/articles/failing_to_use_the_computing_lever [This is number 2 in an occasional series ‘Problems with Software’; read the previous article here.]

We all know at least one and I make no apologies for the picture I now paint. Hunch backed, tongue out, index fingers locked rigidly into an uncomfortable ‘r’ shape, and with no sense of the appropriate level of force to put into their actions. The neanderthal figures I refer to are, of course, two fingered typists. In my experience, a typical two fingered typist will struggle to type 30 Words Per Minute (WPM). An average touch typist will manage at least 70 WPM; and for those prepared to put in a bit of effort in, 90-100 WPM is eminently achievable.

Whether high typing speeds are useful depends on the person and the tasks they need to achieve. Agricultural workers, for example, probably won’t see a huge return on investment in typing skills. What astonishes me is that I know people who spend their entire working day in front of a computer, day in, day out, yet who still use this grossly inefficient technique. The slow typists I know easily lose a few hours every week due to their poor technique — yet suggesting that a few days spent learnt touch typing will quickly pay off in increased productivity is inevitably met with a reply of “I don’t have the time”.

Poor typing technique is a concrete example of a common human tendency — to stop learning as soon as one has a way of performing a task, regardless of whether that is the best way. In typing terms, the slowest typist is perhaps 5 times slower than the fastest: in other areas of human activity, the ratio can be much greater. One of those fields is computing and, by implication, software.

At this point, it is instructive to ask oneself a simple question: what is a computer? The standard answers run along the lines of “a CPU, some RAM, a disk”, “a keyboard and a monitor”, or “an operating system and user software.” While correct in their own way, such answers report the mechanisms involved without considering their purpose. The answer I prefer to this question is more abstract: computers are levers. Archimedes, the first person to explain how levers work, famously said “Give me a place to stand, and I will move the Earth.” When asked to prove this by his King, Archimedes used a lever to move a ship single-handedly, something impossible for even the strongest man to do unaided.

One of the first real uses of a computer shows how effective a lever they can be. The semi-programmable British Colossus computer of the mid 1940s was able to perform thousands of comparisons per second, cracking the seemingly unbreakable German Enigma code. What the code-breakers of Bletchley Park had realised was that a computer could perform simple, repetitive tasks at a speed impossible for humans. Operations that had taken a group of people several weeks to complete could be done by Colossus in half an hour: such was Colossus’s speed that, once operational, it played a significant part in shortening the course of the war.

It is little exaggeration to say that computers have developed into the longest levers available to man. Since Colossus, the rate of progress has been little short of astonishing, and computers have been used to do things that were previously unimaginable: from weather prediction to visual special effects, from DNA sequencing to search engines.

Yet, a lever is only truly effective if used from its end: gripped near the pivot, a lever is, in effect, shortened, and its force magnifying effect reduced. Regrettably, most people grip the computing lever extremely close to the pivot. There are many reasons for this and enumerating them all would take a small book. However, a few examples give an indication of the problem.

  1. Doing tasks manually instead of automatically.
    Computers excel at the simple, repetitive tasks which humans are terrible at. I remember seeing someone change a document which used the American English idiom of placing a comma after e.g. (i.e. “e.g., X”) to the less stilted British English idiom without the comma (i.e. “e.g. X”). The person editing did the change by scanning the (rather large) document and manually changing each occurrence, a task which took a couple of rather dispiriting hours. Inevitably, in a large document, some occurrences were missed. I fixed the remaining occurrences with 100% accuracy by using the ‘search and replace’ function in my text editor — which took less than 5 seconds.
  2. Permanent short-term thinking.
    Whenever I am confronted by a computing task, I ask myself “will I need to do this again?” If yes, I generally spend time working out how to automate the task. Consequently I have built up a considerable suite of small tools (a few of the more polished are publicly available) and techniques over the years to call upon. Most computing professionals I know have not a single such tool (and shockingly few techniques); their thinking never extends beyond trying to bumble through the task at hand. A task that, by hand, takes 30 minutes might take 90 minutes to automate; over the years, as that task recurs every 6 months or so, the automated solution pays for itself many times over.
  3. Using one tool for every task.
    Many people are prepared to invest time in learning one tool, but one tool only; it then becomes the hammer used to treat every task encountered as a nail. While this sometimes works well in the sort term, in the medium and long term, the effort involved in contorting the task to the tool can be huge. Perhaps the most common example is the misuse of spreadsheets to store database-like information — if I had a pound for every time an individual’s name was spelt differently on separate ‘sheets’ within a single Excel file, I would be a wealthy man.
  4. Not knowing the field and not staying upto date.
    The first hour of my day is generally spent checking computing newsgroups and websites, to ensure that I have some knowledge of the breadth of my subject, and the latest developments. Sometimes, I confess, I wish I didn’t have to do this; but if I didn’t, I would miss out on those surprisingly frequent moments when someone raises a problem and I hear myself saying “I was reading about a new program X a few weeks back which sounds like it might do what you need.”
  5. Not searching for help.
    I learnt the rudimentaries of programming before I had a modem; whatever problems I encountered had to be solved with the help of the one manual and two books I owned. Memorably, one extremely trivial problem took me nearly a full month to solve, working on my own. Now, my first instinct is to search for the problem on Google; 95% of the time, someone else will have had the same problem, and someone else will have pointed them to a solution. Yet, when I ask someone who presents me with a trivial problem “did you try looking it up on Google” 95% of the time the answer is “I didn’t think of doing that.”
To some extent, all of us will recognise some parts of ourselves in the above, or can pinpoint other areas where we fail to use the computing lever effectively. There is no shame in that: the shame comes in not addressing the problem. Unfortunately, ignorance in the basic use of computers is considered not only acceptable but the norm amongst the majority of computing and software professionals. The resulting loss in efficiency is vast and a rather sad indictment of our field. ]]>
Problems with Software 1: Confusing Problems Whose Solutions Are Easy to State With Problems Whose Solutions Are Easy to Realise [April 19 2011] http://tratt.net/laurie/tech_articles/articles/problems_whose_solution_are_easy_to_state Problems with Software, but I will do my best. The spirit in which I write these is a positive one, perhaps akin to a parent admonishing a child for his own long‐term benefit. Of course, I am, and always be, a student of software, not its parent; searching, learning, and in awe of this ever‐evolving subject. With that in mind, let us begin.

I start with what I consider to be the most common mistake in software; a mistake that no other subject I know of commits so frequently. It is the confusion of problems whose solutions are easy to state with problems whose solutions are easy to realise. A non‐computing example may help explain this. The problem is war: around the world, it ruins the lives of countless people. The easily stated solution is that we should stop people from fighting each other. Regretfully, we know that this solution, while correct, is not worth the space it takes on the page. Telling people they should stop fighting changes nothing, and physically stopping them requires resources far beyond that which can be mustered. None of this is to say that it is not worth tackling the problem of war: but most of us have enough common sense to realise that solutions (for there will surely need to be more than one) will be expensive, long‐term, and come with no guarantee of success.

Alas, in software, such common sense is an uncommon thing. In our subject, unworkable solutions are frequently proposed to deep problems; funding is acquired; manpower deployed; and, after an exhausting death march, failure guaranteed. The problem most frequently invoked is the difficulty of creating good software. This clearly is a problem: too much software is expensive, unreliable, and ill‐suited to the task it was created for. The many easily‐stated solutions proposed have had, at best, a tiny effect; at worst, they have impeded progress by suppressing less exciting, but more realistic, truths. Two concrete examples exemplify this.

The first example is Aspect Orientated Programming (AOP). The problem that AOP addresses is this: developing a program is hard because all of its constituents aspects — even those which are in no way related — must be considered together at all times. Imagine a health‐care system which records patient data and has separate modules for doctors and patients. One aspect is server communication; both doctor and patient units need to communicate with a central server and cope with certain error conditions. This aspect is jumbled up with, and scattered across, the rest of the program, making it hard to check that all server interactions will be performed as expected. The AOP solution is then simple: rather than developing software monolithically, it should be developed as separate aspects, which can be worked on in isolation, before being composed together to produce the end result.

There is no doubt that the AOP solution is intuitive and appealing: the jumbled and scattered nature of current software is a curse, with many deleterious effects. I can remember when I first came across AOP. My response then is my response now (though my argument, I hope, is somewhat better stated these days): it can't work. Furthermore, a simple analogy shows why it can't work.

Imagine a hill, composed, horizontally, of different rock strata. Strata can be taken out and put back in, without ever changing the inherent ‘hillness’ of the hill—a hill is still identifiably a hill even if one strata is taken out (though it might have a distinct kink in it) and each strata makes sense in and of itself, whether it is in the hill or not. In this analogy, aspects are the strata; for AOP to work, it is necessary to imagine being able to take a piece of software, pull out an aspect, and have that aspect be coherent in and of itself. It is a beautiful vision, and one which would change programming. Let us now take a different analogy. Instead of a hill, imagine a gravel drive, consisting of hundreds of thousands of tiny stones of varying colours: grey, honey, black, white, and so on. Even Don Quixote would blanch if I asked him to extract all the honey coloured stones, put them to one side, and then later to put them all back in the same place. The unfortunate reality is that software necessarily resembles a gravel drive—just as different coloured stones exist alongside each other, so do `aspects' of a program. The server communication code in the health care system is likely, for example, to form parts of nested AND expressions—there is no general way to pull such an entangled aspect out in such cases (or to have developed it from scratch) and expect to be able to put it back in.

The second example is Model Driven Architecture (MDA). For my sins, I played a small part in this monstrosity; in my defence I was young, naive, and needed the money, though such pitiful excuses are unlikely to help me come the day of reckoning. The problem that MDA aims to address is the following: developing software is hard because programming languages force one to work at a level of abstraction that is not much above that of machine code. Indeed, this is a problem; while a high‐level description of a business's requirements for its new system may be a few pages long, the corresponding software may have tens or hundreds of thousands of lines of code. Relating the mess of a low‐level program back to the high‐level requirements is a skill that few people successfully acquire. The MDA solution is then simple: development should be done with high‐level (mostly UML) diagrams. Common sense quickly suggests that this approach can never work. Diagrams are wonderful for expressing static relationships, but, overall, terrible for expressing most dynamic behaviour. Looping behaviour is elegantly captured by a textual FOR loop but, as anyone who has ever seen a `visual' programming language in action will know that, is a daunting, unmanageable, mess when represented graphically—and a language which makes looping behaviour difficult is not very useful.

What's most frustrating about AOP and MDA is that while both correctly identify real problems, the solutions they present are appealing in their simplicity yet self‐evidently unviable. Huge amounts of money and time have been invested in both, with little prospect of meaningful reward. Indeed, I am not sure that a good solution could exist for either problem: they seem to me largely intractable and inevitable. Of course, this is not to say that people shouldn't try to come up to solutions to these and other such problems—if whinging old codgers like me always had our way, we'd still be living in dark, damp caves and laughing at Barry from the cave next door while he works on his silly sounding ‘wheel’ concept. Ours is a ‘new’ subject, a vast, unexplored continent, and optimism is a vital component of the makeup of successful settlers. But as vital as optimism is realism—expending precious resources on hopeless causes is a guaranteed way to doom settlement. A little dose of common sense — thinking through a couple of common cases, at the very least — when prevented with a wonderful sounding solution to a hard problem would do our subject a lot of good. Other subjects seem to manage it, and I don't see why in software we can't too. ]]> Parsing: The Solved Problem That Isn't [March 15 2011] http://tratt.net/laurie/tech_articles/articles/parsing_the_solved_problem_that_isnt Bill hits Ben conforms to the part of the English grammar noun verb noun. Parsing concerns itself with uncovering structure; although this gives a partial indication of the meaning of a sentence, the full meaning is only uncovered by later stages of processing. Parseable, but obviously nonsensical, sentences like Bill evaporates Ben highlight this (the sentence is still noun verb noun, but finding two people who agree on what it means will be a struggle). As humans we naturally parse text all the time, without even thinking about it; indeed, we even have a fairly good ability to parse constructs that we've never seen before.

In computing, parsing is also common; while the grammars are synthetic (e.g. of a specific programming language), the overall idea is the same as for human languages. Although different communities have different approaches to the practicalities of parsing - C programmers reach for lex / yacc; functional programmers to parser combinators; others for tools like ANTLR or a Packrat / PEG-based approach - they typically rely on the same underlying area of knowledge.

After the creation of programming languages themselves, parsing was one of the first major areas tackled by theoretical computer science and, in many peoples eyes, one of its greatest successes. The 1960s saw a concerted effort to uncover good theories and algorithms for parsing. Parsing in the early days seems to have shot off in many directions before, largely, converging. Context Free Grammars (CFGs) eventually won, because they are fairly expressive and easy to reason about, both for practitioners and theorists.

Unfortunately, given the extremely limited hardware of 1960s computers (not helped by the lack of an efficient algorithm), the parsing of an arbitrary CFG was too slow to be practical. Parsing algorithms such as LL, LR, and LALR identified subsets of the full class of CFGs that could be efficiently parsed. Later, relatively practical algorithms for parsing any CFG appeared, most notably Earley's 1973 parsing algorithm. It is easy to overlook the relative difference in performance between then and now: the fastest computer in the world from 1964-1969 was the CDC6600 which executed at around 10 MIPS; my 2010 mobile phone has a processor which runs at over 2000 MIPS. By the time computers had become fast enough for Earley's algorithm, LL, LR, and friends had established a cultural dominance which is only now being seriously challenged - many of the most widely used tools still use those algorithms (or variants) for parsing. Nevertheless in tools such as ACCENT / ENTIRE and recent versions of bison, one has access to performant parsers which can parse any CFG, if that is needed.

The general consensus, therefore, is that parsing is a solved problem. If you've got a parsing problem for synthetic languages, one of the existing tools should do the job. A few heroic people - such as Terence Parr, Adrian Johnstone, and Elizabeth Scott - continue working away to ensure that parsing becomes even more efficient but, ultimately, this will be transparently adopted by tools without overtly changing the way that parsing is typically done.

Language composition

One of the things that's become increasingly obvious to me over the past few years is that the general consensus breaks down for one vital emerging trend: language composition. Composition is one of those long, complicated, but often vague terms that crops up a lot in theoretical work. Fortunately, for our purposes it means something simple: grammar composition, which is where we add one grammar to another and have the combined grammar parse text in the new language (exactly the sort of thing we want to do with Domain Specific Languages (DSLs)). To use a classic example, imagine that we wish to extend a Java-like language with SQL so that we can directly write:
for (String s : SELECT name FROM person WHERE age > 18) {
  ...
}
Let's assume that someone has provided us with two separate grammars: one for the Java-like language and one for SQL. Grammar composition seems like it should be fairly easy. In practice, it turns out to be rather frustrating, and I'll now explain some of the reasons why.

Grammar composition

While grammar composition is theoretically trivial, simply squashing two grammars together is rarely useful in practice. Typically, grammars have a single start rule; one therefore needs to choose which of the two grammars has the start rule. More messy is the fact that the chances of the two grammars referencing each other is slight; in practice, one needs to specify a third tranche of data - often referred to, perhaps slightly misleadingly, as glue - which actually links the two grammars together. In our running example, the Java-like language has the main grammar; the glue will specify where, within the Java-like expressions, SQL statements can be referenced.

For those using old parsing algorithms such as LR (and LL etc.), there is a more fundamental problem. If one takes two LR-compatible grammars and combines them, the resulting grammar is not guaranteed to be LR-compatible (i.e. an LR parser may not be able to parse using it). Therefore such algorithms are of little use for grammar composition.

At this point, users of algorithms such as Earley's have a rather smugger look on their face. Since we know from grammar theory that unioning two CFGs always leads to a valid CFG, such algorithms can always parse the result of grammar composition. But, perhaps inevitably, there are problems.

Tokenization

Parsing is generally a two-phase process: first we break the input up into tokens (tokenization); and then we parse the tokens. Tokens are what we call words in everyday language. In English, words are easily defined (roughly: a word starts and ends with a space or punctuation character). Different computer languages, however, have rather different notions of what their tokens are. Sometimes, tokenization rules are easily combined; however since tokenization is done in ignorance of how the token will later be used, sometimes it is difficult. For example, in SQL SELECT might be a keyword but in Java it is also a valid identifier; it is often hard, if not impossible, to combine such tokenization rules in traditional parsers.

Fortunately there is a solution: scannerless parsing (e.g. SDF2 scannerless parsing). For our purposes, it might perhaps better be called tokenless parsing; the different names reflect the naming conventions of different parsing schools. Scannerless parsing does away with a separate tokenization phase; the grammar now contains the information necessary to dynamically tokenize text. Combining grammars with markedly different tokenization rules is now possible.

Fine-grained composition

In practice, the simple glue mentioned earlier used to combine two grammars is often not enough. There can be subtle conflicts between the grammars, in the sense that the combined language might not give the result that was expected. Consider combining two grammars that have different keywords. Scannerless parsing allows us to combine the two grammars, but we may wish to ensure that the combined languages do not allow users to use keywords in the other language as identifiers. There is no easy way to express this in normal CFGs. The SDF2 paper referenced earlier allows reject productions as a solution to this; unfortunately this then makes SDF2 grammars mildly context sensitive. As far as I know, the precise consequences of this haven't been explored, but it does mean that at least some of the body of CFG theory won't be applicable; it's enough to make one a little nervous, at the very least (not withstanding the excellent work that has been created using the SDF2 formalism by Eeclo Visser and others).

A recent, albeit relatively unknown, alternative are boolean grammars. These are a generalization of CFGs that include conjunction and negation, which, at first glance, are exactly the constructs needed to make grammar composition practical (allowing one to say things like identifiers are any sequence of ASCII characters except SELECT). Boolean grammars, to me at least, seem to have a lot of promise, and Alexander Okhotin is making an heroic effort on them. However, there hasn't yet been any practical use of them that I know of, so wrapping ones head around the practicalities is far from trivial. There are also several open questions about boolean grammars, some of which, until they are answered one way or the other, may preclude wide-scale uptake. In particular, one issue relates to ambiguity, of which more now needs to be said.

Ambiguity

By severely restricting what CFGs they accept, grammars which are compatible with traditional parsing algorithms (LL, LR etc.) are always unambiguous (though, as we shall see, this does not mean that all the incompatible grammars are ambiguous: many are unambiguous). Grammar ambiguity is thus less widely understood than it might otherwise have been. Consider the following grammar of standard arithmetic:
E ::= E "+" E
    | E "-" E
    | E "/" E
    | E "*" E
Using this grammar, a string such as 2 + 3 * 4 can be parsed ambiguously in two ways: as equivalent to (2 + 3) * 4; or as equivalent to 2 + (3 * 4). Parsing algorithms such as Earley's will generate all possibilities even though we often only want one of them (due to arithmetic conventions, in this case we want the latter parse). There are several different ways of disambiguating grammars, such as precedences (in this example, higher precedences win in the face of ambiguity):
E ::= E "+" E  %precedence 1
    | E "-" E  %precedence 1
    | E "/" E  %precedence 2
    | E "*" E  %precedence 3
This might suggest that we can tame ambiguity relatively easily: unfortunately, parsing theory tells us that the reality is rather tricky. The basic issue is that, in general, we can not statically analyse a CFG and determine if it is ambiguous or not. To discover whether a given CFG is ambiguous or not we have to try every possible input: if no input triggers an ambiguous parse, the CFG is not ambiguous. However this is, in general, impractical: most CFGs describe infinite languages and can not be exhaustively tested. There are various techniques which aim to give good heuristics for ambiguity (see Bas Basten's masters thesis for a good summary; I am also collaborating with a colleague on a new approach, though it's far too early to say if it will be useful or not). However, these heuristics are inherently limited: if they say a CFG is ambiguous, it definitely is; but if they can not find ambiguity, all they can say is that the CFG might be unambiguous.

Since theoretical problems are not always practical ones, a good question is the following: is this a real problem? In my experience thus far, defining stand-alone grammars for programming languages using Earley parsing (i.e. a parsing algorithm in which ambiguity is possible), it's not been a huge problem: as the grammar designer, I often understand where dangerous ambiguity might exist, and can nip it in the bud. I've been caught out a couple of times, but not enough to really worry about.

However, I do not think that my experience will hold in the face of wide-spread grammar composition. The theoretical reason is easily stated: combining two unambiguous grammars may result in an ambiguous grammar (which, as previously stated, we are unlikely to be able to statically determine in general). Consider combining two grammars from different authors, neither of whom could have anticipated the particular composition: it seems to me that ambiguity is much more likely to crop up in such cases. It will then remain undetected until an unfortunate user finds an input which triggers the ambiguity. Compilers which fail on seemingly valid input are unlikely to be popular.

PEGs

As stated earlier, unambiguous parsing algorithms such as LL and LR aren't easily usable in grammar composition. More recently, a rediscovered parsing approach has gathered a lot of attention: Packrat / PEG parsing (which I henceforth refer to as PEGs). PEGs are different than everything mentioned previously: they have no formal relation to CFGs. The chief reason for this is PEGs ordered choice operator, which removes any possibility for ambiguity in PEGs. PEGs are interesting because, unlike LL and LR, they're closed under composition: in other words, if you have two PEGs and compose them, you have a valid PEG.

Are PEGs the answer to our problems? Alas - at least as things stand - I now doubt it. First, PEGs are rather inexpressive: like LL and LR parsing, PEGs are often frustrating to use in practise. This is, principally, because they don't support left recursion; Alex Warth proposed an approach which adds left recursion but I discovered what appear to be problems with it, though I should note that there is not yet a general consensus on this (and I am collaborating with a colleague to try and reach an understanding of precisely what left recursion in PEGs should mean). Second, while PEGs are always unambiguous, depending on the glue one uses during composition, the ordered choice operator may cause strings that were previously accepted in the individual languages not to be accepted in the combined language - which, to put it mildly, is unlikely to be the desired behaviour.

Conclusions

If you've got this far, well done. This article has ended up much longer than I originally expected - though far shorter than it could be if I really went into detail on some of these points! It is important to note that I am not a parsing expert: I only ever wanted to be a user of parsing, not - as I currently am - someone who knows bits and pieces about its inner workings. What's happened is that, in wanting to make greater use of parsing, I have gradually become aware of the limitations of what I have been able to find. The emphasis is on gradually: knowledge about parsing is scattered over several decades (from the 60s right up to the present day); many publications (some of them hard to get hold of); and many peoples heads (some of whom no longer work in computing, let alone in the area of parsing). It is therefore hard to get an understanding of the range of approaches or their limitations. This article is my attempt to write down my current understanding and, in particular, the limitations of current approaches when composing grammars; I welcome corrections from those more knowledgeable than myself. Predicting the future is a mugs game, but I am starting to wonder whether, if we fail to come up with more suitable parsing algorithms, programming languages of the future that wish to allow syntax extension will bypass parsing altogether, and use syntax directed editing instead. Many people think parsing is a solved problem - I think it isn't. ]]>
In Praise of the Imperfect [August 25 2010] http://tratt.net/laurie/tech_articles/articles/in_praise_of_the_imperfect programmers like X because it is like programming in the sense Y. Often the only thing relating X - be it painting, poetry, guns, or red squirrels - to programming is the person writing about it. Good luck to them I say - but generalising from a case of one is frequently unenlightening. With that warning to myself - and the reader - in my ears, off I go.

I recently read an article on music that, boiled down to its essentials, talked about the lack of a meaningful relationship between technical virtuosity (roughly ‘how well can you play’) and musical quality (roughly ‘how enjoyable is this piece of music’). Sometimes 3 chords and a 3 note melody played slowly and imperfectly will be exactly what is required; sometimes precise fast scales and subtle melodic variations will do the trick. When a song that should be played slowly, sloppily, and minimally is played fast, precisely, and with extraneous notes - or vice versa - the balance between technical ability and musical quality is lost. Remembering this balance is a problem many musicians have. As their technical abilities grow over time, it is common to equate technical difficulty with musical quality. Despite my extremely limited musical abilities (it is difficult to dispute one persons assessment of my playing as ham-fisted), I have on occasion suffered this problem as I have got slightly more proficient: fortunately, since many of my favourite songs are - not to put too fine a point on it - brainless three and a half minute rockers, I generally get brought back down to earth at some point.

Reading this made me realise that a similar balance applies to programming. Sometimes a heavily engineered solution is right for the job; sometimes something much simpler is just the ticket. Let me give an example of each.

First, an example of something heavily engineered. One of my favourite things that I've done is extsmail which is a traditionally constructed, robust Unix daemon for mail delivery. In extsmail, I cared about every possible detail: the program is carefully decomposed into its constituent parts; every error condition is explicitly handled and dealt with appropriately; the documentation is complete; I audit the code regularly; and I've run it through every static analyser I've been able to get my hands on. In short, its 1,600 lines of code are probably the highest quality I've ever personally written. Given its tiny potential audience, you could well argue that it is grossly over-engineered, but it solves a real problem that I had for years. It performs its task - delivering e-mail to a remote machine via SSH - admirably and hasn't crashed in almost 2 years. Anything less than that wouldn't have been right for what I needed - it has to do the job as near to perfect as is possible, every time. Heavy engineering was the right choice here.

Second, an example of something simple. A colleague once dismissed a program I'd written with an unrepeatable 4 letter word, saying that it was poorly organised and unreadable. He was right in one way. The program in question was about 4000 lines of code, in an area of which both of us had previously been entirely ignorant - the fact that I had to learn a new programming language was one of the lesser challenges in the whole thing - and the result was slapped together in varying stages of incomprehension over roughly 3 or 4 man weeks. I pointed out to him two things about the result: first, it showed that what we wanted to do was possible and plausible, when we both had doubts originally; second, that he'd written precisely zero lines of code in the same period. That code, incidentally, is now part of one of the more widely used pieces of software I've been involved with. No user will ever know that it's the product of someone learning on the job, and that, internally, things could have been done much better. For the job it does it works reliably enough and - more importantly - it exists. If I'd taken the high-road, it still wouldn't exist now, and nobody would have been able to benefit from it. Quick and dirty was the right choice.

As my tactless colleague shows, there is a bias in programming against imperfect solutions. I can understand where this bias comes from: the worst of all worlds is an inappropriate imperfect solution. Trying to ensure that this doesn't happen has some odd side-effects. One is a tendency to assume that one language, one tool, one technique should be appropriate for all tasks. Different communities fixate on different things. For the last 10 years, industrial management has often thought Java was the answer; web-sites PHP; and academics strongly typed pure functional languages. In my opinion, they are all wrong today - and, I suspect, they will all be wrong in the future, even as their favourite things change. There is no one size fits all. Different tasks demand different approaches. For example:

  • If I need to do a sysadmin job across my servers, I'll use the Unix shell or Python. If it goes wrong, it's not a huge problem. What is important is having something now. Having it in a form that can be easily altered to suit changing needs is also useful.
  • If I'm writing a Unix daemon I'll use C. I prefer it not to go wrong, but I can tolerate very occasional failures. If it takes a while to create, I can live with that. Having it in a form that can be tweaked a bit is useful, though I don't expect it to change a great deal in the future.
  • If you're writing software for a plane or nuclear reactor, I hope you use the strongest statically typed language you can, every static analyzer available, and as many formal techniques as you can shake a stick at. Failure is not an option, since the result will be loss of human life. Nor is cost - if it takes hundreds of man years of effort, it's probably worth it. Since such software must be precisely specified up-front, there is relatively little need for it to amenable to change.
As this suggests, when it comes to the needed quality of software, there are several shades of grey. There are many people out there who promulgate the need for highly-engineered solutions. I'd like to wave a little banner for imperfect software. In many cases, having something, even if it's not quite perfect, is better than having nothing. In academia, in particular, we often fall into this trap - we know that, given sufficient time and resources, we can produce high quality (albeit not quite perfect) software. The problem is that the time and resources needed are nearly always prohibitive and, often, complete overkill.

No one in their right mind would say that the highly practiced virtuosity of a classical orchestra could replace the raw emotion of a simple folk singer - both have their own merits and their place in the musical landscape. So too should we give imperfect, pragmatic software the respect that it is due. ]]> A Modest Attempt to Help Prevent Unnecessary Static / Dynamic Typing Debates [Updated April 8 2010] http://tratt.net/laurie/tech_articles/articles/a_modest_attempt_to_help_prevent_unnecessary_static_dynamic_typing_debates PC (meaning DOS / Windows) vs. Acorn (a British computer manufacturer, once fairly successful, but now unknown by anyone under 21), groups of teenage boys would spend countless hours arguing that the other side were, at best, misguided and, at worst, in league with the devil. Boys like to argue, and these were fairly harmless topics, but by the time I'd started university, I was long past the point of wanting to engage in heated debate about such topics.

Several times recently, memories of wasted hours have flooded back to me as I have been an unwilling participant in one of the longest-standing debates in programming languages - statically vs. dynamically typed languages. Frankly, such debates haven't become any less painful since the passing of my teenage years. The problem in such debates remains the same: both sides over-inflate the advantages of their favoured approach while wilfully maintaining their ignorance of the other. Add to this the common tendency of any side in a debate to assume the worst of their opposition, and you have a situation that resembles Northern Europe in WWI: grim, unrelenting trench warfare, with neither side making any real advances.

To make matters worse, the debate in this particular case is asymmetric. For decades now, the computing intelligentsia have been overwhelmingly in favour of static typing - in academia, this bias is almost total. While the intellectual basis of static typing is both well established and frequently promulgated, the dynamic typing community has been far less intellectually confident. Please note, I'm not saying that dynamic typing has a less firm intellectual basis - simply that it is infrequently articulated. Let me give two simple, but different, examples:

  • The first is terminological. Dynamically typed languages are frequently dismissed as scripting languages, which is at best an odd classification (languages such as Python have all the modularisation-type features of equivalent statically typed languages) and, at worst, deliberately derogatory. But labels are just labels: there's no point getting too worried about them. What's more frustrating is a lack of understanding of the difference between statically and dynamically typed languages. Many (though, in fairness, not all) static typing advocates confuse dynamic typing with no typing. Dynamic typing doesn't mean that programs have no types - rather the typing discipline is enforced at run-time rather than at compile-time. Compare this with strong and weak typing. In a dynamically, strongly typed language like Python, types have no compile-time effect, but they can not be overruled at run-time (adding an integer to a string causes a run-time type exception); conversely, a statically, weakly typed language such as C enforces types at compile-time, but allows them to be overridden at run-time (to sometimes hilarious, though generally annoying, effect).
  • The second is practical. Refactoring is an umbrella term (in my world, at least) for the activity of reworking a program to improve its internal quality. Any program that's been around a while will have had unanticipated changes made to it, and will require some refactoring. Small exploratory tweaks are the hallmark of refactoring in dynamically typed languages; they often temporarily break the system as a whole (so program execution tends to terminate prematurely), but give the programmer confidence that the particular part of the program they are concentrating on is not adversely impacted by the larger change they intend making. Unfortunately, static typing inhibits this type of program refactoring, since even the smallest of tweaks must respect the static typing system. If the whole thing doesn't compile, it can't be run; and if a small part of the change can't be tested on a running system, the programmer will often not have the confidence to spend days (or longer) changing the whole system, only to find that the original idea was a bad one.

To my mind, both statically and dynamically typed languages have a place. If you're building software for a nuclear reactor, I want you to use every tool at your disposal to reduce errors, even if that increases the cost of producing the software by a factor of ten or more - and statically typed languages will undoubtedly catch some errors that might otherwise go undiscovered, so they're the right tool for the job. If you're building a website, I want you to make it as easy as possible for yourself to modify it to reflect rapidly changing requirements - which probably means using a dynamically typed language.

One day I hope that someone will put together a comprehensive and unbiased comparison of the two paradigms since, to the best of my knowledge, nothing currently exists which does the job. Until then, I immodestly offer a pre-print of a modest book chapter I wrote last year (and, if you're really interested, its corresponding BibTeX entry) where I tried to give an introduction to dynamically typed languages. Inevitably this involves a comparison to statically typed languages, defining terminology, and trying to enumerate the relative strengths and weaknesses of each approach. As mentioned in the chapter, I personally dislike the terms statically typed and dynamically typed because they mislead so many people; unfortunately, changing them to something less misleading is a battle I could never win, so I start from the assumption that we're stuck with those terms. In retrospect, the chapter is far from perfect: while I tried to suppress my own biases, I didn't fully succeed; there are many parts which I'd like to expand; and even more parts which I'd like to change. Yet, despite this, I can't help feel that some of the points it makes (even though none of them are particularly original) might have helped avoid some of the more tedious aspects of the static vs. dynamically typed debates that I was forcibly involved in. Until something better comes along, it might fill a useful niche.

Updated (April 8 2010): Fred Blasdel points me at Chris Smith's What To Know Before Debating Type Systems which covers much (though not all) of the same ground as my chapter, although more briefly, and from a statically-typed perspective. You may find it interesting to compare and contrast. ]]> A Proposal for Error Handling [December 14 2009] http://tratt.net/laurie/tech_articles/articles/a_proposal_for_error_handling extsmail, and how surprised I was that highly reliable and fault tolerant programs could be written in C. In large part I attributed this to the lack of exceptions in C. In this article, I expand upon this point, consider some of the practical issues with exceptions based language, and present a candidate language design proposal that might partly mitigate the problem. I don't promise that this is a good design; but it does present some of the issues in a different way than I've previously seen and if it encourages a debate on this issue, that might be use enough.

The two approaches

There are two main approaches to trapping and propagating errors which, to avoid possible ambiguity, I define as follows:
  1. Error checking is where functions return a code denoting success or, otherwise, an error; callers check the error code and either ignore it, perform a specific action based on that code, or propagate it to their caller. Error checking does not require any support from language, compiler, or VM; it depends entirely on programmers following the convention. Languages which use this idiom include C.
  2. Exceptions imply a secondary means of controlling program flow other than function calls and returns. When an exception is raised, this secondary control flow immediately propagates the exception to the functions caller; if a try block exists, the exception can be caught and handled (or rethrown); if no such block exists, the exception continues propagating up the call stack. Exceptions require certain features in the language, and support from both compiler and VM. Languages which use this idiom include Java, Python, and Converge.

Outline of the problem

The fundamental problem as I see it can be concisely expressed: exceptions allow predictable systems to be written with much less code than with error checking; but exceptions make writing highly fault tolerant code difficult. The former point is, I'm fairly sure, uncontentious; with exceptions, one needn't explicitly check and propagate most errors, negating the need for vast wodges of code. For many systems, this is fine - indeed, it is desirable. In general, most of the small and medium sized systems I write have little to no exception checking; if something odd happens (e.g. a file is missing) I prefer such systems to immediately exit rather than limp on to an unpredictable end.

The problem comes when one is trying to write highly fault tolerant code, by which I mean systems which try to keep on running even when undesirable events or situations arise. I documented my experiences when writing the (tiny) e-mail sending program extsmail earlier. Highly fault tolerant systems need to deal explicitly with almost every error that can occur; with judicious thought, it is amazing how many errors can either be partly or fully recovered from. In extsmail, the only fatal error (i.e. the system immediately exits) is when memory is exhausted. I would guess that about 40% of extsmail's code is to do with error checking and handling - such fault tolerant systems are much more expensive to produce than the alternative. extsmail is written in C, a language which much folklore would suggest is fundamentally ill-suited to the task. The fact that extsmail - and indeed, most of the other C software I use - works and, I hope, works reasonably well, continues to grate against my long-held prejudices; yet I honestly don't think I could write a system which is as fault tolerant in any other comparable language - in particular, in any exception-based language of my acquaintance.

Some readers might interpret the above as being a position based either on ignorance, or a hatred of exceptions. While I generally plead guilty to charges of ignorance, I can provide some evidence that in this rare situation it's not the case. The language that I designed - Converge - is an exception-based language. To the charge of hatred I say that, except when writing highly fault-tolerant systems, I believe exceptions are the best way of dealing with errors.

If I don't hate exceptions, what's the problem? Here's the rub. In theory there is no real problem with exceptions; they're clearly a more pleasing solution to the problem than error checking. The problem comes because they seem to inevitably lead to a certain style of programming that is not suitable for highly fault tolerant code. This style permeates every exception based language and library I have seen. Before I go into more detail as to what the problem is, it's best to take a step back and look at the competing solutions.

Error checking

Let's consider error checking in a C-like environment (I say C-like because I wish to avoid the horror that is is errno; let us also assume that this C-like environment allows functions to return multiple values). A possible example is the following:

int err;
if ((err = write(...)) != 0) {
  switch (err) {
    case EINTR:
      ...
    case EAGAIN:
      ...
    case ...:
      ...
  }
}
In other words, a function write is called and, if it returns an error, all the possible errors that the function can return are explicitly dealt with. There are three important things implicit in the above. First, functions uniformly denote success or error (most Unix C functions denote success by returning 0; values other than 0 indicate failure). Second, errors are simple integer values. This means that they can easily be stored and compared against pre-defined error codes. Third, functions carefully document which errors they can return so that the caller need only handle those errors.

In general, it's unusual to explicitly deal with each different error that a function can return. At the other end of extreme, one can choose to simply ignore the error returned by a function:

write(...);
which is equivalent to the exception-based code try { write } catch (Exception) { } in, say, Java - although noticeably terser. In practice, one will also see many instances of the following idiom:
if (write(...) != 0)
  err(1, "Fatal error.\n");
This is a rough approximation of the concept in exception-based systems of exiting the whole program if an exception is not caught at some level. This particular idiom in error-checking systems is a pain in the rear end: it's tedious and verbose to write; and if one has several points that share the error message Fatal error then it may not be possible to easily distinguish which one of them triggered.

As the above hopefully suggests, error checking code suffers several problems including: verbosity; the ease with which minor typos escape notice; and the difficulty of retrospectively changing APIs (extending the errors that a function returns would, in general, necessitate changing - or at least checking - all callers of that function). However there is one point which, while it might initially seem a problem, turns out to have interesting positive consequences. Since error checking relies entirely on convention, every function that follows this idiom must carefully document what errors it can return; without that, the idiom would be unusable. We'll return to this shortly.

Exceptions

Most readers are probably familiar with exception based systems as the majority of modern languages utilise them in one form or another. It is this familiarity which I suspect numbs us to one of the major practical problems with exceptions. Let's recast the initial error checking example into exceptions:
try {
  err = write(...);
}
catch Interrupt_Exception e {
  ...
}
catch Again_Exception e {
  ...
}
catch ... {
  ...
}
As this suggests, it's possible to almost exactly emulate the error checking approach in an exception based language. However, one will almost never see the above idiom in an exception based language (I don't think I've ever seen it). Exceptions are typically organised into a hierarchy so one is much more likely to see:
try {
  err = write(...);
}
catch IO_Exception e {
  ...
}
Oddly enough, this organisation of exceptions into hierarchies, while convenient, is not something I'm keen on for fault tolerant systems. The reason is that, by abstracting away from the specific error that occurred, it tends to give a false sense of security that one is dealing with a given exception in the right way. For example, most systems have a multitude of IO related exceptions, yet it is rare for anyone to deal with anything other than the top-level IO exception class; in a truly fault tolerant system many of those sub-exceptions are likely to be best dealt with differently than others.

Of course, the beauty of exceptions is that they don't need to be explicitly handled. Furthermore, there is much less potential to silently, and accidentally, swallow an error (as can easily happen in error-checking systems; and assuming one doesn't have a brain-dead checked exception system as in Java). A system which uses exceptions is entirely predictable: it will run reliably and tend to fail reliably and quickly. In contrast, a buggy error-checking system will often fall over long after the real error occurred, in a way which makes debugging painful at best.

The problem with exceptions

Ironically it is the ease of exceptions which I believe is their downfall.

  1. The first problem is theoretical in nature. Most languages don't include exceptions in their typing system. This means that there is no way of knowing what exceptions a function will throw. As such, this isn't a problem: I'm not known as a big fan of static typing anyway. The problem then becomes a cultural one: the exception based languages with which I am familiar only lightly document what exceptions a function can throw. Even when they document the exceptions, it is rarely clear exactly what circumstances will lead to the exception being raised; and it is generally equally ambiguous as to exactly what the exception being raised means. Compare that with the terse, but invariably precise, descriptions of the C Unix man pages. Each function religiously documents the error codes it will return and what they mean. Clearly this is a cultural problem at heart - there's no real reason why exception based systems have to have such poor documentation. But, I would argue, that because most programs have no need to precisely know what exceptions a function could throw, there is no cultural pressure to document such things. And, as soon as one function is imprecisely documented, any function which calls it is inevitably going to be imprecise in its documentation (even if it doesn't realise it). Poor documentation of exceptions is thus difficult to avoid.
  2. The second problem is due to the ease with which exceptions can be thrown. Most functions, if carefully analysed, can potentially throw a vast number of exceptions. This is hardly surprising. Every look-up of an element in a list; every division of an integer; not to mention the fun that polymorphism can bring to the table; all of these things can potentially raise an exception. The culture of most exception based languages is not to see this as problem and thus not to be defensive: exceptions safely catch problems, so little to nothing is done to check in advance that a given action will not raise an exception. In contrast, error-checking languages have to be defensive: if the pre-conditions for an action are not checked in advance, the system is likely to go wrong in unpredictable fashion. Therefore error-checking systems force programmers to explicitly consider every error that a function might have to throw; and most of them are either dealt with explicitly or, at least, documented. In general, functions in error-checking languages return far fewer distinct errors than an exception-based function can raise exceptions.
  3. The third problem is related to the second. Philosophically speaking, I believe that there are two types of exceptions:
    1. Programmer cock-ups (e.g. pop'ing an empty list; division by zero).
    2. The inability of an external thing to fulfil its contract (e.g. writing to a network socket which has been closed by the other end).
    Programmer cock-ups are frequent (at least, they are in the programs I write), and mean that one or more assumptions underlying the program in question have been violated. These to me are very bad things: in general, I wish the program to be immediately terminated so that the program doesn't limp on to an even worse death later, and so that I can easily pinpoint the underlying cause. Furthermore, by definition, I as the programmer don't mean to make these cock-ups, so I don't put try ... catch statements in to handle them. In other words, I think that exceptions cover programmer cock-ups very well. In contrast, the inability of an external thing to fulfil its contract is different. We all know that network sockets can close at any time; so whenever reading from a network socket, it is reasonable to expect to check for read errors. In fault tolerant systems, one is likely to try and deal with all such errors explicitly.
  4. The fourth problem relates to the try { ... } catch { ... } construct. Any code in the try block which raises an exception will be dealt with appropriately. The difficulty here is the frequency with which too much code is put into this try block. The classic case I see looks as follows:
    try {
      f(g(...));
    }
    catch (Exception) {
      ...
    }
    
    The intention here is that if f throws an exception, the system still soldiers on. The problem is that the ease with which code can be put in the try block means that f's parameters' evaluations are also included; and they are rarely meant to be. In other words, any exceptions which g raises will be silently - and in this case unintentionally - swallowed. In an error checking approach, the call to g would have to be explicitly checked for errors, largely preventing this incorrect idiom.

For average programs, none of these points is a huge problem; in fact, they arguably make a programmers life easier. But for fault-tolerant systems all are genuine issues: it is impossible to write a fault tolerant system if you are unsure what errors you need to deal with; if the number of errors you are required to deal with becomes too great, the sheer magnitude of task will overwhelm you; if you're unclear what errors are the result of your mistakes and which aren't, debugging becomes a philosophical minefield; and if you tend to be too crude in the granularity with which you check errors, mistakes will happen.

A proposal

At a high-level, what I believe might be an improvement on the current situation is a feature which, to some extent, merges the better parts of error-checking and exceptions. Let us therefore consider what the best parts of each approach are. Error checking forces programmers to be considerate both of the errors that they have to deal with, and the quantity of errors they return to the caller. Exceptions make code far smaller and are particularly good at dealing with programmer or cock-ups and very unusual situations (e.g. out of memory errors) which virtually no-one wants to deal with explicitly.

What both error-checking and exceptions provide is a way of returning two things from a function: an error code / exception, and the results of the function. The beginning part of my proposal is for an operator which pulls these two things apart in one go. For arguments sake, let's use back-slash \ as this operator. On the left hand side of \ are the functions normal return values; on the right-hand side its error code. Error codes are integers (reflecting one of my prejudices; but the type of the error code is not hugely important), with 0 indicating success. We can then thus write code such as:

bytes_written \ err = write(...);
if (err == EINTR)
  ...
On the other side of the coin, we need a way for functions to return errors (or not). Mirroring the above syntax, return x returns x as per normal and sets the returned error code as 0 (or whatever the no error value is). return x \ e returns x and sets the returned error code to e (note that it would be valid, though syntactically redundant to explicitly return the no error value).

Although it's not directly related, an obvious problem with error checking is the difficulty with extending an API; any errors codes not explicitly checked for will be silently swallowed. To some extent this is a deliberate design decision: it should be culturally difficult (not impossible; but definitely difficult) to extend the range of errors that a function returns (exceptions provide no encouragement to follow this rule whatsoever). However, I also assume that any language with this feature in will have an equivalent of Converge's ndif statement which raises an exception if none of its branches match.

So what has all this bought us? Well, at first glance, we've just got an error-checking system which formalises the way in which errors are returned. Indeed, this is a large part of the story. We now have a simple, uniform way of performing the error-checking approach. But, if this was all we'd got, we wouldn't have advanced much beyond C. By formalising how errors are returned, we can also define what happens if the error code is not explicitly checked for. In other words, what does the following code do?

bytes_written = write(...);
Answer: if write raises an error and it is not dealt with, the error turns from an integer error code into a standard exception. In general, I would not expect such exceptions to be caught; they would terminate the program, leaving behind a simple to debug line-by-line backtrace.

Finally, I expect exceptions to be maintained almost exactly as they exist in current languages, including the throws / raise construct (which can also be called with an error code, so that if a calling function doesn't know what to do with a particular error code, it can be turned into an exception by that caller). The only difference I would make is that I would forbid exception hierarchies. This is because in this proposal, it's not really expected that exceptions will ever be caught (with one caveat): they're really just a debugging aid and, as such, exception hierarchies only muddy the waters. Because there are fault-tolerant systems such as network servers - where staying alive is the number one priority - I would also maintain try ... catch, though I would expect it to be little used other than to surround a top-level call, with the catch block restarting the server (or a similar recovery mechanism) in the event of an exception.

Intentions

The intention of the proposal is twofold:

  1. Programmer cock-ups are handled as normal exceptions, don't require any extra effort on the part of the user, and lead to immediate and easy-to-debug program failure.
  2. It allows programs which want to do explicit error-checking to do so, and to do so in a framework in which error-checking is culturally practical; in other words, it doesn't suffer from the practical, mostly cultural, problems noted with exceptions.
What's worth noting is that use of the error-checking facility is entirely optional: if one doesn't use the \ construct, what's left is a standard (if slightly simplified) exception-based system. In other words, the error-checking approach is a bonus which doesn't interfere with normal exception-based practices; it degrades gracefully into using standard exceptions.

Conclusion

In one way, what this article has done is to note problems with the practice of exceptions, and then suggest a partial return to the stone-age of explicit error checking. In that sense, one counter-argument is that I am aiming to throw the baby out with the bath water; and there is merit in that argument. There are also a couple of obvious questions which it raises. First, has this scheme been proposed elsewhere? Not that I know of, but there are a vast amount of relatively obscure languages lurking around, so it's not impossible. Second, would this work in a practical language? I don't know - this is a classic paper design which may well not survive its first pass through a compiler. One day I hope to find out.

My thanks to Martin Berger for commenting on a draft of this article. Any errors and infelicities are my own, as are all the bad ideas. ]]> The Missing Level of Abstraction? [September 15 2009] http://tratt.net/laurie/tech_articles/articles/the_missing_level_of_abstraction Levels of abstractions I feel fairly confident in stating that everyone who is familiar with the details of computing will have encountered the phrases high level of abstraction and low level of abstraction - probably rather often. Abstraction is one of those words which is used rather more frequently than it is considered. In short, an X that is an abstraction of Y implies three things in our context:

  1. That X is at a higher level of abstraction than Y (alternatively one could say that X is more abstract than Y).
  2. That X does not introduce anything fundamentally new over Y; indeed, it may well remove fundamental things, or present them in an easier fashion.
  3. Assuming that one does not need any of the fundamental things in Y that may have been lost in the abstraction X, then X is in some sense easier to use than Y.
Abstractions abound in computing, as they do in life in general. To take a simple example, the first computers were programmed in machine code (zeros and ones); the second generation, in assembly code which easily translated into zeros and ones, but provided an easier syntax for humans; the third generation and beyond, in languages such as C whose translation into machine code became increasingly complex. Though they are often nebulous and, indeed, often rather hard to spot and understand, abstractions are what make modern computing possible; life without them would be like trying to walk from Lands End to John o' Groats with ones eyes pointed downwards and half an inch above the ground.

Abstractions are typically relative things. From the statements X is at a higher level of abstraction than Y and Y is at a higher level of abstraction than Z we can deduce that X is at a higher level of abstraction than Z, but we can not say that Z is the lowest level of abstraction of all - it may well be at a higher level of abstraction than some other thing of which we are currently unaware.

Regrettably, my professional life has taught me that the notion of relative levels of abstraction is not universally shared, probably because it is harder to understand than the notion of absolute levels of abstraction. This fallacy is commonplace; I first encountered it in the MDA world where the terms PIM (Platform Independent Model) and PSM (Platform Specific Model) are widely, and incorrectly, abused to imply absolute levels of abstraction. Similarly one often hears talk of high-level versus low-level languages as if they were absolutes when they clearly are not. Provided one bears in mind that these notions are always relative and that omitting the relative qualification is a simple brevity aid, then many things make a good deal more sense.

Objective-C

I have recently been doing some programming in Objective-C for reasons that most people can probably easily guess. Learning a new language is rarely a wasted opportunity, and this has certainly been an interesting experience. As I have noted before, I have a definite soft spot for C as a language. If it wasn't for its brain-dead approach to arrays (which don't know their size, and therefore can't automatically resize themselves, bloating code and causing untold bugs) and, to a lesser extent, the baroque system of standard types that has accreted during decades of porting to different platforms (and consequent misunderstandings), I would not have a bad word to say about it. This praise is of course conditional on the fact that C has a particular niche: it is commonly called a low-level language, because it is little more than a slim layer above assembly code.

Objective-C, alas, I find a more difficult language to like. In small part this is because it generally implies (and did in my case) working under OS X, an operating system beloved of those who do not truly love computers - its idiot-proof lack of customisation and innumerable small bugs came close to breaking me. Objective-C grafts higher-level Smalltalk-like features onto a lower-level C base. The resulting language is not only more verbose than either of its parents - I particularly enjoyed needing to type class attributes into three different places - but distorts many of their defining features. For example, Smalltalk's collection hierarchy (i.e. lists, sets etc.) is a thing of genuine beauty and utility, allowing a syntactically minimalist language to succinctly express complex constraints; since Objective-C doesn't allow blocks (small anonymous functions), not to mention that its collection hierarchy is not as well designed, this is impossible. Another aspect is that C is a statically (if weakly) typed language, catching errors at compile-time that would otherwise cause hard to debug segfaults; however Objective-C's message sending is, like Smalltalk, dynamically typed. In Smalltalk this is not an issue - type errors are simply triggered, lead to predictable backtraces, and are easily fixed. While some dynamic typing errors in Objective-C lead to similar backtraces, others fall foul of C's free-wheeling approach to memory management and cause horrible run-time results, leading to little more than a splat sound; debugging those is a chore.

Memory management

One of C's many lower-level delights is its approach to memory management: put simply, the user is responsible for all memory allocation and deallocation. The virtue of C's approach is that it's simple to understand (for users) and implement (for compiler and library writers). The disadvantage is that it's easy to use incorrectly; in particular, it's very easy to forget to free memory, causing hard to debug memory leaks. A less common, but more dangerous, error is to try and free an already freed chunk of memory (the so-called double free). It is thus reasonable to say that C's memory management is low-level. In contrast, high-level memory management has, in my mind, been largely synonymous with garbage collection, which automatically frees memory (and implicitly prevents double frees). Garbage collection is not without some negative implications - for example, it is notoriously difficult to work out when garbage collection pauses will strike - but overall is generally a good thing. Indeed, garbage collection is now largely ubiquitous outside of systems and embedded programming; anyone weened on Java (or even much older languages such as Lisp and Smalltalk) will know nothing else.

As I understand things, modern Objective-C under OS X uses garbage collection. On some other platforms, such as the one I was targeting, there is no garbage collection. I initially assumed that in such cases Objective-C would revert to a standard C-style system of malloc and free - to my surprise, it does not. Instead, Objective-C uses an unusual system of sometimes-implicit, sometimes-explicit memory management. For someone used to traditional high-level garbage collection or low-level C-style memory management, it's rather hard to get your head around: there don't seem to be hard and fast rules; some of the documentation is ambiguous about the users responsibilities; and there is a good deal of obfuscating cruft which I assume has gathered over time. However, the basic idea is as follows. If a user explicitly allocates memory (via an alloc message), he is responsible for freeing it. If another function allocates memory, it will (or, more accurately, should) be put into a memory pool; when the memory pool is freed, the objects in it are freed too. New memory pools can be created, and they are kept in a stack so objects are put into the most recent memory pool.

Memory pools are an attempt to allow implicit memory allocation (something which is, for good reason, deeply frowned on in traditional C libraries) with implicit memory deallocation. In practice, they resemble a poor-mans attempt at garbage collection or, perhaps, garbage collection as designed by someone who hadn't fully grasped the idea. For the life of me, I can not fully prevent memory leaks in a medium-sized Objective-C system; ironically, I have found it much easier to debug memory leaks in pure C applications. Different parts of code can retain and release objects, which I assumed was a synonym for reference counting, but in reality doesn't always seem to quite work: what, according to the documentation, are valid combinations of calls can cause crashes or leaks. Was this due to bugs in my code, someone else's, or a flaw in the memory management design? Who knows - at some point, I got the leaks down to the level of a few bytes a minute and gave up.

The middle-level of abstraction

A good question at this point is: what does all this have to do with abstractions? Well, the two examples above are instances of something that I've seen once or twice before, but which is hardly common. In normal computing conversation, C is at a low-level of abstraction, and Smalltalk is at a high-level. As I wrote earlier, levels of abstraction are relative, but that doesn't mean we can't talk about the gaps between two levels. Objective-C, which is a child of these two languages, explicitly pitches itself as being the middle-level of abstraction between C and Smalltalk. Similarly, its memory management is the middle-level of abstraction between malloc/free and garbage collection.

The interesting thing to me is not that Objective-C is the middle-level of abstraction between C and Smalltalk as such - after all, given two points on the abstraction scale, there's a high chance that there are other levels in-between - but that it appears to me to have been deliberately designed to slot into this place. This made me wonder about two things. First, why are there not more systems designed to slot between two existing levels of abstractions? Second, why have I never heard the term middle-level of abstraction before?

A couple of answers immediately suggested themselves to me; no doubt others can think of better ones. Perhaps systems designed to slot between two existing levels are more likely than not (as Objective-C) to be worse than the things either side of it? Alternatively, perhaps under normal circumstances we only need to think of one higher level of abstraction thing and one lower-level thing in order to work satisfactorily, and what comes in the middle is generally irrelevant? Whatever the reason may be, I suspect we're unlikely to ever hear many uses of the term middle-level of abstraction - it may remain forever the missing level of abstraction. ]]> Good Programmers are Good Sysadmins are Good Programmers [March 20 2009] http://tratt.net/laurie/tech_articles/articles/good_programmers_are_good_sysadmins_are_good_programmers Of course, my life thus far has not just involved my surprise at other peoples habits; on occasions (less rare than my ego might have preferred), other people have been surprised at mine. Recently a non-computing friend saw my main computer workspace - a Unix setup with 4 xterms displayed - and asked, jokingly, if I was plotting to take over the world (I blame the media for this particular image). I'm so used to my own setup that I no longer think of it as odd but, when suitably prompted, I can see why other people might think so. In comparison to a Windows or Mac machine, full of little visual goodies, and perhaps with only a web browser or word processor loaded, a number of tiny xterms filled with half-executed commands does look odd.

It's not really surprising that a non-computing person would find my setup odd - after all, I spend a lot of time on computers, so it's to be expected that some things that I have come to find natural scare casual users. What has surprised, and continues to surprise me, is how many computing people I come across find my setup odd - sufficiently odd that it attracts comment. Some people are baffled as to why my systems are as they are, some are curious as to how it works, and some people sneer at the way I do things. There is no good answer to the sneer, nor any great reason to answer that person (although, possibly due to a deep character flaw, I find the sneer rather amusing). However the how and why are interesting questions, which raise interesting points, and are more closely integrated than they may first appear.

Here's the broad setup I use. I have a desktop machine (because it's fast and comfortable), a laptop (which I use only when out and about, because laptops are slow and ergonomically disastrous), a main server (where you're probably reading this from), and a backup server (for the next time the water company cuts through the cable powering the main server; in an attempt to salve my environmental qualms, the backup server is a very low power device that also serves some domestic purposes). Though various people have some sort of access to the servers, I personally administer all 4 machines. This raises two immediate problems: how to keep the administration overhead to a minimum; and how to keep files synchronised between each machine.

The answer to the administration overhead question is, for me, simple: use the same operating system for all machines. That way, the lessons learned on one machine apply trivially to the others (and, when things go belly up, machines can relatively easily stand in for one another). As someone who (through a quirk of geography and history) never passed through the DOS / Windows world, I eventually gravitated towards Unix operating systems and, after a brief flirtation with Linux, I've exclusively used OpenBSD for nearly 10 years. This immediately scares most people off, or confirms their worst suspicions of me - to give you an idea of the popularity of this OS, at the time of writing, I've met precisely 1 OpenBSD user in real life. Why did I chose OpenBSD? Simple: it's simple. OpenBSD does very little by default, and what it does, it does well, consistently, with minimal configuration, good documentation, and is easily administered remotely. I can have a blank box turned into a complete OpenBSD install with everything I want, setup how I need, in a couple of hours (most of which is automatic downloading of stuff, and doesn't require my presence). I keep an open mind about OS replacements, but so far none of them appears an improvement, or even a sideways step. Of course, using what is often dismissively called a server operating system does involve some compromises, although less than you might think - the increasing diversity of real-world OS's (thanks indirectly, I think, to OS X) has meant that running a minority platform involves fewer compromises than it did 5 or 6 years back.

The file synchronization problem is a little more subtle, but arguably more important. I outlined my mechanism for this a while back and while it's changed in detail quite a bit, in spirit it's still the same: I use a version control system (git these days) for my important files and Unison for large files that I can recreate via other mechanisms.

A corollary of using a decent Unix, and synchronizing files automatically, is that virtually everything is configured by simple text files, so to a large extent my configuration also propagates across machines. I have also tried over the years to accept, whenever possible, the default configuration on a machine. The reason for this is simple: the less I feel the need to change, the easier it is to move between machines (and different OS's). Of course, there's a limit to how far I'm prepared to accept someone else's choices, and so I do change a reasonable number of settings; but, compared to most people I know, I change relatively little.

As well as trying to use the default configuration as far as is practicable, I also try to maximise my use of tools supplied with the OS and, failing that, to use the simplest tool that does the task I require. A decent Unix comes with a wide variety of little tools, most of them neglected by most users; it continues to amaze me as to how many tasks can be expressed in terms of these little tools. Using tools that are standard across many different machines and OS's again lowers the barriers to moving between machines. It also generally implies a greater consistency of user experience, since tools from the same providers tend to be more consistent; some providers (particularly commercial) seem to delight in perverse user interface choices, which means that installing and learning new tools can be an uncomfortable experience. I also try, whenever possible, to use command-line tools not because - despite what some of my friends think - I like being obscure (my formative years were spent on RISC OS, where the GUI was King and the default assumption was that the command-line was for the mentally unsound) but because it's easier to control command-line tools and maintain consistency across platforms.

Most of my time on a computer is spent either doing e-mail (I use mutt because it's the least annoying mail client I've yet found, despite its obvious limitations), web browsing, programming, or writing. The latter two tasks are the most interesting. If for you, as for me, an average day is a wild trip of programming in several languages, and working on several papers of dubious literary merit, then you'll know how much time one can spend editing text; I often have 20 or 30 files open for editing. The sad truth of the matter is that, as far as I can tell, the modern computing world does not contain a single decent text editor (whereas RISC OS, which I mentioned earlier, had at least 2 excellent text editors). Most text editors are either arcane (e.g. vi and emacs) or bloated (e.g. Eclipse). Since I am not clever enough for the former, and far too impatient to wait for the latter to load (I had a massive shock 4 or 5 years back when, on a powerful machine, I found to my horror that if I typed at full speed in a well known IDE, there was a noticeable lag in text appearing on screen), I use a half-way option, NEdit. NEdit has many limitations and flaws, but it's simple, loads almost immediately, and its syntax highlighting is just about powerful enough to satisfy me.

Let's return to the 4 xterms I mentioned at the beginning of the article, which scared my non-computing friend - it's both worse and better than it seems. I setup KDE so that it has 12 virtual desktops (one of the main reasons I used KDE in the early days was because it binds sensible keys to virtual desktop selection by default), of which I typically use 7 to 8 at any given point. The first is my main work area; one of the xterms has mutt permanently loaded (I occasionally load other mutts in different xterms to simultaneously read multiple folders; an advantage of using a simple tool), one is mostly used for downloading e-mail, and the other two are for random commands and ssh sessions. Desktop 2 is for web browsing and desktop 3 for web page editing. Desktop 4 is my calendar. Desktops 5 and 6 are for programming. Desktops 7 and 8 are for paper writing, with 9 and 10 being used for secondary paper writing. The remaining desktops are spares. This may seem complex or unduly pernickety. The answer to both points is essentially the same: I evolved this setup organically over time so it seems natural, to me at least. Because of virtual desktops, I need only a single 19" monitor, although these days it's a struggle to find a sensibly sized monitor. Unfortunately, computer people are generally gadget people, and gadget people are easily fooled by bigger, faster, better, so monitors these days have largely useless resolutions. Wide-screen monitors, for example, are (I assume) good for watching films, but they're fairly useless for text editing, where screen height is more important than width. Furthermore, the pointless fixation on resolution means that many fixed size things (some fonts, icons etc.) appear tiny, so I increasingly see people having to put their nose virtually to their screen to read things. 1280x1024 works for me and, until someone doubles the resolution without increasing the screen size (since then one can imagine that old apps could be transparently run with two physical pixels for every logical pixel they perceive, while new apps will have access to the genuine high resolution, making everything a bit smoother and sharper), I will try to resist the siren call of the resolution junkies.

In the above I've tried to give a brief outline of how I use computers - a modern reader may well detect a certain Luddite tendency in some of the choices, but hopefully I've also provided some small justification for each of the detailed choices. Of course, none of the above is really a high-level why. Why go to all this effort? Why these particular choices? Although I didn't explicitly think of it this way when I first started down the path that led me to my current mode of operation, there is a solid reason behind it. When I look at the really good programmers I've come across (directly and indirectly), then, with only one exception I can really think of, they seem to share one thing in common: they're also good sysadmins. Their machine(s) are in good order, simple, with the right hardware for the job, and ready for the task at hand; and they can whip a new machine into shape quickly. When the muse strikes, there is little to get in the way of good work: they know their machines inside out, the tools inside out, all their files are easily available, everything used frequently loads quickly, and they can flick rapidly between the sub-tasks that constitute a larger job. Similarly, the best sysadmins I've seen are also good programmers. I don't think it's possible to understand a modern OS without being a decent programmer - more importantly, it's certainly not possible to tame and control an OS in the desired way without programming being involved. There's a symbiosis between these two activities that seems to me undeniable; being really good in one requires being at least fairly good in the other.

So, in conclusion, my computer setup is an attempt to emulate, in my own small way, the best habits I've been able to pick up from those more able than myself. It's a continual work in progress, but it does the trick for me. ]]> How can C Programs be so Reliable? [November 11 2008] http://tratt.net/laurie/tech_articles/articles/how_can_c_programs_be_so_reliable Discounting a couple of tiny C modules that I created largely by blindly cutting and pasting from other places, the first C program I wrote was the Converge VM. Two things from this experience surprised me. First, writing C programs turned out not to be that difficult. With hindsight, I should have realised that a youth misspent writing programs in assembler gave me nearly all the mental tools I needed - after all, C is little more than a high-level assembly language. Once one has understood a concept such as pointers (arguably the trickiest concept in low-level languages, having no simple real-world analogy) in one language, one has understood it in every language. Second, the Converge VM hasn't been riddled with bugs as I expected.

In fact, ignoring logic errors that would have happened in any language, only two C-specific errors have thus far caused any real problem in the Converge VM (please note, I'm sure there are lots of bugs lurking - but I'm happy not to have hit too many of them yet). One was a list which wasn't correctly NULL terminated (a classic C error); that took a while to track down. The other was much more subtle, and took several days, spread over a couple of months, to solve. The Converge garbage collector can conservatively garbage collect arbitrary malloc'd chunks of memory, looking for pointers. In all modern architectures, pointers have to live on word-aligned boundaries. However, malloc'd chunks of memory are often not word-aligned in length. Thus sometimes the garbage collector would try and read the 4 bytes of memory starting at position 4 in a chunk - even if that chunk that was only 5 bytes long. In other words, the garbage collector tried to read in 1 byte of proper data and 3 bytes of possibly random stuff in an area of memory it didn't theoretically have access to. The rare, and subtle, errors this led to were almost impossible to reason about. But let's be honest - in how many languages can one retrospectively add a garbage collector?

My experience with the Converge VM didn't really fit my previous prejudices. I had implicitly bought into the idea that C programs segfault at random, eat data, and generally act like Vikings on a day trip to Lindisfarne; in contrast, programs written in higher level languages supposedly fail in nice, predictable patterns. Gradually it occurred to me that virtually all of the software that I use on a daily basis - that to which I entrust my most important data - is written in C. And I can't remember the last time there was a major problem with any of this software - it's reliable in the sense that it doesn't crash, and also reliable in the sense that it handles minor failures gracefully. Granted, I am extremely fussy about the software I use (I've been an OpenBSD user for 9 years or so, and software doesn't get much better than that), and there are some obvious reasons as to why it might be so reliable: it's used by (relatively) large numbers of people, who help shake out bugs; the software has been developed over a long period of time, so previous generations bore the brunt of the bugs; and, if we're being brutally honest, only fairly competent programmers tend to use C in the first place. But still, the fundamental question remained: why is so much of the software I use in C so reliable?

After a dark period of paper writing, I've recently been doing a little bit of C programming. As someone who, at some points, spends far too much time away from home, reliably sending e-mail has always been an issue. For several years I have sent e-mail by piping messages to a sendmail process on a remote machine via ssh. While this solves several problems (e.g. blacklisting), it has the problem that on many networks (particularly wireless networks) a surprising number of network connections get dropped. Checking that each e-mail has been sent is a frustrating process. So, having mulled on its design for a little while, I decided to create a simple utility to robustly send e-mail via ssh. The resulting program - extsmail - has more features than I'd originally expected, but the basic idea is simply to retry sending messages via an external command such as ssh, until the message has been successfully sent. I also wanted the utility to be as frugal with resources as practical, and to be as portable as possible. This inevitably led to extsmail being written in C. I then decided, as an experiment, to try and write this, as far as possible, in the traditional UNIX way: only to rely on features found in all sensible UNIX clones and to be robust against failure. In so doing, I made two observations, new to me, about writing software in C.

The first observation is semi-obvious. Because software written in C can fail in so many ways, I was much more careful than normal when writing it. In particular, anything involved in manipulating chunks of memory raises the prospect of off-by-one type errors - which are particularly dangerous in C. Whereas in a higher-level language I might be lazy and think hmm, do I need to subtract 1 from this value when I index into the array? Let's run it and find out, in C I thought OK, let's sit down and reason about this. Ironically, the time taken to run-and-discover often seems not to be much different to sit-down-and-think - except the latter is a lot more mentally draining.

The second observation is something I had not previously considered. In C there is no exception handling. If, as in the case of extsmail, one wants to be robust against errors, one has to handle all possible error paths oneself. This is extremely painful in one way - a huge proportion (I would guess at least 40%) of extsmail is dedicated to detecting and recovering from errors - although made easier by the fact that UNIX functions always carefully detail how and when they will fail. In other words, when one calls a function like stat in C, the documentation lists all the failure conditions; the user can then easily choose which errors conditions he wishes his program to recover from, and which are fatal to further execution (in extsmail, out of memory errors are about the only fatal errors). This is a huge difference in mind-set from exception based languages, where the typical philosophy is to write code as normal, only rarely inserting try ... catch blocks to recover from specific errors (which are only sporadically documented). Java, with its checked exceptions, takes a different approach telling the user you must try and catch these specific exceptions when you call this function.

What I realised is that neither exception-based approach is appropriate when one wishes to make software as robust as possible. What one needs is to know exactly which errors / exceptions a function can return / raise, and then deal with each on a case-by-case basis. While it is possible that modern IDEs could (indeed, they may well do, for all I know) automatically show you some of the exceptions that a given function can raise, this can only go so far. Theoretically speaking, sub-classing and polymorphism in OO languages means that pre-compiled libraries can not be sure what exceptions a given function call may raise (since subclasses may overload functions, which can then raise different exceptions). From a practical point of view, I suspect that many functions would claim to raise so many different exceptions that the user would be overwhelmed: in contrast, the UNIX functions are very aware that they need to minimise the amount of errors that they return to the user, either by recovering from internal failure, or by grouping errors. I further suspect that many libraries that rely on exception handling would need to be substantially rewritten to reduce the number of exceptions they raise to a reasonable number. Furthermore, it is the caller of a function who needs to determine which errors are minor and can be recovered from, and which cause more fundamental problems, possibly resulting in the program exiting; checked exceptions, by forcing the caller to deal with certain exceptions, miss the point here.

Henry Spencer said, Those who don't understand UNIX are doomed to reinvent it, poorly. And that's probably why so many of the programs written in C are more reliable than our prejudices might suggest - the UNIX culture, the oldest and wisest in mainstream computing, has found ways of turning some of C's limitations and flaws into advantages. As my experience shows, I am yet another person to slowly realise this. All that said, I don't recommend using C unless much thought has been given to the decision - the resulting software might be reliable, but it will have taken a significant human effort to produce it. ]]> Free Text Geocoding [September 1 2008] http://tratt.net/laurie/tech_articles/articles/free_text_geocoding standard maps with a well-understood relationship to the real world; sites such as Strange Maps and Mark Easton's blog show how maps can present data in many other forms. My headmaster at school thought that geography as a subject had lost its way; instead of focusing on rock strata, it should focus on teaching children where places where. At the time, we all thought he was a touch eccentric; in retrospect, he was absolutely right. Huge tracts of politics, economics, and history only make sense in the context of a given place(s). I could go on, but I hope my point is clear: maps are important, and not just for finding our way around.

A big problem with traditional maps is finding things: places, areas, and so on. Even for a medium sized paper map, the size of the index is huge; and the index for a good quality London A-Z is often almost as big as the map itself. Paper map indexes have their own funny little language, trying to squeeze as much data in as possible. However paper indexes have two problems. First, how many times have you forgotten what square you were supposed to look at when you get to the proscribed page? Second, indexes can only grow to a certain size before they are unusable.

Computer geocoding

The advent of computer mapping has been a real boon, and not to just map fiends such as I. The first site I used regularly was Street Map. Having the map data for the whole UK was incredibly useful and the search facilities, while crude, were a huge improvement on a paper index. When later sites such as Google Maps became available, I was stunned. Suddenly freely viewable map data (though note I do not use the phrase free map data) for much of the world was coupled with a new way of searching. No paper index, or Streetmap's cloying need to be told what type of search was being performed (e.g. place name, street name, post code etc.). Instead is what I call (for want of a better name) free text geocoding: that is, where one types in something in the format used in real-life, and the search engine finds the right place automatically. One doesn't need to tell the search engine that the search is for a postcode or a place-name, or even which country one is searching - it magically does the right thing. Well, of course - not always the right thing. For example, at the time of writing, if I do a search for Penrith in Google Maps UK, I get taken to a suburb of Sydney in Australia. The poor residents of Penrith in the North of England don't even have their town mentioned as a possible match for the search; one must instead search for something like Penrith Cumbria or Penrith UK. There are a range of related minor infelicities in both Google Maps and Yahoo Maps. However on average, both do an adequate job.

There are several reasons why sites such as Google Maps are not as widely used as they might be. Regrettably the chief reason is legal: there are restrictions on the way that map data can be used. Fortunately an initiative - OpenStreetMap - to create much freer map data was started a few years back (and started by Brits - we are, it appears, a nation of map lovers) and is now at the stage where, despite many huge holes in its data, it is semi-usable: in central London, for example, it already has arguably the highest quality maps. Some of the uses of OpenStreetMap are already quite astonishing (the UK postcode layer is a simple, but effective, example of what can be done - click the little + icon in the top-right of the map for more fun), and its accelerating progress is impressive.

One part of OpenStreetMap that frustrates me a little is its search (the Name Finder). Although it does a reasonable job in many respects, the results it returns are hard to interpret (type in Penrith and then try and work out which result is the UK town - I got this wrong on my first go and I'm English!), and it's not easy to use it outside of a website (because it's written in PHP). Furthermore the version running on OpenStreetMap's front page is painfully slow (possibly because it's running on overloaded servers). [At the time of writing, I can't even work out how to get it to find Penrith in the UK automatically. Queries like Penrith, UK crash it. Penrith - please don't take the cold-shoulder from the map search engines personally!]

Free text geocoding

Since one of the huge benefits of using computer mapping is its search ability, I thought a fun little summer project would be to create my own free text geocoder. I started with only a vague idea of what a free text geocoder should (or could) do. While I don't now claim to have thought of everything, I do now have a much clearer idea of what a good free text geocoder should do. I've split these into must and should haves. A free text geocoder must:

  1. give accurate results (meaning it always, somewhere in the list of matches, gives the place the user is looking for). While this might seem obvious, some of the existing free text geocoders don't do this, as we saw earlier.
  2. require as little formatting from the user as practicable (e.g. London SW1 and SW1 London are both likely searches).
  3. return any unmatched text (thus allowing searches like cafes in Pimlico to return Pimlico as the place matched and cafes in as unmatched text which an application can then use as it pleases).
  4. be fast: results shouldn't take more than a second on average (and preferably should be much quicker).
  5. be localisable. This is both linguistic and cultural. Searching should take place disregarding the users input language, but results should be localised when possible. Results should be formatted relative to the users local cultural expectations (e.g. in the US, state names are always shown; in the UK county names are nearly always shown).
If possible, it should:
  • be possible to use it do more than just find the latitude and longitude of a place.
  • try and weight the results so that the most likely match(es) are given higher priority (e.g. an Englishman searching for Penrith should see the English town as the first match, while an Australian should see the Sydney suburb).
  • be usable in different contexts (e.g. in websites, or in applications).
  • be amenable to (possibly quite low-level) customization.

Fetegeo

To this end I created, and have now released, Fetegeo with a BSD / MIT licence. Using Fetegeo's included client / server interface, queries can be performed on the command-line:

$ fetegeoc geo London
Match #1
  ID: 719913
  Name: London
  Latitude: 51.508415
  Longitude: -0.125533
  Country ID: 233
  Parent ID: 1262
  Population: 7421209
  PP: London, United Kingdom
  Dangling text:
Of course, there are a lot of London's in the world and I haven't copied all of Fetegeo's output. Notice though that, since the preferred country of the user wasn't specified, it's chosen what most people are likely to consider to be the London as the first match. If the user specifies that their country is Canada then London in Ontario is the first match:
$ fetegeoc -c ca geo london
Match #1
  ID: 2984878
  Name: London
  Latitude: 42.983389283
  Longitude: -81.233042387
  Country ID: 39
  Parent ID: 540
  Population: 346765
  PP: London, Ontario
  Dangling text:
Fetegeo can be instructed to allow dangling (i.e. unmatched) text in matches:
$ fetegeoc -d geo Museums in London
Match #1
  ID: 719913
  Name: London
  Latitude: 51.508415
  Longitude: -0.125533
  Country ID: 233
  Parent ID: 1262
  Population: 7421209
  PP: London, United Kingdom
  Dangling text: Museums in

If you're interested, there's a slightly more thorough description of the ways that Fetegeo can be used, and a simple demo which geocodes results and shows them on an OpenStreetMap map.

How Fetegeo works

Internally, Fetegeo's search is fairly simple and its approach is easily described. Strings in Fetegeo are always normalised; in particular punctuation is removed, and strings are lower-cased. String queries to the database are always on hashes of normalised words. Given a normalised string S, Fetegeo breaks it into a list of words. It then works right-to-left in the list, trying to find all possible matches. Whenever a match within S occurs, a counter is decremented meaning that subsequent matching takes place n elements from the end of the list. Matches are greedy; they always try to first match the maximum number of words possible, before gradually trying fewer words. Matches are exhaustive in the sense that all possibilities are tried; however only the longest matches are eventually returned to the user (some obvious optimisations are used here, so that some possibilities aren't tried if it's obvious they won't work). Fetegeo first of all tries to match the entirety of S as being a place within the users country; if that fails, it tries to match a country name at the right-hand side of the string. It then tries to match places and postcodes; if a place is found, it is considered to be a parent area; subsequent matches can only match places within that area or (recursively) a sub-area. Postcodes can occur at any point in the match. Of course, there's a lot more detail than this in the code, but this is the essence of Fetegeo's fast free text geocoding.

How Fetegeo compares

How does Fetegeo score on the must / should chart?

It's too early to say how accurate Fetegeo's results are. First, Fetegeo has only been relatively lightly tested so far: it's inevitable that there will be bugs and oversights. Second, any free text geocoder is subject to something I'm tentatively calling Tratt's First Law of Free Text Geocoding (though I doubt I'm the first to think of it): the upper bound for results quality is determined by your dataset. The best free text geocoder in the world can only give iffy results with an iffy dataset. Fetegeo's initial dataset is based on Geonames data (and postcode data from various other sources). While Geonames should be saluted as the first serious attempt to collate freely-available place data, the structure of the data is less than ideal, and the data itself is of variable quality, suffering from frequent inaccuracies and duplication. Because of this, Fetegeo has been designed to be relatively independent of any particular dataset; I hope one day in the not too distant future that OpenStreetMap's data will be sufficiently broad in scope to replace Geoname's (OpenStreetMap's data is already deeper in the sense that it includes roads, which Geonames doesn't).

Fetegeo is already reasonably fast, given that it's only been semi-optimised. On my 3-ish year old desktop machine, using the stock install of PostgreSQL (a few tweaks would, I suspect, make it perform much better - if only I could work out what those tweaks were amongst the mass of overlapping configuration options!), typical queries are answered in less than 0.1s. Fetegeo makes use of simple caching internally to speed things up. Someone who understands databases better than I could almost certainly make things run much faster.

Fetegeo has the beginnings of being usable for more than just longitude / latitude searches, but there is some way to go yet to prove this is feasible. In particular I would like to see it capable of being used by applications to classify things as being in particular areas. Imagine you have a website listing X's in Britain (where X could be just about anything), where each X is located at a particular latitude / longitude. This allows one to easily search for all X's near place P. However users often want to perform area searches such as all X's in London or all X's in Rutland. Exposing the identifiers of places, counties, states (and so on) makes this latter type of query feasible.

Fetegeo is usable in a number of different ways. As such, Fetegeo is just a Python library which can be included and used in any application. Fetegeo also comes with a standard internet server and (command-line) client, which can receive and answer XML queries (as an aside, the XML parser used is often the slowest part of querying). This means that even a simple web-site can query a single Fetegeo server and make use of its caching facilities and so on.

Conclusion

I have no idea whether anyone will find Fetegeo useful. It seems to me that, even in its current embryonic form, it fills an unoccupied niche, at least in terms of its licence if not its functionality (yet). I hope that other people might find it interesting, and start to extend its functionality to make it more widely applicable. If you want to find out more, and contribute, please waltz on over to fetegeo.org. ]]> Extended Backtraces [June 2 2008] http://tratt.net/laurie/tech_articles/articles/extended_backtraces loop for all such loops. When one of my programs mysteriously failed, I could not work out why. Eventually I realised that one of my label definitions had spelt looop (three O's instead of two) instead of loop, so my loop had branched back to the previous loop in the file. Spotting that took me a couple of days.

Later, I realised that most programming errors fit into two broad categories: the obvious and the subtle. Obvious errors are those whose source can be easily pinpointed (even if fixing the problem takes a while). The subtle are typically those where cause and effect are separated, making identification of the root of the problem difficult (often, when eventually located, such problems are easily fixed). The looop problem above was, to a relative novice programmer, a very subtle problem (years of experience have taught me that looking for daft spellings in my own programs is a good initial target when debugging).

There are, to my mind, two tricks to debugging. The first is to try and turn subtle problems into obvious problems; however subtle problems are typically inherently subtle and unamenable to such treatment. The second is to try and speed up the solving of obvious problems. For me, the main tool for solving obvious problems is the humble backtrace which, when an exception occurs, shows one (in some manner or another) the call stack and, hopefully, what file and line number each entry therein is associated with. Given the following trivial program:

func head(l):
    return [l[0], l[1 : ]]

func main():
    head([1])
    head([])
a standard looking backtrace would be:
Traceback (most recent call at bottom):
  1: File "/tmp/head.cv", line 6
  2: File "/tmp/head.cv", line 2
  3: (internal), in List.get
Bounds_Exception: 0 exceeds upper bound 0
Using this we can fairly quickly see that the cause of our error is passing an empty list to a function which assumes that there is at least one element in the list. [As a side note, this example is in one way unrepresentative: in the vast majority of cases, it's typically the bottom one or two lines of the backtrace that pinpoint the real source of the error.]

Backtraces like the above can be found in most modern programming languages like Java. They are immensely useful and form precisely half of my debugging toolkit, the other half being printf - in my view of the world, these two tools obviate the need for debuggers. The power of backtraces is most obviously felt in those languages that don't have them. C programs typically need to be run through a debugger to get a backtrace, meaning that errors in programs running in production can be extremely difficult to diagnose. The first Haskell program I wrote had the head error in it. The resulting message just said Prelude.head: empty list with nothing else to help me - no line numbers or even file names. Needless to say, it took me a long while to work out what this meant, and how it happened (if I remember correctly, I passed an empty list to a library function which, in turn, called head). Unsurprisingly that was also pretty much the last program I wrote in Haskell - languages that turn should-be-obvious errors into subtle errors are of no use to me. [Apparently Haskell now has some in-built, though hardly easy to use, support for backtraces. The implications relating to Haskell's prioritisation of features are, to my mind, highly amusing.]

Python was the first language I saw that took backtraces a little bit further, printing (when possible) the line of source code associated with each part of the backtrace. A Python-esque backtrace looks roughly as follows:

Traceback (most recent call at bottom):
  1: File "/tmp/head.cv", line 6
       head([])
  2: File "/tmp/head.cv", line 2
       return [l[0], l[1 : ]]
  3: (internal), in List.get
Bounds_Exception: 0 exceeds upper bound 0
This simple innovation is a real boon: as in this case, one often doen't even need to open a source file in a text editor to see the error made. Python-esque backtraces help make obvious errors quicker to solve than traditional backtraces.

I realised early on in Converge's development that knowing merely the line number of an error was only part of the problem. Often a specific sub-expression within a certain line is the relevant part of the backtrace, and the rest of the line is noise. Converge therefore recorded the column (i.e. offset within a line) where each error is associated with, meaning that backtraces looked like the following:

Traceback (most recent call at bottom):
  1: File "/tmp/head.cv", line 6, column 4
  2: File "/tmp/head.cv", line 2, column 13
  3: (internal), in List.get
Bounds_Exception: 0 exceeds upper bound 0
This extra information is very helpful: it means that I can accurately pinpoint which of the two list lookups in line 2 is responsible for calling List.get incorrectly. As a useful advantage, Converge's approach also means that errors that happen within multi-line statements (i.e. logical lines of source split over multiple physical lines in a source file to aid presentation) work properly.

Converge's backtraces stayed like the above for quite some time, until recently when I realised that knowing the start column associated with an error is only part of the story. What one really wants to know is the start and end of the associated expression. A small tweak to the parser, and a huge (but mechanical) change to the compiler, and Converge backtraces could tell one how many characters in the line an error was associated with:

Traceback (most recent call at bottom):
  1: File "/tmp/head.cv", line 6, column 4, length 8
       head([])
  2: File "/tmp/head.cv", line 2, column 12, length 4
       return [l[0], l[1 : ]]
  3: (internal), in List.get
Bounds_Exception: 0 exceeds upper bound 0
This is almost helpful, but in practice I find it surprisingly hard to count n characters within a line on screen, which hinders interpretation of the above data.

A short while later, the answer hit me: what the backtraces need to do is to highlight the relevant sub-expression within the line. Here's a screenshot of the above error running in an xterm with the latest version of Converge:

As you might notice, the tiny little difference here is that the part of each line pertinent to the error is in bold and underlined. Knowing that, one can instantly see that the first of the two list lookups on line 2 is responsible for calling List.get incorrectly. Interestingly, my first attempt at this put the offending fragments only in bold, but since whitespace can sometimes be a significant part of an error, underlining can be a useful aid. In the case where the associated source code is split over multiple lines, the first relevant line of source code is printed with ... added to the end of the line to inform the user that the printed line is not the end of the story.

As I explained in a previous entry, when Converge DSLs are translated into Converge ASTs, individual call stack entries can be associated with more than one source location. This means that backtraces tend to be rather long, which previously made tracking down the cause of an error tedious - loading multiple files into text editors and continually flipping back and forth to xterms is not fun. Extended backtraces become a real life saver in this regard. Here's an example where a DSL incorrectly tries to subtract a string from an integer:

Looking at this backtrace, an experienced programmer will be able to quickly surmise that, given the exception message, the most likely candidate for this error is in the ex4.cv file (which, as the eagle-eyed may notice, is code written in the DSL - Converge's errors work with both the base language and with embedded DSLs). Imagine trying to debug this with a traditional backtrace: there's a lot of information in the backtrace, and there would be no indication as to which part of it is more likely to be responsible for the error.

From a practical point of view, Converge's extended backtraces have no run-time penalty for correct code, and users don't have to do anything to enable them - they're a standard part of the system. Extended backtraces can be found in -current versions of Converge (at the time of writing, Converge's support for Curses under Windows is weak, so underlining doesn't work there - it's a quick, fun little project for someone who's interested).

So, going back to the start of this entry, how do Converge's extended backtraces help with debugging? Well, they might help turn the odd subtle error into an obvious error, but that's an incidental benefit. What I think they do is make solving obvious errors much quicker than previously. In the sort time since I've had extended backtraces, I've noticed that I've often been able to almost instantly fix errors that before might have taken me a couple of minutes. Given the number of programming errors I make, the cumulative time saving is most welcome.

In summary, I think that Converge's extended backtraces are a real boon to programming. To the best of my knowledge, Converge is the first language with such backtraces in - I hope it won't be the last! ]]> Designing Sane Scoping Rules [March 3 2008] http://tratt.net/laurie/tech_articles/articles/designing_sane_scoping_rules scoping rules. In this article I'm going to outline why Converge has the scoping rules it does.

The first thing I need to do is to outline the problem. Although all my examples are framed in terms of a typical imperative programming language, the underlying concepts all translate fairly directly to functional languages too. Here's the simplest example possible:

x := 2
...
y := x
In every language I know this says assign 2 to x and the assign the value of x to y. So after this code is run both x and y will have the value 2.

The first major issue with regards to variables is global vs. local variables. Take the following code which is intended to represent the top-level of a source file:

x := 2

func f():
    x := 3

f()
In a lot of BASIC-type languages there is only one underlying x variable in this program, so after this code is run the outer x will have the value 3. In essence we have a flat variable scope: all variables belong to the same, single, namespace. This not only makes writing programs error-prone (e.g. one function accidentally corrupts another's x), but makes certain styles of programming largely impractical (e.g. recursive functions). It's probably no coincidence that the first mainstream programming language to make a virtue out of recursion - Algol 60 - also was among the first to get its scoping rules in reasonable order.

Retro-fitting sane scoping rules is not easy if the first version of your language used the above scoping rule (note the deliberate use of the singular) - any change of scoping rule(s) has a high chance of breaking programs in nasty ways. Some of the BASIC-type languages I first used solved the backwards compatibility problem by making variables global by default but allowing variables in functions to be declared as local. This allows one to rewrite the above example as follows:

x := 2

func f():
    local x
    x := 3

f()
This code now has two distinct x variables: one at the top-level and one in the f function. Every time f is called - even recursively - it will be given space for a new, fresh x. This feature makes programming a lot easier, even if it defaults to the insane global scoping rule by default. As an aside, surprisingly (to me at least), you can still see the global by default scoping rule in the modern language Lua. This goes to show how fundamental scoping rules are: once they're in a language, users will resist nearly all meaningful change to them.

Most programming languages adopt a slight variant of the above rules which are fairly easy to understand in practice. Essentially variables with the same name as a top-level variable reference that top-level variable directly, while all other variables are local. So in a C-type language the following code contains two variables: a top-level x and a y local to f:

x := 2

func f():
    x := 3
    y := 4

f()
Running the above code means that the top-level x is set to 3, while the y variable is local to f. Variations on this set of scoping rules underly many programming languages in use today.

The next major design challenge for scoping rules is much more subtle and confuses many of us to this day. Knowing that the global keyword in Python declares that assignment to a specified variable doesn't make it local to that scope, consider the following Python code:

x = 2

def f():
    x = 3
    
    def g():
        global x
        print x
        x = 4
    
    print x
    g()

f()
print x
What do you think this will print out? Let's try it out with Python 2.5:
$ python scope.py
3
2
4
$
In other words, we get a result which is a long way from what we might have expected: the print statement in g prints 2 instead of the expected 3 and the final print statement prints 4 instead of the expected 2. What is it doing? What's happening is that most languages don't have nested scopes (as one might expect) but two scopes: a top-level (a.k.a. global or module) scope and a function scope. What this means is that the assignment to x in g references the top-level x, not the x in f; you might want to read that twice to check that you've really understood it.

It might at first seem that Python's scoping rules are simply silly; actually, they're not unreasonable and they're shared by most programming languages (e.g. Java). Why? The problem is that the function g might outlive f. Here's a simple example:

func f():
    x := 2
    func g():
        return x
    return g

f()()
In other words, f returns a reference to g; when g is executed, the value of x known to f will have disappeared as, in most programming languages, variables are stored on the stack. This means that variables only exist for the duration of a function call. Since f's variables will have disappeared when g is executed, all sorts of bad things could happen.

Scheme was the first language that presented a practical solution to this problem of nested scopes in the form of closures. The standard way that closures are defined is guaranteed to confuse and I'm not going to repeat it. They're actually very simple: essentially each function allocates heap memory to store variables on. Thus if an inner function outlives an outer function there is no problem in referencing variables in the outer function even if the stack space has long since disappeared, since a function calls variables can outlive the function call itself.

[As an aside, the fact that closures need to allocate heap memory (although it's often possible to statically analyse such allocations away) has been used as an argument against them in languages such as Java. That's the chief reason that Java has all sorts of complications like inner classes, final variables and so on: Java resisted closures, and then had to resort to hacks to get a poor facsimile of its functionality. It's hard to imagine any decent programming language being built now that doesn't implement closures (Converge certainly does), so closures are gradually losing their exotic tag (which, I suspect, is based on their definition and not their utility, which is far from exotic).]

When I was designing Converge, I put some effort into deciding what its scoping rules were going to be. I wanted to make things safe by default (e.g. no global by default-type nonsense), and to make closures easy to deal with. Converge's scoping rules are (or, at least, were), I believe, the simplest of any imperative programming language. They boil down to this: assigning to a variable makes it local to a function; unless a variable is declared nonlocal, in which case scopes are searched from inner to outer to find the matching variable. That's it. Rewriting the Python example into Converge yields the following:

x := 2

func f():
    x := 3
    
    func g():
        nonlocal x
        Sys::println(x)
        x := 4
    
    Sys::println(x)
    g()

func main():
    f()
    Sys::println(x)
Which when run does the expected:
$ converge s.cv
3
3
2
$
The interesting thing here, in my mind at least, is the nonlocal keyword. Although it's a tad awkward sounding, it was the best combination of brevity and accuracy that I could think of. Unlike Python it would be incorrect to declare a variable as global since there are more than 2 scopes: in fact, there are an arbitrary number. What nonlocal is saying is when you see an x search each successive outer scope in order to find where it was originally defined. It's not a commonly used feature, but when you need it, it is priceless.

I said above that Converge has - or had - the simplest of any imperative programming language (at least as far as I'm aware). Some time after the first publications and release of Converge, the Python team decided to fix their scoping rules for their backwards-compatibility-breaking Python 3000 release. PEP 3104 contains the eventual proposal they came up with. Interestingly it is identical to Converge's scoping rules even down to using the nonlocal keyword. Please note that I'm not suggesting that the Python team copied Converge uncredited (and even if they did, I wouldn't mind - I actively encourage people to take the good ideas from Converge and put them in other languages). However I think it shows that these are good scoping rules and that eventually imperative programming languages will evolve towards something like them.

The open question is this: why has it taken us, as a community, 50 years or more to define two simple scoping rules (assignment is local; nonlocal successively searches outer scopes) and one simple implementation technique (closures), and why have we taken so many wrong turns (in this article I haven't enumerated the silliest, such as dynamic scoping)? I suspect the answer, if it could be definitively uncovered, would give one a very interesting insight to the subject of computing in general. ]]> Some Lessons Learned from Icon [Updated October 26 2010] http://tratt.net/laurie/tech_articles/articles/some_lessons_learned_from_icon Updated (October 26 2007): Before reading this article, please be aware that it has been superseded by the paper Experiences with an Icon-like expression evaluation system.

One of Converge's lesser known influences is Icon. Although it's a relatively obscure language itself, a surprising number of people have heard of Icon's direct predecessor SNOBOL. I first heard of Icon through Tim Peters and Python, where Icon was the inspiration for Python's generator feature (basically procedures that whose execution can be resumed to produce multiple values). However Icon has a much richer, and definitely unique, approach to expression evaluation than any other imperative programming language. When I started working on Converge, I decided to go back to the source, and see how much of Icon's unique approach I could incorporate into Converge.

This article is a personal look back at Icon's influence on Converge, and the successes and failures resulting from that influence. It's quite probable that much of it reflects my slightly superficial understanding of Icon - apart from its innovative approach to expression evaluation, much of the language has an old fashioned feel to it, sometimes appearing like a dynamically typed version of Pascal, and this prevented me from ever wanting to write anything large in it. This isn't a criticism of Icon as such - it was ahead of its time in many ways - but is merely an observation that modern programmers expect something slightly different from their programming languages. Hopefully, even with the above caveat, this article contains something of value to those interested in programming languages.

Why is Icon interesting

In a nutshell, Icon is interesting due to what it calls goal-directed evaluation. This is an evaluation mechanism which gives to an imperative programming language some of the flexibility more often associated with languages like Prolog, such as a limited form of backtracking. It is built around the concepts of success and failure, and generators (see the Converge documentation for more details). Expressions can succeed or fail; if they succeed, they produce a value; if they fail, they do not produce a value and transmit failure to their containing scope. Generators are functions which can produce more than one value.

Icon's syntax is derived from Algol, so the following complete program prints 1 to 9 inclusive:

procedure upto(x)
  i := 0
  while i < x do {
    suspend i
    i := i + 1
  }
end

procedure main()
  every x := upto(10) do write(x)
end
Most of this is probably reasonably intuitive, apart from the every and suspend keywords. every can be considered to be equivalent to the typical understanding of a for statement (indeed, in Converge, the same construct is called for). suspend is equivalent to yield in Python or Converge, returning a value, but allowing the procedure calls execution to be resumed directly after the suspend statement.

Failure and if: a beautiful combination

If I had to pick one thing from Icon that I would not be without in Converge, it would be the concept of failure in if statements. There is something beautiful about saying if this function succeeded, assign the value to this variable and then do xyz. This idiom is used to build a very useful convention that is present throughout Converge's libraries. As in most languages, container items such as dictionaries have a get function which returns the value associated with a certain key; geting a key which doesn't exist generates an error. There is a very common variation on this use case, which is to check whether a dictionary has a certain key, and if it does to do something with its value; otherwise a different code path is taken. In most languages this idiom is expressed roughly as follows:
d := Dict{"a" : 2, "b" : 8}
if d.has_key("a"):
  Sys::println(d["a"])
Not only is the double lookup of a an eyesore, and a maintenance accident waiting to happen, but it can be a significant overhead in tight loops. In Python (and probably other languages), it's common to see the following idiom:
d := {"a" : 2, "b" : 8}
try:
  v = d.has_key("a")
  print v
except KeyError:
  pass
This idiom makes use of the fact that it's nearly always quicker to catch an exception if a key isn't found than to perform two lookups. In my opinion, this idiom is far from pleasant, for reasons that I'm sure most readers will intuitively understand.

In Converge one can express this common idiom thus:

x := Dict{"a" : 2, "b" : 8}
if v := x.find("a"):
  Sys::println(v)
In this example, v only gets assigned a value if find succeeds. There are some other advantages to this approach (e.g. there's no fiddling around checking for null), but the symmetry between get and find, and the consistency of this convention across libraries has proved a real success in Converge. For me, this feature alone justifies the Icon experiment in Converge.

Variables should not spring into existence with a default value

The idiom mentioned in the previous section is, unfortunately, somewhat dangerous in Icon itself, as all variables have a default value of sorts. What this means is that the following Icon code runs without error:
procedure main()
  if 1 < 0 then x := 0
  write(x)
end
To me, this is odd, because x never has a value assigned to it (as the expression in the if statement is clearly false); any errors resulting from such a mistake turn out to be very difficult to debug. In Converge I therefore made it so that reading from a currently unassigned variable raises an error, which protects against programming slips such as the following:
x := Dict{"a" : 2, "b" : 8}
if v := x.find("c"):
  ...
else:
  Sys::println(v)
As the lookup of c fails, v doesn't have a value assigned to it, so reading from v in the else branch of the if statement raises an Unassigned_Var_Exception, pinpointing the offending code.

Procedures shouldn't fail by default

In Icon, the default return value of a procedure is failure. This makes quite a lot of sense for generators such as the following:
procedure upto(x)
  i := 0
  while i < x do {
    suspend i
    i := i + 1
  }
end
As this suggests, the typical action for a generator is (via a loop) to generate several values; once that loop is finished, the generator fails. Thus each Icon procedure effectively has return &fail at its end. Very early versions of Converge inherited this behaviour.

What became quickly apparent is that while this is reasonable behaviour for generators, it can be a disaster for normal procedures. Look at this code and see if you can work out what will happen when it's executed:

procedure f(x)
  if x > 0 then return 1
end

procedure main()
  write(f(-1))
end
If you guessed that nothing would be printed to screen, congratulations. If you didn't, then don't worry, because this frequently surprised me. This happens since the failure of the if's condition means that the procedure executes its default return action, which is to fail. Thus the procedure doesn't return a value, and the call to write is never even evaluated.

Although this might seem trivial, the problem becomes significantly worse when it's embedded in a large body of code. f is an example of what I informally think of as the dangling if return problem - in other words, it's quite easy to write a procedure where different values should be returned, but where one branch of an if incorrectly neglects to actually return anything. I find this occurs quite often during the early stages of development when one is fleshing out functions. On several occasions I spent several hours tracking down instances of this problem.

The question I asked myself was whether this feature was worth the pain. A quick analysis of real world code quickly showed that the vast majority of functions aren't generators, and indeed only ever return a single value. Therefore optimising for the exceptional case (generators) is an odd design choice.

I considered two solutions to the problem. First, one could syntactically differentiate generators and procedures, with the former failing by default and the latter not. Second, one can make all procedures not fail by default. Adding new syntax is a decision that should never be taken lightly, and since generators aren't that common, I couldn't justify the added conceptual complexity. Therefore Converge functions became similar to Python functions, with their default return action being to return the null object.

Generators are useful, but easily hidden

Generators are an integral part of Icon and Converge, but as well as having no specific syntax to define them, there is no specific syntax to call them. Assuming that g is a generator, I have grown quite fond of the following idiom:
l := []
...
for l.append(g())
which appends every element generated by g to l. However the problem in such cases is that it can be difficult to work out what part of the expression is the generator. Converge now tries to solve this problem by prefixing generator function names, whenever it makes sense, with iter_ which highlights that the function in question is a generator. This makes it easier to look at a piece of code and work out what it does without having to understand every function called.

Backtracking is rarely useful

One of the major features of goal-directed evaluation is that backtracking can be performed. The most common way for this to be achieved is by linking expressions together with &. The resulting conjunction of these expressions only succeeds if all of its component expressions succeed. The expressions are evaluated in order. If one of them fails, and a previous expression is a generator, then backtracking occurs. The previous generator is resumed to produce a new value, and the conjunction then resumes execution from that point.

Combining conjunction with for allows compact expressions such as the following to be expressed:

x := "abcabaacvcabcbab"
for Sys::println(i := 0.iter_to(x.len()) & i % 2 == 0 & x[i] == "a" & i)
This prints out every index position in a which is a multiple of two, and which contains the character a (0 6 10 14 for those who are interested).

There are two problems with the above. Most obviously, it is incredibly difficult to read from a human perspective; indeed, it would be far better written as a couple of if statements within a for body. The second problem is that the form of backtracking used is often too weak to be useful. Icon allows variables in conjunctions to have assignments undone during backtracking, but even this isn't really enough (I didn't even bother implementing such functionality in Converge, because I couldn't really see its use). Full backtracking often involves undoing assignments within objects and so on, and no imperative language can ever hope to do this automatically.

Out of all the systems I've written in Converge, only one has made any practical use of the built-in backtracking, and even then it needed some manual help to be truly useful.

The fail variable causes bizarre behaviour

One way in which Icon shows its age is its differentiation between values and references. In keeping with most object orientated languages, Converge makes no such distinction to the user (although within the VM it does so for efficiencies sake). One of Icon's main oddities is that fail is a synonym for return &fail, while &fail is a sort-of reference to an invisible object.

Converge simplified this, so that fail was a variable pointing to a semi-special object (in similar fashion to the null object). However the fail variable proved, in admittedly rare cases, to be conceptually troubling. It's troubling in Icon too, although in slightly different ways. What do you think happens in Icon if I define l as the following and then iterate over each of its elements and print them out?

l := [1, &fail, 3]
If you guessed that an error is raised because l didn't get assigned a value (because evaluating &fail cause the whole thing to fail), pat yourself on the back. Now try this:
l := []
put(l, 1)
put(l, &fail)
put(l, 3)
If you guessed that 1, then 2 get printed out, you're doing very well. And finally consider this:
l := []
put(l, 1)
t := &fail
put(l, t)
put(l, 3)
If you guessed that 1, then a blank line, and then 2 get printed out, you're doing much better than I ever managed to.

Converge's attempts to simplify thus were somewhat successful, but left one hole. While one could successfully evaluate the following list in Converge:

l := [1, fail, 2]
the following mysteriously printed nothing:
Sys::println(l[1])
since l[1] is a synonym for l.get(1), which of course tried to evaluate return fail, which then caused get to fail. While this might seem a contrived example, it unavoidably cropped up when getting a module to return the value of a specific definition, since one of those definitions was fail.

fail was edged out of Converge in stages. A while ago, the recommendation became that the only safe idiom involving fail was return fail. Perhaps surprisingly, this was not an onerous restriction. Nevertheless it still failed to prevent the problem of fail being a definition in a module. Therefore the fail variable was finally banished entirely from Converge recently (although it lives on internally in the VM). fail is now an expression in Converge's grammar, and is equivalent to return &fail in Icon. Thus finally, all of the bizarre behaviour associated with the fail variable is forgotten, and I now sleep easier at night.

Do something different

When I was looking for information in Icon, I stumbled across a few interviews with Ralph Griswold (Icon's main designer), and a paper on its history. One thing that became clear to me was that a major design goal of Icon was not to be just another (boring) synthesis of a few existing language features - like the vast majority of programming languages - but to try out genuinely new things. Have a look at e.g. p38 of this interview or p6 and 13 of this paper looking back at Icon's history. Whatever you might think of Icon, it undoubtedly fulfilled this aim: its expression evaluation system alone has no precedent in contemporary programming languages (and there are other interesting parts of the language).

When I was starting Converge, I agreed whole heartedly with this aim, but had no idea how to achieve it. Indeed, when Converge started, I only had the vague intention of chucking a few seemingly useful influences into the melting pot (initially Python, Icon, ObjVLisp, and ultimately Template Haskell when I'd got the mix of the former three languages about right). I had that vague feeling that either everything interesting had been done before, or that the remaining interesting things were outside of my grasp. To be honest, I sort of gave up on the aim of being innovative, and just concentrated on trying to mix Converge's influences together in a coherent way. What has been interesting is that as Converge has developed, new challenges have presented themselves, and I've had to find answers (some better than others) to such challenges. And in so doing, I think Converge has grown some genuinely innovative features, such as DSL blocks, and its approach to error information for macros amongst others. And in Converge's case - unlike Icon, as I understand it - the journey isn't over. Indeed, since I think Converge is currently nearing the end of the beginning, there's bound to be new challenges ahead, some of which will lead to new innovations of one sort or another, some ultimately successful, some ultimately not.

When I read about Icon's aim, and looked at the end result, I assumed that the innovative ideas in Icon had sprung out of nowhere. I, on the other hand, had no idea how to bring such ideas, fully formed, into the world. What's become clear to me is a restating of Edison's famous rule. Innovation is often a matter of perspiration over inspiration. Now my advice to those who wish to do innovative work is simple: if you put the miles in, you'll raise your head one day and discover that you're in a place where no-one else has ever found themselves in before. It's a good feeling, and one that benefits us all, as collectively we gradually discover what works and what doesn't.

Updated (December 10 2007): Used a clearer example in the Failure and if: a beautiful combination section.

Updated (October 26 2007): This article has been superseded by the paper Experiences with an Icon-like expression evaluation system. ]]> How Difficult is it to Write a Compiler? [August 9 2007] http://tratt.net/laurie/tech_articles/articles/how_difficult_is_it_to_write_a_compiler Recently I was discussing Converge with someone, and mentioned how little time the core compiler had taken to implement (no compile-time meta-programming, limited error checking, but a functioning compiler nonetheless) - only a few days. The chap I was talking to looked at me and told me that he didn't believe me. I laughed. Actually, he really didn't believe me. Initially that surprised me, and I wondered why he might think that I was pulling his leg. But then I thought back, and only a few years ago I would probably have had the same reaction. This article is my attempt to explain his reaction, and why it's no longer the case.

When I was younger, there were three things in computing that seemed so complex that I had no idea how anyone had ever created them; certainly, I didn't think I would ever be capable of working on such things. The imposing triumvirate were hardware, operating systems, and programming languages. Although electricity has always baffled me, the hardware people have done an amazing job on interoperability of components in recent years, such that I've never felt the need to gain a low-level understanding of how electrons whizz around my systems. Operating systems were partially demystified for me as I moved into the world of open source UNIX-like operating systems (for the record, I've been an OpenBSD user for 8 years or so), where one could poke and prod virtually any aspect - although kernels retain an aura of mystery for most of us. Despite the fact that I was familiar with several languages, and had even implemented a primitive assembler language (which, to my surprise, found its way into a real system - my first large-ish programming success), real programming languages remained a mystery to me for much longer. The obvious question is: why?

Programming languages are, for most of us, an integral part of the computing infrastructure. Only a small handful of languages have ever had a wide impact, and by the time a programming language becomes popular it already probably has a decade or more of history behind it. This means that when the average programmer first comes across a new language, it has had substantial development behind it, libraries and documentation, developed its own culture, and undoubtedly had several flaws uncovered; not to mention lots of little platform portability hacks, and often complex optimisations. Faced with this wodge of sheer stuff, most peoples reaction is either to shrug their shoulders and accept it at face value (the majority), try to understand it but not know where to start (a minority), or spot how similar it is to every other language (a tiny handful of people). Because the first two categories massively outnumber the third category, a general feeling - which we're not necessarily consciously aware of - has developed that programming languages are special, and therefore that only special people can create them (a veiled insult, if ever I saw one). Thus if I suggest to people that, despite having created a fairly fully featured programming language, I'm no more capable a programmer than the next man, they dismiss it as false modesty.

So let's start looking at things in more detail. If you break a programming language down, it only has a handful of things to it. What's more, the variance between different languages is - despite the unfortunate zealotry which often clouds debate - exceedingly small. In fact, at a high level there are really only two things which any language must have: a definition and an implementation. Sometimes a paper definition is produced up front, and only when this is complete is an implementation created. Sometimes the latter defines the former; but conceptually these are two separate things. Basically the definition says things like statement S does X, and the implementation actually takes in a source file and compiles S such that it really does do X. Defining a language is very simple: any fool can do it (the skill comes in defining a good language, but that's a different story). We can then break the implementation down into three components: a compiler, libraries, and a run-time system. Again, these are conceptually separate things, although sometimes they are combined (e.g. most Python programmers do not distinguish the compiler from the run-time system). Libraries are not difficult to create, although they are time consuming. Sometimes the run-time system will be a VM, sometimes it will be bundled up with the executable created by a compiler; regardless, we'll assume for the purposes of this article that a reasonable VM is already available.

Most people don't have much trouble understanding the high level view, but it's when they start trying to understand the compiler - the thing that ultimately they interact with - that things start getting more complex. The problem is that looking at implementations like Python (quite complex), Java (very complex), or GCC (something beyond merely very complex), reveals too much detail to easily uncover the big picture. The Converge compiler, at the moment at least, is rather simpler (although, like any other program, it is growing slowly but surely as it accretes features) and is a nice snapshot of a compiler. It has three basic stages:

  1. Read in a source file, and create a parse tree.
  2. Turn the parse tree into an abstract syntax tree.
  3. Turn the abstract syntax tree into object code.
Although the terminology might be a little different from language to language, these three stages are fairly common (although they might be augmented by extra stages e.g. for optimisation). Unfortunately my experience is that even at this stage, people look at those three codes and think magic. Let's break them down a little more.

Parsing

Parsing is the process of reading in a big wodge of text, discovering its underlying structure, and then creating a parse tree which is easy to operate on. For English, this is a bit like taking in a sentence and discovering what its subject, its verb, and so on is. Parsing doesn't really uncover meaning as such: that's for later stages.

If we have an input file such as the following:

import Sys

func main():
  Sys::println(2 + 3)

What is its parse tree? Well, in order to uncover that, we use a parsing system, which given a description of a language structure - a grammar - can automatically create a parse tree. Parsing has a bad reputation, often deservedly so: commonly used parsing technologies are often horrible to use, although it is perfectly possible to make much more pleasant parsing technologies. Regardless of that, this is the only thing in a compiler where you will need to check an external source to understand a bit more about grammars (Wikipedia's page on formal grammars is a decent start). An indicative chunk of the Converge grammar is as follows:

top_level ::= definition ( "NEWLINE" definition )*
          ::=

definition  ::= class_def
            ::= func_def
            ::= import
            ::= var_def ( "," var_def )* ":=" expr
            ::= splice
            ::= insert

import      ::= "IMPORT" import_name import_as ( "," import_name import_as )*
import_name ::= "ID" ( "::" "ID" )*
import_as   ::= "AS" "ID"
            ::=
In essence this says: a program consists of zero or more definitions; a definition is a class, a function etc.; an import is one or more module name imports, each of which can assign to a different variable name than the module name (using as). Given that and our input program, the following parse tree is produced (using only a couple of lines of code to call a library function or two):
top_level
  -> definition
    -> import
      -> <IMPORT import>
      -> import_name
        -> <ID Sys>
      -> import_as
  -> <NEWLINE>
  -> definition
    -> func_def
      -> func_type
        -> <FUNC func>
      -> func_name
        -> <ID main>
      -> <(>
      -> func_params
      -> <)>
      -> <:>
      -> <INDENT>
      -> func_decls
      -> expr_body
        -> expr
          -> application
            -> expr
              -> module_lookup
                -> expr
                  -> var_lookup
                    -> <ID Sys>
                -> <::>
                -> <ID println>
            -> <(>
            -> expr
              -> binary
                -> expr
                  -> number
                    -> <INT 2>
                -> binary_op
                  -> <+>
                -> expr
                  -> number
                    -> <INT 3>
            -> <)>
      -> <DEDENT>
This might look a bit imposing at first, but it's trivial to convert this back into the original input (although information about blank lines and comments would disappear reversing the transformation). If you want to see what the parse tree is for other Converge inputs, Converge helpfully includes a little program called convergep which given a source file input automatically prints out its parse tree (the above output is cut 'n' paste straight from convergep).

If you've understood this part, congratulations - you've got over the main stumbling block to creating a compiler.

From parse tree to abstract syntax tree

As stated earlier, a parse tree captures structure, but it doesn't capture meaning as such, beyond that which is captured in the structure. This can be seen by the fact that the parse tree still contains things like the : token, despite the fact this is for the users' visual benefit, rather than effecting the meaning of the program. What the second stage of a compiler does is to understand the meaning of the program (in some sense), strip out all the stuff which doesn't effect that, and convert it into another, very similar, data structure - the Abstract Syntax Tree (AST) - which captures this. For example the text 2 + 3 is converted from its parse tree equivalent:
-> expr
  -> binary
    -> expr
      -> number
        -> <INT 2>
    -> binary_op
      -> <+>
    -> expr
      -> number
        -> <INT 3>
into an AST along the lines of Add(Int(2), Int(3)). Notice that typically the parse tree and the AST have almost identical structure. Because of this, the conversion from parse tree to AST is generally very simple. For Converge, this conversion is captured in the compiler/Compiler/IModule_Generator.cv file. As the language grows, clearly this conversion gets larger, but because it is largely mechanical (and therefore often a simple cut 'n' paste job), it is not a difficult task. For example in Converge's early days, when it had approximately the same expressivity as Python, this conversion took around 2-3 days to code from scratch.

From abstract syntax tree to object code

Object code is an intentionally vague term - it might be machine code, VM bytecode, or even an intermediary programming language. Basically here we take an AST like Add(Int(2), Int(3)) and turn it into instructions such as:
INT 2
INT 3
ADD
In a language such as Converge, where the VM is designed explicitly to be used with the language, then there is a strong relationship between the AST and the object code. In other words, this means that the conversion to object code is about the same size and complexity as the conversion from parse tree to AST. If you're converting to, say, assembly code then this will be a trickier task, but even then there will still be a largely mechanical element to the conversion. You can see this in Converge's compiler/Compiler/Bytecode_Generator.cv file.

Conclusion

That's it. You don't need to know, or do, any more to create a compiler. When it's broken down in this simple way, I hope that it partly demystifies what a compiler is. There's nothing particularly magical about a compiler, and with the exception of parsing (which regretfully is often more involved than it should be), nothing particularly complex. I hope you get some sense of how little work there is in creating a compiler for a simple language. Of course, if you want to do something a little more complex, then things become rapidly more work, but you'd be amazed how few languages require such complexity.

In conclusion, the main difficulty for most of us in creating a compiler is overcoming the cultural absurdity which tells us that mere mortals can't create compilers. They can. I did. You can too. ]]>