RSS / XML feed

Laurence Tratt's Technical Articles


Problems with Software 3: Creating Crises Where There Aren't Any

June 28 2011

[This is number 3 in an occasional series ‘Problems with Software’; read the previous article here.]

As an undergraduate in the late 1990s, the part of my course I liked the least was Software Engineering, which started with obviously silly 1970s practices, and went downhill from there. As someone who was already familiar with both programming and, to some extent, the software industry, I pooh-poohed the course; my arrogance was duly rewarded with the lowest mark of any part of my degree, which taught me a useful life lesson.

One previously unfamiliar concept which this dreadful course introduced to me was the ‘software crisis’. First promulgated in the late 1960s and early 1970s, it captured the notion that software was nearly always over-budget, behind schedule, and of low quality. Those with an appreciation of history will know that most of the software technologies (programming languages, operating systems, and so on) we now take for granted took form around this time. As with any new field, this was an exciting time, with no technical precedent: none of its participants would have been able to predict the path forward. Rather, experimentation with new ideas and tools led to successes and failures; the latter received greater attention, leading to the idea of a ‘software crisis’.

To the best of my knowledge, the ‘software crisis’ was the first wide-spread moment of self-doubt in the history of computing. It reflected the idea that the problems of the day were huge and, quite possibly, perpetual. It is this latter point that, with the benefit of hindsight, looks hopelessly naive: by the early 1990s (and, arguably, many years before), software was demonstrably and continually improving in quality and scope, and has continued doing so ever since. While the massive improvements in hardware certainly helped (by the mid 1990s, hardware imposed constraints only on the most demanding of software), it was surely inevitable that, as a community, we would become better at creating software, given experience.

Mercifully, I have not found myself in a discussion about the ‘software crisis’ for 5 or more years; it seems finally to have been put to rest, albeit many, many years after it had any possible relevance. However it seems that, in software, we abhor a crisis vacuum: as soon as one has disappeared, others arise in its place. I've seen a few over the last few years including a ‘skills crisis’ and an ‘outsourcing crisis’, all minor problems at best.

However, one ‘crisis’ has gained particularly widespread attention in the last few years: the ‘multi-core crisis’. A quick historical detour is instructive. Although – in accordance with Gordon Moore's famous observation – the number of transistors on a CPU has doubled every 2 years for the last 50 years, by the mid part of this decade, it was proving increasingly difficult to make use of extra transistors to improve the performance of single-core (i.e. traditional) processors. In short, the easy performance increases that the software industry had indirectly relied upon for many decades largely disappeared. Instead, hardware manufacturers used the extra transistors to produce multi-core processors. The thesis of the ‘multi-core crisis’ is that software has to make use of the extra cores, yet we currently have no general techniques or languages for writing good parallel programs.

Assuming, for the time being, that we accept that this might be a crisis, there are two things which, I think, we can all agree upon. First, virtually all software ever written (including virtually all software written today) has been created with the assumption of sequential execution. Second, although we feel intuitively that many programs should be paralellisable, we do not currently know how to make them so. In a sense, both these statements are closely related: most software is sequential precisely because we don't know how to parallelise it.

One question, which those behind the ‘multi-core crisis’ do not seem to have asked themselves, is whether this will change. For it to do so, we would need to develop languages or techniques which would allow us to easily write parallel software. Broadly speaking, only two approaches have so far seen wide-spread use: processes and threads. The processes approach breaks software up into chunks, each of which executes in its own largely sealed environment, communicating with other chunks via well-defined channels. It has the advantage that it is relatively easy to understand each process; and that processes can not accidentally cause each other to fail. On the downside, processes must be large: the approach is not suited to fine levels of granularity, ruling out most potential uses. The threads approach is based on parallel execution within a single process, using atomic machine instructions to ensure synchronisation between multiple threads. It has the advantage of allowing fine grained parallelisation. On the downside it is notoriously difficult to comprehend multi-threaded programs, and they are typically highly unreliable because of incorrect assumptions surrounding state shared between threads.

In short, processes and threads are not without their uses, but they will never get us to parallel nirvana. A number of other approaches have been proposed but have yet to prove themselves on a wide scale. Transactional memory, for example, has serious efficiency problems and is conceptually troubling with respect to common operations such as input / output. There also seems to be an interesting tradeoff between parallel and sequential suitability: languages such as Erlang, which are, relatively speaking, good at parallelism, often tend to be rather lacking when it comes to the mundane sequential processing which is still a large part of any ‘parellel’ system.

However, all this is incidental detail. The point about a crisis is not about whether we have solutions to it or not, but whether it is a real, pressing problem. The simple answer is, for the vast majority of the population, “no”. Most peoples computing requirements extend little beyond e-mail, web, and perhaps word processing or similar. For such tasks – be they on a desktop computer or running on a remote server – computers have been more than powerful enough since the turn of the century. If we were able to increase sequential execution speed by an order of magnitude, most people wouldn't notice very much: they are already happy with the performance of their processors, so increasing it again won't make much difference (most users will see far more benefit from an SSD than a faster processor). In those areas where better parallel performance would be a big win – scientific computing, for example, or computer games – people are prepared to pay the considerable premium currently needed to create parallel softwre. Ultimately, we may get better at creating parallel software, but most users' lives will tick on merrily enough either way.

The ‘software crisis’ was bound to solve itself given time and the ‘multi-core crisis’ is no such thing: just because we've been given an extra tool doesn't mean that we're in crisis if we don't use it. Eventually (and this may take many years), I suspect people will realise that. But, I have little doubt, it will be replaced by another ‘crisis’, because that seems to be the way of our subject: no issue is too small not to be thought of as a crisis. The irony of our fixation on artificially created crises is that deep, long-term problems – which arguably should be considered crises – are ignored. The most important of these surely relates to security — too much of our software is susceptible to attack, and too few users understand how to use software in a way that minimises security risks. I have yet to come across any major organisation whose computing systems give me the impression that they are truly secure (indeed, I have occasionally been involved in uncovering trivial flaws in real systems). That to me is a real crisis — and it's one that's being exploited every minute of every day. But it's not sexy, it's not new, and it's hard: none of which makes it worthy, to most people, of being thought of as a crisis.

Link to this entry


Problems with Software 2: Failing to Use the Computing Lever

June 7 2011

[This is number 2 in an occasional series ‘Problems with Software’; read the previous article here.]

We all know at least one and I make no apologies for the picture I now paint. Hunch backed, tongue out, index fingers locked rigidly into an uncomfortable ‘r’ shape, and with no sense of the appropriate level of force to put into their actions. The neanderthal figures I refer to are, of course, two fingered typists. In my experience, a typical two fingered typist will struggle to type 30 Words Per Minute (WPM). An average touch typist will manage at least 70 WPM; and for those prepared to put in a bit of effort in, 90-100 WPM is eminently achievable.

Whether high typing speeds are useful depends on the person and the tasks they need to achieve. Agricultural workers, for example, probably won’t see a huge return on investment in typing skills. What astonishes me is that I know people who spend their entire working day in front of a computer, day in, day out, yet who still use this grossly inefficient technique. The slow typists I know easily lose a few hours every week due to their poor technique — yet suggesting that a few days spent learnt touch typing will quickly pay off in increased productivity is inevitably met with a reply of “I don’t have the time”.

Poor typing technique is a concrete example of a common human tendency — to stop learning as soon as one has a way of performing a task, regardless of whether that is the best way. In typing terms, the slowest typist is perhaps 5 times slower than the fastest: in other areas of human activity, the ratio can be much greater. One of those fields is computing and, by implication, software.

At this point, it is instructive to ask oneself a simple question: what is a computer? The standard answers run along the lines of “a CPU, some RAM, a disk”, “a keyboard and a monitor”, or “an operating system and user software.” While correct in their own way, such answers report the mechanisms involved without considering their purpose. The answer I prefer to this question is more abstract: computers are levers. Archimedes, the first person to explain how levers work, famously said “Give me a place to stand, and I will move the Earth.” When asked to prove this by his King, Archimedes used a lever to move a ship single-handedly, something impossible for even the strongest man to do unaided.

One of the first real uses of a computer shows how effective a lever they can be. The semi-programmable British Colossus computer of the mid 1940s was able to perform thousands of comparisons per second, cracking the seemingly unbreakable German Enigma code. What the code-breakers of Bletchley Park had realised was that a computer could perform simple, repetitive tasks at a speed impossible for humans. Operations that had taken a group of people several weeks to complete could be done by Colossus in half an hour: such was Colossus’s speed that, once operational, it played a significant part in shortening the course of the war.

It is little exaggeration to say that computers have developed into the longest levers available to man. Since Colossus, the rate of progress has been little short of astonishing, and computers have been used to do things that were previously unimaginable: from weather prediction to visual special effects, from DNA sequencing to search engines.

Yet, a lever is only truly effective if used from its end: gripped near the pivot, a lever is, in effect, shortened, and its force magnifying effect reduced. Regrettably, most people grip the computing lever extremely close to the pivot. There are many reasons for this and enumerating them all would take a small book. However, a few examples give an indication of the problem.

  1. Doing tasks manually instead of automatically.
    Computers excel at the simple, repetitive tasks which humans are terrible at. I remember seeing someone change a document which used the American English idiom of placing a comma after e.g. (i.e. “e.g., X”) to the less stilted British English idiom without the comma (i.e. “e.g. X”). The person editing did the change by scanning the (rather large) document and manually changing each occurrence, a task which took a couple of rather dispiriting hours. Inevitably, in a large document, some occurrences were missed. I fixed the remaining occurrences with 100% accuracy by using the ‘search and replace’ function in my text editor — which took less than 5 seconds.
  2. Permanent short-term thinking.
    Whenever I am confronted by a computing task, I ask myself “will I need to do this again?” If yes, I generally spend time working out how to automate the task. Consequently I have built up a considerable suite of small tools (a few of the more polished are publicly available) and techniques over the years to call upon. Most computing professionals I know have not a single such tool (and shockingly few techniques); their thinking never extends beyond trying to bumble through the task at hand. A task that, by hand, takes 30 minutes might take 90 minutes to automate; over the years, as that task recurs every 6 months or so, the automated solution pays for itself many times over.
  3. Using one tool for every task.
    Many people are prepared to invest time in learning one tool, but one tool only; it then becomes the hammer used to treat every task encountered as a nail. While this sometimes works well in the sort term, in the medium and long term, the effort involved in contorting the task to the tool can be huge. Perhaps the most common example is the misuse of spreadsheets to store database-like information — if I had a pound for every time an individual’s name was spelt differently on separate ‘sheets’ within a single Excel file, I would be a wealthy man.
  4. Not knowing the field and not staying upto date.
    The first hour of my day is generally spent checking computing newsgroups and websites, to ensure that I have some knowledge of the breadth of my subject, and the latest developments. Sometimes, I confess, I wish I didn’t have to do this; but if I didn’t, I would miss out on those surprisingly frequent moments when someone raises a problem and I hear myself saying “I was reading about a new program X a few weeks back which sounds like it might do what you need.”
  5. Not searching for help.
    I learnt the rudimentaries of programming before I had a modem; whatever problems I encountered had to be solved with the help of the one manual and two books I owned. Memorably, one extremely trivial problem took me nearly a full month to solve, working on my own. Now, my first instinct is to search for the problem on Google; 95% of the time, someone else will have had the same problem, and someone else will have pointed them to a solution. Yet, when I ask someone who presents me with a trivial problem “did you try looking it up on Google” 95% of the time the answer is “I didn’t think of doing that.”
To some extent, all of us will recognise some parts of ourselves in the above, or can pinpoint other areas where we fail to use the computing lever effectively. There is no shame in that: the shame comes in not addressing the problem. Unfortunately, ignorance in the basic use of computers is considered not only acceptable but the norm amongst the majority of computing and software professionals. The resulting loss in efficiency is vast and a rather sad indictment of our field.

Link to this entry


Problems with Software 1: Confusing Problems Whose Solutions Are Easy to State With Problems Whose Solutions Are Easy to Realise

April 19 2011

I love computing, in particular software: since I was young, I have devoted the major portion of my life to it. In return, I have gained access to a never‐ending source of challenges; met many great people; and had my ego smashed on the jagged rocks that are off‐by‐one errors. None of this makes me very well qualified to write an occasional series of articles on Problems with Software, but I will do my best. The spirit in which I write these is a positive one, perhaps akin to a parent admonishing a child for his own long‐term benefit. Of course, I am, and always be, a student of software, not its parent; searching, learning, and in awe of this ever‐evolving subject. With that in mind, let us begin.

I start with what I consider to be the most common mistake in software; a mistake that no other subject I know of commits so frequently. It is the confusion of problems whose solutions are easy to state with problems whose solutions are easy to realise. A non‐computing example may help explain this. The problem is war: around the world, it ruins the lives of countless people. The easily stated solution is that we should stop people from fighting each other. Regretfully, we know that this solution, while correct, is not worth the space it takes on the page. Telling people they should stop fighting changes nothing, and physically stopping them requires resources far beyond that which can be mustered. None of this is to say that it is not worth tackling the problem of war: but most of us have enough common sense to realise that solutions (for there will surely need to be more than one) will be expensive, long‐term, and come with no guarantee of success.

Alas, in software, such common sense is an uncommon thing. In our subject, unworkable solutions are frequently proposed to deep problems; funding is acquired; manpower deployed; and, after an exhausting death march, failure guaranteed. The problem most frequently invoked is the difficulty of creating good software. This clearly is a problem: too much software is expensive, unreliable, and ill‐suited to the task it was created for. The many easily‐stated solutions proposed have had, at best, a tiny effect; at worst, they have impeded progress by suppressing less exciting, but more realistic, truths. Two concrete examples exemplify this.

The first example is Aspect Orientated Programming (AOP). The problem that AOP addresses is this: developing a program is hard because all of its constituents aspects — even those which are in no way related — must be considered together at all times. Imagine a health‐care system which records patient data and has separate modules for doctors and patients. One aspect is server communication; both doctor and patient units need to communicate with a central server and cope with certain error conditions. This aspect is jumbled up with, and scattered across, the rest of the program, making it hard to check that all server interactions will be performed as expected. The AOP solution is then simple: rather than developing software monolithically, it should be developed as separate aspects, which can be worked on in isolation, before being composed together to produce the end result.

There is no doubt that the AOP solution is intuitive and appealing: the jumbled and scattered nature of current software is a curse, with many deleterious effects. I can remember when I first came across AOP. My response then is my response now (though my argument, I hope, is somewhat better stated these days): it can't work. Furthermore, a simple analogy shows why it can't work.

Imagine a hill, composed, horizontally, of different rock strata. Strata can be taken out and put back in, without ever changing the inherent ‘hillness’ of the hill—a hill is still identifiably a hill even if one strata is taken out (though it might have a distinct kink in it) and each strata makes sense in and of itself, whether it is in the hill or not. In this analogy, aspects are the strata; for AOP to work, it is necessary to imagine being able to take a piece of software, pull out an aspect, and have that aspect be coherent in and of itself. It is a beautiful vision, and one which would change programming. Let us now take a different analogy. Instead of a hill, imagine a gravel drive, consisting of hundreds of thousands of tiny stones of varying colours: grey, honey, black, white, and so on. Even Don Quixote would blanch if I asked him to extract all the honey coloured stones, put them to one side, and then later to put them all back in the same place. The unfortunate reality is that software necessarily resembles a gravel drive—just as different coloured stones exist alongside each other, so do `aspects' of a program. The server communication code in the health care system is likely, for example, to form parts of nested AND expressions—there is no general way to pull such an entangled aspect out in such cases (or to have developed it from scratch) and expect to be able to put it back in.

The second example is Model Driven Architecture (MDA). For my sins, I played a small part in this monstrosity; in my defence I was young, naive, and needed the money, though such pitiful excuses are unlikely to help me come the day of reckoning. The problem that MDA aims to address is the following: developing software is hard because programming languages force one to work at a level of abstraction that is not much above that of machine code. Indeed, this is a problem; while a high‐level description of a business's requirements for its new system may be a few pages long, the corresponding software may have tens or hundreds of thousands of lines of code. Relating the mess of a low‐level program back to the high‐level requirements is a skill that few people successfully acquire. The MDA solution is then simple: development should be done with high‐level (mostly UML) diagrams. Common sense quickly suggests that this approach can never work. Diagrams are wonderful for expressing static relationships, but, overall, terrible for expressing most dynamic behaviour. Looping behaviour is elegantly captured by a textual FOR loop but, as anyone who has ever seen a `visual' programming language in action will know that, is a daunting, unmanageable, mess when represented graphically—and a language which makes looping behaviour difficult is not very useful.

What's most frustrating about AOP and MDA is that while both correctly identify real problems, the solutions they present are appealing in their simplicity yet self‐evidently unviable. Huge amounts of money and time have been invested in both, with little prospect of meaningful reward. Indeed, I am not sure that a good solution could exist for either problem: they seem to me largely intractable and inevitable. Of course, this is not to say that people shouldn't try to come up to solutions to these and other such problems—if whinging old codgers like me always had our way, we'd still be living in dark, damp caves and laughing at Barry from the cave next door while he works on his silly sounding ‘wheel’ concept. Ours is a ‘new’ subject, a vast, unexplored continent, and optimism is a vital component of the makeup of successful settlers. But as vital as optimism is realism—expending precious resources on hopeless causes is a guaranteed way to doom settlement. A little dose of common sense — thinking through a couple of common cases, at the very least — when prevented with a wonderful sounding solution to a hard problem would do our subject a lot of good. Other subjects seem to manage it, and I don't see why in software we can't too.

Link to this entry


Parsing: The Solved Problem That Isn't

March 15 2011

Parsing is the act of taking a stream of characters and deducing if and how they conform to an underlying grammar. For example the sentence Bill hits Ben conforms to the part of the English grammar noun verb noun. Parsing concerns itself with uncovering structure; although this gives a partial indication of the meaning of a sentence, the full meaning is only uncovered by later stages of processing. Parseable, but obviously nonsensical, sentences like Bill evaporates Ben highlight this (the sentence is still noun verb noun, but finding two people who agree on what it means will be a struggle). As humans we naturally parse text all the time, without even thinking about it; indeed, we even have a fairly good ability to parse constructs that we've never seen before.

In computing, parsing is also common; while the grammars are synthetic (e.g. of a specific programming language), the overall idea is the same as for human languages. Although different communities have different approaches to the practicalities of parsing - C programmers reach for lex / yacc; functional programmers to parser combinators; others for tools like ANTLR or a Packrat / PEG-based approach - they typically rely on the same underlying area of knowledge.

After the creation of programming languages themselves, parsing was one of the first major areas tackled by theoretical computer science and, in many peoples eyes, one of its greatest successes. The 1960s saw a concerted effort to uncover good theories and algorithms for parsing. Parsing in the early days seems to have shot off in many directions before, largely, converging. Context Free Grammars (CFGs) eventually won, because they are fairly expressive and easy to reason about, both for practitioners and theorists.

Unfortunately, given the extremely limited hardware of 1960s computers (not helped by the lack of an efficient algorithm), the parsing of an arbitrary CFG was too slow to be practical. Parsing algorithms such as LL, LR, and LALR identified subsets of the full class of CFGs that could be efficiently parsed. Later, relatively practical algorithms for parsing any CFG appeared, most notably Earley's 1973 parsing algorithm. It is easy to overlook the relative difference in performance between then and now: the fastest computer in the world from 1964-1969 was the CDC6600 which executed at around 10 MIPS; my 2010 mobile phone has a processor which runs at over 2000 MIPS. By the time computers had become fast enough for Earley's algorithm, LL, LR, and friends had established a cultural dominance which is only now being seriously challenged - many of the most widely used tools still use those algorithms (or variants) for parsing. Nevertheless in tools such as ACCENT / ENTIRE and recent versions of bison, one has access to performant parsers which can parse any CFG, if that is needed.

The general consensus, therefore, is that parsing is a solved problem. If you've got a parsing problem for synthetic languages, one of the existing tools should do the job. A few heroic people - such as Terence Parr, Adrian Johnstone, and Elizabeth Scott - continue working away to ensure that parsing becomes even more efficient but, ultimately, this will be transparently adopted by tools without overtly changing the way that parsing is typically done.

Language composition

One of the things that's become increasingly obvious to me over the past few years is that the general consensus breaks down for one vital emerging trend: language composition. Composition is one of those long, complicated, but often vague terms that crops up a lot in theoretical work. Fortunately, for our purposes it means something simple: grammar composition, which is where we add one grammar to another and have the combined grammar parse text in the new language (exactly the sort of thing we want to do with Domain Specific Languages (DSLs)). To use a classic example, imagine that we wish to extend a Java-like language with SQL so that we can directly write:
for (String s : SELECT name FROM person WHERE age > 18) {
  ...
}
Let's assume that someone has provided us with two separate grammars: one for the Java-like language and one for SQL. Grammar composition seems like it should be fairly easy. In practice, it turns out to be rather frustrating, and I'll now explain some of the reasons why.

Grammar composition

While grammar composition is theoretically trivial, simply squashing two grammars together is rarely useful in practice. Typically, grammars have a single start rule; one therefore needs to choose which of the two grammars has the start rule. More messy is the fact that the chances of the two grammars referencing each other is slight; in practice, one needs to specify a third tranche of data - often referred to, perhaps slightly misleadingly, as glue - which actually links the two grammars together. In our running example, the Java-like language has the main grammar; the glue will specify where, within the Java-like expressions, SQL statements can be referenced.

For those using old parsing algorithms such as LR (and LL etc.), there is a more fundamental problem. If one takes two LR-compatible grammars and combines them, the resulting grammar is not guaranteed to be LR-compatible (i.e. an LR parser may not be able to parse using it). Therefore such algorithms are of little use for grammar composition.

At this point, users of algorithms such as Earley's have a rather smugger look on their face. Since we know from grammar theory that unioning two CFGs always leads to a valid CFG, such algorithms can always parse the result of grammar composition. But, perhaps inevitably, there are problems.

Tokenization

Parsing is generally a two-phase process: first we break the input up into tokens (tokenization); and then we parse the tokens. Tokens are what we call words in everyday language. In English, words are easily defined (roughly: a word starts and ends with a space or punctuation character). Different computer languages, however, have rather different notions of what their tokens are. Sometimes, tokenization rules are easily combined; however since tokenization is done in ignorance of how the token will later be used, sometimes it is difficult. For example, in SQL SELECT might be a keyword but in Java it is also a valid identifier; it is often hard, if not impossible, to combine such tokenization rules in traditional parsers.

Fortunately there is a solution: scannerless parsing (e.g. SDF2 scannerless parsing). For our purposes, it might perhaps better be called tokenless parsing; the different names reflect the naming conventions of different parsing schools. Scannerless parsing does away with a separate tokenization phase; the grammar now contains the information necessary to dynamically tokenize text. Combining grammars with markedly different tokenization rules is now possible.

Fine-grained composition

In practice, the simple glue mentioned earlier used to combine two grammars is often not enough. There can be subtle conflicts between the grammars, in the sense that the combined language might not give the result that was expected. Consider combining two grammars that have different keywords. Scannerless parsing allows us to combine the two grammars, but we may wish to ensure that the combined languages do not allow users to use keywords in the other language as identifiers. There is no easy way to express this in normal CFGs. The SDF2 paper referenced earlier allows reject productions as a solution to this; unfortunately this then makes SDF2 grammars mildly context sensitive. As far as I know, the precise consequences of this haven't been explored, but it does mean that at least some of the body of CFG theory won't be applicable; it's enough to make one a little nervous, at the very least (not withstanding the excellent work that has been created using the SDF2 formalism by Eeclo Visser and others).

A recent, albeit relatively unknown, alternative are boolean grammars. These are a generalization of CFGs that include conjunction and negation, which, at first glance, are exactly the constructs needed to make grammar composition practical (allowing one to say things like identifiers are any sequence of ASCII characters except SELECT). Boolean grammars, to me at least, seem to have a lot of promise, and Alexander Okhotin is making an heroic effort on them. However, there hasn't yet been any practical use of them that I know of, so wrapping ones head around the practicalities is far from trivial. There are also several open questions about boolean grammars, some of which, until they are answered one way or the other, may preclude wide-scale uptake. In particular, one issue relates to ambiguity, of which more now needs to be said.

Ambiguity

By severely restricting what CFGs they accept, grammars which are compatible with traditional parsing algorithms (LL, LR etc.) are always unambiguous (though, as we shall see, this does not mean that all the incompatible grammars are ambiguous: many are unambiguous). Grammar ambiguity is thus less widely understood than it might otherwise have been. Consider the following grammar of standard arithmetic:
E ::= E "+" E
    | E "-" E
    | E "/" E
    | E "*" E
Using this grammar, a string such as 2 + 3 * 4 can be parsed ambiguously in two ways: as equivalent to (2 + 3) * 4; or as equivalent to 2 + (3 * 4). Parsing algorithms such as Earley's will generate all possibilities even though we often only want one of them (due to arithmetic conventions, in this case we want the latter parse). There are several different ways of disambiguating grammars, such as precedences (in this example, higher precedences win in the face of ambiguity):
E ::= E "+" E  %precedence 1
    | E "-" E  %precedence 1
    | E "/" E  %precedence 2
    | E "*" E  %precedence 3
This might suggest that we can tame ambiguity relatively easily: unfortunately, parsing theory tells us that the reality is rather tricky. The basic issue is that, in general, we can not statically analyse a CFG and determine if it is ambiguous or not. To discover whether a given CFG is ambiguous or not we have to try every possible input: if no input triggers an ambiguous parse, the CFG is not ambiguous. However this is, in general, impractical: most CFGs describe infinite languages and can not be exhaustively tested. There are various techniques which aim to give good heuristics for ambiguity (see Bas Basten's masters thesis for a good summary; I am also collaborating with a colleague on a new approach, though it's far too early to say if it will be useful or not). However, these heuristics are inherently limited: if they say a CFG is ambiguous, it definitely is; but if they can not find ambiguity, all they can say is that the CFG might be unambiguous.

Since theoretical problems are not always practical ones, a good question is the following: is this a real problem? In my experience thus far, defining stand-alone grammars for programming languages using Earley parsing (i.e. a parsing algorithm in which ambiguity is possible), it's not been a huge problem: as the grammar designer, I often understand where dangerous ambiguity might exist, and can nip it in the bud. I've been caught out a couple of times, but not enough to really worry about.

However, I do not think that my experience will hold in the face of wide-spread grammar composition. The theoretical reason is easily stated: combining two unambiguous grammars may result in an ambiguous grammar (which, as previously stated, we are unlikely to be able to statically determine in general). Consider combining two grammars from different authors, neither of whom could have anticipated the particular composition: it seems to me that ambiguity is much more likely to crop up in such cases. It will then remain undetected until an unfortunate user finds an input which triggers the ambiguity. Compilers which fail on seemingly valid input are unlikely to be popular.

PEGs

As stated earlier, unambiguous parsing algorithms such as LL and LR aren't easily usable in grammar composition. More recently, a rediscovered parsing approach has gathered a lot of attention: Packrat / PEG parsing (which I henceforth refer to as PEGs). PEGs are different than everything mentioned previously: they have no formal relation to CFGs. The chief reason for this is PEGs ordered choice operator, which removes any possibility for ambiguity in PEGs. PEGs are interesting because, unlike LL and LR, they're closed under composition: in other words, if you have two PEGs and compose them, you have a valid PEG.

Are PEGs the answer to our problems? Alas - at least as things stand - I now doubt it. First, PEGs are rather inexpressive: like LL and LR parsing, PEGs are often frustrating to use in practise. This is, principally, because they don't support left recursion; Alex Warth proposed an approach which adds left recursion but I discovered what appear to be problems with it, though I should note that there is not yet a general consensus on this (and I am collaborating with a colleague to try and reach an understanding of precisely what left recursion in PEGs should mean). Second, while PEGs are always unambiguous, depending on the glue one uses during composition, the ordered choice operator may cause strings that were previously accepted in the individual languages not to be accepted in the combined language - which, to put it mildly, is unlikely to be the desired behaviour.

Conclusions

If you've got this far, well done. This article has ended up much longer than I originally expected - though far shorter than it could be if I really went into detail on some of these points! It is important to note that I am not a parsing expert: I only ever wanted to be a user of parsing, not - as I currently am - someone who knows bits and pieces about its inner workings. What's happened is that, in wanting to make greater use of parsing, I have gradually become aware of the limitations of what I have been able to find. The emphasis is on gradually: knowledge about parsing is scattered over several decades (from the 60s right up to the present day); many publications (some of them hard to get hold of); and many peoples heads (some of whom no longer work in computing, let alone in the area of parsing). It is therefore hard to get an understanding of the range of approaches or their limitations. This article is my attempt to write down my current understanding and, in particular, the limitations of current approaches when composing grammars; I welcome corrections from those more knowledgeable than myself. Predicting the future is a mugs game, but I am starting to wonder whether, if we fail to come up with more suitable parsing algorithms, programming languages of the future that wish to allow syntax extension will bypass parsing altogether, and use syntax directed editing instead. Many people think parsing is a solved problem - I think it isn't.

Link to this entry


In Praise of the Imperfect

August 25 2010

In this entry, I‘m going to break - in spirit at least - one of my golden rules: I‘m going to mention one of my hobbies. This is a slippery slope. Intelligent, interesting people (as well as me) are generally interested in more than one thing in one life - no matter the relationship between those things. Some of the people who most elegantly write about programming have, in my opinion, gone off the rails when they‘ve made fatuous comparisons between their hobby and computing of the type programmers like X because it is like programming in the sense Y. Often the only thing relating X - be it painting, poetry, guns, or red squirrels - to programming is the person writing about it. Good luck to them I say - but generalising from a case of one is frequently unenlightening. With that warning to myself - and the reader - in my ears, off I go.

I recently read an article on music that, boiled down to its essentials, talked about the lack of a meaningful relationship between technical virtuosity (roughly ‘how well can you play’) and musical quality (roughly ‘how enjoyable is this piece of music’). Sometimes 3 chords and a 3 note melody played slowly and imperfectly will be exactly what is required; sometimes precise fast scales and subtle melodic variations will do the trick. When a song that should be played slowly, sloppily, and minimally is played fast, precisely, and with extraneous notes - or vice versa - the balance between technical ability and musical quality is lost. Remembering this balance is a problem many musicians have. As their technical abilities grow over time, it is common to equate technical difficulty with musical quality. Despite my extremely limited musical abilities (it is difficult to dispute one persons assessment of my playing as ham-fisted), I have on occasion suffered this problem as I have got slightly more proficient: fortunately, since many of my favourite songs are - not to put too fine a point on it - brainless three and a half minute rockers, I generally get brought back down to earth at some point.

Reading this made me realise that a similar balance applies to programming. Sometimes a heavily engineered solution is right for the job; sometimes something much simpler is just the ticket. Let me give an example of each.

First, an example of something heavily engineered. One of my favourite things that I've done is extsmail which is a traditionally constructed, robust Unix daemon for mail delivery. In extsmail, I cared about every possible detail: the program is carefully decomposed into its constituent parts; every error condition is explicitly handled and dealt with appropriately; the documentation is complete; I audit the code regularly; and I've run it through every static analyser I've been able to get my hands on. In short, its 1,600 lines of code are probably the highest quality I've ever personally written. Given its tiny potential audience, you could well argue that it is grossly over-engineered, but it solves a real problem that I had for years. It performs its task - delivering e-mail to a remote machine via SSH - admirably and hasn't crashed in almost 2 years. Anything less than that wouldn't have been right for what I needed - it has to do the job as near to perfect as is possible, every time. Heavy engineering was the right choice here.

Second, an example of something simple. A colleague once dismissed a program I'd written with an unrepeatable 4 letter word, saying that it was poorly organised and unreadable. He was right in one way. The program in question was about 4000 lines of code, in an area of which both of us had previously been entirely ignorant - the fact that I had to learn a new programming language was one of the lesser challenges in the whole thing - and the result was slapped together in varying stages of incomprehension over roughly 3 or 4 man weeks. I pointed out to him two things about the result: first, it showed that what we wanted to do was possible and plausible, when we both had doubts originally; second, that he'd written precisely zero lines of code in the same period. That code, incidentally, is now part of one of the more widely used pieces of software I've been involved with. No user will ever know that it's the product of someone learning on the job, and that, internally, things could have been done much better. For the job it does it works reliably enough and - more importantly - it exists. If I'd taken the high-road, it still wouldn't exist now, and nobody would have been able to benefit from it. Quick and dirty was the right choice.

As my tactless colleague shows, there is a bias in programming against imperfect solutions. I can understand where this bias comes from: the worst of all worlds is an inappropriate imperfect solution. Trying to ensure that this doesn't happen has some odd side-effects. One is a tendency to assume that one language, one tool, one technique should be appropriate for all tasks. Different communities fixate on different things. For the last 10 years, industrial management has often thought Java was the answer; web-sites PHP; and academics strongly typed pure functional languages. In my opinion, they are all wrong today - and, I suspect, they will all be wrong in the future, even as their favourite things change. There is no one size fits all. Different tasks demand different approaches. For example:

  • If I need to do a sysadmin job across my servers, I'll use the Unix shell or Python. If it goes wrong, it's not a huge problem. What is important is having something now. Having it in a form that can be easily altered to suit changing needs is also useful.
  • If I'm writing a Unix daemon I'll use C. I prefer it not to go wrong, but I can tolerate very occasional failures. If it takes a while to create, I can live with that. Having it in a form that can be tweaked a bit is useful, though I don't expect it to change a great deal in the future.
  • If you're writing software for a plane or nuclear reactor, I hope you use the strongest statically typed language you can, every static analyzer available, and as many formal techniques as you can shake a stick at. Failure is not an option, since the result will be loss of human life. Nor is cost - if it takes hundreds of man years of effort, it's probably worth it. Since such software must be precisely specified up-front, there is relatively little need for it to amenable to change.
As this suggests, when it comes to the needed quality of software, there are several shades of grey. There are many people out there who promulgate the need for highly-engineered solutions. I'd like to wave a little banner for imperfect software. In many cases, having something, even if it's not quite perfect, is better than having nothing. In academia, in particular, we often fall into this trap - we know that, given sufficient time and resources, we can produce high quality (albeit not quite perfect) software. The problem is that the time and resources needed are nearly always prohibitive and, often, complete overkill.

No one in their right mind would say that the highly practiced virtuosity of a classical orchestra could replace the raw emotion of a simple folk singer - both have their own merits and their place in the musical landscape. So too should we give imperfect, pragmatic software the respect that it is due.

Link to this entry

Earlier articles

All articles
 
Last 10 articles
Problems with Software 3: Creating Crises Where There Aren't Any
Problems with Software 2: Failing to Use the Computing Lever
Problems with Software 1: Confusing Problems Whose Solutions Are Easy to State With Problems Whose Solutions Are Easy to Realise
Parsing: The Solved Problem That Isn't
In Praise of the Imperfect
A Modest Attempt to Help Prevent Unnecessary Static / Dynamic Typing Debates
A Proposal for Error Handling
The Missing Level of Abstraction?
Good Programmers are Good Sysadmins are Good Programmers
How can C Programs be so Reliable?
 
 
DSLs
Tony Clark
Zef Hemel
 
Modelling
Steve Cook
Mark Delgano
Steven Kelly
Stuart Kent
Jim Steel
Alan Cameron Wills
 
OS
Marc Balmer
Mike Erdely
Peter Hansteen
KernelTrap
OpenBSD Journal
 
Programming
Peter Bell
Gilad Bracha
Tony Clark
Cliff Click
William Cook
Jonathan Edwards
Daniel Ehrenberg
Fabien Fleutot
Martin Fowler
John Goerzen
Grace
James Hague
Jeremy Hylton
Ralph Johnson
JOT
Ralf Laemmel
Lambda the Ultimate
Daniel Lemire
Michael Lucas
Bertrand Meyer
Keith Packard
Havoc Pennington
PyPy
John Regehr
Software Engineering Radio
Diomidis Spinellis
Shin Tai
Markus Voelter
Phil Wadler
Russel Winder
Steve Yegge