Tuesday, June 29, 2010

Alien computer system

Abduction

Suppose you are abducted by aliens at a surprisingly convenient time, and they show you their alien computer (which they have localized for humans) and how they use it. Fascinated, you see that "files" and "directories" as you know them simply don't exist on the alien system. Perhaps they're working with "lists", "collections", and "cells" instead. Instead of creating documents in applications, they're putting together machines from interchangeable parts.

Most importantly, you notice that the aliens automate just about everything they do. They build mini-programs to do things rather than just doing them. Why? Well, this layer of abstraction means they never have to "undo"; they just tack on more machinery to short-circuit what they had before, and if they want to "redo", they just short-circuit the short-circuit. As you watch, you ponder the distinction between a "user" and a "programmer", wondering if it even exists on the alien homeworld.

On day two (after having eaten the best replicated chicken teryaki dinner orbiting earth and well-rested from zero-gravity sleep), you continue watching them play around with their on-screen contraptions, and you find that certain things show up over and over again. Sort of like how Earth computers have windows, icons, menus, and the pointer, you see a lot of the same things over and over again on the alien system, but they seem to have much more to do with semantics than with presentation.

By the end of the week, you've been using the alien computer and have gotten the hang of it. You learn to do some really advanced things, and the aliens are astounded by your progress. Moreover, some of the things you're doing on the alien system in minutes would be month-long projects on Earth computers.

However, you still can't do the vast majority of things you could do on your Earth computer, such as browsing the web, writing documents, managing files, etc.. Also, if other humans saw you working on the alien computer, they'd be just as clueless as when you looked over the aliens' shoulders the first day. Practically speaking, these alien computers would be completely useless to most humans.

Back on Earth

You are returned to Earth (with some cash and leftover teryaki chicken for compensation) and go back to your daily activities. However, you can't shake the feeling that there might be a radically better way to do computing that only takes a week to learn (if you're a fast learner). But, bringing "computer programming" into the user experience just seems like the wrong way to go in terms of making something both user friendly and easy to learn. Then again, you cannot deny the fact that computers are really hard to learn anyway.

I believe that learning how to become a computer "user" isn't easier than learning a programming language, but actually much harder (granted, you'll want to learn how to use a text editor and web browser before learning how to program).

Why, then, don't we teach grandma Python? I think it's because (most) users don't use Python to send and receive e-mail and browse the web. However, even grandma might find Python's interactive command line to be a handy desktop calculator... that is, until she tries to divide integers. The thing is, programming languages (well, except maybe the shell) typically don't make it easy to automate the day-to-day tasks of people who haven't devoted their lives to computing.

Higher-level programming-ish facilities integrated into operating systems would be useful to a lot of people, in my opinion. However, it would involve a learning curve, but users have to learn how to use computers anyway. Something can be simpler, yet harder to learn than something that is familiar.

Monday, May 3, 2010

Guppy: trying to make parsing simple, fun, and powerful

I've been working on designing a parser definition language called Guppy. I have not implemented it yet, as the syntax and semantics of it are a moving target subject to my whims.

The mission statement of Guppy is to make parsing simple, fun, and powerful. You give Guppy a grammar definition and an input string, and Guppy gives you an AST whose nodes are named after production rule symbols. Want to write a calculator? Define a simple, 20-line grammar to get started, and test it out with Guppy. Want to make your own extension to the C programming language? Import the "c" grammar (like you would import a library in scripting languages), and write only the parts that make your language different.

Here are a few examples of Guppy as I currently have it. The first example is the beginnings of a "standard library" for Guppy; a collection of rules that can be used without having to define them by hand:

alnum: alpha | digit
alpha: [A-Za-z]
ascii: [\x00-\x7F]
digit: [0-9]
extended: [\x80-\xFF]
lower: [a-z]
space: [\t\n\v\f\r ]
upper: [A-Z]
xdigit: [0-9A-Fa-f]

identifier: (alpha | '_') (alnum | '_')*

number: int | float
int:
dec: [1-9] [0-9]*
hex: '0' [Xx] xdigit*
oct: '0' [0-7]*
float: dec | hex
e: [Ee] ('+'|'-')? [0-9]+
p: [Pp] ('+'|'-')? [0-9]+
suffix: [FLfl]?

dec:
[0-9]* '.' [0-9]+ e? suffix
[0-9]+ '.' e? suffix
[0-9]+ e suffix
hex:
'0' [Xx] xdigit* '.' xdigit+ p suffix
'0' [Xx] xdigit+ '.' p suffix
'0' [Xx] xdigit+ p suffix

The key features to note here are the use of nested production rules and lexical scope. If a series of rules are listed indented under a parent rule, the names of rules become available to the definition of the actual parent (as in the case of number: int | float). If the parent rule's definition is empty, it automatically defaults to |-ing its child rules (as in the case of dec:). Note that in Guppy (unlike other grammar definition languages like Yacc), empty strings must be explicit (as in, '').

Two more candidate rules to add to Guppy's standard library are:

//C-style string literal
string(start_quote = '"', end_quote = start_quote)
: start_quote char..end_quote
char:
literal: .
escape:
hex: '\\' [Xx] xdigit{1,2}
oct:
'\\' [0-3][0-7][0-7]
'\\' [0-7][0-7]
'\\' [0-7]
unicode: '\\u' xdigit{4}
backslash: '\\' .

//infix-delimited list with at least one item
list(item, delimiter = ','):
item (delimiter item)*

This introduces functions. In this case, they're just used as templates for common tasks. However, using functions recursively makes it possible to define non-context-free grammars (perhaps even Turing-complete ones). Whether Guppy will ultimately allow that or not, I'm not sure.

Isolated production rules


I've also been thinking about how the traditional notions of lexer and parser can be united under one grammar formalism. I thought about how a formal grammar defines a program that outputs valid strings in a language, rather than defining the steps needed to parse that grammar. Taking that further, if you wanted to write a program to output code in a programming language like C, you'd first emit the start symbol, then apply grammar rules. After that (as in, you'd stop applying rules from the grammar), you'd apply lexical rules to convert things like IDENTIFIER to complete strings.

What I came up with involves enabling the user to isolate production rules. The syntax is rather ad hoc, but it lets you take a symbol or string and apply production rules wrapped in parenthesis to it by saying text -> ( rules... ).

expr -> (
expr: add_expr

primary_expr:
number
identifier
( expr )

multiply_expr:
primary_expr
multiply_expr '*' primary_expr
multiply_expr '/' primary_expr

add_expr:
multiply_expr
add_expr '+' multiply_expr
add_expr '-' multiply_expr

/* Note that number, and identifier aren't defined
* in this scope, so they are "quasi-terminals".
* Once we expand them using the production rules
* below, there's no turning back. */
) -> (
'': white
) -> (
number:
[0-9]+
[0-9]+ '.' [0-9]*
[0-9]* '.' [0-9]+

identifier:
[A-Za-z_][0-9A-Za-z_]*

white:
[\t\n ]
)

Ambiguity


There are a number of things I'm having trouble deciding on or understanding. For one, should ambiguity be allowed? Consider this much easier way of defining expr:

expr:
number
identifier
( expr )
expr '*' expr
expr '/' expr
expr '+' expr
expr '-' expr

The issue here isn't just that the order of operations isn't defined. Even if it could be indicated with a hint (as is done in Bison with %left, %right, etc.), the grammar here just isn't correct. If you say "1+2*3+4", you could follow the expr '*' expr expansion first, but it would yield an incorrect derivation. Another case is the ambiguity of + versus ++ in languages with that pre/postincrement operator:

(A|B)* -> (
'' -> ' '
) -> (
A: '+'
B: '++'
)

The ambiguity is typically resolved by making the lexer greedy; "+++" means "++ +". However, this grammar allows us to derive "+++" as "+ ++", which is incorrect. Making this grammar unambiguous isn't exactly simple, so requiring the user to do so goes against Guppy's mission statement.

Indentation-based syntax


Another issue is: should Guppy use an indentation-based syntax or not? Should Guppy offer a non-indentation-based alternative? Consider nice and simple code like this:

number: int | float
int:
dec: [1-9] [0-9]*
hex: '0' [Xx] xdigit*
oct: '0' [0-7]*
...

If we had to do without indentation-based syntax, it might look like this:

number: { (int | float);
int: {
dec: [1-9] [0-9]* ;
hex: '0' [Xx] xdigit* ;
oct: '0' [0-7]* ;
};
...

Although indentation-based syntax makes the code look nice, it has a number of drawbacks. One major drawback is that the grammar of Guppy would not be context-free (I think), complicating the matter of defining Guppy in Guppy. Note that indentation can be invoked in Guppy using a recursive function:

node(indent = ''):
indent value '\n'
indent key ':' value? '\n' node(indent [\t ]+)

Again, whether or not this will be allowed, I'm not sure.