Monday, May 3, 2010

Guppy: trying to make parsing simple, fun, and powerful

I've been working on designing a parser definition language called Guppy. I have not implemented it yet, as the syntax and semantics of it are a moving target subject to my whims.

The mission statement of Guppy is to make parsing simple, fun, and powerful. You give Guppy a grammar definition and an input string, and Guppy gives you an AST whose nodes are named after production rule symbols. Want to write a calculator? Define a simple, 20-line grammar to get started, and test it out with Guppy. Want to make your own extension to the C programming language? Import the "c" grammar (like you would import a library in scripting languages), and write only the parts that make your language different.

Here are a few examples of Guppy as I currently have it. The first example is the beginnings of a "standard library" for Guppy; a collection of rules that can be used without having to define them by hand:

alnum: alpha | digit
alpha: [A-Za-z]
ascii: [\x00-\x7F]
digit: [0-9]
extended: [\x80-\xFF]
lower: [a-z]
space: [\t\n\v\f\r ]
upper: [A-Z]
xdigit: [0-9A-Fa-f]

identifier: (alpha | '_') (alnum | '_')*

number: int | float
dec: [1-9] [0-9]*
hex: '0' [Xx] xdigit*
oct: '0' [0-7]*
float: dec | hex
e: [Ee] ('+'|'-')? [0-9]+
p: [Pp] ('+'|'-')? [0-9]+
suffix: [FLfl]?

[0-9]* '.' [0-9]+ e? suffix
[0-9]+ '.' e? suffix
[0-9]+ e suffix
'0' [Xx] xdigit* '.' xdigit+ p suffix
'0' [Xx] xdigit+ '.' p suffix
'0' [Xx] xdigit+ p suffix

The key features to note here are the use of nested production rules and lexical scope. If a series of rules are listed indented under a parent rule, the names of rules become available to the definition of the actual parent (as in the case of number: int | float). If the parent rule's definition is empty, it automatically defaults to |-ing its child rules (as in the case of dec:). Note that in Guppy (unlike other grammar definition languages like Yacc), empty strings must be explicit (as in, '').

Two more candidate rules to add to Guppy's standard library are:

//C-style string literal
string(start_quote = '"', end_quote = start_quote)
: start_quote char..end_quote
literal: .
hex: '\\' [Xx] xdigit{1,2}
'\\' [0-3][0-7][0-7]
'\\' [0-7][0-7]
'\\' [0-7]
unicode: '\\u' xdigit{4}
backslash: '\\' .

//infix-delimited list with at least one item
list(item, delimiter = ','):
item (delimiter item)*

This introduces functions. In this case, they're just used as templates for common tasks. However, using functions recursively makes it possible to define non-context-free grammars (perhaps even Turing-complete ones). Whether Guppy will ultimately allow that or not, I'm not sure.

Isolated production rules

I've also been thinking about how the traditional notions of lexer and parser can be united under one grammar formalism. I thought about how a formal grammar defines a program that outputs valid strings in a language, rather than defining the steps needed to parse that grammar. Taking that further, if you wanted to write a program to output code in a programming language like C, you'd first emit the start symbol, then apply grammar rules. After that (as in, you'd stop applying rules from the grammar), you'd apply lexical rules to convert things like IDENTIFIER to complete strings.

What I came up with involves enabling the user to isolate production rules. The syntax is rather ad hoc, but it lets you take a symbol or string and apply production rules wrapped in parenthesis to it by saying text -> ( rules... ).

expr -> (
expr: add_expr

( expr )

multiply_expr '*' primary_expr
multiply_expr '/' primary_expr

add_expr '+' multiply_expr
add_expr '-' multiply_expr

/* Note that number, and identifier aren't defined
* in this scope, so they are "quasi-terminals".
* Once we expand them using the production rules
* below, there's no turning back. */
) -> (
'': white
) -> (
[0-9]+ '.' [0-9]*
[0-9]* '.' [0-9]+


[\t\n ]


There are a number of things I'm having trouble deciding on or understanding. For one, should ambiguity be allowed? Consider this much easier way of defining expr:

( expr )
expr '*' expr
expr '/' expr
expr '+' expr
expr '-' expr

The issue here isn't just that the order of operations isn't defined. Even if it could be indicated with a hint (as is done in Bison with %left, %right, etc.), the grammar here just isn't correct. If you say "1+2*3+4", you could follow the expr '*' expr expansion first, but it would yield an incorrect derivation. Another case is the ambiguity of + versus ++ in languages with that pre/postincrement operator:

(A|B)* -> (
'' -> ' '
) -> (
A: '+'
B: '++'

The ambiguity is typically resolved by making the lexer greedy; "+++" means "++ +". However, this grammar allows us to derive "+++" as "+ ++", which is incorrect. Making this grammar unambiguous isn't exactly simple, so requiring the user to do so goes against Guppy's mission statement.

Indentation-based syntax

Another issue is: should Guppy use an indentation-based syntax or not? Should Guppy offer a non-indentation-based alternative? Consider nice and simple code like this:

number: int | float
dec: [1-9] [0-9]*
hex: '0' [Xx] xdigit*
oct: '0' [0-7]*

If we had to do without indentation-based syntax, it might look like this:

number: { (int | float);
int: {
dec: [1-9] [0-9]* ;
hex: '0' [Xx] xdigit* ;
oct: '0' [0-7]* ;

Although indentation-based syntax makes the code look nice, it has a number of drawbacks. One major drawback is that the grammar of Guppy would not be context-free (I think), complicating the matter of defining Guppy in Guppy. Note that indentation can be invoked in Guppy using a recursive function:

node(indent = ''):
indent value '\n'
indent key ':' value? '\n' node(indent [\t ]+)

Again, whether or not this will be allowed, I'm not sure.