Sunday, April 22, 2012

Movimentum - A severe grammar problem

When I looked through the grammar, I saw that I had made a small mistake: The rule for the X and Y selectors is as follows:

    scalarexpr4
      : ...
      | anchor
        ( X
        | Y
        )
      ...
      ;

This means that I can only attach .x and .y to anchors. However, I certainly also would like to write things like

    (Engine.P + [a,b]).x

I.e., I want to attach .x and .y to any vector expression. Not a big problem, I thought, and replaced anchor with vectorexpr4, which includes anchor as well as parenthesized vector expressions. But now ANTLR complains as follows:

[fatal] rule scalarexpr4 has non-LL(*) decision due to recursive rule invocations reachable from alts 1,8.

And after a minute of head scratching, the problem is clear: Assume that we are parsing along in a scalarexpr, and we see a few left parentheses and some stuff behind it:

    (((Identifier....

Now: Is this going to be
  • a scalar subexpression - e.g. just variables and numbers added = the first alternative of scalarexpr4;
  • or a vector expression, which then is ended with .x = the eighth alternative of scalarexpr4;
ANTLR3 tries to solve that puzzle by running a regular expression over the following symbols (which is much more powerful than classic LL(k) parsing, where only a fixed number k of symbol were "looked ahead"), but with the recursive nature of  my expression, a regular grammar cannot decide this! What can we do? Here are seven(!) possibilities to solve the problem - the first three are suggested by ANTLR after the error message above, the others are standard:
  1. Resolve by left-factoring. This is essentially impossible if I want to keep vector expressions and scalar expressions separate: It would require that all those "left sides" where it is not yet clear whether we are "scalar" or "vector" need to be put into a "generic expression" rule. I drop that option.
  2. Resolve by using backtrack=true option. Here, ANTLR runs the alternatives of a rule from top to bottom in a sort of "test mode" without actions and without error output until it succeeds with parsing.
  3. Resolve by using syntactic predicates. This is almost the same as 2., but for this, you must add a hand-crafted piece of "selection grammar" that moves the parser into the right direction. This is more work than 2., but may be much more efficient if there is e.g. a simple context-free grammar to decide on the issue. One can also say that option 2. is like option 3. with the specified alternatives used as "selection grammar". In our small grammar and - hopefully - small expressions, I don't see a reason to hand-craft a "selection grammar", so I drop this option.
  4. Use LR parsing. With LR parsing, the postfix .x might help to decide whether the elements below are going to be collapsed into a vectorexpr or a scalarepxr. However, I stick with ANTLR ... so I drop this option.
  5. Use type-free expressions. If our expressions were "just parenthesized expressions" without a syntactic distinction between vector and scalar expressions, the problem would vanish on the syntactic level. Of course, one could then write expressions like ((4)).x, i.e., add .x to a scalar - but this would have to be handled in a semantic layer. Almost all programming languages do this. This is essentially a more abstract view on the advice "left factor". Still, I would like to keep that syntactic distinction. I drop this also.
  6. Use .x as prefix. Prefixes are obviously great in LL parsing - but I don't want to write .x(Engine.P), so I drop this.
  7. Use different sort of parentheses, e.g. {...}.x. This will also help the parser to distinguish vector and scalar expressions on the left side - and also humans! It seems a worthwhile option, as it would force the user to always clearly state whether an expression is a vector or a scalar.
So the two alternatives that survive are 2. and 7.: Should I clearly distinguish the expression types in the language? Or should ANTLR "try what works"? I lean heavily towards 7., I must say - still, using parentheses (and not e.g. braces) for vector expressions is so common that it's had to break with that convention. Ok - let's take 2. for this time; but maybe reverse that (pure syntactical) decision if there arise more problems later - e.g. when the grammar is extended.

To keep the problem as local as possible, I extract the relevant alternatives to separate rules. For the vectorexpr4, I extract the following to vectorexpr5 and allow the .x, .y and .l suffixes only on these:

    vectorexpr4
      : INTEGRAL
          '(' vectorexpr ')'
      | DIFFERENTIAL
          '(' vectorexpr ')'
      | vectorexpr5
      ;

    vectorexpr5
      : '('
        vectorexpr
        ')'
      | vector
      | anchor
      ;
For the scalarexpr4, I extract a scalarexpr4Ambiguous rule which gets the backtrack option. In the backtracking rule, we put the vector expression with its trailing operators first. The reason is that syntactically, the second alternative can be a prefix of the first one - for example, (a) might be a scalarexpr or a vectorexpr5. When we put the second alternative first, it will not look at those .x or .y or .l suffixes and hence parse into a parenthesized scalarexpr - and then fail on the suffix. We will definitelyhave to test this thouroghly!

     scalarexpr4
      : scalarexpr4Ambiguous
      | ...
      ;
    scalarexpr4Ambiguous options { backtrack = true; }
     :  vectorexpr5
        ( X
        | Y
        | LENGTH
        )
      | '('
        scalarexpr
        ')'
      ;

The modified grammar file can be seen here!

No comments:

Post a Comment