JavaCC [tm]: Grammar Files

This page contains the complete syntax of Java Compiler Compiler [tm] grammar files with detailed explanations of each construct.

Tokens in the grammar files follow the same conventions as for the Java programming language. Hence identifiers, strings, characters, etc. used in the grammars are the same as Java identifiers, Java strings, Java characters, etc.

White space in the grammar files also follows the same conventions as for the Java programming language. This includes the syntax for comments. Most comments present in the grammar files are generated into the generated parser/lexical analyzer.

Grammar files are preprocessed for Unicode escapes just as Java files are (i.e., occurrences of strings such as \uxxxx - where xxxx is a hex value - are converted the the corresponding Unicode character before lexical analysis).

Exceptions to the above rules: The Java operators "<<", ">>", ">>>", "<<=", ">>=", and ">>>=" are left out of JavaCC's input token list in order to allow convenient nested use of token specifications. Finally, the following are the additional reserved words in the Java Compiler Compiler [tm] grammar files.

EOF IGNORE_CASE JAVACODE LOOKAHEAD
MORE options PARSER_BEGIN PARSER_END
SKIP SPECIAL_TOKEN TOKEN TOKEN_MGR_DECLS

Any Java entities used in the grammar rules that follow appear italicized with the prefix java_ (e.g., java_compilation_unit).


javacc_input ::= javacc_options
"PARSER_BEGIN" "(" <IDENTIFIER> ")"
java_compilation_unit
"PARSER_END" "(" <IDENTIFIER> ")"
( production )*
<EOF>

The grammar file starts with a list of options (which is optional). This is then followed by a Java compilation unit enclosed between "PARSER_BEGIN(name)" and "PARSER_END(name)". After this is a list of grammar productions. Options and productions are described later.

The name that follows "PARSER_BEGIN" and "PARSER_END" must be the same and this identifies the name of the generated parser. For example, if name is "MyParser", then the following files are generated:

MyParser.java: The generate parser.
MyParserTokenManager.java: The generated token manager (or scanner/lexical analyzer).
MyParserConstants.java: A bunch of useful constants.

Other files such as "Token.java", "ParseError.java", etc. are also generated. However, these files contain boilerplate code and are the same for any grammar and may be reused across grammars.

Between the PARSER_BEGIN and PARSER_END constructs is a regular Java compilation unit (a compilation unit in Java lingo is the entire contents of a Java file). This may be any arbitrary Java compilation unit so long as it contains a class declaration whose name is the same as the name of the generated parser ("MyParser" in the above example). Hence, in general, this part of the grammar file looks like:

    PARSER_BEGIN(parser_name)
    . . .
    class parser_name . . . {
      . . .
    }
    . . .
    PARSER_END(parser_name)

JavaCC does not perform detailed checks on the compilation unit, so it is possible for a grammar file to pass through JavaCC and generate Java files that produce errors when they are compiled.

If the compilation unit includes a package declaration, this is included in all the generated files. If the compilation unit includes imports declarations, this is included in the generated parser and token manager files.

The generated parser file contains everything in the compilation unit and, in addition, contains the generated parser code that is included at the end of the parser class. For the above example, the generated parser will look like:

    . . .
    class parser_name . . . {
      . . .
      // generated parser is inserted here.
    }
    . . .

The generated parser includes a public method declaration corresponding to each non-terminal (see javacode_production and bnf_production) in the grammar file. Parsing with respect to a non-terminal is achieved by calling the method corresponding to that non-terminal. Unlike yacc, there is no single start symbol in JavaCC - one can parse with respect to any non-terminal in the grammar.

The generated token manager provides one public method:

    Token getNextToken() throws ParseError;

For more details on how this method may be used, please read the description of the Java Compiler Compiler API.


javacc_options ::= [ "options" "{" ( option_binding )* "}" ]

The options if present, starts with the reserved word "options" followed by a list of one or more option bindings within braces. Each option binding specifies the setting of one option. The same option may not be set multiple times.

Options may be specified either here in the grammar file, or from the command line. If the option is set from the command line, that takes precedence.

Option names are not case-sensitive.


option_binding ::= "LOOKAHEAD" "=" java_integer_literal ";"
| "CHOICE_AMBIGUITY_CHECK" "=" java_integer_literal ";"
| "OTHER_AMBIGUITY_CHECK" "=" java_integer_literal ";"
| "STATIC" "=" java_boolean_literal ";"
| "DEBUG_PARSER" "=" java_boolean_literal ";"
| "DEBUG_LOOKAHEAD" "=" java_boolean_literal ";"
| "DEBUG_TOKEN_MANAGER" "=" java_boolean_literal ";"
| "OPTIMIZE_TOKEN_MANAGER" "=" java_boolean_literal ";"
| "ERROR_REPORTING" "=" java_boolean_literal ";"
| "JAVA_UNICODE_ESCAPE" "=" java_boolean_literal ";"
| "UNICODE_INPUT" "=" java_boolean_literal ";"
| "IGNORE_CASE" "=" java_boolean_literal ";"
| "USER_TOKEN_MANAGER" "=" java_boolean_literal ";"
| "USER_CHAR_STREAM" "=" java_boolean_literal ";"
| "BUILD_PARSER" "=" java_boolean_literal ";"
| "BUILD_TOKEN_MANAGER" "=" java_boolean_literal ";"
| "TOKEN_MANAGER_USES_PARSER" "=" java_boolean_literal ";"
| "SANITY_CHECK" "=" java_boolean_literal ";"
| "FORCE_LA_CHECK" "=" java_boolean_literal ";"
| "COMMON_TOKEN_ACTION" "=" java_boolean_literal ";"
| "CACHE_TOKENS" "=" java_boolean_literal ";"
| "OUTPUT_DIRECTORY" "=" java_string_literal ";"


production ::= javacode_production
| regular_expr_production
| bnf_production
| token_manager_decls

There are four kinds of productions in JavaCC. javacode_production and bnf_production are used to define the grammar from which the parser is generated. regular_expr_production is used to define the grammar tokens - the token manager is generated from this information (as well as from inline token specifications in the parser grammar). token_manager_decls is used to introduce declarations that get inserted into the generated token manager.


javacode_production ::= "JAVACODE"
java_access_modifier java_return_type java_identifier "(" java_parameter_list ")"
java_block

The JAVACODE production is a way to write Java code for some productions instead of the usual EBNF expansion. This is useful when there is the need to recognize something that is not context-free or for whatever reason is very difficult to write a grammar for. An example of the use of JAVACODE is shown below. In this example, the non-terminal "skip_to_matching_brace" consumes tokens in the input stream all the way up to a matching closing brace (the opening brace is assumed to have been just scanned):

    JAVACODE
    void skip_to_matching_brace() {
      Token tok;
      int nesting = 1;
      while (true) {
        tok = getToken(1);
        if (tok.kind == LBRACE) nesting++;
        if (tok.kind == RBRACE) {
          nesting--;
          if (nesting == 0) break;
        }
        tok = getNextToken();
      }
    }

Care must be taken when using JAVACODE productions. While you can say pretty much what you want with these productions, JavaCC simply considers it a black box (that somehow performs its parsing task). This becomes a problem when JAVACODE productions appear at choice points. For example, if the above JAVACODE production was referred to from the following production:

  void NT() :
  {}
  {
    skip_to_matching_brace()
  |
    some_other_production()
  }

Then JavaCC would not know how to choose between the two choices. On the other hand, if the JAVACODE production is used at a non-choice point as in the following example, there is no problem:

  void NT() :
  {}
  {
    "{" skip_to_matching_brace()
  |
    "(" parameter_list() ")"
  }

When JAVACODE productions are used at choice points, JavaCC will print a warning message stating this fact. You will then have to insert some explicit LOOKAHEAD specifications to help JavaCC. See the minitutorial on LOOKAHEAD for a detailed guide on such issues.

The default access modifier for JAVACODE productions is package private.


bnf_production ::= java_access_modifier java_return_type java_identifier "(" java_parameter_list ")" ":"
java_block
"{" expansion_choices "}"

The BNF production is the standard production used in specifying JavaCC grammars. Each BNF production has a left hand side which is a non-terminal specification. The BNF production then defines this non-terminal in terms of BNF expansions on the right hand side. The non-terminal is written exactly like a declared Java method. Since each non-terminal is translated into a method in the generated parser, this style of writing the non-terminal makes this association obvious. The name of the non-terminal is the name of the method, and the parameters and return value declared are the means to pass values up and down the parse tree. As will be seen later, non-terminals on the right hand sides of productions are written as method calls, so the passing of values up and down the tree are done using exactly the same paradigm as method call and return. The default access modifier for BNF productions is public.

There are two parts on the right hand side of an BNF production. The first part is a set of arbitrary Java declarations and code (the Java block). This code is generated at the beginning of the method generated for the Java non-terminal. Hence, every time this non-terminal is used in the parsing process, these declarations and code are executed. The declarations in this part are visible to all Java code in actions in the BNF expansions. JavaCC does not do any processing of these declarations and code, except to skip to the matching ending brace, collecting all text encountered on the way. Hence, a Java compiler can detect errors in this code that has been processed by JavaCC.

The second part of the right hand side are the BNF expansions. This is described later.


regular_expr_production ::= [ lexical_state_list ]
regexpr_kind [ "[" "IGNORE_CASE" "]" ] ":"
"{" regexpr_spec ( "|" regexpr_spec )* "}"

A regular expression production is used to define lexical entities that get processed by the generated token manager. A detailed description of how the token manager works is provided in this minitutorial (click here). This page describes the syntactic aspects of specifying lexical entities, while the minitutorial describes how these syntactic constructs tie in with how the token manager actually works.

A regular expression production starts with a specification of the lexical states for which it applies (the lexical state list). There is a standard lexical state called "DEFAULT". If the lexical state list is omitted, the regular expression production applies to the lexical state "DEFAULT".

Following this is a description of what kind of regular expression production this is (see below for what this means).

After this is an optional "[IGNORE_CASE]". If this is present, the regular expression production is case insensitive - it has the same effect as the IGNORE_CASE option, except that in this case it applies locally to this regular expression production.

This is then followed by a list of regular expression specifications that describe in more detail the lexical entities of this regular expression production.


token_manager_decls ::= "TOKEN_MGR_DECLS" ":" java_block

The token manager declarations starts with the reserved word "TOKEN_MGR_DECLS" followed by a ":" and then a set of Java declarations and statements (the Java block). These declarations and statements are written into the generated token manager and are accessible from within lexical actions. See the minitutorial on the token manager for more details.

There can only be one token manager declaration in a JavaCC grammar file.


lexical_state_list ::= "<" "*" ">"
| "<" java_identifier ( "," java_identifier )* ">"

The lexical state list describes the set of lexical states for which the corresponding regular expression production applies. If this is written as "<*>", the regular expression production applies to all lexical states. Otherwise, it applies to all the lexical states in the identifier list within the angular brackets.


regexpr_kind ::= "TOKEN"
| "SPECIAL_TOKEN"
| "SKIP"
| "MORE"

This specifies the kind of regular expression production. There are four kinds:


regexpr_spec ::= regular_expression [ java_block ] [ ":" java_identifier ]

The regular expression specification begins the actual description of the lexical entities that are part of this regular expression production. Each regular expression production may contain any number of regular expression specifications.

Each regular expression specification contains a regular expression followed by a Java block (the lexical action) which is optional. This is then followed by an identifier of a lexical state (which is also optional). Whenever this regular expression is matched, the lexical action (if any) gets executed, followed by any common token actions. Then the action depending on the regular expression production kind is taken. Finally, if a lexical state is specified, the token manager moves to that lexical state for further processing (the token manager starts initially in the state "DEFAULT").


expansion_choices ::= expansion ( "|" expansion )*

Expansion choices are written as a list of one or more expansions separated by "|"s. The set of legal parses allowed by an expansion choice is a legal parse of any one of the contained expansions.


expansion ::= ( expansion_unit )*

An expansion is written as a sequence of expansion units. A concatenation of legal parses of the expansion units is a legal parse of the expansion.

For example, the expansion "{" decls() "}" consists of three expansion units - "{", decls(), and "}". A match for the expansion is a concatenation of the matches for the individual expansion units - in this case, that would be any string that begins with a "{", ends with a "}", and contains a match for decls() in between.


expansion_unit ::= local_lookahead
| java_block
| "(" expansion_choices ")" [ "+" | "*" | "?" ]
| "[" expansion_choices "]"
| [ java_assignment_lhs "=" ] regular_expression
| [ java_assignment_lhs "=" ] java_identifier "(" java_expression_list ")"

An expansion unit can be a local LOOKAHEAD specification. This instructs the generated parser on how to make choices at choice points. For details on how LOOKAHEAD specifications work and how to write LOOKAHEAD specifications, click here to visit the minitutorial on LOOKAHEAD.

An expansion unit can be a set of Java declarations and code enclosed within braces (the Java block). These are also called parser actions. This is generated into the method parsing the non-terminal at the appropriate location. This block is executed whenever the parsing process crosses this point successfully. When JavaCC processes the Java block, it does not perform any detailed syntax or semantic checking. Hence it is possible that the Java compiler will find errors in your actions that have been processed by JavaCC. Actions are not executed during lookahead evaluation.

An expansion unit can be a parenthesized set of one or more expansion choices. In which case, a legal parse of the expansion unit is any legal parse of the nested expansion choices. The parenthesized set of expansion choices can be suffixed (optionally) by:

An expansion unit can be a regular expression. Then a legal parse of the expansion unit is any token that matches this regular expression. When a regular expression is matched, it creates an object of type Token. This object can be accessed by assigning it to a variable by prefixing the regular expression with "variable =". In general, you may have any valid Java assignment left-hand side to the left of the "=". This assignment is not performed during lookahead evaluation.

An expansion unit can be a non-terminal (the last choice in the syntax above). In which case, it takes the form of a method call with the non-terminal name used as the name of the method. A successful parse of the non-terminal causes the parameters placed in the method call to be operated on and a value returned (in case the non-terminal was not declared to be of type "void"). The return value can be assigned (optionally) to a variable by prefixing the regular expression with "variable =". In general, you may have any valid Java assignment left-hand side to the left of the "=". This assignment is not performed during lookahead evaluation. Non-terminals may not be used in an expansion in a manner that introduces left-recursion. JavaCC checks this for you.


local_lookahead ::= "LOOKAHEAD" "(" [ java_integer_literal ] [ "," ] [ expansion_choices ] [ "," ] [ "{" java_expression "}" ] ")"

A local lookahead specification is used to influence the way the generated parser makes choices at the various choice points in the grammar. A local lookahead specification starts with the reserved word "LOOKAHEAD" followed by a set of lookahead constraints within parentheses. There are three different kinds of lookahead constraints - a lookahead limit (the integer literal), a syntactic lookahead (the expansion choices), and a semantic lookahead (the expression within braces). At least one lookahead constraint must be present. If more than one lookahead constraint is present, they must be separated by commas.

For a detailed description of how lookahead works, please click here to visit the minitutorial on LOOKAHEAD. A brief description of each kind of lookahead constraint is given below:

Default values for lookahead constraints: If a local lookahead specification has been provided, but not all lookahead constraints have been included, then the missing ones are assigned default values as follows:


regular_expression ::= java_string_literal
| "<" [ [ "#" ] java_identifier ":" ] complex_regular_expression_choices ">"
| "<" java_identifier ">"
| "<" "EOF" ">"

There are two places in a grammar files where regular expressions may be written:

The complete details of regular expression matching by the token manager is available in the minitutorial on the token manager. The description of the syntactic constructs follows.

The first kind of regular expression is a string literal. The input being parsed matches this regular expression if the token manager is in a lexical state for which this regular expression applies and the next set of characters in the input stream is the same (possibly with case ignored) as this string literal.

A regular expression may also be a more complex regular expression using which more involved regular expression (than string literals can be defined). Such a regular expression is placed within angular brackets "<...>", and may be labeled optionally with an identifier. This label may be used to refer to this regular expression from expansion units or from within other regular expressions. If the label is preceded by a "#", then this regular expression may not be referred to from expansion units, but only from within other regular expressions. When the "#" is present, the regular expression is referred to as a "private regular expression".

A regular expression may be a reference to some other labeled regular expression in which case it is written as the label enclosed in angular brackets "<...>".

Finally, a regular expression may be a reference to the predefined regular expression "<EOF>" which is matched by the end of file.

Private regular expressions are not matched as tokens by the token manager. Their purpose is solely to facilitate the definition of other more complex regular expressions.

Consider the following example defining Java floating point literals:

TOKEN :
{
  < FLOATING_POINT_LITERAL:
        (["0"-"9"])+ "." (["0"-"9"])* (<EXPONENT>)? (["f","F","d","D"])?
      | "." (["0"-"9"])+ (<EXPONENT>)? (["f","F","d","D"])?
      | (["0"-"9"])+ <EXPONENT> (["f","F","d","D"])?
      | (["0"-"9"])+ (<EXPONENT>)? ["f","F","d","D"]
  >
|
  < #EXPONENT: ["e","E"] (["+","-"])? (["0"-"9"])+ >
}

In this example, the token FLOATING_POINT_LITERAL is defined using the definition of another token, namely, EXPONENT. The "#" before the label EXPONENT indicates that this exists solely for the purpose of defining other tokens (FLOATING_POINT_LITERAL in this case). The definition of FLOATING_POINT_LITERAL is not affected by the presence or absence of the "#". However, the token manager's behavior is. If the "#" is omitted, the token manager will erroneously recognize a string like E123 as a legal token of kind EXPONENT (instead of IDENTIFIER in the Java grammar).


complex_regular_expression_choices ::= complex_regular_expression ( "|" complex_regular_expression )*

Complex regular expression choices is made up of a list of one or more complex regular expressions separated by "|"s. A match for a complex regular expression choice is a match of any of its constituent complex regular expressions.


complex_regular_expression ::= ( complex_regular_expression_unit )*

A complex regular expression is a sequence of complex regular expression units. A match for a complex regular expression is a concatenation of matches to the complex regular expression units.


complex_regular_expression_unit ::= java_string_literal
| "<" java_identifier ">"
| character_list
| "(" complex_regular_expression_choices ")" [ "+" | "*" | "?" ]

A complex regular expression unit can be a string literal, in which case there is exactly one match for this unit, namely, the string literal itself.

A complex regular expression unit can be a reference to another regular expression. The other regular expression has to be labeled so that it can be referenced. The matches of this unit are all the matches of this other regular expression. Such references in regular expressions cannot introduce loops in the dependency between tokens.

A complex regular expression unit can be a character list. A character list is a way of defining a set of characters. A match for this kind of complex regular expression unit is any character that is allowed by the character list.

A complex regular expression unit can be a parenthesized set of complex regular expression choices. In this case, a legal match of the unit is any legal match of the nested choices. The parenthesized set of choices can be suffixed (optionally) by:

Note that unlike the BNF expansions, the regular expression "[...]" is not equivalent to the regular expression "(...)?". This is because the [...] construct is used to describe character lists in regular expressions.


character_list ::= [ "~" ] "[" [ character_descriptor ( "," character_descriptor )* ] "]"

A character list describes a set of characters. A legal match for a character list is any character in this set. A character list is a list of character descriptors separated by commas within square brackets. Each character descriptor describes a single character or a range of characters (see character descriptor below), and this is added to the set of characters of the character list. If the character list is prefixed by the "~" symbol, the set of characters it represents is any UNICODE character not in the specified set.


character_descriptor ::= java_string_literal [ "-" java_string_literal ]

A character descriptor can be a single character string literal, in which case it describes a singleton set containing that character; or it is two single character string literals separated by a "-", in which case, it describes the set of all characters in the range between and including these two characters.