The JavaCC [tm] lexical specification is organized into a set of "lexical states". Each lexical state is named with an identifier. There is a standard lexical state called DEFAULT. The generated token manager is at any moment in one of these lexical states. When the token manager is initialized, it starts off in the DEFAULT state, by default. The starting lexical state can also be specified as a parameter while constructing a token manager object.
Each lexical state contains an ordered list of regular expressions; the order is derived from the order of occurrence in the input file. There are four kinds of regular expressions: SKIP, MORE, TOKEN, and SPECIAL_TOKEN.
All regular expressions that occur as expansion units in the grammar are considered to be in the DEFAULT lexical state and their order of occurrence is determined by their position in the grammar file.
A token is matched as follows: All regular expressions in the current lexical state are considered as potential match candidates. The token manager consumes the maximum number of characters from the input stream possible that match one of these regular expressions. That is, the token manager prefers the longest possible match. If there are multiple longest matches (of the same length), the regular expression that is matched is the one with the earliest order of occurrence in the grammar file.
As mentioned above, the token manager is in exactly one state at any moment. At this moment, the token manager only considers the regular expressions defined in this state for matching purposes. After a match, one can specify an action to be executed as well as a new lexical state to move to. If a new lexical state is not specified, the token manager remains in the current state.
The regular expression kind specifies what to do when a regular expression has been successfully matched:
(The mechanism of accessing special tokens is at the end of this page)
Whenever the end of file <EOF> is detected, it causes the creation of an <EOF> token (regardless of the current state of the lexical analyzer). However, if an <EOF> is detected in the middle of a match for a regular expression, or immediately after a MORE regular expression has been matched, an error is reported.
After the regular expression is matched, the lexical action is executed. All the variables (and methods) declared in the TOKEN_MGR_DECLS region (see below) are available here for use. In addition, the variables and methods listed below are also available for use.
Immediately after this, the token manager changes state to that specified (if any).
After that the action specified by the kind of the regular expression is taken (SKIP, MORE, ... ). If the kind is TOKEN, the matched token is returned. If the kind is SPECIAL_TOKEN, the matched token is saved to be returned along with the next TOKEN that is matched.
The following variables are available for use within lexical actions:
<DEFAULT> MORE : { "a" : S1 } <S1> MORE : { "b" { int l = image.length()-1; image.setCharAt(l, image.charAt(l).toUpperCase()); } ^1 ^2 : S2 } <S2> TOKEN : { "cd" { x = image; } : DEFAULT ^3 }In the above example, the value of "image" at the 3 points marked by ^1, ^2, and ^3 are:
At ^1: "ab" At ^2: "aB" At ^3: "aBcd"
At ^1: 1 (the size of "b") At ^2: 1 (does not change due to lexical actions) At ^3: 2 (the size of "cd")
<S2> TOKEN : { "cd" { matchedToken.image = image.toString(); } : DEFAULT }Then the token returned to the parser will have its ".image" field set to "aBcd". If this assignment was not performed, then the ".image" field will remain as "abcd".
Lexical actions have access to a set of class level declarations. These declarations are introduced within the JavaCC file using the following syntax:
token_manager_decls ::= "TOKEN_MGR_DECLS" ":" "{" java_declarations_and_code "}"
These declarations are accessible from all lexical actions.
SKIP : { "/*" : WithinComment } <WithinComment> SKIP : { "*/" : DEFAULT } <WithinComment> MORE : { <~[]> }
TOKEN_MGR_DECLS : { int stringSize; } MORE : { "\"" {stringSize = 0;} : WithinString } <WithinString> TOKEN : { <STRLIT: "\""> {System.out.println("Size = " + stringSize);} : DEFAULT } <WithinString> MORE : { <~["\n","\r"]> {stringSize++;} }
Special tokens are like tokens, except that they are permitted to appear anywhere in the input file (between any two tokens). Special tokens can be specified in the grammar input file using the reserved word "SPECIAL_TOKEN" instead of "TOKEN" as in:
SPECIAL_TOKEN : { <SINGLE_LINE_COMMENT: "//" (~["\n","\r"])* ("\n"|"\r"|"\r\n")> }
Any regular expression defined to be a SPECIAL_TOKEN may be accessed in a special manner from user actions in the lexical and grammar specifications. This allows these tokens to be recovered during parsing while at the same time these tokens do not participate in the parsing.
JavaCC has been bootstrapped to use this feature to automatically copy relevant comments from the input grammar file into the generated files.
Details:
The class Token now has an additional field:
Token specialToken;
This field points to the special token immediately prior to the current token (special or otherwise). If the token immediately prior to the current token is a regular token (and not a special token), then this field is set to null. The "next" fields of regular tokens continue to have the same meaning - i.e., they point to the next regular token except in the case of the EOF token where the "next" field is null. The "next" field of special tokens point to the special token immediately following the current token. If the token immediately following the current token is a regular token, the "next" field is set to null.
This is clarified by the following example. Suppose you wish to print all special tokens prior to the regular token "t" (but only those that are after the regular token before "t"):
if (t.specialToken == null) return; // The above statement determines that there are no special tokens // and returns control to the caller. Token tmp_t = t.specialToken; while (tmp_t.specialToken != null) tmp_t = tmp_t.specialToken; // The above line walks back the special token chain until it // reaches the first special token after the previous regular // token. while (tmp_t != null) { System.out.println(tmp_t.image); tmp_t = tmp_t.next; } // The above loop now walks the special token chain in the forward // direction printing them in the process.