Designing Grammar Rules¶
Writing syntactically correct BNF is only the first step. A working parser requires deliberate decisions about rule organization, PSI tree shape, and error handling hooks. As the Grammar-Kit documentation puts it: "Writing a grammar doesn't mean the generated parser will work. The tricky part is to tune some raw grammar into a working grammar."
This page covers the architectural principles, common patterns, and pitfalls that turn a raw grammar into a robust parser. It assumes you already know the BNF syntax covered in Grammar Syntax.
Grammar Architecture¶
A Grammar-Kit grammar has a predictable top-down structure. The decisions you make about that structure determine what the generated parser produces and what the resulting PSI tree looks like.
Top-Down Structure¶
The first rule in the file is the grammar root. Grammar-Kit treats it as implicitly private, so no PSI class is generated for it. The root typically delegates to a private helper that loops over top-level constructs:
root ::= item *
Each public rule produces one PSI node and one IElementType constant. Private rules produce neither. Their matched content merges into the parent node instead. This distinction is the primary tool for controlling what appears in your PSI tree.
The following template shows a well-organized grammar. Read the inline comments for the role of each section:
// 1. Global attributes
{
parserClass="com.example.lang.parser.MyLangParser"
parserUtilClass="com.example.lang.parser.MyLangParserUtil"
psiClassPrefix="MyLang"
psiImplClassSuffix="Impl"
psiPackage="com.example.lang.psi"
psiImplPackage="com.example.lang.psi.impl"
elementTypeHolderClass="com.example.lang.psi.MyLangTypes"
elementTypeClass="com.example.lang.psi.MyLangElementType"
tokenTypeClass="com.example.lang.psi.MyLangTokenType"
// Pattern-based attributes
extends(".*_expr")=expr
name(".*_expr")='expression'
consumeTokenMethod(".*_recover")="consumeTokenFast"
// 2. Token declarations
tokens=[
SEMI=';'
COMMA=','
EQ='='
LP='('
RP=')'
LB='{'
RB='}'
space='regexp:\s+'
comment='regexp://.*'
number='regexp:\d+(\.\d*)?'
id='regexp:\p{Alpha}\w*'
string="regexp:('([^'\\]|\\.)*'|\"([^\"\\]|\\.)*\")"
]
}
// 3. Root rule (implicitly private)
root ::= item *
// 4. Top-level dispatch (private — no PSI node needed)
private item ::= !<<eof>> statement ';' {
pin=1
recoverWhile=item_recover
}
private item_recover ::= !(';' | id)
// 5. Statement alternatives (private dispatch)
private statement ::= assignment | function_call
// 6. Public rules (each generates a PSI node)
assignment ::= id '=' expr {pin=2}
function_call ::= id '(' [!')' expr (',' expr) *] ')' {pin(".*")=1}
// 7. Expression hierarchy
expr ::= add_group | mul_group | primary_group
private add_group ::= plus_expr | minus_expr
private mul_group ::= mul_expr | div_expr
private primary_group ::= literal_expr | ref_expr | paren_expr
plus_expr ::= expr '+' expr
minus_expr ::= expr '-' expr
mul_expr ::= expr '*' expr
div_expr ::= expr '/' expr
literal_expr ::= number | string
ref_expr ::= id
paren_expr ::= '(' expr ')' {pin=1}
The key principles at work here: global attributes and tokens go at the top, the root rule delegates to a private helper with a loop, the !<<eof>> guard prevents an infinite loop at end of file, private rules dispatch to public rules so no unnecessary PSI nodes are created, and recovery rules sit near the rules they protect.
Choosing Private vs. Public¶
Rules are public by default. Every public rule generates a PSI class and an IElementType constant, which means every public rule creates a node in the PSI tree. Use private when a rule is structural plumbing that should not appear in the tree.
Common uses for private rules:
- Root loop items (
private item ::= ...) - Dispatch rules that group alternatives (
private statement ::= select | delete | ...) - Recovery predicates (
private item_recover ::= !(';' | id)) - Operator priority groups in expressions (
private mul_group ::= mul_expr | div_expr)
The Grammar-Kit guidance is direct: "Specify private attribute on any rule if you don't want it to be present in AST as early as possible."
Public helper rules create noise
If you leave helper rules public, the PSI tree fills with nodes that carry no semantic meaning. Compare:
// Unnecessary PSI nodes for 'item' and 'statement'
root ::= item *
item ::= !<<eof>> statement ';'
statement ::= assignment | function_call
// Clean tree — only 'assignment' and 'function_call' appear
root ::= item *
private item ::= !<<eof>> statement ';' {pin=1 recoverWhile=item_recover}
private item_recover ::= !(';' | id)
statement ::= assignment | function_call
Flattening with extends¶
Without extends, expression grammars produce deeply nested PSI trees. A simple literal like 42 might be wrapped in several layers:
FileNode
Expr
PlusExpr
LiteralExpr
number: '42'
Adding extends(".*_expr")=expr collapses the redundant wrapping nodes. The root expression rule never appears in the tree, and the AST becomes flat:
FileNode
LiteralExpr
number: '42'
Here is the before and after in grammar form:
// BEFORE: no extends — deep AST
expr ::= factor plus_expr *
left plus_expr ::= ('+' | '-') factor
private factor ::= primary mul_expr *
left mul_expr ::= ('*' | '/') primary
private primary ::= literal_expr
literal_expr ::= number
// AFTER: with extends — flat AST
{
extends(".*_expr")=expr
}
expr ::= factor plus_expr *
left plus_expr ::= ('+' | '-') factor
private factor ::= primary mul_expr *
left mul_expr ::= ('*' | '/') primary
private primary ::= literal_expr
literal_expr ::= number
For the full expression parsing framework, including priority tables and associativity, see Expression Parsing.
Common Patterns¶
These patterns are reusable building blocks. Adapt them to your language rather than designing from scratch.
Lists and Separators¶
A comma-separated list appears in almost every grammar. Define it once as a meta rule and reuse it:
// Define once as a meta rule
meta comma_list ::= <<param>> (',' <<param>>) *
// Use for any comma-separated construct
import_list ::= <<comma_list import_item>>
param_list ::= <<comma_list param_decl>>
arg_list ::= <<comma_list expr>>
For parenthesized lists that need error recovery, use this pattern:
list ::= '(' [!')' item (',' item) *] ')' {pin(".*")=1}
item ::= number {recoverWhile=item_recover}
private item_recover ::= !(',' | ')')
The pin(".*")=1 pins every sub-sequence at its first item. The !')' lookahead before item prevents matching an empty list as an error. The recoverWhile on each item skips unrecognized tokens until , or ).
You can simplify the recovery predicate by using #auto, which computes !FOLLOWS(item) automatically:
list ::= '(' [!')' item (',' item) *] ')' {pin(".*")=1}
item ::= number {recoverWhile="#auto"}
To allow trailing commas, use an and-predicate that accepts ) after the last comma:
element_list ::= '(' element (',' (element | &')'))* ')' {pin(".*")=1}
The &')' and-predicate allows a trailing comma: after the last ,, seeing ) is acceptable.
For full details on pin, recoverWhile, and #auto, see Error Recovery.
Declarations and Blocks¶
The property or assignment pattern pins on the operator. Once the = is seen, the rule is committed. A missing right-hand side produces an error, but the PSI node is still created:
property ::= id '=' expr {pin=2}
Block structures pin on the opening delimiter:
block ::= '{' statement * '}' {pin=1}
For optional elements, use the ? quantifier or bracket syntax (they are equivalent):
// Using ? quantifier
field_decl ::= type_ref id default_value? ';'
// Using bracket syntax (equivalent)
field_decl2 ::= type_ref id ['=' expr] ';'
Nested structures use lookahead negation to avoid consuming the closing delimiter. The JSON grammar demonstrates this clearly with !'}' and !']' guards before list items.
Statement-Level Design¶
Statement-oriented languages follow a consistent pattern: a root loop, a private dispatch rule, and individual pinned statements. Here is a complete example:
{
extends(".*_statement")=statement
pin("create_.*")=2
consumeTokenMethod(".*_recover")="consumeTokenFast"
tokens=[
SEMI=';'
LP='('
RP=')'
space='regexp:\s+'
id='regexp:\p{Alpha}\w*'
number='regexp:\d+'
]
}
// Root with loop
script ::= script_item *
// Private loop item with recovery
private script_item ::= !<<eof>> statement ';' {
pin=1
recoverWhile=statement_recover
}
private statement_recover ::= !(';' | CREATE | DROP | SELECT)
// Statement dispatch (private — structural grouping)
statement ::= create_statement | drop_statement | select_statement
// Individual statements (public — each gets a PSI node)
create_statement ::= CREATE TABLE id '(' column_list ')' {pin=2}
drop_statement ::= DROP TABLE id {pin=2}
select_statement ::= SELECT column_list FROM id {pin=1}
// Shared sub-rules
column_list ::= column_def (',' column_def) *
column_def ::= id id {pin=1} // name type
The design decisions here: extends(".*_statement")=statement creates a PSI hierarchy with Statement as the base type. The pattern-based pin pin("create_.*")=2 pins all create statements at position 2. Statement recovery stops at ; or any statement-starting keyword. Each statement type gets its own PSI class through public rules, and column_list is public because it carries semantic meaning in the tree.
JSON Grammar Walkthrough¶
The JSON grammar from Grammar-Kit's test data demonstrates several patterns working together. Read the annotations for each design decision:
{
tokens = [
space='regexp:\s+'
string = "regexp:\"[^\"]*\"|'[^']*'"
number = "regexp:(\+|\-)?\p{Digit}*"
id = "regexp:\p{Alpha}\w*"
comma = ","
colon = ":"
brace1 = "{"
brace2 = "}"
brack1 = "["
brack2 = "]"
]
extends("array|object|json")=value
}
root ::= json
json ::= array | object
value ::= string | number | json {name="value"}
// Array: pinned parenthesized list with item recovery
array ::= '[' [!']' item (!']' ',' item) *] ']' {pin(".*")=1 extends=json}
private item ::= json {recoverWhile=recover}
// Object: pinned parenthesized list with property recovery
object ::= '{' [!'}' prop (!'}' ',' prop) *] '}' {pin(".*")=1 extends=json}
prop ::= [] name ':' value {pin=1 recoverWhile=recover}
name ::= id | string {name="name"}
// Shared recovery predicate — stops at structural delimiters
private recover ::= !(',' | ']' | '}' | '[' | '{')
Several patterns are at work here. The item rule is private, so only json, array, and object appear in the PSI tree. One recover rule serves both array items and object properties. The !']' and !'}' lookahead guards prevent consuming closing delimiters.
The prop ::= [] trick with pin=1 makes name optional in error scenarios because the empty optional always matches, so the pin is always reached. The name attribute on value improves error messages from a raw token list to the readable <value> expected. The nested extends chain (array and object extend json, which extends value) builds a clean PSI hierarchy.
Reducing Repetition¶
As grammars grow, repeated attributes and structural patterns become a maintenance burden. Grammar-Kit provides two mechanisms to address this: pattern-based attributes and meta rules.
Pattern-based attributes apply a single attribute to every rule whose name matches a regex. Compare the verbose approach to the concise one:
// BEFORE: attributes repeated on each rule
plus_expr ::= expr '+' expr {extends=expr}
minus_expr ::= expr '-' expr {extends=expr}
mul_expr ::= expr '*' expr {extends=expr}
div_expr ::= expr '/' expr {extends=expr}
// AFTER: single pattern attribute covers all rules
{
extends(".*_expr")=expr
}
plus_expr ::= expr '+' expr
minus_expr ::= expr '-' expr
mul_expr ::= expr '*' expr
div_expr ::= expr '/' expr
The same approach works for pin, name, and consumeTokenMethod:
pin(".*_list(?:_\\d+)*")=1pins all list rules and their sub-expressions.name(".*_expr")='expression'makes error messages say<expression> expectedinstead of listing every token.consumeTokenMethod(".*_recover")="consumeTokenFast"skips error-reporting overhead in recovery predicates.
Cleaner error messages with name
Without the name attribute, a failed expression match produces something like: '+', '-', '*', '/', number, id, '(' expected. With name(".*_expr")='expression', the message becomes: <expression> expected.
Meta rules extract reusable structural patterns. The comma_list meta rule shown in the Lists and Separators section is the most common example.
For large grammars that generate thousands of lines of code, split the parser across multiple classes using ;{ parserClass="..." } section separators:
// Main parser handles top-level structure
{
parserClass="com.example.MainParser"
tokens=[...]
}
root ::= (statement | expression) *
statement ::= var_decl | assignment
// Expression rules go into a separate parser class
;{
parserClass="com.example.ExpressionParser"
}
meta comma_list ::= <<param>> (',' <<param>>) *
expression ::= binary_expr | unary_expr | atom_expr
binary_expr ::= expression ('+' | '-' | '*' | '/') expression
unary_expr ::= ('-' | '!') expression
atom_expr ::= id | number | '(' expression ')'
// Utility parsing goes into a third class
;{
parserClass="com.example.UtilityParser"
}
type_ref ::= id ('.' id) *
qualified_name ::= id ('.' id) *
Each ;{parserClass="..."} section generates methods in the named class.
Pitfalls and Inspections¶
Grammar-Kit's IDE integration catches many common mistakes through inspections. Knowing these pitfalls upfront saves debugging time.
Left Recursion¶
Left recursion causes a StackOverflowError in recursive descent parsers. The BnfLeftRecursion inspection detects it with the message: '<ruleName>' employs left-recursion unsupported by generator.
// BAD: left recursion causes StackOverflowError
expr ::= expr '+' term | term
There are two fixes. Refactor to an iterative form:
// GOOD: iterative
expr ::= term ('+' term) *
Or use the expression parsing framework, where left recursion is handled automatically through the extends mechanism:
// GOOD: expression parsing framework handles left recursion
{
extends(".*_expr")=expr
}
expr ::= plus_expr | literal_expr
plus_expr ::= expr '+' expr
literal_expr ::= number
Left recursion is only valid within the expression parsing framework (rules with extends pointing to a common root). See Expression Parsing for details.
Unreachable Choice Branches¶
Grammar-Kit uses PEG ordered choice: the first matching branch wins. A branch preceded by one that can match empty input is never reached. The BnfUnreachableChoiceBranch and BnfIdenticalChoiceBranches inspections detect these problems:
// BAD: identical branches
value ::= number | string | number // third branch is identical to first
// BAD: unreachable branch (preceded by branch matching empty)
item ::= optional_thing? | concrete_thing // first branch always matches (empty)
Public Recovery Predicates¶
Recovery predicates should always be private. A public recovery rule creates an unwanted PSI node. The BnfUnusedRuleInspection warns with "Non-private recovery rule."
// BAD: creates unwanted PSI node
item_recover ::= !(',' | ')')
// GOOD: recovery predicates must be private
private item_recover ::= !(',' | ')')
Missing Pin with recoverWhile¶
A recoverWhile attribute without a corresponding pin (on the rule itself or somewhere in its sub-rules) means recovery runs but the rule never commits. The node is not created in the PSI tree.
// BAD: recoverWhile without pin — recovery runs but rule never commits
item ::= number {recoverWhile=item_recover}
// GOOD: pin must be present somewhere
list ::= '(' [!')' item (',' item) *] ')' {pin(".*")=1}
item ::= number {recoverWhile=item_recover}
private item_recover ::= !(',' | ')')
Rule Name Conflicts¶
Grammar-Kit generates internal methods named rule_name_0, rule_name_N1_N2_NX for sub-expressions. Naming your own rules with this pattern causes conflicts:
// BAD: conflicts with generated sub-expression name
my_rule ::= a b c
my_rule_0 ::= d e f
Avoid naming rules with the rule_name_N1_N2_..._NX pattern.
Token Conflicts¶
Prefer declared tokens over text-matched (quoted) tokens. Text-matched tokens are matched by text at parse time and can span multiple lexer tokens, which is both slower and can produce unexpected behavior. The BnfSuspiciousToken inspection highlights tokens that look like they should be rule references.
IDE inspections
Grammar-Kit provides inspections for left recursion, unused rules, duplicate rules, unresolved references, identical and unreachable choice branches, suspicious tokens, and unused attributes. Use Quick Documentation (Ctrl+Q / Cmd+J) to see FIRST/FOLLOWS sets, recovery predicate expansions, and expression priority tables for any rule. Use Live Preview (Ctrl+Alt+P) to test your grammar interactively without generating code. See Live Preview for the full workflow.
The naming conventions for rules follow snake_case (property_recover, root_item, literal_expr). Generated PSI class names derive from rule names via CamelCase conversion: literal_expr becomes LiteralExpr. Generated parser methods follow the rule name directly: static boolean literal_expr(..).
For the complete attribute catalog, including extends, pin, name, and consumeTokenMethod, see the Attributes System. For error recovery mechanics, see Error Recovery.