ANTLR4 TokenStreamRewriter.getText() loses spaces in custom DOCTYPE parser rule

15 hours ago 2
ARTICLE AD BOX

I am using ANTLR4 to parse XML files and then output the parsed content using TokenStreamRewriter.getText().
For normal XML tags (e.g., <foo> and <gm-A-1/>), the output is correct: spaces and line breaks are preserved.

However, for a custom DOCTYPE parser rule I added, the output loses all spaces, even though the original XML contains spaces and line breaks.

Expected behavior:
I want the DOCTYPE output to preserve the original formatting, including spaces and line breaks, like this:

<!DOCTYPE doc [ <!ENTITY e0 ''> <!ENTITY e1 '&e0;'> ]>

Actual behavior:
The output produced by TokenStreamRewriter.getText() looks like this:

<!DOCTYPEdoc[<!ENTITYe0''><!ENTITYe1'&e0;'>]>

All spaces are removed, and everything is concatenated.

Parser rules used:

dtd : DOCTYPE_OPEN Name LBRACK misc* entityDecl* RBRACK CLOSE misc* entityRef? ; entityDecl : ENTITY_OPEN Name STRING misc* CLOSE ; entityRef : '<' Name '>' EntityRef '<' '/' Name '>' ;

Problem description:
I suspect this issue comes from the lexer layer. TokenStreamRewriter.getText() simply concatenates token.text, so if whitespace tokens are skipped by the lexer, they do not appear in the output.

I want to know:

How can I modify the ANTLR4 lexer/parser rules so that DOCTYPE output preserves spaces?

Is there a recommended approach to ensure TokenStreamRewriter.getText() outputs text that is almost identical to the original XML?

What I have tried:

Manually adding spaces in a visitor works, but it is cumbersome.

Using rewriter.getText(ctx.start, ctx.stop) does not help if whitespace tokens are skipped.

Read Entire Article