ARTICLE AD BOX
I am using ANTLR4 to parse XML files and then output the parsed content using TokenStreamRewriter.getText().
For normal XML tags (e.g., <foo> and <gm-A-1/>), the output is correct: spaces and line breaks are preserved.
However, for a custom DOCTYPE parser rule I added, the output loses all spaces, even though the original XML contains spaces and line breaks.
Expected behavior:
I want the DOCTYPE output to preserve the original formatting, including spaces and line breaks, like this:
Actual behavior:
The output produced by TokenStreamRewriter.getText() looks like this:
All spaces are removed, and everything is concatenated.
Parser rules used:
dtd : DOCTYPE_OPEN Name LBRACK misc* entityDecl* RBRACK CLOSE misc* entityRef? ; entityDecl : ENTITY_OPEN Name STRING misc* CLOSE ; entityRef : '<' Name '>' EntityRef '<' '/' Name '>' ;Problem description:
I suspect this issue comes from the lexer layer. TokenStreamRewriter.getText() simply concatenates token.text, so if whitespace tokens are skipped by the lexer, they do not appear in the output.
I want to know:
How can I modify the ANTLR4 lexer/parser rules so that DOCTYPE output preserves spaces?
Is there a recommended approach to ensure TokenStreamRewriter.getText() outputs text that is almost identical to the original XML?
What I have tried:
Manually adding spaces in a visitor works, but it is cumbersome.
Using rewriter.getText(ctx.start, ctx.stop) does not help if whitespace tokens are skipped.
