ANTLR4 TokenStreamRewriter.getText() loses spaces in custom DOCTYPE parser rule

15 hours ago 2

ARTICLE AD BOX

I am using ANTLR4 to parse XML files and then output the parsed content using TokenStreamRewriter.getText().
For normal XML tags (e.g., <foo> and <gm-A-1/>), the output is correct: spaces and line breaks are preserved.

However, for a custom DOCTYPE parser rule I added, the output loses all spaces, even though the original XML contains spaces and line breaks.

Expected behavior:
I want the DOCTYPE output to preserve the original formatting, including spaces and line breaks, like this:

<!DOCTYPE doc [ <!ENTITY e0 ''> <!ENTITY e1 '&e0;'> ]>

Actual behavior:
The output produced by TokenStreamRewriter.getText() looks like this:

<!DOCTYPEdoc[<!ENTITYe0''><!ENTITYe1'&e0;'>]>

All spaces are removed, and everything is concatenated.

Parser rules used:

dtd : DOCTYPE_OPEN Name LBRACK misc* entityDecl* RBRACK CLOSE misc* entityRef? ; entityDecl : ENTITY_OPEN Name STRING misc* CLOSE ; entityRef : '<' Name '>' EntityRef '<' '/' Name '>' ;

Problem description:
I suspect this issue comes from the lexer layer. TokenStreamRewriter.getText() simply concatenates token.text, so if whitespace tokens are skipped by the lexer, they do not appear in the output.

I want to know:

How can I modify the ANTLR4 lexer/parser rules so that DOCTYPE output preserves spaces?

Is there a recommended approach to ensure TokenStreamRewriter.getText() outputs text that is almost identical to the original XML?

What I have tried:

Manually adding spaces in a visitor works, but it is cumbersome.

Using rewriter.getText(ctx.start, ctx.stop) does not help if whitespace tokens are skipped.

Read Entire Article

LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.

ANTLR4 TokenStreamRewriter.getText() loses spaces in custom DOCTYPE parser rule

ARTICLE AD BOX

Related

Django Admin not loading static files

I tried scraping a part of wiki and it's taking the wrong section

yt-dlp works locally but fails on Render (FastAPI) with “Sign in to confirm you’re not a bot” error [closed]

LEFT SIDEBAR AD