ARTICLE AD BOX
I have a Python pipeline that processes legal opinions into structured JSON and rendered HTML. The extraction mostly works, but page numbers are inconsistent across source formats.
I’m trying to determine the fastest deterministic recovery strategy before resorting to OCR or model reruns.
I need help with this specific sub-problem:
Source HTML sometimes contains inline page markers
Rendered HTML sometimes drops or misplaces them
Plain text and PDF text extraction may still contain usable markers
I want to distinguish:
renderer-only loss
extraction loss
cases where fallback recovery is needed
Given inputs like these:
source_html = """ <p>Text before <span class="page-number">*12</span> text after.</p> """ rendered_html = """ <p>Text before text after.</p> """ pdf_text = """ Text before *12 text after """What is the most reliable Python approach to:
detect whether the page marker was lost only during rendering
recover it deterministically from source HTML or extracted PDF text
reinsert it into rendered HTML without corrupting paragraph/footnote structure
Constraints:
needs to work at large scale
should avoid OCR unless necessary
should avoid paid model reruns where possible
Libraries I’m considering: PyMuPDF, pdfplumber, pdftotext, lxml/BeautifulSoup, regex-based alignment.
I’d especially value answers that discuss:
text alignment strategies
marker anchoring
how to handle inline markers safely
failure cases in messy legal documents
