How can I reliably recover and preserve page numbers from legal-document HTML/PDF text in Python at scale?

17 hours ago 1
ARTICLE AD BOX

I have a Python pipeline that processes legal opinions into structured JSON and rendered HTML. The extraction mostly works, but page numbers are inconsistent across source formats.

I’m trying to determine the fastest deterministic recovery strategy before resorting to OCR or model reruns.

I need help with this specific sub-problem:

Source HTML sometimes contains inline page markers

Rendered HTML sometimes drops or misplaces them

Plain text and PDF text extraction may still contain usable markers

I want to distinguish:

renderer-only loss

extraction loss

cases where fallback recovery is needed

Given inputs like these:

source_html = """ <p>Text before <span class="page-number">*12</span> text after.</p> """ rendered_html = """ <p>Text before text after.</p> """ pdf_text = """ Text before *12 text after """

What is the most reliable Python approach to:

detect whether the page marker was lost only during rendering

recover it deterministically from source HTML or extracted PDF text

reinsert it into rendered HTML without corrupting paragraph/footnote structure

Constraints:

needs to work at large scale

should avoid OCR unless necessary

should avoid paid model reruns where possible

Libraries I’m considering: PyMuPDF, pdfplumber, pdftotext, lxml/BeautifulSoup, regex-based alignment.

I’d especially value answers that discuss:

text alignment strategies

marker anchoring

how to handle inline markers safely

failure cases in messy legal documents

Read Entire Article