ARTICLE AD BOX
Title: Improving OCR accuracy for mixed-layout PDFs (Python): preprocessing & model suggestions?
I’m building an automation tool that monitors a folder, detects new PDFs, runs OCR, and extracts specific fields into structured output. The pipeline uses Python with PyMuPDF, Tesseract/PaddleOCR, and Watchdog, plus some custom regex-based text extraction.
I’m running into a few issues:
Certain PDFs return very noisy OCR results — even though they look clean visually.
Mixed layouts (tables, multi-column text) break extraction or produce jumbled text.
Unsure which OCR model/setup provides the most reliable accuracy for such documents.
Preprocessing uncertainty — what combination of deskewing, denoising, thresholding, etc. actually helps?
Current setup:
PyMuPDF for PDF → image conversion
PaddleOCR (primary) + Tesseract fallback
Regex-based extraction
Folder auto-processing via Watchdog/PollingObserver
I’m looking for suggestions on:
Effective preprocessing pipelines in Python before OCR
Best OCR engines/models for mixed structured documents
Tools like LayoutParser / DocTR / Donut for layout-aware extraction
Any ways to boost accuracy before text parsing
If anyone has practical experience or benchmarks for these scenarios, I’d appreciate your insights. Thanks!
