ARTICLE AD BOX
I am using pdf pig to extract text from a user uploaded PDF. Purpose being to save time for manual reentry. The files are consistent enough to use this approach and search for identifiable keyword positions and take substrings.
However, a non negligible amount of the PDFs uploaded are broken, and when the text comes through it is complete nonsense. I can't upload the PDF, however, the way they are broken seems consistent to my testing so far.
For example, "sĞƌŝĚŝĂŶ,ŽŵĞƐ:Žď^ĐŚĞĚƵůĞ"
Should say "Veridian Homes Job Schedule".
When I copy that same line from the different files, I get the same text.
I used the following code to pull the text from the pdf:
using UglyToad.PdfPig; using UglyToad.PdfPig.Content; using UglyToad.PdfPig.DocumentLayoutAnalysis.TextExtractor; string fileName = "fileName"; string fullText = ""; var filePath = "path"; using (var doc = PdfDocument.Open(filePath)) { foreach (Page page in doc.GetPages()) { string pageText = ContentOrderTextExtractor.GetText(page); fullText += pageText; } } string path = @"C:\Users\mry10\Downloads\" + fileName + ".txt"; File.WriteAllText(path, fullText); Console.WriteLine("Wrote to " + fileName);The full text of the document is longer than the snippet I mentioned above, but follows the same principles.
Is there a standard way to fix that text? I've been trying to accomplish it by switching between different encodings, trying the ideas mentioned in this thread on pdf pig, namely trying to use UTF Unknown, but haven't been able to get that to work.
As long as it's consistent, I could hopefully just implement a conversion by hand (as in check the character value found vs expected for all relevant characters), but that seems pretty inelegant.
I'm could switch to an OCR, such as Tesseract, but that's having some implementation issues as well, and I think relying on text extraction alone should be faster and more accurate.
Also, I'm aware there's some potential tools to fix broken PDFs that could be done to clean them before my code runs, but that isn't a viable solution from a workflow perspective.
