Reprocessing PDF Content Streams Results in Malformed PDF

4 weeks ago 31

ARTICLE AD BOX

I'm attempting to use pdf-lib to reprocess an existing PDF. Specifically, I'm modifying the content streams inside that PDF. The idea of my project is to force PDF/UA, and possibly PDF/A-1 later, compliance into an existing document on the frontend.

The following code is intended to add PDF MCIDs via brute force to the decoded text stream from a PDF content stream object.

async reprocessPDF(input:Uint8Array): Promise<Uint8Array> { const pdfDoc: PDFDocument = await PDFDocument.load(input); const pdfContext: PDFContext = pdfDoc.context; const catalogRef = pdfContext.trailerInfo.Root; // PDFRef const catalog = pdfContext.lookup(catalogRef); // PDFDict if (!(catalog instanceof PDFDict)) { throw new Error("Invalid PDF: Catalog is not a dictionary"); } const pagesRef = catalog.get(PDF_NAME_PAGES); if (!pagesRef || !(pagesRef instanceof PDFRef)) { throw new Error("Invalid PDF: Catalog missing /Pages reference"); } // let's get the /Kids and walk them to get PDFRef for a /Type /Page const pagesDict: PDFDict = pdfContext.lookup(pagesRef) as PDFDict; const pageRefs: PDFRef[] = getAllPageRefs(pagesDict, pdfContext); // okay, so we're not getting the references of the page content objects, so // we're not running them through the transformers for (const pageRef of pageRefs) { const pageObj: PDFObject | undefined = pdfContext.lookup(pageRef); if (!pageObj || !(pageObj instanceof PDFDict)) { continue; } // Get the /Contents entry from the page dictionary const contentsEntry = pageObj.get(PDFName.of('Contents')); if (!contentsEntry) { continue; // Page has no content } // Contents can be a single stream reference or an array of stream references const contentRefs: PDFRef[] = []; if (contentsEntry instanceof PDFRef) { contentRefs.push(contentsEntry); } else if (contentsEntry instanceof PDFArray) { // Handle array of content streams for (let i = 0; i < contentsEntry.size(); i++) { const ref = contentsEntry.get(i); if (ref instanceof PDFRef) { contentRefs.push(ref); } } } const mcidCounter: MCIDCounter = new MCIDCounter(0); for (const contentRef of contentRefs) { const contentStream = pdfContext.lookup(contentRef); if (!(contentStream instanceof PDFStream)) { continue; } const filters = contentStream.dict.get(PDFName.of('Filter')); const filterNames: PDFName[] = this.normalizeFilters(filters); // Build the stream transform pipeline const pipeline = []; const hasFlateDecodeFilter = this.hasFilter('FlateDecode', filterNames); if (hasFlateDecodeFilter) { pipeline.push(inflateStream); } console.log("MCID counter before pipeline:", mcidCounter.current()); pipeline.push(createMcidStreamTransformer(mcidCounter)); console.log("MCID counter after pipeline:", mcidCounter.current()); if (hasFlateDecodeFilter) { pipeline.push(deflateStream); } // Process the stream through the pipeline let processedStream = contentStream.getContents(); for (const transformFn of pipeline) { processedStream = await transformFn(processedStream); } // Create new stream with processed contents // const newDict = contentStream.dict.clone(pdfContext); const newDict = contentStream.dict.clone(pdfContext); newDict.set(PDFName.of('Length'), PDFNumber.of(processedStream.length)); const newStream = PDFRawStream.of(newDict, processedStream); // Update the content stream reference in the context pdfContext.assign(contentRef, newStream); } } return pdfDoc.save(); }

Unfortunately, this results in a malformed PDF where only the stream objects exist in the output file. What I believed I was doing was replacing the original object in place, using the same object reference via the following call:

pdfContext.assign(contentRef, newStream);

When I call PDFDocument.save() to convert the PDFDocument back into a Uint8Array, which I then return to the client as a Blob, it appears to destroy everything except the rebuilt stream objects. I confirmed this by inspecting the generated PDF directly.

Why is this happening?

Here's the first lines from the generated PDF. There's no /Contents, no /Pages, and no /Page pointing to the object containing the content stream.

%PDF-1.7 %\81\81\81\81 4 0 obj << /Length 1114 >> stream (stream contents) endstream endobj 10 0 obj << /Type /XObject /Subtype /Image /Width 25 /Height 25 /BitsPerComponent 8 /ColorSpace /DeviceGray /Filter [ /FlateDecode ] /Length 206 >> stream x\9C\85R\89 (stream contents) endstream endobj 11 0 obj << /Length 8 >> stream x\9C\00\00\00\00 endstream endobj 13 0 obj << /Length 1567 >> stream (stream contents) endstream endobj

Read Entire Article

LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.

Reprocessing PDF Content Streams Results in Malformed PDF

ARTICLE AD BOX

Related

Why are spaces being converted to slashes when converting a string to an array buffer?

Prettier in VS code does not format some of the files

HTML Canvas Drawing Works on Desktop But Not Mobile

LEFT SIDEBAR AD