Reprocessing PDF Content Streams Results in Malformed PDF

4 weeks ago 31
ARTICLE AD BOX

I'm attempting to use pdf-lib to reprocess an existing PDF. Specifically, I'm modifying the content streams inside that PDF. The idea of my project is to force PDF/UA, and possibly PDF/A-1 later, compliance into an existing document on the frontend.

The following code is intended to add PDF MCIDs via brute force to the decoded text stream from a PDF content stream object.

async reprocessPDF(input:Uint8Array): Promise<Uint8Array> { const pdfDoc: PDFDocument = await PDFDocument.load(input); const pdfContext: PDFContext = pdfDoc.context; const catalogRef = pdfContext.trailerInfo.Root; // PDFRef const catalog = pdfContext.lookup(catalogRef); // PDFDict if (!(catalog instanceof PDFDict)) { throw new Error("Invalid PDF: Catalog is not a dictionary"); } const pagesRef = catalog.get(PDF_NAME_PAGES); if (!pagesRef || !(pagesRef instanceof PDFRef)) { throw new Error("Invalid PDF: Catalog missing /Pages reference"); } // let's get the /Kids and walk them to get PDFRef for a /Type /Page const pagesDict: PDFDict = pdfContext.lookup(pagesRef) as PDFDict; const pageRefs: PDFRef[] = getAllPageRefs(pagesDict, pdfContext); // okay, so we're not getting the references of the page content objects, so // we're not running them through the transformers for (const pageRef of pageRefs) { const pageObj: PDFObject | undefined = pdfContext.lookup(pageRef); if (!pageObj || !(pageObj instanceof PDFDict)) { continue; } // Get the /Contents entry from the page dictionary const contentsEntry = pageObj.get(PDFName.of('Contents')); if (!contentsEntry) { continue; // Page has no content } // Contents can be a single stream reference or an array of stream references const contentRefs: PDFRef[] = []; if (contentsEntry instanceof PDFRef) { contentRefs.push(contentsEntry); } else if (contentsEntry instanceof PDFArray) { // Handle array of content streams for (let i = 0; i < contentsEntry.size(); i++) { const ref = contentsEntry.get(i); if (ref instanceof PDFRef) { contentRefs.push(ref); } } } const mcidCounter: MCIDCounter = new MCIDCounter(0); for (const contentRef of contentRefs) { const contentStream = pdfContext.lookup(contentRef); if (!(contentStream instanceof PDFStream)) { continue; } const filters = contentStream.dict.get(PDFName.of('Filter')); const filterNames: PDFName[] = this.normalizeFilters(filters); // Build the stream transform pipeline const pipeline = []; const hasFlateDecodeFilter = this.hasFilter('FlateDecode', filterNames); if (hasFlateDecodeFilter) { pipeline.push(inflateStream); } console.log("MCID counter before pipeline:", mcidCounter.current()); pipeline.push(createMcidStreamTransformer(mcidCounter)); console.log("MCID counter after pipeline:", mcidCounter.current()); if (hasFlateDecodeFilter) { pipeline.push(deflateStream); } // Process the stream through the pipeline let processedStream = contentStream.getContents(); for (const transformFn of pipeline) { processedStream = await transformFn(processedStream); } // Create new stream with processed contents // const newDict = contentStream.dict.clone(pdfContext); const newDict = contentStream.dict.clone(pdfContext); newDict.set(PDFName.of('Length'), PDFNumber.of(processedStream.length)); const newStream = PDFRawStream.of(newDict, processedStream); // Update the content stream reference in the context pdfContext.assign(contentRef, newStream); } } return pdfDoc.save(); }

Unfortunately, this results in a malformed PDF where only the stream objects exist in the output file. What I believed I was doing was replacing the original object in place, using the same object reference via the following call:

pdfContext.assign(contentRef, newStream);

When I call PDFDocument.save() to convert the PDFDocument back into a Uint8Array, which I then return to the client as a Blob, it appears to destroy everything except the rebuilt stream objects. I confirmed this by inspecting the generated PDF directly.

Why is this happening?

Here's the first lines from the generated PDF. There's no /Contents, no /Pages, and no /Page pointing to the object containing the content stream.

%PDF-1.7 %\81\81\81\81 4 0 obj << /Length 1114 >> stream (stream contents) endstream endobj 10 0 obj << /Type /XObject /Subtype /Image /Width 25 /Height 25 /BitsPerComponent 8 /ColorSpace /DeviceGray /Filter [ /FlateDecode ] /Length 206 >> stream x\9C\85R\89 (stream contents) endstream endobj 11 0 obj << /Length 8 >> stream x\9C\00\00\00\00 endstream endobj 13 0 obj << /Length 1567 >> stream (stream contents) endstream endobj
Read Entire Article