How to generate a clean PDF from a web page without print dialog artifacts (sticky headers, ads, broken images)?

15 hours ago 1
ARTICLE AD BOX

I'm trying to build a Python script that takes a URL, extracts the article content, converts it to a clean PDF, and uploads it to Google Drive. The goal is to archive 10-20 articles per day for research, accessible across devices.

My current approach:

import requests from bs4 import BeautifulSoup from weasyprint import HTML from google.oauth2.credentials import Credentials from googleapiclient.discovery import build resp = requests.get(url) soup = BeautifulSoup(resp.text, 'html.parser') # Try to find the article content article = ( soup.find('article') or soup.find('div', class_='article-body') or soup.find('div', class_='post-content') or soup.find('main') )

The problems I'm hitting:

No universal selector for article content. Every site uses different class names — .article-body, .post-content, .entry-content, .story-body, etc. I've written ~30 selectors and still miss half the sites I need. Government reports and legal sites are especially inconsistent. Dynamic content doesn't load. requests.get() returns the initial HTML before JavaScript runs. SPAs, lazy-loaded images, and infinite scroll content are all missing. I tried Selenium but it's slow, flaky, and a pain to deploy. The extracted HTML is still full of junk. Even when I find the right container, it includes inline ads, newsletter signup forms, related article widgets, and social share buttons. Stripping these is another selector whack-a-mole game. WeasyPrint formatting is rough. Images break across pages, headings get orphaned at the bottom of a page, and there's no automatic table of contents.

Is there a more robust approach to article extraction than hand-written BeautifulSoup selectors? Or is server-side scraping fundamentally the wrong architecture for this?

Read Entire Article