How can I return an array of strings that matches the web data it's scraping off of?

5 hours ago 1

ARTICLE AD BOX

For starters, it's important to share the actual site you're scraping so answerers can reproduce the problem and be certain they're solving the right issue. Searching the text in your screenshot gives https://letterboxd.com/itscharlibb/film/erupcja/2/, so I assume that's the page you're scraping*.

A general tip in web scraping: don't think like a human user**. The human user knows nothing about programming and needs to click things to see the full review. But you're a programmer, so you can look at the page source and extract information directly. Searching for hidden text in the raw HTML response like "go see it, it’s special." shows the data is statically available in an element with class .js-review-body. Use that.

Since JS and extra network resources aren't necessary, you can block them entirely:

let browser; let context; (async () => { browser = await playwright.chromium.launch({headless: false}); context = await browser.newContext({javaScriptEnabled: false}); const page = await context.newPage(); const url = "https://letterboxd.com/itscharlibb/film/erupcja/2/"; await page.route("**", route => { if (route.request().url().startsWith(url)) { route.continue(); } else { route.abort(); } }); await page.goto(url, {waitUntil: "domcontentloaded"}); console.log(await page.locator(".js-review-body").textContent()); })() .catch(err => console.error(err)) .finally(async () => { await context?.close(); await browser?.close(); });

Note that the above uses Playwright in headful mode. This is important, since the site has antibot protection, so it will reject headless browsers and plain fetch calls.

There are a number of ways around this such as using a real Chrome browser or a browser impersonation tool like cuimp-ts:

import * as cheerio from "cheerio"; // ^1.2.0 import { get } from "cuimp"; // ^1.10.0 const url = "https://letterboxd.com/itscharlibb/film/erupcja/2/"; const response = await get(url); console.log(cheerio.load(response.data)(".js-review-body").text());

The above is fully headless and much more lightweight than a headful browser. This is a common refactor once you find the data directly through an API or static HTML.

*: After writing this answer, I realize you're actually probably scraping the list page https://letterboxd.com/film/erupcja/.

Extending the above cuimp solution to scrape all reviews is straightforward. On the review list page, extract the URLs of each review detail and then request them one by one:

const url = "https://letterboxd.com/film/erupcja/"; const response = await get(url); const $ = cheerio.load(response.data); const reviews = []; for (const el of [...$("[title^='Read']")]) { const base = "https://letterboxd.com/"; const response = await get(base + $(el).attr("href")); reviews.push(cheerio.load(response.data)(".js-review-body").text().trim()); } console.log(reviews);

If you do want to stick with Playwright, keep in mind that clicking the 'more' button is not a synchronous render. This triggers a server request that races against Playwright reading the text, so you'll need to wait for a network response, or wait until the text changes.

That's a bit of a pain and could easily become flaky. As above, I'd just use separate requests to each sub-page, and only resort to Playwright if I'm hard blocked on the above, and adding sleeps/throttling to respect the server doesn't help.

**: Note that if you're testing your own app with Playwright rather than scraping, then you want to follow the opposite principle: behave like your users do as much as possible so you can gain confidence that your app works as intended.

Read Entire Article

LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.

How can I return an array of strings that matches the web data it's scraping off of?

ARTICLE AD BOX

Related

should i remove the return?

Wix Velo currentCart.addToCurrentCart crashes with "Cannot read property" in backend web module

ES6 quick way to activate

LEFT SIDEBAR AD