Varying anchors when scraping markup

15 hours ago 2

ARTICLE AD BOX

I'm building a web crawler and need to parse anchor tags to extract URLs. However, I'm running into issues identifying whether an href attribute contains a full URL path or a relative/internal path. For example, when crawling Wikipedia, I encounter hrefs like:

https://en.wikipedia.org/wiki/Page (full URL) /wiki/Saint_Lucia_Labour_Party (absolute path, relative to domain) wiki/Saint_Lucia_Labour_Party (relative to current directory) #section (anchor link)

What's the most reliable way to identify if an href is absolute or relative and convert relative URLs to absolute URLs so I can crawl them?

Read Entire Article

LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.

Varying anchors when scraping markup

ARTICLE AD BOX

Related

I have a problem with the request module in Automate Boring Stuff With Python - Chapter 13

How do I resolve the ConnectionResetError and CondaHTTPError when attempting to update conda despite multiple retries and Anaconda reinstalls?

Make a Python process that communicates with itself over a PTY

LEFT SIDEBAR AD