How would I parse HTML href values?

18 hours ago 4

ARTICLE AD BOX

I'm writing a basic web scraper, and for that, I need to parse links. I can extract the href attribute from the elements, and I've written a basic parser for them, here's the code for it:

links: list[str] # List of Href links, extracted from <a> elements url: str # URL that these Hrefs came from urldomain: str # Domain of the URL that these Hrefs came from for link in links: if link.startswith('/'): if link.startswith('//'): newlinks.append('http://' + link) # HTML newlinks.append(urldomain + link) elif link.startswith("http"): newlinks.append(link) elif link.startswith("https"): newlinks.append(link) if not link.startswith('http'): if url.endswith('/'): newlinks.append(url + '/' + link) else: newlinks.append('/'.join(url.split('/')[:-1])) else: newlinks.append(link)

It's incredibly hacky, I know, but I just can't figure out how to get it right. I keep finding stuff that doesn't work and just doesn't make sense. What am I missing?

Read Entire Article

LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.

How would I parse HTML href values?

ARTICLE AD BOX

Related

Map set to a value if it is a subset of a key set

Python interactive input: how can I take user input that returns as string, and change it back to a module for doc and subclass()?

Is there an encoder–decoder model that allows editing latent embeddings and regenerating text?

LEFT SIDEBAR AD

How would I parse HTML href values?

ARTICLE AD BOX

Related

Map set to a value if it is a subset of a key set

Python interactive input: how can I take user input that returns as string, and change it back to a module for __doc__ and __subclass__()?

Is there an encoder–decoder model that allows editing latent embeddings and regenerating text?

LEFT SIDEBAR AD

Python interactive input: how can I take user input that returns as string, and change it back to a module for doc and subclass()?