How would I parse HTML href values?

18 hours ago 4
ARTICLE AD BOX

I'm writing a basic web scraper, and for that, I need to parse links. I can extract the href attribute from the elements, and I've written a basic parser for them, here's the code for it:

links: list[str] # List of Href links, extracted from <a> elements url: str # URL that these Hrefs came from urldomain: str # Domain of the URL that these Hrefs came from for link in links: if link.startswith('/'): if link.startswith('//'): newlinks.append('http://' + link) # HTML newlinks.append(urldomain + link) elif link.startswith("http"): newlinks.append(link) elif link.startswith("https"): newlinks.append(link) if not link.startswith('http'): if url.endswith('/'): newlinks.append(url + '/' + link) else: newlinks.append('/'.join(url.split('/')[:-1])) else: newlinks.append(link)

It's incredibly hacky, I know, but I just can't figure out how to get it right. I keep finding stuff that doesn't work and just doesn't make sense. What am I missing?

Read Entire Article