PHP, $regex to obtain each URL of videos about $pattern

4 weeks ago 18

ARTICLE AD BOX

Use a DOM parser if you can, it will be safer than a regular expression.

But with a regular expression, I would suggest using the x flag for the extended notation with spaces and comments and the i flag for case-insensitive:

/ \s # href should be preceded by a space. href="(?<href>[^"]+)" # Capture the href value (⚠️ simple solution). [^>]* # Any other attribute after href. > # Closing of opening <a> tag. (?<link_text> # Capture the text of the link. [^<]*? # Anything before "Greta". \bGreta\b # The word "Greta". [^<]* # Anything after "Greta". ) /gix

I'm capturing the href value and the link text in two named groups. This can be done with the (?<group_name> ... ) syntax.

Notice that href="..." could also be href = "..." or href='...', which I don't take into consideration, to simplify my answer. But if you want to be sure it works, you should use a DOM parser. If not, you end up with a regular expression getting long and complicated, like this: https://regex101.com/r/8VKVxO/1

But it's the only way to handle HTML, which can be in multiple forms...

You may have some other attributes after the href attribute, so take it in consideration with [^>]* to match any char not being the closing tag.

And for "Greta" itself, use the word boundary \b before and after, if not you could match something like "gretard", which wouldn't be what you are looking for.

You can see the result here: https://regex101.com/r/9wYKYK/1

In PHP:

Run it here: https://onlinephp.io/c/87b952

The DOM parser solution, which is safer

In PHP, we can use DOMDocument and the XPath query feature, to find all the <a> tags containing a specific text:

<?php $data = <<<DATA <a href="https://www.youtube.com/watch?v=TzrtLsQbaok">some another title</a> <a href="https://www.youtube.com/watch?v=TzrtLsQbaok">some another title</a> <a href="https://www.youtube.com/watch?v=TzrtLsQbaok">some another title</a> <a href="https://www.youtube.com/watch?v=TMrtLsQbaok">Greta Thunberg talk about...</a> <a href="https://www.youtube.com/watch?v=TzrtLsQbaok">some another title</a> <a href="https://www.youtube.com/watch?v=TMrtLsQbaok">a video of greta thunberg</a> <a href="https://www.youtube.com/watch?v=TMrtLsQbaok">Not Gretard Thunberg</a> <a href="https://www.youtube.com/watch?v=TzrtLsQbaok">some another title</a> <a href="https://www.youtube.com/watch?v=TzrtLsQbaok">Greta is fine</a> <a href="https://www.youtube.com/watch?v=qJv-upsvMfM">Israel and Greta</a> <a href="https://www.youtube.com/watch?v=TzrtLsQbaok">Neither Hogreta</a> <a href="https://www.youtube.com/watch?v=TzrtLsQbaok">some another video</a> ... DATA; $dom = new DOMDocument(); // Suppress warnings for malformed HTML libxml_use_internal_errors(true); $dom->loadHTML($data); libxml_clear_errors(); $xpath = new DOMXPath($dom); // The word to search, in lowercase. $word_to_search = 'greta'; // The same word to search, in uppercase. Used in the translate() function of // the XPath query, later below. $word_to_serach_uppercase = strtoupper($word_to_search); // A regular expression, used as security, to match "greta" as a word only. // Positive lookbehind for begin of string or a space char. // Positive lookahead for end of string or a space char. $regex_word_to_search = '/(?<=^|\s)' . preg_quote($word_to_search) . '(?=\s|$)/i'; // Let's use a XPath query to get all <a> tags with a text containing // the word to search in lowercase, after having transformed all // uppercase letters of this word into the corresponding lowercase letter. // In XPath 2.0, you could use the lower-case() function instead of this // use of the translate() function. $query = <<<XPATH_QUERY //a[contains( translate(text(), '$word_to_serach_uppercase', '$word_to_search'), '$word_to_search' )] XPATH_QUERY; echo "XPath query:\n$query\n\n"; $links = $xpath->query($query); $urls = []; echo "XPath results:\n"; foreach ($links as $link) { // Display the found link and href value. echo "Found: " . $link->nodeValue . "\thref: " . $link->getAttribute('href'); // Check that it's really a word and not part of a word. if (preg_match($regex_word_to_search, $link->nodeValue)) { echo "\tOK\n"; // Store the URL and link title in a dictionnary, in case we have // several times the same href value within different links. $urls[$link->getAttribute('href')][] = $link->nodeValue; } else { echo "\tNot a word!\n"; } } echo "\n"; echo "\$urls = " . var_export($urls, true) . "\n";

Run it live here: https://onlinephp.io/c/839c1

Output:

XPath query: //a[contains( translate(text(), 'GRETA', 'greta'), 'greta' )] XPath results: Found: Greta Thunberg talk about... href: https://www.youtube.com/watch?v=TMrtLsQbaok OK Found: a video of greta thunberg href: https://www.youtube.com/watch?v=TMrtLsQbaok OK Found: Not Gretard Thunberg href: https://www.youtube.com/watch?v=TMrtLsQbaok Not a word! Found: Greta is fine href: https://www.youtube.com/watch?v=TzrtLsQbaok OK Found: Israel and Greta href: https://www.youtube.com/watch?v=qJv-upsvMfM OK Found: Neither Hogreta href: https://www.youtube.com/watch?v=TzrtLsQbaok Not a word! $urls = array ( 'https://www.youtube.com/watch?v=TMrtLsQbaok' => array ( 0 => 'Greta Thunberg talk about...', 1 => 'a video of greta thunberg', ), 'https://www.youtube.com/watch?v=TzrtLsQbaok' => array ( 0 => 'Greta is fine', ), 'https://www.youtube.com/watch?v=qJv-upsvMfM' => array ( 0 => 'Israel and Greta', ), )

Read Entire Article

LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.

PHP, $regex to obtain each URL of videos about $pattern

ARTICLE AD BOX

The DOM parser solution, which is safer

Related

CodeIgniter 4 - Getting 404 Not Found on POST request to /auth/login (GET works fine)

Show a list of first-level product categories on a single category page

SQL table(s) to get desired output in PHP [closed]

LEFT SIDEBAR AD