mcnd
Too busy to read RFC 3986 or W3C about the characters allowed.
It is not just about the url. Plus on page 50 of the rfc you can see the regex to capture a valid url.
mcnd
But you are right. You can find anything in an HTML page.
I'm not trying to show willfully broken pages if that's what you mean. That url from before is legitimate AFAICT.
mcnd
But since it is easier to do it the hard way,
I'm not sure what that means
mcnd
it's probably faster to do a little previous cleanup to handle those cases.
What about
<\/?(?!img)\w(\s+(?:\w+\s*=\s*\x22[^\x22]*\x22|\w+\s*=\s*[^ ]+|\w+)){0,}[^>]*> ? I know it's not perfect, but it's amusing
Nope, (X)HTML is not a regular language, so a regular expression won't help you on its own short of Perl/.NET extensions (which makes them irregular expressions I guess).
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>foo</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body>
<!-- I like turtles -->
<img src="example.png"/>
<![CDATA[<img src="example.png"/>]]>
</body>
</html> This is one way which will work:
Dim Document : Set Document = CreateObject("htmlfile") Document.Write htmlString Dim Images : Set Images = Document.getElementsByTagName("img") Dim Image For Each Image In Images WScript.Echo Image.OuterHtml Next
Another way is using
CreateObject("MSXML2.DomDocument.3.0") to load the file as XML (assuming it is valid) and using xpath. Its about the same amount of code