VB Regex help

Author Message
TheQuestor

  • Total Posts : 5
  • Scores: 0
  • Reward points : 0
  • Joined: 1/18/2011
  • Status: offline
VB Regex help Tuesday, January 18, 2011 1:20 AM (permalink)
0
I am currently trying to figure out to filter out all a hrefs [and whatever other html i need later] from XML that I have converted to standard html and stored in a database. I can use the following:
Function RemoveHTML( strText )
 Dim RegEx
 Set RegEx = New RegExp
 RegEx.Pattern = "<[^>]*>"
 RegEx.Global = True
 RemoveHTML = RegEx.Replace(strText, "")
End Function 

to strip out ALL html but I would like to keep the image tags in tact but I can not using the above code. Can somebody give me a good regex expression that can remove just say the following:
<a href="someurl">Some words here</a>
and better yet the above and also something like:
<div style="blah;blah;">Leave the words alone </div> 
I guess above I would just need to remove <div style="blah;blah;"> and then </div> after all div calls are removed.
Anyway before I get even more confused maybe just start with the a href one :)
 
HELP!
r
 
#1
    TNO

    • Total Posts : 2094
    • Scores: 36
    • Reward points : 0
    • Joined: 12/18/2004
    • Location: Earth
    • Status: offline
    Re:VB Regex help Saturday, January 29, 2011 9:48 AM (permalink)
    0
    Probably easier to use the DOM (XML/HTML) and the innerText property of the elements in question to get the contents you need. Don't parse XML/HTML with regular expressions:
     
    http://stackoverflow.com/...d-tags/1732454#1732454
     
    To iterate is human, to recurse divine. -- L. Peter Deutsch
     
    #2
      mcnd

      • Total Posts : 50
      • Scores: 2
      • Reward points : 0
      • Joined: 4/27/2008
      • Status: offline
      Re:VB Regex help Wednesday, March 23, 2011 12:28 AM (permalink)
      0
      RegEx.Pattern = "</?(?!img)[^>]*>"
       
      #3
        TNO

        • Total Posts : 2094
        • Scores: 36
        • Reward points : 0
        • Joined: 12/18/2004
        • Location: Earth
        • Status: offline
        Re:VB Regex help Wednesday, March 23, 2011 8:22 AM (permalink)
        0
        mcnd


        RegEx.Pattern = "</?(?!img)[^>]*>"



        That fails to capture things like:
        <a href="http://example.com/hello<world>/littlejohnny.html"><img id="foo" src="foo.png"/></a>


        Writing the html to CreateObject("htmlfile") followed by .getElementsByTagName("img") is probably wiser.
        To iterate is human, to recurse divine. -- L. Peter Deutsch
         
        #4
          mcnd

          • Total Posts : 50
          • Scores: 2
          • Reward points : 0
          • Joined: 4/27/2008
          • Status: offline
          Re:VB Regex help Wednesday, March 23, 2011 9:23 PM (permalink)
          0
          Too busy to read RFC 3986 or W3C about the characters allowed.
          But you are right. You can find anything in an HTML page.
          But since it is easier to do it the hard way, it's probably faster to do a little previous cleanup to handle those cases.
          What about
          <\/?(?!img)\w(\s+(?:\w+\s*=\s*\x22[^\x22]*\x22|\w+\s*=\s*[^ ]+|\w+)){0,}[^>]*>
          ? I know it's not perfect, but it's amusing
           
           
          #5
            TNO

            • Total Posts : 2094
            • Scores: 36
            • Reward points : 0
            • Joined: 12/18/2004
            • Location: Earth
            • Status: offline
            Re:VB Regex help Thursday, March 24, 2011 8:33 AM (permalink)
            0
            mcnd

            Too busy to read RFC 3986 or W3C about the characters allowed.

             
            It is not just about the url. Plus on page 50 of the rfc you can see the regex to capture a valid url.
             
            mcnd

            But you are right. You can find anything in an HTML page.

             
            I'm not trying to show willfully broken pages if that's what you mean. That url from before is legitimate AFAICT.
             
            mcnd

            But since it is easier to do it the hard way,

             
            I'm not sure what that means
             
            mcnd

            it's probably faster to do a little previous cleanup to handle those cases.
            What about
            <\/?(?!img)\w(\s+(?:\w+\s*=\s*\x22[^\x22]*\x22|\w+\s*=\s*[^ ]+|\w+)){0,}[^>]*>
            ? I know it's not perfect, but it's amusing

             
            Nope, (X)HTML is not a regular language, so a regular expression won't help you on its own short of Perl/.NET extensions (which makes them irregular expressions I guess).
             <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
             "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
            <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
                 <head>
                     <title>foo</title>
                     <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
                 </head>
                 <body>
                     <!-- I like turtles -->
                     <img src="example.png"/>
                     <![CDATA[<img src="example.png"/>]]>
                 </body>
            </html> 

             
            This is one way which will work:
             Dim Document : Set Document = CreateObject("htmlfile") Document.Write htmlString Dim Images : Set Images = Document.getElementsByTagName("img") Dim Image For Each Image In Images     WScript.Echo Image.OuterHtml Next
             

             
            Another way is using CreateObject("MSXML2.DomDocument.3.0") to load the file as XML (assuming it is valid) and using xpath. Its about the same amount of code
            To iterate is human, to recurse divine. -- L. Peter Deutsch
             
            #6
              mcnd

              • Total Posts : 50
              • Scores: 2
              • Reward points : 0
              • Joined: 4/27/2008
              • Status: offline
              Re:VB Regex help Friday, March 25, 2011 12:00 AM (permalink)
              0
              TNO

              I'm not trying to show willfully broken pages if that's what you mean. That url from before is legitimate AFAICT.

              I know. Not what i was trying to mean. Just talking about (X)HTML, not your answer.
               
              TNO

              mcnd

              But since it is easier to do it the hard way,

              I'm not sure what that means

              Just a not so good translation of an spanish proverb : "es más facil hacerlo de la forma más dificil"
               
              TNO

              Nope, (X)HTML is not a regular language, so a regular expression won't help you on its own short of Perl/.NET extensions (which makes them irregular expressions I guess).
              ....

              As said before, a little previous cleanup (remove anything not needed)
               
              TNO
               
              This is one way which will work:
               Dim Document : Set Document = CreateObject("htmlfile") Document.Write htmlString Dim Images : Set Images = Document.getElementsByTagName("img") Dim Image For Each Image In Images     WScript.Echo Image.OuterHtml Next 
               

              Another way is using CreateObject("MSXML2.DomDocument.3.0") to load the file as XML (assuming it is valid) and using xpath. Its about the same amount of code

              Sure, this retrieves all the image tags, but
              TheQuestor

              ... strip out ALL html but I would like to keep the image tags in tact ...

               
              Regular expressions ONLY does not solve the problem. Maybe i'm wrong, but using htmlfile or DomDocument seems less efficient.
              BUT
              well, the original post says
              TheQuestor

              ... from XML that I have converted to standard html and stored in a database ...

              If the original source is XML, probably the solution does not involve dealing with HTML nor regexp
               
               
               
              #7

                Online Bookmarks Sharing: Share/Bookmark

                Jump to:

                Current active users

                There are 0 members and 1 guests.

                Icon Legend and Permission

                • New Messages
                • No New Messages
                • Hot Topic w/ New Messages
                • Hot Topic w/o New Messages
                • Locked w/ New Messages
                • Locked w/o New Messages
                • Read Message
                • Post New Thread
                • Reply to message
                • Post New Poll
                • Submit Vote
                • Post reward post
                • Delete my own posts
                • Delete my own threads
                • Rate post

                2000-2012 ASPPlayground.NET Forum Version 3.9