Regular Expressions

Matching a word / characters outside of html tags

Posted on January 4, 2008. Filed under: PHP, Regular Expressions | Tags: , , , |

Today I spent a good 2 hours on this very simple regEx problem. I tried googling just about every set of search terms I could think of, and didn’t find anything useful … basically I wanted to replace a certain word inside a string with another word, but not within html.

To do this I used a negative lookahead to see if there were any > characters after the string I wanted to replace, preceded by any non < characters [if any]. The beauty of the look around functions are that they don’t match text … they instead match what’s positioned around the text, similar to how $, ^ and \b function.

So in English, the regEx I came up with, word(?!([^<]+)?>), could be interpreted as:

  • Match the characters "word" literally word
  • Assert that it is impossible to match the regex below starting at this position (negative lookahead) (?!([^)
    • Match the regular expression below and capture its match into backreference number 1 ([^<]+)?
      • Between zero and one times, as many times as possible, giving back as needed (greedy) ?
      • Match any character that is not a "<" [^<]+
        • Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
    • Match the character ">" literally >

[Thanks RegexBuddy]

For example, if we wanted to replace all instances of word with repl which exist outside of any HTML:

word <a href="word">word</word>word word

would become:

repl <a href="word">repl</word>repl repl

The regular expression I used to do this, word(?!([^<]+)?>), fits nicely into preg_replace();

    $str = "word <a href=\"word\">word</word>word word";
    $str = preg_replace("/word(?!([^<]+)?>)/i","repl",$str);
    echo $str;
    # repl <word word="word">repl</word>

I know this is really a one-liner, but I have it in its expanded form to simplify the steps.

Read Full Post | Make a Comment ( 29 so far )

Liked it here?
Why not try sites on the blogroll...