Matching a word / characters outside of html tags

Posted on January 4, 2008. Filed under: PHP, Regular Expressions | Tags: , , , |

Today I spent a good 2 hours on this very simple regEx problem. I tried googling just about every set of search terms I could think of, and didn’t find anything useful … basically I wanted to replace a certain word inside a string with another word, but not within html.

To do this I used a negative lookahead to see if there were any > characters after the string I wanted to replace, preceded by any non < characters [if any]. The beauty of the look around functions are that they don’t match text … they instead match what’s positioned around the text, similar to how $, ^ and \b function.

So in English, the regEx I came up with, word(?!([^<]+)?>), could be interpreted as:

  • Match the characters "word" literally word
  • Assert that it is impossible to match the regex below starting at this position (negative lookahead) (?!([^)
    • Match the regular expression below and capture its match into backreference number 1 ([^<]+)?
      • Between zero and one times, as many times as possible, giving back as needed (greedy) ?
      • Match any character that is not a "<" [^<]+
        • Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
    • Match the character ">" literally >

[Thanks RegexBuddy]

For example, if we wanted to replace all instances of word with repl which exist outside of any HTML:


word <a href="word">word</word>word word

would become:


repl <a href="word">repl</word>repl repl

The regular expression I used to do this, word(?!([^<]+)?>), fits nicely into preg_replace();


<?php
    $str = "word <a href=\"word\">word</word>word word";
    $str = preg_replace("/word(?!([^<]+)?>)/i","repl",$str);
    echo $str;
    # repl <word word="word">repl</word>
?>

I know this is really a one-liner, but I have it in its expanded form to simplify the steps.

Make a Comment

Leave a reply to Adnan Cancel reply

32 Responses to “Matching a word / characters outside of html tags”

RSS Feed for Adventures in PHP / DHTML / CSS and MySQL Comments RSS Feed

I’m a regex noob, and this is exactly what I was looking for. Thanks!

Thanks for posting this! This is the exact RegEx I was looking for, so you saved me a lot of time!

Thanks but this fails if you have a closing angle bracket following the text you want to replace without having a preceding opening bracket.

e.g.
word > word word

becomes

word > repl repl

To answer that, simply use &gtl; in your code for such cases. Then it will work.

my bad it is >

I accidentally typed an l in there.

Oh it auto parses the code? This is what you should use… (ampersand)gt(semicolon)

Just what I needed! Thank you so much !

True, this doesn’t work for word> word word – but in HTML text a > is represented by > so this won’t be a problem – also a lone > should not be found in valid HTML.

Very simple and very functional. I’m using it in javascript ‘replace’ function and it works perfectly too. Thanks!

I am trying to use it on a javascript ‘replace’ function, so far I am having errors, can someone post a working example, it probably is to do escaping or something in js. I can use it in PHP no probs, in js I guess is slightly different.

I also tried:
1)-

var filteredWord = "/(?!(?:[^<]+>|[^>]+<\/.*?>))\\b(' + keyword + ')\\b/";

The above comes back with errors when I do the final search and replace call as:

content = content.replace(filteredWord," <span class='word'>$1<\/span> ");

OR:
2)

content = content.replace(filteredWord," <span class='keywordcss'>'+keyword+'<\/span> ");

I, also want to make sure only text nodes are searched and replaced outside tags, my output does the search and replace but also inside href and image links and other html enclosed text which I want to prevent. In PHP, number one works OK when you replace the keyword bit with $keyword.

One second request from gurus here, PHP or JS function, how do you limit to replacing only a certain number of occurrences, . Basically, I want to search and replace up to a limit of 5 keywords/phases on a page.

You’ll have to create a RegExp object [for JS] like so:

var filteredWord = new RegExp("(?!(?:[^<]+>|[^>]+<\/.*?>))\\b(" + keyword + ")\\b","g");

As far as limiting the amount of words replaced, you can remove the “g” flag from the code above and do something like this:

for (var _i=0;_i<5;_i++) {
	content = content.replace(filteredWord," <span class='word'>$1<\/span> ");
}

Exactly what I was after!
You are a lifesaver!

Thanks

This is the equivalent for Vim regex:

:%s/word\(\([^\)\@!/repl/g

P.S.: Sorry fot the previous comment.

God damn, I have been trying to fix this problem for hours, and then I just came across this lovely little regex. Thank you so much for posting this dude!!!!

Thank you so much for your article. This helped me a great saving me several hours. I had already spent multiple hours trying to come up with what you have explained here.

Hey thanks a lot, u just saved me a lot of time, was trying to do the exact same thing.

T-H-A-N-K-S!

Like others who’ve commented, I just want to let you know that this helped me out immensely! Thank you!

Sorry you will have to view source of my above mentioned comment. It parsed html sample text :(

Thank you for this little snippet. Just saved me a ton of work with our search engine :-)

Just one problem… what about if you have “the worded information lacked a stop for words that you didn’t want associated with word”
How to only affect the word, word and not wordplay

Hi. I have small problem. I want match mamy words ex. “Some Word” which isn’t inside
ex. This some word is outside and this
some word is insied. And I want match outside. Could You help?

Thank you, I was looking for something to delete spaces =)

works create – tnx a lot

Nice piece of code! I modified it slightly. I want to test for a match of ANY character within an html block outside of elements.

So, to test for any character outside of HTML elements is as follows:

preg_match_all('/[a-z0-9-](?!([^<]+)?>)/is', '<b style="background:red;">hello</b>', $result);

The matching result for this instance will be:

Array
(
    [0] => Array
        (
            [0] => h
            [1] => e
            [2] => l
            [3] => l
            [4] => o
        )

    [1] => Array
        (
            [0] => 
            [1] => 
            [2] => 
            [3] => 
            [4] => 
        )
)

No array will be returned if no matches.

You could also use preg_match instead of preg_match_all to get just the first match, which in this case would be “h”.

[…] 我尝试写了好几套匹配方案,都失败了。最后还是请出了 Google 大神来帮忙。这里,搜索的关键字很重要,最好想把你要搜索的关键词翻译成对应的英文单词,这样搜索出的结果会令你更满意。结果我找到了解决方案:Matching A Word / Characters Outside Of Html Tags。 […]


Where's The Comment Form?

Liked it here?
Why not try sites on the blogroll...