Matching a word / characters outside of html tags

Posted on January 4, 2008. Filed under: PHP, Regular Expressions | Tags: , , , |

Today I spent a good 2 hours on this very simple regEx problem. I tried googling just about every set of search terms I could think of, and didn’t find anything useful … basically I wanted to replace a certain word inside a string with another word, but not within html.

To do this I used a negative lookahead to see if there were any > characters after the string I wanted to replace, preceded by any non < characters [if any]. The beauty of the look around functions are that they don’t match text … they instead match what’s positioned around the text, similar to how $, ^ and \b function.

So in English, the regEx I came up with, word(?!([^<]+)?>), could be interpreted as:

  • Match the characters "word" literally word
  • Assert that it is impossible to match the regex below starting at this position (negative lookahead) (?!([^)
    • Match the regular expression below and capture its match into backreference number 1 ([^<]+)?
      • Between zero and one times, as many times as possible, giving back as needed (greedy) ?
      • Match any character that is not a "<" [^<]+
        • Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
    • Match the character ">" literally >

[Thanks RegexBuddy]

For example, if we wanted to replace all instances of word with repl which exist outside of any HTML:


word <a href="word">word</word>word word

would become:


repl <a href="word">repl</word>repl repl

The regular expression I used to do this, word(?!([^<]+)?>), fits nicely into preg_replace();


<?php
    $str = "word <a href=\"word\">word</word>word word";
    $str = preg_replace("/word(?!([^<]+)?>)/i","repl",$str);
    echo $str;
    # repl <word word="word">repl</word>
?>

I know this is really a one-liner, but I have it in its expanded form to simplify the steps.

About these ads

Make a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

24 Responses to “Matching a word / characters outside of html tags”

RSS Feed for Adventures in PHP / DHTML / CSS and MySQL Comments RSS Feed

I’m a regex noob, and this is exactly what I was looking for. Thanks!

Thanks for posting this! This is the exact RegEx I was looking for, so you saved me a lot of time!

Thanks but this fails if you have a closing angle bracket following the text you want to replace without having a preceding opening bracket.

e.g.
word > word word

becomes

word > repl repl

To answer that, simply use &gtl; in your code for such cases. Then it will work.

my bad it is >

I accidentally typed an l in there.

Oh it auto parses the code? This is what you should use… (ampersand)gt(semicolon)

Just what I needed! Thank you so much !

True, this doesn’t work for word> word word – but in HTML text a > is represented by > so this won’t be a problem – also a lone > should not be found in valid HTML.

Very simple and very functional. I’m using it in javascript ‘replace’ function and it works perfectly too. Thanks!

I am trying to use it on a javascript ‘replace’ function, so far I am having errors, can someone post a working example, it probably is to do escaping or something in js. I can use it in PHP no probs, in js I guess is slightly different.

I also tried:
1)-

var filteredWord = "/(?!(?:[^<]+>|[^>]+<\/.*?>))\\b(' + keyword + ')\\b/";

The above comes back with errors when I do the final search and replace call as:

content = content.replace(filteredWord," <span class='word'>$1<\/span> ");

OR:
2)

content = content.replace(filteredWord," <span class='keywordcss'>'+keyword+'<\/span> ");

I, also want to make sure only text nodes are searched and replaced outside tags, my output does the search and replace but also inside href and image links and other html enclosed text which I want to prevent. In PHP, number one works OK when you replace the keyword bit with $keyword.

One second request from gurus here, PHP or JS function, how do you limit to replacing only a certain number of occurrences, . Basically, I want to search and replace up to a limit of 5 keywords/phases on a page.

You’ll have to create a RegExp object [for JS] like so:

var filteredWord = new RegExp("(?!(?:[^<]+>|[^>]+<\/.*?>))\\b(" + keyword + ")\\b","g");

As far as limiting the amount of words replaced, you can remove the “g” flag from the code above and do something like this:

for (var _i=0;_i<5;_i++) {
	content = content.replace(filteredWord," <span class='word'>$1<\/span> ");
}

Exactly what I was after!
You are a lifesaver!

Thanks

This is the equivalent for Vim regex:

:%s/word\(\([^\)\@!/repl/g

P.S.: Sorry fot the previous comment.

God damn, I have been trying to fix this problem for hours, and then I just came across this lovely little regex. Thank you so much for posting this dude!!!!

Thank you so much for your article. This helped me a great saving me several hours. I had already spent multiple hours trying to come up with what you have explained here.

Hey thanks a lot, u just saved me a lot of time, was trying to do the exact same thing.

T-H-A-N-K-S!

Like others who’ve commented, I just want to let you know that this helped me out immensely! Thank you!

Sorry you will have to view source of my above mentioned comment. It parsed html sample text :(

Thank you for this little snippet. Just saved me a ton of work with our search engine :-)

Just one problem… what about if you have “the worded information lacked a stop for words that you didn’t want associated with word”
How to only affect the word, word and not wordplay


Where's The Comment Form?

Liked it here?
Why not try sites on the blogroll...

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: