Matching a word / characters outside of html tags
Today I spent a good 2 hours on this very simple regEx problem. I tried googling just about every set of search terms I could think of, and didn’t find anything useful … basically I wanted to replace a certain word inside a string with another word, but not within html.
To do this I used a negative lookahead to see if there were any > characters after the string I wanted to replace, preceded by any non < characters [if any]. The beauty of the look around functions are that they don’t match text … they instead match what’s positioned around the text, similar to how $, ^ and \b function.
So in English, the regEx I came up with, word(?!([^<]+)?>), could be interpreted as:
- Match the characters "word" literally word
- Assert that it is impossible to match the regex below starting at this position (negative lookahead) (?!([^)
- Match the regular expression below and capture its match into backreference number 1 ([^<]+)?
- Between zero and one times, as many times as possible, giving back as needed (greedy) ?
- Match any character that is not a "<" [^<]+
- Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
- Match the character ">" literally >
- Match the regular expression below and capture its match into backreference number 1 ([^<]+)?
[Thanks RegexBuddy]
For example, if we wanted to replace all instances of word with repl which exist outside of any HTML:
word <a href="word">word</word>word word
would become:
repl <a href="word">repl</word>repl repl
The regular expression I used to do this, word(?!([^<]+)?>), fits nicely into preg_replace();
<?php
$str = "word <a href=\"word\">word</word>word word";
$str = preg_replace("/word(?!([^<]+)?>)/i","repl",$str);
echo $str;
# repl <word word="word">repl</word>
?>
I know this is really a one-liner, but I have it in its expanded form to simplify the steps.
I’m a regex noob, and this is exactly what I was looking for. Thanks!
Matt Kantor
August 13, 2008
Thanks for posting this! This is the exact RegEx I was looking for, so you saved me a lot of time!
Dave Wooldridge
August 29, 2008
Thanks but this fails if you have a closing angle bracket following the text you want to replace without having a preceding opening bracket.
e.g.
word > word word
becomes
word > repl repl
Steve
December 28, 2008
To answer that, simply use >l; in your code for such cases. Then it will work.
Matt DeKok
December 28, 2011
my bad it is >
I accidentally typed an l in there.
Matt DeKok
December 28, 2011
Oh it auto parses the code? This is what you should use… (ampersand)gt(semicolon)
Matt DeKok
December 28, 2011
Just what I needed! Thank you so much !
Martijn
March 11, 2009
True, this doesn’t work for word> word word – but in HTML text a > is represented by > so this won’t be a problem – also a lone > should not be found in valid HTML.
Charleh
October 9, 2009
Very simple and very functional. I’m using it in javascript ‘replace’ function and it works perfectly too. Thanks!
Mike
January 8, 2010
I am trying to use it on a javascript ‘replace’ function, so far I am having errors, can someone post a working example, it probably is to do escaping or something in js. I can use it in PHP no probs, in js I guess is slightly different.
I also tried:
1)-
var filteredWord = "/(?!(?:[^<]+>|[^>]+<\/.*?>))\\b(' + keyword + ')\\b/";The above comes back with errors when I do the final search and replace call as:
OR:
2)
I, also want to make sure only text nodes are searched and replaced outside tags, my output does the search and replace but also inside href and image links and other html enclosed text which I want to prevent. In PHP, number one works OK when you replace the keyword bit with $keyword.
One second request from gurus here, PHP or JS function, how do you limit to replacing only a certain number of occurrences, . Basically, I want to search and replace up to a limit of 5 keywords/phases on a page.
Tim
January 29, 2010
You’ll have to create a RegExp object [for JS] like so:
var filteredWord = new RegExp("(?!(?:[^<]+>|[^>]+<\/.*?>))\\b(" + keyword + ")\\b","g");As far as limiting the amount of words replaced, you can remove the “g” flag from the code above and do something like this:
for (var _i=0;_i<5;_i++) { content = content.replace(filteredWord," <span class='word'>$1<\/span> "); }pureform
January 29, 2010
How is the java-version of this code?
I tried:
String $str = “word wordword word”;
System.out.println($str.replaceAll(“/word(?!([^)/i”,”repl”));
But not sucesss…
Thanks,
Celso.
celso
July 21, 2010
Exactly what I was after!
You are a lifesaver!
Thanks
Kev
August 8, 2010
1
israelch
January 9, 2011
This is the equivalent for Vim regex:
:%s/word\(\([^\)\@!/repl/gP.S.: Sorry fot the previous comment.
israelch
January 9, 2011
God damn, I have been trying to fix this problem for hours, and then I just came across this lovely little regex. Thank you so much for posting this dude!!!!
Cal Leeming
March 17, 2011
Thank you so much for your article. This helped me a great saving me several hours. I had already spent multiple hours trying to come up with what you have explained here.
Adnan
May 7, 2011
Hey thanks a lot, u just saved me a lot of time, was trying to do the exact same thing.
vxcriss
May 27, 2011
T-H-A-N-K-S!
tnk
September 29, 2011
Like others who’ve commented, I just want to let you know that this helped me out immensely! Thank you!
EB
October 15, 2011
Can you enhance this one to skip tags?
I am new to regex and I have tried almost every thing. It will be very nice of you if you could come up with a solution to my problem.
Like from the text below, I want to get all occurences of hi which are 1: not attributes of an html tag and 2: not between and . Your regex does task 1, jst need tuning for 2
hi
he says hi dear
this is hi hi test
Adeel Nawaz
December 12, 2011
Sorry you will have to view source of my above mentioned comment. It parsed html sample text :(
Adeel Nawaz
December 12, 2011
Thank you for this little snippet. Just saved me a ton of work with our search engine :-)
Michael Horne
July 6, 2012