Matching a word / characters outside of html tags
Today I spent a good 2 hours on this very simple regEx problem. I tried googling just about every set of search terms I could think of, and didn’t find anything useful … basically I wanted to replace a certain word inside a string with another word, but not within html.
To do this I used a negative lookahead to see if there were any > characters after the string I wanted to replace, preceded by any non < characters [if any]. The beauty of the look around functions are that they don’t match text … they instead match what’s positioned around the text, similar to how $
, ^
and \b
function.
So in English, the regEx I came up with, word(?!([^<]+)?>), could be interpreted as:
- Match the characters "word" literally word
- Assert that it is impossible to match the regex below starting at this position (negative lookahead) (?!([^)
- Match the regular expression below and capture its match into backreference number 1 ([^<]+)?
- Between zero and one times, as many times as possible, giving back as needed (greedy) ?
- Match any character that is not a "<" [^<]+
- Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
- Match the character ">" literally >
- Match the regular expression below and capture its match into backreference number 1 ([^<]+)?
[Thanks RegexBuddy]
For example, if we wanted to replace all instances of word with repl which exist outside of any HTML:
word <a href="word">word</word>word word
would become:
repl <a href="word">repl</word>repl repl
The regular expression I used to do this, word(?!([^<]+)?>), fits nicely into preg_replace();
<?php
$str = "word <a href=\"word\">word</word>word word";
$str = preg_replace("/word(?!([^<]+)?>)/i","repl",$str);
echo $str;
# repl <word word="word">repl</word>
?>
I know this is really a one-liner, but I have it in its expanded form to simplify the steps.
I’m a regex noob, and this is exactly what I was looking for. Thanks!
Matt Kantor
August 13, 2008
Thanks for posting this! This is the exact RegEx I was looking for, so you saved me a lot of time!
Dave Wooldridge
August 29, 2008
Thanks but this fails if you have a closing angle bracket following the text you want to replace without having a preceding opening bracket.
e.g.
word > word word
becomes
word > repl repl
Steve
December 28, 2008
To answer that, simply use >l; in your code for such cases. Then it will work.
Matt DeKok
December 28, 2011
my bad it is >
I accidentally typed an l in there.
Matt DeKok
December 28, 2011
Oh it auto parses the code? This is what you should use… (ampersand)gt(semicolon)
Matt DeKok
December 28, 2011
Just what I needed! Thank you so much !
Martijn
March 11, 2009
True, this doesn’t work for word> word word – but in HTML text a > is represented by > so this won’t be a problem – also a lone > should not be found in valid HTML.
Charleh
October 9, 2009
Very simple and very functional. I’m using it in javascript ‘replace’ function and it works perfectly too. Thanks!
Mike
January 8, 2010
I am trying to use it on a javascript ‘replace’ function, so far I am having errors, can someone post a working example, it probably is to do escaping or something in js. I can use it in PHP no probs, in js I guess is slightly different.
I also tried:
1)-
The above comes back with errors when I do the final search and replace call as:
OR:
2)
I, also want to make sure only text nodes are searched and replaced outside tags, my output does the search and replace but also inside href and image links and other html enclosed text which I want to prevent. In PHP, number one works OK when you replace the keyword bit with $keyword.
One second request from gurus here, PHP or JS function, how do you limit to replacing only a certain number of occurrences, . Basically, I want to search and replace up to a limit of 5 keywords/phases on a page.
Tim
January 29, 2010
You’ll have to create a RegExp object [for JS] like so:
As far as limiting the amount of words replaced, you can remove the “g” flag from the code above and do something like this:
pureform
January 29, 2010
How is the java-version of this code?
I tried:
String $str = “word wordword word”;
System.out.println($str.replaceAll(“/word(?!([^)/i”,”repl”));
But not sucesss…
Thanks,
Celso.
celso
July 21, 2010
Exactly what I was after!
You are a lifesaver!
Thanks
Kev
August 8, 2010
1
israelch
January 9, 2011
This is the equivalent for Vim regex:
:%s/word\(\([^\)\@!/repl/g
P.S.: Sorry fot the previous comment.
israelch
January 9, 2011
God damn, I have been trying to fix this problem for hours, and then I just came across this lovely little regex. Thank you so much for posting this dude!!!!
Cal Leeming
March 17, 2011
Thank you so much for your article. This helped me a great saving me several hours. I had already spent multiple hours trying to come up with what you have explained here.
Adnan
May 7, 2011
Hey thanks a lot, u just saved me a lot of time, was trying to do the exact same thing.
vxcriss
May 27, 2011
T-H-A-N-K-S!
tnk
September 29, 2011
Like others who’ve commented, I just want to let you know that this helped me out immensely! Thank you!
EB
October 15, 2011
Can you enhance this one to skip tags?
I am new to regex and I have tried almost every thing. It will be very nice of you if you could come up with a solution to my problem.
Like from the text below, I want to get all occurences of hi which are 1: not attributes of an html tag and 2: not between and . Your regex does task 1, jst need tuning for 2
hi
he says hi dear
this is hi hi test
Adeel Nawaz
December 12, 2011
Sorry you will have to view source of my above mentioned comment. It parsed html sample text :(
Adeel Nawaz
December 12, 2011
Thank you for this little snippet. Just saved me a ton of work with our search engine :-)
Michael Horne
July 6, 2012
Just one problem… what about if you have “the worded information lacked a stop for words that you didn’t want associated with word”
How to only affect the word, word and not wordplay
John
May 21, 2014
use boundaries \b around the needle to ensure it parses only the whole word or phrase,
in example below it will ignore “worldy”, “wordword”, and “wordsmith”:
$haystack = “wordly word wordword word wordword wordsmith”;
$needle = ‘word’;
$replacement = ‘BOOM’;
$str = preg_replace(“/\\b” . $needle . “\\b(?!([^)/i”, $replacement, $haystack);
echo $str;
daveheslop
February 19, 2016
$haystack = "word word wordword 53 word wordword";
$needle = 'word';
$replacement = 'BOOM';
$str = preg_replace("/\\b" . $needle . "\\b(?!([^)/i", $replacement, $haystack);
echo $str;
daveheslop
February 19, 2016
hmm, code filter. You get the idea anyway. Wrap the word/phrase with \b
daveheslop
February 19, 2016
Hi. I have small problem. I want match mamy words ex. “Some Word” which isn’t inside
ex. This some word is outside and this some word is insied. And I want match outside. Could You help?
Lukas
August 7, 2014
Thank you, I was looking for something to delete spaces =)
JordyC++n
November 21, 2015
works create – tnx a lot
Robert
January 19, 2018
Nice piece of code! I modified it slightly. I want to test for a match of ANY character within an html block outside of elements.
So, to test for any character outside of HTML elements is as follows:
The matching result for this instance will be:
No array will be returned if no matches.
You could also use preg_match instead of preg_match_all to get just the first match, which in this case would be “h”.
Eric P
May 29, 2018
[…] 我尝试写了好几套匹配方案,都失败了。最后还是请出了 Google 大神来帮忙。这里,搜索的关键字很重要,最好想把你要搜索的关键词翻译成对应的英文单词,这样搜索出的结果会令你更满意。结果我找到了解决方案:Matching A Word / Characters Outside Of Html Tags。 […]
小记:为开源项目增加一个新功能的开发历程 – 技术成就梦想
August 22, 2018