Matching a word / characters outside of html tags

Posted on January 4, 2008. Filed under: PHP, Regular Expressions | Tags: html, regular expression, replace, search |

Today I spent a good 2 hours on this very simple regEx problem. I tried googling just about every set of search terms I could think of, and didn’t find anything useful … basically I wanted to replace a certain word inside a string with another word, but not within html.

To do this I used a negative lookahead to see if there were any > characters after the string I wanted to replace, preceded by any non < characters [if any]. The beauty of the look around functions are that they don’t match text … they instead match what’s positioned around the text, similar to how $, ^ and \b function.

So in English, the regEx I came up with, word(?!([^<]+)?>), could be interpreted as:

Match the characters "word" literally word
Assert that it is impossible to match the regex below starting at this position (negative lookahead) (?!([^)
- Match the regular expression below and capture its match into backreference number 1 ([^<]+)?
  - Between zero and one times, as many times as possible, giving back as needed (greedy) ?
  - Match any character that is not a "<" [^<]+
    - Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
- Match the character ">" literally >

[Thanks RegexBuddy]

For example, if we wanted to replace all instances of word with repl which exist outside of any HTML:


word <a href="word">word</word>word word

would become:


repl <a href="word">repl</word>repl repl

The regular expression I used to do this, word(?!([^<]+)?>), fits nicely into preg_replace();


<?php
    $str = "word <a href=\"word\">word</word>word word";
    $str = preg_replace("/word(?!([^<]+)?>)/i","repl",$str);
    echo $str;
    # repl <word word="word">repl</word>
?>

I know this is really a one-liner, but I have it in its expanded form to simplify the steps.

Make a Comment

32 Responses to “Matching a word / characters outside of html tags”

RSS Feed for Adventures in PHP / DHTML / CSS and MySQL Comments RSS Feed

I’m a regex noob, and this is exactly what I was looking for. Thanks!

Matt Kantor
August 13, 2008

Thanks for posting this! This is the exact RegEx I was looking for, so you saved me a lot of time!

Dave Wooldridge
August 29, 2008

Thanks but this fails if you have a closing angle bracket following the text you want to replace without having a preceding opening bracket.

e.g.
word > word word

becomes

word > repl repl

Steve
December 28, 2008

To answer that, simply use &gtl; in your code for such cases. Then it will work.

Matt DeKok
December 28, 2011

my bad it is >

I accidentally typed an l in there.

Matt DeKok
December 28, 2011

Oh it auto parses the code? This is what you should use… (ampersand)gt(semicolon)

Matt DeKok
December 28, 2011

Just what I needed! Thank you so much !

Martijn
March 11, 2009

True, this doesn’t work for word> word word – but in HTML text a > is represented by > so this won’t be a problem – also a lone > should not be found in valid HTML.

Charleh
October 9, 2009

Very simple and very functional. I’m using it in javascript ‘replace’ function and it works perfectly too. Thanks!

Mike
January 8, 2010

I am trying to use it on a javascript ‘replace’ function, so far I am having errors, can someone post a working example, it probably is to do escaping or something in js. I can use it in PHP no probs, in js I guess is slightly different.

I also tried:
1)-

var filteredWord = "/(?!(?:[^<]+>|[^>]+<\/.*?>))\\b(' + keyword + ')\\b/";

The above comes back with errors when I do the final search and replace call as:

content = content.replace(filteredWord," <span class='word'>$1<\/span> ");

OR:
2)

content = content.replace(filteredWord," <span class='keywordcss'>'+keyword+'<\/span> ");

I, also want to make sure only text nodes are searched and replaced outside tags, my output does the search and replace but also inside href and image links and other html enclosed text which I want to prevent. In PHP, number one works OK when you replace the keyword bit with $keyword.

One second request from gurus here, PHP or JS function, how do you limit to replacing only a certain number of occurrences, . Basically, I want to search and replace up to a limit of 5 keywords/phases on a page.

Tim
January 29, 2010

You’ll have to create a RegExp object [for JS] like so:

var filteredWord = new RegExp("(?!(?:[^<]+>|[^>]+<\/.*?>))\\b(" + keyword + ")\\b","g");

As far as limiting the amount of words replaced, you can remove the “g” flag from the code above and do something like this:

for (var _i=0;_i<5;_i++) {
	content = content.replace(filteredWord," <span class='word'>$1<\/span> ");
}

pureform
January 29, 2010

How is the java-version of this code?

I tried:

String $str = “word wordword word”;
System.out.println($str.replaceAll(“/word(?!([^)/i”,”repl”));

But not sucesss…
Thanks,
Celso.

celso
July 21, 2010

Exactly what I was after!
You are a lifesaver!

Thanks

Kev
August 8, 2010

israelch
January 9, 2011

This is the equivalent for Vim regex:

:%s/word$\([^$\@!/repl/g

P.S.: Sorry fot the previous comment.

israelch
January 9, 2011

God damn, I have been trying to fix this problem for hours, and then I just came across this lovely little regex. Thank you so much for posting this dude!!!!

Cal Leeming
March 17, 2011

Thank you so much for your article. This helped me a great saving me several hours. I had already spent multiple hours trying to come up with what you have explained here.

Adnan
May 7, 2011

Hey thanks a lot, u just saved me a lot of time, was trying to do the exact same thing.

vxcriss
May 27, 2011

T-H-A-N-K-S!

tnk
September 29, 2011

Like others who’ve commented, I just want to let you know that this helped me out immensely! Thank you!

EB
October 15, 2011

Can you enhance this one to skip tags?
I am new to regex and I have tried almost every thing. It will be very nice of you if you could come up with a solution to my problem.
Like from the text below, I want to get all occurences of hi which are 1: not attributes of an html tag and 2: not between and . Your regex does task 1, jst need tuning for 2

hi
he says hi dear
this is hi hi test

Adeel Nawaz
December 12, 2011

Sorry you will have to view source of my above mentioned comment. It parsed html sample text :(

Adeel Nawaz
December 12, 2011

Thank you for this little snippet. Just saved me a ton of work with our search engine :-)

Michael Horne
July 6, 2012

Just one problem… what about if you have “the worded information lacked a stop for words that you didn’t want associated with word”
How to only affect the word, word and not wordplay

John
May 21, 2014

use boundaries \b around the needle to ensure it parses only the whole word or phrase,
in example below it will ignore “worldy”, “wordword”, and “wordsmith”:

$haystack = “wordly word wordword word wordword wordsmith”;
$needle = ‘word’;
$replacement = ‘BOOM’;
$str = preg_replace(“/\\b” . $needle . “\\b(?!([^)/i”, $replacement, $haystack);
echo $str;

daveheslop
February 19, 2016

$haystack = "word word wordword 53 word wordword"; $needle = 'word'; $replacement = 'BOOM'; $str = preg_replace("/\\b" . $needle . "\\b(?!([^)/i", $replacement, $haystack); echo $str;

daveheslop
February 19, 2016

hmm, code filter. You get the idea anyway. Wrap the word/phrase with \b

daveheslop
February 19, 2016

Hi. I have small problem. I want match mamy words ex. “Some Word” which isn’t inside
ex. This some word is outside and this some word is insied. And I want match outside. Could You help?

Lukas
August 7, 2014

Thank you, I was looking for something to delete spaces =)

JordyC++n
November 21, 2015

works create – tnx a lot

Robert
January 19, 2018

Nice piece of code! I modified it slightly. I want to test for a match of ANY character within an html block outside of elements.

So, to test for any character outside of HTML elements is as follows:

preg_match_all('/[a-z0-9-](?!([^<]+)?>)/is', '<b style="background:red;">hello</b>', $result);

The matching result for this instance will be:

Array
(
    [0] => Array
        (
            [0] => h
            [1] => e
            [2] => l
            [3] => l
            [4] => o
        )

    [1] => Array
        (
            [0] => 
            [1] => 
            [2] => 
            [3] => 
            [4] => 
        )
)

No array will be returned if no matches.

You could also use preg_match instead of preg_match_all to get just the first match, which in this case would be “h”.

Eric P
May 29, 2018

[…] 我尝试写了好几套匹配方案，都失败了。最后还是请出了 Google 大神来帮忙。这里，搜索的关键字很重要，最好想把你要搜索的关键词翻译成对应的英文单词，这样搜索出的结果会令你更满意。结果我找到了解决方案：Matching A Word / Characters Outside Of Html Tags。 […]

小记：为开源项目增加一个新功能的开发历程 – 技术成就梦想
August 22, 2018

Where's The Comment Form?

Matching a word / characters outside of html tags

Leave a reply to Adnan Cancel reply

32 Responses to “Matching a word / characters outside of html tags”

Categories

Meta

Recent Posts

Blogroll

Matching a word / characters outside of html tags

Share this:

Related

Leave a reply to Adnan Cancel reply

32 Responses to “Matching a word / characters outside of html tags”

Categories

Meta

Recent Posts

Blogroll