How can I create a regular expression to match a line that doesn't contain a specific word?

anjuyadav.1398 · June 28, 2024, 5:54pm

How can I create a regular expression to match a line that doesn’t contain a specific word?

I know it’s possible to match a word and then reverse the matches using other tools (e.g. grep -v). However, is it possible to match lines that do not contain a specific word, e.g. hede, using a regular expression?

alveera.khn · July 1, 2024, 5:50pm

The idea that regular expressions don’t support inverse matching isn’t entirely accurate. While it’s not their primary purpose, you can simulate this behavior using negative lookarounds. For example, to match a line that doesn’t contain the word “hede”, you can use the following regex pattern:

^((?!hede).)*$

This pattern will match any string or line (without a line break) that does not contain the substring “hede”. If you need to match line breaks as well, you can use the DOT-ALL modifier (the trailing s in the pattern):

/^((?!hede).)*$/s

Alternatively, you can use it inline:

/(?s)^((?!hede).)*$/

If the DOT-ALL modifier is unavailable, you can achieve the same behavior with the character class [\s\S]:

/^((?!hede)[\s\S])*$/

Explanation: A string consists of a list of n characters. Before and after each character, there’s an empty string. So, a list of n characters will have n+1 empty strings. For example, in the string “ABhedeCD”:

┌──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┐ S = │e1│ A │e2│ B │e3│ h │e4│ e │e5│ d │e6│ e │e7│ C │e8│ D │e9│ └──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┘

Here, the e’s represent empty strings. The regex (?!hede). looks ahead to ensure that the substring “hede” is not present, and if it’s not, the . (dot) matches any character except a line break. Lookarounds are zero-width assertions because they don’t consume any characters; they only assert or validate something.

In this example, each empty string is checked to ensure “hede” is not ahead before a character is consumed by the . (dot). This check is repeated zero or more times to cover the entire input, and the start and end of the input are anchored to ensure the entire input is processed: ^((?!hede).)*$.

If the input contains “ABhedeCD”, the regex will fail because at e3, the (?!hede) check fails (indicating “hede” is ahead).

dipen-soni · July 2, 2024, 7:48am

Contrary to the notion that regex is inefficient at inverse matching, it can be quite effective and convenient for such tasks. The performance impact compared to a programmatic search is often negligible.

To match a line that does not start with the word “hede”, you can use the pattern:

^(?!hede).*$

This pattern asserts that the string does not start with “hede”. It’s generally more efficient than the pattern for matching lines that do not contain “hede”:

^((?!hede).)*$

The key difference is that the former checks for “hede” only at the beginning of the input string, rather than at every position, making it more efficient.

Rashmihasija · July 2, 2024, 7:55am

Adding to Dipen,

The regular expression ^((?!hede).)*$ is used to match a line that does not contain the word “hede”. Here’s a breakdown of how it works:

^ asserts the start of the string.
( begins a group that will be captured into \1.
(?!hede) is a negative lookahead that asserts that “hede” does not appear at the current position.
. matches any character except for a newline.
)* repeats the group zero or more times. Note that because the quantifier is applied to the group, only the last repetition of the captured pattern will be stored in \1.
$ asserts the end of the string.

In summary, this regex will match a string that does not contain the substring “hede” anywhere in it.