3

I'm new to sed and trying out stuff to learn. However I'm encountering a problem I can't solve when using sed to remove duplicate words :

echo "abc abc def ghi ijk ijk" | sed 's/\([a-z][a-z]*\) \1/\1/g'

outputs

abc def ghijk ijk

and it does that everytime a word finishes with the same letter as the first letter of the following word. What am I doing wrong ?

Mr. Smith
  • 109

1 Answers1

5

The problem is that, as is, the regex can match partial words. In the example you show, it is matching the i at the end of one word with the i at the beginning of the next. The solution is to insist that the regex match whole words:

$ echo "abc abc def ghi ijk ijk" | sed 's/\<\([a-z][a-z]*\)\> \<\1\>/\1/g'
abc def ghi ijk

In GNU sed, \< matches at the beginning of a word and \> matches at the end of a word.

More complex matches

In the example in the question, the regex was matching on a single repeated character, i i. Here is an example where it matches oat oat:

$ echo "smoat oats" | sed 's/\([a-z][a-z]*\) \1/\1/g'
smoats

This is, again, fixed by insisting on whole words:

$ echo "smoat oats" | sed 's/\<\([a-z][a-z]*\)\> \<\1\>/\1/g'
smoat oats

Simplification

Since alphabet to space transitions always mark a word boundary, the part of the regex above that uses \> \< is unnecessary because the regex requires that the characters on both sides are alphabetic. Thus, we could use:

$ echo "smoat oats" | sed 's/\<\([a-z][a-z]*\) \1\>/\1/g'
smoat oats

Documentation

For more information on the subtleties of sed and its regular expressions, I recommend the Grymoire tutorial. The ultimate reference for GNU sed is the GNU sed manual.

John1024
  • 13,687
  • 43
  • 51
  • Thanks, is it just me or is sed documentation a bit light ? I can't seem to find much... – Mr. Smith Aug 09 '15 at 23:06
  • @Mr.Smith Yes, the man sed is light. I updated the answer with links to two references that I find to be much more informative. – John1024 Aug 09 '15 at 23:11