Sed word duplication problem

Question

I'm new to sed and trying out stuff to learn. However I'm encountering a problem I can't solve when using sed to remove duplicate words :

echo "abc abc def ghi ijk ijk" | sed 's/\([a-z][a-z]*\) \1/\1/g'

outputs

abc def ghijk ijk

and it does that everytime a word finishes with the same letter as the first letter of the following word. What am I doing wrong ?

For your information, [a-z][a-z]* can be written more consisely as [a-z]+ — Stig Hemmer, Aug 10 '15 at 08:11
@StiegHemmer $ echo "abc abc def ghi ijk ijk" | sed 's/<([a-z]+-)> <\1>/\1/g' => abc abc def ghi ijk ijk — Mr. Smith, Aug 10 '15 at 12:04
It was just supposed to be [a-z]+, the minus/hyphen is just part of the site layout. — Stig Hemmer, Aug 11 '15 at 07:01
For advanced regular expressions, the -r option may be usefull: sed -r 's/\b([a-z]+)\b \b\1\b/\1/g' — , Jan 17 '18 at 08:34

John1024 · Accepted Answer · 2015-08-09T23:19:50.877

The problem is that, as is, the regex can match partial words. In the example you show, it is matching the i at the end of one word with the i at the beginning of the next. The solution is to insist that the regex match whole words:

$ echo "abc abc def ghi ijk ijk" | sed 's/\<\([a-z][a-z]*\)\> \<\1\>/\1/g'
abc def ghi ijk

In GNU sed, \< matches at the beginning of a word and \> matches at the end of a word.

More complex matches

In the example in the question, the regex was matching on a single repeated character, i i. Here is an example where it matches oat oat:

$ echo "smoat oats" | sed 's/\([a-z][a-z]*\) \1/\1/g'
smoats

This is, again, fixed by insisting on whole words:

$ echo "smoat oats" | sed 's/\<\([a-z][a-z]*\)\> \<\1\>/\1/g'
smoat oats

Simplification

Since alphabet to space transitions always mark a word boundary, the part of the regex above that uses \> \< is unnecessary because the regex requires that the characters on both sides are alphabetic. Thus, we could use:

$ echo "smoat oats" | sed 's/\<\([a-z][a-z]*\) \1\>/\1/g'
smoat oats

Documentation

For more information on the subtleties of sed and its regular expressions, I recommend the Grymoire tutorial. The ultimate reference for GNU sed is the GNU sed manual.

Thanks, is it just me or is sed documentation a bit light ? I can't seem to find much... — Mr. Smith, Aug 09 '15 at 23:06
@Mr.Smith Yes, the man sed is light. I updated the answer with links to two references that I find to be much more informative. — John1024, Aug 09 '15 at 23:11

Sed word duplication problem

1 Answers1

More complex matches

Simplification

Documentation