How can I get lines where a specific word is repeated exactly N times?

Question

For this given input:

How to get This line that this word repeated 3 times in THIS line?
But not this line which is THIS word repeated 2 times.
And I will get This line with this here and This one
A test line with four this and This another THIS and last this

I want this output:

How to get This line that this word repeated 3 times in THIS line?
And I will get This line with this here and This one

Getting whole lines contains only three repeated "this" words. (case insensitive match)

To the too broad voter: how can a question possibly get more specific? — Jacob Vlijm, Jan 05 '15 at 17:43
@JacobVlijm In that there are "too many possible answers". Pick $RANDOM_LANGUAGE - somebody will be able to come up with a solution in it. — muru, Jan 06 '15 at 00:18
@muru I would say the contrary, limiting it to one language would make it a programming (language) centred question. Now it is a problem centred question. There are maybe many possible solutions (languages), but not so many obvious ones. — Jacob Vlijm, Jan 06 '15 at 07:09

score 13 · Accepted Answer · edited May 23 '17 at 12:39

13

In perl, replace this with itself case-insensitively and count the number of replacements:

$ perl -ne 's/(this)/$1/ig == 3 && print' <<EOF
How to get This line that this word repeated 3 times in THIS line?
But not this line which is THIS word repeated 2 times.
And I will get This line with this here and This one
A test line with four this and This another THIS and last this
EOF
How to get This line that this word repeated 3 times in THIS line?
And I will get This line with this here and This one

Using a count of matches instead:

perl -ne 'my $c = () = /this/ig; $c == 3 && print'

If you have GNU awk, a very simple way:

gawk -F'this' -v IGNORECASE=1 'NF == 4'

The number of fields will be one more than the number of separators.

edited May 23 '17 at 12:39

Community

1

answered Jan 04 '15 at 18:13

muru

197,895
55
485
740

Why replace? we can not count it directly without replace? – αғsнιη Jan 04 '15 at 18:33
Indeed we can count, the code is slightly longer: http://stackoverflow.com/questions/9538542/counting-number-of-occurrences-of-a-string-inside-another-perl – muru Jan 04 '15 at 18:46
Upvote for the gawk command. – Sri Jan 04 '15 at 20:07

Jacob Vlijm · Answer 2 · 2015-01-05T23:08:48.040

In python, this would do the job:

#!/usr/bin/env python3

s = """How to get This line that this word repeated 3 times in THIS line?
But not this line which is THIS word repeated 2 times.
And I will get This line with this here and This one
A test line with four this and This another THIS and last this"""

for line in s.splitlines():
    if line.lower().count("this") == 3:
        print(line)

outputs:

How to get This line that this word repeated 3 times in THIS line?
And I will get This line with this here and This one

Or to read in from a file, with the file as argument:

#!/usr/bin/env python3
import sys

file = sys.argv[1]

with open(file) as src:
    lines = [line.strip() for line in src.readlines()]

for line in lines:
    if line.lower().count("this") == 3:
        print(line)

Paste the script into an empty file, save it as find_3.py, run it by the command:
```
python3 /path/to/find_3.py <file_withlines>
```

Of course the word "this" can be replaced by any other word (or other string or line section), and the number of occurrences per line can be set to any other value in the line:

    if line.lower().count("this") == 3:

Edit

If the file would be large (hundreds of thousands / millions of lines), the code below would be faster; it reads the file per line instead of loading the file at once:

#!/usr/bin/env python3
import sys
file = sys.argv[1]

with open(file) as src:
    for line in src:
        if line.lower().count("this") == 3:
            print(line.strip())

Just curious: Why did you not use a generator in the second code snippet? — muru, Jan 06 '15 at 00:16

Sri · Answer 3 · 2015-01-04T20:35:12.497

9

Assuming your source file is tmp.txt,

grep -iv '.*this.*this.*this.*this' tmp.txt | grep -i '.*this.*this.*this.*'

The left grep outputs all lines that do not have 4 or more case-insensitive occurrences of "this" in tmp.txt.

The result is piped to the right grep, which outputs all lines with 3 or more occurrences in the left grep result.

Update: Thanks to @Muru, here is the better version of this solution,

grep -Eiv '(.*this){4,}' tmp.txt | grep -Ei '(.*this){3}'

replace 4 with n+1 and 3 with n.

edited Jan 04 '15 at 20:35

answered Jan 04 '15 at 18:54

Sri

1,662
2
20
39

This would fail for N > 4. And the first grep needs to end in *. – xyz Jan 04 '15 at 19:02
1

I mean you cannot write this for N = 50. And the question is for exactly three so you need another grep which discards all outputs containing less than or equal to two this. grep -iv '.*this.*this.*this.*this.*' tmp.txt | grep -i '.*this.*this.*this.* |grep -iv '.*this.*this.' – xyz Jan 04 '15 at 19:09
@prakharsingh95 It didn't fail for n > 4 and * is not required in first grep. – Sri Jan 04 '15 at 19:13
My bad. * not needed in first grep. But you cannot write this for N=50, and you need the third grep. – xyz Jan 04 '15 at 19:16
@prakharsingh95 Yes, I agree it's not practical for something like n = 50, but it is a quick, one command solution for simple requirements. And look at the logic - the left grep excludes all 3+ occurrences and the grep on that result takes 3+ occurrences which naturally excludes the 2 occurrences. – Sri Jan 04 '15 at 19:18
1

@KasiyA what's your take on my answer? – Sri Jan 04 '15 at 19:20
@Sri ok, I was reading on tablet in a hurry, and indeed, two greps would work, but the time and code complexity issues remain :P – xyz Jan 04 '15 at 20:09
5

Simplify it a bit: grep -Eiv '(.*this){4,}' | grep -Ei '(.*this){3}' - this might make it practical for N=50. – muru Jan 04 '15 at 20:20

fedorqui · Answer 4 · 2015-01-05T16:53:45.783

6

You can play a bit with awk for this:

awk -F"this" 'BEGIN{IGNORECASE=1} NF==4' file

This returns:

How to get This line that this word repeated 3 times in THIS line?
And I will get This line with this here and This one

Explanation

What we do is to define the field separator to this itself. This way, the line will have as many fields +1 as times the word this appears.
To make it case insensitive, we use IGNORECASE = 1. See reference: Case Sensitivity in Matching.
Then, it is just a matter of saying NF==4 to get all those lines having this exactly three times. No more code is needed, since {print $0} (that is, print the current line) is the default behaviour of awk when an expression evaluates to True.

edited Jan 05 '15 at 16:53

answered Jan 05 '15 at 14:15

fedorqui

10,069

Already posted, but good explanation. – muru Jan 05 '15 at 14:18
@muru oh, I didn't see it! My apologies and +1 for you. – fedorqui Jan 05 '15 at 14:20

score 5 · Answer 5 · edited Jan 04 '15 at 18:57

5

Assuming the lines are stored in a file named FILE:

while read line; do 
    if [ $(grep -oi "this" <<< "$line" | wc -w)  = 3 ]; then 
        echo "$line"; 
    fi  
done  <FILE

edited Jan 04 '15 at 18:57

αғsнιη

35,660

answered Jan 04 '15 at 18:03

xyz

1,786
1
12
22

1

Thank you, you can remove your sed ... command and add -o option for grep -oi ... instead. – αғsнιη Jan 04 '15 at 18:25
Simpler: $(grep -ic "this" <<<"$line") – muru Jan 04 '15 at 18:43
2

@muru No, -c option will count number of lines that matched with "this" not number of "this" words in each line. – αғsнιη Jan 04 '15 at 18:50
1

@KasiyA Ah, yes. My bad. – muru Jan 04 '15 at 18:51
@KasiyA, wouldn't -l and -w be equivalent in this case? – xyz Jan 04 '15 at 19:00
In this case YES they are equal because we used -o option for grep and this cause each matched "this" words prints in separate newlines and with -l you just counting the number of newlines but it's good to use -w option to count words instead. – αғsнιη Jan 04 '15 at 19:04

score 4 · Answer 6 · answered Jan 05 '15 at 05:44

4

If you're in Vim:

g/./if len(split(getline('.'), 'this\c', 1)) == 4 | print | endif

This will just print matched lines.

answered Jan 05 '15 at 05:44

Bohr

221

Nice example to search for lines with n occurrences of word, when using Vim. – Sri Jan 05 '15 at 06:05

score 0 · Answer 7 · answered Jan 07 '17 at 10:37

Ruby one-liner solution:

$ ruby -ne 'print $_ if $_.chomp.downcase.scan(/this/).count == 3' < input.txt                                    
How to get This line that this word repeated 3 times in THIS line?
And I will get This line with this here and This one

Works in a quite simple fashion: we redirect file into ruby's stdin, ruby gets line from stdin, cleans it up with chomp and downcase, and scan().count gives us number of occurrences of a substring.

How can I get lines where a specific word is repeated exactly N times?

7 Answers7

Edit

Explanation