Find text file containing a given text ignoring new lines and spaces?

Question

I have a string like: "thisissometext". I want to find all text files inside a given directory (recursively) that containg this string, or any variations of it with white spaces and/or newlines in the middle of it. For example, a text file containing "this is sometext", or "this\n issometext", "this\n isso metext" should show up in the search. How can I do this?

Can the spaces / newlines be found only at fixed positions (i.e. only after / before a word?)? Or can they be found in any position (like "th\isisso metext")? — kos, May 27 '15 at 19:39
@kos It is in the question: this string, or any variations of it with white spaces and/or newlines in the middle of it — Jacob Vlijm, May 27 '15 at 20:09
@JacobVlijm I'm on it. So far heemayl's answer seems to work fine. — a06e, May 29 '15 at 18:08

heemayl · Accepted Answer · 2015-05-28T13:19:19.053

12

With the newer versions of GNU grep (that has the -z option) you can use this one liner:

find . -type f -exec grep -lz 'this[[:space:]]*is[[:space:]]*some[[:space:]]*text' {} +

Considering the whitespaces can come in between the words only.

If you just want to search all files recursively starting from current directory, you don't need find, you can just use grep -r (recursive). find can be used to be selective on the files to search e.g. choose files of which directory to exclude. So, simply:

grep -rlz 'this[[:space:]]*is[[:space:]]*some[[:space:]]*text' .

The main trick here is -z, it will treat the each line of input stream ended in ASCII NUL instead of new line, as a result we can match newlines by using usual methods.
[[:space:]] character class pattern indicates any whitespace characters including space, tab, CR, LF etc. So, we can use it to match all the whitespace characters that can come in between the words.
grep -l will print only the file names that having any of the desired patterns. If you want to print the matches also, use -H instead of -l.

On the other hand, if the whitespaces can come at any places rather than the words, this would loose its good look:

grep -rlz
't[[:space:]]*h[[:space:]]*i[[:space:]]*s[[:space:]]*i[[:space:]]*\
s[[:space:]]*s[[:space:]]*o[[:space:]]*m[[:space:]]*e[[:space:]]*\
t[[:space:]]*e[[:space:]]*x[[:space:]]*t' .

With -P (PCRE) option you can replace the [[:space:]] with \s (this would look much nicer):

grep -rlzP 't\s*h\s*i\s*s\s*i\s*s\s*s\s*o\s*m\s*e\s*\
t\s*e\s*x\s*t' .

Using @steeldriver's suggestion to get sed to generate the pattern for us would be the best option:

grep -rlzP "$(sed 's/./\\s*&/2g' <<< "thisissometext")" .

edited May 28 '15 at 13:19

answered May 27 '15 at 21:07

heemayl

91,753

1

Again, this does not work on this string, or any variations of it with white spaces and/or newlines in the middle of it, Only if they appear on whole words. – Jacob Vlijm May 27 '15 at 21:12
@JacobVlijm You were right..i had misunderstood it..fixed.. – heemayl May 27 '15 at 21:28
1

@heemayl you could maybe do something like grep -zP "$(sed 's/./\\s*&/2g' <<< "thisissometext")" to take some of the tedium out of extending your approach to arbitrary amounts of whitespace between any characters – steeldriver May 27 '15 at 21:28
@steeldriver Thanks....yeah..that would do great..i have added that.. – heemayl May 27 '15 at 21:42
Can you specify which version of grep do I need for this to work? (And how do I find by grep version?) Thanks. – a06e May 27 '15 at 22:10
@becko You can find grep version by grep --version .. i am using 2.16..i can't recall from which version it included -z..use man grep | grep -- '--null-data', you will get it if your grep supports it.. – heemayl May 27 '15 at 22:13
1

Why is this wrapped in a find-exec? Why not just use grep's -r recursive flag? – Oli May 28 '15 at 11:14
@Oli if the situation is at is right now grep -r is the way to go..find is used to if OP wants to be more selective on the files.. :) – heemayl May 28 '15 at 11:22
the spaces need not be between words. In fact, the text I want to search for is a sequence of characters. There are no "words". The last command using sed seems to work fine. – a06e May 29 '15 at 17:45
@becko Well..then check the last three solutions..i have mentioned that clearly too.. :) – heemayl May 29 '15 at 17:46
Yes, I saw that, I +1 your answer ;). It seems to work fine. If I have no issues I'll accept it. – a06e May 29 '15 at 17:48

score 7 · Answer 2 · answered May 27 '15 at 20:16

You can delete all whitespace and grep it:

tr -d '[[:space:]]' < foo | grep thisissometext

Extending:

find . -type f -exec bash -c 'for i; do tr -d "[[:space:]]" < "$i" | grep -q thisissometext && printf "%s\n" "$i"; done' _ {} +

The bash command, expanded:

for i
do
    tr -d "[[:space:]]" < "$i" | 
      grep -q thisissometext && 
      printf "%s\n" "$i"
done

This loops over all arguments and uses the above test.

Jacob Vlijm · Answer 3 · 2015-05-27T20:48:57.030

The code below searches a directory recursively for files, removes all occurrences of " " and "\n". If the string exists in the remaining text, there is a match. This implies that the spaces/newlines can be on any position in the string inside your file(s).

What it does

If it finds matching files, they will be printed in the terminal, including their paths, like:

/home/jacob/Bureaublad/testmap/test2.txt
/home/jacob/Bureaublad/testmap/Naamloze map 2/test1.txt

The try / except I built in to prevent the script from breaking if it runs into an unreadable file.

The script

#!/usr/bin/env python3
import os
import sys

s = sys.argv[2]
for root, dirs, files in os.walk(sys.argv[1]):
    for file in files:
        file = root+"/"+file
        try:
            if s in open(file).read().replace(" ", "").replace("\n",""):
                print(file)
        except:
            pass

How to use

Copy the script into an empty file, save it as find_string.py

Run it with the directory and the string as arguments:

python3 /path/to/find_string.py <directory> <string_to_find>

If either the string or the directory contains spaces, use quotes:

python3 /path/to/find_string.py '<directory>' '<string_to_find>'

Note

The script, as it is finds files with the string, with either whitespaces or newlines in it. It can be expanded with other characters/strings (e.g. tabs) in the line:

if s in open(file).read().replace(" ", "").replace("\n",""):

Sergiy Kolodyazhnyy · Answer 4 · 2015-05-27T19:56:57.577

0

You could use grep -i --recursive 'word1\|word2' * and awk '/word1/,/word2/' can be used to deal with the newline

enter image description here

edited May 27 '15 at 19:56

answered May 27 '15 at 19:47

Sergiy Kolodyazhnyy

105,154
20
279
497

1

This does not work on this string, or any variations of it with white spaces and/or newlines in the middle of it. – Jacob Vlijm May 27 '15 at 19:50

Find text file containing a given text ignoring new lines and spaces?

4 Answers4

What it does

The script

How to use

Note