I duplicated your two 'input' lines to a file size 3867148288 bytes (3.7 GiB) and I could process it with grep
in 8 minutes and 24 seconds (reading from and writing to a HDD. It should be faster using an SSD or ramdrive).
In order to minimize the time used, I used only standard features of grep
, and did not post-process it, so the output format is not what you specify, but might be useful anyway. You can test this command
time grep -oE -e 'eventtime=[0-9]* ' \
-e 'srcip=[[:alnum:]]\.[[:alnum:]]\.[[:alnum:]]\.[[:alnum:]]' \
-e 'dstip=[[:alnum:]]\.[[:alnum:]]\.[[:alnum:]]\.[[:alnum:]]' \
infile > outfile
Output from your two lines:
$ cat outfile
eventtime=1548531298
srcip=X.X.X.Y
dstip=X.X.X.X
eventtime=1548531299
srcip=X.X.X.Z
dstip=X.X.X.Y
The output file contains 25165824 lines corresponding to 8388608 (8.3 million) lines in the input file.
$ wc -l outfile
25165824 outfile
$ <<< '25165824/3' bc
8388608
My test indicates that grep
can process approximately 1 million lines per minute.
Unless your computer is much faster than mine. this is not fast enough, and I think you have to consider something that is several times faster, probably filtering before writing the log file, but it would be best to completely avoid output of what is not necessary (and avoid filtering).
The input file is made by duplication, and maybe the system 'remembers' that it has seen the same lines before and makes things faster, so I don't know how fast it will work with a real big file with all the unpredicted variations. You have to test it.
Edit1: I ran the same task in a Dell M4800 with an Intel 4th generation i7 processor and an SSD. It finished in 4 minutes and 36 seconds, at almost double speed, 1.82 million lines per minute.
$ <<< 'scale=2;25165824/3/(4*60+36)*60/10^6' bc
1.82
Still too slow.
Edit2: I simplified the grep
patterns and ran it again in the Dell.
time grep -oE -e 'eventtime=[^\ ]*' \
-e 'srcip=[^\ ]*' \
-e 'dstip=[^\ ]*' \
infile > out
It finished after 4 minutes and 11 seconds, a small improvement to 2.00 million lines per minute
$ <<< 'scale=2;25165824/3/(4*60+11)*60/10^6' bc
2.00
Edit 3: @JJoao's, perl extension speeds up grep
to 39 seconds corresponding to 12.90 million lines per minute in the computer, where the ordinary grep
reads 1 million lines per minute (reading from and writing to an HDD).
$ time grep -oP '\b(eventtime|srcip|dstip)=\K\S+' infile >out-grep-JJoao
real 0m38,699s
user 0m31,466s
sys 0m2,843s
This perl extension is experiental according to info grep
but works in my Lubuntu 18.04.1 LTS.
‘-P’ ‘--perl-regexp’
Interpret the pattern as a Perl-compatible regular expression
(PCRE). This is experimental, particularly when combined with the
‘-z’ (‘--null-data’) option, and ‘grep -P’ may warn of
unimplemented features. *Note Other Options::.
I also compiled a C program according to @JJoao's flex
method, and it finshed in 53 seconds corresponding to 9.49 million lines per minute in the computer, where the ordinary grep
reads 1 million lines per minute (reading from and writing to an HDD). Both methods are fast, but grep
with the perl extension is fastest.
$ time ./filt.jjouo < infile > out-flex-JJoao
real 0m53,440s
user 0m48,789s
sys 0m3,104s
Edit 3.1: In the Dell M4800 with an SSD I had the following results,
time ./filt.jjouo < infile > out-flex-JJoao
real 0m25,611s
user 0m24,794s
sys 0m0,592s
time grep -oP '\b(eventtime|srcip|dstip)=\K\S+' infile >out-grep-JJoao
real 0m18,375s
user 0m17,654s
sys 0m0,500s
This corresponds to
- 19.66 million lines per minute for the
flex
application
- 27.35 million lines per minute for
grep
with the perl extension
Edit 3.2: In the Dell M4800 with an SSD I had the following results, when I used the option -f
to the flex preprocessor,
flex -f -o filt.c filt.flex
cc -O2 -o filt.jjoao filt.c
The result was improved, and now the flex
application shows the highest speed
flex -f ...
$ time ./filt.jjoao < infile > out-flex-JJoao
real 0m15,952s
user 0m15,318s
sys 0m0,628s
This corresponds to
- 31.55 million lines per minute for the
flex
application.
re_extract
to filter the messages as they are coming in? – doneal24 Jan 30 '19 at 17:19