Fastest method to filter a huge file

Question

I have a syslog file with 4GB size every minute, and below is 2 lines of 4.5 million line in each minute,

and i want to generate a new file with only few columns eventtime|srcip|dstip, so the result will be as following

1548531299|X.X.X.X|X.X.X.X

please note that the position of the fields is random.
I've tried some filters but still consuming more than 40 minutes to handle one file on a powerful VM machine with 4 Cores and 16GB ram.

So is there a method to handle such big file and filter the required column in a fast method ?

{Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="ERB-03" devid="5KDTB18800169" logid="0000000011" type="traffic" subtype="forward" level="warning" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=3XXXX srcintf="GGI-cer.405" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="hh-BB.100" dstintfrole="undefined" sessionid=xxxxxxx proto=6 action="ip" policyid=5 policytype="policy" service="HTTP" appcat="unscanned" crscore=5 craction=xxxxxx crlevel="low"

Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="XXX-XX-FGT-03" devid="XX-XXXXXXXX" logid="0000000013" type="traffic" subtype="forward" level="notice" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=XXXXX srcintf="XXX-Core.123" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="sXX-CC.100" dstintfrole="undefined" sessionid=1234567 cvdpkt=0 appcat="unscanned" crscore=5 craction=123456 crlevel="low"}

If syslog-ng is generating 4G/min of firewall data into syslog, either you've got a real problem, or you need to pump that firewall data into a separate file. Please explain why you're doing this... inquiring minds want to know :-) — heynnema, Jan 28 '19 at 16:03
postnote... if your "powerful VM" is VirtualBox, and you've got it set to use all of your cores and all of your RAM... you're killing your host. — heynnema, Jan 28 '19 at 16:10
Do you need to log the raw data? If not, have you considered using rsyslog's regex features like re_extract to filter the messages as they are coming in? — doneal24, Jan 30 '19 at 17:19

score 2 · Accepted Answer · answered Jan 28 '19 at 19:08

2

Perl to the rescue

Save the following script as filter.pl and make it executable (chmod +x):

#!/usr/bin/env perl

use strict;
use warnings;

while( <> ) {
    if ( /^(?=.*eventtime=(\S+))(?=.*srcip=(\S+))(?=.*dstip=(\S+)).*$/ ) {
        print "$1|$2|$3\n";
    }
}

Then run

pduck@ubuntu:~> time ./filter.pl < input.txt > output.txt

real    0m44,984s
user    0m43,965s
sys     0m0,973s

The regex uses a lookaround pattern, in this case a positive lookahead, to match the three values eventtime, srcip, and dstip in any order.

I duplicated your two input lines until I got a file with 4 GB and approximately 9 million lines. I ran the code on an SSD.

answered Jan 28 '19 at 19:08

PerlDuck

13,335

+1. Congratulations :-) Your perl script is a lot faster than what I could do with grep, In my main computer, that greps 1 million lines per second, I perl 6.89 million lines per second with your script, and this should be fast enough to solve the problem. – sudodus Jan 28 '19 at 19:19
1

Thank you, @sudodus. Yes, it may solve the problem ... if it is not an XY problem ;-) – PerlDuck Jan 28 '19 at 19:20
You are right about that ;-) – sudodus Jan 28 '19 at 19:21
@PerlDuck , really shocking speed , thank you very much – Ubai salih Jan 29 '19 at 10:59
2

@Ubaisalih Thank you, glad I could help. But still, I don't think it's normal to have such a huge amount of log messages. I mean, 4 GB a minute is 5.5 Terabytes a day. A day! – PerlDuck Jan 29 '19 at 11:56
@PerlDuck, just for curiosity: could you please compare the time of the solution I propose with yours? – Jan 30 '19 at 12:27
@PerlDuck, thank you. (probably 90% of the time is writing the output file). your lookaround solution is brilliant! – Jan 30 '19 at 14:07
@JJoao Now I'm back on the computer I used when writing my answer. Perl: 43 seconds, Flex/C: 36 seconds. Indeed, your solution is a bit faster. Cool. I guess it could even further be sped up when the input is split into, say, 4 parts and have the filter program process those parts in parallel to take advantage of multiple cores. – PerlDuck Jan 30 '19 at 18:51
1

@PerlDuck, 43 vs 36 is the same. We can optimize it more (flex -f; cc -O2) and we can also reduce some seconds in Perl solution but definitely Log polity needs revision. – Jan 30 '19 at 19:15

score 2 · Answer 2 · 2019-01-30T22:11:14.683

If you want a really fast solution I suggest flex tool. Flex generates C. The following is capable of processing examples like the one presented accepting free order fields. So create a file, named f.fl with the following content:

%option main
%%
  char e[100], s[100], d[100];

eventtime=[^ \n]*   { strcpy(e,yytext+10); }
srcip=[^ \n]*       { strcpy(s,yytext+6);  }
dstip=[^ \n]*       { strcpy(d,yytext+6);  }
\n                  { if (e[0] && s[0] && d[0] )printf("%s|%s|%s\n",e,s,d); 
                      e[0]=s[0]=d[0]=0 ;}
.                   {}
%%

To test try:

$ flex -f -o f.c f.fl 
$ cc -O2 -o f f.c
$ ./f < input > output

Here is the time comparison:

$ time ./f < 13.5-milion-lines-3.9G-in-file > out-file
real    0m35.689s
user    0m34.705s
sys     0m0.908s

sudodus · Answer 3 · 2019-01-31T07:21:52.780

I duplicated your two 'input' lines to a file size 3867148288 bytes (3.7 GiB) and I could process it with grep in 8 minutes and 24 seconds (reading from and writing to a HDD. It should be faster using an SSD or ramdrive).

In order to minimize the time used, I used only standard features of grep, and did not post-process it, so the output format is not what you specify, but might be useful anyway. You can test this command

time grep -oE -e 'eventtime=[0-9]* ' \
 -e 'srcip=[[:alnum:]]\.[[:alnum:]]\.[[:alnum:]]\.[[:alnum:]]' \
 -e 'dstip=[[:alnum:]]\.[[:alnum:]]\.[[:alnum:]]\.[[:alnum:]]' \
infile > outfile

Output from your two lines:

$ cat outfile
eventtime=1548531298 
srcip=X.X.X.Y
dstip=X.X.X.X
eventtime=1548531299 
srcip=X.X.X.Z
dstip=X.X.X.Y

The output file contains 25165824 lines corresponding to 8388608 (8.3 million) lines in the input file.

$ wc -l outfile
25165824 outfile
$ <<< '25165824/3' bc
8388608

My test indicates that grep can process approximately 1 million lines per minute.

Unless your computer is much faster than mine. this is not fast enough, and I think you have to consider something that is several times faster, probably filtering before writing the log file, but it would be best to completely avoid output of what is not necessary (and avoid filtering).

The input file is made by duplication, and maybe the system 'remembers' that it has seen the same lines before and makes things faster, so I don't know how fast it will work with a real big file with all the unpredicted variations. You have to test it.

Edit1: I ran the same task in a Dell M4800 with an Intel 4th generation i7 processor and an SSD. It finished in 4 minutes and 36 seconds, at almost double speed, 1.82 million lines per minute.

$ <<< 'scale=2;25165824/3/(4*60+36)*60/10^6' bc
1.82

Still too slow.

Edit2: I simplified the grep patterns and ran it again in the Dell.

time grep -oE -e 'eventtime=[^\ ]*' \
 -e 'srcip=[^\ ]*' \
 -e 'dstip=[^\ ]*' \
infile > out

It finished after 4 minutes and 11 seconds, a small improvement to 2.00 million lines per minute

$ <<< 'scale=2;25165824/3/(4*60+11)*60/10^6' bc
2.00

Edit 3: @JJoao's, perl extension speeds up grep to 39 seconds corresponding to 12.90 million lines per minute in the computer, where the ordinary grep reads 1 million lines per minute (reading from and writing to an HDD).

$ time grep -oP '\b(eventtime|srcip|dstip)=\K\S+' infile >out-grep-JJoao

real    0m38,699s
user    0m31,466s
sys     0m2,843s

This perl extension is experiental according to info grep but works in my Lubuntu 18.04.1 LTS.

‘-P’ ‘--perl-regexp’ Interpret the pattern as a Perl-compatible regular expression (PCRE). This is experimental, particularly when combined with the ‘-z’ (‘--null-data’) option, and ‘grep -P’ may warn of unimplemented features. *Note Other Options::.

I also compiled a C program according to @JJoao's flex method, and it finshed in 53 seconds corresponding to 9.49 million lines per minute in the computer, where the ordinary grep reads 1 million lines per minute (reading from and writing to an HDD). Both methods are fast, but grep with the perl extension is fastest.

$ time ./filt.jjouo < infile > out-flex-JJoao

real    0m53,440s
user    0m48,789s
sys 0m3,104s

Edit 3.1: In the Dell M4800 with an SSD I had the following results,

time ./filt.jjouo < infile > out-flex-JJoao

real    0m25,611s
user    0m24,794s
sys 0m0,592s

time grep -oP '\b(eventtime|srcip|dstip)=\K\S+' infile >out-grep-JJoao

real    0m18,375s
user    0m17,654s
sys 0m0,500s

This corresponds to

19.66 million lines per minute for the flex application
27.35 million lines per minute for grep with the perl extension

Edit 3.2: In the Dell M4800 with an SSD I had the following results, when I used the option -f to the flex preprocessor,

flex -f -o filt.c filt.flex
cc -O2 -o filt.jjoao filt.c

The result was improved, and now the flex application shows the highest speed

flex -f ...

$ time ./filt.jjoao < infile > out-flex-JJoao

real    0m15,952s
user    0m15,318s
sys 0m0,628s

This corresponds to

31.55 million lines per minute for the flex application.

thank you very much for such effort, so is it possible to split the file to many chunks and run the filter simultaneously against these chunks ? — Ubai salih, Jan 28 '19 at 20:00
@Ubaisalih, Yes, it might be possible and worthwhile to split the file in many chunks, if you can use use several processors in parallel for the task. But you should really try to avoid writing such a huge file with columns of data, that you will not use. Instead you should write a file with only the data that you need from the beginning. Such a file will contain 10 % or less compared to the file that you create now, Check how the current syslog file is created and how you can create the file that you need directly (not via a filter). — sudodus, Jan 28 '19 at 20:19
@sudodus, (+1); for curiosity: could you please time grep -oP '\b(eventtime|srcip|dstip)=\K\S+' infile — , Jan 30 '19 at 12:33
@sudodus, cool!. grep is a really good tool. One of the reasons of the improvement is that it is writing less bytes. Just one final curiosity : the flex was build with the optimization? (flex -f ... ; cc -O2 ...) — , Jan 30 '19 at 19:06
@JJoao, Yes, I used the -O2 optimization flag as you prescribed. I only changed the file names. I used the same executable file filt.jjouo in both computers (transfered it with sftp). — sudodus, Jan 30 '19 at 19:45
@sudodus, Thanks again! (using flex -f would reduce 30% more but that is absolutely irrelevant) — , Jan 30 '19 at 22:00
@JJoao, I copied and pasted from the previous version of your answer without the option -f. Now I have tried with that option or the improvement was significant. The flex application shows the highest speed, 31.55 million lines per minute :-) — sudodus, Jan 31 '19 at 07:24
@Ubaisalih, With the flex application you might get something that is fast enough from the huge sys log file, but I still suggest that you should find a method to write a file with only the data that you need from the beginning. — sudodus, Jan 31 '19 at 07:26

pa4080 · Answer 4 · 2019-01-28T09:41:04.643

0

Here is one possible solution, based on this answer, provided by @PerlDuck a while ago:

#!/bin/bash
while IFS= read -r LINE
do
    if [[ ! -z ${LINE} ]]
    then
        eval $(echo "$LINE" | sed -e 's/\({\|}\)//g' -e 's/ /\n/g' | sed -ne '/=/p')
        echo "$eventtime|$srcip|$dstip"
    fi
done < "$1"

I do not know how it will behave on such large file, IMO an awk solution will be much faster. Here is how it works with the provided input file example:

$ ./script.sh in-file
1548531299|X.X.X.X|X.X.X.X
1548531299|X.X.X.X|X.X.X.X

Here is the result of a productivity time test, performed on a regular i7, equipped with SSD and 16GB RAM:

$ time ./script.sh 160000-lines-in-file > out-file

real    4m49.620s
user    6m15.875s
sys     1m50.254s

edited Jan 28 '19 at 09:41

answered Jan 28 '19 at 09:04

pa4080

29,831

thank you very much for reply,. i've tried this but still consuming too much time to filter one file, so i'm not sure if there is something faster probably with AWK as you mentioned – Ubai salih Jan 28 '19 at 09:16
Hi, @Ubaisalih, I think awk will be faster according to this time comparison. Whatever, does a new 4GB log is created each minute or only few hundred lines are appended each minute to an existing file? – pa4080 Jan 28 '19 at 09:31
Huh, I just saw you are the same OP on the linked PerlDuck"s answer :) – pa4080 Jan 28 '19 at 09:51
2

@Ubaisalih, what service writes these logs? I'm asking, because I remembered, for example Apache2 can pipe its logs to a script and then that script can write a regular log and also processed log, etc. - if such solution could be applied this will be the best way, I think. – pa4080 Jan 28 '19 at 10:41
it's a new file 4GB each minute, the problem now is the speed of filtering the file, so is there a faster method to capture few column with a timely manner ? – – Ubai salih Jan 28 '19 at 11:47
it's a firewall Log and it's captured by syslog-ng . – Ubai salih Jan 28 '19 at 12:22
@Ubaisalih, I can't investigate it right now. I think keywords that you are searching for are: syslog-ng pipe to script. – pa4080 Jan 28 '19 at 12:32
@PerlDuck, nice method to extract the relevant parts of the file, and nice catch, pa4080 to use it here. But the bash shell is way too slow. I think you must call one single program once, and let it do the thing. Someone who knows awk well can test how fast it can do the job. I am testing a method with grep, that will be much faster, but I don't know if it will be fast enough. – sudodus Jan 28 '19 at 14:32
@Ubaisalih, What is the time limit (process 4 GB within 1 minute)? Or can the task be done 'off-line' in a longer time? In that case how long? Would it be possible as pa4080 suggests, to filter directly and avoid writing all the data, that we try to get rid of with the filter? – sudodus Jan 28 '19 at 14:35
the time limit should not exceed the 1 limit to be able to filter the next file, otherwise the files will piles up and filtering process will not be able to catch the files – Ubai salih Jan 28 '19 at 15:13
@pa4080, very good answer (+1)! – Jan 30 '19 at 12:24
@pa4080, for curiosity: could you please compare the time of the solution I propose with yours? – Jan 30 '19 at 12:25

Fastest method to filter a huge file

4 Answers4

Perl to the rescue