Fastest method to filter a huge file
I have a syslog file with 4GB size every minute, and below is 2 lines of 4.5 million line in each minute,
and i want to generate a new file with only few columns eventtime|srcip|dstip
, so the result will be as following
1548531299|X.X.X.X|X.X.X.X
please note that the position of the fields is random.
I've tried some filters but still consuming more than 40 minutes to handle one file on a powerful VM machine with 4 Cores and 16GB ram.
So is there a method to handle such big file and filter the required column in a fast method ?
{Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="ERB-03" devid="5KDTB18800169" logid="0000000011" type="traffic" subtype="forward" level="warning" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=3XXXX srcintf="GGI-cer.405" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="hh-BB.100" dstintfrole="undefined" sessionid=xxxxxxx proto=6 action="ip" policyid=5 policytype="policy" service="HTTP" appcat="unscanned" crscore=5 craction=xxxxxx crlevel="low"
Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="XXX-XX-FGT-03" devid="XX-XXXXXXXX" logid="0000000013" type="traffic" subtype="forward" level="notice" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=XXXXX srcintf="XXX-Core.123" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="sXX-CC.100" dstintfrole="undefined" sessionid=1234567 cvdpkt=0 appcat="unscanned" crscore=5 craction=123456 crlevel="low"}
command-line networking text-processing
add a comment |
I have a syslog file with 4GB size every minute, and below is 2 lines of 4.5 million line in each minute,
and i want to generate a new file with only few columns eventtime|srcip|dstip
, so the result will be as following
1548531299|X.X.X.X|X.X.X.X
please note that the position of the fields is random.
I've tried some filters but still consuming more than 40 minutes to handle one file on a powerful VM machine with 4 Cores and 16GB ram.
So is there a method to handle such big file and filter the required column in a fast method ?
{Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="ERB-03" devid="5KDTB18800169" logid="0000000011" type="traffic" subtype="forward" level="warning" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=3XXXX srcintf="GGI-cer.405" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="hh-BB.100" dstintfrole="undefined" sessionid=xxxxxxx proto=6 action="ip" policyid=5 policytype="policy" service="HTTP" appcat="unscanned" crscore=5 craction=xxxxxx crlevel="low"
Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="XXX-XX-FGT-03" devid="XX-XXXXXXXX" logid="0000000013" type="traffic" subtype="forward" level="notice" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=XXXXX srcintf="XXX-Core.123" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="sXX-CC.100" dstintfrole="undefined" sessionid=1234567 cvdpkt=0 appcat="unscanned" crscore=5 craction=123456 crlevel="low"}
command-line networking text-processing
1
i assume that grep + regexp will be the fastest
– cmak.fr
Jan 28 at 8:39
1
If syslog-ng is generating 4G/min of firewall data into syslog, either you've got a real problem, or you need to pump that firewall data into a separate file. Please explain why you're doing this... inquiring minds want to know :-)
– heynnema
Jan 28 at 16:03
2
postnote... if your "powerful VM" is VirtualBox, and you've got it set to use all of your cores and all of your RAM... you're killing your host.
– heynnema
Jan 28 at 16:10
Do you need to log the raw data? If not, have you considered using rsyslog's regex features likere_extract
to filter the messages as they are coming in?
– Doug O'Neal
Jan 30 at 17:19
add a comment |
I have a syslog file with 4GB size every minute, and below is 2 lines of 4.5 million line in each minute,
and i want to generate a new file with only few columns eventtime|srcip|dstip
, so the result will be as following
1548531299|X.X.X.X|X.X.X.X
please note that the position of the fields is random.
I've tried some filters but still consuming more than 40 minutes to handle one file on a powerful VM machine with 4 Cores and 16GB ram.
So is there a method to handle such big file and filter the required column in a fast method ?
{Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="ERB-03" devid="5KDTB18800169" logid="0000000011" type="traffic" subtype="forward" level="warning" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=3XXXX srcintf="GGI-cer.405" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="hh-BB.100" dstintfrole="undefined" sessionid=xxxxxxx proto=6 action="ip" policyid=5 policytype="policy" service="HTTP" appcat="unscanned" crscore=5 craction=xxxxxx crlevel="low"
Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="XXX-XX-FGT-03" devid="XX-XXXXXXXX" logid="0000000013" type="traffic" subtype="forward" level="notice" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=XXXXX srcintf="XXX-Core.123" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="sXX-CC.100" dstintfrole="undefined" sessionid=1234567 cvdpkt=0 appcat="unscanned" crscore=5 craction=123456 crlevel="low"}
command-line networking text-processing
I have a syslog file with 4GB size every minute, and below is 2 lines of 4.5 million line in each minute,
and i want to generate a new file with only few columns eventtime|srcip|dstip
, so the result will be as following
1548531299|X.X.X.X|X.X.X.X
please note that the position of the fields is random.
I've tried some filters but still consuming more than 40 minutes to handle one file on a powerful VM machine with 4 Cores and 16GB ram.
So is there a method to handle such big file and filter the required column in a fast method ?
{Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="ERB-03" devid="5KDTB18800169" logid="0000000011" type="traffic" subtype="forward" level="warning" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=3XXXX srcintf="GGI-cer.405" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="hh-BB.100" dstintfrole="undefined" sessionid=xxxxxxx proto=6 action="ip" policyid=5 policytype="policy" service="HTTP" appcat="unscanned" crscore=5 craction=xxxxxx crlevel="low"
Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="XXX-XX-FGT-03" devid="XX-XXXXXXXX" logid="0000000013" type="traffic" subtype="forward" level="notice" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=XXXXX srcintf="XXX-Core.123" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="sXX-CC.100" dstintfrole="undefined" sessionid=1234567 cvdpkt=0 appcat="unscanned" crscore=5 craction=123456 crlevel="low"}
command-line networking text-processing
command-line networking text-processing
edited Jan 28 at 10:37
pa4080
14.2k52668
14.2k52668
asked Jan 28 at 8:12
Ubai salihUbai salih
253
253
1
i assume that grep + regexp will be the fastest
– cmak.fr
Jan 28 at 8:39
1
If syslog-ng is generating 4G/min of firewall data into syslog, either you've got a real problem, or you need to pump that firewall data into a separate file. Please explain why you're doing this... inquiring minds want to know :-)
– heynnema
Jan 28 at 16:03
2
postnote... if your "powerful VM" is VirtualBox, and you've got it set to use all of your cores and all of your RAM... you're killing your host.
– heynnema
Jan 28 at 16:10
Do you need to log the raw data? If not, have you considered using rsyslog's regex features likere_extract
to filter the messages as they are coming in?
– Doug O'Neal
Jan 30 at 17:19
add a comment |
1
i assume that grep + regexp will be the fastest
– cmak.fr
Jan 28 at 8:39
1
If syslog-ng is generating 4G/min of firewall data into syslog, either you've got a real problem, or you need to pump that firewall data into a separate file. Please explain why you're doing this... inquiring minds want to know :-)
– heynnema
Jan 28 at 16:03
2
postnote... if your "powerful VM" is VirtualBox, and you've got it set to use all of your cores and all of your RAM... you're killing your host.
– heynnema
Jan 28 at 16:10
Do you need to log the raw data? If not, have you considered using rsyslog's regex features likere_extract
to filter the messages as they are coming in?
– Doug O'Neal
Jan 30 at 17:19
1
1
i assume that grep + regexp will be the fastest
– cmak.fr
Jan 28 at 8:39
i assume that grep + regexp will be the fastest
– cmak.fr
Jan 28 at 8:39
1
1
If syslog-ng is generating 4G/min of firewall data into syslog, either you've got a real problem, or you need to pump that firewall data into a separate file. Please explain why you're doing this... inquiring minds want to know :-)
– heynnema
Jan 28 at 16:03
If syslog-ng is generating 4G/min of firewall data into syslog, either you've got a real problem, or you need to pump that firewall data into a separate file. Please explain why you're doing this... inquiring minds want to know :-)
– heynnema
Jan 28 at 16:03
2
2
postnote... if your "powerful VM" is VirtualBox, and you've got it set to use all of your cores and all of your RAM... you're killing your host.
– heynnema
Jan 28 at 16:10
postnote... if your "powerful VM" is VirtualBox, and you've got it set to use all of your cores and all of your RAM... you're killing your host.
– heynnema
Jan 28 at 16:10
Do you need to log the raw data? If not, have you considered using rsyslog's regex features like
re_extract
to filter the messages as they are coming in?– Doug O'Neal
Jan 30 at 17:19
Do you need to log the raw data? If not, have you considered using rsyslog's regex features like
re_extract
to filter the messages as they are coming in?– Doug O'Neal
Jan 30 at 17:19
add a comment |
4 Answers
4
active
oldest
votes
Perl to the rescue
Save the following script as filter.pl
and make it executable (chmod +x
):
#!/usr/bin/env perl
use strict;
use warnings;
while( <> ) {
if ( /^(?=.*eventtime=(S+))(?=.*srcip=(S+))(?=.*dstip=(S+)).*$/ ) {
print "$1|$2|$3n";
}
}
Then run
pduck@ubuntu:~> time ./filter.pl < input.txt > output.txt
real 0m44,984s
user 0m43,965s
sys 0m0,973s
The regex uses a lookaround pattern, in this case a positive lookahead,
to match the three values eventtime
, srcip
, and dstip
in any order.
I duplicated your two input lines until I got a file with 4 GB and
approximately 9 million lines. I ran the code on an SSD.
+1. Congratulations :-) Yourperl
script is a lot faster than what I could do withgrep
, In my main computer, that greps 1 million lines per second, I perl 6.89 million lines per second with your script, and this should be fast enough to solve the problem.
– sudodus
Jan 28 at 19:19
1
Thank you, @sudodus. Yes, it may solve the problem ... if it is not an XY problem ;-)
– PerlDuck
Jan 28 at 19:20
You are right about that ;-)
– sudodus
Jan 28 at 19:21
2
@Ubaisalih Thank you, glad I could help. But still, I don't think it's normal to have such a huge amount of log messages. I mean, 4 GB a minute is 5.5 Terabytes a day. A day!
– PerlDuck
Jan 29 at 11:56
1
@PerlDuck, 43 vs 36 is the same. We can optimize it more (flex -f; cc -O2) and we can also reduce some seconds in Perl solution but definitely Log polity needs revision.
– JJoao
Jan 30 at 19:15
|
show 4 more comments
If you want a really fast solution I suggest flex tool. Flex generates C. The following is capable of processing examples like the one presented accepting free order fields. So create a file, named f.fl
with the following content:
%option main
%%
char e[100], s[100], d[100];
eventtime=[^ n]* { strcpy(e,yytext+10); }
srcip=[^ n]* { strcpy(s,yytext+6); }
dstip=[^ n]* { strcpy(d,yytext+6); }
n { if (e[0] && s[0] && d[0] )printf("%s|%s|%sn",e,s,d);
e[0]=s[0]=d[0]=0 ;}
. {}
%%
To test try:
$ flex -f -o f.c f.fl
$ cc -O2 -o f f.c
$ ./f < input > output
Here is the time
comparison:
$ time ./f < 13.5-milion-lines-3.9G-in-file > out-file
real 0m35.689s
user 0m34.705s
sys 0m0.908s
add a comment |
I duplicated your two 'input' lines to a file size 3867148288 bytes (3.7 GiB) and I could process it with grep
in 8 minutes and 24 seconds (reading from and writing to a HDD. It should be faster using an SSD or ramdrive).
In order to minimize the time used, I used only standard features of grep
, and did not post-process it, so the output format is not what you specify, but might be useful anyway. You can test this command
time grep -oE -e 'eventtime=[0-9]* '
-e 'srcip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]'
-e 'dstip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]'
infile > outfile
Output from your two lines:
$ cat outfile
eventtime=1548531298
srcip=X.X.X.Y
dstip=X.X.X.X
eventtime=1548531299
srcip=X.X.X.Z
dstip=X.X.X.Y
The output file contains 25165824 lines corresponding to 8388608 (8.3 million) lines in the input file.
$ wc -l outfile
25165824 outfile
$ <<< '25165824/3' bc
8388608
My test indicates that grep
can process approximately 1 million lines per minute.
Unless your computer is much faster than mine. this is not fast enough, and I think you have to consider something that is several times faster, probably filtering before writing the log file, but it would be best to completely avoid output of what is not necessary (and avoid filtering).
The input file is made by duplication, and maybe the system 'remembers' that it has seen the same lines before and makes things faster, so I don't know how fast it will work with a real big file with all the unpredicted variations. You have to test it.
Edit1: I ran the same task in a Dell M4800 with an Intel 4th generation i7 processor and an SSD. It finished in 4 minutes and 36 seconds, at almost double speed, 1.82 million lines per minute.
$ <<< 'scale=2;25165824/3/(4*60+36)*60/10^6' bc
1.82
Still too slow.
Edit2: I simplified the grep
patterns and ran it again in the Dell.
time grep -oE -e 'eventtime=[^ ]*'
-e 'srcip=[^ ]*'
-e 'dstip=[^ ]*'
infile > out
It finished after 4 minutes and 11 seconds, a small improvement to 2.00 million lines per minute
$ <<< 'scale=2;25165824/3/(4*60+11)*60/10^6' bc
2.00
Edit 3: @JJoao's, perl extension speeds up grep
to 39 seconds corresponding to 12.90 million lines per minute in the computer, where the ordinary grep
reads 1 million lines per minute (reading from and writing to an HDD).
$ time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao
real 0m38,699s
user 0m31,466s
sys 0m2,843s
This perl extension is experiental according to info grep
but works in my Lubuntu 18.04.1 LTS.
‘-P’ ‘--perl-regexp’
Interpret the pattern as a Perl-compatible regular expression
(PCRE). This is experimental, particularly when combined with the
‘-z’ (‘--null-data’) option, and ‘grep -P’ may warn of
unimplemented features. *Note Other Options::.
I also compiled a C program according to @JJoao's flex
method, and it finshed in 53 seconds corresponding to 9.49 million lines per minute in the computer, where the ordinary grep
reads 1 million lines per minute (reading from and writing to an HDD). Both methods are fast, but grep
with the perl extension is fastest.
$ time ./filt.jjouo < infile > out-flex-JJoao
real 0m53,440s
user 0m48,789s
sys 0m3,104s
Edit 3.1: In the Dell M4800 with an SSD I had the following results,
time ./filt.jjouo < infile > out-flex-JJoao
real 0m25,611s
user 0m24,794s
sys 0m0,592s
time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao
real 0m18,375s
user 0m17,654s
sys 0m0,500s
This corresponds to
- 19.66 million lines per minute for the
flex
application - 27.35 million lines per minute for
grep
with the perl extension
Edit 3.2: In the Dell M4800 with an SSD I had the following results, when I used the option -f
to the flex preprocessor,
flex -f -o filt.c filt.flex
cc -O2 -o filt.jjoao filt.c
The result was improved, and now the flex
application shows the highest speed
flex -f ...
$ time ./filt.jjoao < infile > out-flex-JJoao
real 0m15,952s
user 0m15,318s
sys 0m0,628s
This corresponds to
- 31.55 million lines per minute for the
flex
application.
thank you very much for such effort, so is it possible to split the file to many chunks and run the filter simultaneously against these chunks ?
– Ubai salih
Jan 28 at 20:00
@Ubaisalih, Yes, it might be possible and worthwhile to split the file in many chunks, if you can use use several processors in parallel for the task. But you should really try to avoid writing such a huge file with columns of data, that you will not use. Instead you should write a file with only the data that you need from the beginning. Such a file will contain 10 % or less compared to the file that you create now, Check how the current syslog file is created and how you can create the file that you need directly (not via a filter).
– sudodus
Jan 28 at 20:19
@sudodus, (+1); for curiosity: could you please timegrep -oP 'b(eventtime|srcip|dstip)=KS+' infile
– JJoao
Jan 30 at 12:33
@JJoao, see Edit 3.
– sudodus
Jan 30 at 17:14
1
@Ubaisalih, With theflex
application you might get something that is fast enough from the huge sys log file, but I still suggest that you should find a method to write a file with only the data that you need from the beginning.
– sudodus
Jan 31 at 7:26
|
show 4 more comments
Here is one possible solution, based on this answer, provided by @PerlDuck a while ago:
#!/bin/bash
while IFS= read -r LINE
do
if [[ ! -z ${LINE} ]]
then
eval $(echo "$LINE" | sed -e 's/({|})//g' -e 's/ /n/g' | sed -ne '/=/p')
echo "$eventtime|$srcip|$dstip"
fi
done < "$1"
I do not know how it will behave on such large file, IMO an awk
solution will be much faster. Here is how it works with the provided input file example:
$ ./script.sh in-file
1548531299|X.X.X.X|X.X.X.X
1548531299|X.X.X.X|X.X.X.X
Here is the result of a productivity time
test, performed on a regular i7, equipped with SSD and 16GB RAM:
$ time ./script.sh 160000-lines-in-file > out-file
real 4m49.620s
user 6m15.875s
sys 1m50.254s
thank you very much for reply,. i've tried this but still consuming too much time to filter one file, so i'm not sure if there is something faster probably with AWK as you mentioned
– Ubai salih
Jan 28 at 9:16
Hi, @Ubaisalih, I thinkawk
will be faster according to this time comparison. Whatever, does a new 4GB log is created each minute or only few hundred lines are appended each minute to an existing file?
– pa4080
Jan 28 at 9:31
Huh, I just saw you are the same OP on the linked PerlDuck"s answer :)
– pa4080
Jan 28 at 9:51
2
@Ubaisalih, what service writes these logs? I'm asking, because I remembered, for example Apache2 can pipe its logs to a script and then that script can write a regular log and also processed log, etc. - if such solution could be applied this will be the best way, I think.
– pa4080
Jan 28 at 10:41
it's a new file 4GB each minute, the problem now is the speed of filtering the file, so is there a faster method to capture few column with a timely manner ? –
– Ubai salih
Jan 28 at 11:47
|
show 7 more comments
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1113457%2ffastest-method-to-filter-a-huge-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
Perl to the rescue
Save the following script as filter.pl
and make it executable (chmod +x
):
#!/usr/bin/env perl
use strict;
use warnings;
while( <> ) {
if ( /^(?=.*eventtime=(S+))(?=.*srcip=(S+))(?=.*dstip=(S+)).*$/ ) {
print "$1|$2|$3n";
}
}
Then run
pduck@ubuntu:~> time ./filter.pl < input.txt > output.txt
real 0m44,984s
user 0m43,965s
sys 0m0,973s
The regex uses a lookaround pattern, in this case a positive lookahead,
to match the three values eventtime
, srcip
, and dstip
in any order.
I duplicated your two input lines until I got a file with 4 GB and
approximately 9 million lines. I ran the code on an SSD.
+1. Congratulations :-) Yourperl
script is a lot faster than what I could do withgrep
, In my main computer, that greps 1 million lines per second, I perl 6.89 million lines per second with your script, and this should be fast enough to solve the problem.
– sudodus
Jan 28 at 19:19
1
Thank you, @sudodus. Yes, it may solve the problem ... if it is not an XY problem ;-)
– PerlDuck
Jan 28 at 19:20
You are right about that ;-)
– sudodus
Jan 28 at 19:21
2
@Ubaisalih Thank you, glad I could help. But still, I don't think it's normal to have such a huge amount of log messages. I mean, 4 GB a minute is 5.5 Terabytes a day. A day!
– PerlDuck
Jan 29 at 11:56
1
@PerlDuck, 43 vs 36 is the same. We can optimize it more (flex -f; cc -O2) and we can also reduce some seconds in Perl solution but definitely Log polity needs revision.
– JJoao
Jan 30 at 19:15
|
show 4 more comments
Perl to the rescue
Save the following script as filter.pl
and make it executable (chmod +x
):
#!/usr/bin/env perl
use strict;
use warnings;
while( <> ) {
if ( /^(?=.*eventtime=(S+))(?=.*srcip=(S+))(?=.*dstip=(S+)).*$/ ) {
print "$1|$2|$3n";
}
}
Then run
pduck@ubuntu:~> time ./filter.pl < input.txt > output.txt
real 0m44,984s
user 0m43,965s
sys 0m0,973s
The regex uses a lookaround pattern, in this case a positive lookahead,
to match the three values eventtime
, srcip
, and dstip
in any order.
I duplicated your two input lines until I got a file with 4 GB and
approximately 9 million lines. I ran the code on an SSD.
+1. Congratulations :-) Yourperl
script is a lot faster than what I could do withgrep
, In my main computer, that greps 1 million lines per second, I perl 6.89 million lines per second with your script, and this should be fast enough to solve the problem.
– sudodus
Jan 28 at 19:19
1
Thank you, @sudodus. Yes, it may solve the problem ... if it is not an XY problem ;-)
– PerlDuck
Jan 28 at 19:20
You are right about that ;-)
– sudodus
Jan 28 at 19:21
2
@Ubaisalih Thank you, glad I could help. But still, I don't think it's normal to have such a huge amount of log messages. I mean, 4 GB a minute is 5.5 Terabytes a day. A day!
– PerlDuck
Jan 29 at 11:56
1
@PerlDuck, 43 vs 36 is the same. We can optimize it more (flex -f; cc -O2) and we can also reduce some seconds in Perl solution but definitely Log polity needs revision.
– JJoao
Jan 30 at 19:15
|
show 4 more comments
Perl to the rescue
Save the following script as filter.pl
and make it executable (chmod +x
):
#!/usr/bin/env perl
use strict;
use warnings;
while( <> ) {
if ( /^(?=.*eventtime=(S+))(?=.*srcip=(S+))(?=.*dstip=(S+)).*$/ ) {
print "$1|$2|$3n";
}
}
Then run
pduck@ubuntu:~> time ./filter.pl < input.txt > output.txt
real 0m44,984s
user 0m43,965s
sys 0m0,973s
The regex uses a lookaround pattern, in this case a positive lookahead,
to match the three values eventtime
, srcip
, and dstip
in any order.
I duplicated your two input lines until I got a file with 4 GB and
approximately 9 million lines. I ran the code on an SSD.
Perl to the rescue
Save the following script as filter.pl
and make it executable (chmod +x
):
#!/usr/bin/env perl
use strict;
use warnings;
while( <> ) {
if ( /^(?=.*eventtime=(S+))(?=.*srcip=(S+))(?=.*dstip=(S+)).*$/ ) {
print "$1|$2|$3n";
}
}
Then run
pduck@ubuntu:~> time ./filter.pl < input.txt > output.txt
real 0m44,984s
user 0m43,965s
sys 0m0,973s
The regex uses a lookaround pattern, in this case a positive lookahead,
to match the three values eventtime
, srcip
, and dstip
in any order.
I duplicated your two input lines until I got a file with 4 GB and
approximately 9 million lines. I ran the code on an SSD.
answered Jan 28 at 19:08
PerlDuckPerlDuck
6,73711535
6,73711535
+1. Congratulations :-) Yourperl
script is a lot faster than what I could do withgrep
, In my main computer, that greps 1 million lines per second, I perl 6.89 million lines per second with your script, and this should be fast enough to solve the problem.
– sudodus
Jan 28 at 19:19
1
Thank you, @sudodus. Yes, it may solve the problem ... if it is not an XY problem ;-)
– PerlDuck
Jan 28 at 19:20
You are right about that ;-)
– sudodus
Jan 28 at 19:21
2
@Ubaisalih Thank you, glad I could help. But still, I don't think it's normal to have such a huge amount of log messages. I mean, 4 GB a minute is 5.5 Terabytes a day. A day!
– PerlDuck
Jan 29 at 11:56
1
@PerlDuck, 43 vs 36 is the same. We can optimize it more (flex -f; cc -O2) and we can also reduce some seconds in Perl solution but definitely Log polity needs revision.
– JJoao
Jan 30 at 19:15
|
show 4 more comments
+1. Congratulations :-) Yourperl
script is a lot faster than what I could do withgrep
, In my main computer, that greps 1 million lines per second, I perl 6.89 million lines per second with your script, and this should be fast enough to solve the problem.
– sudodus
Jan 28 at 19:19
1
Thank you, @sudodus. Yes, it may solve the problem ... if it is not an XY problem ;-)
– PerlDuck
Jan 28 at 19:20
You are right about that ;-)
– sudodus
Jan 28 at 19:21
2
@Ubaisalih Thank you, glad I could help. But still, I don't think it's normal to have such a huge amount of log messages. I mean, 4 GB a minute is 5.5 Terabytes a day. A day!
– PerlDuck
Jan 29 at 11:56
1
@PerlDuck, 43 vs 36 is the same. We can optimize it more (flex -f; cc -O2) and we can also reduce some seconds in Perl solution but definitely Log polity needs revision.
– JJoao
Jan 30 at 19:15
+1. Congratulations :-) Your
perl
script is a lot faster than what I could do with grep
, In my main computer, that greps 1 million lines per second, I perl 6.89 million lines per second with your script, and this should be fast enough to solve the problem.– sudodus
Jan 28 at 19:19
+1. Congratulations :-) Your
perl
script is a lot faster than what I could do with grep
, In my main computer, that greps 1 million lines per second, I perl 6.89 million lines per second with your script, and this should be fast enough to solve the problem.– sudodus
Jan 28 at 19:19
1
1
Thank you, @sudodus. Yes, it may solve the problem ... if it is not an XY problem ;-)
– PerlDuck
Jan 28 at 19:20
Thank you, @sudodus. Yes, it may solve the problem ... if it is not an XY problem ;-)
– PerlDuck
Jan 28 at 19:20
You are right about that ;-)
– sudodus
Jan 28 at 19:21
You are right about that ;-)
– sudodus
Jan 28 at 19:21
2
2
@Ubaisalih Thank you, glad I could help. But still, I don't think it's normal to have such a huge amount of log messages. I mean, 4 GB a minute is 5.5 Terabytes a day. A day!
– PerlDuck
Jan 29 at 11:56
@Ubaisalih Thank you, glad I could help. But still, I don't think it's normal to have such a huge amount of log messages. I mean, 4 GB a minute is 5.5 Terabytes a day. A day!
– PerlDuck
Jan 29 at 11:56
1
1
@PerlDuck, 43 vs 36 is the same. We can optimize it more (flex -f; cc -O2) and we can also reduce some seconds in Perl solution but definitely Log polity needs revision.
– JJoao
Jan 30 at 19:15
@PerlDuck, 43 vs 36 is the same. We can optimize it more (flex -f; cc -O2) and we can also reduce some seconds in Perl solution but definitely Log polity needs revision.
– JJoao
Jan 30 at 19:15
|
show 4 more comments
If you want a really fast solution I suggest flex tool. Flex generates C. The following is capable of processing examples like the one presented accepting free order fields. So create a file, named f.fl
with the following content:
%option main
%%
char e[100], s[100], d[100];
eventtime=[^ n]* { strcpy(e,yytext+10); }
srcip=[^ n]* { strcpy(s,yytext+6); }
dstip=[^ n]* { strcpy(d,yytext+6); }
n { if (e[0] && s[0] && d[0] )printf("%s|%s|%sn",e,s,d);
e[0]=s[0]=d[0]=0 ;}
. {}
%%
To test try:
$ flex -f -o f.c f.fl
$ cc -O2 -o f f.c
$ ./f < input > output
Here is the time
comparison:
$ time ./f < 13.5-milion-lines-3.9G-in-file > out-file
real 0m35.689s
user 0m34.705s
sys 0m0.908s
add a comment |
If you want a really fast solution I suggest flex tool. Flex generates C. The following is capable of processing examples like the one presented accepting free order fields. So create a file, named f.fl
with the following content:
%option main
%%
char e[100], s[100], d[100];
eventtime=[^ n]* { strcpy(e,yytext+10); }
srcip=[^ n]* { strcpy(s,yytext+6); }
dstip=[^ n]* { strcpy(d,yytext+6); }
n { if (e[0] && s[0] && d[0] )printf("%s|%s|%sn",e,s,d);
e[0]=s[0]=d[0]=0 ;}
. {}
%%
To test try:
$ flex -f -o f.c f.fl
$ cc -O2 -o f f.c
$ ./f < input > output
Here is the time
comparison:
$ time ./f < 13.5-milion-lines-3.9G-in-file > out-file
real 0m35.689s
user 0m34.705s
sys 0m0.908s
add a comment |
If you want a really fast solution I suggest flex tool. Flex generates C. The following is capable of processing examples like the one presented accepting free order fields. So create a file, named f.fl
with the following content:
%option main
%%
char e[100], s[100], d[100];
eventtime=[^ n]* { strcpy(e,yytext+10); }
srcip=[^ n]* { strcpy(s,yytext+6); }
dstip=[^ n]* { strcpy(d,yytext+6); }
n { if (e[0] && s[0] && d[0] )printf("%s|%s|%sn",e,s,d);
e[0]=s[0]=d[0]=0 ;}
. {}
%%
To test try:
$ flex -f -o f.c f.fl
$ cc -O2 -o f f.c
$ ./f < input > output
Here is the time
comparison:
$ time ./f < 13.5-milion-lines-3.9G-in-file > out-file
real 0m35.689s
user 0m34.705s
sys 0m0.908s
If you want a really fast solution I suggest flex tool. Flex generates C. The following is capable of processing examples like the one presented accepting free order fields. So create a file, named f.fl
with the following content:
%option main
%%
char e[100], s[100], d[100];
eventtime=[^ n]* { strcpy(e,yytext+10); }
srcip=[^ n]* { strcpy(s,yytext+6); }
dstip=[^ n]* { strcpy(d,yytext+6); }
n { if (e[0] && s[0] && d[0] )printf("%s|%s|%sn",e,s,d);
e[0]=s[0]=d[0]=0 ;}
. {}
%%
To test try:
$ flex -f -o f.c f.fl
$ cc -O2 -o f f.c
$ ./f < input > output
Here is the time
comparison:
$ time ./f < 13.5-milion-lines-3.9G-in-file > out-file
real 0m35.689s
user 0m34.705s
sys 0m0.908s
edited Jan 30 at 22:11
answered Jan 30 at 12:22
JJoaoJJoao
1,40069
1,40069
add a comment |
add a comment |
I duplicated your two 'input' lines to a file size 3867148288 bytes (3.7 GiB) and I could process it with grep
in 8 minutes and 24 seconds (reading from and writing to a HDD. It should be faster using an SSD or ramdrive).
In order to minimize the time used, I used only standard features of grep
, and did not post-process it, so the output format is not what you specify, but might be useful anyway. You can test this command
time grep -oE -e 'eventtime=[0-9]* '
-e 'srcip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]'
-e 'dstip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]'
infile > outfile
Output from your two lines:
$ cat outfile
eventtime=1548531298
srcip=X.X.X.Y
dstip=X.X.X.X
eventtime=1548531299
srcip=X.X.X.Z
dstip=X.X.X.Y
The output file contains 25165824 lines corresponding to 8388608 (8.3 million) lines in the input file.
$ wc -l outfile
25165824 outfile
$ <<< '25165824/3' bc
8388608
My test indicates that grep
can process approximately 1 million lines per minute.
Unless your computer is much faster than mine. this is not fast enough, and I think you have to consider something that is several times faster, probably filtering before writing the log file, but it would be best to completely avoid output of what is not necessary (and avoid filtering).
The input file is made by duplication, and maybe the system 'remembers' that it has seen the same lines before and makes things faster, so I don't know how fast it will work with a real big file with all the unpredicted variations. You have to test it.
Edit1: I ran the same task in a Dell M4800 with an Intel 4th generation i7 processor and an SSD. It finished in 4 minutes and 36 seconds, at almost double speed, 1.82 million lines per minute.
$ <<< 'scale=2;25165824/3/(4*60+36)*60/10^6' bc
1.82
Still too slow.
Edit2: I simplified the grep
patterns and ran it again in the Dell.
time grep -oE -e 'eventtime=[^ ]*'
-e 'srcip=[^ ]*'
-e 'dstip=[^ ]*'
infile > out
It finished after 4 minutes and 11 seconds, a small improvement to 2.00 million lines per minute
$ <<< 'scale=2;25165824/3/(4*60+11)*60/10^6' bc
2.00
Edit 3: @JJoao's, perl extension speeds up grep
to 39 seconds corresponding to 12.90 million lines per minute in the computer, where the ordinary grep
reads 1 million lines per minute (reading from and writing to an HDD).
$ time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao
real 0m38,699s
user 0m31,466s
sys 0m2,843s
This perl extension is experiental according to info grep
but works in my Lubuntu 18.04.1 LTS.
‘-P’ ‘--perl-regexp’
Interpret the pattern as a Perl-compatible regular expression
(PCRE). This is experimental, particularly when combined with the
‘-z’ (‘--null-data’) option, and ‘grep -P’ may warn of
unimplemented features. *Note Other Options::.
I also compiled a C program according to @JJoao's flex
method, and it finshed in 53 seconds corresponding to 9.49 million lines per minute in the computer, where the ordinary grep
reads 1 million lines per minute (reading from and writing to an HDD). Both methods are fast, but grep
with the perl extension is fastest.
$ time ./filt.jjouo < infile > out-flex-JJoao
real 0m53,440s
user 0m48,789s
sys 0m3,104s
Edit 3.1: In the Dell M4800 with an SSD I had the following results,
time ./filt.jjouo < infile > out-flex-JJoao
real 0m25,611s
user 0m24,794s
sys 0m0,592s
time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao
real 0m18,375s
user 0m17,654s
sys 0m0,500s
This corresponds to
- 19.66 million lines per minute for the
flex
application - 27.35 million lines per minute for
grep
with the perl extension
Edit 3.2: In the Dell M4800 with an SSD I had the following results, when I used the option -f
to the flex preprocessor,
flex -f -o filt.c filt.flex
cc -O2 -o filt.jjoao filt.c
The result was improved, and now the flex
application shows the highest speed
flex -f ...
$ time ./filt.jjoao < infile > out-flex-JJoao
real 0m15,952s
user 0m15,318s
sys 0m0,628s
This corresponds to
- 31.55 million lines per minute for the
flex
application.
thank you very much for such effort, so is it possible to split the file to many chunks and run the filter simultaneously against these chunks ?
– Ubai salih
Jan 28 at 20:00
@Ubaisalih, Yes, it might be possible and worthwhile to split the file in many chunks, if you can use use several processors in parallel for the task. But you should really try to avoid writing such a huge file with columns of data, that you will not use. Instead you should write a file with only the data that you need from the beginning. Such a file will contain 10 % or less compared to the file that you create now, Check how the current syslog file is created and how you can create the file that you need directly (not via a filter).
– sudodus
Jan 28 at 20:19
@sudodus, (+1); for curiosity: could you please timegrep -oP 'b(eventtime|srcip|dstip)=KS+' infile
– JJoao
Jan 30 at 12:33
@JJoao, see Edit 3.
– sudodus
Jan 30 at 17:14
1
@Ubaisalih, With theflex
application you might get something that is fast enough from the huge sys log file, but I still suggest that you should find a method to write a file with only the data that you need from the beginning.
– sudodus
Jan 31 at 7:26
|
show 4 more comments
I duplicated your two 'input' lines to a file size 3867148288 bytes (3.7 GiB) and I could process it with grep
in 8 minutes and 24 seconds (reading from and writing to a HDD. It should be faster using an SSD or ramdrive).
In order to minimize the time used, I used only standard features of grep
, and did not post-process it, so the output format is not what you specify, but might be useful anyway. You can test this command
time grep -oE -e 'eventtime=[0-9]* '
-e 'srcip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]'
-e 'dstip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]'
infile > outfile
Output from your two lines:
$ cat outfile
eventtime=1548531298
srcip=X.X.X.Y
dstip=X.X.X.X
eventtime=1548531299
srcip=X.X.X.Z
dstip=X.X.X.Y
The output file contains 25165824 lines corresponding to 8388608 (8.3 million) lines in the input file.
$ wc -l outfile
25165824 outfile
$ <<< '25165824/3' bc
8388608
My test indicates that grep
can process approximately 1 million lines per minute.
Unless your computer is much faster than mine. this is not fast enough, and I think you have to consider something that is several times faster, probably filtering before writing the log file, but it would be best to completely avoid output of what is not necessary (and avoid filtering).
The input file is made by duplication, and maybe the system 'remembers' that it has seen the same lines before and makes things faster, so I don't know how fast it will work with a real big file with all the unpredicted variations. You have to test it.
Edit1: I ran the same task in a Dell M4800 with an Intel 4th generation i7 processor and an SSD. It finished in 4 minutes and 36 seconds, at almost double speed, 1.82 million lines per minute.
$ <<< 'scale=2;25165824/3/(4*60+36)*60/10^6' bc
1.82
Still too slow.
Edit2: I simplified the grep
patterns and ran it again in the Dell.
time grep -oE -e 'eventtime=[^ ]*'
-e 'srcip=[^ ]*'
-e 'dstip=[^ ]*'
infile > out
It finished after 4 minutes and 11 seconds, a small improvement to 2.00 million lines per minute
$ <<< 'scale=2;25165824/3/(4*60+11)*60/10^6' bc
2.00
Edit 3: @JJoao's, perl extension speeds up grep
to 39 seconds corresponding to 12.90 million lines per minute in the computer, where the ordinary grep
reads 1 million lines per minute (reading from and writing to an HDD).
$ time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao
real 0m38,699s
user 0m31,466s
sys 0m2,843s
This perl extension is experiental according to info grep
but works in my Lubuntu 18.04.1 LTS.
‘-P’ ‘--perl-regexp’
Interpret the pattern as a Perl-compatible regular expression
(PCRE). This is experimental, particularly when combined with the
‘-z’ (‘--null-data’) option, and ‘grep -P’ may warn of
unimplemented features. *Note Other Options::.
I also compiled a C program according to @JJoao's flex
method, and it finshed in 53 seconds corresponding to 9.49 million lines per minute in the computer, where the ordinary grep
reads 1 million lines per minute (reading from and writing to an HDD). Both methods are fast, but grep
with the perl extension is fastest.
$ time ./filt.jjouo < infile > out-flex-JJoao
real 0m53,440s
user 0m48,789s
sys 0m3,104s
Edit 3.1: In the Dell M4800 with an SSD I had the following results,
time ./filt.jjouo < infile > out-flex-JJoao
real 0m25,611s
user 0m24,794s
sys 0m0,592s
time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao
real 0m18,375s
user 0m17,654s
sys 0m0,500s
This corresponds to
- 19.66 million lines per minute for the
flex
application - 27.35 million lines per minute for
grep
with the perl extension
Edit 3.2: In the Dell M4800 with an SSD I had the following results, when I used the option -f
to the flex preprocessor,
flex -f -o filt.c filt.flex
cc -O2 -o filt.jjoao filt.c
The result was improved, and now the flex
application shows the highest speed
flex -f ...
$ time ./filt.jjoao < infile > out-flex-JJoao
real 0m15,952s
user 0m15,318s
sys 0m0,628s
This corresponds to
- 31.55 million lines per minute for the
flex
application.
thank you very much for such effort, so is it possible to split the file to many chunks and run the filter simultaneously against these chunks ?
– Ubai salih
Jan 28 at 20:00
@Ubaisalih, Yes, it might be possible and worthwhile to split the file in many chunks, if you can use use several processors in parallel for the task. But you should really try to avoid writing such a huge file with columns of data, that you will not use. Instead you should write a file with only the data that you need from the beginning. Such a file will contain 10 % or less compared to the file that you create now, Check how the current syslog file is created and how you can create the file that you need directly (not via a filter).
– sudodus
Jan 28 at 20:19
@sudodus, (+1); for curiosity: could you please timegrep -oP 'b(eventtime|srcip|dstip)=KS+' infile
– JJoao
Jan 30 at 12:33
@JJoao, see Edit 3.
– sudodus
Jan 30 at 17:14
1
@Ubaisalih, With theflex
application you might get something that is fast enough from the huge sys log file, but I still suggest that you should find a method to write a file with only the data that you need from the beginning.
– sudodus
Jan 31 at 7:26
|
show 4 more comments
I duplicated your two 'input' lines to a file size 3867148288 bytes (3.7 GiB) and I could process it with grep
in 8 minutes and 24 seconds (reading from and writing to a HDD. It should be faster using an SSD or ramdrive).
In order to minimize the time used, I used only standard features of grep
, and did not post-process it, so the output format is not what you specify, but might be useful anyway. You can test this command
time grep -oE -e 'eventtime=[0-9]* '
-e 'srcip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]'
-e 'dstip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]'
infile > outfile
Output from your two lines:
$ cat outfile
eventtime=1548531298
srcip=X.X.X.Y
dstip=X.X.X.X
eventtime=1548531299
srcip=X.X.X.Z
dstip=X.X.X.Y
The output file contains 25165824 lines corresponding to 8388608 (8.3 million) lines in the input file.
$ wc -l outfile
25165824 outfile
$ <<< '25165824/3' bc
8388608
My test indicates that grep
can process approximately 1 million lines per minute.
Unless your computer is much faster than mine. this is not fast enough, and I think you have to consider something that is several times faster, probably filtering before writing the log file, but it would be best to completely avoid output of what is not necessary (and avoid filtering).
The input file is made by duplication, and maybe the system 'remembers' that it has seen the same lines before and makes things faster, so I don't know how fast it will work with a real big file with all the unpredicted variations. You have to test it.
Edit1: I ran the same task in a Dell M4800 with an Intel 4th generation i7 processor and an SSD. It finished in 4 minutes and 36 seconds, at almost double speed, 1.82 million lines per minute.
$ <<< 'scale=2;25165824/3/(4*60+36)*60/10^6' bc
1.82
Still too slow.
Edit2: I simplified the grep
patterns and ran it again in the Dell.
time grep -oE -e 'eventtime=[^ ]*'
-e 'srcip=[^ ]*'
-e 'dstip=[^ ]*'
infile > out
It finished after 4 minutes and 11 seconds, a small improvement to 2.00 million lines per minute
$ <<< 'scale=2;25165824/3/(4*60+11)*60/10^6' bc
2.00
Edit 3: @JJoao's, perl extension speeds up grep
to 39 seconds corresponding to 12.90 million lines per minute in the computer, where the ordinary grep
reads 1 million lines per minute (reading from and writing to an HDD).
$ time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao
real 0m38,699s
user 0m31,466s
sys 0m2,843s
This perl extension is experiental according to info grep
but works in my Lubuntu 18.04.1 LTS.
‘-P’ ‘--perl-regexp’
Interpret the pattern as a Perl-compatible regular expression
(PCRE). This is experimental, particularly when combined with the
‘-z’ (‘--null-data’) option, and ‘grep -P’ may warn of
unimplemented features. *Note Other Options::.
I also compiled a C program according to @JJoao's flex
method, and it finshed in 53 seconds corresponding to 9.49 million lines per minute in the computer, where the ordinary grep
reads 1 million lines per minute (reading from and writing to an HDD). Both methods are fast, but grep
with the perl extension is fastest.
$ time ./filt.jjouo < infile > out-flex-JJoao
real 0m53,440s
user 0m48,789s
sys 0m3,104s
Edit 3.1: In the Dell M4800 with an SSD I had the following results,
time ./filt.jjouo < infile > out-flex-JJoao
real 0m25,611s
user 0m24,794s
sys 0m0,592s
time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao
real 0m18,375s
user 0m17,654s
sys 0m0,500s
This corresponds to
- 19.66 million lines per minute for the
flex
application - 27.35 million lines per minute for
grep
with the perl extension
Edit 3.2: In the Dell M4800 with an SSD I had the following results, when I used the option -f
to the flex preprocessor,
flex -f -o filt.c filt.flex
cc -O2 -o filt.jjoao filt.c
The result was improved, and now the flex
application shows the highest speed
flex -f ...
$ time ./filt.jjoao < infile > out-flex-JJoao
real 0m15,952s
user 0m15,318s
sys 0m0,628s
This corresponds to
- 31.55 million lines per minute for the
flex
application.
I duplicated your two 'input' lines to a file size 3867148288 bytes (3.7 GiB) and I could process it with grep
in 8 minutes and 24 seconds (reading from and writing to a HDD. It should be faster using an SSD or ramdrive).
In order to minimize the time used, I used only standard features of grep
, and did not post-process it, so the output format is not what you specify, but might be useful anyway. You can test this command
time grep -oE -e 'eventtime=[0-9]* '
-e 'srcip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]'
-e 'dstip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]'
infile > outfile
Output from your two lines:
$ cat outfile
eventtime=1548531298
srcip=X.X.X.Y
dstip=X.X.X.X
eventtime=1548531299
srcip=X.X.X.Z
dstip=X.X.X.Y
The output file contains 25165824 lines corresponding to 8388608 (8.3 million) lines in the input file.
$ wc -l outfile
25165824 outfile
$ <<< '25165824/3' bc
8388608
My test indicates that grep
can process approximately 1 million lines per minute.
Unless your computer is much faster than mine. this is not fast enough, and I think you have to consider something that is several times faster, probably filtering before writing the log file, but it would be best to completely avoid output of what is not necessary (and avoid filtering).
The input file is made by duplication, and maybe the system 'remembers' that it has seen the same lines before and makes things faster, so I don't know how fast it will work with a real big file with all the unpredicted variations. You have to test it.
Edit1: I ran the same task in a Dell M4800 with an Intel 4th generation i7 processor and an SSD. It finished in 4 minutes and 36 seconds, at almost double speed, 1.82 million lines per minute.
$ <<< 'scale=2;25165824/3/(4*60+36)*60/10^6' bc
1.82
Still too slow.
Edit2: I simplified the grep
patterns and ran it again in the Dell.
time grep -oE -e 'eventtime=[^ ]*'
-e 'srcip=[^ ]*'
-e 'dstip=[^ ]*'
infile > out
It finished after 4 minutes and 11 seconds, a small improvement to 2.00 million lines per minute
$ <<< 'scale=2;25165824/3/(4*60+11)*60/10^6' bc
2.00
Edit 3: @JJoao's, perl extension speeds up grep
to 39 seconds corresponding to 12.90 million lines per minute in the computer, where the ordinary grep
reads 1 million lines per minute (reading from and writing to an HDD).
$ time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao
real 0m38,699s
user 0m31,466s
sys 0m2,843s
This perl extension is experiental according to info grep
but works in my Lubuntu 18.04.1 LTS.
‘-P’ ‘--perl-regexp’
Interpret the pattern as a Perl-compatible regular expression
(PCRE). This is experimental, particularly when combined with the
‘-z’ (‘--null-data’) option, and ‘grep -P’ may warn of
unimplemented features. *Note Other Options::.
I also compiled a C program according to @JJoao's flex
method, and it finshed in 53 seconds corresponding to 9.49 million lines per minute in the computer, where the ordinary grep
reads 1 million lines per minute (reading from and writing to an HDD). Both methods are fast, but grep
with the perl extension is fastest.
$ time ./filt.jjouo < infile > out-flex-JJoao
real 0m53,440s
user 0m48,789s
sys 0m3,104s
Edit 3.1: In the Dell M4800 with an SSD I had the following results,
time ./filt.jjouo < infile > out-flex-JJoao
real 0m25,611s
user 0m24,794s
sys 0m0,592s
time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao
real 0m18,375s
user 0m17,654s
sys 0m0,500s
This corresponds to
- 19.66 million lines per minute for the
flex
application - 27.35 million lines per minute for
grep
with the perl extension
Edit 3.2: In the Dell M4800 with an SSD I had the following results, when I used the option -f
to the flex preprocessor,
flex -f -o filt.c filt.flex
cc -O2 -o filt.jjoao filt.c
The result was improved, and now the flex
application shows the highest speed
flex -f ...
$ time ./filt.jjoao < infile > out-flex-JJoao
real 0m15,952s
user 0m15,318s
sys 0m0,628s
This corresponds to
- 31.55 million lines per minute for the
flex
application.
edited Jan 31 at 7:21
answered Jan 28 at 15:23
sudodussudodus
24.3k32875
24.3k32875
thank you very much for such effort, so is it possible to split the file to many chunks and run the filter simultaneously against these chunks ?
– Ubai salih
Jan 28 at 20:00
@Ubaisalih, Yes, it might be possible and worthwhile to split the file in many chunks, if you can use use several processors in parallel for the task. But you should really try to avoid writing such a huge file with columns of data, that you will not use. Instead you should write a file with only the data that you need from the beginning. Such a file will contain 10 % or less compared to the file that you create now, Check how the current syslog file is created and how you can create the file that you need directly (not via a filter).
– sudodus
Jan 28 at 20:19
@sudodus, (+1); for curiosity: could you please timegrep -oP 'b(eventtime|srcip|dstip)=KS+' infile
– JJoao
Jan 30 at 12:33
@JJoao, see Edit 3.
– sudodus
Jan 30 at 17:14
1
@Ubaisalih, With theflex
application you might get something that is fast enough from the huge sys log file, but I still suggest that you should find a method to write a file with only the data that you need from the beginning.
– sudodus
Jan 31 at 7:26
|
show 4 more comments
thank you very much for such effort, so is it possible to split the file to many chunks and run the filter simultaneously against these chunks ?
– Ubai salih
Jan 28 at 20:00
@Ubaisalih, Yes, it might be possible and worthwhile to split the file in many chunks, if you can use use several processors in parallel for the task. But you should really try to avoid writing such a huge file with columns of data, that you will not use. Instead you should write a file with only the data that you need from the beginning. Such a file will contain 10 % or less compared to the file that you create now, Check how the current syslog file is created and how you can create the file that you need directly (not via a filter).
– sudodus
Jan 28 at 20:19
@sudodus, (+1); for curiosity: could you please timegrep -oP 'b(eventtime|srcip|dstip)=KS+' infile
– JJoao
Jan 30 at 12:33
@JJoao, see Edit 3.
– sudodus
Jan 30 at 17:14
1
@Ubaisalih, With theflex
application you might get something that is fast enough from the huge sys log file, but I still suggest that you should find a method to write a file with only the data that you need from the beginning.
– sudodus
Jan 31 at 7:26
thank you very much for such effort, so is it possible to split the file to many chunks and run the filter simultaneously against these chunks ?
– Ubai salih
Jan 28 at 20:00
thank you very much for such effort, so is it possible to split the file to many chunks and run the filter simultaneously against these chunks ?
– Ubai salih
Jan 28 at 20:00
@Ubaisalih, Yes, it might be possible and worthwhile to split the file in many chunks, if you can use use several processors in parallel for the task. But you should really try to avoid writing such a huge file with columns of data, that you will not use. Instead you should write a file with only the data that you need from the beginning. Such a file will contain 10 % or less compared to the file that you create now, Check how the current syslog file is created and how you can create the file that you need directly (not via a filter).
– sudodus
Jan 28 at 20:19
@Ubaisalih, Yes, it might be possible and worthwhile to split the file in many chunks, if you can use use several processors in parallel for the task. But you should really try to avoid writing such a huge file with columns of data, that you will not use. Instead you should write a file with only the data that you need from the beginning. Such a file will contain 10 % or less compared to the file that you create now, Check how the current syslog file is created and how you can create the file that you need directly (not via a filter).
– sudodus
Jan 28 at 20:19
@sudodus, (+1); for curiosity: could you please time
grep -oP 'b(eventtime|srcip|dstip)=KS+' infile
– JJoao
Jan 30 at 12:33
@sudodus, (+1); for curiosity: could you please time
grep -oP 'b(eventtime|srcip|dstip)=KS+' infile
– JJoao
Jan 30 at 12:33
@JJoao, see Edit 3.
– sudodus
Jan 30 at 17:14
@JJoao, see Edit 3.
– sudodus
Jan 30 at 17:14
1
1
@Ubaisalih, With the
flex
application you might get something that is fast enough from the huge sys log file, but I still suggest that you should find a method to write a file with only the data that you need from the beginning.– sudodus
Jan 31 at 7:26
@Ubaisalih, With the
flex
application you might get something that is fast enough from the huge sys log file, but I still suggest that you should find a method to write a file with only the data that you need from the beginning.– sudodus
Jan 31 at 7:26
|
show 4 more comments
Here is one possible solution, based on this answer, provided by @PerlDuck a while ago:
#!/bin/bash
while IFS= read -r LINE
do
if [[ ! -z ${LINE} ]]
then
eval $(echo "$LINE" | sed -e 's/({|})//g' -e 's/ /n/g' | sed -ne '/=/p')
echo "$eventtime|$srcip|$dstip"
fi
done < "$1"
I do not know how it will behave on such large file, IMO an awk
solution will be much faster. Here is how it works with the provided input file example:
$ ./script.sh in-file
1548531299|X.X.X.X|X.X.X.X
1548531299|X.X.X.X|X.X.X.X
Here is the result of a productivity time
test, performed on a regular i7, equipped with SSD and 16GB RAM:
$ time ./script.sh 160000-lines-in-file > out-file
real 4m49.620s
user 6m15.875s
sys 1m50.254s
thank you very much for reply,. i've tried this but still consuming too much time to filter one file, so i'm not sure if there is something faster probably with AWK as you mentioned
– Ubai salih
Jan 28 at 9:16
Hi, @Ubaisalih, I thinkawk
will be faster according to this time comparison. Whatever, does a new 4GB log is created each minute or only few hundred lines are appended each minute to an existing file?
– pa4080
Jan 28 at 9:31
Huh, I just saw you are the same OP on the linked PerlDuck"s answer :)
– pa4080
Jan 28 at 9:51
2
@Ubaisalih, what service writes these logs? I'm asking, because I remembered, for example Apache2 can pipe its logs to a script and then that script can write a regular log and also processed log, etc. - if such solution could be applied this will be the best way, I think.
– pa4080
Jan 28 at 10:41
it's a new file 4GB each minute, the problem now is the speed of filtering the file, so is there a faster method to capture few column with a timely manner ? –
– Ubai salih
Jan 28 at 11:47
|
show 7 more comments
Here is one possible solution, based on this answer, provided by @PerlDuck a while ago:
#!/bin/bash
while IFS= read -r LINE
do
if [[ ! -z ${LINE} ]]
then
eval $(echo "$LINE" | sed -e 's/({|})//g' -e 's/ /n/g' | sed -ne '/=/p')
echo "$eventtime|$srcip|$dstip"
fi
done < "$1"
I do not know how it will behave on such large file, IMO an awk
solution will be much faster. Here is how it works with the provided input file example:
$ ./script.sh in-file
1548531299|X.X.X.X|X.X.X.X
1548531299|X.X.X.X|X.X.X.X
Here is the result of a productivity time
test, performed on a regular i7, equipped with SSD and 16GB RAM:
$ time ./script.sh 160000-lines-in-file > out-file
real 4m49.620s
user 6m15.875s
sys 1m50.254s
thank you very much for reply,. i've tried this but still consuming too much time to filter one file, so i'm not sure if there is something faster probably with AWK as you mentioned
– Ubai salih
Jan 28 at 9:16
Hi, @Ubaisalih, I thinkawk
will be faster according to this time comparison. Whatever, does a new 4GB log is created each minute or only few hundred lines are appended each minute to an existing file?
– pa4080
Jan 28 at 9:31
Huh, I just saw you are the same OP on the linked PerlDuck"s answer :)
– pa4080
Jan 28 at 9:51
2
@Ubaisalih, what service writes these logs? I'm asking, because I remembered, for example Apache2 can pipe its logs to a script and then that script can write a regular log and also processed log, etc. - if such solution could be applied this will be the best way, I think.
– pa4080
Jan 28 at 10:41
it's a new file 4GB each minute, the problem now is the speed of filtering the file, so is there a faster method to capture few column with a timely manner ? –
– Ubai salih
Jan 28 at 11:47
|
show 7 more comments
Here is one possible solution, based on this answer, provided by @PerlDuck a while ago:
#!/bin/bash
while IFS= read -r LINE
do
if [[ ! -z ${LINE} ]]
then
eval $(echo "$LINE" | sed -e 's/({|})//g' -e 's/ /n/g' | sed -ne '/=/p')
echo "$eventtime|$srcip|$dstip"
fi
done < "$1"
I do not know how it will behave on such large file, IMO an awk
solution will be much faster. Here is how it works with the provided input file example:
$ ./script.sh in-file
1548531299|X.X.X.X|X.X.X.X
1548531299|X.X.X.X|X.X.X.X
Here is the result of a productivity time
test, performed on a regular i7, equipped with SSD and 16GB RAM:
$ time ./script.sh 160000-lines-in-file > out-file
real 4m49.620s
user 6m15.875s
sys 1m50.254s
Here is one possible solution, based on this answer, provided by @PerlDuck a while ago:
#!/bin/bash
while IFS= read -r LINE
do
if [[ ! -z ${LINE} ]]
then
eval $(echo "$LINE" | sed -e 's/({|})//g' -e 's/ /n/g' | sed -ne '/=/p')
echo "$eventtime|$srcip|$dstip"
fi
done < "$1"
I do not know how it will behave on such large file, IMO an awk
solution will be much faster. Here is how it works with the provided input file example:
$ ./script.sh in-file
1548531299|X.X.X.X|X.X.X.X
1548531299|X.X.X.X|X.X.X.X
Here is the result of a productivity time
test, performed on a regular i7, equipped with SSD and 16GB RAM:
$ time ./script.sh 160000-lines-in-file > out-file
real 4m49.620s
user 6m15.875s
sys 1m50.254s
edited Jan 28 at 9:41
answered Jan 28 at 9:04
pa4080pa4080
14.2k52668
14.2k52668
thank you very much for reply,. i've tried this but still consuming too much time to filter one file, so i'm not sure if there is something faster probably with AWK as you mentioned
– Ubai salih
Jan 28 at 9:16
Hi, @Ubaisalih, I thinkawk
will be faster according to this time comparison. Whatever, does a new 4GB log is created each minute or only few hundred lines are appended each minute to an existing file?
– pa4080
Jan 28 at 9:31
Huh, I just saw you are the same OP on the linked PerlDuck"s answer :)
– pa4080
Jan 28 at 9:51
2
@Ubaisalih, what service writes these logs? I'm asking, because I remembered, for example Apache2 can pipe its logs to a script and then that script can write a regular log and also processed log, etc. - if such solution could be applied this will be the best way, I think.
– pa4080
Jan 28 at 10:41
it's a new file 4GB each minute, the problem now is the speed of filtering the file, so is there a faster method to capture few column with a timely manner ? –
– Ubai salih
Jan 28 at 11:47
|
show 7 more comments
thank you very much for reply,. i've tried this but still consuming too much time to filter one file, so i'm not sure if there is something faster probably with AWK as you mentioned
– Ubai salih
Jan 28 at 9:16
Hi, @Ubaisalih, I thinkawk
will be faster according to this time comparison. Whatever, does a new 4GB log is created each minute or only few hundred lines are appended each minute to an existing file?
– pa4080
Jan 28 at 9:31
Huh, I just saw you are the same OP on the linked PerlDuck"s answer :)
– pa4080
Jan 28 at 9:51
2
@Ubaisalih, what service writes these logs? I'm asking, because I remembered, for example Apache2 can pipe its logs to a script and then that script can write a regular log and also processed log, etc. - if such solution could be applied this will be the best way, I think.
– pa4080
Jan 28 at 10:41
it's a new file 4GB each minute, the problem now is the speed of filtering the file, so is there a faster method to capture few column with a timely manner ? –
– Ubai salih
Jan 28 at 11:47
thank you very much for reply,. i've tried this but still consuming too much time to filter one file, so i'm not sure if there is something faster probably with AWK as you mentioned
– Ubai salih
Jan 28 at 9:16
thank you very much for reply,. i've tried this but still consuming too much time to filter one file, so i'm not sure if there is something faster probably with AWK as you mentioned
– Ubai salih
Jan 28 at 9:16
Hi, @Ubaisalih, I think
awk
will be faster according to this time comparison. Whatever, does a new 4GB log is created each minute or only few hundred lines are appended each minute to an existing file?– pa4080
Jan 28 at 9:31
Hi, @Ubaisalih, I think
awk
will be faster according to this time comparison. Whatever, does a new 4GB log is created each minute or only few hundred lines are appended each minute to an existing file?– pa4080
Jan 28 at 9:31
Huh, I just saw you are the same OP on the linked PerlDuck"s answer :)
– pa4080
Jan 28 at 9:51
Huh, I just saw you are the same OP on the linked PerlDuck"s answer :)
– pa4080
Jan 28 at 9:51
2
2
@Ubaisalih, what service writes these logs? I'm asking, because I remembered, for example Apache2 can pipe its logs to a script and then that script can write a regular log and also processed log, etc. - if such solution could be applied this will be the best way, I think.
– pa4080
Jan 28 at 10:41
@Ubaisalih, what service writes these logs? I'm asking, because I remembered, for example Apache2 can pipe its logs to a script and then that script can write a regular log and also processed log, etc. - if such solution could be applied this will be the best way, I think.
– pa4080
Jan 28 at 10:41
it's a new file 4GB each minute, the problem now is the speed of filtering the file, so is there a faster method to capture few column with a timely manner ? –
– Ubai salih
Jan 28 at 11:47
it's a new file 4GB each minute, the problem now is the speed of filtering the file, so is there a faster method to capture few column with a timely manner ? –
– Ubai salih
Jan 28 at 11:47
|
show 7 more comments
Thanks for contributing an answer to Ask Ubuntu!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1113457%2ffastest-method-to-filter-a-huge-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
i assume that grep + regexp will be the fastest
– cmak.fr
Jan 28 at 8:39
1
If syslog-ng is generating 4G/min of firewall data into syslog, either you've got a real problem, or you need to pump that firewall data into a separate file. Please explain why you're doing this... inquiring minds want to know :-)
– heynnema
Jan 28 at 16:03
2
postnote... if your "powerful VM" is VirtualBox, and you've got it set to use all of your cores and all of your RAM... you're killing your host.
– heynnema
Jan 28 at 16:10
Do you need to log the raw data? If not, have you considered using rsyslog's regex features like
re_extract
to filter the messages as they are coming in?– Doug O'Neal
Jan 30 at 17:19