Fastest method to filter a huge file












2















I have a syslog file with 4GB size every minute, and below is 2 lines of 4.5 million line in each minute,



and i want to generate a new file with only few columns eventtime|srcip|dstip, so the result will be as following



1548531299|X.X.X.X|X.X.X.X


please note that the position of the fields is random.

I've tried some filters but still consuming more than 40 minutes to handle one file on a powerful VM machine with 4 Cores and 16GB ram.



So is there a method to handle such big file and filter the required column in a fast method ?



{Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="ERB-03" devid="5KDTB18800169" logid="0000000011" type="traffic" subtype="forward" level="warning" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=3XXXX srcintf="GGI-cer.405" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="hh-BB.100" dstintfrole="undefined" sessionid=xxxxxxx proto=6 action="ip" policyid=5 policytype="policy" service="HTTP" appcat="unscanned" crscore=5 craction=xxxxxx crlevel="low"

Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="XXX-XX-FGT-03" devid="XX-XXXXXXXX" logid="0000000013" type="traffic" subtype="forward" level="notice" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=XXXXX srcintf="XXX-Core.123" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="sXX-CC.100" dstintfrole="undefined" sessionid=1234567 cvdpkt=0 appcat="unscanned" crscore=5 craction=123456 crlevel="low"}









share|improve this question




















  • 1





    i assume that grep + regexp will be the fastest

    – cmak.fr
    Jan 28 at 8:39






  • 1





    If syslog-ng is generating 4G/min of firewall data into syslog, either you've got a real problem, or you need to pump that firewall data into a separate file. Please explain why you're doing this... inquiring minds want to know :-)

    – heynnema
    Jan 28 at 16:03








  • 2





    postnote... if your "powerful VM" is VirtualBox, and you've got it set to use all of your cores and all of your RAM... you're killing your host.

    – heynnema
    Jan 28 at 16:10











  • Do you need to log the raw data? If not, have you considered using rsyslog's regex features like re_extract to filter the messages as they are coming in?

    – Doug O'Neal
    Jan 30 at 17:19
















2















I have a syslog file with 4GB size every minute, and below is 2 lines of 4.5 million line in each minute,



and i want to generate a new file with only few columns eventtime|srcip|dstip, so the result will be as following



1548531299|X.X.X.X|X.X.X.X


please note that the position of the fields is random.

I've tried some filters but still consuming more than 40 minutes to handle one file on a powerful VM machine with 4 Cores and 16GB ram.



So is there a method to handle such big file and filter the required column in a fast method ?



{Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="ERB-03" devid="5KDTB18800169" logid="0000000011" type="traffic" subtype="forward" level="warning" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=3XXXX srcintf="GGI-cer.405" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="hh-BB.100" dstintfrole="undefined" sessionid=xxxxxxx proto=6 action="ip" policyid=5 policytype="policy" service="HTTP" appcat="unscanned" crscore=5 craction=xxxxxx crlevel="low"

Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="XXX-XX-FGT-03" devid="XX-XXXXXXXX" logid="0000000013" type="traffic" subtype="forward" level="notice" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=XXXXX srcintf="XXX-Core.123" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="sXX-CC.100" dstintfrole="undefined" sessionid=1234567 cvdpkt=0 appcat="unscanned" crscore=5 craction=123456 crlevel="low"}









share|improve this question




















  • 1





    i assume that grep + regexp will be the fastest

    – cmak.fr
    Jan 28 at 8:39






  • 1





    If syslog-ng is generating 4G/min of firewall data into syslog, either you've got a real problem, or you need to pump that firewall data into a separate file. Please explain why you're doing this... inquiring minds want to know :-)

    – heynnema
    Jan 28 at 16:03








  • 2





    postnote... if your "powerful VM" is VirtualBox, and you've got it set to use all of your cores and all of your RAM... you're killing your host.

    – heynnema
    Jan 28 at 16:10











  • Do you need to log the raw data? If not, have you considered using rsyslog's regex features like re_extract to filter the messages as they are coming in?

    – Doug O'Neal
    Jan 30 at 17:19














2












2








2








I have a syslog file with 4GB size every minute, and below is 2 lines of 4.5 million line in each minute,



and i want to generate a new file with only few columns eventtime|srcip|dstip, so the result will be as following



1548531299|X.X.X.X|X.X.X.X


please note that the position of the fields is random.

I've tried some filters but still consuming more than 40 minutes to handle one file on a powerful VM machine with 4 Cores and 16GB ram.



So is there a method to handle such big file and filter the required column in a fast method ?



{Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="ERB-03" devid="5KDTB18800169" logid="0000000011" type="traffic" subtype="forward" level="warning" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=3XXXX srcintf="GGI-cer.405" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="hh-BB.100" dstintfrole="undefined" sessionid=xxxxxxx proto=6 action="ip" policyid=5 policytype="policy" service="HTTP" appcat="unscanned" crscore=5 craction=xxxxxx crlevel="low"

Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="XXX-XX-FGT-03" devid="XX-XXXXXXXX" logid="0000000013" type="traffic" subtype="forward" level="notice" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=XXXXX srcintf="XXX-Core.123" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="sXX-CC.100" dstintfrole="undefined" sessionid=1234567 cvdpkt=0 appcat="unscanned" crscore=5 craction=123456 crlevel="low"}









share|improve this question
















I have a syslog file with 4GB size every minute, and below is 2 lines of 4.5 million line in each minute,



and i want to generate a new file with only few columns eventtime|srcip|dstip, so the result will be as following



1548531299|X.X.X.X|X.X.X.X


please note that the position of the fields is random.

I've tried some filters but still consuming more than 40 minutes to handle one file on a powerful VM machine with 4 Cores and 16GB ram.



So is there a method to handle such big file and filter the required column in a fast method ?



{Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="ERB-03" devid="5KDTB18800169" logid="0000000011" type="traffic" subtype="forward" level="warning" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=3XXXX srcintf="GGI-cer.405" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="hh-BB.100" dstintfrole="undefined" sessionid=xxxxxxx proto=6 action="ip" policyid=5 policytype="policy" service="HTTP" appcat="unscanned" crscore=5 craction=xxxxxx crlevel="low"

Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="XXX-XX-FGT-03" devid="XX-XXXXXXXX" logid="0000000013" type="traffic" subtype="forward" level="notice" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=XXXXX srcintf="XXX-Core.123" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="sXX-CC.100" dstintfrole="undefined" sessionid=1234567 cvdpkt=0 appcat="unscanned" crscore=5 craction=123456 crlevel="low"}






command-line networking text-processing






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 28 at 10:37









pa4080

14.2k52668




14.2k52668










asked Jan 28 at 8:12









Ubai salihUbai salih

253




253








  • 1





    i assume that grep + regexp will be the fastest

    – cmak.fr
    Jan 28 at 8:39






  • 1





    If syslog-ng is generating 4G/min of firewall data into syslog, either you've got a real problem, or you need to pump that firewall data into a separate file. Please explain why you're doing this... inquiring minds want to know :-)

    – heynnema
    Jan 28 at 16:03








  • 2





    postnote... if your "powerful VM" is VirtualBox, and you've got it set to use all of your cores and all of your RAM... you're killing your host.

    – heynnema
    Jan 28 at 16:10











  • Do you need to log the raw data? If not, have you considered using rsyslog's regex features like re_extract to filter the messages as they are coming in?

    – Doug O'Neal
    Jan 30 at 17:19














  • 1





    i assume that grep + regexp will be the fastest

    – cmak.fr
    Jan 28 at 8:39






  • 1





    If syslog-ng is generating 4G/min of firewall data into syslog, either you've got a real problem, or you need to pump that firewall data into a separate file. Please explain why you're doing this... inquiring minds want to know :-)

    – heynnema
    Jan 28 at 16:03








  • 2





    postnote... if your "powerful VM" is VirtualBox, and you've got it set to use all of your cores and all of your RAM... you're killing your host.

    – heynnema
    Jan 28 at 16:10











  • Do you need to log the raw data? If not, have you considered using rsyslog's regex features like re_extract to filter the messages as they are coming in?

    – Doug O'Neal
    Jan 30 at 17:19








1




1





i assume that grep + regexp will be the fastest

– cmak.fr
Jan 28 at 8:39





i assume that grep + regexp will be the fastest

– cmak.fr
Jan 28 at 8:39




1




1





If syslog-ng is generating 4G/min of firewall data into syslog, either you've got a real problem, or you need to pump that firewall data into a separate file. Please explain why you're doing this... inquiring minds want to know :-)

– heynnema
Jan 28 at 16:03







If syslog-ng is generating 4G/min of firewall data into syslog, either you've got a real problem, or you need to pump that firewall data into a separate file. Please explain why you're doing this... inquiring minds want to know :-)

– heynnema
Jan 28 at 16:03






2




2





postnote... if your "powerful VM" is VirtualBox, and you've got it set to use all of your cores and all of your RAM... you're killing your host.

– heynnema
Jan 28 at 16:10





postnote... if your "powerful VM" is VirtualBox, and you've got it set to use all of your cores and all of your RAM... you're killing your host.

– heynnema
Jan 28 at 16:10













Do you need to log the raw data? If not, have you considered using rsyslog's regex features like re_extract to filter the messages as they are coming in?

– Doug O'Neal
Jan 30 at 17:19





Do you need to log the raw data? If not, have you considered using rsyslog's regex features like re_extract to filter the messages as they are coming in?

– Doug O'Neal
Jan 30 at 17:19










4 Answers
4






active

oldest

votes


















3














Perl to the rescue



Save the following script as filter.pl and make it executable (chmod +x):



#!/usr/bin/env perl

use strict;
use warnings;

while( <> ) {
if ( /^(?=.*eventtime=(S+))(?=.*srcip=(S+))(?=.*dstip=(S+)).*$/ ) {
print "$1|$2|$3n";
}
}


Then run



pduck@ubuntu:~> time ./filter.pl < input.txt > output.txt

real 0m44,984s
user 0m43,965s
sys 0m0,973s


The regex uses a lookaround pattern, in this case a positive lookahead,
to match the three values eventtime, srcip, and dstip in any order.



I duplicated your two input lines until I got a file with 4 GB and
approximately 9 million lines. I ran the code on an SSD.






share|improve this answer
























  • +1. Congratulations :-) Your perl script is a lot faster than what I could do with grep, In my main computer, that greps 1 million lines per second, I perl 6.89 million lines per second with your script, and this should be fast enough to solve the problem.

    – sudodus
    Jan 28 at 19:19








  • 1





    Thank you, @sudodus. Yes, it may solve the problem ... if it is not an XY problem ;-)

    – PerlDuck
    Jan 28 at 19:20













  • You are right about that ;-)

    – sudodus
    Jan 28 at 19:21






  • 2





    @Ubaisalih Thank you, glad I could help. But still, I don't think it's normal to have such a huge amount of log messages. I mean, 4 GB a minute is 5.5 Terabytes a day. A day!

    – PerlDuck
    Jan 29 at 11:56








  • 1





    @PerlDuck, 43 vs 36 is the same. We can optimize it more (flex -f; cc -O2) and we can also reduce some seconds in Perl solution but definitely Log polity needs revision.

    – JJoao
    Jan 30 at 19:15



















2














If you want a really fast solution I suggest flex tool. Flex generates C. The following is capable of processing examples like the one presented accepting free order fields. So create a file, named f.fl with the following content:



%option main
%%
char e[100], s[100], d[100];

eventtime=[^ n]* { strcpy(e,yytext+10); }
srcip=[^ n]* { strcpy(s,yytext+6); }
dstip=[^ n]* { strcpy(d,yytext+6); }
n { if (e[0] && s[0] && d[0] )printf("%s|%s|%sn",e,s,d);
e[0]=s[0]=d[0]=0 ;}
. {}
%%


To test try:



$ flex -f -o f.c f.fl 
$ cc -O2 -o f f.c
$ ./f < input > output




Here is the time comparison:



$ time ./f < 13.5-milion-lines-3.9G-in-file > out-file
real 0m35.689s
user 0m34.705s
sys 0m0.908s





share|improve this answer

































    2














    I duplicated your two 'input' lines to a file size 3867148288 bytes (3.7 GiB) and I could process it with grep in 8 minutes and 24 seconds (reading from and writing to a HDD. It should be faster using an SSD or ramdrive).



    In order to minimize the time used, I used only standard features of grep, and did not post-process it, so the output format is not what you specify, but might be useful anyway. You can test this command



    time grep -oE -e 'eventtime=[0-9]* ' 
    -e 'srcip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]'
    -e 'dstip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]'
    infile > outfile


    Output from your two lines:



    $ cat outfile
    eventtime=1548531298
    srcip=X.X.X.Y
    dstip=X.X.X.X
    eventtime=1548531299
    srcip=X.X.X.Z
    dstip=X.X.X.Y


    The output file contains 25165824 lines corresponding to 8388608 (8.3 million) lines in the input file.



    $ wc -l outfile
    25165824 outfile
    $ <<< '25165824/3' bc
    8388608


    My test indicates that grep can process approximately 1 million lines per minute.



    Unless your computer is much faster than mine. this is not fast enough, and I think you have to consider something that is several times faster, probably filtering before writing the log file, but it would be best to completely avoid output of what is not necessary (and avoid filtering).





    The input file is made by duplication, and maybe the system 'remembers' that it has seen the same lines before and makes things faster, so I don't know how fast it will work with a real big file with all the unpredicted variations. You have to test it.





    Edit1: I ran the same task in a Dell M4800 with an Intel 4th generation i7 processor and an SSD. It finished in 4 minutes and 36 seconds, at almost double speed, 1.82 million lines per minute.



    $ <<< 'scale=2;25165824/3/(4*60+36)*60/10^6' bc
    1.82


    Still too slow.





    Edit2: I simplified the grep patterns and ran it again in the Dell.



    time grep -oE -e 'eventtime=[^ ]*' 
    -e 'srcip=[^ ]*'
    -e 'dstip=[^ ]*'
    infile > out


    It finished after 4 minutes and 11 seconds, a small improvement to 2.00 million lines per minute



    $ <<< 'scale=2;25165824/3/(4*60+11)*60/10^6' bc
    2.00


    Edit 3: @JJoao's, perl extension speeds up grep to 39 seconds corresponding to 12.90 million lines per minute in the computer, where the ordinary grep reads 1 million lines per minute (reading from and writing to an HDD).



    $ time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao

    real 0m38,699s
    user 0m31,466s
    sys 0m2,843s


    This perl extension is experiental according to info grep but works in my Lubuntu 18.04.1 LTS.




    ‘-P’ ‘--perl-regexp’
    Interpret the pattern as a Perl-compatible regular expression
    (PCRE). This is experimental, particularly when combined with the
    ‘-z’ (‘--null-data’) option, and ‘grep -P’ may warn of
    unimplemented features. *Note Other Options::.




    I also compiled a C program according to @JJoao's flex method, and it finshed in 53 seconds corresponding to 9.49 million lines per minute in the computer, where the ordinary grep reads 1 million lines per minute (reading from and writing to an HDD). Both methods are fast, but grep with the perl extension is fastest.



    $ time ./filt.jjouo < infile > out-flex-JJoao

    real 0m53,440s
    user 0m48,789s
    sys 0m3,104s


    Edit 3.1: In the Dell M4800 with an SSD I had the following results,



    time ./filt.jjouo < infile > out-flex-JJoao

    real 0m25,611s
    user 0m24,794s
    sys 0m0,592s

    time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao

    real 0m18,375s
    user 0m17,654s
    sys 0m0,500s


    This corresponds to




    • 19.66 million lines per minute for the flex application

    • 27.35 million lines per minute for grep with the perl extension


    Edit 3.2: In the Dell M4800 with an SSD I had the following results, when I used the option -f to the flex preprocessor,



    flex -f -o filt.c filt.flex
    cc -O2 -o filt.jjoao filt.c


    The result was improved, and now the flex application shows the highest speed



    flex -f ...

    $ time ./filt.jjoao < infile > out-flex-JJoao

    real 0m15,952s
    user 0m15,318s
    sys 0m0,628s


    This corresponds to




    • 31.55 million lines per minute for the flex application.






    share|improve this answer


























    • thank you very much for such effort, so is it possible to split the file to many chunks and run the filter simultaneously against these chunks ?

      – Ubai salih
      Jan 28 at 20:00











    • @Ubaisalih, Yes, it might be possible and worthwhile to split the file in many chunks, if you can use use several processors in parallel for the task. But you should really try to avoid writing such a huge file with columns of data, that you will not use. Instead you should write a file with only the data that you need from the beginning. Such a file will contain 10 % or less compared to the file that you create now, Check how the current syslog file is created and how you can create the file that you need directly (not via a filter).

      – sudodus
      Jan 28 at 20:19











    • @sudodus, (+1); for curiosity: could you please time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile

      – JJoao
      Jan 30 at 12:33













    • @JJoao, see Edit 3.

      – sudodus
      Jan 30 at 17:14






    • 1





      @Ubaisalih, With the flex application you might get something that is fast enough from the huge sys log file, but I still suggest that you should find a method to write a file with only the data that you need from the beginning.

      – sudodus
      Jan 31 at 7:26



















    1














    Here is one possible solution, based on this answer, provided by @PerlDuck a while ago:





    #!/bin/bash
    while IFS= read -r LINE
    do
    if [[ ! -z ${LINE} ]]
    then
    eval $(echo "$LINE" | sed -e 's/({|})//g' -e 's/ /n/g' | sed -ne '/=/p')
    echo "$eventtime|$srcip|$dstip"
    fi
    done < "$1"


    I do not know how it will behave on such large file, IMO an awk solution will be much faster. Here is how it works with the provided input file example:



    $ ./script.sh in-file
    1548531299|X.X.X.X|X.X.X.X
    1548531299|X.X.X.X|X.X.X.X




    Here is the result of a productivity time test, performed on a regular i7, equipped with SSD and 16GB RAM:



    $ time ./script.sh 160000-lines-in-file > out-file

    real 4m49.620s
    user 6m15.875s
    sys 1m50.254s





    share|improve this answer


























    • thank you very much for reply,. i've tried this but still consuming too much time to filter one file, so i'm not sure if there is something faster probably with AWK as you mentioned

      – Ubai salih
      Jan 28 at 9:16













    • Hi, @Ubaisalih, I think awk will be faster according to this time comparison. Whatever, does a new 4GB log is created each minute or only few hundred lines are appended each minute to an existing file?

      – pa4080
      Jan 28 at 9:31













    • Huh, I just saw you are the same OP on the linked PerlDuck"s answer :)

      – pa4080
      Jan 28 at 9:51






    • 2





      @Ubaisalih, what service writes these logs? I'm asking, because I remembered, for example Apache2 can pipe its logs to a script and then that script can write a regular log and also processed log, etc. - if such solution could be applied this will be the best way, I think.

      – pa4080
      Jan 28 at 10:41













    • it's a new file 4GB each minute, the problem now is the speed of filtering the file, so is there a faster method to capture few column with a timely manner ? –

      – Ubai salih
      Jan 28 at 11:47











    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "89"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1113457%2ffastest-method-to-filter-a-huge-file%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    4 Answers
    4






    active

    oldest

    votes








    4 Answers
    4






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    3














    Perl to the rescue



    Save the following script as filter.pl and make it executable (chmod +x):



    #!/usr/bin/env perl

    use strict;
    use warnings;

    while( <> ) {
    if ( /^(?=.*eventtime=(S+))(?=.*srcip=(S+))(?=.*dstip=(S+)).*$/ ) {
    print "$1|$2|$3n";
    }
    }


    Then run



    pduck@ubuntu:~> time ./filter.pl < input.txt > output.txt

    real 0m44,984s
    user 0m43,965s
    sys 0m0,973s


    The regex uses a lookaround pattern, in this case a positive lookahead,
    to match the three values eventtime, srcip, and dstip in any order.



    I duplicated your two input lines until I got a file with 4 GB and
    approximately 9 million lines. I ran the code on an SSD.






    share|improve this answer
























    • +1. Congratulations :-) Your perl script is a lot faster than what I could do with grep, In my main computer, that greps 1 million lines per second, I perl 6.89 million lines per second with your script, and this should be fast enough to solve the problem.

      – sudodus
      Jan 28 at 19:19








    • 1





      Thank you, @sudodus. Yes, it may solve the problem ... if it is not an XY problem ;-)

      – PerlDuck
      Jan 28 at 19:20













    • You are right about that ;-)

      – sudodus
      Jan 28 at 19:21






    • 2





      @Ubaisalih Thank you, glad I could help. But still, I don't think it's normal to have such a huge amount of log messages. I mean, 4 GB a minute is 5.5 Terabytes a day. A day!

      – PerlDuck
      Jan 29 at 11:56








    • 1





      @PerlDuck, 43 vs 36 is the same. We can optimize it more (flex -f; cc -O2) and we can also reduce some seconds in Perl solution but definitely Log polity needs revision.

      – JJoao
      Jan 30 at 19:15
















    3














    Perl to the rescue



    Save the following script as filter.pl and make it executable (chmod +x):



    #!/usr/bin/env perl

    use strict;
    use warnings;

    while( <> ) {
    if ( /^(?=.*eventtime=(S+))(?=.*srcip=(S+))(?=.*dstip=(S+)).*$/ ) {
    print "$1|$2|$3n";
    }
    }


    Then run



    pduck@ubuntu:~> time ./filter.pl < input.txt > output.txt

    real 0m44,984s
    user 0m43,965s
    sys 0m0,973s


    The regex uses a lookaround pattern, in this case a positive lookahead,
    to match the three values eventtime, srcip, and dstip in any order.



    I duplicated your two input lines until I got a file with 4 GB and
    approximately 9 million lines. I ran the code on an SSD.






    share|improve this answer
























    • +1. Congratulations :-) Your perl script is a lot faster than what I could do with grep, In my main computer, that greps 1 million lines per second, I perl 6.89 million lines per second with your script, and this should be fast enough to solve the problem.

      – sudodus
      Jan 28 at 19:19








    • 1





      Thank you, @sudodus. Yes, it may solve the problem ... if it is not an XY problem ;-)

      – PerlDuck
      Jan 28 at 19:20













    • You are right about that ;-)

      – sudodus
      Jan 28 at 19:21






    • 2





      @Ubaisalih Thank you, glad I could help. But still, I don't think it's normal to have such a huge amount of log messages. I mean, 4 GB a minute is 5.5 Terabytes a day. A day!

      – PerlDuck
      Jan 29 at 11:56








    • 1





      @PerlDuck, 43 vs 36 is the same. We can optimize it more (flex -f; cc -O2) and we can also reduce some seconds in Perl solution but definitely Log polity needs revision.

      – JJoao
      Jan 30 at 19:15














    3












    3








    3







    Perl to the rescue



    Save the following script as filter.pl and make it executable (chmod +x):



    #!/usr/bin/env perl

    use strict;
    use warnings;

    while( <> ) {
    if ( /^(?=.*eventtime=(S+))(?=.*srcip=(S+))(?=.*dstip=(S+)).*$/ ) {
    print "$1|$2|$3n";
    }
    }


    Then run



    pduck@ubuntu:~> time ./filter.pl < input.txt > output.txt

    real 0m44,984s
    user 0m43,965s
    sys 0m0,973s


    The regex uses a lookaround pattern, in this case a positive lookahead,
    to match the three values eventtime, srcip, and dstip in any order.



    I duplicated your two input lines until I got a file with 4 GB and
    approximately 9 million lines. I ran the code on an SSD.






    share|improve this answer













    Perl to the rescue



    Save the following script as filter.pl and make it executable (chmod +x):



    #!/usr/bin/env perl

    use strict;
    use warnings;

    while( <> ) {
    if ( /^(?=.*eventtime=(S+))(?=.*srcip=(S+))(?=.*dstip=(S+)).*$/ ) {
    print "$1|$2|$3n";
    }
    }


    Then run



    pduck@ubuntu:~> time ./filter.pl < input.txt > output.txt

    real 0m44,984s
    user 0m43,965s
    sys 0m0,973s


    The regex uses a lookaround pattern, in this case a positive lookahead,
    to match the three values eventtime, srcip, and dstip in any order.



    I duplicated your two input lines until I got a file with 4 GB and
    approximately 9 million lines. I ran the code on an SSD.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Jan 28 at 19:08









    PerlDuckPerlDuck

    6,73711535




    6,73711535













    • +1. Congratulations :-) Your perl script is a lot faster than what I could do with grep, In my main computer, that greps 1 million lines per second, I perl 6.89 million lines per second with your script, and this should be fast enough to solve the problem.

      – sudodus
      Jan 28 at 19:19








    • 1





      Thank you, @sudodus. Yes, it may solve the problem ... if it is not an XY problem ;-)

      – PerlDuck
      Jan 28 at 19:20













    • You are right about that ;-)

      – sudodus
      Jan 28 at 19:21






    • 2





      @Ubaisalih Thank you, glad I could help. But still, I don't think it's normal to have such a huge amount of log messages. I mean, 4 GB a minute is 5.5 Terabytes a day. A day!

      – PerlDuck
      Jan 29 at 11:56








    • 1





      @PerlDuck, 43 vs 36 is the same. We can optimize it more (flex -f; cc -O2) and we can also reduce some seconds in Perl solution but definitely Log polity needs revision.

      – JJoao
      Jan 30 at 19:15



















    • +1. Congratulations :-) Your perl script is a lot faster than what I could do with grep, In my main computer, that greps 1 million lines per second, I perl 6.89 million lines per second with your script, and this should be fast enough to solve the problem.

      – sudodus
      Jan 28 at 19:19








    • 1





      Thank you, @sudodus. Yes, it may solve the problem ... if it is not an XY problem ;-)

      – PerlDuck
      Jan 28 at 19:20













    • You are right about that ;-)

      – sudodus
      Jan 28 at 19:21






    • 2





      @Ubaisalih Thank you, glad I could help. But still, I don't think it's normal to have such a huge amount of log messages. I mean, 4 GB a minute is 5.5 Terabytes a day. A day!

      – PerlDuck
      Jan 29 at 11:56








    • 1





      @PerlDuck, 43 vs 36 is the same. We can optimize it more (flex -f; cc -O2) and we can also reduce some seconds in Perl solution but definitely Log polity needs revision.

      – JJoao
      Jan 30 at 19:15

















    +1. Congratulations :-) Your perl script is a lot faster than what I could do with grep, In my main computer, that greps 1 million lines per second, I perl 6.89 million lines per second with your script, and this should be fast enough to solve the problem.

    – sudodus
    Jan 28 at 19:19







    +1. Congratulations :-) Your perl script is a lot faster than what I could do with grep, In my main computer, that greps 1 million lines per second, I perl 6.89 million lines per second with your script, and this should be fast enough to solve the problem.

    – sudodus
    Jan 28 at 19:19






    1




    1





    Thank you, @sudodus. Yes, it may solve the problem ... if it is not an XY problem ;-)

    – PerlDuck
    Jan 28 at 19:20







    Thank you, @sudodus. Yes, it may solve the problem ... if it is not an XY problem ;-)

    – PerlDuck
    Jan 28 at 19:20















    You are right about that ;-)

    – sudodus
    Jan 28 at 19:21





    You are right about that ;-)

    – sudodus
    Jan 28 at 19:21




    2




    2





    @Ubaisalih Thank you, glad I could help. But still, I don't think it's normal to have such a huge amount of log messages. I mean, 4 GB a minute is 5.5 Terabytes a day. A day!

    – PerlDuck
    Jan 29 at 11:56







    @Ubaisalih Thank you, glad I could help. But still, I don't think it's normal to have such a huge amount of log messages. I mean, 4 GB a minute is 5.5 Terabytes a day. A day!

    – PerlDuck
    Jan 29 at 11:56






    1




    1





    @PerlDuck, 43 vs 36 is the same. We can optimize it more (flex -f; cc -O2) and we can also reduce some seconds in Perl solution but definitely Log polity needs revision.

    – JJoao
    Jan 30 at 19:15





    @PerlDuck, 43 vs 36 is the same. We can optimize it more (flex -f; cc -O2) and we can also reduce some seconds in Perl solution but definitely Log polity needs revision.

    – JJoao
    Jan 30 at 19:15













    2














    If you want a really fast solution I suggest flex tool. Flex generates C. The following is capable of processing examples like the one presented accepting free order fields. So create a file, named f.fl with the following content:



    %option main
    %%
    char e[100], s[100], d[100];

    eventtime=[^ n]* { strcpy(e,yytext+10); }
    srcip=[^ n]* { strcpy(s,yytext+6); }
    dstip=[^ n]* { strcpy(d,yytext+6); }
    n { if (e[0] && s[0] && d[0] )printf("%s|%s|%sn",e,s,d);
    e[0]=s[0]=d[0]=0 ;}
    . {}
    %%


    To test try:



    $ flex -f -o f.c f.fl 
    $ cc -O2 -o f f.c
    $ ./f < input > output




    Here is the time comparison:



    $ time ./f < 13.5-milion-lines-3.9G-in-file > out-file
    real 0m35.689s
    user 0m34.705s
    sys 0m0.908s





    share|improve this answer






























      2














      If you want a really fast solution I suggest flex tool. Flex generates C. The following is capable of processing examples like the one presented accepting free order fields. So create a file, named f.fl with the following content:



      %option main
      %%
      char e[100], s[100], d[100];

      eventtime=[^ n]* { strcpy(e,yytext+10); }
      srcip=[^ n]* { strcpy(s,yytext+6); }
      dstip=[^ n]* { strcpy(d,yytext+6); }
      n { if (e[0] && s[0] && d[0] )printf("%s|%s|%sn",e,s,d);
      e[0]=s[0]=d[0]=0 ;}
      . {}
      %%


      To test try:



      $ flex -f -o f.c f.fl 
      $ cc -O2 -o f f.c
      $ ./f < input > output




      Here is the time comparison:



      $ time ./f < 13.5-milion-lines-3.9G-in-file > out-file
      real 0m35.689s
      user 0m34.705s
      sys 0m0.908s





      share|improve this answer




























        2












        2








        2







        If you want a really fast solution I suggest flex tool. Flex generates C. The following is capable of processing examples like the one presented accepting free order fields. So create a file, named f.fl with the following content:



        %option main
        %%
        char e[100], s[100], d[100];

        eventtime=[^ n]* { strcpy(e,yytext+10); }
        srcip=[^ n]* { strcpy(s,yytext+6); }
        dstip=[^ n]* { strcpy(d,yytext+6); }
        n { if (e[0] && s[0] && d[0] )printf("%s|%s|%sn",e,s,d);
        e[0]=s[0]=d[0]=0 ;}
        . {}
        %%


        To test try:



        $ flex -f -o f.c f.fl 
        $ cc -O2 -o f f.c
        $ ./f < input > output




        Here is the time comparison:



        $ time ./f < 13.5-milion-lines-3.9G-in-file > out-file
        real 0m35.689s
        user 0m34.705s
        sys 0m0.908s





        share|improve this answer















        If you want a really fast solution I suggest flex tool. Flex generates C. The following is capable of processing examples like the one presented accepting free order fields. So create a file, named f.fl with the following content:



        %option main
        %%
        char e[100], s[100], d[100];

        eventtime=[^ n]* { strcpy(e,yytext+10); }
        srcip=[^ n]* { strcpy(s,yytext+6); }
        dstip=[^ n]* { strcpy(d,yytext+6); }
        n { if (e[0] && s[0] && d[0] )printf("%s|%s|%sn",e,s,d);
        e[0]=s[0]=d[0]=0 ;}
        . {}
        %%


        To test try:



        $ flex -f -o f.c f.fl 
        $ cc -O2 -o f f.c
        $ ./f < input > output




        Here is the time comparison:



        $ time ./f < 13.5-milion-lines-3.9G-in-file > out-file
        real 0m35.689s
        user 0m34.705s
        sys 0m0.908s






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Jan 30 at 22:11

























        answered Jan 30 at 12:22









        JJoaoJJoao

        1,40069




        1,40069























            2














            I duplicated your two 'input' lines to a file size 3867148288 bytes (3.7 GiB) and I could process it with grep in 8 minutes and 24 seconds (reading from and writing to a HDD. It should be faster using an SSD or ramdrive).



            In order to minimize the time used, I used only standard features of grep, and did not post-process it, so the output format is not what you specify, but might be useful anyway. You can test this command



            time grep -oE -e 'eventtime=[0-9]* ' 
            -e 'srcip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]'
            -e 'dstip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]'
            infile > outfile


            Output from your two lines:



            $ cat outfile
            eventtime=1548531298
            srcip=X.X.X.Y
            dstip=X.X.X.X
            eventtime=1548531299
            srcip=X.X.X.Z
            dstip=X.X.X.Y


            The output file contains 25165824 lines corresponding to 8388608 (8.3 million) lines in the input file.



            $ wc -l outfile
            25165824 outfile
            $ <<< '25165824/3' bc
            8388608


            My test indicates that grep can process approximately 1 million lines per minute.



            Unless your computer is much faster than mine. this is not fast enough, and I think you have to consider something that is several times faster, probably filtering before writing the log file, but it would be best to completely avoid output of what is not necessary (and avoid filtering).





            The input file is made by duplication, and maybe the system 'remembers' that it has seen the same lines before and makes things faster, so I don't know how fast it will work with a real big file with all the unpredicted variations. You have to test it.





            Edit1: I ran the same task in a Dell M4800 with an Intel 4th generation i7 processor and an SSD. It finished in 4 minutes and 36 seconds, at almost double speed, 1.82 million lines per minute.



            $ <<< 'scale=2;25165824/3/(4*60+36)*60/10^6' bc
            1.82


            Still too slow.





            Edit2: I simplified the grep patterns and ran it again in the Dell.



            time grep -oE -e 'eventtime=[^ ]*' 
            -e 'srcip=[^ ]*'
            -e 'dstip=[^ ]*'
            infile > out


            It finished after 4 minutes and 11 seconds, a small improvement to 2.00 million lines per minute



            $ <<< 'scale=2;25165824/3/(4*60+11)*60/10^6' bc
            2.00


            Edit 3: @JJoao's, perl extension speeds up grep to 39 seconds corresponding to 12.90 million lines per minute in the computer, where the ordinary grep reads 1 million lines per minute (reading from and writing to an HDD).



            $ time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao

            real 0m38,699s
            user 0m31,466s
            sys 0m2,843s


            This perl extension is experiental according to info grep but works in my Lubuntu 18.04.1 LTS.




            ‘-P’ ‘--perl-regexp’
            Interpret the pattern as a Perl-compatible regular expression
            (PCRE). This is experimental, particularly when combined with the
            ‘-z’ (‘--null-data’) option, and ‘grep -P’ may warn of
            unimplemented features. *Note Other Options::.




            I also compiled a C program according to @JJoao's flex method, and it finshed in 53 seconds corresponding to 9.49 million lines per minute in the computer, where the ordinary grep reads 1 million lines per minute (reading from and writing to an HDD). Both methods are fast, but grep with the perl extension is fastest.



            $ time ./filt.jjouo < infile > out-flex-JJoao

            real 0m53,440s
            user 0m48,789s
            sys 0m3,104s


            Edit 3.1: In the Dell M4800 with an SSD I had the following results,



            time ./filt.jjouo < infile > out-flex-JJoao

            real 0m25,611s
            user 0m24,794s
            sys 0m0,592s

            time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao

            real 0m18,375s
            user 0m17,654s
            sys 0m0,500s


            This corresponds to




            • 19.66 million lines per minute for the flex application

            • 27.35 million lines per minute for grep with the perl extension


            Edit 3.2: In the Dell M4800 with an SSD I had the following results, when I used the option -f to the flex preprocessor,



            flex -f -o filt.c filt.flex
            cc -O2 -o filt.jjoao filt.c


            The result was improved, and now the flex application shows the highest speed



            flex -f ...

            $ time ./filt.jjoao < infile > out-flex-JJoao

            real 0m15,952s
            user 0m15,318s
            sys 0m0,628s


            This corresponds to




            • 31.55 million lines per minute for the flex application.






            share|improve this answer


























            • thank you very much for such effort, so is it possible to split the file to many chunks and run the filter simultaneously against these chunks ?

              – Ubai salih
              Jan 28 at 20:00











            • @Ubaisalih, Yes, it might be possible and worthwhile to split the file in many chunks, if you can use use several processors in parallel for the task. But you should really try to avoid writing such a huge file with columns of data, that you will not use. Instead you should write a file with only the data that you need from the beginning. Such a file will contain 10 % or less compared to the file that you create now, Check how the current syslog file is created and how you can create the file that you need directly (not via a filter).

              – sudodus
              Jan 28 at 20:19











            • @sudodus, (+1); for curiosity: could you please time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile

              – JJoao
              Jan 30 at 12:33













            • @JJoao, see Edit 3.

              – sudodus
              Jan 30 at 17:14






            • 1





              @Ubaisalih, With the flex application you might get something that is fast enough from the huge sys log file, but I still suggest that you should find a method to write a file with only the data that you need from the beginning.

              – sudodus
              Jan 31 at 7:26
















            2














            I duplicated your two 'input' lines to a file size 3867148288 bytes (3.7 GiB) and I could process it with grep in 8 minutes and 24 seconds (reading from and writing to a HDD. It should be faster using an SSD or ramdrive).



            In order to minimize the time used, I used only standard features of grep, and did not post-process it, so the output format is not what you specify, but might be useful anyway. You can test this command



            time grep -oE -e 'eventtime=[0-9]* ' 
            -e 'srcip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]'
            -e 'dstip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]'
            infile > outfile


            Output from your two lines:



            $ cat outfile
            eventtime=1548531298
            srcip=X.X.X.Y
            dstip=X.X.X.X
            eventtime=1548531299
            srcip=X.X.X.Z
            dstip=X.X.X.Y


            The output file contains 25165824 lines corresponding to 8388608 (8.3 million) lines in the input file.



            $ wc -l outfile
            25165824 outfile
            $ <<< '25165824/3' bc
            8388608


            My test indicates that grep can process approximately 1 million lines per minute.



            Unless your computer is much faster than mine. this is not fast enough, and I think you have to consider something that is several times faster, probably filtering before writing the log file, but it would be best to completely avoid output of what is not necessary (and avoid filtering).





            The input file is made by duplication, and maybe the system 'remembers' that it has seen the same lines before and makes things faster, so I don't know how fast it will work with a real big file with all the unpredicted variations. You have to test it.





            Edit1: I ran the same task in a Dell M4800 with an Intel 4th generation i7 processor and an SSD. It finished in 4 minutes and 36 seconds, at almost double speed, 1.82 million lines per minute.



            $ <<< 'scale=2;25165824/3/(4*60+36)*60/10^6' bc
            1.82


            Still too slow.





            Edit2: I simplified the grep patterns and ran it again in the Dell.



            time grep -oE -e 'eventtime=[^ ]*' 
            -e 'srcip=[^ ]*'
            -e 'dstip=[^ ]*'
            infile > out


            It finished after 4 minutes and 11 seconds, a small improvement to 2.00 million lines per minute



            $ <<< 'scale=2;25165824/3/(4*60+11)*60/10^6' bc
            2.00


            Edit 3: @JJoao's, perl extension speeds up grep to 39 seconds corresponding to 12.90 million lines per minute in the computer, where the ordinary grep reads 1 million lines per minute (reading from and writing to an HDD).



            $ time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao

            real 0m38,699s
            user 0m31,466s
            sys 0m2,843s


            This perl extension is experiental according to info grep but works in my Lubuntu 18.04.1 LTS.




            ‘-P’ ‘--perl-regexp’
            Interpret the pattern as a Perl-compatible regular expression
            (PCRE). This is experimental, particularly when combined with the
            ‘-z’ (‘--null-data’) option, and ‘grep -P’ may warn of
            unimplemented features. *Note Other Options::.




            I also compiled a C program according to @JJoao's flex method, and it finshed in 53 seconds corresponding to 9.49 million lines per minute in the computer, where the ordinary grep reads 1 million lines per minute (reading from and writing to an HDD). Both methods are fast, but grep with the perl extension is fastest.



            $ time ./filt.jjouo < infile > out-flex-JJoao

            real 0m53,440s
            user 0m48,789s
            sys 0m3,104s


            Edit 3.1: In the Dell M4800 with an SSD I had the following results,



            time ./filt.jjouo < infile > out-flex-JJoao

            real 0m25,611s
            user 0m24,794s
            sys 0m0,592s

            time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao

            real 0m18,375s
            user 0m17,654s
            sys 0m0,500s


            This corresponds to




            • 19.66 million lines per minute for the flex application

            • 27.35 million lines per minute for grep with the perl extension


            Edit 3.2: In the Dell M4800 with an SSD I had the following results, when I used the option -f to the flex preprocessor,



            flex -f -o filt.c filt.flex
            cc -O2 -o filt.jjoao filt.c


            The result was improved, and now the flex application shows the highest speed



            flex -f ...

            $ time ./filt.jjoao < infile > out-flex-JJoao

            real 0m15,952s
            user 0m15,318s
            sys 0m0,628s


            This corresponds to




            • 31.55 million lines per minute for the flex application.






            share|improve this answer


























            • thank you very much for such effort, so is it possible to split the file to many chunks and run the filter simultaneously against these chunks ?

              – Ubai salih
              Jan 28 at 20:00











            • @Ubaisalih, Yes, it might be possible and worthwhile to split the file in many chunks, if you can use use several processors in parallel for the task. But you should really try to avoid writing such a huge file with columns of data, that you will not use. Instead you should write a file with only the data that you need from the beginning. Such a file will contain 10 % or less compared to the file that you create now, Check how the current syslog file is created and how you can create the file that you need directly (not via a filter).

              – sudodus
              Jan 28 at 20:19











            • @sudodus, (+1); for curiosity: could you please time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile

              – JJoao
              Jan 30 at 12:33













            • @JJoao, see Edit 3.

              – sudodus
              Jan 30 at 17:14






            • 1





              @Ubaisalih, With the flex application you might get something that is fast enough from the huge sys log file, but I still suggest that you should find a method to write a file with only the data that you need from the beginning.

              – sudodus
              Jan 31 at 7:26














            2












            2








            2







            I duplicated your two 'input' lines to a file size 3867148288 bytes (3.7 GiB) and I could process it with grep in 8 minutes and 24 seconds (reading from and writing to a HDD. It should be faster using an SSD or ramdrive).



            In order to minimize the time used, I used only standard features of grep, and did not post-process it, so the output format is not what you specify, but might be useful anyway. You can test this command



            time grep -oE -e 'eventtime=[0-9]* ' 
            -e 'srcip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]'
            -e 'dstip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]'
            infile > outfile


            Output from your two lines:



            $ cat outfile
            eventtime=1548531298
            srcip=X.X.X.Y
            dstip=X.X.X.X
            eventtime=1548531299
            srcip=X.X.X.Z
            dstip=X.X.X.Y


            The output file contains 25165824 lines corresponding to 8388608 (8.3 million) lines in the input file.



            $ wc -l outfile
            25165824 outfile
            $ <<< '25165824/3' bc
            8388608


            My test indicates that grep can process approximately 1 million lines per minute.



            Unless your computer is much faster than mine. this is not fast enough, and I think you have to consider something that is several times faster, probably filtering before writing the log file, but it would be best to completely avoid output of what is not necessary (and avoid filtering).





            The input file is made by duplication, and maybe the system 'remembers' that it has seen the same lines before and makes things faster, so I don't know how fast it will work with a real big file with all the unpredicted variations. You have to test it.





            Edit1: I ran the same task in a Dell M4800 with an Intel 4th generation i7 processor and an SSD. It finished in 4 minutes and 36 seconds, at almost double speed, 1.82 million lines per minute.



            $ <<< 'scale=2;25165824/3/(4*60+36)*60/10^6' bc
            1.82


            Still too slow.





            Edit2: I simplified the grep patterns and ran it again in the Dell.



            time grep -oE -e 'eventtime=[^ ]*' 
            -e 'srcip=[^ ]*'
            -e 'dstip=[^ ]*'
            infile > out


            It finished after 4 minutes and 11 seconds, a small improvement to 2.00 million lines per minute



            $ <<< 'scale=2;25165824/3/(4*60+11)*60/10^6' bc
            2.00


            Edit 3: @JJoao's, perl extension speeds up grep to 39 seconds corresponding to 12.90 million lines per minute in the computer, where the ordinary grep reads 1 million lines per minute (reading from and writing to an HDD).



            $ time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao

            real 0m38,699s
            user 0m31,466s
            sys 0m2,843s


            This perl extension is experiental according to info grep but works in my Lubuntu 18.04.1 LTS.




            ‘-P’ ‘--perl-regexp’
            Interpret the pattern as a Perl-compatible regular expression
            (PCRE). This is experimental, particularly when combined with the
            ‘-z’ (‘--null-data’) option, and ‘grep -P’ may warn of
            unimplemented features. *Note Other Options::.




            I also compiled a C program according to @JJoao's flex method, and it finshed in 53 seconds corresponding to 9.49 million lines per minute in the computer, where the ordinary grep reads 1 million lines per minute (reading from and writing to an HDD). Both methods are fast, but grep with the perl extension is fastest.



            $ time ./filt.jjouo < infile > out-flex-JJoao

            real 0m53,440s
            user 0m48,789s
            sys 0m3,104s


            Edit 3.1: In the Dell M4800 with an SSD I had the following results,



            time ./filt.jjouo < infile > out-flex-JJoao

            real 0m25,611s
            user 0m24,794s
            sys 0m0,592s

            time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao

            real 0m18,375s
            user 0m17,654s
            sys 0m0,500s


            This corresponds to




            • 19.66 million lines per minute for the flex application

            • 27.35 million lines per minute for grep with the perl extension


            Edit 3.2: In the Dell M4800 with an SSD I had the following results, when I used the option -f to the flex preprocessor,



            flex -f -o filt.c filt.flex
            cc -O2 -o filt.jjoao filt.c


            The result was improved, and now the flex application shows the highest speed



            flex -f ...

            $ time ./filt.jjoao < infile > out-flex-JJoao

            real 0m15,952s
            user 0m15,318s
            sys 0m0,628s


            This corresponds to




            • 31.55 million lines per minute for the flex application.






            share|improve this answer















            I duplicated your two 'input' lines to a file size 3867148288 bytes (3.7 GiB) and I could process it with grep in 8 minutes and 24 seconds (reading from and writing to a HDD. It should be faster using an SSD or ramdrive).



            In order to minimize the time used, I used only standard features of grep, and did not post-process it, so the output format is not what you specify, but might be useful anyway. You can test this command



            time grep -oE -e 'eventtime=[0-9]* ' 
            -e 'srcip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]'
            -e 'dstip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]'
            infile > outfile


            Output from your two lines:



            $ cat outfile
            eventtime=1548531298
            srcip=X.X.X.Y
            dstip=X.X.X.X
            eventtime=1548531299
            srcip=X.X.X.Z
            dstip=X.X.X.Y


            The output file contains 25165824 lines corresponding to 8388608 (8.3 million) lines in the input file.



            $ wc -l outfile
            25165824 outfile
            $ <<< '25165824/3' bc
            8388608


            My test indicates that grep can process approximately 1 million lines per minute.



            Unless your computer is much faster than mine. this is not fast enough, and I think you have to consider something that is several times faster, probably filtering before writing the log file, but it would be best to completely avoid output of what is not necessary (and avoid filtering).





            The input file is made by duplication, and maybe the system 'remembers' that it has seen the same lines before and makes things faster, so I don't know how fast it will work with a real big file with all the unpredicted variations. You have to test it.





            Edit1: I ran the same task in a Dell M4800 with an Intel 4th generation i7 processor and an SSD. It finished in 4 minutes and 36 seconds, at almost double speed, 1.82 million lines per minute.



            $ <<< 'scale=2;25165824/3/(4*60+36)*60/10^6' bc
            1.82


            Still too slow.





            Edit2: I simplified the grep patterns and ran it again in the Dell.



            time grep -oE -e 'eventtime=[^ ]*' 
            -e 'srcip=[^ ]*'
            -e 'dstip=[^ ]*'
            infile > out


            It finished after 4 minutes and 11 seconds, a small improvement to 2.00 million lines per minute



            $ <<< 'scale=2;25165824/3/(4*60+11)*60/10^6' bc
            2.00


            Edit 3: @JJoao's, perl extension speeds up grep to 39 seconds corresponding to 12.90 million lines per minute in the computer, where the ordinary grep reads 1 million lines per minute (reading from and writing to an HDD).



            $ time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao

            real 0m38,699s
            user 0m31,466s
            sys 0m2,843s


            This perl extension is experiental according to info grep but works in my Lubuntu 18.04.1 LTS.




            ‘-P’ ‘--perl-regexp’
            Interpret the pattern as a Perl-compatible regular expression
            (PCRE). This is experimental, particularly when combined with the
            ‘-z’ (‘--null-data’) option, and ‘grep -P’ may warn of
            unimplemented features. *Note Other Options::.




            I also compiled a C program according to @JJoao's flex method, and it finshed in 53 seconds corresponding to 9.49 million lines per minute in the computer, where the ordinary grep reads 1 million lines per minute (reading from and writing to an HDD). Both methods are fast, but grep with the perl extension is fastest.



            $ time ./filt.jjouo < infile > out-flex-JJoao

            real 0m53,440s
            user 0m48,789s
            sys 0m3,104s


            Edit 3.1: In the Dell M4800 with an SSD I had the following results,



            time ./filt.jjouo < infile > out-flex-JJoao

            real 0m25,611s
            user 0m24,794s
            sys 0m0,592s

            time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao

            real 0m18,375s
            user 0m17,654s
            sys 0m0,500s


            This corresponds to




            • 19.66 million lines per minute for the flex application

            • 27.35 million lines per minute for grep with the perl extension


            Edit 3.2: In the Dell M4800 with an SSD I had the following results, when I used the option -f to the flex preprocessor,



            flex -f -o filt.c filt.flex
            cc -O2 -o filt.jjoao filt.c


            The result was improved, and now the flex application shows the highest speed



            flex -f ...

            $ time ./filt.jjoao < infile > out-flex-JJoao

            real 0m15,952s
            user 0m15,318s
            sys 0m0,628s


            This corresponds to




            • 31.55 million lines per minute for the flex application.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Jan 31 at 7:21

























            answered Jan 28 at 15:23









            sudodussudodus

            24.3k32875




            24.3k32875













            • thank you very much for such effort, so is it possible to split the file to many chunks and run the filter simultaneously against these chunks ?

              – Ubai salih
              Jan 28 at 20:00











            • @Ubaisalih, Yes, it might be possible and worthwhile to split the file in many chunks, if you can use use several processors in parallel for the task. But you should really try to avoid writing such a huge file with columns of data, that you will not use. Instead you should write a file with only the data that you need from the beginning. Such a file will contain 10 % or less compared to the file that you create now, Check how the current syslog file is created and how you can create the file that you need directly (not via a filter).

              – sudodus
              Jan 28 at 20:19











            • @sudodus, (+1); for curiosity: could you please time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile

              – JJoao
              Jan 30 at 12:33













            • @JJoao, see Edit 3.

              – sudodus
              Jan 30 at 17:14






            • 1





              @Ubaisalih, With the flex application you might get something that is fast enough from the huge sys log file, but I still suggest that you should find a method to write a file with only the data that you need from the beginning.

              – sudodus
              Jan 31 at 7:26



















            • thank you very much for such effort, so is it possible to split the file to many chunks and run the filter simultaneously against these chunks ?

              – Ubai salih
              Jan 28 at 20:00











            • @Ubaisalih, Yes, it might be possible and worthwhile to split the file in many chunks, if you can use use several processors in parallel for the task. But you should really try to avoid writing such a huge file with columns of data, that you will not use. Instead you should write a file with only the data that you need from the beginning. Such a file will contain 10 % or less compared to the file that you create now, Check how the current syslog file is created and how you can create the file that you need directly (not via a filter).

              – sudodus
              Jan 28 at 20:19











            • @sudodus, (+1); for curiosity: could you please time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile

              – JJoao
              Jan 30 at 12:33













            • @JJoao, see Edit 3.

              – sudodus
              Jan 30 at 17:14






            • 1





              @Ubaisalih, With the flex application you might get something that is fast enough from the huge sys log file, but I still suggest that you should find a method to write a file with only the data that you need from the beginning.

              – sudodus
              Jan 31 at 7:26

















            thank you very much for such effort, so is it possible to split the file to many chunks and run the filter simultaneously against these chunks ?

            – Ubai salih
            Jan 28 at 20:00





            thank you very much for such effort, so is it possible to split the file to many chunks and run the filter simultaneously against these chunks ?

            – Ubai salih
            Jan 28 at 20:00













            @Ubaisalih, Yes, it might be possible and worthwhile to split the file in many chunks, if you can use use several processors in parallel for the task. But you should really try to avoid writing such a huge file with columns of data, that you will not use. Instead you should write a file with only the data that you need from the beginning. Such a file will contain 10 % or less compared to the file that you create now, Check how the current syslog file is created and how you can create the file that you need directly (not via a filter).

            – sudodus
            Jan 28 at 20:19





            @Ubaisalih, Yes, it might be possible and worthwhile to split the file in many chunks, if you can use use several processors in parallel for the task. But you should really try to avoid writing such a huge file with columns of data, that you will not use. Instead you should write a file with only the data that you need from the beginning. Such a file will contain 10 % or less compared to the file that you create now, Check how the current syslog file is created and how you can create the file that you need directly (not via a filter).

            – sudodus
            Jan 28 at 20:19













            @sudodus, (+1); for curiosity: could you please time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile

            – JJoao
            Jan 30 at 12:33







            @sudodus, (+1); for curiosity: could you please time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile

            – JJoao
            Jan 30 at 12:33















            @JJoao, see Edit 3.

            – sudodus
            Jan 30 at 17:14





            @JJoao, see Edit 3.

            – sudodus
            Jan 30 at 17:14




            1




            1





            @Ubaisalih, With the flex application you might get something that is fast enough from the huge sys log file, but I still suggest that you should find a method to write a file with only the data that you need from the beginning.

            – sudodus
            Jan 31 at 7:26





            @Ubaisalih, With the flex application you might get something that is fast enough from the huge sys log file, but I still suggest that you should find a method to write a file with only the data that you need from the beginning.

            – sudodus
            Jan 31 at 7:26











            1














            Here is one possible solution, based on this answer, provided by @PerlDuck a while ago:





            #!/bin/bash
            while IFS= read -r LINE
            do
            if [[ ! -z ${LINE} ]]
            then
            eval $(echo "$LINE" | sed -e 's/({|})//g' -e 's/ /n/g' | sed -ne '/=/p')
            echo "$eventtime|$srcip|$dstip"
            fi
            done < "$1"


            I do not know how it will behave on such large file, IMO an awk solution will be much faster. Here is how it works with the provided input file example:



            $ ./script.sh in-file
            1548531299|X.X.X.X|X.X.X.X
            1548531299|X.X.X.X|X.X.X.X




            Here is the result of a productivity time test, performed on a regular i7, equipped with SSD and 16GB RAM:



            $ time ./script.sh 160000-lines-in-file > out-file

            real 4m49.620s
            user 6m15.875s
            sys 1m50.254s





            share|improve this answer


























            • thank you very much for reply,. i've tried this but still consuming too much time to filter one file, so i'm not sure if there is something faster probably with AWK as you mentioned

              – Ubai salih
              Jan 28 at 9:16













            • Hi, @Ubaisalih, I think awk will be faster according to this time comparison. Whatever, does a new 4GB log is created each minute or only few hundred lines are appended each minute to an existing file?

              – pa4080
              Jan 28 at 9:31













            • Huh, I just saw you are the same OP on the linked PerlDuck"s answer :)

              – pa4080
              Jan 28 at 9:51






            • 2





              @Ubaisalih, what service writes these logs? I'm asking, because I remembered, for example Apache2 can pipe its logs to a script and then that script can write a regular log and also processed log, etc. - if such solution could be applied this will be the best way, I think.

              – pa4080
              Jan 28 at 10:41













            • it's a new file 4GB each minute, the problem now is the speed of filtering the file, so is there a faster method to capture few column with a timely manner ? –

              – Ubai salih
              Jan 28 at 11:47
















            1














            Here is one possible solution, based on this answer, provided by @PerlDuck a while ago:





            #!/bin/bash
            while IFS= read -r LINE
            do
            if [[ ! -z ${LINE} ]]
            then
            eval $(echo "$LINE" | sed -e 's/({|})//g' -e 's/ /n/g' | sed -ne '/=/p')
            echo "$eventtime|$srcip|$dstip"
            fi
            done < "$1"


            I do not know how it will behave on such large file, IMO an awk solution will be much faster. Here is how it works with the provided input file example:



            $ ./script.sh in-file
            1548531299|X.X.X.X|X.X.X.X
            1548531299|X.X.X.X|X.X.X.X




            Here is the result of a productivity time test, performed on a regular i7, equipped with SSD and 16GB RAM:



            $ time ./script.sh 160000-lines-in-file > out-file

            real 4m49.620s
            user 6m15.875s
            sys 1m50.254s





            share|improve this answer


























            • thank you very much for reply,. i've tried this but still consuming too much time to filter one file, so i'm not sure if there is something faster probably with AWK as you mentioned

              – Ubai salih
              Jan 28 at 9:16













            • Hi, @Ubaisalih, I think awk will be faster according to this time comparison. Whatever, does a new 4GB log is created each minute or only few hundred lines are appended each minute to an existing file?

              – pa4080
              Jan 28 at 9:31













            • Huh, I just saw you are the same OP on the linked PerlDuck"s answer :)

              – pa4080
              Jan 28 at 9:51






            • 2





              @Ubaisalih, what service writes these logs? I'm asking, because I remembered, for example Apache2 can pipe its logs to a script and then that script can write a regular log and also processed log, etc. - if such solution could be applied this will be the best way, I think.

              – pa4080
              Jan 28 at 10:41













            • it's a new file 4GB each minute, the problem now is the speed of filtering the file, so is there a faster method to capture few column with a timely manner ? –

              – Ubai salih
              Jan 28 at 11:47














            1












            1








            1







            Here is one possible solution, based on this answer, provided by @PerlDuck a while ago:





            #!/bin/bash
            while IFS= read -r LINE
            do
            if [[ ! -z ${LINE} ]]
            then
            eval $(echo "$LINE" | sed -e 's/({|})//g' -e 's/ /n/g' | sed -ne '/=/p')
            echo "$eventtime|$srcip|$dstip"
            fi
            done < "$1"


            I do not know how it will behave on such large file, IMO an awk solution will be much faster. Here is how it works with the provided input file example:



            $ ./script.sh in-file
            1548531299|X.X.X.X|X.X.X.X
            1548531299|X.X.X.X|X.X.X.X




            Here is the result of a productivity time test, performed on a regular i7, equipped with SSD and 16GB RAM:



            $ time ./script.sh 160000-lines-in-file > out-file

            real 4m49.620s
            user 6m15.875s
            sys 1m50.254s





            share|improve this answer















            Here is one possible solution, based on this answer, provided by @PerlDuck a while ago:





            #!/bin/bash
            while IFS= read -r LINE
            do
            if [[ ! -z ${LINE} ]]
            then
            eval $(echo "$LINE" | sed -e 's/({|})//g' -e 's/ /n/g' | sed -ne '/=/p')
            echo "$eventtime|$srcip|$dstip"
            fi
            done < "$1"


            I do not know how it will behave on such large file, IMO an awk solution will be much faster. Here is how it works with the provided input file example:



            $ ./script.sh in-file
            1548531299|X.X.X.X|X.X.X.X
            1548531299|X.X.X.X|X.X.X.X




            Here is the result of a productivity time test, performed on a regular i7, equipped with SSD and 16GB RAM:



            $ time ./script.sh 160000-lines-in-file > out-file

            real 4m49.620s
            user 6m15.875s
            sys 1m50.254s






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Jan 28 at 9:41

























            answered Jan 28 at 9:04









            pa4080pa4080

            14.2k52668




            14.2k52668













            • thank you very much for reply,. i've tried this but still consuming too much time to filter one file, so i'm not sure if there is something faster probably with AWK as you mentioned

              – Ubai salih
              Jan 28 at 9:16













            • Hi, @Ubaisalih, I think awk will be faster according to this time comparison. Whatever, does a new 4GB log is created each minute or only few hundred lines are appended each minute to an existing file?

              – pa4080
              Jan 28 at 9:31













            • Huh, I just saw you are the same OP on the linked PerlDuck"s answer :)

              – pa4080
              Jan 28 at 9:51






            • 2





              @Ubaisalih, what service writes these logs? I'm asking, because I remembered, for example Apache2 can pipe its logs to a script and then that script can write a regular log and also processed log, etc. - if such solution could be applied this will be the best way, I think.

              – pa4080
              Jan 28 at 10:41













            • it's a new file 4GB each minute, the problem now is the speed of filtering the file, so is there a faster method to capture few column with a timely manner ? –

              – Ubai salih
              Jan 28 at 11:47



















            • thank you very much for reply,. i've tried this but still consuming too much time to filter one file, so i'm not sure if there is something faster probably with AWK as you mentioned

              – Ubai salih
              Jan 28 at 9:16













            • Hi, @Ubaisalih, I think awk will be faster according to this time comparison. Whatever, does a new 4GB log is created each minute or only few hundred lines are appended each minute to an existing file?

              – pa4080
              Jan 28 at 9:31













            • Huh, I just saw you are the same OP on the linked PerlDuck"s answer :)

              – pa4080
              Jan 28 at 9:51






            • 2





              @Ubaisalih, what service writes these logs? I'm asking, because I remembered, for example Apache2 can pipe its logs to a script and then that script can write a regular log and also processed log, etc. - if such solution could be applied this will be the best way, I think.

              – pa4080
              Jan 28 at 10:41













            • it's a new file 4GB each minute, the problem now is the speed of filtering the file, so is there a faster method to capture few column with a timely manner ? –

              – Ubai salih
              Jan 28 at 11:47

















            thank you very much for reply,. i've tried this but still consuming too much time to filter one file, so i'm not sure if there is something faster probably with AWK as you mentioned

            – Ubai salih
            Jan 28 at 9:16







            thank you very much for reply,. i've tried this but still consuming too much time to filter one file, so i'm not sure if there is something faster probably with AWK as you mentioned

            – Ubai salih
            Jan 28 at 9:16















            Hi, @Ubaisalih, I think awk will be faster according to this time comparison. Whatever, does a new 4GB log is created each minute or only few hundred lines are appended each minute to an existing file?

            – pa4080
            Jan 28 at 9:31







            Hi, @Ubaisalih, I think awk will be faster according to this time comparison. Whatever, does a new 4GB log is created each minute or only few hundred lines are appended each minute to an existing file?

            – pa4080
            Jan 28 at 9:31















            Huh, I just saw you are the same OP on the linked PerlDuck"s answer :)

            – pa4080
            Jan 28 at 9:51





            Huh, I just saw you are the same OP on the linked PerlDuck"s answer :)

            – pa4080
            Jan 28 at 9:51




            2




            2





            @Ubaisalih, what service writes these logs? I'm asking, because I remembered, for example Apache2 can pipe its logs to a script and then that script can write a regular log and also processed log, etc. - if such solution could be applied this will be the best way, I think.

            – pa4080
            Jan 28 at 10:41







            @Ubaisalih, what service writes these logs? I'm asking, because I remembered, for example Apache2 can pipe its logs to a script and then that script can write a regular log and also processed log, etc. - if such solution could be applied this will be the best way, I think.

            – pa4080
            Jan 28 at 10:41















            it's a new file 4GB each minute, the problem now is the speed of filtering the file, so is there a faster method to capture few column with a timely manner ? –

            – Ubai salih
            Jan 28 at 11:47





            it's a new file 4GB each minute, the problem now is the speed of filtering the file, so is there a faster method to capture few column with a timely manner ? –

            – Ubai salih
            Jan 28 at 11:47


















            draft saved

            draft discarded




















































            Thanks for contributing an answer to Ask Ubuntu!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1113457%2ffastest-method-to-filter-a-huge-file%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to make a Squid Proxy server?

            Is this a new Fibonacci Identity?

            19世紀