Fastest method to filter a huge file

I have a syslog file with 4GB size every minute, and below is 2 lines of 4.5 million line in each minute,

and i want to generate a new file with only few columns eventtime|srcip|dstip, so the result will be as following

1548531299|X.X.X.X|X.X.X.X

please note that the position of the fields is random.

I've tried some filters but still consuming more than 40 minutes to handle one file on a powerful VM machine with 4 Cores and 16GB ram.

So is there a method to handle such big file and filter the required column in a fast method ?

{Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="ERB-03" devid="5KDTB18800169" logid="0000000011" type="traffic" subtype="forward" level="warning" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=3XXXX srcintf="GGI-cer.405" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="hh-BB.100" dstintfrole="undefined" sessionid=xxxxxxx proto=6 action="ip" policyid=5 policytype="policy" service="HTTP" appcat="unscanned" crscore=5 craction=xxxxxx crlevel="low"



Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="XXX-XX-FGT-03" devid="XX-XXXXXXXX" logid="0000000013" type="traffic" subtype="forward" level="notice" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=XXXXX srcintf="XXX-Core.123" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="sXX-CC.100" dstintfrole="undefined" sessionid=1234567 cvdpkt=0 appcat="unscanned" crscore=5 craction=123456 crlevel="low"}

edited Jan 28 at 10:37

pa4080

14.2k52668

asked Jan 28 at 8:12

Ubai salih

253

1

i assume that grep + regexp will be the fastest

– cmak.fr
Jan 28 at 8:39

1

If syslog-ng is generating 4G/min of firewall data into syslog, either you've got a real problem, or you need to pump that firewall data into a separate file. Please explain why you're doing this... inquiring minds want to know :-)

– heynnema
Jan 28 at 16:03

2

postnote... if your "powerful VM" is VirtualBox, and you've got it set to use all of your cores and all of your RAM... you're killing your host.

– heynnema
Jan 28 at 16:10

Do you need to log the raw data? If not, have you considered using rsyslog's regex features like re_extract to filter the messages as they are coming in?

– Doug O'Neal
Jan 30 at 17:19

add a comment |

I have a syslog file with 4GB size every minute, and below is 2 lines of 4.5 million line in each minute,

and i want to generate a new file with only few columns eventtime|srcip|dstip, so the result will be as following

1548531299|X.X.X.X|X.X.X.X

please note that the position of the fields is random.

I've tried some filters but still consuming more than 40 minutes to handle one file on a powerful VM machine with 4 Cores and 16GB ram.

So is there a method to handle such big file and filter the required column in a fast method ?

{Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="ERB-03" devid="5KDTB18800169" logid="0000000011" type="traffic" subtype="forward" level="warning" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=3XXXX srcintf="GGI-cer.405" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="hh-BB.100" dstintfrole="undefined" sessionid=xxxxxxx proto=6 action="ip" policyid=5 policytype="policy" service="HTTP" appcat="unscanned" crscore=5 craction=xxxxxx crlevel="low"



Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="XXX-XX-FGT-03" devid="XX-XXXXXXXX" logid="0000000013" type="traffic" subtype="forward" level="notice" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=XXXXX srcintf="XXX-Core.123" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="sXX-CC.100" dstintfrole="undefined" sessionid=1234567 cvdpkt=0 appcat="unscanned" crscore=5 craction=123456 crlevel="low"}

edited Jan 28 at 10:37

pa4080

14.2k52668

asked Jan 28 at 8:12

Ubai salih

253

1

i assume that grep + regexp will be the fastest

– cmak.fr
Jan 28 at 8:39

1

If syslog-ng is generating 4G/min of firewall data into syslog, either you've got a real problem, or you need to pump that firewall data into a separate file. Please explain why you're doing this... inquiring minds want to know :-)

– heynnema
Jan 28 at 16:03

2

postnote... if your "powerful VM" is VirtualBox, and you've got it set to use all of your cores and all of your RAM... you're killing your host.

– heynnema
Jan 28 at 16:10

Do you need to log the raw data? If not, have you considered using rsyslog's regex features like re_extract to filter the messages as they are coming in?

– Doug O'Neal
Jan 30 at 17:19

add a comment |

I have a syslog file with 4GB size every minute, and below is 2 lines of 4.5 million line in each minute,

and i want to generate a new file with only few columns eventtime|srcip|dstip, so the result will be as following

1548531299|X.X.X.X|X.X.X.X

please note that the position of the fields is random.

I've tried some filters but still consuming more than 40 minutes to handle one file on a powerful VM machine with 4 Cores and 16GB ram.

So is there a method to handle such big file and filter the required column in a fast method ?

{Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="ERB-03" devid="5KDTB18800169" logid="0000000011" type="traffic" subtype="forward" level="warning" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=3XXXX srcintf="GGI-cer.405" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="hh-BB.100" dstintfrole="undefined" sessionid=xxxxxxx proto=6 action="ip" policyid=5 policytype="policy" service="HTTP" appcat="unscanned" crscore=5 craction=xxxxxx crlevel="low"



Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="XXX-XX-FGT-03" devid="XX-XXXXXXXX" logid="0000000013" type="traffic" subtype="forward" level="notice" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=XXXXX srcintf="XXX-Core.123" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="sXX-CC.100" dstintfrole="undefined" sessionid=1234567 cvdpkt=0 appcat="unscanned" crscore=5 craction=123456 crlevel="low"}

edited Jan 28 at 10:37

pa4080

14.2k52668

asked Jan 28 at 8:12

Ubai salih

253

I have a syslog file with 4GB size every minute, and below is 2 lines of 4.5 million line in each minute,

and i want to generate a new file with only few columns eventtime|srcip|dstip, so the result will be as following

1548531299|X.X.X.X|X.X.X.X

please note that the position of the fields is random.

I've tried some filters but still consuming more than 40 minutes to handle one file on a powerful VM machine with 4 Cores and 16GB ram.

So is there a method to handle such big file and filter the required column in a fast method ?

{Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="ERB-03" devid="5KDTB18800169" logid="0000000011" type="traffic" subtype="forward" level="warning" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=3XXXX srcintf="GGI-cer.405" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="hh-BB.100" dstintfrole="undefined" sessionid=xxxxxxx proto=6 action="ip" policyid=5 policytype="policy" service="HTTP" appcat="unscanned" crscore=5 craction=xxxxxx crlevel="low"



Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="XXX-XX-FGT-03" devid="XX-XXXXXXXX" logid="0000000013" type="traffic" subtype="forward" level="notice" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=XXXXX srcintf="XXX-Core.123" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="sXX-CC.100" dstintfrole="undefined" sessionid=1234567 cvdpkt=0 appcat="unscanned" crscore=5 craction=123456 crlevel="low"}

command-line networking text-processing

edited Jan 28 at 10:37

pa4080

14.2k52668

asked Jan 28 at 8:12

Ubai salih

253

edited Jan 28 at 10:37

pa4080

14.2k52668

asked Jan 28 at 8:12

Ubai salih

253

edited Jan 28 at 10:37

pa4080

14.2k52668

edited Jan 28 at 10:37

pa4080

14.2k52668

edited Jan 28 at 10:37

pa4080

14.2k52668

asked Jan 28 at 8:12

Ubai salih

253

asked Jan 28 at 8:12

Ubai salih

253

asked Jan 28 at 8:12

Ubai salih

253

1

i assume that grep + regexp will be the fastest

– cmak.fr
Jan 28 at 8:39

1

If syslog-ng is generating 4G/min of firewall data into syslog, either you've got a real problem, or you need to pump that firewall data into a separate file. Please explain why you're doing this... inquiring minds want to know :-)

– heynnema
Jan 28 at 16:03

2

postnote... if your "powerful VM" is VirtualBox, and you've got it set to use all of your cores and all of your RAM... you're killing your host.

– heynnema
Jan 28 at 16:10

Do you need to log the raw data? If not, have you considered using rsyslog's regex features like re_extract to filter the messages as they are coming in?

– Doug O'Neal
Jan 30 at 17:19

add a comment |

1

i assume that grep + regexp will be the fastest

– cmak.fr
Jan 28 at 8:39

1

If syslog-ng is generating 4G/min of firewall data into syslog, either you've got a real problem, or you need to pump that firewall data into a separate file. Please explain why you're doing this... inquiring minds want to know :-)

– heynnema
Jan 28 at 16:03

2

postnote... if your "powerful VM" is VirtualBox, and you've got it set to use all of your cores and all of your RAM... you're killing your host.

– heynnema
Jan 28 at 16:10

Do you need to log the raw data? If not, have you considered using rsyslog's regex features like re_extract to filter the messages as they are coming in?

– Doug O'Neal
Jan 30 at 17:19

i assume that grep + regexp will be the fastest

– cmak.fr
Jan 28 at 8:39

If syslog-ng is generating 4G/min of firewall data into syslog, either you've got a real problem, or you need to pump that firewall data into a separate file. Please explain why you're doing this... inquiring minds want to know :-)

– heynnema
Jan 28 at 16:03

postnote... if your "powerful VM" is VirtualBox, and you've got it set to use all of your cores and all of your RAM... you're killing your host.

– heynnema
Jan 28 at 16:10

Do you need to log the raw data? If not, have you considered using rsyslog's regex features like re_extract to filter the messages as they are coming in?

– Doug O'Neal
Jan 30 at 17:19

add a comment |

4 Answers
4

active

oldest

votes

Perl to the rescue

Save the following script as filter.pl and make it executable (chmod +x):

#!/usr/bin/env perl



use strict;

use warnings;



while( <> ) {

    if ( /^(?=.*eventtime=(S+))(?=.*srcip=(S+))(?=.*dstip=(S+)).*$/ ) {

        print "$1|$2|$3n";

    }

}

Then run

pduck@ubuntu:~> time ./filter.pl < input.txt > output.txt



real    0m44,984s

user    0m43,965s

sys     0m0,973s

The regex uses a lookaround pattern, in this case a positive lookahead,
to match the three values eventtime, srcip, and dstip in any order.

I duplicated your two input lines until I got a file with 4 GB and
approximately 9 million lines. I ran the code on an SSD.

answered Jan 28 at 19:08

PerlDuck

6,73711535

+1. Congratulations :-) Your perl script is a lot faster than what I could do with grep, In my main computer, that greps 1 million lines per second, I perl 6.89 million lines per second with your script, and this should be fast enough to solve the problem.

– sudodus
Jan 28 at 19:19

1

Thank you, @sudodus. Yes, it may solve the problem ... if it is not an XY problem ;-)

– PerlDuck
Jan 28 at 19:20

You are right about that ;-)

– sudodus
Jan 28 at 19:21

2

@Ubaisalih Thank you, glad I could help. But still, I don't think it's normal to have such a huge amount of log messages. I mean, 4 GB a minute is 5.5 Terabytes a day. A day!

– PerlDuck
Jan 29 at 11:56

1

@PerlDuck, 43 vs 36 is the same. We can optimize it more (flex -f; cc -O2) and we can also reduce some seconds in Perl solution but definitely Log polity needs revision.

– JJoao
Jan 30 at 19:15

|
show 4 more comments

If you want a really fast solution I suggest flex tool. Flex generates C. The following is capable of processing examples like the one presented accepting free order fields. So create a file, named f.fl with the following content:

%option main

%%

  char e[100], s[100], d[100];



eventtime=[^ n]*   { strcpy(e,yytext+10); }

srcip=[^ n]*       { strcpy(s,yytext+6);  }

dstip=[^ n]*       { strcpy(d,yytext+6);  }

n                  { if (e[0] && s[0] && d[0] )printf("%s|%s|%sn",e,s,d); 

                      e[0]=s[0]=d[0]=0 ;}

.                   {}

%%

To test try:

$ flex -f -o f.c f.fl 

$ cc -O2 -o f f.c

$ ./f < input > output

Here is the time comparison:

$ time ./f < 13.5-milion-lines-3.9G-in-file > out-file

real    0m35.689s

user    0m34.705s

sys     0m0.908s

edited Jan 30 at 22:11

answered Jan 30 at 12:22

JJoao

1,40069

add a comment |

I duplicated your two 'input' lines to a file size 3867148288 bytes (3.7 GiB) and I could process it with grep in 8 minutes and 24 seconds (reading from and writing to a HDD. It should be faster using an SSD or ramdrive).

In order to minimize the time used, I used only standard features of grep, and did not post-process it, so the output format is not what you specify, but might be useful anyway. You can test this command

time grep -oE -e 'eventtime=[0-9]* ' 

 -e 'srcip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]' 

 -e 'dstip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]' 

infile > outfile

Output from your two lines:

$ cat outfile

eventtime=1548531298 

srcip=X.X.X.Y

dstip=X.X.X.X

eventtime=1548531299 

srcip=X.X.X.Z

dstip=X.X.X.Y

The output file contains 25165824 lines corresponding to 8388608 (8.3 million) lines in the input file.

$ wc -l outfile

25165824 outfile

$ <<< '25165824/3' bc

8388608

My test indicates that grep can process approximately 1 million lines per minute.

Unless your computer is much faster than mine. this is not fast enough, and I think you have to consider something that is several times faster, probably filtering before writing the log file, but it would be best to completely avoid output of what is not necessary (and avoid filtering).

The input file is made by duplication, and maybe the system 'remembers' that it has seen the same lines before and makes things faster, so I don't know how fast it will work with a real big file with all the unpredicted variations. You have to test it.

Edit1: I ran the same task in a Dell M4800 with an Intel 4th generation i7 processor and an SSD. It finished in 4 minutes and 36 seconds, at almost double speed, 1.82 million lines per minute.

$ <<< 'scale=2;25165824/3/(4*60+36)*60/10^6' bc

1.82

Still too slow.

Edit2: I simplified the grep patterns and ran it again in the Dell.

time grep -oE -e 'eventtime=[^ ]*' 

 -e 'srcip=[^ ]*' 

 -e 'dstip=[^ ]*' 

infile > out

It finished after 4 minutes and 11 seconds, a small improvement to 2.00 million lines per minute

$ <<< 'scale=2;25165824/3/(4*60+11)*60/10^6' bc

2.00

Edit 3: @JJoao's, perl extension speeds up grep to 39 seconds corresponding to 12.90 million lines per minute in the computer, where the ordinary grep reads 1 million lines per minute (reading from and writing to an HDD).

$ time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao



real    0m38,699s

user    0m31,466s

sys     0m2,843s

This perl extension is experiental according to info grep but works in my Lubuntu 18.04.1 LTS.

‘-P’ ‘--perl-regexp’
Interpret the pattern as a Perl-compatible regular expression
(PCRE). This is experimental, particularly when combined with the
‘-z’ (‘--null-data’) option, and ‘grep -P’ may warn of
unimplemented features. *Note Other Options::.

I also compiled a C program according to @JJoao's flex method, and it finshed in 53 seconds corresponding to 9.49 million lines per minute in the computer, where the ordinary grep reads 1 million lines per minute (reading from and writing to an HDD). Both methods are fast, but grep with the perl extension is fastest.

$ time ./filt.jjouo < infile > out-flex-JJoao



real    0m53,440s

user    0m48,789s

sys 0m3,104s

Edit 3.1: In the Dell M4800 with an SSD I had the following results,

time ./filt.jjouo < infile > out-flex-JJoao



real    0m25,611s

user    0m24,794s

sys 0m0,592s



time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao



real    0m18,375s

user    0m17,654s

sys 0m0,500s

This corresponds to

19.66 million lines per minute for the flex application

27.35 million lines per minute for grep with the perl extension

Edit 3.2: In the Dell M4800 with an SSD I had the following results, when I used the option -f to the flex preprocessor,

flex -f -o filt.c filt.flex

cc -O2 -o filt.jjoao filt.c

The result was improved, and now the flex application shows the highest speed

flex -f ...



$ time ./filt.jjoao < infile > out-flex-JJoao



real    0m15,952s

user    0m15,318s

sys 0m0,628s

This corresponds to

31.55 million lines per minute for the flex application.

edited Jan 31 at 7:21

answered Jan 28 at 15:23

sudodus

24.3k32875

thank you very much for such effort, so is it possible to split the file to many chunks and run the filter simultaneously against these chunks ?

– Ubai salih
Jan 28 at 20:00

@Ubaisalih, Yes, it might be possible and worthwhile to split the file in many chunks, if you can use use several processors in parallel for the task. But you should really try to avoid writing such a huge file with columns of data, that you will not use. Instead you should write a file with only the data that you need from the beginning. Such a file will contain 10 % or less compared to the file that you create now, Check how the current syslog file is created and how you can create the file that you need directly (not via a filter).

– sudodus
Jan 28 at 20:19

@sudodus, (+1); for curiosity: could you please time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile

– JJoao
Jan 30 at 12:33

@JJoao, see Edit 3.

– sudodus
Jan 30 at 17:14

1

@Ubaisalih, With the flex application you might get something that is fast enough from the huge sys log file, but I still suggest that you should find a method to write a file with only the data that you need from the beginning.

– sudodus
Jan 31 at 7:26

|
show 4 more comments

Here is one possible solution, based on this answer, provided by @PerlDuck a while ago:

#!/bin/bash

while IFS= read -r LINE

do

    if [[ ! -z ${LINE} ]]

    then

        eval $(echo "$LINE" | sed -e 's/({|})//g' -e 's/ /n/g' | sed -ne '/=/p')

        echo "$eventtime|$srcip|$dstip"

    fi

done < "$1"

I do not know how it will behave on such large file, IMO an awk solution will be much faster. Here is how it works with the provided input file example:

$ ./script.sh in-file

1548531299|X.X.X.X|X.X.X.X

1548531299|X.X.X.X|X.X.X.X

Here is the result of a productivity time test, performed on a regular i7, equipped with SSD and 16GB RAM:

$ time ./script.sh 160000-lines-in-file > out-file



real    4m49.620s

user    6m15.875s

sys     1m50.254s

edited Jan 28 at 9:41

answered Jan 28 at 9:04

pa4080

14.2k52668

thank you very much for reply,. i've tried this but still consuming too much time to filter one file, so i'm not sure if there is something faster probably with AWK as you mentioned

– Ubai salih
Jan 28 at 9:16

Hi, @Ubaisalih, I think awk will be faster according to this time comparison. Whatever, does a new 4GB log is created each minute or only few hundred lines are appended each minute to an existing file?

– pa4080
Jan 28 at 9:31

Huh, I just saw you are the same OP on the linked PerlDuck"s answer :)

– pa4080
Jan 28 at 9:51

2

@Ubaisalih, what service writes these logs? I'm asking, because I remembered, for example Apache2 can pipe its logs to a script and then that script can write a regular log and also processed log, etc. - if such solution could be applied this will be the best way, I think.

– pa4080
Jan 28 at 10:41

it's a new file 4GB each minute, the problem now is the speed of filtering the file, so is there a faster method to capture few column with a timely manner ? –

– Ubai salih
Jan 28 at 11:47

|
show 7 more comments

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1113457%2ffastest-method-to-filter-a-huge-file%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

Perl to the rescue

Save the following script as filter.pl and make it executable (chmod +x):

#!/usr/bin/env perl



use strict;

use warnings;



while( <> ) {

    if ( /^(?=.*eventtime=(S+))(?=.*srcip=(S+))(?=.*dstip=(S+)).*$/ ) {

        print "$1|$2|$3n";

    }

}

Then run

pduck@ubuntu:~> time ./filter.pl < input.txt > output.txt



real    0m44,984s

user    0m43,965s

sys     0m0,973s

The regex uses a lookaround pattern, in this case a positive lookahead,
to match the three values eventtime, srcip, and dstip in any order.

I duplicated your two input lines until I got a file with 4 GB and
approximately 9 million lines. I ran the code on an SSD.

answered Jan 28 at 19:08

PerlDuck

6,73711535

+1. Congratulations :-) Your perl script is a lot faster than what I could do with grep, In my main computer, that greps 1 million lines per second, I perl 6.89 million lines per second with your script, and this should be fast enough to solve the problem.

– sudodus
Jan 28 at 19:19

1

Thank you, @sudodus. Yes, it may solve the problem ... if it is not an XY problem ;-)

– PerlDuck
Jan 28 at 19:20

You are right about that ;-)

– sudodus
Jan 28 at 19:21

2

@Ubaisalih Thank you, glad I could help. But still, I don't think it's normal to have such a huge amount of log messages. I mean, 4 GB a minute is 5.5 Terabytes a day. A day!

– PerlDuck
Jan 29 at 11:56

1

@PerlDuck, 43 vs 36 is the same. We can optimize it more (flex -f; cc -O2) and we can also reduce some seconds in Perl solution but definitely Log polity needs revision.

– JJoao
Jan 30 at 19:15

|
show 4 more comments

Perl to the rescue

Save the following script as filter.pl and make it executable (chmod +x):

#!/usr/bin/env perl



use strict;

use warnings;



while( <> ) {

    if ( /^(?=.*eventtime=(S+))(?=.*srcip=(S+))(?=.*dstip=(S+)).*$/ ) {

        print "$1|$2|$3n";

    }

}

Then run

pduck@ubuntu:~> time ./filter.pl < input.txt > output.txt



real    0m44,984s

user    0m43,965s

sys     0m0,973s

The regex uses a lookaround pattern, in this case a positive lookahead,
to match the three values eventtime, srcip, and dstip in any order.

I duplicated your two input lines until I got a file with 4 GB and
approximately 9 million lines. I ran the code on an SSD.

answered Jan 28 at 19:08

PerlDuck

6,73711535

+1. Congratulations :-) Your perl script is a lot faster than what I could do with grep, In my main computer, that greps 1 million lines per second, I perl 6.89 million lines per second with your script, and this should be fast enough to solve the problem.

– sudodus
Jan 28 at 19:19

1

Thank you, @sudodus. Yes, it may solve the problem ... if it is not an XY problem ;-)

– PerlDuck
Jan 28 at 19:20

You are right about that ;-)

– sudodus
Jan 28 at 19:21

2

@Ubaisalih Thank you, glad I could help. But still, I don't think it's normal to have such a huge amount of log messages. I mean, 4 GB a minute is 5.5 Terabytes a day. A day!

– PerlDuck
Jan 29 at 11:56

1

@PerlDuck, 43 vs 36 is the same. We can optimize it more (flex -f; cc -O2) and we can also reduce some seconds in Perl solution but definitely Log polity needs revision.

– JJoao
Jan 30 at 19:15

|
show 4 more comments

Perl to the rescue

Save the following script as filter.pl and make it executable (chmod +x):

#!/usr/bin/env perl



use strict;

use warnings;



while( <> ) {

    if ( /^(?=.*eventtime=(S+))(?=.*srcip=(S+))(?=.*dstip=(S+)).*$/ ) {

        print "$1|$2|$3n";

    }

}

Then run

pduck@ubuntu:~> time ./filter.pl < input.txt > output.txt



real    0m44,984s

user    0m43,965s

sys     0m0,973s

The regex uses a lookaround pattern, in this case a positive lookahead,
to match the three values eventtime, srcip, and dstip in any order.

I duplicated your two input lines until I got a file with 4 GB and
approximately 9 million lines. I ran the code on an SSD.

answered Jan 28 at 19:08

PerlDuck

6,73711535

Perl to the rescue

Save the following script as filter.pl and make it executable (chmod +x):

#!/usr/bin/env perl



use strict;

use warnings;



while( <> ) {

    if ( /^(?=.*eventtime=(S+))(?=.*srcip=(S+))(?=.*dstip=(S+)).*$/ ) {

        print "$1|$2|$3n";

    }

}

Then run

pduck@ubuntu:~> time ./filter.pl < input.txt > output.txt



real    0m44,984s

user    0m43,965s

sys     0m0,973s

The regex uses a lookaround pattern, in this case a positive lookahead,
to match the three values eventtime, srcip, and dstip in any order.

I duplicated your two input lines until I got a file with 4 GB and
approximately 9 million lines. I ran the code on an SSD.

answered Jan 28 at 19:08

PerlDuck

6,73711535

answered Jan 28 at 19:08

PerlDuck

6,73711535

answered Jan 28 at 19:08

PerlDuck

6,73711535

answered Jan 28 at 19:08

PerlDuck

6,73711535

+1. Congratulations :-) Your perl script is a lot faster than what I could do with grep, In my main computer, that greps 1 million lines per second, I perl 6.89 million lines per second with your script, and this should be fast enough to solve the problem.

– sudodus
Jan 28 at 19:19

1

Thank you, @sudodus. Yes, it may solve the problem ... if it is not an XY problem ;-)

– PerlDuck
Jan 28 at 19:20

You are right about that ;-)

– sudodus
Jan 28 at 19:21

2

@Ubaisalih Thank you, glad I could help. But still, I don't think it's normal to have such a huge amount of log messages. I mean, 4 GB a minute is 5.5 Terabytes a day. A day!

– PerlDuck
Jan 29 at 11:56

1

@PerlDuck, 43 vs 36 is the same. We can optimize it more (flex -f; cc -O2) and we can also reduce some seconds in Perl solution but definitely Log polity needs revision.

– JJoao
Jan 30 at 19:15

|
show 4 more comments

+1. Congratulations :-) Your perl script is a lot faster than what I could do with grep, In my main computer, that greps 1 million lines per second, I perl 6.89 million lines per second with your script, and this should be fast enough to solve the problem.

– sudodus
Jan 28 at 19:19

1

Thank you, @sudodus. Yes, it may solve the problem ... if it is not an XY problem ;-)

– PerlDuck
Jan 28 at 19:20

You are right about that ;-)

– sudodus
Jan 28 at 19:21

2

@Ubaisalih Thank you, glad I could help. But still, I don't think it's normal to have such a huge amount of log messages. I mean, 4 GB a minute is 5.5 Terabytes a day. A day!

– PerlDuck
Jan 29 at 11:56

1

@PerlDuck, 43 vs 36 is the same. We can optimize it more (flex -f; cc -O2) and we can also reduce some seconds in Perl solution but definitely Log polity needs revision.

– JJoao
Jan 30 at 19:15

+1. Congratulations :-) Your perl script is a lot faster than what I could do with grep, In my main computer, that greps 1 million lines per second, I perl 6.89 million lines per second with your script, and this should be fast enough to solve the problem.

– sudodus
Jan 28 at 19:19

Thank you, @sudodus. Yes, it may solve the problem ... if it is not an XY problem ;-)

– PerlDuck
Jan 28 at 19:20

You are right about that ;-)

– sudodus
Jan 28 at 19:21

@Ubaisalih Thank you, glad I could help. But still, I don't think it's normal to have such a huge amount of log messages. I mean, 4 GB a minute is 5.5 Terabytes a day. A day!

– PerlDuck
Jan 29 at 11:56

@PerlDuck, 43 vs 36 is the same. We can optimize it more (flex -f; cc -O2) and we can also reduce some seconds in Perl solution but definitely Log polity needs revision.

– JJoao
Jan 30 at 19:15

|
show 4 more comments

%option main

%%

  char e[100], s[100], d[100];



eventtime=[^ n]*   { strcpy(e,yytext+10); }

srcip=[^ n]*       { strcpy(s,yytext+6);  }

dstip=[^ n]*       { strcpy(d,yytext+6);  }

n                  { if (e[0] && s[0] && d[0] )printf("%s|%s|%sn",e,s,d); 

                      e[0]=s[0]=d[0]=0 ;}

.                   {}

%%

To test try:

$ flex -f -o f.c f.fl 

$ cc -O2 -o f f.c

$ ./f < input > output

Here is the time comparison:

$ time ./f < 13.5-milion-lines-3.9G-in-file > out-file

real    0m35.689s

user    0m34.705s

sys     0m0.908s

edited Jan 30 at 22:11

answered Jan 30 at 12:22

JJoao

1,40069

add a comment |

%option main

%%

  char e[100], s[100], d[100];



eventtime=[^ n]*   { strcpy(e,yytext+10); }

srcip=[^ n]*       { strcpy(s,yytext+6);  }

dstip=[^ n]*       { strcpy(d,yytext+6);  }

n                  { if (e[0] && s[0] && d[0] )printf("%s|%s|%sn",e,s,d); 

                      e[0]=s[0]=d[0]=0 ;}

.                   {}

%%

To test try:

$ flex -f -o f.c f.fl 

$ cc -O2 -o f f.c

$ ./f < input > output

Here is the time comparison:

$ time ./f < 13.5-milion-lines-3.9G-in-file > out-file

real    0m35.689s

user    0m34.705s

sys     0m0.908s

edited Jan 30 at 22:11

answered Jan 30 at 12:22

JJoao

1,40069

add a comment |

%option main

%%

  char e[100], s[100], d[100];



eventtime=[^ n]*   { strcpy(e,yytext+10); }

srcip=[^ n]*       { strcpy(s,yytext+6);  }

dstip=[^ n]*       { strcpy(d,yytext+6);  }

n                  { if (e[0] && s[0] && d[0] )printf("%s|%s|%sn",e,s,d); 

                      e[0]=s[0]=d[0]=0 ;}

.                   {}

%%

To test try:

$ flex -f -o f.c f.fl 

$ cc -O2 -o f f.c

$ ./f < input > output

Here is the time comparison:

$ time ./f < 13.5-milion-lines-3.9G-in-file > out-file

real    0m35.689s

user    0m34.705s

sys     0m0.908s

edited Jan 30 at 22:11

answered Jan 30 at 12:22

JJoao

1,40069

%option main

%%

  char e[100], s[100], d[100];



eventtime=[^ n]*   { strcpy(e,yytext+10); }

srcip=[^ n]*       { strcpy(s,yytext+6);  }

dstip=[^ n]*       { strcpy(d,yytext+6);  }

n                  { if (e[0] && s[0] && d[0] )printf("%s|%s|%sn",e,s,d); 

                      e[0]=s[0]=d[0]=0 ;}

.                   {}

%%

To test try:

$ flex -f -o f.c f.fl 

$ cc -O2 -o f f.c

$ ./f < input > output

Here is the time comparison:

$ time ./f < 13.5-milion-lines-3.9G-in-file > out-file

real    0m35.689s

user    0m34.705s

sys     0m0.908s

edited Jan 30 at 22:11

answered Jan 30 at 12:22

JJoao

1,40069

edited Jan 30 at 22:11

answered Jan 30 at 12:22

JJoao

1,40069

answered Jan 30 at 12:22

JJoao

1,40069

answered Jan 30 at 12:22

JJoao

1,40069

add a comment |

time grep -oE -e 'eventtime=[0-9]* ' 

 -e 'srcip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]' 

 -e 'dstip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]' 

infile > outfile

Output from your two lines:

$ cat outfile

eventtime=1548531298 

srcip=X.X.X.Y

dstip=X.X.X.X

eventtime=1548531299 

srcip=X.X.X.Z

dstip=X.X.X.Y

The output file contains 25165824 lines corresponding to 8388608 (8.3 million) lines in the input file.

$ wc -l outfile

25165824 outfile

$ <<< '25165824/3' bc

8388608

My test indicates that grep can process approximately 1 million lines per minute.

Edit1: I ran the same task in a Dell M4800 with an Intel 4th generation i7 processor and an SSD. It finished in 4 minutes and 36 seconds, at almost double speed, 1.82 million lines per minute.

$ <<< 'scale=2;25165824/3/(4*60+36)*60/10^6' bc

1.82

Still too slow.

Edit2: I simplified the grep patterns and ran it again in the Dell.

time grep -oE -e 'eventtime=[^ ]*' 

 -e 'srcip=[^ ]*' 

 -e 'dstip=[^ ]*' 

infile > out

It finished after 4 minutes and 11 seconds, a small improvement to 2.00 million lines per minute

$ <<< 'scale=2;25165824/3/(4*60+11)*60/10^6' bc

2.00

$ time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao



real    0m38,699s

user    0m31,466s

sys     0m2,843s

This perl extension is experiental according to info grep but works in my Lubuntu 18.04.1 LTS.

‘-P’ ‘--perl-regexp’
Interpret the pattern as a Perl-compatible regular expression
(PCRE). This is experimental, particularly when combined with the
‘-z’ (‘--null-data’) option, and ‘grep -P’ may warn of
unimplemented features. *Note Other Options::.

$ time ./filt.jjouo < infile > out-flex-JJoao



real    0m53,440s

user    0m48,789s

sys 0m3,104s

Edit 3.1: In the Dell M4800 with an SSD I had the following results,

time ./filt.jjouo < infile > out-flex-JJoao



real    0m25,611s

user    0m24,794s

sys 0m0,592s



time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao



real    0m18,375s

user    0m17,654s

sys 0m0,500s

This corresponds to

19.66 million lines per minute for the flex application

27.35 million lines per minute for grep with the perl extension

Edit 3.2: In the Dell M4800 with an SSD I had the following results, when I used the option -f to the flex preprocessor,

flex -f -o filt.c filt.flex

cc -O2 -o filt.jjoao filt.c

The result was improved, and now the flex application shows the highest speed

flex -f ...



$ time ./filt.jjoao < infile > out-flex-JJoao



real    0m15,952s

user    0m15,318s

sys 0m0,628s

This corresponds to

31.55 million lines per minute for the flex application.

edited Jan 31 at 7:21

answered Jan 28 at 15:23

sudodus

24.3k32875

thank you very much for such effort, so is it possible to split the file to many chunks and run the filter simultaneously against these chunks ?

– Ubai salih
Jan 28 at 20:00

@Ubaisalih, Yes, it might be possible and worthwhile to split the file in many chunks, if you can use use several processors in parallel for the task. But you should really try to avoid writing such a huge file with columns of data, that you will not use. Instead you should write a file with only the data that you need from the beginning. Such a file will contain 10 % or less compared to the file that you create now, Check how the current syslog file is created and how you can create the file that you need directly (not via a filter).

– sudodus
Jan 28 at 20:19

@sudodus, (+1); for curiosity: could you please time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile

– JJoao
Jan 30 at 12:33

@JJoao, see Edit 3.

– sudodus
Jan 30 at 17:14

1

@Ubaisalih, With the flex application you might get something that is fast enough from the huge sys log file, but I still suggest that you should find a method to write a file with only the data that you need from the beginning.

– sudodus
Jan 31 at 7:26

|
show 4 more comments

time grep -oE -e 'eventtime=[0-9]* ' 

 -e 'srcip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]' 

 -e 'dstip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]' 

infile > outfile

Output from your two lines:

$ cat outfile

eventtime=1548531298 

srcip=X.X.X.Y

dstip=X.X.X.X

eventtime=1548531299 

srcip=X.X.X.Z

dstip=X.X.X.Y

The output file contains 25165824 lines corresponding to 8388608 (8.3 million) lines in the input file.

$ wc -l outfile

25165824 outfile

$ <<< '25165824/3' bc

8388608

My test indicates that grep can process approximately 1 million lines per minute.

Edit1: I ran the same task in a Dell M4800 with an Intel 4th generation i7 processor and an SSD. It finished in 4 minutes and 36 seconds, at almost double speed, 1.82 million lines per minute.

$ <<< 'scale=2;25165824/3/(4*60+36)*60/10^6' bc

1.82

Still too slow.

Edit2: I simplified the grep patterns and ran it again in the Dell.

time grep -oE -e 'eventtime=[^ ]*' 

 -e 'srcip=[^ ]*' 

 -e 'dstip=[^ ]*' 

infile > out

It finished after 4 minutes and 11 seconds, a small improvement to 2.00 million lines per minute

$ <<< 'scale=2;25165824/3/(4*60+11)*60/10^6' bc

2.00

$ time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao



real    0m38,699s

user    0m31,466s

sys     0m2,843s

This perl extension is experiental according to info grep but works in my Lubuntu 18.04.1 LTS.

‘-P’ ‘--perl-regexp’
Interpret the pattern as a Perl-compatible regular expression
(PCRE). This is experimental, particularly when combined with the
‘-z’ (‘--null-data’) option, and ‘grep -P’ may warn of
unimplemented features. *Note Other Options::.

$ time ./filt.jjouo < infile > out-flex-JJoao



real    0m53,440s

user    0m48,789s

sys 0m3,104s

Edit 3.1: In the Dell M4800 with an SSD I had the following results,

time ./filt.jjouo < infile > out-flex-JJoao



real    0m25,611s

user    0m24,794s

sys 0m0,592s



time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao



real    0m18,375s

user    0m17,654s

sys 0m0,500s

This corresponds to

19.66 million lines per minute for the flex application

27.35 million lines per minute for grep with the perl extension

Edit 3.2: In the Dell M4800 with an SSD I had the following results, when I used the option -f to the flex preprocessor,

flex -f -o filt.c filt.flex

cc -O2 -o filt.jjoao filt.c

The result was improved, and now the flex application shows the highest speed

flex -f ...



$ time ./filt.jjoao < infile > out-flex-JJoao



real    0m15,952s

user    0m15,318s

sys 0m0,628s

This corresponds to

31.55 million lines per minute for the flex application.

edited Jan 31 at 7:21

answered Jan 28 at 15:23

sudodus

24.3k32875

thank you very much for such effort, so is it possible to split the file to many chunks and run the filter simultaneously against these chunks ?

– Ubai salih
Jan 28 at 20:00

@Ubaisalih, Yes, it might be possible and worthwhile to split the file in many chunks, if you can use use several processors in parallel for the task. But you should really try to avoid writing such a huge file with columns of data, that you will not use. Instead you should write a file with only the data that you need from the beginning. Such a file will contain 10 % or less compared to the file that you create now, Check how the current syslog file is created and how you can create the file that you need directly (not via a filter).

– sudodus
Jan 28 at 20:19

@sudodus, (+1); for curiosity: could you please time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile

– JJoao
Jan 30 at 12:33

@JJoao, see Edit 3.

– sudodus
Jan 30 at 17:14

1

@Ubaisalih, With the flex application you might get something that is fast enough from the huge sys log file, but I still suggest that you should find a method to write a file with only the data that you need from the beginning.

– sudodus
Jan 31 at 7:26

|
show 4 more comments

time grep -oE -e 'eventtime=[0-9]* ' 

 -e 'srcip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]' 

 -e 'dstip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]' 

infile > outfile

Output from your two lines:

$ cat outfile

eventtime=1548531298 

srcip=X.X.X.Y

dstip=X.X.X.X

eventtime=1548531299 

srcip=X.X.X.Z

dstip=X.X.X.Y

The output file contains 25165824 lines corresponding to 8388608 (8.3 million) lines in the input file.

$ wc -l outfile

25165824 outfile

$ <<< '25165824/3' bc

8388608

My test indicates that grep can process approximately 1 million lines per minute.

Edit1: I ran the same task in a Dell M4800 with an Intel 4th generation i7 processor and an SSD. It finished in 4 minutes and 36 seconds, at almost double speed, 1.82 million lines per minute.

$ <<< 'scale=2;25165824/3/(4*60+36)*60/10^6' bc

1.82

Still too slow.

Edit2: I simplified the grep patterns and ran it again in the Dell.

time grep -oE -e 'eventtime=[^ ]*' 

 -e 'srcip=[^ ]*' 

 -e 'dstip=[^ ]*' 

infile > out

It finished after 4 minutes and 11 seconds, a small improvement to 2.00 million lines per minute

$ <<< 'scale=2;25165824/3/(4*60+11)*60/10^6' bc

2.00

$ time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao



real    0m38,699s

user    0m31,466s

sys     0m2,843s

This perl extension is experiental according to info grep but works in my Lubuntu 18.04.1 LTS.

‘-P’ ‘--perl-regexp’
Interpret the pattern as a Perl-compatible regular expression
(PCRE). This is experimental, particularly when combined with the
‘-z’ (‘--null-data’) option, and ‘grep -P’ may warn of
unimplemented features. *Note Other Options::.

$ time ./filt.jjouo < infile > out-flex-JJoao



real    0m53,440s

user    0m48,789s

sys 0m3,104s

Edit 3.1: In the Dell M4800 with an SSD I had the following results,

time ./filt.jjouo < infile > out-flex-JJoao



real    0m25,611s

user    0m24,794s

sys 0m0,592s



time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao



real    0m18,375s

user    0m17,654s

sys 0m0,500s

This corresponds to

19.66 million lines per minute for the flex application

27.35 million lines per minute for grep with the perl extension

Edit 3.2: In the Dell M4800 with an SSD I had the following results, when I used the option -f to the flex preprocessor,

flex -f -o filt.c filt.flex

cc -O2 -o filt.jjoao filt.c

The result was improved, and now the flex application shows the highest speed

flex -f ...



$ time ./filt.jjoao < infile > out-flex-JJoao



real    0m15,952s

user    0m15,318s

sys 0m0,628s

This corresponds to

31.55 million lines per minute for the flex application.

edited Jan 31 at 7:21

answered Jan 28 at 15:23

sudodus

24.3k32875

time grep -oE -e 'eventtime=[0-9]* ' 

 -e 'srcip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]' 

 -e 'dstip=[[:alnum:]].[[:alnum:]].[[:alnum:]].[[:alnum:]]' 

infile > outfile

Output from your two lines:

$ cat outfile

eventtime=1548531298 

srcip=X.X.X.Y

dstip=X.X.X.X

eventtime=1548531299 

srcip=X.X.X.Z

dstip=X.X.X.Y

The output file contains 25165824 lines corresponding to 8388608 (8.3 million) lines in the input file.

$ wc -l outfile

25165824 outfile

$ <<< '25165824/3' bc

8388608

My test indicates that grep can process approximately 1 million lines per minute.

Edit1: I ran the same task in a Dell M4800 with an Intel 4th generation i7 processor and an SSD. It finished in 4 minutes and 36 seconds, at almost double speed, 1.82 million lines per minute.

$ <<< 'scale=2;25165824/3/(4*60+36)*60/10^6' bc

1.82

Still too slow.

Edit2: I simplified the grep patterns and ran it again in the Dell.

time grep -oE -e 'eventtime=[^ ]*' 

 -e 'srcip=[^ ]*' 

 -e 'dstip=[^ ]*' 

infile > out

It finished after 4 minutes and 11 seconds, a small improvement to 2.00 million lines per minute

$ <<< 'scale=2;25165824/3/(4*60+11)*60/10^6' bc

2.00

$ time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao



real    0m38,699s

user    0m31,466s

sys     0m2,843s

This perl extension is experiental according to info grep but works in my Lubuntu 18.04.1 LTS.

‘-P’ ‘--perl-regexp’
Interpret the pattern as a Perl-compatible regular expression
(PCRE). This is experimental, particularly when combined with the
‘-z’ (‘--null-data’) option, and ‘grep -P’ may warn of
unimplemented features. *Note Other Options::.

$ time ./filt.jjouo < infile > out-flex-JJoao



real    0m53,440s

user    0m48,789s

sys 0m3,104s

Edit 3.1: In the Dell M4800 with an SSD I had the following results,

time ./filt.jjouo < infile > out-flex-JJoao



real    0m25,611s

user    0m24,794s

sys 0m0,592s



time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile >out-grep-JJoao



real    0m18,375s

user    0m17,654s

sys 0m0,500s

This corresponds to

19.66 million lines per minute for the flex application

27.35 million lines per minute for grep with the perl extension

Edit 3.2: In the Dell M4800 with an SSD I had the following results, when I used the option -f to the flex preprocessor,

flex -f -o filt.c filt.flex

cc -O2 -o filt.jjoao filt.c

The result was improved, and now the flex application shows the highest speed

flex -f ...



$ time ./filt.jjoao < infile > out-flex-JJoao



real    0m15,952s

user    0m15,318s

sys 0m0,628s

This corresponds to

31.55 million lines per minute for the flex application.

edited Jan 31 at 7:21

answered Jan 28 at 15:23

sudodus

24.3k32875

edited Jan 31 at 7:21

answered Jan 28 at 15:23

sudodus

24.3k32875

answered Jan 28 at 15:23

sudodus

24.3k32875

answered Jan 28 at 15:23

sudodus

24.3k32875

thank you very much for such effort, so is it possible to split the file to many chunks and run the filter simultaneously against these chunks ?

– Ubai salih
Jan 28 at 20:00

@Ubaisalih, Yes, it might be possible and worthwhile to split the file in many chunks, if you can use use several processors in parallel for the task. But you should really try to avoid writing such a huge file with columns of data, that you will not use. Instead you should write a file with only the data that you need from the beginning. Such a file will contain 10 % or less compared to the file that you create now, Check how the current syslog file is created and how you can create the file that you need directly (not via a filter).

– sudodus
Jan 28 at 20:19

@sudodus, (+1); for curiosity: could you please time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile

– JJoao
Jan 30 at 12:33

@JJoao, see Edit 3.

– sudodus
Jan 30 at 17:14

1

@Ubaisalih, With the flex application you might get something that is fast enough from the huge sys log file, but I still suggest that you should find a method to write a file with only the data that you need from the beginning.

– sudodus
Jan 31 at 7:26

|
show 4 more comments

thank you very much for such effort, so is it possible to split the file to many chunks and run the filter simultaneously against these chunks ?

– Ubai salih
Jan 28 at 20:00

@Ubaisalih, Yes, it might be possible and worthwhile to split the file in many chunks, if you can use use several processors in parallel for the task. But you should really try to avoid writing such a huge file with columns of data, that you will not use. Instead you should write a file with only the data that you need from the beginning. Such a file will contain 10 % or less compared to the file that you create now, Check how the current syslog file is created and how you can create the file that you need directly (not via a filter).

– sudodus
Jan 28 at 20:19

@sudodus, (+1); for curiosity: could you please time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile

– JJoao
Jan 30 at 12:33

@JJoao, see Edit 3.

– sudodus
Jan 30 at 17:14

1

@Ubaisalih, With the flex application you might get something that is fast enough from the huge sys log file, but I still suggest that you should find a method to write a file with only the data that you need from the beginning.

– sudodus
Jan 31 at 7:26

thank you very much for such effort, so is it possible to split the file to many chunks and run the filter simultaneously against these chunks ?

– Ubai salih
Jan 28 at 20:00

@Ubaisalih, Yes, it might be possible and worthwhile to split the file in many chunks, if you can use use several processors in parallel for the task. But you should really try to avoid writing such a huge file with columns of data, that you will not use. Instead you should write a file with only the data that you need from the beginning. Such a file will contain 10 % or less compared to the file that you create now, Check how the current syslog file is created and how you can create the file that you need directly (not via a filter).

– sudodus
Jan 28 at 20:19

@sudodus, (+1); for curiosity: could you please time grep -oP 'b(eventtime|srcip|dstip)=KS+' infile

– JJoao
Jan 30 at 12:33

@JJoao, see Edit 3.

– sudodus
Jan 30 at 17:14

@Ubaisalih, With the flex application you might get something that is fast enough from the huge sys log file, but I still suggest that you should find a method to write a file with only the data that you need from the beginning.

– sudodus
Jan 31 at 7:26

|
show 4 more comments

Here is one possible solution, based on this answer, provided by @PerlDuck a while ago:

#!/bin/bash

while IFS= read -r LINE

do

    if [[ ! -z ${LINE} ]]

    then

        eval $(echo "$LINE" | sed -e 's/({|})//g' -e 's/ /n/g' | sed -ne '/=/p')

        echo "$eventtime|$srcip|$dstip"

    fi

done < "$1"

I do not know how it will behave on such large file, IMO an awk solution will be much faster. Here is how it works with the provided input file example:

$ ./script.sh in-file

1548531299|X.X.X.X|X.X.X.X

1548531299|X.X.X.X|X.X.X.X

Here is the result of a productivity time test, performed on a regular i7, equipped with SSD and 16GB RAM:

$ time ./script.sh 160000-lines-in-file > out-file



real    4m49.620s

user    6m15.875s

sys     1m50.254s

edited Jan 28 at 9:41

answered Jan 28 at 9:04

pa4080

14.2k52668

thank you very much for reply,. i've tried this but still consuming too much time to filter one file, so i'm not sure if there is something faster probably with AWK as you mentioned

– Ubai salih
Jan 28 at 9:16

Hi, @Ubaisalih, I think awk will be faster according to this time comparison. Whatever, does a new 4GB log is created each minute or only few hundred lines are appended each minute to an existing file?

– pa4080
Jan 28 at 9:31

Huh, I just saw you are the same OP on the linked PerlDuck"s answer :)

– pa4080
Jan 28 at 9:51

2

@Ubaisalih, what service writes these logs? I'm asking, because I remembered, for example Apache2 can pipe its logs to a script and then that script can write a regular log and also processed log, etc. - if such solution could be applied this will be the best way, I think.

– pa4080
Jan 28 at 10:41

it's a new file 4GB each minute, the problem now is the speed of filtering the file, so is there a faster method to capture few column with a timely manner ? –

– Ubai salih
Jan 28 at 11:47

|
show 7 more comments

Here is one possible solution, based on this answer, provided by @PerlDuck a while ago:

#!/bin/bash

while IFS= read -r LINE

do

    if [[ ! -z ${LINE} ]]

    then

        eval $(echo "$LINE" | sed -e 's/({|})//g' -e 's/ /n/g' | sed -ne '/=/p')

        echo "$eventtime|$srcip|$dstip"

    fi

done < "$1"

I do not know how it will behave on such large file, IMO an awk solution will be much faster. Here is how it works with the provided input file example:

$ ./script.sh in-file

1548531299|X.X.X.X|X.X.X.X

1548531299|X.X.X.X|X.X.X.X

Here is the result of a productivity time test, performed on a regular i7, equipped with SSD and 16GB RAM:

$ time ./script.sh 160000-lines-in-file > out-file



real    4m49.620s

user    6m15.875s

sys     1m50.254s

edited Jan 28 at 9:41

answered Jan 28 at 9:04

pa4080

14.2k52668

thank you very much for reply,. i've tried this but still consuming too much time to filter one file, so i'm not sure if there is something faster probably with AWK as you mentioned

– Ubai salih
Jan 28 at 9:16

Hi, @Ubaisalih, I think awk will be faster according to this time comparison. Whatever, does a new 4GB log is created each minute or only few hundred lines are appended each minute to an existing file?

– pa4080
Jan 28 at 9:31

Huh, I just saw you are the same OP on the linked PerlDuck"s answer :)

– pa4080
Jan 28 at 9:51

2

@Ubaisalih, what service writes these logs? I'm asking, because I remembered, for example Apache2 can pipe its logs to a script and then that script can write a regular log and also processed log, etc. - if such solution could be applied this will be the best way, I think.

– pa4080
Jan 28 at 10:41

it's a new file 4GB each minute, the problem now is the speed of filtering the file, so is there a faster method to capture few column with a timely manner ? –

– Ubai salih
Jan 28 at 11:47

|
show 7 more comments

Here is one possible solution, based on this answer, provided by @PerlDuck a while ago:

#!/bin/bash

while IFS= read -r LINE

do

    if [[ ! -z ${LINE} ]]

    then

        eval $(echo "$LINE" | sed -e 's/({|})//g' -e 's/ /n/g' | sed -ne '/=/p')

        echo "$eventtime|$srcip|$dstip"

    fi

done < "$1"

I do not know how it will behave on such large file, IMO an awk solution will be much faster. Here is how it works with the provided input file example:

$ ./script.sh in-file

1548531299|X.X.X.X|X.X.X.X

1548531299|X.X.X.X|X.X.X.X

Here is the result of a productivity time test, performed on a regular i7, equipped with SSD and 16GB RAM:

$ time ./script.sh 160000-lines-in-file > out-file



real    4m49.620s

user    6m15.875s

sys     1m50.254s

edited Jan 28 at 9:41

answered Jan 28 at 9:04

pa4080

14.2k52668

Here is one possible solution, based on this answer, provided by @PerlDuck a while ago:

#!/bin/bash

while IFS= read -r LINE

do

    if [[ ! -z ${LINE} ]]

    then

        eval $(echo "$LINE" | sed -e 's/({|})//g' -e 's/ /n/g' | sed -ne '/=/p')

        echo "$eventtime|$srcip|$dstip"

    fi

done < "$1"

I do not know how it will behave on such large file, IMO an awk solution will be much faster. Here is how it works with the provided input file example:

$ ./script.sh in-file

1548531299|X.X.X.X|X.X.X.X

1548531299|X.X.X.X|X.X.X.X

Here is the result of a productivity time test, performed on a regular i7, equipped with SSD and 16GB RAM:

$ time ./script.sh 160000-lines-in-file > out-file



real    4m49.620s

user    6m15.875s

sys     1m50.254s

edited Jan 28 at 9:41

answered Jan 28 at 9:04

pa4080

14.2k52668

edited Jan 28 at 9:41

answered Jan 28 at 9:04

pa4080

14.2k52668

answered Jan 28 at 9:04

pa4080

14.2k52668

answered Jan 28 at 9:04

pa4080

14.2k52668

thank you very much for reply,. i've tried this but still consuming too much time to filter one file, so i'm not sure if there is something faster probably with AWK as you mentioned

– Ubai salih
Jan 28 at 9:16

Hi, @Ubaisalih, I think awk will be faster according to this time comparison. Whatever, does a new 4GB log is created each minute or only few hundred lines are appended each minute to an existing file?

– pa4080
Jan 28 at 9:31

Huh, I just saw you are the same OP on the linked PerlDuck"s answer :)

– pa4080
Jan 28 at 9:51

2

@Ubaisalih, what service writes these logs? I'm asking, because I remembered, for example Apache2 can pipe its logs to a script and then that script can write a regular log and also processed log, etc. - if such solution could be applied this will be the best way, I think.

– pa4080
Jan 28 at 10:41

it's a new file 4GB each minute, the problem now is the speed of filtering the file, so is there a faster method to capture few column with a timely manner ? –

– Ubai salih
Jan 28 at 11:47

|
show 7 more comments

thank you very much for reply,. i've tried this but still consuming too much time to filter one file, so i'm not sure if there is something faster probably with AWK as you mentioned

– Ubai salih
Jan 28 at 9:16

Hi, @Ubaisalih, I think awk will be faster according to this time comparison. Whatever, does a new 4GB log is created each minute or only few hundred lines are appended each minute to an existing file?

– pa4080
Jan 28 at 9:31

Huh, I just saw you are the same OP on the linked PerlDuck"s answer :)

– pa4080
Jan 28 at 9:51

2

@Ubaisalih, what service writes these logs? I'm asking, because I remembered, for example Apache2 can pipe its logs to a script and then that script can write a regular log and also processed log, etc. - if such solution could be applied this will be the best way, I think.

– pa4080
Jan 28 at 10:41

it's a new file 4GB each minute, the problem now is the speed of filtering the file, so is there a faster method to capture few column with a timely manner ? –

– Ubai salih
Jan 28 at 11:47

thank you very much for reply,. i've tried this but still consuming too much time to filter one file, so i'm not sure if there is something faster probably with AWK as you mentioned

– Ubai salih
Jan 28 at 9:16

Hi, @Ubaisalih, I think awk will be faster according to this time comparison. Whatever, does a new 4GB log is created each minute or only few hundred lines are appended each minute to an existing file?

– pa4080
Jan 28 at 9:31

Huh, I just saw you are the same OP on the linked PerlDuck"s answer :)

– pa4080
Jan 28 at 9:51

@Ubaisalih, what service writes these logs? I'm asking, because I remembered, for example Apache2 can pipe its logs to a script and then that script can write a regular log and also processed log, etc. - if such solution could be applied this will be the best way, I think.

– pa4080
Jan 28 at 10:41

it's a new file 4GB each minute, the problem now is the speed of filtering the file, so is there a faster method to capture few column with a timely manner ? –

– Ubai salih
Jan 28 at 11:47

|
show 7 more comments

draft saved

draft discarded

Thanks for contributing an answer to Ask Ubuntu!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ytdyklly