Extracting specific lines from log file with grep and awk

I have a huge log file (20 millions of lines) telling me if some url status are responding "200 OK" or not.

I'd like to extract all url with status "200 OK", plus the filename attached with it.

Input example:

Spider mode enabled. Check if remote file exists.

--2019-02-06 07:38:43--  https://www.example/download/123456789

Reusing existing connection to website.

HTTP request sent, awaiting response... 

  HTTP/1.1 200 OK

  Content-Type: application/zip

  Connection: keep-alive

  Status: 200 OK

  Content-Disposition: attachment; filename="myfile123.zip"

  Last-Modified: 2019-02-06 01:38:44 +0100

  Access-Control-Allow-Origin: *

  Cache-Control: private

  X-Runtime: 0.312890

  X-Frame-Options: SAMEORIGIN

  Access-Control-Request-Method: GET,OPTIONS

  X-Request-Id: 99920e01-d308-40ba-9461-74405e7df4b3

  Date: Wed, 06 Feb 2019 00:38:44 GMT 

  X-Powered-By: Phusion Passenger 5.1.11

  Server: nginx + Phusion Passenger 5.1.11

  X-Powered-By: cloud66

Length: unspecified [application/zip]

Last-modified header invalid -- time-stamp ignored.

Remote file exists.



Spider mode enabled. Check if remote file exists.

--2019-02-06 07:38:43--  https://www.example/download/234567890

Reusing existing connection to website.

HTTP request sent, awaiting response... 

  HTTP/1.1 404 Not Found

  Content-Type: text/html; charset=utf-8

  Connection: keep-alive

  Status: 404 Not Found

  Cache-Control: no-cache

  Access-Control-Allow-Origin: *

  X-Runtime: 0.020718

  X-Frame-Options: SAMEORIGIN

  Access-Control-Request-Method: GET,OPTIONS

  X-Request-Id: bc20626b-095f-4b28-8322-ad3f294e4ee2

  Date: Wed, 06 Feb 2019 00:37:42 GMT

  X-Powered-By: Phusion Passenger 5.1.11

  Server: nginx + Phusion Passenger 5.1.11

Remote file does not exist -- broken link!!!

Desired Output:

https://www.example/download/123456789 myfile123.zip

I had a friend who turned expert in doing this kind of extraction with grep and awk, and I'd love to finally understand the logic behind.

If I do this:

awk '/: 200 OK/{print $0}' file.log

I get all lines with Status: 200 OK but without the context.

If I do this:

grep -C4 "1 200 OK" file.log

I get the context but with "noise". I'd like to rearrange the output to get only the relevant information on one line.

edited Feb 6 at 1:51

asked Feb 6 at 1:40

Yoric

1134

add a comment |

I have a huge log file (20 millions of lines) telling me if some url status are responding "200 OK" or not.

I'd like to extract all url with status "200 OK", plus the filename attached with it.

Input example:

Spider mode enabled. Check if remote file exists.

--2019-02-06 07:38:43--  https://www.example/download/123456789

Reusing existing connection to website.

HTTP request sent, awaiting response... 

  HTTP/1.1 200 OK

  Content-Type: application/zip

  Connection: keep-alive

  Status: 200 OK

  Content-Disposition: attachment; filename="myfile123.zip"

  Last-Modified: 2019-02-06 01:38:44 +0100

  Access-Control-Allow-Origin: *

  Cache-Control: private

  X-Runtime: 0.312890

  X-Frame-Options: SAMEORIGIN

  Access-Control-Request-Method: GET,OPTIONS

  X-Request-Id: 99920e01-d308-40ba-9461-74405e7df4b3

  Date: Wed, 06 Feb 2019 00:38:44 GMT 

  X-Powered-By: Phusion Passenger 5.1.11

  Server: nginx + Phusion Passenger 5.1.11

  X-Powered-By: cloud66

Length: unspecified [application/zip]

Last-modified header invalid -- time-stamp ignored.

Remote file exists.



Spider mode enabled. Check if remote file exists.

--2019-02-06 07:38:43--  https://www.example/download/234567890

Reusing existing connection to website.

HTTP request sent, awaiting response... 

  HTTP/1.1 404 Not Found

  Content-Type: text/html; charset=utf-8

  Connection: keep-alive

  Status: 404 Not Found

  Cache-Control: no-cache

  Access-Control-Allow-Origin: *

  X-Runtime: 0.020718

  X-Frame-Options: SAMEORIGIN

  Access-Control-Request-Method: GET,OPTIONS

  X-Request-Id: bc20626b-095f-4b28-8322-ad3f294e4ee2

  Date: Wed, 06 Feb 2019 00:37:42 GMT

  X-Powered-By: Phusion Passenger 5.1.11

  Server: nginx + Phusion Passenger 5.1.11

Remote file does not exist -- broken link!!!

Desired Output:

https://www.example/download/123456789 myfile123.zip

I had a friend who turned expert in doing this kind of extraction with grep and awk, and I'd love to finally understand the logic behind.

If I do this:

awk '/: 200 OK/{print $0}' file.log

I get all lines with Status: 200 OK but without the context.

If I do this:

grep -C4 "1 200 OK" file.log

I get the context but with "noise". I'd like to rearrange the output to get only the relevant information on one line.

edited Feb 6 at 1:51

asked Feb 6 at 1:40

Yoric

1134

add a comment |

I have a huge log file (20 millions of lines) telling me if some url status are responding "200 OK" or not.

I'd like to extract all url with status "200 OK", plus the filename attached with it.

Input example:

Spider mode enabled. Check if remote file exists.

--2019-02-06 07:38:43--  https://www.example/download/123456789

Reusing existing connection to website.

HTTP request sent, awaiting response... 

  HTTP/1.1 200 OK

  Content-Type: application/zip

  Connection: keep-alive

  Status: 200 OK

  Content-Disposition: attachment; filename="myfile123.zip"

  Last-Modified: 2019-02-06 01:38:44 +0100

  Access-Control-Allow-Origin: *

  Cache-Control: private

  X-Runtime: 0.312890

  X-Frame-Options: SAMEORIGIN

  Access-Control-Request-Method: GET,OPTIONS

  X-Request-Id: 99920e01-d308-40ba-9461-74405e7df4b3

  Date: Wed, 06 Feb 2019 00:38:44 GMT 

  X-Powered-By: Phusion Passenger 5.1.11

  Server: nginx + Phusion Passenger 5.1.11

  X-Powered-By: cloud66

Length: unspecified [application/zip]

Last-modified header invalid -- time-stamp ignored.

Remote file exists.



Spider mode enabled. Check if remote file exists.

--2019-02-06 07:38:43--  https://www.example/download/234567890

Reusing existing connection to website.

HTTP request sent, awaiting response... 

  HTTP/1.1 404 Not Found

  Content-Type: text/html; charset=utf-8

  Connection: keep-alive

  Status: 404 Not Found

  Cache-Control: no-cache

  Access-Control-Allow-Origin: *

  X-Runtime: 0.020718

  X-Frame-Options: SAMEORIGIN

  Access-Control-Request-Method: GET,OPTIONS

  X-Request-Id: bc20626b-095f-4b28-8322-ad3f294e4ee2

  Date: Wed, 06 Feb 2019 00:37:42 GMT

  X-Powered-By: Phusion Passenger 5.1.11

  Server: nginx + Phusion Passenger 5.1.11

Remote file does not exist -- broken link!!!

Desired Output:

https://www.example/download/123456789 myfile123.zip

I had a friend who turned expert in doing this kind of extraction with grep and awk, and I'd love to finally understand the logic behind.

If I do this:

awk '/: 200 OK/{print $0}' file.log

I get all lines with Status: 200 OK but without the context.

If I do this:

grep -C4 "1 200 OK" file.log

I get the context but with "noise". I'd like to rearrange the output to get only the relevant information on one line.

edited Feb 6 at 1:51

asked Feb 6 at 1:40

Yoric

1134

I have a huge log file (20 millions of lines) telling me if some url status are responding "200 OK" or not.

I'd like to extract all url with status "200 OK", plus the filename attached with it.

Input example:

Spider mode enabled. Check if remote file exists.

--2019-02-06 07:38:43--  https://www.example/download/123456789

Reusing existing connection to website.

HTTP request sent, awaiting response... 

  HTTP/1.1 200 OK

  Content-Type: application/zip

  Connection: keep-alive

  Status: 200 OK

  Content-Disposition: attachment; filename="myfile123.zip"

  Last-Modified: 2019-02-06 01:38:44 +0100

  Access-Control-Allow-Origin: *

  Cache-Control: private

  X-Runtime: 0.312890

  X-Frame-Options: SAMEORIGIN

  Access-Control-Request-Method: GET,OPTIONS

  X-Request-Id: 99920e01-d308-40ba-9461-74405e7df4b3

  Date: Wed, 06 Feb 2019 00:38:44 GMT 

  X-Powered-By: Phusion Passenger 5.1.11

  Server: nginx + Phusion Passenger 5.1.11

  X-Powered-By: cloud66

Length: unspecified [application/zip]

Last-modified header invalid -- time-stamp ignored.

Remote file exists.



Spider mode enabled. Check if remote file exists.

--2019-02-06 07:38:43--  https://www.example/download/234567890

Reusing existing connection to website.

HTTP request sent, awaiting response... 

  HTTP/1.1 404 Not Found

  Content-Type: text/html; charset=utf-8

  Connection: keep-alive

  Status: 404 Not Found

  Cache-Control: no-cache

  Access-Control-Allow-Origin: *

  X-Runtime: 0.020718

  X-Frame-Options: SAMEORIGIN

  Access-Control-Request-Method: GET,OPTIONS

  X-Request-Id: bc20626b-095f-4b28-8322-ad3f294e4ee2

  Date: Wed, 06 Feb 2019 00:37:42 GMT

  X-Powered-By: Phusion Passenger 5.1.11

  Server: nginx + Phusion Passenger 5.1.11

Remote file does not exist -- broken link!!!

Desired Output:

https://www.example/download/123456789 myfile123.zip

I had a friend who turned expert in doing this kind of extraction with grep and awk, and I'd love to finally understand the logic behind.

If I do this:

awk '/: 200 OK/{print $0}' file.log

I get all lines with Status: 200 OK but without the context.

If I do this:

grep -C4 "1 200 OK" file.log

I get the context but with "noise". I'd like to rearrange the output to get only the relevant information on one line.

awk grep logs

edited Feb 6 at 1:51

asked Feb 6 at 1:40

Yoric

1134

edited Feb 6 at 1:51

asked Feb 6 at 1:40

Yoric

1134

edited Feb 6 at 1:51

asked Feb 6 at 1:40

Yoric

1134

asked Feb 6 at 1:40

Yoric

1134

asked Feb 6 at 1:40

Yoric

1134

add a comment |

2 Answers
2

active

oldest

votes

You need to use awk as below. Store the URL first in a variable and then on the Status line if its OK get the filename from the subsequent line. It should work on GNU awk as the match() function would need the third argument to store the captured group in an array.

awk '/^--/{ url = $NF } 

    /^[[:space:]]+Status/ && $NF == "OK" { getline nextline; match(nextline, /filename="(.+)"/,arr); print url, arr[1] }' file

answered Feb 6 at 3:16

Inian

4,5551025

1

Truly amazing. I stared for 20 minutes in awe. I think I got the logic. This is pure power. Thanks a million for writing this.

– Yoric
Feb 6 at 3:51

add a comment |

i=`awk '/Status: 200 OK/{x=NR+1}(NR<x){getline;print $NF}' filename | awk -F "=" '{print $NF}'| sed 's/"//g'`



awk '{a[++i]=$0}/Status: 200 OK/{for(x=NR-7;x<=NR;x++)print a[x]}' filename | awk -v i="$i" '/https:/{$1=$2="";print $0 " " i}'

output

https://www.example/download/123456789 myfile123.zip

answered Feb 7 at 19:03

Praveen Kumar BS

1,470138

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f498942%2fextracting-specific-lines-from-log-file-with-grep-and-awk%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

awk '/^--/{ url = $NF } 

    /^[[:space:]]+Status/ && $NF == "OK" { getline nextline; match(nextline, /filename="(.+)"/,arr); print url, arr[1] }' file

answered Feb 6 at 3:16

Inian

4,5551025

1

Truly amazing. I stared for 20 minutes in awe. I think I got the logic. This is pure power. Thanks a million for writing this.

– Yoric
Feb 6 at 3:51

add a comment |

awk '/^--/{ url = $NF } 

    /^[[:space:]]+Status/ && $NF == "OK" { getline nextline; match(nextline, /filename="(.+)"/,arr); print url, arr[1] }' file

answered Feb 6 at 3:16

Inian

4,5551025

1

Truly amazing. I stared for 20 minutes in awe. I think I got the logic. This is pure power. Thanks a million for writing this.

– Yoric
Feb 6 at 3:51

add a comment |

awk '/^--/{ url = $NF } 

    /^[[:space:]]+Status/ && $NF == "OK" { getline nextline; match(nextline, /filename="(.+)"/,arr); print url, arr[1] }' file

answered Feb 6 at 3:16

Inian

4,5551025

awk '/^--/{ url = $NF } 

    /^[[:space:]]+Status/ && $NF == "OK" { getline nextline; match(nextline, /filename="(.+)"/,arr); print url, arr[1] }' file

answered Feb 6 at 3:16

Inian

4,5551025

answered Feb 6 at 3:16

Inian

4,5551025

answered Feb 6 at 3:16

Inian

4,5551025

answered Feb 6 at 3:16

Inian

4,5551025

1

Truly amazing. I stared for 20 minutes in awe. I think I got the logic. This is pure power. Thanks a million for writing this.

– Yoric
Feb 6 at 3:51

add a comment |

1

Truly amazing. I stared for 20 minutes in awe. I think I got the logic. This is pure power. Thanks a million for writing this.

– Yoric
Feb 6 at 3:51

Truly amazing. I stared for 20 minutes in awe. I think I got the logic. This is pure power. Thanks a million for writing this.

– Yoric
Feb 6 at 3:51

add a comment |

i=`awk '/Status: 200 OK/{x=NR+1}(NR<x){getline;print $NF}' filename | awk -F "=" '{print $NF}'| sed 's/"//g'`



awk '{a[++i]=$0}/Status: 200 OK/{for(x=NR-7;x<=NR;x++)print a[x]}' filename | awk -v i="$i" '/https:/{$1=$2="";print $0 " " i}'

output

https://www.example/download/123456789 myfile123.zip

answered Feb 7 at 19:03

Praveen Kumar BS

1,470138

add a comment |

i=`awk '/Status: 200 OK/{x=NR+1}(NR<x){getline;print $NF}' filename | awk -F "=" '{print $NF}'| sed 's/"//g'`



awk '{a[++i]=$0}/Status: 200 OK/{for(x=NR-7;x<=NR;x++)print a[x]}' filename | awk -v i="$i" '/https:/{$1=$2="";print $0 " " i}'

output

https://www.example/download/123456789 myfile123.zip

answered Feb 7 at 19:03

Praveen Kumar BS

1,470138

add a comment |

i=`awk '/Status: 200 OK/{x=NR+1}(NR<x){getline;print $NF}' filename | awk -F "=" '{print $NF}'| sed 's/"//g'`



awk '{a[++i]=$0}/Status: 200 OK/{for(x=NR-7;x<=NR;x++)print a[x]}' filename | awk -v i="$i" '/https:/{$1=$2="";print $0 " " i}'

output

https://www.example/download/123456789 myfile123.zip

answered Feb 7 at 19:03

Praveen Kumar BS

1,470138

i=`awk '/Status: 200 OK/{x=NR+1}(NR<x){getline;print $NF}' filename | awk -F "=" '{print $NF}'| sed 's/"//g'`



awk '{a[++i]=$0}/Status: 200 OK/{for(x=NR-7;x<=NR;x++)print a[x]}' filename | awk -v i="$i" '/https:/{$1=$2="";print $0 " " i}'

output

https://www.example/download/123456789 myfile123.zip

answered Feb 7 at 19:03

Praveen Kumar BS

1,470138

answered Feb 7 at 19:03

Praveen Kumar BS

1,470138

answered Feb 7 at 19:03

Praveen Kumar BS

1,470138

answered Feb 7 at 19:03

Praveen Kumar BS

1,470138

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ytdyklly