Extracting specific lines from log file with grep and awk
I have a huge log file (20 millions of lines) telling me if some url status are responding "200 OK" or not.
I'd like to extract all url with status "200 OK", plus the filename attached with it.
Input example:
Spider mode enabled. Check if remote file exists.
--2019-02-06 07:38:43-- https://www.example/download/123456789
Reusing existing connection to website.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Content-Type: application/zip
Connection: keep-alive
Status: 200 OK
Content-Disposition: attachment; filename="myfile123.zip"
Last-Modified: 2019-02-06 01:38:44 +0100
Access-Control-Allow-Origin: *
Cache-Control: private
X-Runtime: 0.312890
X-Frame-Options: SAMEORIGIN
Access-Control-Request-Method: GET,OPTIONS
X-Request-Id: 99920e01-d308-40ba-9461-74405e7df4b3
Date: Wed, 06 Feb 2019 00:38:44 GMT
X-Powered-By: Phusion Passenger 5.1.11
Server: nginx + Phusion Passenger 5.1.11
X-Powered-By: cloud66
Length: unspecified [application/zip]
Last-modified header invalid -- time-stamp ignored.
Remote file exists.
Spider mode enabled. Check if remote file exists.
--2019-02-06 07:38:43-- https://www.example/download/234567890
Reusing existing connection to website.
HTTP request sent, awaiting response...
HTTP/1.1 404 Not Found
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Status: 404 Not Found
Cache-Control: no-cache
Access-Control-Allow-Origin: *
X-Runtime: 0.020718
X-Frame-Options: SAMEORIGIN
Access-Control-Request-Method: GET,OPTIONS
X-Request-Id: bc20626b-095f-4b28-8322-ad3f294e4ee2
Date: Wed, 06 Feb 2019 00:37:42 GMT
X-Powered-By: Phusion Passenger 5.1.11
Server: nginx + Phusion Passenger 5.1.11
Remote file does not exist -- broken link!!!
Desired Output:
https://www.example/download/123456789 myfile123.zip
I had a friend who turned expert in doing this kind of extraction with grep and awk, and I'd love to finally understand the logic behind.
If I do this:
awk '/: 200 OK/{print $0}' file.log
I get all lines with Status: 200 OK
but without the context.
If I do this:
grep -C4 "1 200 OK" file.log
I get the context but with "noise". I'd like to rearrange the output to get only the relevant information on one line.
awk grep logs
add a comment |
I have a huge log file (20 millions of lines) telling me if some url status are responding "200 OK" or not.
I'd like to extract all url with status "200 OK", plus the filename attached with it.
Input example:
Spider mode enabled. Check if remote file exists.
--2019-02-06 07:38:43-- https://www.example/download/123456789
Reusing existing connection to website.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Content-Type: application/zip
Connection: keep-alive
Status: 200 OK
Content-Disposition: attachment; filename="myfile123.zip"
Last-Modified: 2019-02-06 01:38:44 +0100
Access-Control-Allow-Origin: *
Cache-Control: private
X-Runtime: 0.312890
X-Frame-Options: SAMEORIGIN
Access-Control-Request-Method: GET,OPTIONS
X-Request-Id: 99920e01-d308-40ba-9461-74405e7df4b3
Date: Wed, 06 Feb 2019 00:38:44 GMT
X-Powered-By: Phusion Passenger 5.1.11
Server: nginx + Phusion Passenger 5.1.11
X-Powered-By: cloud66
Length: unspecified [application/zip]
Last-modified header invalid -- time-stamp ignored.
Remote file exists.
Spider mode enabled. Check if remote file exists.
--2019-02-06 07:38:43-- https://www.example/download/234567890
Reusing existing connection to website.
HTTP request sent, awaiting response...
HTTP/1.1 404 Not Found
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Status: 404 Not Found
Cache-Control: no-cache
Access-Control-Allow-Origin: *
X-Runtime: 0.020718
X-Frame-Options: SAMEORIGIN
Access-Control-Request-Method: GET,OPTIONS
X-Request-Id: bc20626b-095f-4b28-8322-ad3f294e4ee2
Date: Wed, 06 Feb 2019 00:37:42 GMT
X-Powered-By: Phusion Passenger 5.1.11
Server: nginx + Phusion Passenger 5.1.11
Remote file does not exist -- broken link!!!
Desired Output:
https://www.example/download/123456789 myfile123.zip
I had a friend who turned expert in doing this kind of extraction with grep and awk, and I'd love to finally understand the logic behind.
If I do this:
awk '/: 200 OK/{print $0}' file.log
I get all lines with Status: 200 OK
but without the context.
If I do this:
grep -C4 "1 200 OK" file.log
I get the context but with "noise". I'd like to rearrange the output to get only the relevant information on one line.
awk grep logs
add a comment |
I have a huge log file (20 millions of lines) telling me if some url status are responding "200 OK" or not.
I'd like to extract all url with status "200 OK", plus the filename attached with it.
Input example:
Spider mode enabled. Check if remote file exists.
--2019-02-06 07:38:43-- https://www.example/download/123456789
Reusing existing connection to website.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Content-Type: application/zip
Connection: keep-alive
Status: 200 OK
Content-Disposition: attachment; filename="myfile123.zip"
Last-Modified: 2019-02-06 01:38:44 +0100
Access-Control-Allow-Origin: *
Cache-Control: private
X-Runtime: 0.312890
X-Frame-Options: SAMEORIGIN
Access-Control-Request-Method: GET,OPTIONS
X-Request-Id: 99920e01-d308-40ba-9461-74405e7df4b3
Date: Wed, 06 Feb 2019 00:38:44 GMT
X-Powered-By: Phusion Passenger 5.1.11
Server: nginx + Phusion Passenger 5.1.11
X-Powered-By: cloud66
Length: unspecified [application/zip]
Last-modified header invalid -- time-stamp ignored.
Remote file exists.
Spider mode enabled. Check if remote file exists.
--2019-02-06 07:38:43-- https://www.example/download/234567890
Reusing existing connection to website.
HTTP request sent, awaiting response...
HTTP/1.1 404 Not Found
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Status: 404 Not Found
Cache-Control: no-cache
Access-Control-Allow-Origin: *
X-Runtime: 0.020718
X-Frame-Options: SAMEORIGIN
Access-Control-Request-Method: GET,OPTIONS
X-Request-Id: bc20626b-095f-4b28-8322-ad3f294e4ee2
Date: Wed, 06 Feb 2019 00:37:42 GMT
X-Powered-By: Phusion Passenger 5.1.11
Server: nginx + Phusion Passenger 5.1.11
Remote file does not exist -- broken link!!!
Desired Output:
https://www.example/download/123456789 myfile123.zip
I had a friend who turned expert in doing this kind of extraction with grep and awk, and I'd love to finally understand the logic behind.
If I do this:
awk '/: 200 OK/{print $0}' file.log
I get all lines with Status: 200 OK
but without the context.
If I do this:
grep -C4 "1 200 OK" file.log
I get the context but with "noise". I'd like to rearrange the output to get only the relevant information on one line.
awk grep logs
I have a huge log file (20 millions of lines) telling me if some url status are responding "200 OK" or not.
I'd like to extract all url with status "200 OK", plus the filename attached with it.
Input example:
Spider mode enabled. Check if remote file exists.
--2019-02-06 07:38:43-- https://www.example/download/123456789
Reusing existing connection to website.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Content-Type: application/zip
Connection: keep-alive
Status: 200 OK
Content-Disposition: attachment; filename="myfile123.zip"
Last-Modified: 2019-02-06 01:38:44 +0100
Access-Control-Allow-Origin: *
Cache-Control: private
X-Runtime: 0.312890
X-Frame-Options: SAMEORIGIN
Access-Control-Request-Method: GET,OPTIONS
X-Request-Id: 99920e01-d308-40ba-9461-74405e7df4b3
Date: Wed, 06 Feb 2019 00:38:44 GMT
X-Powered-By: Phusion Passenger 5.1.11
Server: nginx + Phusion Passenger 5.1.11
X-Powered-By: cloud66
Length: unspecified [application/zip]
Last-modified header invalid -- time-stamp ignored.
Remote file exists.
Spider mode enabled. Check if remote file exists.
--2019-02-06 07:38:43-- https://www.example/download/234567890
Reusing existing connection to website.
HTTP request sent, awaiting response...
HTTP/1.1 404 Not Found
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Status: 404 Not Found
Cache-Control: no-cache
Access-Control-Allow-Origin: *
X-Runtime: 0.020718
X-Frame-Options: SAMEORIGIN
Access-Control-Request-Method: GET,OPTIONS
X-Request-Id: bc20626b-095f-4b28-8322-ad3f294e4ee2
Date: Wed, 06 Feb 2019 00:37:42 GMT
X-Powered-By: Phusion Passenger 5.1.11
Server: nginx + Phusion Passenger 5.1.11
Remote file does not exist -- broken link!!!
Desired Output:
https://www.example/download/123456789 myfile123.zip
I had a friend who turned expert in doing this kind of extraction with grep and awk, and I'd love to finally understand the logic behind.
If I do this:
awk '/: 200 OK/{print $0}' file.log
I get all lines with Status: 200 OK
but without the context.
If I do this:
grep -C4 "1 200 OK" file.log
I get the context but with "noise". I'd like to rearrange the output to get only the relevant information on one line.
awk grep logs
awk grep logs
edited Feb 6 at 1:51
Yoric
asked Feb 6 at 1:40
YoricYoric
1134
1134
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
You need to use awk
as below. Store the URL first in a variable and then on the Status
line if its OK
get the filename from the subsequent line. It should work on GNU awk
as the match()
function would need the third argument to store the captured group in an array.
awk '/^--/{ url = $NF }
/^[[:space:]]+Status/ && $NF == "OK" { getline nextline; match(nextline, /filename="(.+)"/,arr); print url, arr[1] }' file
1
Truly amazing. I stared for 20 minutes in awe. I think I got the logic. This is pure power. Thanks a million for writing this.
– Yoric
Feb 6 at 3:51
add a comment |
i=`awk '/Status: 200 OK/{x=NR+1}(NR<x){getline;print $NF}' filename | awk -F "=" '{print $NF}'| sed 's/"//g'`
awk '{a[++i]=$0}/Status: 200 OK/{for(x=NR-7;x<=NR;x++)print a[x]}' filename | awk -v i="$i" '/https:/{$1=$2="";print $0 " " i}'
output
https://www.example/download/123456789 myfile123.zip
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f498942%2fextracting-specific-lines-from-log-file-with-grep-and-awk%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
You need to use awk
as below. Store the URL first in a variable and then on the Status
line if its OK
get the filename from the subsequent line. It should work on GNU awk
as the match()
function would need the third argument to store the captured group in an array.
awk '/^--/{ url = $NF }
/^[[:space:]]+Status/ && $NF == "OK" { getline nextline; match(nextline, /filename="(.+)"/,arr); print url, arr[1] }' file
1
Truly amazing. I stared for 20 minutes in awe. I think I got the logic. This is pure power. Thanks a million for writing this.
– Yoric
Feb 6 at 3:51
add a comment |
You need to use awk
as below. Store the URL first in a variable and then on the Status
line if its OK
get the filename from the subsequent line. It should work on GNU awk
as the match()
function would need the third argument to store the captured group in an array.
awk '/^--/{ url = $NF }
/^[[:space:]]+Status/ && $NF == "OK" { getline nextline; match(nextline, /filename="(.+)"/,arr); print url, arr[1] }' file
1
Truly amazing. I stared for 20 minutes in awe. I think I got the logic. This is pure power. Thanks a million for writing this.
– Yoric
Feb 6 at 3:51
add a comment |
You need to use awk
as below. Store the URL first in a variable and then on the Status
line if its OK
get the filename from the subsequent line. It should work on GNU awk
as the match()
function would need the third argument to store the captured group in an array.
awk '/^--/{ url = $NF }
/^[[:space:]]+Status/ && $NF == "OK" { getline nextline; match(nextline, /filename="(.+)"/,arr); print url, arr[1] }' file
You need to use awk
as below. Store the URL first in a variable and then on the Status
line if its OK
get the filename from the subsequent line. It should work on GNU awk
as the match()
function would need the third argument to store the captured group in an array.
awk '/^--/{ url = $NF }
/^[[:space:]]+Status/ && $NF == "OK" { getline nextline; match(nextline, /filename="(.+)"/,arr); print url, arr[1] }' file
answered Feb 6 at 3:16
InianInian
4,5551025
4,5551025
1
Truly amazing. I stared for 20 minutes in awe. I think I got the logic. This is pure power. Thanks a million for writing this.
– Yoric
Feb 6 at 3:51
add a comment |
1
Truly amazing. I stared for 20 minutes in awe. I think I got the logic. This is pure power. Thanks a million for writing this.
– Yoric
Feb 6 at 3:51
1
1
Truly amazing. I stared for 20 minutes in awe. I think I got the logic. This is pure power. Thanks a million for writing this.
– Yoric
Feb 6 at 3:51
Truly amazing. I stared for 20 minutes in awe. I think I got the logic. This is pure power. Thanks a million for writing this.
– Yoric
Feb 6 at 3:51
add a comment |
i=`awk '/Status: 200 OK/{x=NR+1}(NR<x){getline;print $NF}' filename | awk -F "=" '{print $NF}'| sed 's/"//g'`
awk '{a[++i]=$0}/Status: 200 OK/{for(x=NR-7;x<=NR;x++)print a[x]}' filename | awk -v i="$i" '/https:/{$1=$2="";print $0 " " i}'
output
https://www.example/download/123456789 myfile123.zip
add a comment |
i=`awk '/Status: 200 OK/{x=NR+1}(NR<x){getline;print $NF}' filename | awk -F "=" '{print $NF}'| sed 's/"//g'`
awk '{a[++i]=$0}/Status: 200 OK/{for(x=NR-7;x<=NR;x++)print a[x]}' filename | awk -v i="$i" '/https:/{$1=$2="";print $0 " " i}'
output
https://www.example/download/123456789 myfile123.zip
add a comment |
i=`awk '/Status: 200 OK/{x=NR+1}(NR<x){getline;print $NF}' filename | awk -F "=" '{print $NF}'| sed 's/"//g'`
awk '{a[++i]=$0}/Status: 200 OK/{for(x=NR-7;x<=NR;x++)print a[x]}' filename | awk -v i="$i" '/https:/{$1=$2="";print $0 " " i}'
output
https://www.example/download/123456789 myfile123.zip
i=`awk '/Status: 200 OK/{x=NR+1}(NR<x){getline;print $NF}' filename | awk -F "=" '{print $NF}'| sed 's/"//g'`
awk '{a[++i]=$0}/Status: 200 OK/{for(x=NR-7;x<=NR;x++)print a[x]}' filename | awk -v i="$i" '/https:/{$1=$2="";print $0 " " i}'
output
https://www.example/download/123456789 myfile123.zip
answered Feb 7 at 19:03
Praveen Kumar BSPraveen Kumar BS
1,470138
1,470138
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f498942%2fextracting-specific-lines-from-log-file-with-grep-and-awk%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown