Extracting specific lines from log file with grep and awk












1















I have a huge log file (20 millions of lines) telling me if some url status are responding "200 OK" or not.



I'd like to extract all url with status "200 OK", plus the filename attached with it.



Input example:



Spider mode enabled. Check if remote file exists.
--2019-02-06 07:38:43-- https://www.example/download/123456789
Reusing existing connection to website.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Content-Type: application/zip
Connection: keep-alive
Status: 200 OK
Content-Disposition: attachment; filename="myfile123.zip"
Last-Modified: 2019-02-06 01:38:44 +0100
Access-Control-Allow-Origin: *
Cache-Control: private
X-Runtime: 0.312890
X-Frame-Options: SAMEORIGIN
Access-Control-Request-Method: GET,OPTIONS
X-Request-Id: 99920e01-d308-40ba-9461-74405e7df4b3
Date: Wed, 06 Feb 2019 00:38:44 GMT
X-Powered-By: Phusion Passenger 5.1.11
Server: nginx + Phusion Passenger 5.1.11
X-Powered-By: cloud66
Length: unspecified [application/zip]
Last-modified header invalid -- time-stamp ignored.
Remote file exists.

Spider mode enabled. Check if remote file exists.
--2019-02-06 07:38:43-- https://www.example/download/234567890
Reusing existing connection to website.
HTTP request sent, awaiting response...
HTTP/1.1 404 Not Found
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Status: 404 Not Found
Cache-Control: no-cache
Access-Control-Allow-Origin: *
X-Runtime: 0.020718
X-Frame-Options: SAMEORIGIN
Access-Control-Request-Method: GET,OPTIONS
X-Request-Id: bc20626b-095f-4b28-8322-ad3f294e4ee2
Date: Wed, 06 Feb 2019 00:37:42 GMT
X-Powered-By: Phusion Passenger 5.1.11
Server: nginx + Phusion Passenger 5.1.11
Remote file does not exist -- broken link!!!


Desired Output:



https://www.example/download/123456789 myfile123.zip


I had a friend who turned expert in doing this kind of extraction with grep and awk, and I'd love to finally understand the logic behind.



If I do this:



awk '/: 200 OK/{print $0}' file.log


I get all lines with Status: 200 OK but without the context.



If I do this:



grep -C4 "1 200 OK" file.log


I get the context but with "noise". I'd like to rearrange the output to get only the relevant information on one line.










share|improve this question





























    1















    I have a huge log file (20 millions of lines) telling me if some url status are responding "200 OK" or not.



    I'd like to extract all url with status "200 OK", plus the filename attached with it.



    Input example:



    Spider mode enabled. Check if remote file exists.
    --2019-02-06 07:38:43-- https://www.example/download/123456789
    Reusing existing connection to website.
    HTTP request sent, awaiting response...
    HTTP/1.1 200 OK
    Content-Type: application/zip
    Connection: keep-alive
    Status: 200 OK
    Content-Disposition: attachment; filename="myfile123.zip"
    Last-Modified: 2019-02-06 01:38:44 +0100
    Access-Control-Allow-Origin: *
    Cache-Control: private
    X-Runtime: 0.312890
    X-Frame-Options: SAMEORIGIN
    Access-Control-Request-Method: GET,OPTIONS
    X-Request-Id: 99920e01-d308-40ba-9461-74405e7df4b3
    Date: Wed, 06 Feb 2019 00:38:44 GMT
    X-Powered-By: Phusion Passenger 5.1.11
    Server: nginx + Phusion Passenger 5.1.11
    X-Powered-By: cloud66
    Length: unspecified [application/zip]
    Last-modified header invalid -- time-stamp ignored.
    Remote file exists.

    Spider mode enabled. Check if remote file exists.
    --2019-02-06 07:38:43-- https://www.example/download/234567890
    Reusing existing connection to website.
    HTTP request sent, awaiting response...
    HTTP/1.1 404 Not Found
    Content-Type: text/html; charset=utf-8
    Connection: keep-alive
    Status: 404 Not Found
    Cache-Control: no-cache
    Access-Control-Allow-Origin: *
    X-Runtime: 0.020718
    X-Frame-Options: SAMEORIGIN
    Access-Control-Request-Method: GET,OPTIONS
    X-Request-Id: bc20626b-095f-4b28-8322-ad3f294e4ee2
    Date: Wed, 06 Feb 2019 00:37:42 GMT
    X-Powered-By: Phusion Passenger 5.1.11
    Server: nginx + Phusion Passenger 5.1.11
    Remote file does not exist -- broken link!!!


    Desired Output:



    https://www.example/download/123456789 myfile123.zip


    I had a friend who turned expert in doing this kind of extraction with grep and awk, and I'd love to finally understand the logic behind.



    If I do this:



    awk '/: 200 OK/{print $0}' file.log


    I get all lines with Status: 200 OK but without the context.



    If I do this:



    grep -C4 "1 200 OK" file.log


    I get the context but with "noise". I'd like to rearrange the output to get only the relevant information on one line.










    share|improve this question



























      1












      1








      1


      1






      I have a huge log file (20 millions of lines) telling me if some url status are responding "200 OK" or not.



      I'd like to extract all url with status "200 OK", plus the filename attached with it.



      Input example:



      Spider mode enabled. Check if remote file exists.
      --2019-02-06 07:38:43-- https://www.example/download/123456789
      Reusing existing connection to website.
      HTTP request sent, awaiting response...
      HTTP/1.1 200 OK
      Content-Type: application/zip
      Connection: keep-alive
      Status: 200 OK
      Content-Disposition: attachment; filename="myfile123.zip"
      Last-Modified: 2019-02-06 01:38:44 +0100
      Access-Control-Allow-Origin: *
      Cache-Control: private
      X-Runtime: 0.312890
      X-Frame-Options: SAMEORIGIN
      Access-Control-Request-Method: GET,OPTIONS
      X-Request-Id: 99920e01-d308-40ba-9461-74405e7df4b3
      Date: Wed, 06 Feb 2019 00:38:44 GMT
      X-Powered-By: Phusion Passenger 5.1.11
      Server: nginx + Phusion Passenger 5.1.11
      X-Powered-By: cloud66
      Length: unspecified [application/zip]
      Last-modified header invalid -- time-stamp ignored.
      Remote file exists.

      Spider mode enabled. Check if remote file exists.
      --2019-02-06 07:38:43-- https://www.example/download/234567890
      Reusing existing connection to website.
      HTTP request sent, awaiting response...
      HTTP/1.1 404 Not Found
      Content-Type: text/html; charset=utf-8
      Connection: keep-alive
      Status: 404 Not Found
      Cache-Control: no-cache
      Access-Control-Allow-Origin: *
      X-Runtime: 0.020718
      X-Frame-Options: SAMEORIGIN
      Access-Control-Request-Method: GET,OPTIONS
      X-Request-Id: bc20626b-095f-4b28-8322-ad3f294e4ee2
      Date: Wed, 06 Feb 2019 00:37:42 GMT
      X-Powered-By: Phusion Passenger 5.1.11
      Server: nginx + Phusion Passenger 5.1.11
      Remote file does not exist -- broken link!!!


      Desired Output:



      https://www.example/download/123456789 myfile123.zip


      I had a friend who turned expert in doing this kind of extraction with grep and awk, and I'd love to finally understand the logic behind.



      If I do this:



      awk '/: 200 OK/{print $0}' file.log


      I get all lines with Status: 200 OK but without the context.



      If I do this:



      grep -C4 "1 200 OK" file.log


      I get the context but with "noise". I'd like to rearrange the output to get only the relevant information on one line.










      share|improve this question
















      I have a huge log file (20 millions of lines) telling me if some url status are responding "200 OK" or not.



      I'd like to extract all url with status "200 OK", plus the filename attached with it.



      Input example:



      Spider mode enabled. Check if remote file exists.
      --2019-02-06 07:38:43-- https://www.example/download/123456789
      Reusing existing connection to website.
      HTTP request sent, awaiting response...
      HTTP/1.1 200 OK
      Content-Type: application/zip
      Connection: keep-alive
      Status: 200 OK
      Content-Disposition: attachment; filename="myfile123.zip"
      Last-Modified: 2019-02-06 01:38:44 +0100
      Access-Control-Allow-Origin: *
      Cache-Control: private
      X-Runtime: 0.312890
      X-Frame-Options: SAMEORIGIN
      Access-Control-Request-Method: GET,OPTIONS
      X-Request-Id: 99920e01-d308-40ba-9461-74405e7df4b3
      Date: Wed, 06 Feb 2019 00:38:44 GMT
      X-Powered-By: Phusion Passenger 5.1.11
      Server: nginx + Phusion Passenger 5.1.11
      X-Powered-By: cloud66
      Length: unspecified [application/zip]
      Last-modified header invalid -- time-stamp ignored.
      Remote file exists.

      Spider mode enabled. Check if remote file exists.
      --2019-02-06 07:38:43-- https://www.example/download/234567890
      Reusing existing connection to website.
      HTTP request sent, awaiting response...
      HTTP/1.1 404 Not Found
      Content-Type: text/html; charset=utf-8
      Connection: keep-alive
      Status: 404 Not Found
      Cache-Control: no-cache
      Access-Control-Allow-Origin: *
      X-Runtime: 0.020718
      X-Frame-Options: SAMEORIGIN
      Access-Control-Request-Method: GET,OPTIONS
      X-Request-Id: bc20626b-095f-4b28-8322-ad3f294e4ee2
      Date: Wed, 06 Feb 2019 00:37:42 GMT
      X-Powered-By: Phusion Passenger 5.1.11
      Server: nginx + Phusion Passenger 5.1.11
      Remote file does not exist -- broken link!!!


      Desired Output:



      https://www.example/download/123456789 myfile123.zip


      I had a friend who turned expert in doing this kind of extraction with grep and awk, and I'd love to finally understand the logic behind.



      If I do this:



      awk '/: 200 OK/{print $0}' file.log


      I get all lines with Status: 200 OK but without the context.



      If I do this:



      grep -C4 "1 200 OK" file.log


      I get the context but with "noise". I'd like to rearrange the output to get only the relevant information on one line.







      awk grep logs






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Feb 6 at 1:51







      Yoric

















      asked Feb 6 at 1:40









      YoricYoric

      1134




      1134






















          2 Answers
          2






          active

          oldest

          votes


















          1














          You need to use awk as below. Store the URL first in a variable and then on the Status line if its OK get the filename from the subsequent line. It should work on GNU awk as the match() function would need the third argument to store the captured group in an array.



          awk '/^--/{ url = $NF } 
          /^[[:space:]]+Status/ && $NF == "OK" { getline nextline; match(nextline, /filename="(.+)"/,arr); print url, arr[1] }' file





          share|improve this answer



















          • 1





            Truly amazing. I stared for 20 minutes in awe. I think I got the logic. This is pure power. Thanks a million for writing this.

            – Yoric
            Feb 6 at 3:51



















          0














          i=`awk '/Status: 200 OK/{x=NR+1}(NR<x){getline;print $NF}' filename | awk -F "=" '{print $NF}'| sed 's/"//g'`

          awk '{a[++i]=$0}/Status: 200 OK/{for(x=NR-7;x<=NR;x++)print a[x]}' filename | awk -v i="$i" '/https:/{$1=$2="";print $0 " " i}'


          output



          https://www.example/download/123456789 myfile123.zip





          share|improve this answer























            Your Answer








            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "106"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f498942%2fextracting-specific-lines-from-log-file-with-grep-and-awk%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            You need to use awk as below. Store the URL first in a variable and then on the Status line if its OK get the filename from the subsequent line. It should work on GNU awk as the match() function would need the third argument to store the captured group in an array.



            awk '/^--/{ url = $NF } 
            /^[[:space:]]+Status/ && $NF == "OK" { getline nextline; match(nextline, /filename="(.+)"/,arr); print url, arr[1] }' file





            share|improve this answer



















            • 1





              Truly amazing. I stared for 20 minutes in awe. I think I got the logic. This is pure power. Thanks a million for writing this.

              – Yoric
              Feb 6 at 3:51
















            1














            You need to use awk as below. Store the URL first in a variable and then on the Status line if its OK get the filename from the subsequent line. It should work on GNU awk as the match() function would need the third argument to store the captured group in an array.



            awk '/^--/{ url = $NF } 
            /^[[:space:]]+Status/ && $NF == "OK" { getline nextline; match(nextline, /filename="(.+)"/,arr); print url, arr[1] }' file





            share|improve this answer



















            • 1





              Truly amazing. I stared for 20 minutes in awe. I think I got the logic. This is pure power. Thanks a million for writing this.

              – Yoric
              Feb 6 at 3:51














            1












            1








            1







            You need to use awk as below. Store the URL first in a variable and then on the Status line if its OK get the filename from the subsequent line. It should work on GNU awk as the match() function would need the third argument to store the captured group in an array.



            awk '/^--/{ url = $NF } 
            /^[[:space:]]+Status/ && $NF == "OK" { getline nextline; match(nextline, /filename="(.+)"/,arr); print url, arr[1] }' file





            share|improve this answer













            You need to use awk as below. Store the URL first in a variable and then on the Status line if its OK get the filename from the subsequent line. It should work on GNU awk as the match() function would need the third argument to store the captured group in an array.



            awk '/^--/{ url = $NF } 
            /^[[:space:]]+Status/ && $NF == "OK" { getline nextline; match(nextline, /filename="(.+)"/,arr); print url, arr[1] }' file






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Feb 6 at 3:16









            InianInian

            4,5551025




            4,5551025








            • 1





              Truly amazing. I stared for 20 minutes in awe. I think I got the logic. This is pure power. Thanks a million for writing this.

              – Yoric
              Feb 6 at 3:51














            • 1





              Truly amazing. I stared for 20 minutes in awe. I think I got the logic. This is pure power. Thanks a million for writing this.

              – Yoric
              Feb 6 at 3:51








            1




            1





            Truly amazing. I stared for 20 minutes in awe. I think I got the logic. This is pure power. Thanks a million for writing this.

            – Yoric
            Feb 6 at 3:51





            Truly amazing. I stared for 20 minutes in awe. I think I got the logic. This is pure power. Thanks a million for writing this.

            – Yoric
            Feb 6 at 3:51













            0














            i=`awk '/Status: 200 OK/{x=NR+1}(NR<x){getline;print $NF}' filename | awk -F "=" '{print $NF}'| sed 's/"//g'`

            awk '{a[++i]=$0}/Status: 200 OK/{for(x=NR-7;x<=NR;x++)print a[x]}' filename | awk -v i="$i" '/https:/{$1=$2="";print $0 " " i}'


            output



            https://www.example/download/123456789 myfile123.zip





            share|improve this answer




























              0














              i=`awk '/Status: 200 OK/{x=NR+1}(NR<x){getline;print $NF}' filename | awk -F "=" '{print $NF}'| sed 's/"//g'`

              awk '{a[++i]=$0}/Status: 200 OK/{for(x=NR-7;x<=NR;x++)print a[x]}' filename | awk -v i="$i" '/https:/{$1=$2="";print $0 " " i}'


              output



              https://www.example/download/123456789 myfile123.zip





              share|improve this answer


























                0












                0








                0







                i=`awk '/Status: 200 OK/{x=NR+1}(NR<x){getline;print $NF}' filename | awk -F "=" '{print $NF}'| sed 's/"//g'`

                awk '{a[++i]=$0}/Status: 200 OK/{for(x=NR-7;x<=NR;x++)print a[x]}' filename | awk -v i="$i" '/https:/{$1=$2="";print $0 " " i}'


                output



                https://www.example/download/123456789 myfile123.zip





                share|improve this answer













                i=`awk '/Status: 200 OK/{x=NR+1}(NR<x){getline;print $NF}' filename | awk -F "=" '{print $NF}'| sed 's/"//g'`

                awk '{a[++i]=$0}/Status: 200 OK/{for(x=NR-7;x<=NR;x++)print a[x]}' filename | awk -v i="$i" '/https:/{$1=$2="";print $0 " " i}'


                output



                https://www.example/download/123456789 myfile123.zip






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Feb 7 at 19:03









                Praveen Kumar BSPraveen Kumar BS

                1,470138




                1,470138






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Unix & Linux Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f498942%2fextracting-specific-lines-from-log-file-with-grep-and-awk%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    How to reconfigure Docker Trusted Registry 2.x.x to use CEPH FS mount instead of NFS and other traditional...

                    is 'sed' thread safe

                    How to make a Squid Proxy server?