awk to match and cut out fields with alternating delimiter












3















example bookmarks file



I would like to use awk or similar to match patterns of a chrome bookmarks file and depending on match, cut out a specific field based on different field delimiters.



I have attached a sample picture. I still haven't figured out how to attach as a file.



I want the folder names in case the string H3 is matched and the URL in case the string HREF is encountered.



the following two commands do the job for the respective matches:



awk -F'[<>]' '/H3/{print $5}' bookmarks.htm
awk -F'"' '/HREF/{print $2}' bookmarks.html


My goal is to combine the two statements above so the output becomes:



UNIX
url-1
url-2
OCE
url-3
url-4
url-5
ANDROID
url-6
url-7


I have tried awk's if, then, else but wasn't conclusive.



How do I achieve this as a one-liner? are there better candidates than awk? python, perl would both be great, however, one-liner is an absolute as it would be an easy task writing a shell script that does the job.










share|improve this question




















  • 4





    Don't post screenshots of text, paste the actual text...

    – jasonwryan
    Feb 19 '17 at 21:16











  • text is very long and ugly formatting. i tcan't be added as it contians URLs and as a beginner i am not allowed > 1 URL in my post

    – HenrikJson
    Feb 19 '17 at 21:18













  • Use the {} button to format the text as code.

    – Gilles
    Feb 19 '17 at 21:37











  • the {} produced something that looks like only a partial code extract -> no good

    – HenrikJson
    Feb 19 '17 at 21:49






  • 1





    You don't need a one-liner to make it easy to script, but here is one anyway: awk -F'[<>]' '/<H3/{print $5} /HREF="/{sub(/[^"]*"/,"");sub(/".*/,"");print}' bookmark.html

    – dave_thompson_085
    Feb 21 '17 at 2:53


















3















example bookmarks file



I would like to use awk or similar to match patterns of a chrome bookmarks file and depending on match, cut out a specific field based on different field delimiters.



I have attached a sample picture. I still haven't figured out how to attach as a file.



I want the folder names in case the string H3 is matched and the URL in case the string HREF is encountered.



the following two commands do the job for the respective matches:



awk -F'[<>]' '/H3/{print $5}' bookmarks.htm
awk -F'"' '/HREF/{print $2}' bookmarks.html


My goal is to combine the two statements above so the output becomes:



UNIX
url-1
url-2
OCE
url-3
url-4
url-5
ANDROID
url-6
url-7


I have tried awk's if, then, else but wasn't conclusive.



How do I achieve this as a one-liner? are there better candidates than awk? python, perl would both be great, however, one-liner is an absolute as it would be an easy task writing a shell script that does the job.










share|improve this question




















  • 4





    Don't post screenshots of text, paste the actual text...

    – jasonwryan
    Feb 19 '17 at 21:16











  • text is very long and ugly formatting. i tcan't be added as it contians URLs and as a beginner i am not allowed > 1 URL in my post

    – HenrikJson
    Feb 19 '17 at 21:18













  • Use the {} button to format the text as code.

    – Gilles
    Feb 19 '17 at 21:37











  • the {} produced something that looks like only a partial code extract -> no good

    – HenrikJson
    Feb 19 '17 at 21:49






  • 1





    You don't need a one-liner to make it easy to script, but here is one anyway: awk -F'[<>]' '/<H3/{print $5} /HREF="/{sub(/[^"]*"/,"");sub(/".*/,"");print}' bookmark.html

    – dave_thompson_085
    Feb 21 '17 at 2:53
















3












3








3








example bookmarks file



I would like to use awk or similar to match patterns of a chrome bookmarks file and depending on match, cut out a specific field based on different field delimiters.



I have attached a sample picture. I still haven't figured out how to attach as a file.



I want the folder names in case the string H3 is matched and the URL in case the string HREF is encountered.



the following two commands do the job for the respective matches:



awk -F'[<>]' '/H3/{print $5}' bookmarks.htm
awk -F'"' '/HREF/{print $2}' bookmarks.html


My goal is to combine the two statements above so the output becomes:



UNIX
url-1
url-2
OCE
url-3
url-4
url-5
ANDROID
url-6
url-7


I have tried awk's if, then, else but wasn't conclusive.



How do I achieve this as a one-liner? are there better candidates than awk? python, perl would both be great, however, one-liner is an absolute as it would be an easy task writing a shell script that does the job.










share|improve this question
















example bookmarks file



I would like to use awk or similar to match patterns of a chrome bookmarks file and depending on match, cut out a specific field based on different field delimiters.



I have attached a sample picture. I still haven't figured out how to attach as a file.



I want the folder names in case the string H3 is matched and the URL in case the string HREF is encountered.



the following two commands do the job for the respective matches:



awk -F'[<>]' '/H3/{print $5}' bookmarks.htm
awk -F'"' '/HREF/{print $2}' bookmarks.html


My goal is to combine the two statements above so the output becomes:



UNIX
url-1
url-2
OCE
url-3
url-4
url-5
ANDROID
url-6
url-7


I have tried awk's if, then, else but wasn't conclusive.



How do I achieve this as a one-liner? are there better candidates than awk? python, perl would both be great, however, one-liner is an absolute as it would be an easy task writing a shell script that does the job.







shell-script awk






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 13 at 21:52









Rui F Ribeiro

39.5k1479133




39.5k1479133










asked Feb 19 '17 at 21:14









HenrikJsonHenrikJson

308




308








  • 4





    Don't post screenshots of text, paste the actual text...

    – jasonwryan
    Feb 19 '17 at 21:16











  • text is very long and ugly formatting. i tcan't be added as it contians URLs and as a beginner i am not allowed > 1 URL in my post

    – HenrikJson
    Feb 19 '17 at 21:18













  • Use the {} button to format the text as code.

    – Gilles
    Feb 19 '17 at 21:37











  • the {} produced something that looks like only a partial code extract -> no good

    – HenrikJson
    Feb 19 '17 at 21:49






  • 1





    You don't need a one-liner to make it easy to script, but here is one anyway: awk -F'[<>]' '/<H3/{print $5} /HREF="/{sub(/[^"]*"/,"");sub(/".*/,"");print}' bookmark.html

    – dave_thompson_085
    Feb 21 '17 at 2:53
















  • 4





    Don't post screenshots of text, paste the actual text...

    – jasonwryan
    Feb 19 '17 at 21:16











  • text is very long and ugly formatting. i tcan't be added as it contians URLs and as a beginner i am not allowed > 1 URL in my post

    – HenrikJson
    Feb 19 '17 at 21:18













  • Use the {} button to format the text as code.

    – Gilles
    Feb 19 '17 at 21:37











  • the {} produced something that looks like only a partial code extract -> no good

    – HenrikJson
    Feb 19 '17 at 21:49






  • 1





    You don't need a one-liner to make it easy to script, but here is one anyway: awk -F'[<>]' '/<H3/{print $5} /HREF="/{sub(/[^"]*"/,"");sub(/".*/,"");print}' bookmark.html

    – dave_thompson_085
    Feb 21 '17 at 2:53










4




4





Don't post screenshots of text, paste the actual text...

– jasonwryan
Feb 19 '17 at 21:16





Don't post screenshots of text, paste the actual text...

– jasonwryan
Feb 19 '17 at 21:16













text is very long and ugly formatting. i tcan't be added as it contians URLs and as a beginner i am not allowed > 1 URL in my post

– HenrikJson
Feb 19 '17 at 21:18







text is very long and ugly formatting. i tcan't be added as it contians URLs and as a beginner i am not allowed > 1 URL in my post

– HenrikJson
Feb 19 '17 at 21:18















Use the {} button to format the text as code.

– Gilles
Feb 19 '17 at 21:37





Use the {} button to format the text as code.

– Gilles
Feb 19 '17 at 21:37













the {} produced something that looks like only a partial code extract -> no good

– HenrikJson
Feb 19 '17 at 21:49





the {} produced something that looks like only a partial code extract -> no good

– HenrikJson
Feb 19 '17 at 21:49




1




1





You don't need a one-liner to make it easy to script, but here is one anyway: awk -F'[<>]' '/<H3/{print $5} /HREF="/{sub(/[^"]*"/,"");sub(/".*/,"");print}' bookmark.html

– dave_thompson_085
Feb 21 '17 at 2:53







You don't need a one-liner to make it easy to script, but here is one anyway: awk -F'[<>]' '/<H3/{print $5} /HREF="/{sub(/[^"]*"/,"");sub(/".*/,"");print}' bookmark.html

– dave_thompson_085
Feb 21 '17 at 2:53












4 Answers
4






active

oldest

votes


















2














This is wrong way to process html-files with sed/awk/… There are few special parsers but as temporary substitution



sed '
/n/{P;d;}
/<H3/s/[><]/n/4g
/HREF/s/"/n/g
D
' bookmarks.htm


For non-GNU versions of sed:



sed '
/n/{P;d;} #if there is more then 1 line «P»rint 1st line then «d»elete all
/</H3/s//n/ #replace «</H3» by «n»ewline
/n/s/">/n/ #replace «">» by «n»ewline if previous command is executed
/HREF/s/"/n/g #put «n»ewline» around url if «HREF» in line
D #«D»elete 1 first line, go to start
' bookmarks.htm





share|improve this answer


























  • Thanks, that gives the urls but not the headers, trying to adapt the part: /<H3/s/[><]/n/4g

    – HenrikJson
    Feb 20 '17 at 21:33








  • 1





    @HenrikJson It possible if you use non-GNU sed: 4g construction is not recognized. In the case you have to substitute it by /</H3/s//n/;/n/s/">/n/

    – Costas
    Feb 21 '17 at 6:17













  • Costas, if you add that comment + your original command as answer i will mark it as correct. please if you have time also add annotations how to read the sed command. parts of are clear to me but not all

    – HenrikJson
    Feb 21 '17 at 8:11













  • @HenrikJson see updated

    – Costas
    Feb 21 '17 at 10:06



















1














Using a xml / html parser / processor has some advantages. Xpath expressions are the standard way to select specific parts.



xml + xmlstarlet + xpath



If the input is well formed xml we can use xmlstarlet + xpath expression:



xmlstarlet sel -t -v '//h3|//a/@href' -nl bookmarks.html


html + xmllint : xml



If the input is just valid html, we can convert it to xml (using xmllint) and use the previous:



xmllint -html -xmlout ex.html | xmlstarlet sel -t -v '//h3|//a/@href' -nl -


xmllint + xpath



We can use xmllint + xpath expression, directly



xmllint -html -xpath '//h3/text()|//a/@href' bookmarks.html


... but the output format is not the same...






share|improve this answer


























  • Could you explain what this is ?

    – LukeM
    Feb 20 '17 at 0:53













  • @DarkHeart, I added some more information.

    – JJoao
    Feb 20 '17 at 12:58











  • on the cygwin I am running, neither xmllint nor xpath available

    – HenrikJson
    Feb 20 '17 at 21:29






  • 1





    @HenrikJson, you can install both xmllint (setup-x86_64 -qP libxml2) and xmlstarlet in cygwin.

    – JJoao
    Feb 20 '17 at 23:41



















1














One last answer: this time a one-ligner perl



perl -nE 'say $1 if (/<h3.*?>(.*?)</h3>/i or /href="(.*?)"/i)' ex.html


(I believe that xml parser based solutions are better, but since you have a
tool-generated file, the amount of surprises should not be very high)






share|improve this answer































    0














    For now I discarded demand for one-liner and did it as a script instead.



    I had to post this as a response as it would have been too long for a comment. Still, feel free to respond.



    This script does the job but is too sluggish, can anyone speed it up or alternatively suggest a one-liner?



    #!/bin/sh
    file=$1
    while IFS= read -r line
    do
    hdr=$(echo $line | awk -F'[<>]' '/H3/{print $5}')
    url=$(echo $line | awk -F'"' '/HREF/{print $2}')
    if [ ${url} ]; then
    echo $url
    elif [ ${hdr} ]; then
    echo $hdr
    fi
    done <"$file"


    Here the file: (finally got it)



    <html xmlns="http://www.w3.org/1999/xhtml">
    <body>
    <h1>Bookmarks</h1>
    <dl>
    <dd>
    <DT><H3 ADD_DATE="1484311924" LAST_MODIFIED="1485532328">UNIX</H3>
    <dl>
    <dt><a HREF="http://unix.stackexchange.com/questions/223182/how-to-replace-spaces-in-all-file-names-with-underscore-in-linux-using-shell-scr" add_date="1484311897">url-1</a></dt>
    <dt><a HREF="http://unix.stackexchange.com/questions/81349/how-do-i-use-find-when-the-filename-contains-spaces" add_date="1484738308">url-2</a></dt>
    </dl>
    </dd>
    <dd>
    <DT><H3 ADD_DATE="1486550854" LAST_MODIFIED="1487228526">OCE</H3>
    <dl>
    <dt><a HREF="http://www.oraclecertificationprep.com/apex/f?p=OCPSG%3AEXAM_DETAILS%3A%3A%3ANO%3A%3AP2_EXAM%3A1Z0-061" add_date="1486550866">url-3</a></dt>
    <dt><a HREF="http://education.oracle.com/pls/web_prod-plq-dad/db_pages.getpage?page_id=303&amp;p_certName=SQ1Z0_047" add_date="1486550898">url-4</a></dt>
    <dt><a HREF="https://www.quora.com/How-do-you-prepare-for-an-Oracle-Database-SQL-exam" add_date="1486550950">url-5</a></dt>
    </dl>
    </dd>
    <dd>
    <DT><H3 ADD_DATE="1487084050" LAST_MODIFIED="1487228595">ANDROID</H3>
    <dl>
    <dt><a HREF="https://material.io/guidelines/style/color.html#" add_date="1487228526">url-6</a></dt>
    <dt><a HREF="https://developer.android.com/index.html" add_date="1487228539">url-7</a></dt>
    </dl>
    </dd>
    </dl>
    </body>
    </html>





    share|improve this answer























      Your Answer








      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "106"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f346175%2fawk-to-match-and-cut-out-fields-with-alternating-delimiter%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      4 Answers
      4






      active

      oldest

      votes








      4 Answers
      4






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      2














      This is wrong way to process html-files with sed/awk/… There are few special parsers but as temporary substitution



      sed '
      /n/{P;d;}
      /<H3/s/[><]/n/4g
      /HREF/s/"/n/g
      D
      ' bookmarks.htm


      For non-GNU versions of sed:



      sed '
      /n/{P;d;} #if there is more then 1 line «P»rint 1st line then «d»elete all
      /</H3/s//n/ #replace «</H3» by «n»ewline
      /n/s/">/n/ #replace «">» by «n»ewline if previous command is executed
      /HREF/s/"/n/g #put «n»ewline» around url if «HREF» in line
      D #«D»elete 1 first line, go to start
      ' bookmarks.htm





      share|improve this answer


























      • Thanks, that gives the urls but not the headers, trying to adapt the part: /<H3/s/[><]/n/4g

        – HenrikJson
        Feb 20 '17 at 21:33








      • 1





        @HenrikJson It possible if you use non-GNU sed: 4g construction is not recognized. In the case you have to substitute it by /</H3/s//n/;/n/s/">/n/

        – Costas
        Feb 21 '17 at 6:17













      • Costas, if you add that comment + your original command as answer i will mark it as correct. please if you have time also add annotations how to read the sed command. parts of are clear to me but not all

        – HenrikJson
        Feb 21 '17 at 8:11













      • @HenrikJson see updated

        – Costas
        Feb 21 '17 at 10:06
















      2














      This is wrong way to process html-files with sed/awk/… There are few special parsers but as temporary substitution



      sed '
      /n/{P;d;}
      /<H3/s/[><]/n/4g
      /HREF/s/"/n/g
      D
      ' bookmarks.htm


      For non-GNU versions of sed:



      sed '
      /n/{P;d;} #if there is more then 1 line «P»rint 1st line then «d»elete all
      /</H3/s//n/ #replace «</H3» by «n»ewline
      /n/s/">/n/ #replace «">» by «n»ewline if previous command is executed
      /HREF/s/"/n/g #put «n»ewline» around url if «HREF» in line
      D #«D»elete 1 first line, go to start
      ' bookmarks.htm





      share|improve this answer


























      • Thanks, that gives the urls but not the headers, trying to adapt the part: /<H3/s/[><]/n/4g

        – HenrikJson
        Feb 20 '17 at 21:33








      • 1





        @HenrikJson It possible if you use non-GNU sed: 4g construction is not recognized. In the case you have to substitute it by /</H3/s//n/;/n/s/">/n/

        – Costas
        Feb 21 '17 at 6:17













      • Costas, if you add that comment + your original command as answer i will mark it as correct. please if you have time also add annotations how to read the sed command. parts of are clear to me but not all

        – HenrikJson
        Feb 21 '17 at 8:11













      • @HenrikJson see updated

        – Costas
        Feb 21 '17 at 10:06














      2












      2








      2







      This is wrong way to process html-files with sed/awk/… There are few special parsers but as temporary substitution



      sed '
      /n/{P;d;}
      /<H3/s/[><]/n/4g
      /HREF/s/"/n/g
      D
      ' bookmarks.htm


      For non-GNU versions of sed:



      sed '
      /n/{P;d;} #if there is more then 1 line «P»rint 1st line then «d»elete all
      /</H3/s//n/ #replace «</H3» by «n»ewline
      /n/s/">/n/ #replace «">» by «n»ewline if previous command is executed
      /HREF/s/"/n/g #put «n»ewline» around url if «HREF» in line
      D #«D»elete 1 first line, go to start
      ' bookmarks.htm





      share|improve this answer















      This is wrong way to process html-files with sed/awk/… There are few special parsers but as temporary substitution



      sed '
      /n/{P;d;}
      /<H3/s/[><]/n/4g
      /HREF/s/"/n/g
      D
      ' bookmarks.htm


      For non-GNU versions of sed:



      sed '
      /n/{P;d;} #if there is more then 1 line «P»rint 1st line then «d»elete all
      /</H3/s//n/ #replace «</H3» by «n»ewline
      /n/s/">/n/ #replace «">» by «n»ewline if previous command is executed
      /HREF/s/"/n/g #put «n»ewline» around url if «HREF» in line
      D #«D»elete 1 first line, go to start
      ' bookmarks.htm






      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Feb 21 '17 at 10:03

























      answered Feb 19 '17 at 21:53









      CostasCostas

      12.6k1129




      12.6k1129













      • Thanks, that gives the urls but not the headers, trying to adapt the part: /<H3/s/[><]/n/4g

        – HenrikJson
        Feb 20 '17 at 21:33








      • 1





        @HenrikJson It possible if you use non-GNU sed: 4g construction is not recognized. In the case you have to substitute it by /</H3/s//n/;/n/s/">/n/

        – Costas
        Feb 21 '17 at 6:17













      • Costas, if you add that comment + your original command as answer i will mark it as correct. please if you have time also add annotations how to read the sed command. parts of are clear to me but not all

        – HenrikJson
        Feb 21 '17 at 8:11













      • @HenrikJson see updated

        – Costas
        Feb 21 '17 at 10:06



















      • Thanks, that gives the urls but not the headers, trying to adapt the part: /<H3/s/[><]/n/4g

        – HenrikJson
        Feb 20 '17 at 21:33








      • 1





        @HenrikJson It possible if you use non-GNU sed: 4g construction is not recognized. In the case you have to substitute it by /</H3/s//n/;/n/s/">/n/

        – Costas
        Feb 21 '17 at 6:17













      • Costas, if you add that comment + your original command as answer i will mark it as correct. please if you have time also add annotations how to read the sed command. parts of are clear to me but not all

        – HenrikJson
        Feb 21 '17 at 8:11













      • @HenrikJson see updated

        – Costas
        Feb 21 '17 at 10:06

















      Thanks, that gives the urls but not the headers, trying to adapt the part: /<H3/s/[><]/n/4g

      – HenrikJson
      Feb 20 '17 at 21:33







      Thanks, that gives the urls but not the headers, trying to adapt the part: /<H3/s/[><]/n/4g

      – HenrikJson
      Feb 20 '17 at 21:33






      1




      1





      @HenrikJson It possible if you use non-GNU sed: 4g construction is not recognized. In the case you have to substitute it by /</H3/s//n/;/n/s/">/n/

      – Costas
      Feb 21 '17 at 6:17







      @HenrikJson It possible if you use non-GNU sed: 4g construction is not recognized. In the case you have to substitute it by /</H3/s//n/;/n/s/">/n/

      – Costas
      Feb 21 '17 at 6:17















      Costas, if you add that comment + your original command as answer i will mark it as correct. please if you have time also add annotations how to read the sed command. parts of are clear to me but not all

      – HenrikJson
      Feb 21 '17 at 8:11







      Costas, if you add that comment + your original command as answer i will mark it as correct. please if you have time also add annotations how to read the sed command. parts of are clear to me but not all

      – HenrikJson
      Feb 21 '17 at 8:11















      @HenrikJson see updated

      – Costas
      Feb 21 '17 at 10:06





      @HenrikJson see updated

      – Costas
      Feb 21 '17 at 10:06













      1














      Using a xml / html parser / processor has some advantages. Xpath expressions are the standard way to select specific parts.



      xml + xmlstarlet + xpath



      If the input is well formed xml we can use xmlstarlet + xpath expression:



      xmlstarlet sel -t -v '//h3|//a/@href' -nl bookmarks.html


      html + xmllint : xml



      If the input is just valid html, we can convert it to xml (using xmllint) and use the previous:



      xmllint -html -xmlout ex.html | xmlstarlet sel -t -v '//h3|//a/@href' -nl -


      xmllint + xpath



      We can use xmllint + xpath expression, directly



      xmllint -html -xpath '//h3/text()|//a/@href' bookmarks.html


      ... but the output format is not the same...






      share|improve this answer


























      • Could you explain what this is ?

        – LukeM
        Feb 20 '17 at 0:53













      • @DarkHeart, I added some more information.

        – JJoao
        Feb 20 '17 at 12:58











      • on the cygwin I am running, neither xmllint nor xpath available

        – HenrikJson
        Feb 20 '17 at 21:29






      • 1





        @HenrikJson, you can install both xmllint (setup-x86_64 -qP libxml2) and xmlstarlet in cygwin.

        – JJoao
        Feb 20 '17 at 23:41
















      1














      Using a xml / html parser / processor has some advantages. Xpath expressions are the standard way to select specific parts.



      xml + xmlstarlet + xpath



      If the input is well formed xml we can use xmlstarlet + xpath expression:



      xmlstarlet sel -t -v '//h3|//a/@href' -nl bookmarks.html


      html + xmllint : xml



      If the input is just valid html, we can convert it to xml (using xmllint) and use the previous:



      xmllint -html -xmlout ex.html | xmlstarlet sel -t -v '//h3|//a/@href' -nl -


      xmllint + xpath



      We can use xmllint + xpath expression, directly



      xmllint -html -xpath '//h3/text()|//a/@href' bookmarks.html


      ... but the output format is not the same...






      share|improve this answer


























      • Could you explain what this is ?

        – LukeM
        Feb 20 '17 at 0:53













      • @DarkHeart, I added some more information.

        – JJoao
        Feb 20 '17 at 12:58











      • on the cygwin I am running, neither xmllint nor xpath available

        – HenrikJson
        Feb 20 '17 at 21:29






      • 1





        @HenrikJson, you can install both xmllint (setup-x86_64 -qP libxml2) and xmlstarlet in cygwin.

        – JJoao
        Feb 20 '17 at 23:41














      1












      1








      1







      Using a xml / html parser / processor has some advantages. Xpath expressions are the standard way to select specific parts.



      xml + xmlstarlet + xpath



      If the input is well formed xml we can use xmlstarlet + xpath expression:



      xmlstarlet sel -t -v '//h3|//a/@href' -nl bookmarks.html


      html + xmllint : xml



      If the input is just valid html, we can convert it to xml (using xmllint) and use the previous:



      xmllint -html -xmlout ex.html | xmlstarlet sel -t -v '//h3|//a/@href' -nl -


      xmllint + xpath



      We can use xmllint + xpath expression, directly



      xmllint -html -xpath '//h3/text()|//a/@href' bookmarks.html


      ... but the output format is not the same...






      share|improve this answer















      Using a xml / html parser / processor has some advantages. Xpath expressions are the standard way to select specific parts.



      xml + xmlstarlet + xpath



      If the input is well formed xml we can use xmlstarlet + xpath expression:



      xmlstarlet sel -t -v '//h3|//a/@href' -nl bookmarks.html


      html + xmllint : xml



      If the input is just valid html, we can convert it to xml (using xmllint) and use the previous:



      xmllint -html -xmlout ex.html | xmlstarlet sel -t -v '//h3|//a/@href' -nl -


      xmllint + xpath



      We can use xmllint + xpath expression, directly



      xmllint -html -xpath '//h3/text()|//a/@href' bookmarks.html


      ... but the output format is not the same...







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Feb 20 '17 at 15:33

























      answered Feb 20 '17 at 0:17









      JJoaoJJoao

      7,1691928




      7,1691928













      • Could you explain what this is ?

        – LukeM
        Feb 20 '17 at 0:53













      • @DarkHeart, I added some more information.

        – JJoao
        Feb 20 '17 at 12:58











      • on the cygwin I am running, neither xmllint nor xpath available

        – HenrikJson
        Feb 20 '17 at 21:29






      • 1





        @HenrikJson, you can install both xmllint (setup-x86_64 -qP libxml2) and xmlstarlet in cygwin.

        – JJoao
        Feb 20 '17 at 23:41



















      • Could you explain what this is ?

        – LukeM
        Feb 20 '17 at 0:53













      • @DarkHeart, I added some more information.

        – JJoao
        Feb 20 '17 at 12:58











      • on the cygwin I am running, neither xmllint nor xpath available

        – HenrikJson
        Feb 20 '17 at 21:29






      • 1





        @HenrikJson, you can install both xmllint (setup-x86_64 -qP libxml2) and xmlstarlet in cygwin.

        – JJoao
        Feb 20 '17 at 23:41

















      Could you explain what this is ?

      – LukeM
      Feb 20 '17 at 0:53







      Could you explain what this is ?

      – LukeM
      Feb 20 '17 at 0:53















      @DarkHeart, I added some more information.

      – JJoao
      Feb 20 '17 at 12:58





      @DarkHeart, I added some more information.

      – JJoao
      Feb 20 '17 at 12:58













      on the cygwin I am running, neither xmllint nor xpath available

      – HenrikJson
      Feb 20 '17 at 21:29





      on the cygwin I am running, neither xmllint nor xpath available

      – HenrikJson
      Feb 20 '17 at 21:29




      1




      1





      @HenrikJson, you can install both xmllint (setup-x86_64 -qP libxml2) and xmlstarlet in cygwin.

      – JJoao
      Feb 20 '17 at 23:41





      @HenrikJson, you can install both xmllint (setup-x86_64 -qP libxml2) and xmlstarlet in cygwin.

      – JJoao
      Feb 20 '17 at 23:41











      1














      One last answer: this time a one-ligner perl



      perl -nE 'say $1 if (/<h3.*?>(.*?)</h3>/i or /href="(.*?)"/i)' ex.html


      (I believe that xml parser based solutions are better, but since you have a
      tool-generated file, the amount of surprises should not be very high)






      share|improve this answer




























        1














        One last answer: this time a one-ligner perl



        perl -nE 'say $1 if (/<h3.*?>(.*?)</h3>/i or /href="(.*?)"/i)' ex.html


        (I believe that xml parser based solutions are better, but since you have a
        tool-generated file, the amount of surprises should not be very high)






        share|improve this answer


























          1












          1








          1







          One last answer: this time a one-ligner perl



          perl -nE 'say $1 if (/<h3.*?>(.*?)</h3>/i or /href="(.*?)"/i)' ex.html


          (I believe that xml parser based solutions are better, but since you have a
          tool-generated file, the amount of surprises should not be very high)






          share|improve this answer













          One last answer: this time a one-ligner perl



          perl -nE 'say $1 if (/<h3.*?>(.*?)</h3>/i or /href="(.*?)"/i)' ex.html


          (I believe that xml parser based solutions are better, but since you have a
          tool-generated file, the amount of surprises should not be very high)







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Feb 21 '17 at 8:27









          JJoaoJJoao

          7,1691928




          7,1691928























              0














              For now I discarded demand for one-liner and did it as a script instead.



              I had to post this as a response as it would have been too long for a comment. Still, feel free to respond.



              This script does the job but is too sluggish, can anyone speed it up or alternatively suggest a one-liner?



              #!/bin/sh
              file=$1
              while IFS= read -r line
              do
              hdr=$(echo $line | awk -F'[<>]' '/H3/{print $5}')
              url=$(echo $line | awk -F'"' '/HREF/{print $2}')
              if [ ${url} ]; then
              echo $url
              elif [ ${hdr} ]; then
              echo $hdr
              fi
              done <"$file"


              Here the file: (finally got it)



              <html xmlns="http://www.w3.org/1999/xhtml">
              <body>
              <h1>Bookmarks</h1>
              <dl>
              <dd>
              <DT><H3 ADD_DATE="1484311924" LAST_MODIFIED="1485532328">UNIX</H3>
              <dl>
              <dt><a HREF="http://unix.stackexchange.com/questions/223182/how-to-replace-spaces-in-all-file-names-with-underscore-in-linux-using-shell-scr" add_date="1484311897">url-1</a></dt>
              <dt><a HREF="http://unix.stackexchange.com/questions/81349/how-do-i-use-find-when-the-filename-contains-spaces" add_date="1484738308">url-2</a></dt>
              </dl>
              </dd>
              <dd>
              <DT><H3 ADD_DATE="1486550854" LAST_MODIFIED="1487228526">OCE</H3>
              <dl>
              <dt><a HREF="http://www.oraclecertificationprep.com/apex/f?p=OCPSG%3AEXAM_DETAILS%3A%3A%3ANO%3A%3AP2_EXAM%3A1Z0-061" add_date="1486550866">url-3</a></dt>
              <dt><a HREF="http://education.oracle.com/pls/web_prod-plq-dad/db_pages.getpage?page_id=303&amp;p_certName=SQ1Z0_047" add_date="1486550898">url-4</a></dt>
              <dt><a HREF="https://www.quora.com/How-do-you-prepare-for-an-Oracle-Database-SQL-exam" add_date="1486550950">url-5</a></dt>
              </dl>
              </dd>
              <dd>
              <DT><H3 ADD_DATE="1487084050" LAST_MODIFIED="1487228595">ANDROID</H3>
              <dl>
              <dt><a HREF="https://material.io/guidelines/style/color.html#" add_date="1487228526">url-6</a></dt>
              <dt><a HREF="https://developer.android.com/index.html" add_date="1487228539">url-7</a></dt>
              </dl>
              </dd>
              </dl>
              </body>
              </html>





              share|improve this answer




























                0














                For now I discarded demand for one-liner and did it as a script instead.



                I had to post this as a response as it would have been too long for a comment. Still, feel free to respond.



                This script does the job but is too sluggish, can anyone speed it up or alternatively suggest a one-liner?



                #!/bin/sh
                file=$1
                while IFS= read -r line
                do
                hdr=$(echo $line | awk -F'[<>]' '/H3/{print $5}')
                url=$(echo $line | awk -F'"' '/HREF/{print $2}')
                if [ ${url} ]; then
                echo $url
                elif [ ${hdr} ]; then
                echo $hdr
                fi
                done <"$file"


                Here the file: (finally got it)



                <html xmlns="http://www.w3.org/1999/xhtml">
                <body>
                <h1>Bookmarks</h1>
                <dl>
                <dd>
                <DT><H3 ADD_DATE="1484311924" LAST_MODIFIED="1485532328">UNIX</H3>
                <dl>
                <dt><a HREF="http://unix.stackexchange.com/questions/223182/how-to-replace-spaces-in-all-file-names-with-underscore-in-linux-using-shell-scr" add_date="1484311897">url-1</a></dt>
                <dt><a HREF="http://unix.stackexchange.com/questions/81349/how-do-i-use-find-when-the-filename-contains-spaces" add_date="1484738308">url-2</a></dt>
                </dl>
                </dd>
                <dd>
                <DT><H3 ADD_DATE="1486550854" LAST_MODIFIED="1487228526">OCE</H3>
                <dl>
                <dt><a HREF="http://www.oraclecertificationprep.com/apex/f?p=OCPSG%3AEXAM_DETAILS%3A%3A%3ANO%3A%3AP2_EXAM%3A1Z0-061" add_date="1486550866">url-3</a></dt>
                <dt><a HREF="http://education.oracle.com/pls/web_prod-plq-dad/db_pages.getpage?page_id=303&amp;p_certName=SQ1Z0_047" add_date="1486550898">url-4</a></dt>
                <dt><a HREF="https://www.quora.com/How-do-you-prepare-for-an-Oracle-Database-SQL-exam" add_date="1486550950">url-5</a></dt>
                </dl>
                </dd>
                <dd>
                <DT><H3 ADD_DATE="1487084050" LAST_MODIFIED="1487228595">ANDROID</H3>
                <dl>
                <dt><a HREF="https://material.io/guidelines/style/color.html#" add_date="1487228526">url-6</a></dt>
                <dt><a HREF="https://developer.android.com/index.html" add_date="1487228539">url-7</a></dt>
                </dl>
                </dd>
                </dl>
                </body>
                </html>





                share|improve this answer


























                  0












                  0








                  0







                  For now I discarded demand for one-liner and did it as a script instead.



                  I had to post this as a response as it would have been too long for a comment. Still, feel free to respond.



                  This script does the job but is too sluggish, can anyone speed it up or alternatively suggest a one-liner?



                  #!/bin/sh
                  file=$1
                  while IFS= read -r line
                  do
                  hdr=$(echo $line | awk -F'[<>]' '/H3/{print $5}')
                  url=$(echo $line | awk -F'"' '/HREF/{print $2}')
                  if [ ${url} ]; then
                  echo $url
                  elif [ ${hdr} ]; then
                  echo $hdr
                  fi
                  done <"$file"


                  Here the file: (finally got it)



                  <html xmlns="http://www.w3.org/1999/xhtml">
                  <body>
                  <h1>Bookmarks</h1>
                  <dl>
                  <dd>
                  <DT><H3 ADD_DATE="1484311924" LAST_MODIFIED="1485532328">UNIX</H3>
                  <dl>
                  <dt><a HREF="http://unix.stackexchange.com/questions/223182/how-to-replace-spaces-in-all-file-names-with-underscore-in-linux-using-shell-scr" add_date="1484311897">url-1</a></dt>
                  <dt><a HREF="http://unix.stackexchange.com/questions/81349/how-do-i-use-find-when-the-filename-contains-spaces" add_date="1484738308">url-2</a></dt>
                  </dl>
                  </dd>
                  <dd>
                  <DT><H3 ADD_DATE="1486550854" LAST_MODIFIED="1487228526">OCE</H3>
                  <dl>
                  <dt><a HREF="http://www.oraclecertificationprep.com/apex/f?p=OCPSG%3AEXAM_DETAILS%3A%3A%3ANO%3A%3AP2_EXAM%3A1Z0-061" add_date="1486550866">url-3</a></dt>
                  <dt><a HREF="http://education.oracle.com/pls/web_prod-plq-dad/db_pages.getpage?page_id=303&amp;p_certName=SQ1Z0_047" add_date="1486550898">url-4</a></dt>
                  <dt><a HREF="https://www.quora.com/How-do-you-prepare-for-an-Oracle-Database-SQL-exam" add_date="1486550950">url-5</a></dt>
                  </dl>
                  </dd>
                  <dd>
                  <DT><H3 ADD_DATE="1487084050" LAST_MODIFIED="1487228595">ANDROID</H3>
                  <dl>
                  <dt><a HREF="https://material.io/guidelines/style/color.html#" add_date="1487228526">url-6</a></dt>
                  <dt><a HREF="https://developer.android.com/index.html" add_date="1487228539">url-7</a></dt>
                  </dl>
                  </dd>
                  </dl>
                  </body>
                  </html>





                  share|improve this answer













                  For now I discarded demand for one-liner and did it as a script instead.



                  I had to post this as a response as it would have been too long for a comment. Still, feel free to respond.



                  This script does the job but is too sluggish, can anyone speed it up or alternatively suggest a one-liner?



                  #!/bin/sh
                  file=$1
                  while IFS= read -r line
                  do
                  hdr=$(echo $line | awk -F'[<>]' '/H3/{print $5}')
                  url=$(echo $line | awk -F'"' '/HREF/{print $2}')
                  if [ ${url} ]; then
                  echo $url
                  elif [ ${hdr} ]; then
                  echo $hdr
                  fi
                  done <"$file"


                  Here the file: (finally got it)



                  <html xmlns="http://www.w3.org/1999/xhtml">
                  <body>
                  <h1>Bookmarks</h1>
                  <dl>
                  <dd>
                  <DT><H3 ADD_DATE="1484311924" LAST_MODIFIED="1485532328">UNIX</H3>
                  <dl>
                  <dt><a HREF="http://unix.stackexchange.com/questions/223182/how-to-replace-spaces-in-all-file-names-with-underscore-in-linux-using-shell-scr" add_date="1484311897">url-1</a></dt>
                  <dt><a HREF="http://unix.stackexchange.com/questions/81349/how-do-i-use-find-when-the-filename-contains-spaces" add_date="1484738308">url-2</a></dt>
                  </dl>
                  </dd>
                  <dd>
                  <DT><H3 ADD_DATE="1486550854" LAST_MODIFIED="1487228526">OCE</H3>
                  <dl>
                  <dt><a HREF="http://www.oraclecertificationprep.com/apex/f?p=OCPSG%3AEXAM_DETAILS%3A%3A%3ANO%3A%3AP2_EXAM%3A1Z0-061" add_date="1486550866">url-3</a></dt>
                  <dt><a HREF="http://education.oracle.com/pls/web_prod-plq-dad/db_pages.getpage?page_id=303&amp;p_certName=SQ1Z0_047" add_date="1486550898">url-4</a></dt>
                  <dt><a HREF="https://www.quora.com/How-do-you-prepare-for-an-Oracle-Database-SQL-exam" add_date="1486550950">url-5</a></dt>
                  </dl>
                  </dd>
                  <dd>
                  <DT><H3 ADD_DATE="1487084050" LAST_MODIFIED="1487228595">ANDROID</H3>
                  <dl>
                  <dt><a HREF="https://material.io/guidelines/style/color.html#" add_date="1487228526">url-6</a></dt>
                  <dt><a HREF="https://developer.android.com/index.html" add_date="1487228539">url-7</a></dt>
                  </dl>
                  </dd>
                  </dl>
                  </body>
                  </html>






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Feb 20 '17 at 21:59









                  HenrikJsonHenrikJson

                  308




                  308






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Unix & Linux Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f346175%2fawk-to-match-and-cut-out-fields-with-alternating-delimiter%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      How to make a Squid Proxy server?

                      Is this a new Fibonacci Identity?

                      19世紀