awk to match and cut out fields with alternating delimiter

example bookmarks file

I would like to use awk or similar to match patterns of a chrome bookmarks file and depending on match, cut out a specific field based on different field delimiters.

I have attached a sample picture. I still haven't figured out how to attach as a file.

I want the folder names in case the string H3 is matched and the URL in case the string HREF is encountered.

the following two commands do the job for the respective matches:

awk -F'[<>]' '/H3/{print $5}' bookmarks.htm

awk -F'"' '/HREF/{print $2}' bookmarks.html

My goal is to combine the two statements above so the output becomes:

UNIX

url-1

url-2

OCE

url-3

url-4

url-5

ANDROID

url-6

url-7

I have tried awk's if, then, else but wasn't conclusive.

How do I achieve this as a one-liner? are there better candidates than awk? python, perl would both be great, however, one-liner is an absolute as it would be an easy task writing a shell script that does the job.

edited Jan 13 at 21:52

Rui F Ribeiro

39.5k1479133

asked Feb 19 '17 at 21:14

HenrikJson

308

4

Don't post screenshots of text, paste the actual text...

– jasonwryan
Feb 19 '17 at 21:16

text is very long and ugly formatting. i tcan't be added as it contians URLs and as a beginner i am not allowed > 1 URL in my post

– HenrikJson
Feb 19 '17 at 21:18

Use the {} button to format the text as code.

– Gilles
Feb 19 '17 at 21:37

the {} produced something that looks like only a partial code extract -> no good

– HenrikJson
Feb 19 '17 at 21:49

1

You don't need a one-liner to make it easy to script, but here is one anyway: awk -F'[<>]' '/<H3/{print $5} /HREF="/{sub(/[^"]*"/,"");sub(/".*/,"");print}' bookmark.html

– dave_thompson_085
Feb 21 '17 at 2:53

|
show 1 more comment

example bookmarks file

I would like to use awk or similar to match patterns of a chrome bookmarks file and depending on match, cut out a specific field based on different field delimiters.

I have attached a sample picture. I still haven't figured out how to attach as a file.

I want the folder names in case the string H3 is matched and the URL in case the string HREF is encountered.

the following two commands do the job for the respective matches:

awk -F'[<>]' '/H3/{print $5}' bookmarks.htm

awk -F'"' '/HREF/{print $2}' bookmarks.html

My goal is to combine the two statements above so the output becomes:

UNIX

url-1

url-2

OCE

url-3

url-4

url-5

ANDROID

url-6

url-7

I have tried awk's if, then, else but wasn't conclusive.

edited Jan 13 at 21:52

Rui F Ribeiro

39.5k1479133

asked Feb 19 '17 at 21:14

HenrikJson

308

4

Don't post screenshots of text, paste the actual text...

– jasonwryan
Feb 19 '17 at 21:16

text is very long and ugly formatting. i tcan't be added as it contians URLs and as a beginner i am not allowed > 1 URL in my post

– HenrikJson
Feb 19 '17 at 21:18

Use the {} button to format the text as code.

– Gilles
Feb 19 '17 at 21:37

the {} produced something that looks like only a partial code extract -> no good

– HenrikJson
Feb 19 '17 at 21:49

1

You don't need a one-liner to make it easy to script, but here is one anyway: awk -F'[<>]' '/<H3/{print $5} /HREF="/{sub(/[^"]*"/,"");sub(/".*/,"");print}' bookmark.html

– dave_thompson_085
Feb 21 '17 at 2:53

|
show 1 more comment

example bookmarks file

I would like to use awk or similar to match patterns of a chrome bookmarks file and depending on match, cut out a specific field based on different field delimiters.

I have attached a sample picture. I still haven't figured out how to attach as a file.

I want the folder names in case the string H3 is matched and the URL in case the string HREF is encountered.

the following two commands do the job for the respective matches:

awk -F'[<>]' '/H3/{print $5}' bookmarks.htm

awk -F'"' '/HREF/{print $2}' bookmarks.html

My goal is to combine the two statements above so the output becomes:

UNIX

url-1

url-2

OCE

url-3

url-4

url-5

ANDROID

url-6

url-7

I have tried awk's if, then, else but wasn't conclusive.

edited Jan 13 at 21:52

Rui F Ribeiro

39.5k1479133

asked Feb 19 '17 at 21:14

HenrikJson

308

example bookmarks file

I would like to use awk or similar to match patterns of a chrome bookmarks file and depending on match, cut out a specific field based on different field delimiters.

I have attached a sample picture. I still haven't figured out how to attach as a file.

I want the folder names in case the string H3 is matched and the URL in case the string HREF is encountered.

the following two commands do the job for the respective matches:

awk -F'[<>]' '/H3/{print $5}' bookmarks.htm

awk -F'"' '/HREF/{print $2}' bookmarks.html

My goal is to combine the two statements above so the output becomes:

UNIX

url-1

url-2

OCE

url-3

url-4

url-5

ANDROID

url-6

url-7

I have tried awk's if, then, else but wasn't conclusive.

shell-script awk

edited Jan 13 at 21:52

Rui F Ribeiro

39.5k1479133

asked Feb 19 '17 at 21:14

HenrikJson

308

edited Jan 13 at 21:52

Rui F Ribeiro

39.5k1479133

asked Feb 19 '17 at 21:14

HenrikJson

308

edited Jan 13 at 21:52

Rui F Ribeiro

39.5k1479133

edited Jan 13 at 21:52

Rui F Ribeiro

39.5k1479133

edited Jan 13 at 21:52

Rui F Ribeiro

39.5k1479133

asked Feb 19 '17 at 21:14

HenrikJson

308

asked Feb 19 '17 at 21:14

HenrikJson

308

asked Feb 19 '17 at 21:14

HenrikJson

308

4

Don't post screenshots of text, paste the actual text...

– jasonwryan
Feb 19 '17 at 21:16

text is very long and ugly formatting. i tcan't be added as it contians URLs and as a beginner i am not allowed > 1 URL in my post

– HenrikJson
Feb 19 '17 at 21:18

Use the {} button to format the text as code.

– Gilles
Feb 19 '17 at 21:37

the {} produced something that looks like only a partial code extract -> no good

– HenrikJson
Feb 19 '17 at 21:49

1

You don't need a one-liner to make it easy to script, but here is one anyway: awk -F'[<>]' '/<H3/{print $5} /HREF="/{sub(/[^"]*"/,"");sub(/".*/,"");print}' bookmark.html

– dave_thompson_085
Feb 21 '17 at 2:53

|
show 1 more comment

4

Don't post screenshots of text, paste the actual text...

– jasonwryan
Feb 19 '17 at 21:16

text is very long and ugly formatting. i tcan't be added as it contians URLs and as a beginner i am not allowed > 1 URL in my post

– HenrikJson
Feb 19 '17 at 21:18

Use the {} button to format the text as code.

– Gilles
Feb 19 '17 at 21:37

the {} produced something that looks like only a partial code extract -> no good

– HenrikJson
Feb 19 '17 at 21:49

1

You don't need a one-liner to make it easy to script, but here is one anyway: awk -F'[<>]' '/<H3/{print $5} /HREF="/{sub(/[^"]*"/,"");sub(/".*/,"");print}' bookmark.html

– dave_thompson_085
Feb 21 '17 at 2:53

Don't post screenshots of text, paste the actual text...

– jasonwryan
Feb 19 '17 at 21:16

text is very long and ugly formatting. i tcan't be added as it contians URLs and as a beginner i am not allowed > 1 URL in my post

– HenrikJson
Feb 19 '17 at 21:18

Use the {} button to format the text as code.

– Gilles
Feb 19 '17 at 21:37

the {} produced something that looks like only a partial code extract -> no good

– HenrikJson
Feb 19 '17 at 21:49

You don't need a one-liner to make it easy to script, but here is one anyway: awk -F'[<>]' '/<H3/{print $5} /HREF="/{sub(/[^"]*"/,"");sub(/".*/,"");print}' bookmark.html

– dave_thompson_085
Feb 21 '17 at 2:53

|
show 1 more comment

4 Answers
4

active

oldest

votes

This is wrong way to process html-files with sed/awk/… There are few special parsers but as temporary substitution

sed '

    /n/{P;d;}

    /<H3/s/[><]/n/4g

    /HREF/s/"/n/g

    D

    ' bookmarks.htm

For non-GNU versions of sed:

sed '

    /n/{P;d;}     #if there is more then 1 line «P»rint 1st line then «d»elete all

    /</H3/s//n/  #replace «</H3» by «n»ewline

    /n/s/">/n/   #replace «">» by «n»ewline if previous command is executed

    /HREF/s/"/n/g #put «n»ewline» around url if «HREF» in line

    D              #«D»elete 1 first line, go to start

    ' bookmarks.htm

edited Feb 21 '17 at 10:03

answered Feb 19 '17 at 21:53

Costas

12.6k1129

Thanks, that gives the urls but not the headers, trying to adapt the part: /<H3/s/[><]/n/4g

– HenrikJson
Feb 20 '17 at 21:33

1

@HenrikJson It possible if you use non-GNU sed: 4g construction is not recognized. In the case you have to substitute it by /</H3/s//n/;/n/s/">/n/

– Costas
Feb 21 '17 at 6:17

Costas, if you add that comment + your original command as answer i will mark it as correct. please if you have time also add annotations how to read the sed command. parts of are clear to me but not all

– HenrikJson
Feb 21 '17 at 8:11

@HenrikJson see updated

– Costas
Feb 21 '17 at 10:06

add a comment |

Using a xml / html parser / processor has some advantages. Xpath expressions are the standard way to select specific parts.

xml + xmlstarlet + xpath

If the input is well formed xml we can use xmlstarlet + xpath expression:

xmlstarlet sel -t -v '//h3|//a/@href' -nl bookmarks.html

html + xmllint : xml

If the input is just valid html, we can convert it to xml (using xmllint) and use the previous:

xmllint -html -xmlout ex.html | xmlstarlet sel -t -v '//h3|//a/@href' -nl -

xmllint + xpath

We can use xmllint + xpath expression, directly

xmllint -html -xpath '//h3/text()|//a/@href' bookmarks.html

... but the output format is not the same...

edited Feb 20 '17 at 15:33

answered Feb 20 '17 at 0:17

JJoao

7,1691928

Could you explain what this is ?

– LukeM
Feb 20 '17 at 0:53

@DarkHeart, I added some more information.

– JJoao
Feb 20 '17 at 12:58

on the cygwin I am running, neither xmllint nor xpath available

– HenrikJson
Feb 20 '17 at 21:29

1

@HenrikJson, you can install both xmllint (setup-x86_64 -qP libxml2) and xmlstarlet in cygwin.

– JJoao
Feb 20 '17 at 23:41

add a comment |

One last answer: this time a one-ligner perl

perl -nE 'say $1 if (/<h3.*?>(.*?)</h3>/i or /href="(.*?)"/i)' ex.html

(I believe that xml parser based solutions are better, but since you have a
tool-generated file, the amount of surprises should not be very high)

answered Feb 21 '17 at 8:27

JJoao

7,1691928

add a comment |

For now I discarded demand for one-liner and did it as a script instead.

I had to post this as a response as it would have been too long for a comment. Still, feel free to respond.

This script does the job but is too sluggish, can anyone speed it up or alternatively suggest a one-liner?

#!/bin/sh

file=$1

while IFS= read -r line

do

hdr=$(echo $line | awk -F'[<>]' '/H3/{print $5}')

url=$(echo $line | awk -F'"' '/HREF/{print $2}')

if [ ${url} ]; then

    echo $url

elif [ ${hdr} ]; then

    echo $hdr

fi

done <"$file"

Here the file: (finally got it)

<html xmlns="http://www.w3.org/1999/xhtml">

<body>

  <h1>Bookmarks</h1>

  <dl>

    <dd>

        <DT><H3 ADD_DATE="1484311924" LAST_MODIFIED="1485532328">UNIX</H3>

      <dl>

        <dt><a HREF="http://unix.stackexchange.com/questions/223182/how-to-replace-spaces-in-all-file-names-with-underscore-in-linux-using-shell-scr" add_date="1484311897">url-1</a></dt>

        <dt><a HREF="http://unix.stackexchange.com/questions/81349/how-do-i-use-find-when-the-filename-contains-spaces"        add_date="1484738308">url-2</a></dt>

      </dl>

    </dd>

    <dd>

        <DT><H3 ADD_DATE="1486550854" LAST_MODIFIED="1487228526">OCE</H3>

      <dl>

        <dt><a HREF="http://www.oraclecertificationprep.com/apex/f?p=OCPSG%3AEXAM_DETAILS%3A%3A%3ANO%3A%3AP2_EXAM%3A1Z0-061"    add_date="1486550866">url-3</a></dt>

        <dt><a HREF="http://education.oracle.com/pls/web_prod-plq-dad/db_pages.getpage?page_id=303&amp;p_certName=SQ1Z0_047" add_date="1486550898">url-4</a></dt>

        <dt><a HREF="https://www.quora.com/How-do-you-prepare-for-an-Oracle-Database-SQL-exam" add_date="1486550950">url-5</a></dt>

      </dl>

    </dd>

    <dd>

        <DT><H3 ADD_DATE="1487084050" LAST_MODIFIED="1487228595">ANDROID</H3>

      <dl>

        <dt><a HREF="https://material.io/guidelines/style/color.html#" add_date="1487228526">url-6</a></dt>

        <dt><a HREF="https://developer.android.com/index.html" add_date="1487228539">url-7</a></dt>

      </dl>

    </dd>

  </dl>

</body>

</html>

answered Feb 20 '17 at 21:59

HenrikJson

308

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f346175%2fawk-to-match-and-cut-out-fields-with-alternating-delimiter%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

This is wrong way to process html-files with sed/awk/… There are few special parsers but as temporary substitution

sed '

    /n/{P;d;}

    /<H3/s/[><]/n/4g

    /HREF/s/"/n/g

    D

    ' bookmarks.htm

For non-GNU versions of sed:

sed '

    /n/{P;d;}     #if there is more then 1 line «P»rint 1st line then «d»elete all

    /</H3/s//n/  #replace «</H3» by «n»ewline

    /n/s/">/n/   #replace «">» by «n»ewline if previous command is executed

    /HREF/s/"/n/g #put «n»ewline» around url if «HREF» in line

    D              #«D»elete 1 first line, go to start

    ' bookmarks.htm

edited Feb 21 '17 at 10:03

answered Feb 19 '17 at 21:53

Costas

12.6k1129

Thanks, that gives the urls but not the headers, trying to adapt the part: /<H3/s/[><]/n/4g

– HenrikJson
Feb 20 '17 at 21:33

1

@HenrikJson It possible if you use non-GNU sed: 4g construction is not recognized. In the case you have to substitute it by /</H3/s//n/;/n/s/">/n/

– Costas
Feb 21 '17 at 6:17

Costas, if you add that comment + your original command as answer i will mark it as correct. please if you have time also add annotations how to read the sed command. parts of are clear to me but not all

– HenrikJson
Feb 21 '17 at 8:11

@HenrikJson see updated

– Costas
Feb 21 '17 at 10:06

add a comment |

This is wrong way to process html-files with sed/awk/… There are few special parsers but as temporary substitution

sed '

    /n/{P;d;}

    /<H3/s/[><]/n/4g

    /HREF/s/"/n/g

    D

    ' bookmarks.htm

For non-GNU versions of sed:

sed '

    /n/{P;d;}     #if there is more then 1 line «P»rint 1st line then «d»elete all

    /</H3/s//n/  #replace «</H3» by «n»ewline

    /n/s/">/n/   #replace «">» by «n»ewline if previous command is executed

    /HREF/s/"/n/g #put «n»ewline» around url if «HREF» in line

    D              #«D»elete 1 first line, go to start

    ' bookmarks.htm

edited Feb 21 '17 at 10:03

answered Feb 19 '17 at 21:53

Costas

12.6k1129

Thanks, that gives the urls but not the headers, trying to adapt the part: /<H3/s/[><]/n/4g

– HenrikJson
Feb 20 '17 at 21:33

1

@HenrikJson It possible if you use non-GNU sed: 4g construction is not recognized. In the case you have to substitute it by /</H3/s//n/;/n/s/">/n/

– Costas
Feb 21 '17 at 6:17

Costas, if you add that comment + your original command as answer i will mark it as correct. please if you have time also add annotations how to read the sed command. parts of are clear to me but not all

– HenrikJson
Feb 21 '17 at 8:11

@HenrikJson see updated

– Costas
Feb 21 '17 at 10:06

add a comment |

This is wrong way to process html-files with sed/awk/… There are few special parsers but as temporary substitution

sed '

    /n/{P;d;}

    /<H3/s/[><]/n/4g

    /HREF/s/"/n/g

    D

    ' bookmarks.htm

For non-GNU versions of sed:

sed '

    /n/{P;d;}     #if there is more then 1 line «P»rint 1st line then «d»elete all

    /</H3/s//n/  #replace «</H3» by «n»ewline

    /n/s/">/n/   #replace «">» by «n»ewline if previous command is executed

    /HREF/s/"/n/g #put «n»ewline» around url if «HREF» in line

    D              #«D»elete 1 first line, go to start

    ' bookmarks.htm

edited Feb 21 '17 at 10:03

answered Feb 19 '17 at 21:53

Costas

12.6k1129

This is wrong way to process html-files with sed/awk/… There are few special parsers but as temporary substitution

sed '

    /n/{P;d;}

    /<H3/s/[><]/n/4g

    /HREF/s/"/n/g

    D

    ' bookmarks.htm

For non-GNU versions of sed:

sed '

    /n/{P;d;}     #if there is more then 1 line «P»rint 1st line then «d»elete all

    /</H3/s//n/  #replace «</H3» by «n»ewline

    /n/s/">/n/   #replace «">» by «n»ewline if previous command is executed

    /HREF/s/"/n/g #put «n»ewline» around url if «HREF» in line

    D              #«D»elete 1 first line, go to start

    ' bookmarks.htm

edited Feb 21 '17 at 10:03

answered Feb 19 '17 at 21:53

Costas

12.6k1129

edited Feb 21 '17 at 10:03

answered Feb 19 '17 at 21:53

Costas

12.6k1129

answered Feb 19 '17 at 21:53

Costas

12.6k1129

answered Feb 19 '17 at 21:53

Costas

12.6k1129

Thanks, that gives the urls but not the headers, trying to adapt the part: /<H3/s/[><]/n/4g

– HenrikJson
Feb 20 '17 at 21:33

1

@HenrikJson It possible if you use non-GNU sed: 4g construction is not recognized. In the case you have to substitute it by /</H3/s//n/;/n/s/">/n/

– Costas
Feb 21 '17 at 6:17

Costas, if you add that comment + your original command as answer i will mark it as correct. please if you have time also add annotations how to read the sed command. parts of are clear to me but not all

– HenrikJson
Feb 21 '17 at 8:11

@HenrikJson see updated

– Costas
Feb 21 '17 at 10:06

add a comment |

Thanks, that gives the urls but not the headers, trying to adapt the part: /<H3/s/[><]/n/4g

– HenrikJson
Feb 20 '17 at 21:33

1

@HenrikJson It possible if you use non-GNU sed: 4g construction is not recognized. In the case you have to substitute it by /</H3/s//n/;/n/s/">/n/

– Costas
Feb 21 '17 at 6:17

Costas, if you add that comment + your original command as answer i will mark it as correct. please if you have time also add annotations how to read the sed command. parts of are clear to me but not all

– HenrikJson
Feb 21 '17 at 8:11

@HenrikJson see updated

– Costas
Feb 21 '17 at 10:06

Thanks, that gives the urls but not the headers, trying to adapt the part: /<H3/s/[><]/n/4g

– HenrikJson
Feb 20 '17 at 21:33

@HenrikJson It possible if you use non-GNU sed: 4g construction is not recognized. In the case you have to substitute it by /</H3/s//n/;/n/s/">/n/

– Costas
Feb 21 '17 at 6:17

Costas, if you add that comment + your original command as answer i will mark it as correct. please if you have time also add annotations how to read the sed command. parts of are clear to me but not all

– HenrikJson
Feb 21 '17 at 8:11

@HenrikJson see updated

– Costas
Feb 21 '17 at 10:06

add a comment |

Using a xml / html parser / processor has some advantages. Xpath expressions are the standard way to select specific parts.

xml + xmlstarlet + xpath

If the input is well formed xml we can use xmlstarlet + xpath expression:

xmlstarlet sel -t -v '//h3|//a/@href' -nl bookmarks.html

html + xmllint : xml

If the input is just valid html, we can convert it to xml (using xmllint) and use the previous:

xmllint -html -xmlout ex.html | xmlstarlet sel -t -v '//h3|//a/@href' -nl -

xmllint + xpath

We can use xmllint + xpath expression, directly

xmllint -html -xpath '//h3/text()|//a/@href' bookmarks.html

... but the output format is not the same...

edited Feb 20 '17 at 15:33

answered Feb 20 '17 at 0:17

JJoao

7,1691928

Could you explain what this is ?

– LukeM
Feb 20 '17 at 0:53

@DarkHeart, I added some more information.

– JJoao
Feb 20 '17 at 12:58

on the cygwin I am running, neither xmllint nor xpath available

– HenrikJson
Feb 20 '17 at 21:29

1

@HenrikJson, you can install both xmllint (setup-x86_64 -qP libxml2) and xmlstarlet in cygwin.

– JJoao
Feb 20 '17 at 23:41

add a comment |

Using a xml / html parser / processor has some advantages. Xpath expressions are the standard way to select specific parts.

xml + xmlstarlet + xpath

If the input is well formed xml we can use xmlstarlet + xpath expression:

xmlstarlet sel -t -v '//h3|//a/@href' -nl bookmarks.html

html + xmllint : xml

If the input is just valid html, we can convert it to xml (using xmllint) and use the previous:

xmllint -html -xmlout ex.html | xmlstarlet sel -t -v '//h3|//a/@href' -nl -

xmllint + xpath

We can use xmllint + xpath expression, directly

xmllint -html -xpath '//h3/text()|//a/@href' bookmarks.html

... but the output format is not the same...

edited Feb 20 '17 at 15:33

answered Feb 20 '17 at 0:17

JJoao

7,1691928

Could you explain what this is ?

– LukeM
Feb 20 '17 at 0:53

@DarkHeart, I added some more information.

– JJoao
Feb 20 '17 at 12:58

on the cygwin I am running, neither xmllint nor xpath available

– HenrikJson
Feb 20 '17 at 21:29

1

@HenrikJson, you can install both xmllint (setup-x86_64 -qP libxml2) and xmlstarlet in cygwin.

– JJoao
Feb 20 '17 at 23:41

add a comment |

Using a xml / html parser / processor has some advantages. Xpath expressions are the standard way to select specific parts.

xml + xmlstarlet + xpath

If the input is well formed xml we can use xmlstarlet + xpath expression:

xmlstarlet sel -t -v '//h3|//a/@href' -nl bookmarks.html

html + xmllint : xml

If the input is just valid html, we can convert it to xml (using xmllint) and use the previous:

xmllint -html -xmlout ex.html | xmlstarlet sel -t -v '//h3|//a/@href' -nl -

xmllint + xpath

We can use xmllint + xpath expression, directly

xmllint -html -xpath '//h3/text()|//a/@href' bookmarks.html

... but the output format is not the same...

edited Feb 20 '17 at 15:33

answered Feb 20 '17 at 0:17

JJoao

7,1691928

Using a xml / html parser / processor has some advantages. Xpath expressions are the standard way to select specific parts.

xml + xmlstarlet + xpath

If the input is well formed xml we can use xmlstarlet + xpath expression:

xmlstarlet sel -t -v '//h3|//a/@href' -nl bookmarks.html

html + xmllint : xml

If the input is just valid html, we can convert it to xml (using xmllint) and use the previous:

xmllint -html -xmlout ex.html | xmlstarlet sel -t -v '//h3|//a/@href' -nl -

xmllint + xpath

We can use xmllint + xpath expression, directly

xmllint -html -xpath '//h3/text()|//a/@href' bookmarks.html

... but the output format is not the same...

edited Feb 20 '17 at 15:33

answered Feb 20 '17 at 0:17

JJoao

7,1691928

edited Feb 20 '17 at 15:33

answered Feb 20 '17 at 0:17

JJoao

7,1691928

answered Feb 20 '17 at 0:17

JJoao

7,1691928

answered Feb 20 '17 at 0:17

JJoao

7,1691928

Could you explain what this is ?

– LukeM
Feb 20 '17 at 0:53

@DarkHeart, I added some more information.

– JJoao
Feb 20 '17 at 12:58

on the cygwin I am running, neither xmllint nor xpath available

– HenrikJson
Feb 20 '17 at 21:29

1

@HenrikJson, you can install both xmllint (setup-x86_64 -qP libxml2) and xmlstarlet in cygwin.

– JJoao
Feb 20 '17 at 23:41

add a comment |

Could you explain what this is ?

– LukeM
Feb 20 '17 at 0:53

@DarkHeart, I added some more information.

– JJoao
Feb 20 '17 at 12:58

on the cygwin I am running, neither xmllint nor xpath available

– HenrikJson
Feb 20 '17 at 21:29

1

@HenrikJson, you can install both xmllint (setup-x86_64 -qP libxml2) and xmlstarlet in cygwin.

– JJoao
Feb 20 '17 at 23:41

Could you explain what this is ?

– LukeM
Feb 20 '17 at 0:53

@DarkHeart, I added some more information.

– JJoao
Feb 20 '17 at 12:58

on the cygwin I am running, neither xmllint nor xpath available

– HenrikJson
Feb 20 '17 at 21:29

@HenrikJson, you can install both xmllint (setup-x86_64 -qP libxml2) and xmlstarlet in cygwin.

– JJoao
Feb 20 '17 at 23:41

add a comment |

One last answer: this time a one-ligner perl

perl -nE 'say $1 if (/<h3.*?>(.*?)</h3>/i or /href="(.*?)"/i)' ex.html

(I believe that xml parser based solutions are better, but since you have a
tool-generated file, the amount of surprises should not be very high)

answered Feb 21 '17 at 8:27

JJoao

7,1691928

add a comment |

One last answer: this time a one-ligner perl

perl -nE 'say $1 if (/<h3.*?>(.*?)</h3>/i or /href="(.*?)"/i)' ex.html

(I believe that xml parser based solutions are better, but since you have a
tool-generated file, the amount of surprises should not be very high)

answered Feb 21 '17 at 8:27

JJoao

7,1691928

add a comment |

One last answer: this time a one-ligner perl

perl -nE 'say $1 if (/<h3.*?>(.*?)</h3>/i or /href="(.*?)"/i)' ex.html

(I believe that xml parser based solutions are better, but since you have a
tool-generated file, the amount of surprises should not be very high)

answered Feb 21 '17 at 8:27

JJoao

7,1691928

One last answer: this time a one-ligner perl

perl -nE 'say $1 if (/<h3.*?>(.*?)</h3>/i or /href="(.*?)"/i)' ex.html

(I believe that xml parser based solutions are better, but since you have a
tool-generated file, the amount of surprises should not be very high)

answered Feb 21 '17 at 8:27

JJoao

7,1691928

answered Feb 21 '17 at 8:27

JJoao

7,1691928

answered Feb 21 '17 at 8:27

JJoao

7,1691928

answered Feb 21 '17 at 8:27

JJoao

7,1691928

add a comment |

For now I discarded demand for one-liner and did it as a script instead.

I had to post this as a response as it would have been too long for a comment. Still, feel free to respond.

This script does the job but is too sluggish, can anyone speed it up or alternatively suggest a one-liner?

#!/bin/sh

file=$1

while IFS= read -r line

do

hdr=$(echo $line | awk -F'[<>]' '/H3/{print $5}')

url=$(echo $line | awk -F'"' '/HREF/{print $2}')

if [ ${url} ]; then

    echo $url

elif [ ${hdr} ]; then

    echo $hdr

fi

done <"$file"

Here the file: (finally got it)

<html xmlns="http://www.w3.org/1999/xhtml">

<body>

  <h1>Bookmarks</h1>

  <dl>

    <dd>

        <DT><H3 ADD_DATE="1484311924" LAST_MODIFIED="1485532328">UNIX</H3>

      <dl>

        <dt><a HREF="http://unix.stackexchange.com/questions/223182/how-to-replace-spaces-in-all-file-names-with-underscore-in-linux-using-shell-scr" add_date="1484311897">url-1</a></dt>

        <dt><a HREF="http://unix.stackexchange.com/questions/81349/how-do-i-use-find-when-the-filename-contains-spaces"        add_date="1484738308">url-2</a></dt>

      </dl>

    </dd>

    <dd>

        <DT><H3 ADD_DATE="1486550854" LAST_MODIFIED="1487228526">OCE</H3>

      <dl>

        <dt><a HREF="http://www.oraclecertificationprep.com/apex/f?p=OCPSG%3AEXAM_DETAILS%3A%3A%3ANO%3A%3AP2_EXAM%3A1Z0-061"    add_date="1486550866">url-3</a></dt>

        <dt><a HREF="http://education.oracle.com/pls/web_prod-plq-dad/db_pages.getpage?page_id=303&amp;p_certName=SQ1Z0_047" add_date="1486550898">url-4</a></dt>

        <dt><a HREF="https://www.quora.com/How-do-you-prepare-for-an-Oracle-Database-SQL-exam" add_date="1486550950">url-5</a></dt>

      </dl>

    </dd>

    <dd>

        <DT><H3 ADD_DATE="1487084050" LAST_MODIFIED="1487228595">ANDROID</H3>

      <dl>

        <dt><a HREF="https://material.io/guidelines/style/color.html#" add_date="1487228526">url-6</a></dt>

        <dt><a HREF="https://developer.android.com/index.html" add_date="1487228539">url-7</a></dt>

      </dl>

    </dd>

  </dl>

</body>

</html>

answered Feb 20 '17 at 21:59

HenrikJson

308

add a comment |

For now I discarded demand for one-liner and did it as a script instead.

I had to post this as a response as it would have been too long for a comment. Still, feel free to respond.

This script does the job but is too sluggish, can anyone speed it up or alternatively suggest a one-liner?

#!/bin/sh

file=$1

while IFS= read -r line

do

hdr=$(echo $line | awk -F'[<>]' '/H3/{print $5}')

url=$(echo $line | awk -F'"' '/HREF/{print $2}')

if [ ${url} ]; then

    echo $url

elif [ ${hdr} ]; then

    echo $hdr

fi

done <"$file"

Here the file: (finally got it)

<html xmlns="http://www.w3.org/1999/xhtml">

<body>

  <h1>Bookmarks</h1>

  <dl>

    <dd>

        <DT><H3 ADD_DATE="1484311924" LAST_MODIFIED="1485532328">UNIX</H3>

      <dl>

        <dt><a HREF="http://unix.stackexchange.com/questions/223182/how-to-replace-spaces-in-all-file-names-with-underscore-in-linux-using-shell-scr" add_date="1484311897">url-1</a></dt>

        <dt><a HREF="http://unix.stackexchange.com/questions/81349/how-do-i-use-find-when-the-filename-contains-spaces"        add_date="1484738308">url-2</a></dt>

      </dl>

    </dd>

    <dd>

        <DT><H3 ADD_DATE="1486550854" LAST_MODIFIED="1487228526">OCE</H3>

      <dl>

        <dt><a HREF="http://www.oraclecertificationprep.com/apex/f?p=OCPSG%3AEXAM_DETAILS%3A%3A%3ANO%3A%3AP2_EXAM%3A1Z0-061"    add_date="1486550866">url-3</a></dt>

        <dt><a HREF="http://education.oracle.com/pls/web_prod-plq-dad/db_pages.getpage?page_id=303&amp;p_certName=SQ1Z0_047" add_date="1486550898">url-4</a></dt>

        <dt><a HREF="https://www.quora.com/How-do-you-prepare-for-an-Oracle-Database-SQL-exam" add_date="1486550950">url-5</a></dt>

      </dl>

    </dd>

    <dd>

        <DT><H3 ADD_DATE="1487084050" LAST_MODIFIED="1487228595">ANDROID</H3>

      <dl>

        <dt><a HREF="https://material.io/guidelines/style/color.html#" add_date="1487228526">url-6</a></dt>

        <dt><a HREF="https://developer.android.com/index.html" add_date="1487228539">url-7</a></dt>

      </dl>

    </dd>

  </dl>

</body>

</html>

answered Feb 20 '17 at 21:59

HenrikJson

308

add a comment |

For now I discarded demand for one-liner and did it as a script instead.

I had to post this as a response as it would have been too long for a comment. Still, feel free to respond.

This script does the job but is too sluggish, can anyone speed it up or alternatively suggest a one-liner?

#!/bin/sh

file=$1

while IFS= read -r line

do

hdr=$(echo $line | awk -F'[<>]' '/H3/{print $5}')

url=$(echo $line | awk -F'"' '/HREF/{print $2}')

if [ ${url} ]; then

    echo $url

elif [ ${hdr} ]; then

    echo $hdr

fi

done <"$file"

Here the file: (finally got it)

<html xmlns="http://www.w3.org/1999/xhtml">

<body>

  <h1>Bookmarks</h1>

  <dl>

    <dd>

        <DT><H3 ADD_DATE="1484311924" LAST_MODIFIED="1485532328">UNIX</H3>

      <dl>

        <dt><a HREF="http://unix.stackexchange.com/questions/223182/how-to-replace-spaces-in-all-file-names-with-underscore-in-linux-using-shell-scr" add_date="1484311897">url-1</a></dt>

        <dt><a HREF="http://unix.stackexchange.com/questions/81349/how-do-i-use-find-when-the-filename-contains-spaces"        add_date="1484738308">url-2</a></dt>

      </dl>

    </dd>

    <dd>

        <DT><H3 ADD_DATE="1486550854" LAST_MODIFIED="1487228526">OCE</H3>

      <dl>

        <dt><a HREF="http://www.oraclecertificationprep.com/apex/f?p=OCPSG%3AEXAM_DETAILS%3A%3A%3ANO%3A%3AP2_EXAM%3A1Z0-061"    add_date="1486550866">url-3</a></dt>

        <dt><a HREF="http://education.oracle.com/pls/web_prod-plq-dad/db_pages.getpage?page_id=303&amp;p_certName=SQ1Z0_047" add_date="1486550898">url-4</a></dt>

        <dt><a HREF="https://www.quora.com/How-do-you-prepare-for-an-Oracle-Database-SQL-exam" add_date="1486550950">url-5</a></dt>

      </dl>

    </dd>

    <dd>

        <DT><H3 ADD_DATE="1487084050" LAST_MODIFIED="1487228595">ANDROID</H3>

      <dl>

        <dt><a HREF="https://material.io/guidelines/style/color.html#" add_date="1487228526">url-6</a></dt>

        <dt><a HREF="https://developer.android.com/index.html" add_date="1487228539">url-7</a></dt>

      </dl>

    </dd>

  </dl>

</body>

</html>

answered Feb 20 '17 at 21:59

HenrikJson

308

For now I discarded demand for one-liner and did it as a script instead.

I had to post this as a response as it would have been too long for a comment. Still, feel free to respond.

This script does the job but is too sluggish, can anyone speed it up or alternatively suggest a one-liner?

#!/bin/sh

file=$1

while IFS= read -r line

do

hdr=$(echo $line | awk -F'[<>]' '/H3/{print $5}')

url=$(echo $line | awk -F'"' '/HREF/{print $2}')

if [ ${url} ]; then

    echo $url

elif [ ${hdr} ]; then

    echo $hdr

fi

done <"$file"

Here the file: (finally got it)

<html xmlns="http://www.w3.org/1999/xhtml">

<body>

  <h1>Bookmarks</h1>

  <dl>

    <dd>

        <DT><H3 ADD_DATE="1484311924" LAST_MODIFIED="1485532328">UNIX</H3>

      <dl>

        <dt><a HREF="http://unix.stackexchange.com/questions/223182/how-to-replace-spaces-in-all-file-names-with-underscore-in-linux-using-shell-scr" add_date="1484311897">url-1</a></dt>

        <dt><a HREF="http://unix.stackexchange.com/questions/81349/how-do-i-use-find-when-the-filename-contains-spaces"        add_date="1484738308">url-2</a></dt>

      </dl>

    </dd>

    <dd>

        <DT><H3 ADD_DATE="1486550854" LAST_MODIFIED="1487228526">OCE</H3>

      <dl>

        <dt><a HREF="http://www.oraclecertificationprep.com/apex/f?p=OCPSG%3AEXAM_DETAILS%3A%3A%3ANO%3A%3AP2_EXAM%3A1Z0-061"    add_date="1486550866">url-3</a></dt>

        <dt><a HREF="http://education.oracle.com/pls/web_prod-plq-dad/db_pages.getpage?page_id=303&amp;p_certName=SQ1Z0_047" add_date="1486550898">url-4</a></dt>

        <dt><a HREF="https://www.quora.com/How-do-you-prepare-for-an-Oracle-Database-SQL-exam" add_date="1486550950">url-5</a></dt>

      </dl>

    </dd>

    <dd>

        <DT><H3 ADD_DATE="1487084050" LAST_MODIFIED="1487228595">ANDROID</H3>

      <dl>

        <dt><a HREF="https://material.io/guidelines/style/color.html#" add_date="1487228526">url-6</a></dt>

        <dt><a HREF="https://developer.android.com/index.html" add_date="1487228539">url-7</a></dt>

      </dl>

    </dd>

  </dl>

</body>

</html>

answered Feb 20 '17 at 21:59

HenrikJson

308

answered Feb 20 '17 at 21:59

HenrikJson

308

answered Feb 20 '17 at 21:59

HenrikJson

308

answered Feb 20 '17 at 21:59

HenrikJson

308

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ytdyklly