awk to match and cut out fields with alternating delimiter
I would like to use awk
or similar to match patterns of a chrome bookmarks file and depending on match, cut out a specific field based on different field delimiters.
I have attached a sample picture. I still haven't figured out how to attach as a file.
I want the folder names in case the string H3
is matched and the URL in case the string HREF
is encountered.
the following two commands do the job for the respective matches:
awk -F'[<>]' '/H3/{print $5}' bookmarks.htm
awk -F'"' '/HREF/{print $2}' bookmarks.html
My goal is to combine the two statements above so the output becomes:
UNIX
url-1
url-2
OCE
url-3
url-4
url-5
ANDROID
url-6
url-7
I have tried awk
's if, then, else but wasn't conclusive.
How do I achieve this as a one-liner? are there better candidates than awk
? python, perl would both be great, however, one-liner is an absolute as it would be an easy task writing a shell script that does the job.
shell-script awk
|
show 1 more comment
I would like to use awk
or similar to match patterns of a chrome bookmarks file and depending on match, cut out a specific field based on different field delimiters.
I have attached a sample picture. I still haven't figured out how to attach as a file.
I want the folder names in case the string H3
is matched and the URL in case the string HREF
is encountered.
the following two commands do the job for the respective matches:
awk -F'[<>]' '/H3/{print $5}' bookmarks.htm
awk -F'"' '/HREF/{print $2}' bookmarks.html
My goal is to combine the two statements above so the output becomes:
UNIX
url-1
url-2
OCE
url-3
url-4
url-5
ANDROID
url-6
url-7
I have tried awk
's if, then, else but wasn't conclusive.
How do I achieve this as a one-liner? are there better candidates than awk
? python, perl would both be great, however, one-liner is an absolute as it would be an easy task writing a shell script that does the job.
shell-script awk
4
Don't post screenshots of text, paste the actual text...
– jasonwryan
Feb 19 '17 at 21:16
text is very long and ugly formatting. i tcan't be added as it contians URLs and as a beginner i am not allowed > 1 URL in my post
– HenrikJson
Feb 19 '17 at 21:18
Use the{}
button to format the text as code.
– Gilles
Feb 19 '17 at 21:37
the {} produced something that looks like only a partial code extract -> no good
– HenrikJson
Feb 19 '17 at 21:49
1
You don't need a one-liner to make it easy to script, but here is one anyway:awk -F'[<>]' '/<H3/{print $5} /HREF="/{sub(/[^"]*"/,"");sub(/".*/,"");print}' bookmark.html
– dave_thompson_085
Feb 21 '17 at 2:53
|
show 1 more comment
I would like to use awk
or similar to match patterns of a chrome bookmarks file and depending on match, cut out a specific field based on different field delimiters.
I have attached a sample picture. I still haven't figured out how to attach as a file.
I want the folder names in case the string H3
is matched and the URL in case the string HREF
is encountered.
the following two commands do the job for the respective matches:
awk -F'[<>]' '/H3/{print $5}' bookmarks.htm
awk -F'"' '/HREF/{print $2}' bookmarks.html
My goal is to combine the two statements above so the output becomes:
UNIX
url-1
url-2
OCE
url-3
url-4
url-5
ANDROID
url-6
url-7
I have tried awk
's if, then, else but wasn't conclusive.
How do I achieve this as a one-liner? are there better candidates than awk
? python, perl would both be great, however, one-liner is an absolute as it would be an easy task writing a shell script that does the job.
shell-script awk
I would like to use awk
or similar to match patterns of a chrome bookmarks file and depending on match, cut out a specific field based on different field delimiters.
I have attached a sample picture. I still haven't figured out how to attach as a file.
I want the folder names in case the string H3
is matched and the URL in case the string HREF
is encountered.
the following two commands do the job for the respective matches:
awk -F'[<>]' '/H3/{print $5}' bookmarks.htm
awk -F'"' '/HREF/{print $2}' bookmarks.html
My goal is to combine the two statements above so the output becomes:
UNIX
url-1
url-2
OCE
url-3
url-4
url-5
ANDROID
url-6
url-7
I have tried awk
's if, then, else but wasn't conclusive.
How do I achieve this as a one-liner? are there better candidates than awk
? python, perl would both be great, however, one-liner is an absolute as it would be an easy task writing a shell script that does the job.
shell-script awk
shell-script awk
edited Jan 13 at 21:52
Rui F Ribeiro
39.5k1479133
39.5k1479133
asked Feb 19 '17 at 21:14
HenrikJsonHenrikJson
308
308
4
Don't post screenshots of text, paste the actual text...
– jasonwryan
Feb 19 '17 at 21:16
text is very long and ugly formatting. i tcan't be added as it contians URLs and as a beginner i am not allowed > 1 URL in my post
– HenrikJson
Feb 19 '17 at 21:18
Use the{}
button to format the text as code.
– Gilles
Feb 19 '17 at 21:37
the {} produced something that looks like only a partial code extract -> no good
– HenrikJson
Feb 19 '17 at 21:49
1
You don't need a one-liner to make it easy to script, but here is one anyway:awk -F'[<>]' '/<H3/{print $5} /HREF="/{sub(/[^"]*"/,"");sub(/".*/,"");print}' bookmark.html
– dave_thompson_085
Feb 21 '17 at 2:53
|
show 1 more comment
4
Don't post screenshots of text, paste the actual text...
– jasonwryan
Feb 19 '17 at 21:16
text is very long and ugly formatting. i tcan't be added as it contians URLs and as a beginner i am not allowed > 1 URL in my post
– HenrikJson
Feb 19 '17 at 21:18
Use the{}
button to format the text as code.
– Gilles
Feb 19 '17 at 21:37
the {} produced something that looks like only a partial code extract -> no good
– HenrikJson
Feb 19 '17 at 21:49
1
You don't need a one-liner to make it easy to script, but here is one anyway:awk -F'[<>]' '/<H3/{print $5} /HREF="/{sub(/[^"]*"/,"");sub(/".*/,"");print}' bookmark.html
– dave_thompson_085
Feb 21 '17 at 2:53
4
4
Don't post screenshots of text, paste the actual text...
– jasonwryan
Feb 19 '17 at 21:16
Don't post screenshots of text, paste the actual text...
– jasonwryan
Feb 19 '17 at 21:16
text is very long and ugly formatting. i tcan't be added as it contians URLs and as a beginner i am not allowed > 1 URL in my post
– HenrikJson
Feb 19 '17 at 21:18
text is very long and ugly formatting. i tcan't be added as it contians URLs and as a beginner i am not allowed > 1 URL in my post
– HenrikJson
Feb 19 '17 at 21:18
Use the
{}
button to format the text as code.– Gilles
Feb 19 '17 at 21:37
Use the
{}
button to format the text as code.– Gilles
Feb 19 '17 at 21:37
the {} produced something that looks like only a partial code extract -> no good
– HenrikJson
Feb 19 '17 at 21:49
the {} produced something that looks like only a partial code extract -> no good
– HenrikJson
Feb 19 '17 at 21:49
1
1
You don't need a one-liner to make it easy to script, but here is one anyway:
awk -F'[<>]' '/<H3/{print $5} /HREF="/{sub(/[^"]*"/,"");sub(/".*/,"");print}' bookmark.html
– dave_thompson_085
Feb 21 '17 at 2:53
You don't need a one-liner to make it easy to script, but here is one anyway:
awk -F'[<>]' '/<H3/{print $5} /HREF="/{sub(/[^"]*"/,"");sub(/".*/,"");print}' bookmark.html
– dave_thompson_085
Feb 21 '17 at 2:53
|
show 1 more comment
4 Answers
4
active
oldest
votes
This is wrong way to process html-files with sed/awk/… There are few special parsers but as temporary substitution
sed '
/n/{P;d;}
/<H3/s/[><]/n/4g
/HREF/s/"/n/g
D
' bookmarks.htm
For non-GNU versions of sed:
sed '
/n/{P;d;} #if there is more then 1 line «P»rint 1st line then «d»elete all
/</H3/s//n/ #replace «</H3» by «n»ewline
/n/s/">/n/ #replace «">» by «n»ewline if previous command is executed
/HREF/s/"/n/g #put «n»ewline» around url if «HREF» in line
D #«D»elete 1 first line, go to start
' bookmarks.htm
Thanks, that gives the urls but not the headers, trying to adapt the part: /<H3/s/[><]/n/4g
– HenrikJson
Feb 20 '17 at 21:33
1
@HenrikJson It possible if you use non-GNU sed:4g
construction is not recognized. In the case you have to substitute it by/</H3/s//n/;/n/s/">/n/
– Costas
Feb 21 '17 at 6:17
Costas, if you add that comment + your original command as answer i will mark it as correct. please if you have time also add annotations how to read the sed command. parts of are clear to me but not all
– HenrikJson
Feb 21 '17 at 8:11
@HenrikJson see updated
– Costas
Feb 21 '17 at 10:06
add a comment |
Using a xml / html parser / processor has some advantages. Xpath expressions are the standard way to select specific parts.
xml + xmlstarlet + xpath
If the input is well formed xml we can use xmlstarlet + xpath expression:
xmlstarlet sel -t -v '//h3|//a/@href' -nl bookmarks.html
html + xmllint : xml
If the input is just valid html, we can convert it to xml (using xmllint) and use the previous:
xmllint -html -xmlout ex.html | xmlstarlet sel -t -v '//h3|//a/@href' -nl -
xmllint + xpath
We can use xmllint + xpath expression, directly
xmllint -html -xpath '//h3/text()|//a/@href' bookmarks.html
... but the output format is not the same...
Could you explain what this is ?
– LukeM
Feb 20 '17 at 0:53
@DarkHeart, I added some more information.
– JJoao
Feb 20 '17 at 12:58
on the cygwin I am running, neither xmllint nor xpath available
– HenrikJson
Feb 20 '17 at 21:29
1
@HenrikJson, you can install both xmllint (setup-x86_64 -qP libxml2
) and xmlstarlet in cygwin.
– JJoao
Feb 20 '17 at 23:41
add a comment |
One last answer: this time a one-ligner perl
perl -nE 'say $1 if (/<h3.*?>(.*?)</h3>/i or /href="(.*?)"/i)' ex.html
(I believe that xml parser based solutions are better, but since you have a
tool-generated file, the amount of surprises should not be very high)
add a comment |
For now I discarded demand for one-liner and did it as a script instead.
I had to post this as a response as it would have been too long for a comment. Still, feel free to respond.
This script does the job but is too sluggish, can anyone speed it up or alternatively suggest a one-liner?
#!/bin/sh
file=$1
while IFS= read -r line
do
hdr=$(echo $line | awk -F'[<>]' '/H3/{print $5}')
url=$(echo $line | awk -F'"' '/HREF/{print $2}')
if [ ${url} ]; then
echo $url
elif [ ${hdr} ]; then
echo $hdr
fi
done <"$file"
Here the file: (finally got it)
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<h1>Bookmarks</h1>
<dl>
<dd>
<DT><H3 ADD_DATE="1484311924" LAST_MODIFIED="1485532328">UNIX</H3>
<dl>
<dt><a HREF="http://unix.stackexchange.com/questions/223182/how-to-replace-spaces-in-all-file-names-with-underscore-in-linux-using-shell-scr" add_date="1484311897">url-1</a></dt>
<dt><a HREF="http://unix.stackexchange.com/questions/81349/how-do-i-use-find-when-the-filename-contains-spaces" add_date="1484738308">url-2</a></dt>
</dl>
</dd>
<dd>
<DT><H3 ADD_DATE="1486550854" LAST_MODIFIED="1487228526">OCE</H3>
<dl>
<dt><a HREF="http://www.oraclecertificationprep.com/apex/f?p=OCPSG%3AEXAM_DETAILS%3A%3A%3ANO%3A%3AP2_EXAM%3A1Z0-061" add_date="1486550866">url-3</a></dt>
<dt><a HREF="http://education.oracle.com/pls/web_prod-plq-dad/db_pages.getpage?page_id=303&p_certName=SQ1Z0_047" add_date="1486550898">url-4</a></dt>
<dt><a HREF="https://www.quora.com/How-do-you-prepare-for-an-Oracle-Database-SQL-exam" add_date="1486550950">url-5</a></dt>
</dl>
</dd>
<dd>
<DT><H3 ADD_DATE="1487084050" LAST_MODIFIED="1487228595">ANDROID</H3>
<dl>
<dt><a HREF="https://material.io/guidelines/style/color.html#" add_date="1487228526">url-6</a></dt>
<dt><a HREF="https://developer.android.com/index.html" add_date="1487228539">url-7</a></dt>
</dl>
</dd>
</dl>
</body>
</html>
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f346175%2fawk-to-match-and-cut-out-fields-with-alternating-delimiter%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
This is wrong way to process html-files with sed/awk/… There are few special parsers but as temporary substitution
sed '
/n/{P;d;}
/<H3/s/[><]/n/4g
/HREF/s/"/n/g
D
' bookmarks.htm
For non-GNU versions of sed:
sed '
/n/{P;d;} #if there is more then 1 line «P»rint 1st line then «d»elete all
/</H3/s//n/ #replace «</H3» by «n»ewline
/n/s/">/n/ #replace «">» by «n»ewline if previous command is executed
/HREF/s/"/n/g #put «n»ewline» around url if «HREF» in line
D #«D»elete 1 first line, go to start
' bookmarks.htm
Thanks, that gives the urls but not the headers, trying to adapt the part: /<H3/s/[><]/n/4g
– HenrikJson
Feb 20 '17 at 21:33
1
@HenrikJson It possible if you use non-GNU sed:4g
construction is not recognized. In the case you have to substitute it by/</H3/s//n/;/n/s/">/n/
– Costas
Feb 21 '17 at 6:17
Costas, if you add that comment + your original command as answer i will mark it as correct. please if you have time also add annotations how to read the sed command. parts of are clear to me but not all
– HenrikJson
Feb 21 '17 at 8:11
@HenrikJson see updated
– Costas
Feb 21 '17 at 10:06
add a comment |
This is wrong way to process html-files with sed/awk/… There are few special parsers but as temporary substitution
sed '
/n/{P;d;}
/<H3/s/[><]/n/4g
/HREF/s/"/n/g
D
' bookmarks.htm
For non-GNU versions of sed:
sed '
/n/{P;d;} #if there is more then 1 line «P»rint 1st line then «d»elete all
/</H3/s//n/ #replace «</H3» by «n»ewline
/n/s/">/n/ #replace «">» by «n»ewline if previous command is executed
/HREF/s/"/n/g #put «n»ewline» around url if «HREF» in line
D #«D»elete 1 first line, go to start
' bookmarks.htm
Thanks, that gives the urls but not the headers, trying to adapt the part: /<H3/s/[><]/n/4g
– HenrikJson
Feb 20 '17 at 21:33
1
@HenrikJson It possible if you use non-GNU sed:4g
construction is not recognized. In the case you have to substitute it by/</H3/s//n/;/n/s/">/n/
– Costas
Feb 21 '17 at 6:17
Costas, if you add that comment + your original command as answer i will mark it as correct. please if you have time also add annotations how to read the sed command. parts of are clear to me but not all
– HenrikJson
Feb 21 '17 at 8:11
@HenrikJson see updated
– Costas
Feb 21 '17 at 10:06
add a comment |
This is wrong way to process html-files with sed/awk/… There are few special parsers but as temporary substitution
sed '
/n/{P;d;}
/<H3/s/[><]/n/4g
/HREF/s/"/n/g
D
' bookmarks.htm
For non-GNU versions of sed:
sed '
/n/{P;d;} #if there is more then 1 line «P»rint 1st line then «d»elete all
/</H3/s//n/ #replace «</H3» by «n»ewline
/n/s/">/n/ #replace «">» by «n»ewline if previous command is executed
/HREF/s/"/n/g #put «n»ewline» around url if «HREF» in line
D #«D»elete 1 first line, go to start
' bookmarks.htm
This is wrong way to process html-files with sed/awk/… There are few special parsers but as temporary substitution
sed '
/n/{P;d;}
/<H3/s/[><]/n/4g
/HREF/s/"/n/g
D
' bookmarks.htm
For non-GNU versions of sed:
sed '
/n/{P;d;} #if there is more then 1 line «P»rint 1st line then «d»elete all
/</H3/s//n/ #replace «</H3» by «n»ewline
/n/s/">/n/ #replace «">» by «n»ewline if previous command is executed
/HREF/s/"/n/g #put «n»ewline» around url if «HREF» in line
D #«D»elete 1 first line, go to start
' bookmarks.htm
edited Feb 21 '17 at 10:03
answered Feb 19 '17 at 21:53
CostasCostas
12.6k1129
12.6k1129
Thanks, that gives the urls but not the headers, trying to adapt the part: /<H3/s/[><]/n/4g
– HenrikJson
Feb 20 '17 at 21:33
1
@HenrikJson It possible if you use non-GNU sed:4g
construction is not recognized. In the case you have to substitute it by/</H3/s//n/;/n/s/">/n/
– Costas
Feb 21 '17 at 6:17
Costas, if you add that comment + your original command as answer i will mark it as correct. please if you have time also add annotations how to read the sed command. parts of are clear to me but not all
– HenrikJson
Feb 21 '17 at 8:11
@HenrikJson see updated
– Costas
Feb 21 '17 at 10:06
add a comment |
Thanks, that gives the urls but not the headers, trying to adapt the part: /<H3/s/[><]/n/4g
– HenrikJson
Feb 20 '17 at 21:33
1
@HenrikJson It possible if you use non-GNU sed:4g
construction is not recognized. In the case you have to substitute it by/</H3/s//n/;/n/s/">/n/
– Costas
Feb 21 '17 at 6:17
Costas, if you add that comment + your original command as answer i will mark it as correct. please if you have time also add annotations how to read the sed command. parts of are clear to me but not all
– HenrikJson
Feb 21 '17 at 8:11
@HenrikJson see updated
– Costas
Feb 21 '17 at 10:06
Thanks, that gives the urls but not the headers, trying to adapt the part: /<H3/s/[><]/n/4g
– HenrikJson
Feb 20 '17 at 21:33
Thanks, that gives the urls but not the headers, trying to adapt the part: /<H3/s/[><]/n/4g
– HenrikJson
Feb 20 '17 at 21:33
1
1
@HenrikJson It possible if you use non-GNU sed:
4g
construction is not recognized. In the case you have to substitute it by /</H3/s//n/;/n/s/">/n/
– Costas
Feb 21 '17 at 6:17
@HenrikJson It possible if you use non-GNU sed:
4g
construction is not recognized. In the case you have to substitute it by /</H3/s//n/;/n/s/">/n/
– Costas
Feb 21 '17 at 6:17
Costas, if you add that comment + your original command as answer i will mark it as correct. please if you have time also add annotations how to read the sed command. parts of are clear to me but not all
– HenrikJson
Feb 21 '17 at 8:11
Costas, if you add that comment + your original command as answer i will mark it as correct. please if you have time also add annotations how to read the sed command. parts of are clear to me but not all
– HenrikJson
Feb 21 '17 at 8:11
@HenrikJson see updated
– Costas
Feb 21 '17 at 10:06
@HenrikJson see updated
– Costas
Feb 21 '17 at 10:06
add a comment |
Using a xml / html parser / processor has some advantages. Xpath expressions are the standard way to select specific parts.
xml + xmlstarlet + xpath
If the input is well formed xml we can use xmlstarlet + xpath expression:
xmlstarlet sel -t -v '//h3|//a/@href' -nl bookmarks.html
html + xmllint : xml
If the input is just valid html, we can convert it to xml (using xmllint) and use the previous:
xmllint -html -xmlout ex.html | xmlstarlet sel -t -v '//h3|//a/@href' -nl -
xmllint + xpath
We can use xmllint + xpath expression, directly
xmllint -html -xpath '//h3/text()|//a/@href' bookmarks.html
... but the output format is not the same...
Could you explain what this is ?
– LukeM
Feb 20 '17 at 0:53
@DarkHeart, I added some more information.
– JJoao
Feb 20 '17 at 12:58
on the cygwin I am running, neither xmllint nor xpath available
– HenrikJson
Feb 20 '17 at 21:29
1
@HenrikJson, you can install both xmllint (setup-x86_64 -qP libxml2
) and xmlstarlet in cygwin.
– JJoao
Feb 20 '17 at 23:41
add a comment |
Using a xml / html parser / processor has some advantages. Xpath expressions are the standard way to select specific parts.
xml + xmlstarlet + xpath
If the input is well formed xml we can use xmlstarlet + xpath expression:
xmlstarlet sel -t -v '//h3|//a/@href' -nl bookmarks.html
html + xmllint : xml
If the input is just valid html, we can convert it to xml (using xmllint) and use the previous:
xmllint -html -xmlout ex.html | xmlstarlet sel -t -v '//h3|//a/@href' -nl -
xmllint + xpath
We can use xmllint + xpath expression, directly
xmllint -html -xpath '//h3/text()|//a/@href' bookmarks.html
... but the output format is not the same...
Could you explain what this is ?
– LukeM
Feb 20 '17 at 0:53
@DarkHeart, I added some more information.
– JJoao
Feb 20 '17 at 12:58
on the cygwin I am running, neither xmllint nor xpath available
– HenrikJson
Feb 20 '17 at 21:29
1
@HenrikJson, you can install both xmllint (setup-x86_64 -qP libxml2
) and xmlstarlet in cygwin.
– JJoao
Feb 20 '17 at 23:41
add a comment |
Using a xml / html parser / processor has some advantages. Xpath expressions are the standard way to select specific parts.
xml + xmlstarlet + xpath
If the input is well formed xml we can use xmlstarlet + xpath expression:
xmlstarlet sel -t -v '//h3|//a/@href' -nl bookmarks.html
html + xmllint : xml
If the input is just valid html, we can convert it to xml (using xmllint) and use the previous:
xmllint -html -xmlout ex.html | xmlstarlet sel -t -v '//h3|//a/@href' -nl -
xmllint + xpath
We can use xmllint + xpath expression, directly
xmllint -html -xpath '//h3/text()|//a/@href' bookmarks.html
... but the output format is not the same...
Using a xml / html parser / processor has some advantages. Xpath expressions are the standard way to select specific parts.
xml + xmlstarlet + xpath
If the input is well formed xml we can use xmlstarlet + xpath expression:
xmlstarlet sel -t -v '//h3|//a/@href' -nl bookmarks.html
html + xmllint : xml
If the input is just valid html, we can convert it to xml (using xmllint) and use the previous:
xmllint -html -xmlout ex.html | xmlstarlet sel -t -v '//h3|//a/@href' -nl -
xmllint + xpath
We can use xmllint + xpath expression, directly
xmllint -html -xpath '//h3/text()|//a/@href' bookmarks.html
... but the output format is not the same...
edited Feb 20 '17 at 15:33
answered Feb 20 '17 at 0:17
JJoaoJJoao
7,1691928
7,1691928
Could you explain what this is ?
– LukeM
Feb 20 '17 at 0:53
@DarkHeart, I added some more information.
– JJoao
Feb 20 '17 at 12:58
on the cygwin I am running, neither xmllint nor xpath available
– HenrikJson
Feb 20 '17 at 21:29
1
@HenrikJson, you can install both xmllint (setup-x86_64 -qP libxml2
) and xmlstarlet in cygwin.
– JJoao
Feb 20 '17 at 23:41
add a comment |
Could you explain what this is ?
– LukeM
Feb 20 '17 at 0:53
@DarkHeart, I added some more information.
– JJoao
Feb 20 '17 at 12:58
on the cygwin I am running, neither xmllint nor xpath available
– HenrikJson
Feb 20 '17 at 21:29
1
@HenrikJson, you can install both xmllint (setup-x86_64 -qP libxml2
) and xmlstarlet in cygwin.
– JJoao
Feb 20 '17 at 23:41
Could you explain what this is ?
– LukeM
Feb 20 '17 at 0:53
Could you explain what this is ?
– LukeM
Feb 20 '17 at 0:53
@DarkHeart, I added some more information.
– JJoao
Feb 20 '17 at 12:58
@DarkHeart, I added some more information.
– JJoao
Feb 20 '17 at 12:58
on the cygwin I am running, neither xmllint nor xpath available
– HenrikJson
Feb 20 '17 at 21:29
on the cygwin I am running, neither xmllint nor xpath available
– HenrikJson
Feb 20 '17 at 21:29
1
1
@HenrikJson, you can install both xmllint (
setup-x86_64 -qP libxml2
) and xmlstarlet in cygwin.– JJoao
Feb 20 '17 at 23:41
@HenrikJson, you can install both xmllint (
setup-x86_64 -qP libxml2
) and xmlstarlet in cygwin.– JJoao
Feb 20 '17 at 23:41
add a comment |
One last answer: this time a one-ligner perl
perl -nE 'say $1 if (/<h3.*?>(.*?)</h3>/i or /href="(.*?)"/i)' ex.html
(I believe that xml parser based solutions are better, but since you have a
tool-generated file, the amount of surprises should not be very high)
add a comment |
One last answer: this time a one-ligner perl
perl -nE 'say $1 if (/<h3.*?>(.*?)</h3>/i or /href="(.*?)"/i)' ex.html
(I believe that xml parser based solutions are better, but since you have a
tool-generated file, the amount of surprises should not be very high)
add a comment |
One last answer: this time a one-ligner perl
perl -nE 'say $1 if (/<h3.*?>(.*?)</h3>/i or /href="(.*?)"/i)' ex.html
(I believe that xml parser based solutions are better, but since you have a
tool-generated file, the amount of surprises should not be very high)
One last answer: this time a one-ligner perl
perl -nE 'say $1 if (/<h3.*?>(.*?)</h3>/i or /href="(.*?)"/i)' ex.html
(I believe that xml parser based solutions are better, but since you have a
tool-generated file, the amount of surprises should not be very high)
answered Feb 21 '17 at 8:27
JJoaoJJoao
7,1691928
7,1691928
add a comment |
add a comment |
For now I discarded demand for one-liner and did it as a script instead.
I had to post this as a response as it would have been too long for a comment. Still, feel free to respond.
This script does the job but is too sluggish, can anyone speed it up or alternatively suggest a one-liner?
#!/bin/sh
file=$1
while IFS= read -r line
do
hdr=$(echo $line | awk -F'[<>]' '/H3/{print $5}')
url=$(echo $line | awk -F'"' '/HREF/{print $2}')
if [ ${url} ]; then
echo $url
elif [ ${hdr} ]; then
echo $hdr
fi
done <"$file"
Here the file: (finally got it)
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<h1>Bookmarks</h1>
<dl>
<dd>
<DT><H3 ADD_DATE="1484311924" LAST_MODIFIED="1485532328">UNIX</H3>
<dl>
<dt><a HREF="http://unix.stackexchange.com/questions/223182/how-to-replace-spaces-in-all-file-names-with-underscore-in-linux-using-shell-scr" add_date="1484311897">url-1</a></dt>
<dt><a HREF="http://unix.stackexchange.com/questions/81349/how-do-i-use-find-when-the-filename-contains-spaces" add_date="1484738308">url-2</a></dt>
</dl>
</dd>
<dd>
<DT><H3 ADD_DATE="1486550854" LAST_MODIFIED="1487228526">OCE</H3>
<dl>
<dt><a HREF="http://www.oraclecertificationprep.com/apex/f?p=OCPSG%3AEXAM_DETAILS%3A%3A%3ANO%3A%3AP2_EXAM%3A1Z0-061" add_date="1486550866">url-3</a></dt>
<dt><a HREF="http://education.oracle.com/pls/web_prod-plq-dad/db_pages.getpage?page_id=303&p_certName=SQ1Z0_047" add_date="1486550898">url-4</a></dt>
<dt><a HREF="https://www.quora.com/How-do-you-prepare-for-an-Oracle-Database-SQL-exam" add_date="1486550950">url-5</a></dt>
</dl>
</dd>
<dd>
<DT><H3 ADD_DATE="1487084050" LAST_MODIFIED="1487228595">ANDROID</H3>
<dl>
<dt><a HREF="https://material.io/guidelines/style/color.html#" add_date="1487228526">url-6</a></dt>
<dt><a HREF="https://developer.android.com/index.html" add_date="1487228539">url-7</a></dt>
</dl>
</dd>
</dl>
</body>
</html>
add a comment |
For now I discarded demand for one-liner and did it as a script instead.
I had to post this as a response as it would have been too long for a comment. Still, feel free to respond.
This script does the job but is too sluggish, can anyone speed it up or alternatively suggest a one-liner?
#!/bin/sh
file=$1
while IFS= read -r line
do
hdr=$(echo $line | awk -F'[<>]' '/H3/{print $5}')
url=$(echo $line | awk -F'"' '/HREF/{print $2}')
if [ ${url} ]; then
echo $url
elif [ ${hdr} ]; then
echo $hdr
fi
done <"$file"
Here the file: (finally got it)
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<h1>Bookmarks</h1>
<dl>
<dd>
<DT><H3 ADD_DATE="1484311924" LAST_MODIFIED="1485532328">UNIX</H3>
<dl>
<dt><a HREF="http://unix.stackexchange.com/questions/223182/how-to-replace-spaces-in-all-file-names-with-underscore-in-linux-using-shell-scr" add_date="1484311897">url-1</a></dt>
<dt><a HREF="http://unix.stackexchange.com/questions/81349/how-do-i-use-find-when-the-filename-contains-spaces" add_date="1484738308">url-2</a></dt>
</dl>
</dd>
<dd>
<DT><H3 ADD_DATE="1486550854" LAST_MODIFIED="1487228526">OCE</H3>
<dl>
<dt><a HREF="http://www.oraclecertificationprep.com/apex/f?p=OCPSG%3AEXAM_DETAILS%3A%3A%3ANO%3A%3AP2_EXAM%3A1Z0-061" add_date="1486550866">url-3</a></dt>
<dt><a HREF="http://education.oracle.com/pls/web_prod-plq-dad/db_pages.getpage?page_id=303&p_certName=SQ1Z0_047" add_date="1486550898">url-4</a></dt>
<dt><a HREF="https://www.quora.com/How-do-you-prepare-for-an-Oracle-Database-SQL-exam" add_date="1486550950">url-5</a></dt>
</dl>
</dd>
<dd>
<DT><H3 ADD_DATE="1487084050" LAST_MODIFIED="1487228595">ANDROID</H3>
<dl>
<dt><a HREF="https://material.io/guidelines/style/color.html#" add_date="1487228526">url-6</a></dt>
<dt><a HREF="https://developer.android.com/index.html" add_date="1487228539">url-7</a></dt>
</dl>
</dd>
</dl>
</body>
</html>
add a comment |
For now I discarded demand for one-liner and did it as a script instead.
I had to post this as a response as it would have been too long for a comment. Still, feel free to respond.
This script does the job but is too sluggish, can anyone speed it up or alternatively suggest a one-liner?
#!/bin/sh
file=$1
while IFS= read -r line
do
hdr=$(echo $line | awk -F'[<>]' '/H3/{print $5}')
url=$(echo $line | awk -F'"' '/HREF/{print $2}')
if [ ${url} ]; then
echo $url
elif [ ${hdr} ]; then
echo $hdr
fi
done <"$file"
Here the file: (finally got it)
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<h1>Bookmarks</h1>
<dl>
<dd>
<DT><H3 ADD_DATE="1484311924" LAST_MODIFIED="1485532328">UNIX</H3>
<dl>
<dt><a HREF="http://unix.stackexchange.com/questions/223182/how-to-replace-spaces-in-all-file-names-with-underscore-in-linux-using-shell-scr" add_date="1484311897">url-1</a></dt>
<dt><a HREF="http://unix.stackexchange.com/questions/81349/how-do-i-use-find-when-the-filename-contains-spaces" add_date="1484738308">url-2</a></dt>
</dl>
</dd>
<dd>
<DT><H3 ADD_DATE="1486550854" LAST_MODIFIED="1487228526">OCE</H3>
<dl>
<dt><a HREF="http://www.oraclecertificationprep.com/apex/f?p=OCPSG%3AEXAM_DETAILS%3A%3A%3ANO%3A%3AP2_EXAM%3A1Z0-061" add_date="1486550866">url-3</a></dt>
<dt><a HREF="http://education.oracle.com/pls/web_prod-plq-dad/db_pages.getpage?page_id=303&p_certName=SQ1Z0_047" add_date="1486550898">url-4</a></dt>
<dt><a HREF="https://www.quora.com/How-do-you-prepare-for-an-Oracle-Database-SQL-exam" add_date="1486550950">url-5</a></dt>
</dl>
</dd>
<dd>
<DT><H3 ADD_DATE="1487084050" LAST_MODIFIED="1487228595">ANDROID</H3>
<dl>
<dt><a HREF="https://material.io/guidelines/style/color.html#" add_date="1487228526">url-6</a></dt>
<dt><a HREF="https://developer.android.com/index.html" add_date="1487228539">url-7</a></dt>
</dl>
</dd>
</dl>
</body>
</html>
For now I discarded demand for one-liner and did it as a script instead.
I had to post this as a response as it would have been too long for a comment. Still, feel free to respond.
This script does the job but is too sluggish, can anyone speed it up or alternatively suggest a one-liner?
#!/bin/sh
file=$1
while IFS= read -r line
do
hdr=$(echo $line | awk -F'[<>]' '/H3/{print $5}')
url=$(echo $line | awk -F'"' '/HREF/{print $2}')
if [ ${url} ]; then
echo $url
elif [ ${hdr} ]; then
echo $hdr
fi
done <"$file"
Here the file: (finally got it)
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<h1>Bookmarks</h1>
<dl>
<dd>
<DT><H3 ADD_DATE="1484311924" LAST_MODIFIED="1485532328">UNIX</H3>
<dl>
<dt><a HREF="http://unix.stackexchange.com/questions/223182/how-to-replace-spaces-in-all-file-names-with-underscore-in-linux-using-shell-scr" add_date="1484311897">url-1</a></dt>
<dt><a HREF="http://unix.stackexchange.com/questions/81349/how-do-i-use-find-when-the-filename-contains-spaces" add_date="1484738308">url-2</a></dt>
</dl>
</dd>
<dd>
<DT><H3 ADD_DATE="1486550854" LAST_MODIFIED="1487228526">OCE</H3>
<dl>
<dt><a HREF="http://www.oraclecertificationprep.com/apex/f?p=OCPSG%3AEXAM_DETAILS%3A%3A%3ANO%3A%3AP2_EXAM%3A1Z0-061" add_date="1486550866">url-3</a></dt>
<dt><a HREF="http://education.oracle.com/pls/web_prod-plq-dad/db_pages.getpage?page_id=303&p_certName=SQ1Z0_047" add_date="1486550898">url-4</a></dt>
<dt><a HREF="https://www.quora.com/How-do-you-prepare-for-an-Oracle-Database-SQL-exam" add_date="1486550950">url-5</a></dt>
</dl>
</dd>
<dd>
<DT><H3 ADD_DATE="1487084050" LAST_MODIFIED="1487228595">ANDROID</H3>
<dl>
<dt><a HREF="https://material.io/guidelines/style/color.html#" add_date="1487228526">url-6</a></dt>
<dt><a HREF="https://developer.android.com/index.html" add_date="1487228539">url-7</a></dt>
</dl>
</dd>
</dl>
</body>
</html>
answered Feb 20 '17 at 21:59
HenrikJsonHenrikJson
308
308
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f346175%2fawk-to-match-and-cut-out-fields-with-alternating-delimiter%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
4
Don't post screenshots of text, paste the actual text...
– jasonwryan
Feb 19 '17 at 21:16
text is very long and ugly formatting. i tcan't be added as it contians URLs and as a beginner i am not allowed > 1 URL in my post
– HenrikJson
Feb 19 '17 at 21:18
Use the
{}
button to format the text as code.– Gilles
Feb 19 '17 at 21:37
the {} produced something that looks like only a partial code extract -> no good
– HenrikJson
Feb 19 '17 at 21:49
1
You don't need a one-liner to make it easy to script, but here is one anyway:
awk -F'[<>]' '/<H3/{print $5} /HREF="/{sub(/[^"]*"/,"");sub(/".*/,"");print}' bookmark.html
– dave_thompson_085
Feb 21 '17 at 2:53