Differences in raw text between downloaded PDF and the same PDF embedded in XML [closed]
I am looking at publicly-available company filings on the SEC's EDGAR database. For each filing, there is a .txt file containing detailed information about the filing in what looks like XML format (I'm a beginner). Sometimes this is immediately useful text, but in a few cases, the information is a PDF file that appears to be embedded in a raw format that looks like ASCII. For example,
<PDF>
begin 644 filename1.pdf
M)5!$1BTQ+C4-)>+CS],-"C(X(#`@;V)J#3P+TQI;F5A<FEZ960@,2],(#0T
M-34Y+T@,S`O12`R-S@T,B].(#0O5"`T-#,P,B]((%L@-#0Q(#(P.%T^/@UE
M;F1O8FH-("`@("`@("`@("`@("`@("`@#0HS."`P(&]B:@T/"],96YG=&@@
M-C,O4F]O="`R.2`P(%(O241;/$1#0S%%,T$W,S9%0S8V-#`R-C-$.3DS1C(R
...
[...lots of text like this...]
...
)#0HE)45/1@T*
`
end
</PDF>
The fact that it's raw isn't surprising. What's surprising to me as a novice is that (1) if I try to copy/paste that raw text into Notepad++ and save as .pdf, Acrobat can't read the file, and (2) when I download (using Chrome) the actual .pdf from the filing, which is available elsewhere on the EDGAR system, and open it up in Notepad++, the raw text looks much different from the XML-file raw text, even though I expect them to encode the same file. For example,
%PDF-1.5
%âãÏÓ
28 0 obj
<</Linearized 1/L 44559/O 30/E 27842/N 4/T 44302/H [ 441 208]>>
endobj
38 0 obj
<</Length 63/Root 29 0 R/ID[<DCC1E3A736EC6640263D993F227A4DC8><71A0C1AA5F566D44A5466B14A0F219D4>]/Info 27 0 R/Filter/FlateDecode/W[1 2 1]/Index[28 23]/DecodeParms<</Columns 4/Predictor 12>>/Size 51/Prev 44303/Type/XRef>>stream
xÚbbd``b`ª@‚± H0{ ¶‡@‚»Ä
Ö§a¬Ÿ˜Vƒt00’Fügœõ
À =¸ ê
endstream
endobj
...
The files I'm talking about can be found here:
The .txt file
The .pdf file
Why can't Acrobat read the raw text from the XML .txt file? Is there a way to alter that easily so it's readable? Why does the raw text look so different when I download the actual PDF? Are they different representations of the same file, or is the published .pdf actually a much different file from what could be extracted from the .txt file?
I tried searching for information on ASCII and different types of Unicode, and found The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), but didn't see how to apply it to PDF files. I tried searching for how to extract PDF files embedded in XML, but did not find an answer that helped. I tried converting among encoding types in Notepad++, which was not fruitful.
pdf notepad++ adobe-acrobat xml
closed as too broad by DavidPostill♦ Jan 23 at 21:39
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
add a comment |
I am looking at publicly-available company filings on the SEC's EDGAR database. For each filing, there is a .txt file containing detailed information about the filing in what looks like XML format (I'm a beginner). Sometimes this is immediately useful text, but in a few cases, the information is a PDF file that appears to be embedded in a raw format that looks like ASCII. For example,
<PDF>
begin 644 filename1.pdf
M)5!$1BTQ+C4-)>+CS],-"C(X(#`@;V)J#3P+TQI;F5A<FEZ960@,2],(#0T
M-34Y+T@,S`O12`R-S@T,B].(#0O5"`T-#,P,B]((%L@-#0Q(#(P.%T^/@UE
M;F1O8FH-("`@("`@("`@("`@("`@("`@#0HS."`P(&]B:@T/"],96YG=&@@
M-C,O4F]O="`R.2`P(%(O241;/$1#0S%%,T$W,S9%0S8V-#`R-C-$.3DS1C(R
...
[...lots of text like this...]
...
)#0HE)45/1@T*
`
end
</PDF>
The fact that it's raw isn't surprising. What's surprising to me as a novice is that (1) if I try to copy/paste that raw text into Notepad++ and save as .pdf, Acrobat can't read the file, and (2) when I download (using Chrome) the actual .pdf from the filing, which is available elsewhere on the EDGAR system, and open it up in Notepad++, the raw text looks much different from the XML-file raw text, even though I expect them to encode the same file. For example,
%PDF-1.5
%âãÏÓ
28 0 obj
<</Linearized 1/L 44559/O 30/E 27842/N 4/T 44302/H [ 441 208]>>
endobj
38 0 obj
<</Length 63/Root 29 0 R/ID[<DCC1E3A736EC6640263D993F227A4DC8><71A0C1AA5F566D44A5466B14A0F219D4>]/Info 27 0 R/Filter/FlateDecode/W[1 2 1]/Index[28 23]/DecodeParms<</Columns 4/Predictor 12>>/Size 51/Prev 44303/Type/XRef>>stream
xÚbbd``b`ª@‚± H0{ ¶‡@‚»Ä
Ö§a¬Ÿ˜Vƒt00’Fügœõ
À =¸ ê
endstream
endobj
...
The files I'm talking about can be found here:
The .txt file
The .pdf file
Why can't Acrobat read the raw text from the XML .txt file? Is there a way to alter that easily so it's readable? Why does the raw text look so different when I download the actual PDF? Are they different representations of the same file, or is the published .pdf actually a much different file from what could be extracted from the .txt file?
I tried searching for information on ASCII and different types of Unicode, and found The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), but didn't see how to apply it to PDF files. I tried searching for how to extract PDF files embedded in XML, but did not find an answer that helped. I tried converting among encoding types in Notepad++, which was not fruitful.
pdf notepad++ adobe-acrobat xml
closed as too broad by DavidPostill♦ Jan 23 at 21:39
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
This would be an answer if the question hadn't been closed already: the "text-like" PDF you showed is really a uuencoded version. When you download it from the PDF-link using Chrome, it's downloading the actual binary PDF, not the uuencoded version. If you saved the text-like version in Notepad++ assome.pdf.uu
, then ran auudecode
onsome.pdf.uu
, it would extract the PDF in something Acrobat could read.
– PeterCJ
Jan 29 at 18:08
The above comment answered my question. I can't accept an answer because the question is closed, and I can't upvote the comment due to insufficient reputation.
– Attila the Fun
Jan 29 at 19:26
add a comment |
I am looking at publicly-available company filings on the SEC's EDGAR database. For each filing, there is a .txt file containing detailed information about the filing in what looks like XML format (I'm a beginner). Sometimes this is immediately useful text, but in a few cases, the information is a PDF file that appears to be embedded in a raw format that looks like ASCII. For example,
<PDF>
begin 644 filename1.pdf
M)5!$1BTQ+C4-)>+CS],-"C(X(#`@;V)J#3P+TQI;F5A<FEZ960@,2],(#0T
M-34Y+T@,S`O12`R-S@T,B].(#0O5"`T-#,P,B]((%L@-#0Q(#(P.%T^/@UE
M;F1O8FH-("`@("`@("`@("`@("`@("`@#0HS."`P(&]B:@T/"],96YG=&@@
M-C,O4F]O="`R.2`P(%(O241;/$1#0S%%,T$W,S9%0S8V-#`R-C-$.3DS1C(R
...
[...lots of text like this...]
...
)#0HE)45/1@T*
`
end
</PDF>
The fact that it's raw isn't surprising. What's surprising to me as a novice is that (1) if I try to copy/paste that raw text into Notepad++ and save as .pdf, Acrobat can't read the file, and (2) when I download (using Chrome) the actual .pdf from the filing, which is available elsewhere on the EDGAR system, and open it up in Notepad++, the raw text looks much different from the XML-file raw text, even though I expect them to encode the same file. For example,
%PDF-1.5
%âãÏÓ
28 0 obj
<</Linearized 1/L 44559/O 30/E 27842/N 4/T 44302/H [ 441 208]>>
endobj
38 0 obj
<</Length 63/Root 29 0 R/ID[<DCC1E3A736EC6640263D993F227A4DC8><71A0C1AA5F566D44A5466B14A0F219D4>]/Info 27 0 R/Filter/FlateDecode/W[1 2 1]/Index[28 23]/DecodeParms<</Columns 4/Predictor 12>>/Size 51/Prev 44303/Type/XRef>>stream
xÚbbd``b`ª@‚± H0{ ¶‡@‚»Ä
Ö§a¬Ÿ˜Vƒt00’Fügœõ
À =¸ ê
endstream
endobj
...
The files I'm talking about can be found here:
The .txt file
The .pdf file
Why can't Acrobat read the raw text from the XML .txt file? Is there a way to alter that easily so it's readable? Why does the raw text look so different when I download the actual PDF? Are they different representations of the same file, or is the published .pdf actually a much different file from what could be extracted from the .txt file?
I tried searching for information on ASCII and different types of Unicode, and found The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), but didn't see how to apply it to PDF files. I tried searching for how to extract PDF files embedded in XML, but did not find an answer that helped. I tried converting among encoding types in Notepad++, which was not fruitful.
pdf notepad++ adobe-acrobat xml
I am looking at publicly-available company filings on the SEC's EDGAR database. For each filing, there is a .txt file containing detailed information about the filing in what looks like XML format (I'm a beginner). Sometimes this is immediately useful text, but in a few cases, the information is a PDF file that appears to be embedded in a raw format that looks like ASCII. For example,
<PDF>
begin 644 filename1.pdf
M)5!$1BTQ+C4-)>+CS],-"C(X(#`@;V)J#3P+TQI;F5A<FEZ960@,2],(#0T
M-34Y+T@,S`O12`R-S@T,B].(#0O5"`T-#,P,B]((%L@-#0Q(#(P.%T^/@UE
M;F1O8FH-("`@("`@("`@("`@("`@("`@#0HS."`P(&]B:@T/"],96YG=&@@
M-C,O4F]O="`R.2`P(%(O241;/$1#0S%%,T$W,S9%0S8V-#`R-C-$.3DS1C(R
...
[...lots of text like this...]
...
)#0HE)45/1@T*
`
end
</PDF>
The fact that it's raw isn't surprising. What's surprising to me as a novice is that (1) if I try to copy/paste that raw text into Notepad++ and save as .pdf, Acrobat can't read the file, and (2) when I download (using Chrome) the actual .pdf from the filing, which is available elsewhere on the EDGAR system, and open it up in Notepad++, the raw text looks much different from the XML-file raw text, even though I expect them to encode the same file. For example,
%PDF-1.5
%âãÏÓ
28 0 obj
<</Linearized 1/L 44559/O 30/E 27842/N 4/T 44302/H [ 441 208]>>
endobj
38 0 obj
<</Length 63/Root 29 0 R/ID[<DCC1E3A736EC6640263D993F227A4DC8><71A0C1AA5F566D44A5466B14A0F219D4>]/Info 27 0 R/Filter/FlateDecode/W[1 2 1]/Index[28 23]/DecodeParms<</Columns 4/Predictor 12>>/Size 51/Prev 44303/Type/XRef>>stream
xÚbbd``b`ª@‚± H0{ ¶‡@‚»Ä
Ö§a¬Ÿ˜Vƒt00’Fügœõ
À =¸ ê
endstream
endobj
...
The files I'm talking about can be found here:
The .txt file
The .pdf file
Why can't Acrobat read the raw text from the XML .txt file? Is there a way to alter that easily so it's readable? Why does the raw text look so different when I download the actual PDF? Are they different representations of the same file, or is the published .pdf actually a much different file from what could be extracted from the .txt file?
I tried searching for information on ASCII and different types of Unicode, and found The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), but didn't see how to apply it to PDF files. I tried searching for how to extract PDF files embedded in XML, but did not find an answer that helped. I tried converting among encoding types in Notepad++, which was not fruitful.
pdf notepad++ adobe-acrobat xml
pdf notepad++ adobe-acrobat xml
asked Jan 23 at 17:13
Attila the FunAttila the Fun
111
111
closed as too broad by DavidPostill♦ Jan 23 at 21:39
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
closed as too broad by DavidPostill♦ Jan 23 at 21:39
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
This would be an answer if the question hadn't been closed already: the "text-like" PDF you showed is really a uuencoded version. When you download it from the PDF-link using Chrome, it's downloading the actual binary PDF, not the uuencoded version. If you saved the text-like version in Notepad++ assome.pdf.uu
, then ran auudecode
onsome.pdf.uu
, it would extract the PDF in something Acrobat could read.
– PeterCJ
Jan 29 at 18:08
The above comment answered my question. I can't accept an answer because the question is closed, and I can't upvote the comment due to insufficient reputation.
– Attila the Fun
Jan 29 at 19:26
add a comment |
This would be an answer if the question hadn't been closed already: the "text-like" PDF you showed is really a uuencoded version. When you download it from the PDF-link using Chrome, it's downloading the actual binary PDF, not the uuencoded version. If you saved the text-like version in Notepad++ assome.pdf.uu
, then ran auudecode
onsome.pdf.uu
, it would extract the PDF in something Acrobat could read.
– PeterCJ
Jan 29 at 18:08
The above comment answered my question. I can't accept an answer because the question is closed, and I can't upvote the comment due to insufficient reputation.
– Attila the Fun
Jan 29 at 19:26
This would be an answer if the question hadn't been closed already: the "text-like" PDF you showed is really a uuencoded version. When you download it from the PDF-link using Chrome, it's downloading the actual binary PDF, not the uuencoded version. If you saved the text-like version in Notepad++ as
some.pdf.uu
, then ran a uudecode
on some.pdf.uu
, it would extract the PDF in something Acrobat could read.– PeterCJ
Jan 29 at 18:08
This would be an answer if the question hadn't been closed already: the "text-like" PDF you showed is really a uuencoded version. When you download it from the PDF-link using Chrome, it's downloading the actual binary PDF, not the uuencoded version. If you saved the text-like version in Notepad++ as
some.pdf.uu
, then ran a uudecode
on some.pdf.uu
, it would extract the PDF in something Acrobat could read.– PeterCJ
Jan 29 at 18:08
The above comment answered my question. I can't accept an answer because the question is closed, and I can't upvote the comment due to insufficient reputation.
– Attila the Fun
Jan 29 at 19:26
The above comment answered my question. I can't accept an answer because the question is closed, and I can't upvote the comment due to insufficient reputation.
– Attila the Fun
Jan 29 at 19:26
add a comment |
0
active
oldest
votes
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
This would be an answer if the question hadn't been closed already: the "text-like" PDF you showed is really a uuencoded version. When you download it from the PDF-link using Chrome, it's downloading the actual binary PDF, not the uuencoded version. If you saved the text-like version in Notepad++ as
some.pdf.uu
, then ran auudecode
onsome.pdf.uu
, it would extract the PDF in something Acrobat could read.– PeterCJ
Jan 29 at 18:08
The above comment answered my question. I can't accept an answer because the question is closed, and I can't upvote the comment due to insufficient reputation.
– Attila the Fun
Jan 29 at 19:26