Strange character in a file












4















I have an UTF-8 file that contains a strange character -- visible to me just as



<96>


This is how it appears on vi



enter image description here



and how it appears on gedit



enter image description here



and how it appears under LibreOffice



enter image description here



and that makes a series of basic Unix tools misbehave, including:





  1. cat file make the character dissapear, and more as well

  2. I cannot copy and paste inside vi/vim -- it will not even find itself


  3. grep fails to display anything as well, as if the character did not exists.


The program file works fine and recognizes it an UTF-8 file. I also know that, because of the nature of the file, it most likely came from a Copy & Paste from the web and the character initially represented an EMDASH.



My basic questions are:
1. Is there anything wrong with this file?
2. How can I search for other occurrences of it inside the same file?
2. How can I grep for other files that may contain the same problem/character?



The file can be found here: file.txt










share|improve this question









New contributor




Paulo Ney is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
















  • 1





    First step is to hexdump -C filename and look at the encoding of what is "visible" to you as <96>. Context should help to pinpoint it.

    – dirkt
    4 hours ago











  • (1) What does «visible to me just as “<96>”» mean?  (2) What grep command are you using?

    – G-Man
    4 hours ago











  • @dirkt, the context points to the character being an EMDASH and hexdump -C shows c2 96. How can I search for other occurrences of the same thing?

    – Paulo Ney
    4 hours ago











  • @G-Man, you can download the file, the character shows like that in vi/vim, for example, and I am using stock "grep" on Ubuntu 18.04.

    – Paulo Ney
    4 hours ago











  • Is there any tool, that can manage this character is a good way? I'm thinking of word processor like LibreOffice Writer or a simple text editor like gedit when you have set it to manage your language and UTF-8. In this case you can remove that character.

    – sudodus
    3 hours ago


















4















I have an UTF-8 file that contains a strange character -- visible to me just as



<96>


This is how it appears on vi



enter image description here



and how it appears on gedit



enter image description here



and how it appears under LibreOffice



enter image description here



and that makes a series of basic Unix tools misbehave, including:





  1. cat file make the character dissapear, and more as well

  2. I cannot copy and paste inside vi/vim -- it will not even find itself


  3. grep fails to display anything as well, as if the character did not exists.


The program file works fine and recognizes it an UTF-8 file. I also know that, because of the nature of the file, it most likely came from a Copy & Paste from the web and the character initially represented an EMDASH.



My basic questions are:
1. Is there anything wrong with this file?
2. How can I search for other occurrences of it inside the same file?
2. How can I grep for other files that may contain the same problem/character?



The file can be found here: file.txt










share|improve this question









New contributor




Paulo Ney is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
















  • 1





    First step is to hexdump -C filename and look at the encoding of what is "visible" to you as <96>. Context should help to pinpoint it.

    – dirkt
    4 hours ago











  • (1) What does «visible to me just as “<96>”» mean?  (2) What grep command are you using?

    – G-Man
    4 hours ago











  • @dirkt, the context points to the character being an EMDASH and hexdump -C shows c2 96. How can I search for other occurrences of the same thing?

    – Paulo Ney
    4 hours ago











  • @G-Man, you can download the file, the character shows like that in vi/vim, for example, and I am using stock "grep" on Ubuntu 18.04.

    – Paulo Ney
    4 hours ago











  • Is there any tool, that can manage this character is a good way? I'm thinking of word processor like LibreOffice Writer or a simple text editor like gedit when you have set it to manage your language and UTF-8. In this case you can remove that character.

    – sudodus
    3 hours ago
















4












4








4








I have an UTF-8 file that contains a strange character -- visible to me just as



<96>


This is how it appears on vi



enter image description here



and how it appears on gedit



enter image description here



and how it appears under LibreOffice



enter image description here



and that makes a series of basic Unix tools misbehave, including:





  1. cat file make the character dissapear, and more as well

  2. I cannot copy and paste inside vi/vim -- it will not even find itself


  3. grep fails to display anything as well, as if the character did not exists.


The program file works fine and recognizes it an UTF-8 file. I also know that, because of the nature of the file, it most likely came from a Copy & Paste from the web and the character initially represented an EMDASH.



My basic questions are:
1. Is there anything wrong with this file?
2. How can I search for other occurrences of it inside the same file?
2. How can I grep for other files that may contain the same problem/character?



The file can be found here: file.txt










share|improve this question









New contributor




Paulo Ney is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.












I have an UTF-8 file that contains a strange character -- visible to me just as



<96>


This is how it appears on vi



enter image description here



and how it appears on gedit



enter image description here



and how it appears under LibreOffice



enter image description here



and that makes a series of basic Unix tools misbehave, including:





  1. cat file make the character dissapear, and more as well

  2. I cannot copy and paste inside vi/vim -- it will not even find itself


  3. grep fails to display anything as well, as if the character did not exists.


The program file works fine and recognizes it an UTF-8 file. I also know that, because of the nature of the file, it most likely came from a Copy & Paste from the web and the character initially represented an EMDASH.



My basic questions are:
1. Is there anything wrong with this file?
2. How can I search for other occurrences of it inside the same file?
2. How can I grep for other files that may contain the same problem/character?



The file can be found here: file.txt







unicode character-encoding






share|improve this question









New contributor




Paulo Ney is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




Paulo Ney is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited 3 hours ago







Paulo Ney













New contributor




Paulo Ney is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 4 hours ago









Paulo NeyPaulo Ney

1234




1234




New contributor




Paulo Ney is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Paulo Ney is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Paulo Ney is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.








  • 1





    First step is to hexdump -C filename and look at the encoding of what is "visible" to you as <96>. Context should help to pinpoint it.

    – dirkt
    4 hours ago











  • (1) What does «visible to me just as “<96>”» mean?  (2) What grep command are you using?

    – G-Man
    4 hours ago











  • @dirkt, the context points to the character being an EMDASH and hexdump -C shows c2 96. How can I search for other occurrences of the same thing?

    – Paulo Ney
    4 hours ago











  • @G-Man, you can download the file, the character shows like that in vi/vim, for example, and I am using stock "grep" on Ubuntu 18.04.

    – Paulo Ney
    4 hours ago











  • Is there any tool, that can manage this character is a good way? I'm thinking of word processor like LibreOffice Writer or a simple text editor like gedit when you have set it to manage your language and UTF-8. In this case you can remove that character.

    – sudodus
    3 hours ago
















  • 1





    First step is to hexdump -C filename and look at the encoding of what is "visible" to you as <96>. Context should help to pinpoint it.

    – dirkt
    4 hours ago











  • (1) What does «visible to me just as “<96>”» mean?  (2) What grep command are you using?

    – G-Man
    4 hours ago











  • @dirkt, the context points to the character being an EMDASH and hexdump -C shows c2 96. How can I search for other occurrences of the same thing?

    – Paulo Ney
    4 hours ago











  • @G-Man, you can download the file, the character shows like that in vi/vim, for example, and I am using stock "grep" on Ubuntu 18.04.

    – Paulo Ney
    4 hours ago











  • Is there any tool, that can manage this character is a good way? I'm thinking of word processor like LibreOffice Writer or a simple text editor like gedit when you have set it to manage your language and UTF-8. In this case you can remove that character.

    – sudodus
    3 hours ago










1




1





First step is to hexdump -C filename and look at the encoding of what is "visible" to you as <96>. Context should help to pinpoint it.

– dirkt
4 hours ago





First step is to hexdump -C filename and look at the encoding of what is "visible" to you as <96>. Context should help to pinpoint it.

– dirkt
4 hours ago













(1) What does «visible to me just as “<96>”» mean?  (2) What grep command are you using?

– G-Man
4 hours ago





(1) What does «visible to me just as “<96>”» mean?  (2) What grep command are you using?

– G-Man
4 hours ago













@dirkt, the context points to the character being an EMDASH and hexdump -C shows c2 96. How can I search for other occurrences of the same thing?

– Paulo Ney
4 hours ago





@dirkt, the context points to the character being an EMDASH and hexdump -C shows c2 96. How can I search for other occurrences of the same thing?

– Paulo Ney
4 hours ago













@G-Man, you can download the file, the character shows like that in vi/vim, for example, and I am using stock "grep" on Ubuntu 18.04.

– Paulo Ney
4 hours ago





@G-Man, you can download the file, the character shows like that in vi/vim, for example, and I am using stock "grep" on Ubuntu 18.04.

– Paulo Ney
4 hours ago













Is there any tool, that can manage this character is a good way? I'm thinking of word processor like LibreOffice Writer or a simple text editor like gedit when you have set it to manage your language and UTF-8. In this case you can remove that character.

– sudodus
3 hours ago







Is there any tool, that can manage this character is a good way? I'm thinking of word processor like LibreOffice Writer or a simple text editor like gedit when you have set it to manage your language and UTF-8. In this case you can remove that character.

– sudodus
3 hours ago












2 Answers
2






active

oldest

votes


















11














This file contains bytes C2 96, which are the UTF-8 encoding of codepoint U+0096. That codepoint is one of the C1 control characters commonly called SPA "Start of Guarded Area" (or "Protected Area"). That isn't a useful character for any modern system, but it's unlikely to be harmful that it's there.



The original source for this was likely a byte 96 in some single-byte 8-bit encoding that has been transcoded incorrectly somewhere along the way. Probably this was originally a Windows CP1252 en dash "–", which has byte value 96 in that encoding - most other plausible candidates have the control set at positions 80-9F - which has been translated to UTF-8 as though it was latin-1 (ISO/IEC 8859-1), which is not uncommon. That would lead to the byte being interpreted as the control character and translated accordingly as you've seen.





You can fix this file with the iconv tool, which is part of glibc.



iconv -f utf-8 -t iso-8859-1 < mwe.txt | iconv -f cp1252 -t utf-8


produces a correct version of you minimal example for me. That works by first converting the UTF-8 to latin-1 (inverting the earlier mistranslation), and then reinterpreting that as cp1252 to convert it back to UTF-8 correctly.



It does depend on what else is in the real file, however. If you have characters outside Latin-1 elsewhere it will fail because it can't encode those correctly at the first step.



If you don't have iconv, or it doesn't work for the real file, you can replace the bytes directly using sed:



LC_ALL=C sed -e $'s/xc2x96/xe2x80x93/g' < mwe.txt


This replaces C2 96 with the UTF-8 en dash encoding E2 80 93. You could also replace it with e.g. a hyphen or two by changing xe2x80x93 into --.





You can grep in a similar fashion. We're using LC_ALL=C to make sure we're reading the actual bytes, and not having grep interpret things:



LC_ALL=C grep -R $'xc2x96` .


will list out everywhere under this directory those bytes appear. You may want to limit it to just text files if you have mixed content around, since binary files will include any pair of bytes fairly often.






share|improve this answer


























  • Is there a way I can search other files for the same occurrence, with something like grep?

    – Paulo Ney
    3 hours ago






  • 1





    Yes, you can use grep $'xc2x96' (last section).

    – Michael Homer
    3 hours ago











  • Is the file a "valid" UTF-8 file?

    – Paulo Ney
    3 hours ago






  • 1





    Yes, it's a perfectly correct encoding of a not-very-useful character.

    – Michael Homer
    3 hours ago



















2














It's an en dash (ascii 0x96). The c2 byte preceding it seems to be a default first byte in a double-width character. Someone else could explain more precisely about it.



To search for other occurrences, put your cursor over it in command mode, hit yl (yank one character), then type /<Ctrl>+r". (ctrl+r lets you insert the contents of a register into the command, and the " register is whatever has last been yanked).



Just replace it with two hyphens if you want it to render in your terminal. If that is a bibtex file that you have, then two hyphens are the appropriate way to key it in.



To show how you can find occurrences of the character, you can pipe it through a hexdump tool like xxd.



$ cat tmp | xxd | grep c296
00000000: 7061 6765 733d 7b31 c296 3935 7d2c 0a70 pages={1..95},.p
00000020: 6765 733d 7b31 c296 3935 7d2c 0a70 6167 ges={1..95},.pag
00000040: 733d 7b31 c296 3935 7d2c 0a70 6167 6573 s={1..95},.pages
00000060: 7b31 c296 3935 7d2c 0a70 6167 6573 3d7b {1..95},.pages={
00000080: c296 3935 7d2c 0a70 6167 6573 3d7b 31c2 ..95},.pages={1.
00000090: 9639 357d 2c0a 7061 6765 733d 7b31 c296 .95},.pages={1..
000000b0: 357d 2c0a 7061 6765 733d 7b31 c296 3935 5},.pages={1..95
000000d0: 2c0a 7061 6765 733d 7b31 c296 3935 7d2c ,.pages={1..95},
000000f0: 7061 6765 733d 7b31 c296 3935 7d2c 0a70 pages={1..95},.p
00000110: 6765 733d 7b31 c296 3935 7d2c 0a70 6167 ges={1..95},.pag
00000130: 733d 7b31 c296 3935 7d2c 0a70 6167 6573 s={1..95},.pages
00000150: 7b31 c296 3935 7d2c 0a70 6167 6573 3d7b {1..95},.pages={





share|improve this answer










New contributor




jlovegren is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





















  • Nice. Is there a way I can search for occurrences in others files with something like grep?

    – Paulo Ney
    3 hours ago






  • 2





    There is no 0x96 value in ASCII - presumably it was in an 8-bit encoding originally (I've speculated cp1252, but there are other options).

    – Michael Homer
    3 hours ago











  • @MichaelHomer thanks for the correction.

    – jlovegren
    3 hours ago











  • @PauloNey you can pass it through a hex dump util like xxd. See my updated answer.

    – jlovegren
    3 hours ago











Your Answer








StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});






Paulo Ney is a new contributor. Be nice, and check out our Code of Conduct.










draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f495643%2fstrange-character-in-a-file%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









11














This file contains bytes C2 96, which are the UTF-8 encoding of codepoint U+0096. That codepoint is one of the C1 control characters commonly called SPA "Start of Guarded Area" (or "Protected Area"). That isn't a useful character for any modern system, but it's unlikely to be harmful that it's there.



The original source for this was likely a byte 96 in some single-byte 8-bit encoding that has been transcoded incorrectly somewhere along the way. Probably this was originally a Windows CP1252 en dash "–", which has byte value 96 in that encoding - most other plausible candidates have the control set at positions 80-9F - which has been translated to UTF-8 as though it was latin-1 (ISO/IEC 8859-1), which is not uncommon. That would lead to the byte being interpreted as the control character and translated accordingly as you've seen.





You can fix this file with the iconv tool, which is part of glibc.



iconv -f utf-8 -t iso-8859-1 < mwe.txt | iconv -f cp1252 -t utf-8


produces a correct version of you minimal example for me. That works by first converting the UTF-8 to latin-1 (inverting the earlier mistranslation), and then reinterpreting that as cp1252 to convert it back to UTF-8 correctly.



It does depend on what else is in the real file, however. If you have characters outside Latin-1 elsewhere it will fail because it can't encode those correctly at the first step.



If you don't have iconv, or it doesn't work for the real file, you can replace the bytes directly using sed:



LC_ALL=C sed -e $'s/xc2x96/xe2x80x93/g' < mwe.txt


This replaces C2 96 with the UTF-8 en dash encoding E2 80 93. You could also replace it with e.g. a hyphen or two by changing xe2x80x93 into --.





You can grep in a similar fashion. We're using LC_ALL=C to make sure we're reading the actual bytes, and not having grep interpret things:



LC_ALL=C grep -R $'xc2x96` .


will list out everywhere under this directory those bytes appear. You may want to limit it to just text files if you have mixed content around, since binary files will include any pair of bytes fairly often.






share|improve this answer


























  • Is there a way I can search other files for the same occurrence, with something like grep?

    – Paulo Ney
    3 hours ago






  • 1





    Yes, you can use grep $'xc2x96' (last section).

    – Michael Homer
    3 hours ago











  • Is the file a "valid" UTF-8 file?

    – Paulo Ney
    3 hours ago






  • 1





    Yes, it's a perfectly correct encoding of a not-very-useful character.

    – Michael Homer
    3 hours ago
















11














This file contains bytes C2 96, which are the UTF-8 encoding of codepoint U+0096. That codepoint is one of the C1 control characters commonly called SPA "Start of Guarded Area" (or "Protected Area"). That isn't a useful character for any modern system, but it's unlikely to be harmful that it's there.



The original source for this was likely a byte 96 in some single-byte 8-bit encoding that has been transcoded incorrectly somewhere along the way. Probably this was originally a Windows CP1252 en dash "–", which has byte value 96 in that encoding - most other plausible candidates have the control set at positions 80-9F - which has been translated to UTF-8 as though it was latin-1 (ISO/IEC 8859-1), which is not uncommon. That would lead to the byte being interpreted as the control character and translated accordingly as you've seen.





You can fix this file with the iconv tool, which is part of glibc.



iconv -f utf-8 -t iso-8859-1 < mwe.txt | iconv -f cp1252 -t utf-8


produces a correct version of you minimal example for me. That works by first converting the UTF-8 to latin-1 (inverting the earlier mistranslation), and then reinterpreting that as cp1252 to convert it back to UTF-8 correctly.



It does depend on what else is in the real file, however. If you have characters outside Latin-1 elsewhere it will fail because it can't encode those correctly at the first step.



If you don't have iconv, or it doesn't work for the real file, you can replace the bytes directly using sed:



LC_ALL=C sed -e $'s/xc2x96/xe2x80x93/g' < mwe.txt


This replaces C2 96 with the UTF-8 en dash encoding E2 80 93. You could also replace it with e.g. a hyphen or two by changing xe2x80x93 into --.





You can grep in a similar fashion. We're using LC_ALL=C to make sure we're reading the actual bytes, and not having grep interpret things:



LC_ALL=C grep -R $'xc2x96` .


will list out everywhere under this directory those bytes appear. You may want to limit it to just text files if you have mixed content around, since binary files will include any pair of bytes fairly often.






share|improve this answer


























  • Is there a way I can search other files for the same occurrence, with something like grep?

    – Paulo Ney
    3 hours ago






  • 1





    Yes, you can use grep $'xc2x96' (last section).

    – Michael Homer
    3 hours ago











  • Is the file a "valid" UTF-8 file?

    – Paulo Ney
    3 hours ago






  • 1





    Yes, it's a perfectly correct encoding of a not-very-useful character.

    – Michael Homer
    3 hours ago














11












11








11







This file contains bytes C2 96, which are the UTF-8 encoding of codepoint U+0096. That codepoint is one of the C1 control characters commonly called SPA "Start of Guarded Area" (or "Protected Area"). That isn't a useful character for any modern system, but it's unlikely to be harmful that it's there.



The original source for this was likely a byte 96 in some single-byte 8-bit encoding that has been transcoded incorrectly somewhere along the way. Probably this was originally a Windows CP1252 en dash "–", which has byte value 96 in that encoding - most other plausible candidates have the control set at positions 80-9F - which has been translated to UTF-8 as though it was latin-1 (ISO/IEC 8859-1), which is not uncommon. That would lead to the byte being interpreted as the control character and translated accordingly as you've seen.





You can fix this file with the iconv tool, which is part of glibc.



iconv -f utf-8 -t iso-8859-1 < mwe.txt | iconv -f cp1252 -t utf-8


produces a correct version of you minimal example for me. That works by first converting the UTF-8 to latin-1 (inverting the earlier mistranslation), and then reinterpreting that as cp1252 to convert it back to UTF-8 correctly.



It does depend on what else is in the real file, however. If you have characters outside Latin-1 elsewhere it will fail because it can't encode those correctly at the first step.



If you don't have iconv, or it doesn't work for the real file, you can replace the bytes directly using sed:



LC_ALL=C sed -e $'s/xc2x96/xe2x80x93/g' < mwe.txt


This replaces C2 96 with the UTF-8 en dash encoding E2 80 93. You could also replace it with e.g. a hyphen or two by changing xe2x80x93 into --.





You can grep in a similar fashion. We're using LC_ALL=C to make sure we're reading the actual bytes, and not having grep interpret things:



LC_ALL=C grep -R $'xc2x96` .


will list out everywhere under this directory those bytes appear. You may want to limit it to just text files if you have mixed content around, since binary files will include any pair of bytes fairly often.






share|improve this answer















This file contains bytes C2 96, which are the UTF-8 encoding of codepoint U+0096. That codepoint is one of the C1 control characters commonly called SPA "Start of Guarded Area" (or "Protected Area"). That isn't a useful character for any modern system, but it's unlikely to be harmful that it's there.



The original source for this was likely a byte 96 in some single-byte 8-bit encoding that has been transcoded incorrectly somewhere along the way. Probably this was originally a Windows CP1252 en dash "–", which has byte value 96 in that encoding - most other plausible candidates have the control set at positions 80-9F - which has been translated to UTF-8 as though it was latin-1 (ISO/IEC 8859-1), which is not uncommon. That would lead to the byte being interpreted as the control character and translated accordingly as you've seen.





You can fix this file with the iconv tool, which is part of glibc.



iconv -f utf-8 -t iso-8859-1 < mwe.txt | iconv -f cp1252 -t utf-8


produces a correct version of you minimal example for me. That works by first converting the UTF-8 to latin-1 (inverting the earlier mistranslation), and then reinterpreting that as cp1252 to convert it back to UTF-8 correctly.



It does depend on what else is in the real file, however. If you have characters outside Latin-1 elsewhere it will fail because it can't encode those correctly at the first step.



If you don't have iconv, or it doesn't work for the real file, you can replace the bytes directly using sed:



LC_ALL=C sed -e $'s/xc2x96/xe2x80x93/g' < mwe.txt


This replaces C2 96 with the UTF-8 en dash encoding E2 80 93. You could also replace it with e.g. a hyphen or two by changing xe2x80x93 into --.





You can grep in a similar fashion. We're using LC_ALL=C to make sure we're reading the actual bytes, and not having grep interpret things:



LC_ALL=C grep -R $'xc2x96` .


will list out everywhere under this directory those bytes appear. You may want to limit it to just text files if you have mixed content around, since binary files will include any pair of bytes fairly often.







share|improve this answer














share|improve this answer



share|improve this answer








edited 3 hours ago

























answered 3 hours ago









Michael HomerMichael Homer

46.9k8123162




46.9k8123162













  • Is there a way I can search other files for the same occurrence, with something like grep?

    – Paulo Ney
    3 hours ago






  • 1





    Yes, you can use grep $'xc2x96' (last section).

    – Michael Homer
    3 hours ago











  • Is the file a "valid" UTF-8 file?

    – Paulo Ney
    3 hours ago






  • 1





    Yes, it's a perfectly correct encoding of a not-very-useful character.

    – Michael Homer
    3 hours ago



















  • Is there a way I can search other files for the same occurrence, with something like grep?

    – Paulo Ney
    3 hours ago






  • 1





    Yes, you can use grep $'xc2x96' (last section).

    – Michael Homer
    3 hours ago











  • Is the file a "valid" UTF-8 file?

    – Paulo Ney
    3 hours ago






  • 1





    Yes, it's a perfectly correct encoding of a not-very-useful character.

    – Michael Homer
    3 hours ago

















Is there a way I can search other files for the same occurrence, with something like grep?

– Paulo Ney
3 hours ago





Is there a way I can search other files for the same occurrence, with something like grep?

– Paulo Ney
3 hours ago




1




1





Yes, you can use grep $'xc2x96' (last section).

– Michael Homer
3 hours ago





Yes, you can use grep $'xc2x96' (last section).

– Michael Homer
3 hours ago













Is the file a "valid" UTF-8 file?

– Paulo Ney
3 hours ago





Is the file a "valid" UTF-8 file?

– Paulo Ney
3 hours ago




1




1





Yes, it's a perfectly correct encoding of a not-very-useful character.

– Michael Homer
3 hours ago





Yes, it's a perfectly correct encoding of a not-very-useful character.

– Michael Homer
3 hours ago













2














It's an en dash (ascii 0x96). The c2 byte preceding it seems to be a default first byte in a double-width character. Someone else could explain more precisely about it.



To search for other occurrences, put your cursor over it in command mode, hit yl (yank one character), then type /<Ctrl>+r". (ctrl+r lets you insert the contents of a register into the command, and the " register is whatever has last been yanked).



Just replace it with two hyphens if you want it to render in your terminal. If that is a bibtex file that you have, then two hyphens are the appropriate way to key it in.



To show how you can find occurrences of the character, you can pipe it through a hexdump tool like xxd.



$ cat tmp | xxd | grep c296
00000000: 7061 6765 733d 7b31 c296 3935 7d2c 0a70 pages={1..95},.p
00000020: 6765 733d 7b31 c296 3935 7d2c 0a70 6167 ges={1..95},.pag
00000040: 733d 7b31 c296 3935 7d2c 0a70 6167 6573 s={1..95},.pages
00000060: 7b31 c296 3935 7d2c 0a70 6167 6573 3d7b {1..95},.pages={
00000080: c296 3935 7d2c 0a70 6167 6573 3d7b 31c2 ..95},.pages={1.
00000090: 9639 357d 2c0a 7061 6765 733d 7b31 c296 .95},.pages={1..
000000b0: 357d 2c0a 7061 6765 733d 7b31 c296 3935 5},.pages={1..95
000000d0: 2c0a 7061 6765 733d 7b31 c296 3935 7d2c ,.pages={1..95},
000000f0: 7061 6765 733d 7b31 c296 3935 7d2c 0a70 pages={1..95},.p
00000110: 6765 733d 7b31 c296 3935 7d2c 0a70 6167 ges={1..95},.pag
00000130: 733d 7b31 c296 3935 7d2c 0a70 6167 6573 s={1..95},.pages
00000150: 7b31 c296 3935 7d2c 0a70 6167 6573 3d7b {1..95},.pages={





share|improve this answer










New contributor




jlovegren is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





















  • Nice. Is there a way I can search for occurrences in others files with something like grep?

    – Paulo Ney
    3 hours ago






  • 2





    There is no 0x96 value in ASCII - presumably it was in an 8-bit encoding originally (I've speculated cp1252, but there are other options).

    – Michael Homer
    3 hours ago











  • @MichaelHomer thanks for the correction.

    – jlovegren
    3 hours ago











  • @PauloNey you can pass it through a hex dump util like xxd. See my updated answer.

    – jlovegren
    3 hours ago
















2














It's an en dash (ascii 0x96). The c2 byte preceding it seems to be a default first byte in a double-width character. Someone else could explain more precisely about it.



To search for other occurrences, put your cursor over it in command mode, hit yl (yank one character), then type /<Ctrl>+r". (ctrl+r lets you insert the contents of a register into the command, and the " register is whatever has last been yanked).



Just replace it with two hyphens if you want it to render in your terminal. If that is a bibtex file that you have, then two hyphens are the appropriate way to key it in.



To show how you can find occurrences of the character, you can pipe it through a hexdump tool like xxd.



$ cat tmp | xxd | grep c296
00000000: 7061 6765 733d 7b31 c296 3935 7d2c 0a70 pages={1..95},.p
00000020: 6765 733d 7b31 c296 3935 7d2c 0a70 6167 ges={1..95},.pag
00000040: 733d 7b31 c296 3935 7d2c 0a70 6167 6573 s={1..95},.pages
00000060: 7b31 c296 3935 7d2c 0a70 6167 6573 3d7b {1..95},.pages={
00000080: c296 3935 7d2c 0a70 6167 6573 3d7b 31c2 ..95},.pages={1.
00000090: 9639 357d 2c0a 7061 6765 733d 7b31 c296 .95},.pages={1..
000000b0: 357d 2c0a 7061 6765 733d 7b31 c296 3935 5},.pages={1..95
000000d0: 2c0a 7061 6765 733d 7b31 c296 3935 7d2c ,.pages={1..95},
000000f0: 7061 6765 733d 7b31 c296 3935 7d2c 0a70 pages={1..95},.p
00000110: 6765 733d 7b31 c296 3935 7d2c 0a70 6167 ges={1..95},.pag
00000130: 733d 7b31 c296 3935 7d2c 0a70 6167 6573 s={1..95},.pages
00000150: 7b31 c296 3935 7d2c 0a70 6167 6573 3d7b {1..95},.pages={





share|improve this answer










New contributor




jlovegren is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





















  • Nice. Is there a way I can search for occurrences in others files with something like grep?

    – Paulo Ney
    3 hours ago






  • 2





    There is no 0x96 value in ASCII - presumably it was in an 8-bit encoding originally (I've speculated cp1252, but there are other options).

    – Michael Homer
    3 hours ago











  • @MichaelHomer thanks for the correction.

    – jlovegren
    3 hours ago











  • @PauloNey you can pass it through a hex dump util like xxd. See my updated answer.

    – jlovegren
    3 hours ago














2












2








2







It's an en dash (ascii 0x96). The c2 byte preceding it seems to be a default first byte in a double-width character. Someone else could explain more precisely about it.



To search for other occurrences, put your cursor over it in command mode, hit yl (yank one character), then type /<Ctrl>+r". (ctrl+r lets you insert the contents of a register into the command, and the " register is whatever has last been yanked).



Just replace it with two hyphens if you want it to render in your terminal. If that is a bibtex file that you have, then two hyphens are the appropriate way to key it in.



To show how you can find occurrences of the character, you can pipe it through a hexdump tool like xxd.



$ cat tmp | xxd | grep c296
00000000: 7061 6765 733d 7b31 c296 3935 7d2c 0a70 pages={1..95},.p
00000020: 6765 733d 7b31 c296 3935 7d2c 0a70 6167 ges={1..95},.pag
00000040: 733d 7b31 c296 3935 7d2c 0a70 6167 6573 s={1..95},.pages
00000060: 7b31 c296 3935 7d2c 0a70 6167 6573 3d7b {1..95},.pages={
00000080: c296 3935 7d2c 0a70 6167 6573 3d7b 31c2 ..95},.pages={1.
00000090: 9639 357d 2c0a 7061 6765 733d 7b31 c296 .95},.pages={1..
000000b0: 357d 2c0a 7061 6765 733d 7b31 c296 3935 5},.pages={1..95
000000d0: 2c0a 7061 6765 733d 7b31 c296 3935 7d2c ,.pages={1..95},
000000f0: 7061 6765 733d 7b31 c296 3935 7d2c 0a70 pages={1..95},.p
00000110: 6765 733d 7b31 c296 3935 7d2c 0a70 6167 ges={1..95},.pag
00000130: 733d 7b31 c296 3935 7d2c 0a70 6167 6573 s={1..95},.pages
00000150: 7b31 c296 3935 7d2c 0a70 6167 6573 3d7b {1..95},.pages={





share|improve this answer










New contributor




jlovegren is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.










It's an en dash (ascii 0x96). The c2 byte preceding it seems to be a default first byte in a double-width character. Someone else could explain more precisely about it.



To search for other occurrences, put your cursor over it in command mode, hit yl (yank one character), then type /<Ctrl>+r". (ctrl+r lets you insert the contents of a register into the command, and the " register is whatever has last been yanked).



Just replace it with two hyphens if you want it to render in your terminal. If that is a bibtex file that you have, then two hyphens are the appropriate way to key it in.



To show how you can find occurrences of the character, you can pipe it through a hexdump tool like xxd.



$ cat tmp | xxd | grep c296
00000000: 7061 6765 733d 7b31 c296 3935 7d2c 0a70 pages={1..95},.p
00000020: 6765 733d 7b31 c296 3935 7d2c 0a70 6167 ges={1..95},.pag
00000040: 733d 7b31 c296 3935 7d2c 0a70 6167 6573 s={1..95},.pages
00000060: 7b31 c296 3935 7d2c 0a70 6167 6573 3d7b {1..95},.pages={
00000080: c296 3935 7d2c 0a70 6167 6573 3d7b 31c2 ..95},.pages={1.
00000090: 9639 357d 2c0a 7061 6765 733d 7b31 c296 .95},.pages={1..
000000b0: 357d 2c0a 7061 6765 733d 7b31 c296 3935 5},.pages={1..95
000000d0: 2c0a 7061 6765 733d 7b31 c296 3935 7d2c ,.pages={1..95},
000000f0: 7061 6765 733d 7b31 c296 3935 7d2c 0a70 pages={1..95},.p
00000110: 6765 733d 7b31 c296 3935 7d2c 0a70 6167 ges={1..95},.pag
00000130: 733d 7b31 c296 3935 7d2c 0a70 6167 6573 s={1..95},.pages
00000150: 7b31 c296 3935 7d2c 0a70 6167 6573 3d7b {1..95},.pages={






share|improve this answer










New contributor




jlovegren is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this answer



share|improve this answer








edited 3 hours ago





















New contributor




jlovegren is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









answered 3 hours ago









jlovegrenjlovegren

1243




1243




New contributor




jlovegren is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





jlovegren is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






jlovegren is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.













  • Nice. Is there a way I can search for occurrences in others files with something like grep?

    – Paulo Ney
    3 hours ago






  • 2





    There is no 0x96 value in ASCII - presumably it was in an 8-bit encoding originally (I've speculated cp1252, but there are other options).

    – Michael Homer
    3 hours ago











  • @MichaelHomer thanks for the correction.

    – jlovegren
    3 hours ago











  • @PauloNey you can pass it through a hex dump util like xxd. See my updated answer.

    – jlovegren
    3 hours ago



















  • Nice. Is there a way I can search for occurrences in others files with something like grep?

    – Paulo Ney
    3 hours ago






  • 2





    There is no 0x96 value in ASCII - presumably it was in an 8-bit encoding originally (I've speculated cp1252, but there are other options).

    – Michael Homer
    3 hours ago











  • @MichaelHomer thanks for the correction.

    – jlovegren
    3 hours ago











  • @PauloNey you can pass it through a hex dump util like xxd. See my updated answer.

    – jlovegren
    3 hours ago

















Nice. Is there a way I can search for occurrences in others files with something like grep?

– Paulo Ney
3 hours ago





Nice. Is there a way I can search for occurrences in others files with something like grep?

– Paulo Ney
3 hours ago




2




2





There is no 0x96 value in ASCII - presumably it was in an 8-bit encoding originally (I've speculated cp1252, but there are other options).

– Michael Homer
3 hours ago





There is no 0x96 value in ASCII - presumably it was in an 8-bit encoding originally (I've speculated cp1252, but there are other options).

– Michael Homer
3 hours ago













@MichaelHomer thanks for the correction.

– jlovegren
3 hours ago





@MichaelHomer thanks for the correction.

– jlovegren
3 hours ago













@PauloNey you can pass it through a hex dump util like xxd. See my updated answer.

– jlovegren
3 hours ago





@PauloNey you can pass it through a hex dump util like xxd. See my updated answer.

– jlovegren
3 hours ago










Paulo Ney is a new contributor. Be nice, and check out our Code of Conduct.










draft saved

draft discarded


















Paulo Ney is a new contributor. Be nice, and check out our Code of Conduct.













Paulo Ney is a new contributor. Be nice, and check out our Code of Conduct.












Paulo Ney is a new contributor. Be nice, and check out our Code of Conduct.
















Thanks for contributing an answer to Unix & Linux Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f495643%2fstrange-character-in-a-file%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to make a Squid Proxy server?

第一次世界大戦

Touch on Surface Book