How to do a regex search in a UTF-16LE file while in a UTF-8 locale?












3















EDIT: Due to a comment Warren Young made, it made me realize that I was not clear on one quite relevant point. My search string is already in UTF-16LE order (not in Unicode Codepoint order, which is UTF-16BE), so perhaps the Unicode issue is somewhat moot,



Perhaps my issue is a question of how do I grep for bytes (not chars) in groups of 2-bytes, ie. so that UTF-16LE x09x0A is not treated as TAB,newline, but just as 2 bytes which happen to be UTF-16LE ? ... Note: I do not need to be concerned about UTF-16 surrogate pairs, so 2-byte blocks is fine.



Here is sample pattern for this 3-character string ऊपर:





  • x09x0Ax09x2Ax09x30



    but it returns nothing, though the string is in the file.




(here is the original post)

When searching a UTF-16LE file with a pattern in x00x01x...etc format, I have encountered problems for some values. I've been using sed (and experimented with grep), but being in the UTF-8 locale they recognize some UTF-16LE values as ASCII characters. I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option.



eg. In this text (UNICODE 090A), though it is a single character, is perceived as two ASCII chars x09 and x0A.



grep has a -P (perl) option which can search for x00x... patterns, but I'm getting the same ASCII interpretation.



Is there some way to use grep -P to search in a UTF-16 mode, or perhaps better, how can this be done is perl or some other script.



grep seems to be the most appealing because of its compactness, but whatever gets the job done will overrride that preference.



PS; My example uses a literal string, but my actual usage needs a regex style search. So this perl example is not quite what I'm after, though it does process the file as UTF-16... I'd prefer to not have to open and close the file... I think perl has more compact ways for basic things like a regex search. I'm after something with that type of compact syntax.










share|improve this question

























  • I'm not so sure regexp machinery is really up to snuff with respect to UTF-8, much less other Unicode encodings. They will mostly work on UTF-8, as long as characters that are represented by several bytes do not appear in character sets or as arguments to repetition. E.g., [ña-z] will probably do surpising stuff, and so will gñ* or g[ñn]u, but g(ñ)*, g(n|ñ)u` should work fine (it just means something different than you see ;-). The machinery is 8-bit clean nowadays, and swallows the UTF-8 bytes without complaint, but doesn't combine them up to characters.

    – vonbrand
    Jan 23 '13 at 14:30


















3















EDIT: Due to a comment Warren Young made, it made me realize that I was not clear on one quite relevant point. My search string is already in UTF-16LE order (not in Unicode Codepoint order, which is UTF-16BE), so perhaps the Unicode issue is somewhat moot,



Perhaps my issue is a question of how do I grep for bytes (not chars) in groups of 2-bytes, ie. so that UTF-16LE x09x0A is not treated as TAB,newline, but just as 2 bytes which happen to be UTF-16LE ? ... Note: I do not need to be concerned about UTF-16 surrogate pairs, so 2-byte blocks is fine.



Here is sample pattern for this 3-character string ऊपर:





  • x09x0Ax09x2Ax09x30



    but it returns nothing, though the string is in the file.




(here is the original post)

When searching a UTF-16LE file with a pattern in x00x01x...etc format, I have encountered problems for some values. I've been using sed (and experimented with grep), but being in the UTF-8 locale they recognize some UTF-16LE values as ASCII characters. I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option.



eg. In this text (UNICODE 090A), though it is a single character, is perceived as two ASCII chars x09 and x0A.



grep has a -P (perl) option which can search for x00x... patterns, but I'm getting the same ASCII interpretation.



Is there some way to use grep -P to search in a UTF-16 mode, or perhaps better, how can this be done is perl or some other script.



grep seems to be the most appealing because of its compactness, but whatever gets the job done will overrride that preference.



PS; My example uses a literal string, but my actual usage needs a regex style search. So this perl example is not quite what I'm after, though it does process the file as UTF-16... I'd prefer to not have to open and close the file... I think perl has more compact ways for basic things like a regex search. I'm after something with that type of compact syntax.










share|improve this question

























  • I'm not so sure regexp machinery is really up to snuff with respect to UTF-8, much less other Unicode encodings. They will mostly work on UTF-8, as long as characters that are represented by several bytes do not appear in character sets or as arguments to repetition. E.g., [ña-z] will probably do surpising stuff, and so will gñ* or g[ñn]u, but g(ñ)*, g(n|ñ)u` should work fine (it just means something different than you see ;-). The machinery is 8-bit clean nowadays, and swallows the UTF-8 bytes without complaint, but doesn't combine them up to characters.

    – vonbrand
    Jan 23 '13 at 14:30
















3












3








3


2






EDIT: Due to a comment Warren Young made, it made me realize that I was not clear on one quite relevant point. My search string is already in UTF-16LE order (not in Unicode Codepoint order, which is UTF-16BE), so perhaps the Unicode issue is somewhat moot,



Perhaps my issue is a question of how do I grep for bytes (not chars) in groups of 2-bytes, ie. so that UTF-16LE x09x0A is not treated as TAB,newline, but just as 2 bytes which happen to be UTF-16LE ? ... Note: I do not need to be concerned about UTF-16 surrogate pairs, so 2-byte blocks is fine.



Here is sample pattern for this 3-character string ऊपर:





  • x09x0Ax09x2Ax09x30



    but it returns nothing, though the string is in the file.




(here is the original post)

When searching a UTF-16LE file with a pattern in x00x01x...etc format, I have encountered problems for some values. I've been using sed (and experimented with grep), but being in the UTF-8 locale they recognize some UTF-16LE values as ASCII characters. I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option.



eg. In this text (UNICODE 090A), though it is a single character, is perceived as two ASCII chars x09 and x0A.



grep has a -P (perl) option which can search for x00x... patterns, but I'm getting the same ASCII interpretation.



Is there some way to use grep -P to search in a UTF-16 mode, or perhaps better, how can this be done is perl or some other script.



grep seems to be the most appealing because of its compactness, but whatever gets the job done will overrride that preference.



PS; My example uses a literal string, but my actual usage needs a regex style search. So this perl example is not quite what I'm after, though it does process the file as UTF-16... I'd prefer to not have to open and close the file... I think perl has more compact ways for basic things like a regex search. I'm after something with that type of compact syntax.










share|improve this question
















EDIT: Due to a comment Warren Young made, it made me realize that I was not clear on one quite relevant point. My search string is already in UTF-16LE order (not in Unicode Codepoint order, which is UTF-16BE), so perhaps the Unicode issue is somewhat moot,



Perhaps my issue is a question of how do I grep for bytes (not chars) in groups of 2-bytes, ie. so that UTF-16LE x09x0A is not treated as TAB,newline, but just as 2 bytes which happen to be UTF-16LE ? ... Note: I do not need to be concerned about UTF-16 surrogate pairs, so 2-byte blocks is fine.



Here is sample pattern for this 3-character string ऊपर:





  • x09x0Ax09x2Ax09x30



    but it returns nothing, though the string is in the file.




(here is the original post)

When searching a UTF-16LE file with a pattern in x00x01x...etc format, I have encountered problems for some values. I've been using sed (and experimented with grep), but being in the UTF-8 locale they recognize some UTF-16LE values as ASCII characters. I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option.



eg. In this text (UNICODE 090A), though it is a single character, is perceived as two ASCII chars x09 and x0A.



grep has a -P (perl) option which can search for x00x... patterns, but I'm getting the same ASCII interpretation.



Is there some way to use grep -P to search in a UTF-16 mode, or perhaps better, how can this be done is perl or some other script.



grep seems to be the most appealing because of its compactness, but whatever gets the job done will overrride that preference.



PS; My example uses a literal string, but my actual usage needs a regex style search. So this perl example is not quite what I'm after, though it does process the file as UTF-16... I'd prefer to not have to open and close the file... I think perl has more compact ways for basic things like a regex search. I'm after something with that type of compact syntax.







text-processing grep regular-expression perl unicode






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 13 '17 at 12:36









Community

1




1










asked Jun 9 '12 at 10:44









Peter.OPeter.O

18.9k1791144




18.9k1791144













  • I'm not so sure regexp machinery is really up to snuff with respect to UTF-8, much less other Unicode encodings. They will mostly work on UTF-8, as long as characters that are represented by several bytes do not appear in character sets or as arguments to repetition. E.g., [ña-z] will probably do surpising stuff, and so will gñ* or g[ñn]u, but g(ñ)*, g(n|ñ)u` should work fine (it just means something different than you see ;-). The machinery is 8-bit clean nowadays, and swallows the UTF-8 bytes without complaint, but doesn't combine them up to characters.

    – vonbrand
    Jan 23 '13 at 14:30





















  • I'm not so sure regexp machinery is really up to snuff with respect to UTF-8, much less other Unicode encodings. They will mostly work on UTF-8, as long as characters that are represented by several bytes do not appear in character sets or as arguments to repetition. E.g., [ña-z] will probably do surpising stuff, and so will gñ* or g[ñn]u, but g(ñ)*, g(n|ñ)u` should work fine (it just means something different than you see ;-). The machinery is 8-bit clean nowadays, and swallows the UTF-8 bytes without complaint, but doesn't combine them up to characters.

    – vonbrand
    Jan 23 '13 at 14:30



















I'm not so sure regexp machinery is really up to snuff with respect to UTF-8, much less other Unicode encodings. They will mostly work on UTF-8, as long as characters that are represented by several bytes do not appear in character sets or as arguments to repetition. E.g., [ña-z] will probably do surpising stuff, and so will gñ* or g[ñn]u, but g(ñ)*, g(n|ñ)u` should work fine (it just means something different than you see ;-). The machinery is 8-bit clean nowadays, and swallows the UTF-8 bytes without complaint, but doesn't combine them up to characters.

– vonbrand
Jan 23 '13 at 14:30







I'm not so sure regexp machinery is really up to snuff with respect to UTF-8, much less other Unicode encodings. They will mostly work on UTF-8, as long as characters that are represented by several bytes do not appear in character sets or as arguments to repetition. E.g., [ña-z] will probably do surpising stuff, and so will gñ* or g[ñn]u, but g(ñ)*, g(n|ñ)u` should work fine (it just means something different than you see ;-). The machinery is 8-bit clean nowadays, and swallows the UTF-8 bytes without complaint, but doesn't combine them up to characters.

– vonbrand
Jan 23 '13 at 14:30












3 Answers
3






active

oldest

votes


















8














My answer is essentially the same as in your other question on this topic:



$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern


As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.






share|improve this answer


























  • Thanks Warren, but as I mentioned in the question: "I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option." ... I could, if all else fails, do something like what you suggest, but I'm certainly trying to avoid it, because all the search criteria are already in 'xXX` UTF-16 format and that would mean converting them also, plus I need to re-convert the result back into UTF-16. So a more direct (probably/possibly perl) way is preferred... and also, I'd just like to learn how to do it without re-encoding...

    – Peter.O
    Jun 9 '12 at 13:34













  • I think you may be borrowing trouble. If you provide grep a Unicode code point, it should find it, if the input is in its native Unicode encoding. The only way I see it not working is if you are searching for hex byte pairs instead, and they are byte swapped as compared to how grep sees the data. Keep in mind that internally, grep is processing the input as 32-bit Unicode characters, not as a raw byte stream. Anyway, try it before you reject the answer. You might be surprised and find that it works.

    – Warren Young
    Jun 9 '12 at 14:50













  • As the Codepoint for @ is 0x0040, the Codepoint for is 0x090A (U+090A). My patterns are flipped into Little-Endian order x0Ax09 which is how they are stored. This basically works fine for most patterns, but produces unexpected results when the UTF-16 representation of the the codepoint(s) clashes with grep's UTF-8 interpretation of the pattern and data; especially with the x0Ax09 combination, which I do encounter.

    – Peter.O
    Jun 9 '12 at 16:38













  • Your method will certainly work, and I'll mark it up once the dust has settled. At the moment, I'm just hanging out for a method which doesn't need to re-encode data.. (I'm currently diving into perl. The last time I did that I think I drowned :) ... Perhaps what I am looking for is to grep raw byte data, but I'm not sure yet.

    – Peter.O
    Jun 9 '12 at 16:39













  • I don't see any virtue in not re-coding the data. The Perl answer to your other question also re-coded it on the fly. It's not like I'm asking you to change your files on disk; we're just performing a bit of a transform to the data in order to get it into the form we need to process it. This is what computers are best at. Input-process-output.

    – Warren Young
    Jun 9 '12 at 20:56



















1














I believe that Warren's answer is a better general *nix solution, but this perl script works exactly as I wanted (for my somewhat non-standard situation). It does require that I change the search-pattern's current format slightly:

from x09x0Ax09x2Ax09x30x00s09

     to x{090A}x{092A}x{0930}x{0009}



It does everything in one process which is particularly what I was after.



#! /usr/bin/env perl
use strict;
use warnings;
die "3 args are required" if scalar @ARGV != 3;
my $if =$ARGV[0];
my $of =$ARGV[1];
my $pat=$ARGV[2];
open(my $ifh, '<:encoding(UTF-16LE)', $if) or warn "Can't open $if: $!";
open(my $ofh, '>:encoding(UTF-16LE)', $of) or warn "Can't open $of: $!";
while (<$ifh>) { print $ofh $_ if /^$pat/; }





share|improve this answer


























  • Your main loop can be rewritten as while (<$ifh>) { print $ofh $_ if /^$pat/; } You won't get the diagnostic on bad readline, but that's not going to happen on a modern OS unless the hardware is failing while you read the file.

    – Warren Young
    Jun 9 '12 at 23:52











  • @Warren, thanks for the help. I've changed the script to the simpler loop.

    – Peter.O
    Jun 10 '12 at 0:28



















0














Install ripgrep utility which supports UTF-16.



For example:



rg pattern filename



ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)




To print all lines, run: rg -N . filename.






share|improve this answer























    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "106"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f40375%2fhow-to-do-a-regex-search-in-a-utf-16le-file-while-in-a-utf-8-locale%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    8














    My answer is essentially the same as in your other question on this topic:



    $ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern


    As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.






    share|improve this answer


























    • Thanks Warren, but as I mentioned in the question: "I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option." ... I could, if all else fails, do something like what you suggest, but I'm certainly trying to avoid it, because all the search criteria are already in 'xXX` UTF-16 format and that would mean converting them also, plus I need to re-convert the result back into UTF-16. So a more direct (probably/possibly perl) way is preferred... and also, I'd just like to learn how to do it without re-encoding...

      – Peter.O
      Jun 9 '12 at 13:34













    • I think you may be borrowing trouble. If you provide grep a Unicode code point, it should find it, if the input is in its native Unicode encoding. The only way I see it not working is if you are searching for hex byte pairs instead, and they are byte swapped as compared to how grep sees the data. Keep in mind that internally, grep is processing the input as 32-bit Unicode characters, not as a raw byte stream. Anyway, try it before you reject the answer. You might be surprised and find that it works.

      – Warren Young
      Jun 9 '12 at 14:50













    • As the Codepoint for @ is 0x0040, the Codepoint for is 0x090A (U+090A). My patterns are flipped into Little-Endian order x0Ax09 which is how they are stored. This basically works fine for most patterns, but produces unexpected results when the UTF-16 representation of the the codepoint(s) clashes with grep's UTF-8 interpretation of the pattern and data; especially with the x0Ax09 combination, which I do encounter.

      – Peter.O
      Jun 9 '12 at 16:38













    • Your method will certainly work, and I'll mark it up once the dust has settled. At the moment, I'm just hanging out for a method which doesn't need to re-encode data.. (I'm currently diving into perl. The last time I did that I think I drowned :) ... Perhaps what I am looking for is to grep raw byte data, but I'm not sure yet.

      – Peter.O
      Jun 9 '12 at 16:39













    • I don't see any virtue in not re-coding the data. The Perl answer to your other question also re-coded it on the fly. It's not like I'm asking you to change your files on disk; we're just performing a bit of a transform to the data in order to get it into the form we need to process it. This is what computers are best at. Input-process-output.

      – Warren Young
      Jun 9 '12 at 20:56
















    8














    My answer is essentially the same as in your other question on this topic:



    $ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern


    As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.






    share|improve this answer


























    • Thanks Warren, but as I mentioned in the question: "I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option." ... I could, if all else fails, do something like what you suggest, but I'm certainly trying to avoid it, because all the search criteria are already in 'xXX` UTF-16 format and that would mean converting them also, plus I need to re-convert the result back into UTF-16. So a more direct (probably/possibly perl) way is preferred... and also, I'd just like to learn how to do it without re-encoding...

      – Peter.O
      Jun 9 '12 at 13:34













    • I think you may be borrowing trouble. If you provide grep a Unicode code point, it should find it, if the input is in its native Unicode encoding. The only way I see it not working is if you are searching for hex byte pairs instead, and they are byte swapped as compared to how grep sees the data. Keep in mind that internally, grep is processing the input as 32-bit Unicode characters, not as a raw byte stream. Anyway, try it before you reject the answer. You might be surprised and find that it works.

      – Warren Young
      Jun 9 '12 at 14:50













    • As the Codepoint for @ is 0x0040, the Codepoint for is 0x090A (U+090A). My patterns are flipped into Little-Endian order x0Ax09 which is how they are stored. This basically works fine for most patterns, but produces unexpected results when the UTF-16 representation of the the codepoint(s) clashes with grep's UTF-8 interpretation of the pattern and data; especially with the x0Ax09 combination, which I do encounter.

      – Peter.O
      Jun 9 '12 at 16:38













    • Your method will certainly work, and I'll mark it up once the dust has settled. At the moment, I'm just hanging out for a method which doesn't need to re-encode data.. (I'm currently diving into perl. The last time I did that I think I drowned :) ... Perhaps what I am looking for is to grep raw byte data, but I'm not sure yet.

      – Peter.O
      Jun 9 '12 at 16:39













    • I don't see any virtue in not re-coding the data. The Perl answer to your other question also re-coded it on the fly. It's not like I'm asking you to change your files on disk; we're just performing a bit of a transform to the data in order to get it into the form we need to process it. This is what computers are best at. Input-process-output.

      – Warren Young
      Jun 9 '12 at 20:56














    8












    8








    8







    My answer is essentially the same as in your other question on this topic:



    $ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern


    As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.






    share|improve this answer















    My answer is essentially the same as in your other question on this topic:



    $ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern


    As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Apr 13 '17 at 12:36









    Community

    1




    1










    answered Jun 9 '12 at 13:12









    Warren YoungWarren Young

    55k11143147




    55k11143147













    • Thanks Warren, but as I mentioned in the question: "I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option." ... I could, if all else fails, do something like what you suggest, but I'm certainly trying to avoid it, because all the search criteria are already in 'xXX` UTF-16 format and that would mean converting them also, plus I need to re-convert the result back into UTF-16. So a more direct (probably/possibly perl) way is preferred... and also, I'd just like to learn how to do it without re-encoding...

      – Peter.O
      Jun 9 '12 at 13:34













    • I think you may be borrowing trouble. If you provide grep a Unicode code point, it should find it, if the input is in its native Unicode encoding. The only way I see it not working is if you are searching for hex byte pairs instead, and they are byte swapped as compared to how grep sees the data. Keep in mind that internally, grep is processing the input as 32-bit Unicode characters, not as a raw byte stream. Anyway, try it before you reject the answer. You might be surprised and find that it works.

      – Warren Young
      Jun 9 '12 at 14:50













    • As the Codepoint for @ is 0x0040, the Codepoint for is 0x090A (U+090A). My patterns are flipped into Little-Endian order x0Ax09 which is how they are stored. This basically works fine for most patterns, but produces unexpected results when the UTF-16 representation of the the codepoint(s) clashes with grep's UTF-8 interpretation of the pattern and data; especially with the x0Ax09 combination, which I do encounter.

      – Peter.O
      Jun 9 '12 at 16:38













    • Your method will certainly work, and I'll mark it up once the dust has settled. At the moment, I'm just hanging out for a method which doesn't need to re-encode data.. (I'm currently diving into perl. The last time I did that I think I drowned :) ... Perhaps what I am looking for is to grep raw byte data, but I'm not sure yet.

      – Peter.O
      Jun 9 '12 at 16:39













    • I don't see any virtue in not re-coding the data. The Perl answer to your other question also re-coded it on the fly. It's not like I'm asking you to change your files on disk; we're just performing a bit of a transform to the data in order to get it into the form we need to process it. This is what computers are best at. Input-process-output.

      – Warren Young
      Jun 9 '12 at 20:56



















    • Thanks Warren, but as I mentioned in the question: "I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option." ... I could, if all else fails, do something like what you suggest, but I'm certainly trying to avoid it, because all the search criteria are already in 'xXX` UTF-16 format and that would mean converting them also, plus I need to re-convert the result back into UTF-16. So a more direct (probably/possibly perl) way is preferred... and also, I'd just like to learn how to do it without re-encoding...

      – Peter.O
      Jun 9 '12 at 13:34













    • I think you may be borrowing trouble. If you provide grep a Unicode code point, it should find it, if the input is in its native Unicode encoding. The only way I see it not working is if you are searching for hex byte pairs instead, and they are byte swapped as compared to how grep sees the data. Keep in mind that internally, grep is processing the input as 32-bit Unicode characters, not as a raw byte stream. Anyway, try it before you reject the answer. You might be surprised and find that it works.

      – Warren Young
      Jun 9 '12 at 14:50













    • As the Codepoint for @ is 0x0040, the Codepoint for is 0x090A (U+090A). My patterns are flipped into Little-Endian order x0Ax09 which is how they are stored. This basically works fine for most patterns, but produces unexpected results when the UTF-16 representation of the the codepoint(s) clashes with grep's UTF-8 interpretation of the pattern and data; especially with the x0Ax09 combination, which I do encounter.

      – Peter.O
      Jun 9 '12 at 16:38













    • Your method will certainly work, and I'll mark it up once the dust has settled. At the moment, I'm just hanging out for a method which doesn't need to re-encode data.. (I'm currently diving into perl. The last time I did that I think I drowned :) ... Perhaps what I am looking for is to grep raw byte data, but I'm not sure yet.

      – Peter.O
      Jun 9 '12 at 16:39













    • I don't see any virtue in not re-coding the data. The Perl answer to your other question also re-coded it on the fly. It's not like I'm asking you to change your files on disk; we're just performing a bit of a transform to the data in order to get it into the form we need to process it. This is what computers are best at. Input-process-output.

      – Warren Young
      Jun 9 '12 at 20:56

















    Thanks Warren, but as I mentioned in the question: "I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option." ... I could, if all else fails, do something like what you suggest, but I'm certainly trying to avoid it, because all the search criteria are already in 'xXX` UTF-16 format and that would mean converting them also, plus I need to re-convert the result back into UTF-16. So a more direct (probably/possibly perl) way is preferred... and also, I'd just like to learn how to do it without re-encoding...

    – Peter.O
    Jun 9 '12 at 13:34







    Thanks Warren, but as I mentioned in the question: "I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option." ... I could, if all else fails, do something like what you suggest, but I'm certainly trying to avoid it, because all the search criteria are already in 'xXX` UTF-16 format and that would mean converting them also, plus I need to re-convert the result back into UTF-16. So a more direct (probably/possibly perl) way is preferred... and also, I'd just like to learn how to do it without re-encoding...

    – Peter.O
    Jun 9 '12 at 13:34















    I think you may be borrowing trouble. If you provide grep a Unicode code point, it should find it, if the input is in its native Unicode encoding. The only way I see it not working is if you are searching for hex byte pairs instead, and they are byte swapped as compared to how grep sees the data. Keep in mind that internally, grep is processing the input as 32-bit Unicode characters, not as a raw byte stream. Anyway, try it before you reject the answer. You might be surprised and find that it works.

    – Warren Young
    Jun 9 '12 at 14:50







    I think you may be borrowing trouble. If you provide grep a Unicode code point, it should find it, if the input is in its native Unicode encoding. The only way I see it not working is if you are searching for hex byte pairs instead, and they are byte swapped as compared to how grep sees the data. Keep in mind that internally, grep is processing the input as 32-bit Unicode characters, not as a raw byte stream. Anyway, try it before you reject the answer. You might be surprised and find that it works.

    – Warren Young
    Jun 9 '12 at 14:50















    As the Codepoint for @ is 0x0040, the Codepoint for is 0x090A (U+090A). My patterns are flipped into Little-Endian order x0Ax09 which is how they are stored. This basically works fine for most patterns, but produces unexpected results when the UTF-16 representation of the the codepoint(s) clashes with grep's UTF-8 interpretation of the pattern and data; especially with the x0Ax09 combination, which I do encounter.

    – Peter.O
    Jun 9 '12 at 16:38







    As the Codepoint for @ is 0x0040, the Codepoint for is 0x090A (U+090A). My patterns are flipped into Little-Endian order x0Ax09 which is how they are stored. This basically works fine for most patterns, but produces unexpected results when the UTF-16 representation of the the codepoint(s) clashes with grep's UTF-8 interpretation of the pattern and data; especially with the x0Ax09 combination, which I do encounter.

    – Peter.O
    Jun 9 '12 at 16:38















    Your method will certainly work, and I'll mark it up once the dust has settled. At the moment, I'm just hanging out for a method which doesn't need to re-encode data.. (I'm currently diving into perl. The last time I did that I think I drowned :) ... Perhaps what I am looking for is to grep raw byte data, but I'm not sure yet.

    – Peter.O
    Jun 9 '12 at 16:39







    Your method will certainly work, and I'll mark it up once the dust has settled. At the moment, I'm just hanging out for a method which doesn't need to re-encode data.. (I'm currently diving into perl. The last time I did that I think I drowned :) ... Perhaps what I am looking for is to grep raw byte data, but I'm not sure yet.

    – Peter.O
    Jun 9 '12 at 16:39















    I don't see any virtue in not re-coding the data. The Perl answer to your other question also re-coded it on the fly. It's not like I'm asking you to change your files on disk; we're just performing a bit of a transform to the data in order to get it into the form we need to process it. This is what computers are best at. Input-process-output.

    – Warren Young
    Jun 9 '12 at 20:56





    I don't see any virtue in not re-coding the data. The Perl answer to your other question also re-coded it on the fly. It's not like I'm asking you to change your files on disk; we're just performing a bit of a transform to the data in order to get it into the form we need to process it. This is what computers are best at. Input-process-output.

    – Warren Young
    Jun 9 '12 at 20:56













    1














    I believe that Warren's answer is a better general *nix solution, but this perl script works exactly as I wanted (for my somewhat non-standard situation). It does require that I change the search-pattern's current format slightly:

    from x09x0Ax09x2Ax09x30x00s09

         to x{090A}x{092A}x{0930}x{0009}



    It does everything in one process which is particularly what I was after.



    #! /usr/bin/env perl
    use strict;
    use warnings;
    die "3 args are required" if scalar @ARGV != 3;
    my $if =$ARGV[0];
    my $of =$ARGV[1];
    my $pat=$ARGV[2];
    open(my $ifh, '<:encoding(UTF-16LE)', $if) or warn "Can't open $if: $!";
    open(my $ofh, '>:encoding(UTF-16LE)', $of) or warn "Can't open $of: $!";
    while (<$ifh>) { print $ofh $_ if /^$pat/; }





    share|improve this answer


























    • Your main loop can be rewritten as while (<$ifh>) { print $ofh $_ if /^$pat/; } You won't get the diagnostic on bad readline, but that's not going to happen on a modern OS unless the hardware is failing while you read the file.

      – Warren Young
      Jun 9 '12 at 23:52











    • @Warren, thanks for the help. I've changed the script to the simpler loop.

      – Peter.O
      Jun 10 '12 at 0:28
















    1














    I believe that Warren's answer is a better general *nix solution, but this perl script works exactly as I wanted (for my somewhat non-standard situation). It does require that I change the search-pattern's current format slightly:

    from x09x0Ax09x2Ax09x30x00s09

         to x{090A}x{092A}x{0930}x{0009}



    It does everything in one process which is particularly what I was after.



    #! /usr/bin/env perl
    use strict;
    use warnings;
    die "3 args are required" if scalar @ARGV != 3;
    my $if =$ARGV[0];
    my $of =$ARGV[1];
    my $pat=$ARGV[2];
    open(my $ifh, '<:encoding(UTF-16LE)', $if) or warn "Can't open $if: $!";
    open(my $ofh, '>:encoding(UTF-16LE)', $of) or warn "Can't open $of: $!";
    while (<$ifh>) { print $ofh $_ if /^$pat/; }





    share|improve this answer


























    • Your main loop can be rewritten as while (<$ifh>) { print $ofh $_ if /^$pat/; } You won't get the diagnostic on bad readline, but that's not going to happen on a modern OS unless the hardware is failing while you read the file.

      – Warren Young
      Jun 9 '12 at 23:52











    • @Warren, thanks for the help. I've changed the script to the simpler loop.

      – Peter.O
      Jun 10 '12 at 0:28














    1












    1








    1







    I believe that Warren's answer is a better general *nix solution, but this perl script works exactly as I wanted (for my somewhat non-standard situation). It does require that I change the search-pattern's current format slightly:

    from x09x0Ax09x2Ax09x30x00s09

         to x{090A}x{092A}x{0930}x{0009}



    It does everything in one process which is particularly what I was after.



    #! /usr/bin/env perl
    use strict;
    use warnings;
    die "3 args are required" if scalar @ARGV != 3;
    my $if =$ARGV[0];
    my $of =$ARGV[1];
    my $pat=$ARGV[2];
    open(my $ifh, '<:encoding(UTF-16LE)', $if) or warn "Can't open $if: $!";
    open(my $ofh, '>:encoding(UTF-16LE)', $of) or warn "Can't open $of: $!";
    while (<$ifh>) { print $ofh $_ if /^$pat/; }





    share|improve this answer















    I believe that Warren's answer is a better general *nix solution, but this perl script works exactly as I wanted (for my somewhat non-standard situation). It does require that I change the search-pattern's current format slightly:

    from x09x0Ax09x2Ax09x30x00s09

         to x{090A}x{092A}x{0930}x{0009}



    It does everything in one process which is particularly what I was after.



    #! /usr/bin/env perl
    use strict;
    use warnings;
    die "3 args are required" if scalar @ARGV != 3;
    my $if =$ARGV[0];
    my $of =$ARGV[1];
    my $pat=$ARGV[2];
    open(my $ifh, '<:encoding(UTF-16LE)', $if) or warn "Can't open $if: $!";
    open(my $ofh, '>:encoding(UTF-16LE)', $of) or warn "Can't open $of: $!";
    while (<$ifh>) { print $ofh $_ if /^$pat/; }






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Jun 10 '12 at 0:25

























    answered Jun 9 '12 at 23:19









    Peter.OPeter.O

    18.9k1791144




    18.9k1791144













    • Your main loop can be rewritten as while (<$ifh>) { print $ofh $_ if /^$pat/; } You won't get the diagnostic on bad readline, but that's not going to happen on a modern OS unless the hardware is failing while you read the file.

      – Warren Young
      Jun 9 '12 at 23:52











    • @Warren, thanks for the help. I've changed the script to the simpler loop.

      – Peter.O
      Jun 10 '12 at 0:28



















    • Your main loop can be rewritten as while (<$ifh>) { print $ofh $_ if /^$pat/; } You won't get the diagnostic on bad readline, but that's not going to happen on a modern OS unless the hardware is failing while you read the file.

      – Warren Young
      Jun 9 '12 at 23:52











    • @Warren, thanks for the help. I've changed the script to the simpler loop.

      – Peter.O
      Jun 10 '12 at 0:28

















    Your main loop can be rewritten as while (<$ifh>) { print $ofh $_ if /^$pat/; } You won't get the diagnostic on bad readline, but that's not going to happen on a modern OS unless the hardware is failing while you read the file.

    – Warren Young
    Jun 9 '12 at 23:52





    Your main loop can be rewritten as while (<$ifh>) { print $ofh $_ if /^$pat/; } You won't get the diagnostic on bad readline, but that's not going to happen on a modern OS unless the hardware is failing while you read the file.

    – Warren Young
    Jun 9 '12 at 23:52













    @Warren, thanks for the help. I've changed the script to the simpler loop.

    – Peter.O
    Jun 10 '12 at 0:28





    @Warren, thanks for the help. I've changed the script to the simpler loop.

    – Peter.O
    Jun 10 '12 at 0:28











    0














    Install ripgrep utility which supports UTF-16.



    For example:



    rg pattern filename



    ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)




    To print all lines, run: rg -N . filename.






    share|improve this answer




























      0














      Install ripgrep utility which supports UTF-16.



      For example:



      rg pattern filename



      ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)




      To print all lines, run: rg -N . filename.






      share|improve this answer


























        0












        0








        0







        Install ripgrep utility which supports UTF-16.



        For example:



        rg pattern filename



        ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)




        To print all lines, run: rg -N . filename.






        share|improve this answer













        Install ripgrep utility which supports UTF-16.



        For example:



        rg pattern filename



        ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)




        To print all lines, run: rg -N . filename.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jan 17 at 14:23









        kenorbkenorb

        8,541370106




        8,541370106






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Unix & Linux Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f40375%2fhow-to-do-a-regex-search-in-a-utf-16le-file-while-in-a-utf-8-locale%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to make a Squid Proxy server?

            Is this a new Fibonacci Identity?

            19世紀