How to do a regex search in a UTF-16LE file while in a UTF-8 locale?

EDIT: Due to a comment Warren Young made, it made me realize that I was not clear on one quite relevant point. My search string is already in UTF-16LE order (not in Unicode Codepoint order, which is UTF-16BE), so perhaps the Unicode issue is somewhat moot,

Perhaps my issue is a question of how do I grep for bytes (not chars) in groups of 2-bytes, ie. so that UTF-16LE x09x0A is not treated as TAB,newline, but just as 2 bytes which happen to be UTF-16LE ऊ? ... Note: I do not need to be concerned about UTF-16 surrogate pairs, so 2-byte blocks is fine.

Here is sample pattern for this 3-character string ऊपर:

x09x0Ax09x2Ax09x30

but it returns nothing, though the string is in the file.

(here is the original post)

When searching a UTF-16LE file with a pattern in x00x01x...etc format, I have encountered problems for some values. I've been using sed (and experimented with grep), but being in the UTF-8 locale they recognize some UTF-16LE values as ASCII characters. I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option.

eg. In this text ऊ (UNICODE 090A), though it is a single character, ऊ is perceived as two ASCII chars x09 and x0A.

grep has a -P (perl) option which can search for x00x... patterns, but I'm getting the same ASCII interpretation.

Is there some way to use grep -P to search in a UTF-16 mode, or perhaps better, how can this be done is perl or some other script.

grep seems to be the most appealing because of its compactness, but whatever gets the job done will overrride that preference.

PS; My ऊ example uses a literal string, but my actual usage needs a regex style search. So this perl example is not quite what I'm after, though it does process the file as UTF-16... I'd prefer to not have to open and close the file... I think perl has more compact ways for basic things like a regex search. I'm after something with that type of compact syntax.

edited Apr 13 '17 at 12:36

Community♦

asked Jun 9 '12 at 10:44

Peter.O

18.9k1791144

I'm not so sure regexp machinery is really up to snuff with respect to UTF-8, much less other Unicode encodings. They will mostly work on UTF-8, as long as characters that are represented by several bytes do not appear in character sets or as arguments to repetition. E.g., [ña-z] will probably do surpising stuff, and so will gñ* or g[ñn]u, but g(ñ)*, g(n|ñ)u` should work fine (it just means something different than you see ;-). The machinery is 8-bit clean nowadays, and swallows the UTF-8 bytes without complaint, but doesn't combine them up to characters.

– vonbrand
Jan 23 '13 at 14:30

add a comment |

Here is sample pattern for this 3-character string ऊपर:

x09x0Ax09x2Ax09x30

but it returns nothing, though the string is in the file.

eg. In this text ऊ (UNICODE 090A), though it is a single character, ऊ is perceived as two ASCII chars x09 and x0A.

grep has a -P (perl) option which can search for x00x... patterns, but I'm getting the same ASCII interpretation.

Is there some way to use grep -P to search in a UTF-16 mode, or perhaps better, how can this be done is perl or some other script.

grep seems to be the most appealing because of its compactness, but whatever gets the job done will overrride that preference.

edited Apr 13 '17 at 12:36

Community♦

asked Jun 9 '12 at 10:44

Peter.O

18.9k1791144

I'm not so sure regexp machinery is really up to snuff with respect to UTF-8, much less other Unicode encodings. They will mostly work on UTF-8, as long as characters that are represented by several bytes do not appear in character sets or as arguments to repetition. E.g., [ña-z] will probably do surpising stuff, and so will gñ* or g[ñn]u, but g(ñ)*, g(n|ñ)u` should work fine (it just means something different than you see ;-). The machinery is 8-bit clean nowadays, and swallows the UTF-8 bytes without complaint, but doesn't combine them up to characters.

– vonbrand
Jan 23 '13 at 14:30

add a comment |

Here is sample pattern for this 3-character string ऊपर:

x09x0Ax09x2Ax09x30

but it returns nothing, though the string is in the file.

eg. In this text ऊ (UNICODE 090A), though it is a single character, ऊ is perceived as two ASCII chars x09 and x0A.

grep has a -P (perl) option which can search for x00x... patterns, but I'm getting the same ASCII interpretation.

Is there some way to use grep -P to search in a UTF-16 mode, or perhaps better, how can this be done is perl or some other script.

grep seems to be the most appealing because of its compactness, but whatever gets the job done will overrride that preference.

edited Apr 13 '17 at 12:36

Community♦

asked Jun 9 '12 at 10:44

Peter.O

18.9k1791144

Here is sample pattern for this 3-character string ऊपर:

x09x0Ax09x2Ax09x30

but it returns nothing, though the string is in the file.

eg. In this text ऊ (UNICODE 090A), though it is a single character, ऊ is perceived as two ASCII chars x09 and x0A.

grep has a -P (perl) option which can search for x00x... patterns, but I'm getting the same ASCII interpretation.

Is there some way to use grep -P to search in a UTF-16 mode, or perhaps better, how can this be done is perl or some other script.

grep seems to be the most appealing because of its compactness, but whatever gets the job done will overrride that preference.

text-processing grep regular-expression perl unicode

edited Apr 13 '17 at 12:36

Community♦

asked Jun 9 '12 at 10:44

Peter.O

18.9k1791144

edited Apr 13 '17 at 12:36

Community♦

asked Jun 9 '12 at 10:44

Peter.O

18.9k1791144

edited Apr 13 '17 at 12:36

Community♦

edited Apr 13 '17 at 12:36

Community♦

edited Apr 13 '17 at 12:36

Community♦

asked Jun 9 '12 at 10:44

Peter.O

18.9k1791144

asked Jun 9 '12 at 10:44

Peter.O

18.9k1791144

asked Jun 9 '12 at 10:44

Peter.O

18.9k1791144

I'm not so sure regexp machinery is really up to snuff with respect to UTF-8, much less other Unicode encodings. They will mostly work on UTF-8, as long as characters that are represented by several bytes do not appear in character sets or as arguments to repetition. E.g., [ña-z] will probably do surpising stuff, and so will gñ* or g[ñn]u, but g(ñ)*, g(n|ñ)u` should work fine (it just means something different than you see ;-). The machinery is 8-bit clean nowadays, and swallows the UTF-8 bytes without complaint, but doesn't combine them up to characters.

– vonbrand
Jan 23 '13 at 14:30

add a comment |

I'm not so sure regexp machinery is really up to snuff with respect to UTF-8, much less other Unicode encodings. They will mostly work on UTF-8, as long as characters that are represented by several bytes do not appear in character sets or as arguments to repetition. E.g., [ña-z] will probably do surpising stuff, and so will gñ* or g[ñn]u, but g(ñ)*, g(n|ñ)u` should work fine (it just means something different than you see ;-). The machinery is 8-bit clean nowadays, and swallows the UTF-8 bytes without complaint, but doesn't combine them up to characters.

– vonbrand
Jan 23 '13 at 14:30

I'm not so sure regexp machinery is really up to snuff with respect to UTF-8, much less other Unicode encodings. They will mostly work on UTF-8, as long as characters that are represented by several bytes do not appear in character sets or as arguments to repetition. E.g., [ña-z] will probably do surpising stuff, and so will gñ* or g[ñn]u, but g(ñ)*, g(n|ñ)u` should work fine (it just means something different than you see ;-). The machinery is 8-bit clean nowadays, and swallows the UTF-8 bytes without complaint, but doesn't combine them up to characters.

– vonbrand
Jan 23 '13 at 14:30

add a comment |

3 Answers
3

active

oldest

votes

My answer is essentially the same as in your other question on this topic:

$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern

As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.

edited Apr 13 '17 at 12:36

Community♦

answered Jun 9 '12 at 13:12

Warren Young

55k11143147

Thanks Warren, but as I mentioned in the question: "I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option." ... I could, if all else fails, do something like what you suggest, but I'm certainly trying to avoid it, because all the search criteria are already in 'xXX` UTF-16 format and that would mean converting them also, plus I need to re-convert the result back into UTF-16. So a more direct (probably/possibly perl) way is preferred... and also, I'd just like to learn how to do it without re-encoding...

– Peter.O
Jun 9 '12 at 13:34

I think you may be borrowing trouble. If you provide grep a Unicode code point, it should find it, if the input is in its native Unicode encoding. The only way I see it not working is if you are searching for hex byte pairs instead, and they are byte swapped as compared to how grep sees the data. Keep in mind that internally, grep is processing the input as 32-bit Unicode characters, not as a raw byte stream. Anyway, try it before you reject the answer. You might be surprised and find that it works.

– Warren Young
Jun 9 '12 at 14:50

As the Codepoint for @ is 0x0040, the Codepoint for ऊ is 0x090A (U+090A). My patterns are flipped into Little-Endian order x0Ax09 which is how they are stored. This basically works fine for most patterns, but produces unexpected results when the UTF-16 representation of the the codepoint(s) clashes with grep's UTF-8 interpretation of the pattern and data; especially with the x0Ax09 combination, which I do encounter.

– Peter.O
Jun 9 '12 at 16:38

Your method will certainly work, and I'll mark it up once the dust has settled. At the moment, I'm just hanging out for a method which doesn't need to re-encode data.. (I'm currently diving into perl. The last time I did that I think I drowned :) ... Perhaps what I am looking for is to grep raw byte data, but I'm not sure yet.

– Peter.O
Jun 9 '12 at 16:39

I don't see any virtue in not re-coding the data. The Perl answer to your other question also re-coded it on the fly. It's not like I'm asking you to change your files on disk; we're just performing a bit of a transform to the data in order to get it into the form we need to process it. This is what computers are best at. Input-process-output.

– Warren Young
Jun 9 '12 at 20:56

add a comment |

I believe that Warren's answer is a better general *nix solution, but this perl script works exactly as I wanted (for my somewhat non-standard situation). It does require that I change the search-pattern's current format slightly:

from x09x0Ax09x2Ax09x30x00s09

to x{090A}x{092A}x{0930}x{0009}

It does everything in one process which is particularly what I was after.

#! /usr/bin/env perl

use strict;

use warnings;

die "3 args are required" if scalar @ARGV != 3;

my $if =$ARGV[0];

my $of =$ARGV[1];

my $pat=$ARGV[2];

open(my $ifh, '<:encoding(UTF-16LE)', $if) or warn "Can't open $if: $!";

open(my $ofh, '>:encoding(UTF-16LE)', $of) or warn "Can't open $of: $!";

while (<$ifh>) { print $ofh $_ if /^$pat/; }

edited Jun 10 '12 at 0:25

answered Jun 9 '12 at 23:19

Peter.O

18.9k1791144

Your main loop can be rewritten as while (<$ifh>) { print $ofh $_ if /^$pat/; } You won't get the diagnostic on bad readline, but that's not going to happen on a modern OS unless the hardware is failing while you read the file.

– Warren Young
Jun 9 '12 at 23:52

@Warren, thanks for the help. I've changed the script to the simpler loop.

– Peter.O
Jun 10 '12 at 0:28

add a comment |

Install ripgrep utility which supports UTF-16.

For example:

rg pattern filename

ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)

To print all lines, run: rg -N . filename.

answered Jan 17 at 14:23

kenorb

8,541370106

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f40375%2fhow-to-do-a-regex-search-in-a-utf-16le-file-while-in-a-utf-8-locale%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

My answer is essentially the same as in your other question on this topic:

$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern

As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.

edited Apr 13 '17 at 12:36

Community♦

answered Jun 9 '12 at 13:12

Warren Young

55k11143147

Thanks Warren, but as I mentioned in the question: "I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option." ... I could, if all else fails, do something like what you suggest, but I'm certainly trying to avoid it, because all the search criteria are already in 'xXX` UTF-16 format and that would mean converting them also, plus I need to re-convert the result back into UTF-16. So a more direct (probably/possibly perl) way is preferred... and also, I'd just like to learn how to do it without re-encoding...

– Peter.O
Jun 9 '12 at 13:34

I think you may be borrowing trouble. If you provide grep a Unicode code point, it should find it, if the input is in its native Unicode encoding. The only way I see it not working is if you are searching for hex byte pairs instead, and they are byte swapped as compared to how grep sees the data. Keep in mind that internally, grep is processing the input as 32-bit Unicode characters, not as a raw byte stream. Anyway, try it before you reject the answer. You might be surprised and find that it works.

– Warren Young
Jun 9 '12 at 14:50

As the Codepoint for @ is 0x0040, the Codepoint for ऊ is 0x090A (U+090A). My patterns are flipped into Little-Endian order x0Ax09 which is how they are stored. This basically works fine for most patterns, but produces unexpected results when the UTF-16 representation of the the codepoint(s) clashes with grep's UTF-8 interpretation of the pattern and data; especially with the x0Ax09 combination, which I do encounter.

– Peter.O
Jun 9 '12 at 16:38

Your method will certainly work, and I'll mark it up once the dust has settled. At the moment, I'm just hanging out for a method which doesn't need to re-encode data.. (I'm currently diving into perl. The last time I did that I think I drowned :) ... Perhaps what I am looking for is to grep raw byte data, but I'm not sure yet.

– Peter.O
Jun 9 '12 at 16:39

I don't see any virtue in not re-coding the data. The Perl answer to your other question also re-coded it on the fly. It's not like I'm asking you to change your files on disk; we're just performing a bit of a transform to the data in order to get it into the form we need to process it. This is what computers are best at. Input-process-output.

– Warren Young
Jun 9 '12 at 20:56

add a comment |

My answer is essentially the same as in your other question on this topic:

$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern

As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.

edited Apr 13 '17 at 12:36

Community♦

answered Jun 9 '12 at 13:12

Warren Young

55k11143147

Thanks Warren, but as I mentioned in the question: "I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option." ... I could, if all else fails, do something like what you suggest, but I'm certainly trying to avoid it, because all the search criteria are already in 'xXX` UTF-16 format and that would mean converting them also, plus I need to re-convert the result back into UTF-16. So a more direct (probably/possibly perl) way is preferred... and also, I'd just like to learn how to do it without re-encoding...

– Peter.O
Jun 9 '12 at 13:34

I think you may be borrowing trouble. If you provide grep a Unicode code point, it should find it, if the input is in its native Unicode encoding. The only way I see it not working is if you are searching for hex byte pairs instead, and they are byte swapped as compared to how grep sees the data. Keep in mind that internally, grep is processing the input as 32-bit Unicode characters, not as a raw byte stream. Anyway, try it before you reject the answer. You might be surprised and find that it works.

– Warren Young
Jun 9 '12 at 14:50

As the Codepoint for @ is 0x0040, the Codepoint for ऊ is 0x090A (U+090A). My patterns are flipped into Little-Endian order x0Ax09 which is how they are stored. This basically works fine for most patterns, but produces unexpected results when the UTF-16 representation of the the codepoint(s) clashes with grep's UTF-8 interpretation of the pattern and data; especially with the x0Ax09 combination, which I do encounter.

– Peter.O
Jun 9 '12 at 16:38

Your method will certainly work, and I'll mark it up once the dust has settled. At the moment, I'm just hanging out for a method which doesn't need to re-encode data.. (I'm currently diving into perl. The last time I did that I think I drowned :) ... Perhaps what I am looking for is to grep raw byte data, but I'm not sure yet.

– Peter.O
Jun 9 '12 at 16:39

I don't see any virtue in not re-coding the data. The Perl answer to your other question also re-coded it on the fly. It's not like I'm asking you to change your files on disk; we're just performing a bit of a transform to the data in order to get it into the form we need to process it. This is what computers are best at. Input-process-output.

– Warren Young
Jun 9 '12 at 20:56

add a comment |

My answer is essentially the same as in your other question on this topic:

$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern

As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.

edited Apr 13 '17 at 12:36

Community♦

answered Jun 9 '12 at 13:12

Warren Young

55k11143147

My answer is essentially the same as in your other question on this topic:

$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern

As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.

edited Apr 13 '17 at 12:36

Community♦

answered Jun 9 '12 at 13:12

Warren Young

55k11143147

edited Apr 13 '17 at 12:36

Community♦

edited Apr 13 '17 at 12:36

Community♦

edited Apr 13 '17 at 12:36

Community♦

answered Jun 9 '12 at 13:12

Warren Young

55k11143147

answered Jun 9 '12 at 13:12

Warren Young

55k11143147

answered Jun 9 '12 at 13:12

Warren Young

55k11143147

Thanks Warren, but as I mentioned in the question: "I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option." ... I could, if all else fails, do something like what you suggest, but I'm certainly trying to avoid it, because all the search criteria are already in 'xXX` UTF-16 format and that would mean converting them also, plus I need to re-convert the result back into UTF-16. So a more direct (probably/possibly perl) way is preferred... and also, I'd just like to learn how to do it without re-encoding...

– Peter.O
Jun 9 '12 at 13:34

I think you may be borrowing trouble. If you provide grep a Unicode code point, it should find it, if the input is in its native Unicode encoding. The only way I see it not working is if you are searching for hex byte pairs instead, and they are byte swapped as compared to how grep sees the data. Keep in mind that internally, grep is processing the input as 32-bit Unicode characters, not as a raw byte stream. Anyway, try it before you reject the answer. You might be surprised and find that it works.

– Warren Young
Jun 9 '12 at 14:50

As the Codepoint for @ is 0x0040, the Codepoint for ऊ is 0x090A (U+090A). My patterns are flipped into Little-Endian order x0Ax09 which is how they are stored. This basically works fine for most patterns, but produces unexpected results when the UTF-16 representation of the the codepoint(s) clashes with grep's UTF-8 interpretation of the pattern and data; especially with the x0Ax09 combination, which I do encounter.

– Peter.O
Jun 9 '12 at 16:38

Your method will certainly work, and I'll mark it up once the dust has settled. At the moment, I'm just hanging out for a method which doesn't need to re-encode data.. (I'm currently diving into perl. The last time I did that I think I drowned :) ... Perhaps what I am looking for is to grep raw byte data, but I'm not sure yet.

– Peter.O
Jun 9 '12 at 16:39

I don't see any virtue in not re-coding the data. The Perl answer to your other question also re-coded it on the fly. It's not like I'm asking you to change your files on disk; we're just performing a bit of a transform to the data in order to get it into the form we need to process it. This is what computers are best at. Input-process-output.

– Warren Young
Jun 9 '12 at 20:56

add a comment |

Thanks Warren, but as I mentioned in the question: "I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option." ... I could, if all else fails, do something like what you suggest, but I'm certainly trying to avoid it, because all the search criteria are already in 'xXX` UTF-16 format and that would mean converting them also, plus I need to re-convert the result back into UTF-16. So a more direct (probably/possibly perl) way is preferred... and also, I'd just like to learn how to do it without re-encoding...

– Peter.O
Jun 9 '12 at 13:34

I think you may be borrowing trouble. If you provide grep a Unicode code point, it should find it, if the input is in its native Unicode encoding. The only way I see it not working is if you are searching for hex byte pairs instead, and they are byte swapped as compared to how grep sees the data. Keep in mind that internally, grep is processing the input as 32-bit Unicode characters, not as a raw byte stream. Anyway, try it before you reject the answer. You might be surprised and find that it works.

– Warren Young
Jun 9 '12 at 14:50

As the Codepoint for @ is 0x0040, the Codepoint for ऊ is 0x090A (U+090A). My patterns are flipped into Little-Endian order x0Ax09 which is how they are stored. This basically works fine for most patterns, but produces unexpected results when the UTF-16 representation of the the codepoint(s) clashes with grep's UTF-8 interpretation of the pattern and data; especially with the x0Ax09 combination, which I do encounter.

– Peter.O
Jun 9 '12 at 16:38

Your method will certainly work, and I'll mark it up once the dust has settled. At the moment, I'm just hanging out for a method which doesn't need to re-encode data.. (I'm currently diving into perl. The last time I did that I think I drowned :) ... Perhaps what I am looking for is to grep raw byte data, but I'm not sure yet.

– Peter.O
Jun 9 '12 at 16:39

I don't see any virtue in not re-coding the data. The Perl answer to your other question also re-coded it on the fly. It's not like I'm asking you to change your files on disk; we're just performing a bit of a transform to the data in order to get it into the form we need to process it. This is what computers are best at. Input-process-output.

– Warren Young
Jun 9 '12 at 20:56

Thanks Warren, but as I mentioned in the question: "I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option." ... I could, if all else fails, do something like what you suggest, but I'm certainly trying to avoid it, because all the search criteria are already in 'xXX` UTF-16 format and that would mean converting them also, plus I need to re-convert the result back into UTF-16. So a more direct (probably/possibly perl) way is preferred... and also, I'd just like to learn how to do it without re-encoding...

– Peter.O
Jun 9 '12 at 13:34

I think you may be borrowing trouble. If you provide grep a Unicode code point, it should find it, if the input is in its native Unicode encoding. The only way I see it not working is if you are searching for hex byte pairs instead, and they are byte swapped as compared to how grep sees the data. Keep in mind that internally, grep is processing the input as 32-bit Unicode characters, not as a raw byte stream. Anyway, try it before you reject the answer. You might be surprised and find that it works.

– Warren Young
Jun 9 '12 at 14:50

As the Codepoint for @ is 0x0040, the Codepoint for ऊ is 0x090A (U+090A). My patterns are flipped into Little-Endian order x0Ax09 which is how they are stored. This basically works fine for most patterns, but produces unexpected results when the UTF-16 representation of the the codepoint(s) clashes with grep's UTF-8 interpretation of the pattern and data; especially with the x0Ax09 combination, which I do encounter.

– Peter.O
Jun 9 '12 at 16:38

Your method will certainly work, and I'll mark it up once the dust has settled. At the moment, I'm just hanging out for a method which doesn't need to re-encode data.. (I'm currently diving into perl. The last time I did that I think I drowned :) ... Perhaps what I am looking for is to grep raw byte data, but I'm not sure yet.

– Peter.O
Jun 9 '12 at 16:39

I don't see any virtue in not re-coding the data. The Perl answer to your other question also re-coded it on the fly. It's not like I'm asking you to change your files on disk; we're just performing a bit of a transform to the data in order to get it into the form we need to process it. This is what computers are best at. Input-process-output.

– Warren Young
Jun 9 '12 at 20:56

add a comment |

It does everything in one process which is particularly what I was after.

#! /usr/bin/env perl

use strict;

use warnings;

die "3 args are required" if scalar @ARGV != 3;

my $if =$ARGV[0];

my $of =$ARGV[1];

my $pat=$ARGV[2];

open(my $ifh, '<:encoding(UTF-16LE)', $if) or warn "Can't open $if: $!";

open(my $ofh, '>:encoding(UTF-16LE)', $of) or warn "Can't open $of: $!";

while (<$ifh>) { print $ofh $_ if /^$pat/; }

edited Jun 10 '12 at 0:25

answered Jun 9 '12 at 23:19

Peter.O

18.9k1791144

Your main loop can be rewritten as while (<$ifh>) { print $ofh $_ if /^$pat/; } You won't get the diagnostic on bad readline, but that's not going to happen on a modern OS unless the hardware is failing while you read the file.

– Warren Young
Jun 9 '12 at 23:52

@Warren, thanks for the help. I've changed the script to the simpler loop.

– Peter.O
Jun 10 '12 at 0:28

add a comment |

It does everything in one process which is particularly what I was after.

#! /usr/bin/env perl

use strict;

use warnings;

die "3 args are required" if scalar @ARGV != 3;

my $if =$ARGV[0];

my $of =$ARGV[1];

my $pat=$ARGV[2];

open(my $ifh, '<:encoding(UTF-16LE)', $if) or warn "Can't open $if: $!";

open(my $ofh, '>:encoding(UTF-16LE)', $of) or warn "Can't open $of: $!";

while (<$ifh>) { print $ofh $_ if /^$pat/; }

edited Jun 10 '12 at 0:25

answered Jun 9 '12 at 23:19

Peter.O

18.9k1791144

Your main loop can be rewritten as while (<$ifh>) { print $ofh $_ if /^$pat/; } You won't get the diagnostic on bad readline, but that's not going to happen on a modern OS unless the hardware is failing while you read the file.

– Warren Young
Jun 9 '12 at 23:52

@Warren, thanks for the help. I've changed the script to the simpler loop.

– Peter.O
Jun 10 '12 at 0:28

add a comment |

It does everything in one process which is particularly what I was after.

#! /usr/bin/env perl

use strict;

use warnings;

die "3 args are required" if scalar @ARGV != 3;

my $if =$ARGV[0];

my $of =$ARGV[1];

my $pat=$ARGV[2];

open(my $ifh, '<:encoding(UTF-16LE)', $if) or warn "Can't open $if: $!";

open(my $ofh, '>:encoding(UTF-16LE)', $of) or warn "Can't open $of: $!";

while (<$ifh>) { print $ofh $_ if /^$pat/; }

edited Jun 10 '12 at 0:25

answered Jun 9 '12 at 23:19

Peter.O

18.9k1791144

It does everything in one process which is particularly what I was after.

#! /usr/bin/env perl

use strict;

use warnings;

die "3 args are required" if scalar @ARGV != 3;

my $if =$ARGV[0];

my $of =$ARGV[1];

my $pat=$ARGV[2];

open(my $ifh, '<:encoding(UTF-16LE)', $if) or warn "Can't open $if: $!";

open(my $ofh, '>:encoding(UTF-16LE)', $of) or warn "Can't open $of: $!";

while (<$ifh>) { print $ofh $_ if /^$pat/; }

edited Jun 10 '12 at 0:25

answered Jun 9 '12 at 23:19

Peter.O

18.9k1791144

edited Jun 10 '12 at 0:25

answered Jun 9 '12 at 23:19

Peter.O

18.9k1791144

answered Jun 9 '12 at 23:19

Peter.O

18.9k1791144

answered Jun 9 '12 at 23:19

Peter.O

18.9k1791144

Your main loop can be rewritten as while (<$ifh>) { print $ofh $_ if /^$pat/; } You won't get the diagnostic on bad readline, but that's not going to happen on a modern OS unless the hardware is failing while you read the file.

– Warren Young
Jun 9 '12 at 23:52

@Warren, thanks for the help. I've changed the script to the simpler loop.

– Peter.O
Jun 10 '12 at 0:28

add a comment |

Your main loop can be rewritten as while (<$ifh>) { print $ofh $_ if /^$pat/; } You won't get the diagnostic on bad readline, but that's not going to happen on a modern OS unless the hardware is failing while you read the file.

– Warren Young
Jun 9 '12 at 23:52

@Warren, thanks for the help. I've changed the script to the simpler loop.

– Peter.O
Jun 10 '12 at 0:28

Your main loop can be rewritten as while (<$ifh>) { print $ofh $_ if /^$pat/; } You won't get the diagnostic on bad readline, but that's not going to happen on a modern OS unless the hardware is failing while you read the file.

– Warren Young
Jun 9 '12 at 23:52

@Warren, thanks for the help. I've changed the script to the simpler loop.

– Peter.O
Jun 10 '12 at 0:28

add a comment |

Install ripgrep utility which supports UTF-16.

For example:

rg pattern filename

ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)

To print all lines, run: rg -N . filename.

answered Jan 17 at 14:23

kenorb

8,541370106

add a comment |

Install ripgrep utility which supports UTF-16.

For example:

rg pattern filename

ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)

To print all lines, run: rg -N . filename.

answered Jan 17 at 14:23

kenorb

8,541370106

add a comment |

Install ripgrep utility which supports UTF-16.

For example:

rg pattern filename

ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)

To print all lines, run: rg -N . filename.

answered Jan 17 at 14:23

kenorb

8,541370106

Install ripgrep utility which supports UTF-16.

For example:

rg pattern filename

ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)

To print all lines, run: rg -N . filename.

answered Jan 17 at 14:23

kenorb

8,541370106

answered Jan 17 at 14:23

kenorb

8,541370106

answered Jan 17 at 14:23

kenorb

8,541370106

answered Jan 17 at 14:23

kenorb

8,541370106

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ytdyklly