Filtering invalid utf8












46















I have a text file in an unknown or mixed encoding. I want to see the lines that contain a byte sequence that is not valid UTF-8 (by piping the text file into some program). Equivalently, I want to filter out the lines that are valid UTF-8. In other words, I'm looking for grep [notutf8].



An ideal solution would be portable, short and generalizable to other encodings, but if you feel the best way is to bake in the definition of UTF-8, go ahead.










share|improve this question























  • See also keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex for a possible regex.

    – Mikel
    Jan 27 '11 at 6:54











  • @Mikel: or unix.stackexchange.com/questions/6460/… … I was hoping for something less clumsly.

    – Gilles
    Jan 27 '11 at 8:18
















46















I have a text file in an unknown or mixed encoding. I want to see the lines that contain a byte sequence that is not valid UTF-8 (by piping the text file into some program). Equivalently, I want to filter out the lines that are valid UTF-8. In other words, I'm looking for grep [notutf8].



An ideal solution would be portable, short and generalizable to other encodings, but if you feel the best way is to bake in the definition of UTF-8, go ahead.










share|improve this question























  • See also keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex for a possible regex.

    – Mikel
    Jan 27 '11 at 6:54











  • @Mikel: or unix.stackexchange.com/questions/6460/… … I was hoping for something less clumsly.

    – Gilles
    Jan 27 '11 at 8:18














46












46








46


19






I have a text file in an unknown or mixed encoding. I want to see the lines that contain a byte sequence that is not valid UTF-8 (by piping the text file into some program). Equivalently, I want to filter out the lines that are valid UTF-8. In other words, I'm looking for grep [notutf8].



An ideal solution would be portable, short and generalizable to other encodings, but if you feel the best way is to bake in the definition of UTF-8, go ahead.










share|improve this question














I have a text file in an unknown or mixed encoding. I want to see the lines that contain a byte sequence that is not valid UTF-8 (by piping the text file into some program). Equivalently, I want to filter out the lines that are valid UTF-8. In other words, I'm looking for grep [notutf8].



An ideal solution would be portable, short and generalizable to other encodings, but if you feel the best way is to bake in the definition of UTF-8, go ahead.







command-line text-processing character-encoding unicode






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jan 27 '11 at 0:13









GillesGilles

533k12810721594




533k12810721594













  • See also keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex for a possible regex.

    – Mikel
    Jan 27 '11 at 6:54











  • @Mikel: or unix.stackexchange.com/questions/6460/… … I was hoping for something less clumsly.

    – Gilles
    Jan 27 '11 at 8:18



















  • See also keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex for a possible regex.

    – Mikel
    Jan 27 '11 at 6:54











  • @Mikel: or unix.stackexchange.com/questions/6460/… … I was hoping for something less clumsly.

    – Gilles
    Jan 27 '11 at 8:18

















See also keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex for a possible regex.

– Mikel
Jan 27 '11 at 6:54





See also keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex for a possible regex.

– Mikel
Jan 27 '11 at 6:54













@Mikel: or unix.stackexchange.com/questions/6460/… … I was hoping for something less clumsly.

– Gilles
Jan 27 '11 at 8:18





@Mikel: or unix.stackexchange.com/questions/6460/… … I was hoping for something less clumsly.

– Gilles
Jan 27 '11 at 8:18










6 Answers
6






active

oldest

votes


















30














If you want to use grep, you can do:



grep -axv '.*' file


in UTF-8 locales to get the lines that have at least an invalid UTF-8 sequence (this works with GNU Grep at least).






share|improve this answer


























  • Except for -a, that's required to work by POSIX. However GNU grep at least fails to spot the UTF-8 encoded UTF-16 surrogate non-characters or codepoints above 0x10FFFF.

    – Stéphane Chazelas
    Dec 19 '14 at 23:10






  • 1





    @StéphaneChazelas Conversely, the -a is needed by GNU grep (which isn't POSIX compliant, I assume). Concerning, the surrogate area and the codepoints above 0x10FFFF, this is a bug then (which could explain that). For this, adding -P should work with GNU grep 2.21 (but is slow); it is buggy at least in Debian grep/2.20-4.

    – vinc17
    Dec 19 '14 at 23:37











  • Sorry, my bad, the behaviour is unspecified in POSIX since grep is a text utility (only expected to work on text input), so I suppose GNU grep's behaviour is as valid as any here.

    – Stéphane Chazelas
    Dec 19 '14 at 23:50













  • @StéphaneChazelas I confirm that POSIX says: "The input files shall be text files." (though not in the description part, which is a bit misleading). This also means that in case of invalid sequences, the behavior is undefined by POSIX. Hence the need to know the implementation, such as GNU grep (whose intent is to regard invalid sequences as non-matching), and possible bugs.

    – vinc17
    Dec 20 '14 at 0:12






  • 1





    I'm switching the accepted answer to this one (sorry, Peter.O because it's simple and works well for my primary use case, which is a heuristic to distinguish UTF-8 from other common encodings (especially 8-bit encodings). Stéphane Chazelas and Peter.O provide more accurate answers in terms of UTF-8 compliance.

    – Gilles
    Dec 23 '14 at 2:11



















29














I think you probably want iconv. It's for converting between codesets and supports an absurd number of formats. For example, to strip anything not valid in UTF-8 you could use:



iconv -c -t UTF-8 < input.txt > output.txt



Without the -c option it'll report problems in converting to stderr, so with process direction could you save a list of these. Another way would be to strip the non-UTF8 stuff and then



diff input.txt output.txt



for a list of where changes were made.






share|improve this answer
























  • Ok, that's iconv -c -t UTF-8 <input.txt | diff input.txt - | sed -ne 's/^< //p'. It won't work as a pipeline, though, since you need to read the input twice (no, tee won't do, it might block depending on how much buffering iconv and diff do).

    – Gilles
    Jan 27 '11 at 8:22






  • 2





    Random note: input & output may not be the same file or you will end up with an empty file

    – drahnr
    Aug 24 '13 at 8:02






  • 1





    Or use process substitution if your shell supports it diff <(iconv -c -t UTF-8 <input.txt) input.txt

    – Karl
    Nov 11 '15 at 23:41













  • How do you do this and make the output to the same file as the input. I just did this and got a blank file iconv -c -t UTF-8 < input.txt > input.txt

    – Costas Vrahimis
    Nov 8 '16 at 0:24






  • 1





    Thanks.. This allow restoring broken utf-8 postgresql dump , but not discarding valid utf-8

    – Superbiji
    Dec 16 '16 at 10:29



















19














Edit: I've fixed a typo-bug in the regex.. It needed a 'x80` not 80.



The regex to filter out invalid UTF-8 forms, for strict adherance to UTF-8, is as follows



perl -l -ne '/
^( ([x00-x7F]) # 1-byte pattern
|([xC2-xDF][x80-xBF]) # 2-byte pattern
|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF])) # 3-byte pattern
|((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2})) # 4-byte pattern
)*$ /x or print'


Output (of key lines.from Test 1):



Codepoint
=========
00001000 Test=1 mode=strict valid,invalid,fail=(1000,0,0)
0000E000 Test=1 mode=strict valid,invalid,fail=(D800,800,0)
0010FFFF mode=strict test-return=(0,0) valid,invalid,fail=(10F800,800,0)




Q. How does one create test data to test a regex which filters invalid Unicode ?

A. Create your own UTF-8 test algorithm, and break it's rules...

Catch-22.. But then, how do you then test your test algorithm?



The regex, above, has been tested (using iconv as the reference) for every integer value from 0x00000 to 0x10FFFF.. This upper value being the maximum integer value of a Unicode Codepoint





  • In November 2003 UTF-8 was restricted by RFC 3629 to four bytes covering only the range U+0000 to U+10FFFF, in order to match the constraints of the UTF-16 character encoding.`


According to this wikipedia UTF-8 page,.




  • UTF-8 encodes each of the 1,112,064 code points in the Unicode character set, using one to four 8-bit bytes


This numeber (1,112,064) equates to a range 0x000000 to 0x10F7FF, which is 0x0800 shy of the actual maximum integer-value for the highest Unicode Codepoint: 0x10FFFF



This block of integers is missing from the Unicode Codepoints spectrum, because of the need for the UTF-16 encoding to step beyond its original design intent via a system called surrogate pairs. A block of 0x0800 integers has been reserved to be used by UTF-16.. This block spans the range0x00D800 to 0x00DFFF. None of these inteters are legal Unicode values, and are therefore invalid UTF-8 values.



In Test 1, the regex has been tested against every number in the range of Unicode Codepoints, and it matches exectly the results of iconv .. ie. 0x010F7FF valid values, and 0x000800 invalid values.



However, the issue now arises of, *How does the regex handle Out-Of-Range UTF-8 Value; above 0x010FFFF (UTF-8 can extend to 6 bytes, with a maximum integer value of 0x7FFFFFFF?

To generate the necessary *non-unicode UTF-8 byte values, I've used the following command:



  perl -C -e 'print chr 0x'$hexUTF32BE


To test their validity (in some fashion), I've used Gilles' UTF-8 regex...



  perl -l -ne '/
^( [00-177] # 1-byte pattern
|[300-337][200-277] # 2-byte pattern
|[340-357][200-277]{2} # 3-byte pattern
|[360-367][200-277]{3} # 4-byte pattern
|[370-373][200-277]{4} # 5-byte pattern
|[374-375][200-277]{5} # 6-byte pattern
)*$ /x or print'


The output of 'perl's print chr' matches the filtering of Gilles' regex.. One reinforces the validity of the other..
I can't use iconv because it only handles the valid-Unicode Standard subset of the broader (original) UTF-8 standard...



The nunbers involved are rather large, so I've tested top-of-range, bottom-of-range, and several scans stepping by increments such as, 11111, 13579, 33333, 53441... The results all match, so now all that remains is to test the regex against these out-of-range UTF-8-style values (invalid for Unicode, and therefore also invalid for strict UTF-8 itself) ..





Here are the test modules:



[[ "$(locale charmap)" != "UTF-8" ]] && { echo "ERROR: locale must be UTF-8, but it is $(locale charmap)"; exit 1; }

# Testing the UTF-8 regex
#
# Tests to check that the observed byte-ranges (above) have
# been accurately observed and included in the test code and final regex.
# =========================================================================
: 2 bytes; B2=0 # run-test=1 do-not-test=0
: 3 bytes; B3=0 # run-test=1 do-not-test=0
: 4 bytes; B4=0 # run-test=1 do-not-test=0

: regex; Rx=1 # run-test=1 do-not-test=0

((strict=16)); mode[$strict]=strict # iconv -f UTF-16BE then iconv -f UTF-32BE beyond 0xFFFF)
(( lax=32)); mode[$lax]=lax # iconv -f UTF-32BE only)

# modebits=$strict
# UTF-8, in relation to UTF-16 has invalid values
# modebits=$strict automatically shifts to modebits=$lax
# when the tested integer exceeds 0xFFFF
# modebits=$lax
# UTF-8, in relation to UTF-32, has no restrictione


# Test 1 Sequentially tests a range of Big-Endian integers
# * Unicode Codepoints are a subset ofBig-Endian integers
# ( based on 'iconv' -f UTF-32BE -f UTF-8 )
# Note: strict UTF-8 has a few quirks because of UTF-16
# Set modebits=16 to "strictly" test the low range

Test=1; modebits=$strict
# Test=2; modebits=$lax
# Test=3
mode3wlo=$(( 1*4)) # minimum chars * 4 ( '4' is for UTF-32BE )
mode3whi=$((10*4)) # minimum chars * 4 ( '4' is for UTF-32BE )


#########################################################################

# 1 byte UTF-8 values: Nothing to do; no complexities.

#########################################################################

# 2 Byte UTF-8 values: Verifying that I've got the right range values.
if ((B2==1)) ; then
echo "# Test 2 bytes for Valid UTF-8 values: ie. values which are in range"
# =========================================================================
time
for d1 in {194..223} ;do
# bin oct hex dec
# lo 11000010 302 C2 194
# hi 11011111 337 DF 223
B2b1=$(printf "%0.2X" $d1)
#
for d2 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B2b2=$(printf "%0.2X" $d2)
#
echo -n "${B2b1}${B2b2}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B2b1}${B2b2}"; exit 20; }
#
done
done
echo

# Now do a negated test.. This takes longer, because there are more values.
echo "# Test 2 bytes for Invalid values: ie. values which are out of range"
# =========================================================================
# Note: 'iconv' will treat a leading x00-x7F as a valid leading single,
# so this negated test primes the first UTF-8 byte with values starting at x80
time
for d1 in {128..193} {224..255} ;do
#for d1 in {128..194} {224..255} ;do # force a valid UTF-8 (needs $B2b2)
B2b1=$(printf "%0.2X" $d1)
#
for d2 in {0..127} {192..255} ;do
#for d2 in {0..128} {192..255} ;do # force a valid UTF-8 (needs $B2b1)
B2b2=$(printf "%0.2X" $d2)
#
echo -n "${B2b1}${B2b2}" |
xxd -p -u -r |
iconv -f UTF-8 2>/dev/null && {
echo "ERROR: VALID UTF-8 found: ${B2b1}${B2b2}"; exit 21; }
#
done
done
echo
fi

#########################################################################

# 3 Byte UTF-8 values: Verifying that I've got the right range values.
if ((B3==1)) ; then
echo "# Test 3 bytes for Valid UTF-8 values: ie. values which are in range"
# ========================================================================
time
for d1 in {224..239} ;do
# bin oct hex dec
# lo 11100000 340 E0 224
# hi 11101111 357 EF 239
B3b1=$(printf "%0.2X" $d1)
#
if [[ $B3b1 == "E0" ]] ; then
B3b2range="$(echo {160..191})"
# bin oct hex dec
# lo 10100000 240 A0 160
# hi 10111111 277 BF 191
elif [[ $B3b1 == "ED" ]] ; then
B3b2range="$(echo {128..159})"
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10011111 237 9F 159
else
B3b2range="$(echo {128..191})"
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
fi
#
for d2 in $B3b2range ;do
B3b2=$(printf "%0.2X" $d2)
echo "${B3b1} ${B3b2} xx"
#
for d3 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B3b3=$(printf "%0.2X" $d3)
#
echo -n "${B3b1}${B3b2}${B3b3}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B3b1}${B3b2}${B3b3}"; exit 30; }
#
done
done
done
echo

# Now do a negated test.. This takes longer, because there are more values.
echo "# Test 3 bytes for Invalid values: ie. values which are out of range"
# =========================================================================
# Note: 'iconv' will treat a leading x00-x7F as a valid leading single,
# so this negated test primes the first UTF-8 byte with values starting at x80
#
# real 26m28.462s
# user 27m12.526s | stepping by 2
# sys 13m11.193s /
#
# real 239m00.836s
# user 225m11.108s | stepping by 1
# sys 120m00.538s /
#
time
for d1 in {128..223..1} {240..255..1} ;do
#for d1 in {128..224..1} {239..255..1} ;do # force a valid UTF-8 (needs $B2b2,$B3b3)
B3b1=$(printf "%0.2X" $d1)
#
if [[ $B3b1 == "E0" ]] ; then
B3b2range="$(echo {0..159..1} {192..255..1})"
#B3b2range="$(> {192..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
elif [[ $B3b1 == "ED" ]] ; then
B3b2range="$(echo {0..127..1} {160..255..1})"
#B3b2range="$(echo {0..128..1} {160..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
else
B3b2range="$(echo {0..127..1} {192..255..1})"
#B3b2range="$(echo {0..128..1} {192..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
fi
for d2 in $B3b2range ;do
B3b2=$(printf "%0.2X" $d2)
echo "${B3b1} ${B3b2} xx"
#
for d3 in {0..127..1} {192..255..1} ;do
#for d3 in {0..128..1} {192..255..1} ;do # force a valid UTF-8 (needs $B2b1)
B3b3=$(printf "%0.2X" $d3)
#
echo -n "${B3b1}${B3b2}${B3b3}" |
xxd -p -u -r |
iconv -f UTF-8 2>/dev/null && {
echo "ERROR: VALID UTF-8 found: ${B3b1}${B3b2}${B3b3}"; exit 31; }
#
done
done
done
echo

fi

#########################################################################

# Brute force testing in the Astral Plane will take a VERY LONG time..
# Perhaps selective testing is more appropriate, now that the previous tests
# have panned out okay...
#
# 4 Byte UTF-8 values:
if ((B4==1)) ; then
echo "# Test 4 bytes for Valid UTF-8 values: ie. values which are in range"
# ==================================================================
# real 58m18.531s
# user 56m44.317s |
# sys 27m29.867s /
time
for d1 in {240..244} ;do
# bin oct hex dec
# lo 11110000 360 F0 240
# hi 11110100 364 F4 244 -- F4 encodes some values greater than 0x10FFFF;
# such a sequence is invalid.
B4b1=$(printf "%0.2X" $d1)
#
if [[ $B4b1 == "F0" ]] ; then
B4b2range="$(echo {144..191})" ## f0 90 80 80 to f0 bf bf bf
# bin oct hex dec 010000 -- 03FFFF
# lo 10010000 220 90 144
# hi 10111111 277 BF 191
#
elif [[ $B4b1 == "F4" ]] ; then
B4b2range="$(echo {128..143})" ## f4 80 80 80 to f4 8f bf bf
# bin oct hex dec 100000 -- 10FFFF
# lo 10000000 200 80 128
# hi 10001111 217 8F 143 -- F4 encodes some values greater than 0x10FFFF;
# such a sequence is invalid.
else
B4b2range="$(echo {128..191})" ## fx 80 80 80 to f3 bf bf bf
# bin oct hex dec 0C0000 -- 0FFFFF
# lo 10000000 200 80 128 0A0000
# hi 10111111 277 BF 191
fi
#
for d2 in $B4b2range ;do
B4b2=$(printf "%0.2X" $d2)
#
for d3 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B4b3=$(printf "%0.2X" $d3)
echo "${B4b1} ${B4b2} ${B4b3} xx"
#
for d4 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B4b4=$(printf "%0.2X" $d4)
#
echo -n "${B4b1}${B4b2}${B4b3}${B4b4}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B4b1}${B4b2}${B4b3}${B4b4}"; exit 40; }
#
done
done
done
done
echo "# Test 4 bytes for Valid UTF-8 values: END"
echo
fi

########################################################################
# There is no test (yet) for negated range values in the astral plane. #
# (all negated range values must be invalid) #
# I won't bother; This was mainly for me to ge the general feel of #
# the tests, and the final test below should flush anything out.. #
# Traversing the intire UTF-8 range takes quite a while... #
# so no need to do it twice (albeit in a slightly different manner) #
########################################################################

################################
### The construction of: ####
### The Regular Expression ####
### (de-construction?) ####
################################

# BYTE 1 BYTE 2 BYTE 3 BYTE 4
# 1: [x00-x7F]
# ===========
# ([x00-x7F])
#
# 2: [xC2-xDF] [x80-xBF]
# =================================
# ([xC2-xDF][x80-xBF])
#
# 3: [xE0] [xA0-xBF] [x80-xBF]
# [xED] [x80-x9F] [x80-xBF]
# [xE1-xECxEE-xEF] [x80-xBF] [x80-xBF]
# ==============================================
# ((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF]))
#
# 4 [xF0] [x90-xBF] [x80-xBF] [x80-xBF]
# [xF1-xF3] [x80-xBF] [x80-xBF] [x80-xBF]
# [xF4] [x80-x8F] [x80-xBF] [x80-xBF]
# ===========================================================
# ((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2}))
#
# The final regex
# ===============
# 1-4: (([x00-x7F])|([xC2-xDF][x80-xBF])|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF]))|((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2})))
# 4-1: (((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2}))|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF]))|([xC2-xDF][x80-xBF])|([x00-x7F]))


#######################################################################
# The final Test; for a single character (multi chars to follow) #
# Compare the return code of 'iconv' against the 'regex' #
# for the full range of 0x000000 to 0x10FFFF #
# #
# Note; this script has 3 modes: #
# Run this test TWICE, set each mode Manually! #
# #
# 1. Sequentially test every value from 0x000000 to 0x10FFFF #
# 2. Throw a spanner into the works! Force random byte patterns #
# 2. Throw a spanner into the works! Force random longer strings #
# ============================== #
# #
# Note: The purpose of this routine is to determine if there is any #
# difference how 'iconv' and 'regex' handle the same data #
# #
#######################################################################
if ((Rx==1)) ; then
# real 191m34.826s
# user 158m24.114s
# sys 83m10.676s
time {
invalCt=0
validCt=0
failCt=0
decBeg=$((0x00110000)) # incement by decimal integer
decMax=$((0x7FFFFFFF)) # incement by decimal integer
#
for ((CPDec=decBeg;CPDec<=decMax;CPDec+=13247)) ;do
((D==1)) && echo "=========================================================="
#
# Convert decimal integer '$CPDec' to Hex-digits; 6-long (dec2hex)
hexUTF32BE=$(printf '%0.8Xn' $CPDec) # hexUTF32BE

# progress count
if (((CPDec%$((0x1000)))==0)) ;then
((Test>2)) && echo
echo "$hexUTF32BE Test=$Test mode=${mode[$modebits]} "
fi
if ((Test==1 || Test==2 ))
then # Test 1. Sequentially test every value from 0x000000 to 0x10FFFF
#
if ((Test==2)) ; then
bits=32
UTF8="$( perl -C -e 'print chr 0x'$hexUTF32BE |
perl -l -ne '/^( [00-177]
| [300-337][200-277]
| [340-357][200-277]{2}
| [360-367][200-277]{3}
| [370-373][200-277]{4}
| [374-375][200-277]{5}
)*$/x and print' |xxd -p )"
UTF8="${UTF8%0a}"
[[ -n "$UTF8" ]]
&& rcIco32=0 || rcIco32=1
rcIco16=

elif ((modebits==strict && CPDec<=$((0xFFFF)))) ;then
bits=16
UTF8="$( echo -n "${hexUTF32BE:4}" |
xxd -p -u -r |
iconv -f UTF-16BE -t UTF-8 2>/dev/null)"
&& rcIco16=0 || rcIco16=1
rcIco32=
else
bits=32
UTF8="$( echo -n "$hexUTF32BE" |
xxd -p -u -r |
iconv -f UTF-32BE -t UTF-8 2>/dev/null)"
&& rcIco32=0 || rcIco32=1
rcIco16=
fi
# echo "1 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
#
#
#
if ((${rcIco16}${rcIco32}!=0)) ;then
# 'iconv -f UTF-16BE' failed produce a reliable UTF-8
if ((bits==16)) ;then
((D==1)) && echo "bits-$bits rcIconv: error $hexUTF32BE .. 'strict' failed, now trying 'lax'"
# iconv failed to create a 'srict' UTF-8 so
# try UTF-32BE to get a 'lax' UTF-8 pattern
UTF8="$( echo -n "$hexUTF32BE" |
xxd -p -u -r |
iconv -f UTF-32BE -t UTF-8 2>/dev/null)"
&& rcIco32=0 || rcIco32=1
#echo "2 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
if ((rcIco32!=0)) ;then
((D==1)) && echo -n "bits-$bits rcIconv: Cannot gen UTF-8 for: $hexUTF32BE"
rcIco32=1
fi
fi
fi
# echo "3 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
#
#
#
if ((rcIco16==0 || rcIco32==0)) ;then
# 'strict(16)' OR 'lax(32)'... 'iconv' managed to generate a UTF-8 pattern
((D==1)) && echo -n "bits-$bits rcIconv: pattern* $hexUTF32BE"
((D==1)) && if [[ $bits == "16" && $rcIco32 == "0" ]] ;then
echo " .. 'lax' UTF-8 produced a pattern"
else
echo
fi
# regex test
if ((modebits==strict)) ;then
#rxOut="$(echo -n "$UTF8" |perl -l -ne '/^(([x00-x7F])|([xC2-xDF][x80-xBF])|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF]))|((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2})))*$/ or print' )"
rxOut="$(echo -n "$UTF8" |
perl -l -ne '/^( ([x00-x7F]) # 1-byte pattern
|([xC2-xDF][x80-xBF]) # 2-byte pattern
|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF])) # 3-byte pattern
|((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2})) # 4-byte pattern
)*$ /x or print' )"
else
if ((Test==2)) ;then
rx="$(echo -n "$UTF8" |perl -l -ne '/^([00-177]|[300-337][200-277]|[340-357][200-277]{2}|[360-367][200-277]{3}|[370-373][200-277]{4}|[374-375][200-277]{5})*$/ and print')"
[[ "$UTF8" != "$rx" ]] && rxOut="$UTF8" || rxOut=
rx="$(echo -n "$rx" |sed -e "s/(..)/1 /g")"
else
rxOut="$(echo -n "$UTF8" |perl -l -ne '/^([00-177]|[300-337][200-277]|[340-357][200-277]{2}|[360-367][200-277]{3}|[370-373][200-277]{4}|[374-375][200-277]{5})*$/ or print' )"
fi
fi
if [[ "$rxOut" == "" ]] ;then
((D==1)) && echo " rcRegex: ok"
rcRegex=0
else
((D==1)) && echo -n "bits-$bits rcRegex: error $hexUTF32BE .. 'strict' failed,"
((D==1)) && if [[ "12" == *$Test* ]] ;then
echo # " (codepoint) Test $Test"
else
echo
fi
rcRegex=1
fi
fi
#
elif [[ $Test == 2 ]]
then # Test 2. Throw a randomizing spanner into the works!
# Then test the arbitary bytes ASIS
#
hexLineRand="$(echo -n "$hexUTF32BE" |
sed -re "s/(.)(.)(.)(.)(.)(.)(.)(.)/1n2n3n4n5n6n7n8/" |
sort -R |
tr -d 'n')"
#
elif [[ $Test == 3 ]]
then # Test 3. Test single UTF-16BE bytes in the range 0x00000000 to 0x7FFFFFFF
#
echo "Test 3 is not properly implemented yet.. Exiting"
exit 99
else
echo "ERROR: Invalid mode"
exit
fi
#
#
if ((Test==1 || Test=2)) ;then
if ((modebits==strict && CPDec<=$((0xFFFF)))) ;then
((rcIconv=rcIco16))
else
((rcIconv=rcIco32))
fi
if ((rcRegex!=rcIconv)) ;then
[[ $Test != 1 ]] && echo
if ((rcRegex==1)) ;then
echo "ERROR: 'regex' ok, but NOT 'iconv': ${hexUTF32BE} "
else
echo "ERROR: 'iconv' ok, but NOT 'regex': ${hexUTF32BE} "
fi
((failCt++));
elif ((rcRegex!=0)) ;then
# ((invalCt++)); echo -ne "$hexUTF32BE exit-codes $${rcIco16}${rcIco32}=,$rcRegext: $(printf "%0.8Xn" $invalCt)t$hexLine$(printf "%$(((mode3whi*2)-${#hexLine}))s")r"
((invalCt++))
else
((validCt++))
fi
if ((Test==1)) ;then
echo -ne "$hexUTF32BE " "mode=${mode[$modebits]} test-return=($rcIconv,$rcRegex) valid,invalid,fail=($(printf "%X" $validCt),$(printf "%X" $invalCt),$(printf "%X" $failCt)) r"
else
echo -ne "$hexUTF32BE $rx mode=${mode[$modebits]} test-return=($rcIconv,$rcRegex) val,inval,fail=($(printf "%X" $validCt),$(printf "%X" $invalCt),$(printf "%X" $failCt))r"
fi
fi
done
} # End time
fi
exit





share|improve this answer


























  • The main problem with my regexp is that it allowed some forbidden sequences such as 300200 (really bad: that's code point 0 not expressed with a null byte!). I think your regexp rejects them correctly.

    – Gilles
    May 8 '11 at 12:38



















6





+100









I find uconv (in icu-devtools package in Debian) useful to inspect UTF-8 data:



$ print '\xE9 xe9 u20ac ud800udc00 U110000' |
uconv --callback escape-c -t us
xE9 xE9 u20ac xEDxA0x80xEDxB0x80 xF4x90x80x80


(The xs help spotting the invalid characters (except for the false positive voluntarily introduced with a literal xE9 above)).



(plenty of other nice usages).






share|improve this answer


























  • I think recode can be used similarly - except that I think it should fail if asked to translate an invalid multibyte sequence. I'm not sure though; it won't fail for print...|recode u8..u8/x4 for example (which just does a hexdump as you do above) because it doesn't do anything but iconv data data, but it does fail like recode u8..u2..u8/x4 because it translates then prints. But I don't know enough about it to be sure - and there are a lot of possibilities.

    – mikeserv
    Dec 20 '14 at 19:29











  • If I have a file, say, test.txt. How should I suppose to find the invalid character using your solution? What does us in your code mean?

    – jdhao
    Dec 26 '17 at 7:51











  • @Hao, us means United States, that is short for ASCII. It converts the input into a ASCII one where the non-ASCII characters are converted to uXXXX notation and the non-characters to xXX.

    – Stéphane Chazelas
    Dec 26 '17 at 14:19











  • Where should I put my file to use your script? Is the last line in the code block the output of your code? It is a little confusing to me.

    – jdhao
    Dec 26 '17 at 14:27



















3














Python has had a built-in unicode function since version 2.0.



#!/usr/bin/env python2
import sys
for line in sys.stdin:
try:
unicode(line, 'utf-8')
except UnicodeDecodeError:
sys.stdout.write(line)


In Python 3, unicode has been folded into str. It needs to be passed a bytes-like object, here the underlying buffer objects for the standard descriptors.



#!/usr/bin/env python3
import sys
for line in sys.stdin.buffer:
try:
str(line, 'utf-8')
except UnicodeDecodeError:
sys.stdout.buffer.write(line)





share|improve this answer
























  • The python 2 one fails to flag UTF-8 encoded UTF-16 surrogate non-characters (at least with 2.7.6).

    – Stéphane Chazelas
    Dec 19 '14 at 23:12











  • @StéphaneChazelas Dammit. Thanks. I've only run nominal tests so far, I'll run Peter's test battery later.

    – Gilles
    Dec 19 '14 at 23:14



















0














I came across similar problem (detail in "Context" section) and arrived with following ftfy_line_by_line.py solution:



#!/usr/bin/env python3
import ftfy, sys
with open(sys.argv[1], mode='rt', encoding='utf8', errors='replace') as f:
for line in f:
sys.stdout.buffer.write(ftfy.fix_text(line).encode('utf8', 'replace'))
#print(ftfy.fix_text(line).rstrip().decode(encoding="utf-8", errors="replace"))


Using encode+replace + ftfy to auto-fix Mojibake and other corrections.



Context



I've collected >10GiB CSV of basic filesystem metadata using following gen_basic_files_metadata.csv.sh script, running essentially:



find "${path}" -type f -exec stat --format="%i,%Y,%s,${hostname},%m,%n" "{}" ;


The trouble I had was with inconsistent encoding of filenames across file systems, causing UnicodeDecodeError when processing further with python applications (csvsql to be more specific).



Therefore I applied above ftfy script, and it took



Please note ftfy is pretty slow, processing those >10GiB took:



real    147m35.182s
user 146m14.329s
sys 2m8.713s


while sha256sum for comparison:



real    6m28.897s
user 1m9.273s
sys 0m6.210s


on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz + 16GiB RAM (and data on external drive)






share|improve this answer


























  • And yes, I know that this find command will not properly encode filenames containing quotes according to csv standard

    – Grzegorz Wierzowiecki
    May 25 '17 at 14:12











Your Answer








StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f6516%2ffiltering-invalid-utf8%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























6 Answers
6






active

oldest

votes








6 Answers
6






active

oldest

votes









active

oldest

votes






active

oldest

votes









30














If you want to use grep, you can do:



grep -axv '.*' file


in UTF-8 locales to get the lines that have at least an invalid UTF-8 sequence (this works with GNU Grep at least).






share|improve this answer


























  • Except for -a, that's required to work by POSIX. However GNU grep at least fails to spot the UTF-8 encoded UTF-16 surrogate non-characters or codepoints above 0x10FFFF.

    – Stéphane Chazelas
    Dec 19 '14 at 23:10






  • 1





    @StéphaneChazelas Conversely, the -a is needed by GNU grep (which isn't POSIX compliant, I assume). Concerning, the surrogate area and the codepoints above 0x10FFFF, this is a bug then (which could explain that). For this, adding -P should work with GNU grep 2.21 (but is slow); it is buggy at least in Debian grep/2.20-4.

    – vinc17
    Dec 19 '14 at 23:37











  • Sorry, my bad, the behaviour is unspecified in POSIX since grep is a text utility (only expected to work on text input), so I suppose GNU grep's behaviour is as valid as any here.

    – Stéphane Chazelas
    Dec 19 '14 at 23:50













  • @StéphaneChazelas I confirm that POSIX says: "The input files shall be text files." (though not in the description part, which is a bit misleading). This also means that in case of invalid sequences, the behavior is undefined by POSIX. Hence the need to know the implementation, such as GNU grep (whose intent is to regard invalid sequences as non-matching), and possible bugs.

    – vinc17
    Dec 20 '14 at 0:12






  • 1





    I'm switching the accepted answer to this one (sorry, Peter.O because it's simple and works well for my primary use case, which is a heuristic to distinguish UTF-8 from other common encodings (especially 8-bit encodings). Stéphane Chazelas and Peter.O provide more accurate answers in terms of UTF-8 compliance.

    – Gilles
    Dec 23 '14 at 2:11
















30














If you want to use grep, you can do:



grep -axv '.*' file


in UTF-8 locales to get the lines that have at least an invalid UTF-8 sequence (this works with GNU Grep at least).






share|improve this answer


























  • Except for -a, that's required to work by POSIX. However GNU grep at least fails to spot the UTF-8 encoded UTF-16 surrogate non-characters or codepoints above 0x10FFFF.

    – Stéphane Chazelas
    Dec 19 '14 at 23:10






  • 1





    @StéphaneChazelas Conversely, the -a is needed by GNU grep (which isn't POSIX compliant, I assume). Concerning, the surrogate area and the codepoints above 0x10FFFF, this is a bug then (which could explain that). For this, adding -P should work with GNU grep 2.21 (but is slow); it is buggy at least in Debian grep/2.20-4.

    – vinc17
    Dec 19 '14 at 23:37











  • Sorry, my bad, the behaviour is unspecified in POSIX since grep is a text utility (only expected to work on text input), so I suppose GNU grep's behaviour is as valid as any here.

    – Stéphane Chazelas
    Dec 19 '14 at 23:50













  • @StéphaneChazelas I confirm that POSIX says: "The input files shall be text files." (though not in the description part, which is a bit misleading). This also means that in case of invalid sequences, the behavior is undefined by POSIX. Hence the need to know the implementation, such as GNU grep (whose intent is to regard invalid sequences as non-matching), and possible bugs.

    – vinc17
    Dec 20 '14 at 0:12






  • 1





    I'm switching the accepted answer to this one (sorry, Peter.O because it's simple and works well for my primary use case, which is a heuristic to distinguish UTF-8 from other common encodings (especially 8-bit encodings). Stéphane Chazelas and Peter.O provide more accurate answers in terms of UTF-8 compliance.

    – Gilles
    Dec 23 '14 at 2:11














30












30








30







If you want to use grep, you can do:



grep -axv '.*' file


in UTF-8 locales to get the lines that have at least an invalid UTF-8 sequence (this works with GNU Grep at least).






share|improve this answer















If you want to use grep, you can do:



grep -axv '.*' file


in UTF-8 locales to get the lines that have at least an invalid UTF-8 sequence (this works with GNU Grep at least).







share|improve this answer














share|improve this answer



share|improve this answer








edited Jun 30 '15 at 20:27









Stéphane Chazelas

302k56567919




302k56567919










answered Dec 19 '14 at 22:14









vinc17vinc17

8,9091736




8,9091736













  • Except for -a, that's required to work by POSIX. However GNU grep at least fails to spot the UTF-8 encoded UTF-16 surrogate non-characters or codepoints above 0x10FFFF.

    – Stéphane Chazelas
    Dec 19 '14 at 23:10






  • 1





    @StéphaneChazelas Conversely, the -a is needed by GNU grep (which isn't POSIX compliant, I assume). Concerning, the surrogate area and the codepoints above 0x10FFFF, this is a bug then (which could explain that). For this, adding -P should work with GNU grep 2.21 (but is slow); it is buggy at least in Debian grep/2.20-4.

    – vinc17
    Dec 19 '14 at 23:37











  • Sorry, my bad, the behaviour is unspecified in POSIX since grep is a text utility (only expected to work on text input), so I suppose GNU grep's behaviour is as valid as any here.

    – Stéphane Chazelas
    Dec 19 '14 at 23:50













  • @StéphaneChazelas I confirm that POSIX says: "The input files shall be text files." (though not in the description part, which is a bit misleading). This also means that in case of invalid sequences, the behavior is undefined by POSIX. Hence the need to know the implementation, such as GNU grep (whose intent is to regard invalid sequences as non-matching), and possible bugs.

    – vinc17
    Dec 20 '14 at 0:12






  • 1





    I'm switching the accepted answer to this one (sorry, Peter.O because it's simple and works well for my primary use case, which is a heuristic to distinguish UTF-8 from other common encodings (especially 8-bit encodings). Stéphane Chazelas and Peter.O provide more accurate answers in terms of UTF-8 compliance.

    – Gilles
    Dec 23 '14 at 2:11



















  • Except for -a, that's required to work by POSIX. However GNU grep at least fails to spot the UTF-8 encoded UTF-16 surrogate non-characters or codepoints above 0x10FFFF.

    – Stéphane Chazelas
    Dec 19 '14 at 23:10






  • 1





    @StéphaneChazelas Conversely, the -a is needed by GNU grep (which isn't POSIX compliant, I assume). Concerning, the surrogate area and the codepoints above 0x10FFFF, this is a bug then (which could explain that). For this, adding -P should work with GNU grep 2.21 (but is slow); it is buggy at least in Debian grep/2.20-4.

    – vinc17
    Dec 19 '14 at 23:37











  • Sorry, my bad, the behaviour is unspecified in POSIX since grep is a text utility (only expected to work on text input), so I suppose GNU grep's behaviour is as valid as any here.

    – Stéphane Chazelas
    Dec 19 '14 at 23:50













  • @StéphaneChazelas I confirm that POSIX says: "The input files shall be text files." (though not in the description part, which is a bit misleading). This also means that in case of invalid sequences, the behavior is undefined by POSIX. Hence the need to know the implementation, such as GNU grep (whose intent is to regard invalid sequences as non-matching), and possible bugs.

    – vinc17
    Dec 20 '14 at 0:12






  • 1





    I'm switching the accepted answer to this one (sorry, Peter.O because it's simple and works well for my primary use case, which is a heuristic to distinguish UTF-8 from other common encodings (especially 8-bit encodings). Stéphane Chazelas and Peter.O provide more accurate answers in terms of UTF-8 compliance.

    – Gilles
    Dec 23 '14 at 2:11

















Except for -a, that's required to work by POSIX. However GNU grep at least fails to spot the UTF-8 encoded UTF-16 surrogate non-characters or codepoints above 0x10FFFF.

– Stéphane Chazelas
Dec 19 '14 at 23:10





Except for -a, that's required to work by POSIX. However GNU grep at least fails to spot the UTF-8 encoded UTF-16 surrogate non-characters or codepoints above 0x10FFFF.

– Stéphane Chazelas
Dec 19 '14 at 23:10




1




1





@StéphaneChazelas Conversely, the -a is needed by GNU grep (which isn't POSIX compliant, I assume). Concerning, the surrogate area and the codepoints above 0x10FFFF, this is a bug then (which could explain that). For this, adding -P should work with GNU grep 2.21 (but is slow); it is buggy at least in Debian grep/2.20-4.

– vinc17
Dec 19 '14 at 23:37





@StéphaneChazelas Conversely, the -a is needed by GNU grep (which isn't POSIX compliant, I assume). Concerning, the surrogate area and the codepoints above 0x10FFFF, this is a bug then (which could explain that). For this, adding -P should work with GNU grep 2.21 (but is slow); it is buggy at least in Debian grep/2.20-4.

– vinc17
Dec 19 '14 at 23:37













Sorry, my bad, the behaviour is unspecified in POSIX since grep is a text utility (only expected to work on text input), so I suppose GNU grep's behaviour is as valid as any here.

– Stéphane Chazelas
Dec 19 '14 at 23:50







Sorry, my bad, the behaviour is unspecified in POSIX since grep is a text utility (only expected to work on text input), so I suppose GNU grep's behaviour is as valid as any here.

– Stéphane Chazelas
Dec 19 '14 at 23:50















@StéphaneChazelas I confirm that POSIX says: "The input files shall be text files." (though not in the description part, which is a bit misleading). This also means that in case of invalid sequences, the behavior is undefined by POSIX. Hence the need to know the implementation, such as GNU grep (whose intent is to regard invalid sequences as non-matching), and possible bugs.

– vinc17
Dec 20 '14 at 0:12





@StéphaneChazelas I confirm that POSIX says: "The input files shall be text files." (though not in the description part, which is a bit misleading). This also means that in case of invalid sequences, the behavior is undefined by POSIX. Hence the need to know the implementation, such as GNU grep (whose intent is to regard invalid sequences as non-matching), and possible bugs.

– vinc17
Dec 20 '14 at 0:12




1




1





I'm switching the accepted answer to this one (sorry, Peter.O because it's simple and works well for my primary use case, which is a heuristic to distinguish UTF-8 from other common encodings (especially 8-bit encodings). Stéphane Chazelas and Peter.O provide more accurate answers in terms of UTF-8 compliance.

– Gilles
Dec 23 '14 at 2:11





I'm switching the accepted answer to this one (sorry, Peter.O because it's simple and works well for my primary use case, which is a heuristic to distinguish UTF-8 from other common encodings (especially 8-bit encodings). Stéphane Chazelas and Peter.O provide more accurate answers in terms of UTF-8 compliance.

– Gilles
Dec 23 '14 at 2:11













29














I think you probably want iconv. It's for converting between codesets and supports an absurd number of formats. For example, to strip anything not valid in UTF-8 you could use:



iconv -c -t UTF-8 < input.txt > output.txt



Without the -c option it'll report problems in converting to stderr, so with process direction could you save a list of these. Another way would be to strip the non-UTF8 stuff and then



diff input.txt output.txt



for a list of where changes were made.






share|improve this answer
























  • Ok, that's iconv -c -t UTF-8 <input.txt | diff input.txt - | sed -ne 's/^< //p'. It won't work as a pipeline, though, since you need to read the input twice (no, tee won't do, it might block depending on how much buffering iconv and diff do).

    – Gilles
    Jan 27 '11 at 8:22






  • 2





    Random note: input & output may not be the same file or you will end up with an empty file

    – drahnr
    Aug 24 '13 at 8:02






  • 1





    Or use process substitution if your shell supports it diff <(iconv -c -t UTF-8 <input.txt) input.txt

    – Karl
    Nov 11 '15 at 23:41













  • How do you do this and make the output to the same file as the input. I just did this and got a blank file iconv -c -t UTF-8 < input.txt > input.txt

    – Costas Vrahimis
    Nov 8 '16 at 0:24






  • 1





    Thanks.. This allow restoring broken utf-8 postgresql dump , but not discarding valid utf-8

    – Superbiji
    Dec 16 '16 at 10:29
















29














I think you probably want iconv. It's for converting between codesets and supports an absurd number of formats. For example, to strip anything not valid in UTF-8 you could use:



iconv -c -t UTF-8 < input.txt > output.txt



Without the -c option it'll report problems in converting to stderr, so with process direction could you save a list of these. Another way would be to strip the non-UTF8 stuff and then



diff input.txt output.txt



for a list of where changes were made.






share|improve this answer
























  • Ok, that's iconv -c -t UTF-8 <input.txt | diff input.txt - | sed -ne 's/^< //p'. It won't work as a pipeline, though, since you need to read the input twice (no, tee won't do, it might block depending on how much buffering iconv and diff do).

    – Gilles
    Jan 27 '11 at 8:22






  • 2





    Random note: input & output may not be the same file or you will end up with an empty file

    – drahnr
    Aug 24 '13 at 8:02






  • 1





    Or use process substitution if your shell supports it diff <(iconv -c -t UTF-8 <input.txt) input.txt

    – Karl
    Nov 11 '15 at 23:41













  • How do you do this and make the output to the same file as the input. I just did this and got a blank file iconv -c -t UTF-8 < input.txt > input.txt

    – Costas Vrahimis
    Nov 8 '16 at 0:24






  • 1





    Thanks.. This allow restoring broken utf-8 postgresql dump , but not discarding valid utf-8

    – Superbiji
    Dec 16 '16 at 10:29














29












29








29







I think you probably want iconv. It's for converting between codesets and supports an absurd number of formats. For example, to strip anything not valid in UTF-8 you could use:



iconv -c -t UTF-8 < input.txt > output.txt



Without the -c option it'll report problems in converting to stderr, so with process direction could you save a list of these. Another way would be to strip the non-UTF8 stuff and then



diff input.txt output.txt



for a list of where changes were made.






share|improve this answer













I think you probably want iconv. It's for converting between codesets and supports an absurd number of formats. For example, to strip anything not valid in UTF-8 you could use:



iconv -c -t UTF-8 < input.txt > output.txt



Without the -c option it'll report problems in converting to stderr, so with process direction could you save a list of these. Another way would be to strip the non-UTF8 stuff and then



diff input.txt output.txt



for a list of where changes were made.







share|improve this answer












share|improve this answer



share|improve this answer










answered Jan 27 '11 at 3:43









frabjousfrabjous

4,3271825




4,3271825













  • Ok, that's iconv -c -t UTF-8 <input.txt | diff input.txt - | sed -ne 's/^< //p'. It won't work as a pipeline, though, since you need to read the input twice (no, tee won't do, it might block depending on how much buffering iconv and diff do).

    – Gilles
    Jan 27 '11 at 8:22






  • 2





    Random note: input & output may not be the same file or you will end up with an empty file

    – drahnr
    Aug 24 '13 at 8:02






  • 1





    Or use process substitution if your shell supports it diff <(iconv -c -t UTF-8 <input.txt) input.txt

    – Karl
    Nov 11 '15 at 23:41













  • How do you do this and make the output to the same file as the input. I just did this and got a blank file iconv -c -t UTF-8 < input.txt > input.txt

    – Costas Vrahimis
    Nov 8 '16 at 0:24






  • 1





    Thanks.. This allow restoring broken utf-8 postgresql dump , but not discarding valid utf-8

    – Superbiji
    Dec 16 '16 at 10:29



















  • Ok, that's iconv -c -t UTF-8 <input.txt | diff input.txt - | sed -ne 's/^< //p'. It won't work as a pipeline, though, since you need to read the input twice (no, tee won't do, it might block depending on how much buffering iconv and diff do).

    – Gilles
    Jan 27 '11 at 8:22






  • 2





    Random note: input & output may not be the same file or you will end up with an empty file

    – drahnr
    Aug 24 '13 at 8:02






  • 1





    Or use process substitution if your shell supports it diff <(iconv -c -t UTF-8 <input.txt) input.txt

    – Karl
    Nov 11 '15 at 23:41













  • How do you do this and make the output to the same file as the input. I just did this and got a blank file iconv -c -t UTF-8 < input.txt > input.txt

    – Costas Vrahimis
    Nov 8 '16 at 0:24






  • 1





    Thanks.. This allow restoring broken utf-8 postgresql dump , but not discarding valid utf-8

    – Superbiji
    Dec 16 '16 at 10:29

















Ok, that's iconv -c -t UTF-8 <input.txt | diff input.txt - | sed -ne 's/^< //p'. It won't work as a pipeline, though, since you need to read the input twice (no, tee won't do, it might block depending on how much buffering iconv and diff do).

– Gilles
Jan 27 '11 at 8:22





Ok, that's iconv -c -t UTF-8 <input.txt | diff input.txt - | sed -ne 's/^< //p'. It won't work as a pipeline, though, since you need to read the input twice (no, tee won't do, it might block depending on how much buffering iconv and diff do).

– Gilles
Jan 27 '11 at 8:22




2




2





Random note: input & output may not be the same file or you will end up with an empty file

– drahnr
Aug 24 '13 at 8:02





Random note: input & output may not be the same file or you will end up with an empty file

– drahnr
Aug 24 '13 at 8:02




1




1





Or use process substitution if your shell supports it diff <(iconv -c -t UTF-8 <input.txt) input.txt

– Karl
Nov 11 '15 at 23:41







Or use process substitution if your shell supports it diff <(iconv -c -t UTF-8 <input.txt) input.txt

– Karl
Nov 11 '15 at 23:41















How do you do this and make the output to the same file as the input. I just did this and got a blank file iconv -c -t UTF-8 < input.txt > input.txt

– Costas Vrahimis
Nov 8 '16 at 0:24





How do you do this and make the output to the same file as the input. I just did this and got a blank file iconv -c -t UTF-8 < input.txt > input.txt

– Costas Vrahimis
Nov 8 '16 at 0:24




1




1





Thanks.. This allow restoring broken utf-8 postgresql dump , but not discarding valid utf-8

– Superbiji
Dec 16 '16 at 10:29





Thanks.. This allow restoring broken utf-8 postgresql dump , but not discarding valid utf-8

– Superbiji
Dec 16 '16 at 10:29











19














Edit: I've fixed a typo-bug in the regex.. It needed a 'x80` not 80.



The regex to filter out invalid UTF-8 forms, for strict adherance to UTF-8, is as follows



perl -l -ne '/
^( ([x00-x7F]) # 1-byte pattern
|([xC2-xDF][x80-xBF]) # 2-byte pattern
|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF])) # 3-byte pattern
|((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2})) # 4-byte pattern
)*$ /x or print'


Output (of key lines.from Test 1):



Codepoint
=========
00001000 Test=1 mode=strict valid,invalid,fail=(1000,0,0)
0000E000 Test=1 mode=strict valid,invalid,fail=(D800,800,0)
0010FFFF mode=strict test-return=(0,0) valid,invalid,fail=(10F800,800,0)




Q. How does one create test data to test a regex which filters invalid Unicode ?

A. Create your own UTF-8 test algorithm, and break it's rules...

Catch-22.. But then, how do you then test your test algorithm?



The regex, above, has been tested (using iconv as the reference) for every integer value from 0x00000 to 0x10FFFF.. This upper value being the maximum integer value of a Unicode Codepoint





  • In November 2003 UTF-8 was restricted by RFC 3629 to four bytes covering only the range U+0000 to U+10FFFF, in order to match the constraints of the UTF-16 character encoding.`


According to this wikipedia UTF-8 page,.




  • UTF-8 encodes each of the 1,112,064 code points in the Unicode character set, using one to four 8-bit bytes


This numeber (1,112,064) equates to a range 0x000000 to 0x10F7FF, which is 0x0800 shy of the actual maximum integer-value for the highest Unicode Codepoint: 0x10FFFF



This block of integers is missing from the Unicode Codepoints spectrum, because of the need for the UTF-16 encoding to step beyond its original design intent via a system called surrogate pairs. A block of 0x0800 integers has been reserved to be used by UTF-16.. This block spans the range0x00D800 to 0x00DFFF. None of these inteters are legal Unicode values, and are therefore invalid UTF-8 values.



In Test 1, the regex has been tested against every number in the range of Unicode Codepoints, and it matches exectly the results of iconv .. ie. 0x010F7FF valid values, and 0x000800 invalid values.



However, the issue now arises of, *How does the regex handle Out-Of-Range UTF-8 Value; above 0x010FFFF (UTF-8 can extend to 6 bytes, with a maximum integer value of 0x7FFFFFFF?

To generate the necessary *non-unicode UTF-8 byte values, I've used the following command:



  perl -C -e 'print chr 0x'$hexUTF32BE


To test their validity (in some fashion), I've used Gilles' UTF-8 regex...



  perl -l -ne '/
^( [00-177] # 1-byte pattern
|[300-337][200-277] # 2-byte pattern
|[340-357][200-277]{2} # 3-byte pattern
|[360-367][200-277]{3} # 4-byte pattern
|[370-373][200-277]{4} # 5-byte pattern
|[374-375][200-277]{5} # 6-byte pattern
)*$ /x or print'


The output of 'perl's print chr' matches the filtering of Gilles' regex.. One reinforces the validity of the other..
I can't use iconv because it only handles the valid-Unicode Standard subset of the broader (original) UTF-8 standard...



The nunbers involved are rather large, so I've tested top-of-range, bottom-of-range, and several scans stepping by increments such as, 11111, 13579, 33333, 53441... The results all match, so now all that remains is to test the regex against these out-of-range UTF-8-style values (invalid for Unicode, and therefore also invalid for strict UTF-8 itself) ..





Here are the test modules:



[[ "$(locale charmap)" != "UTF-8" ]] && { echo "ERROR: locale must be UTF-8, but it is $(locale charmap)"; exit 1; }

# Testing the UTF-8 regex
#
# Tests to check that the observed byte-ranges (above) have
# been accurately observed and included in the test code and final regex.
# =========================================================================
: 2 bytes; B2=0 # run-test=1 do-not-test=0
: 3 bytes; B3=0 # run-test=1 do-not-test=0
: 4 bytes; B4=0 # run-test=1 do-not-test=0

: regex; Rx=1 # run-test=1 do-not-test=0

((strict=16)); mode[$strict]=strict # iconv -f UTF-16BE then iconv -f UTF-32BE beyond 0xFFFF)
(( lax=32)); mode[$lax]=lax # iconv -f UTF-32BE only)

# modebits=$strict
# UTF-8, in relation to UTF-16 has invalid values
# modebits=$strict automatically shifts to modebits=$lax
# when the tested integer exceeds 0xFFFF
# modebits=$lax
# UTF-8, in relation to UTF-32, has no restrictione


# Test 1 Sequentially tests a range of Big-Endian integers
# * Unicode Codepoints are a subset ofBig-Endian integers
# ( based on 'iconv' -f UTF-32BE -f UTF-8 )
# Note: strict UTF-8 has a few quirks because of UTF-16
# Set modebits=16 to "strictly" test the low range

Test=1; modebits=$strict
# Test=2; modebits=$lax
# Test=3
mode3wlo=$(( 1*4)) # minimum chars * 4 ( '4' is for UTF-32BE )
mode3whi=$((10*4)) # minimum chars * 4 ( '4' is for UTF-32BE )


#########################################################################

# 1 byte UTF-8 values: Nothing to do; no complexities.

#########################################################################

# 2 Byte UTF-8 values: Verifying that I've got the right range values.
if ((B2==1)) ; then
echo "# Test 2 bytes for Valid UTF-8 values: ie. values which are in range"
# =========================================================================
time
for d1 in {194..223} ;do
# bin oct hex dec
# lo 11000010 302 C2 194
# hi 11011111 337 DF 223
B2b1=$(printf "%0.2X" $d1)
#
for d2 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B2b2=$(printf "%0.2X" $d2)
#
echo -n "${B2b1}${B2b2}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B2b1}${B2b2}"; exit 20; }
#
done
done
echo

# Now do a negated test.. This takes longer, because there are more values.
echo "# Test 2 bytes for Invalid values: ie. values which are out of range"
# =========================================================================
# Note: 'iconv' will treat a leading x00-x7F as a valid leading single,
# so this negated test primes the first UTF-8 byte with values starting at x80
time
for d1 in {128..193} {224..255} ;do
#for d1 in {128..194} {224..255} ;do # force a valid UTF-8 (needs $B2b2)
B2b1=$(printf "%0.2X" $d1)
#
for d2 in {0..127} {192..255} ;do
#for d2 in {0..128} {192..255} ;do # force a valid UTF-8 (needs $B2b1)
B2b2=$(printf "%0.2X" $d2)
#
echo -n "${B2b1}${B2b2}" |
xxd -p -u -r |
iconv -f UTF-8 2>/dev/null && {
echo "ERROR: VALID UTF-8 found: ${B2b1}${B2b2}"; exit 21; }
#
done
done
echo
fi

#########################################################################

# 3 Byte UTF-8 values: Verifying that I've got the right range values.
if ((B3==1)) ; then
echo "# Test 3 bytes for Valid UTF-8 values: ie. values which are in range"
# ========================================================================
time
for d1 in {224..239} ;do
# bin oct hex dec
# lo 11100000 340 E0 224
# hi 11101111 357 EF 239
B3b1=$(printf "%0.2X" $d1)
#
if [[ $B3b1 == "E0" ]] ; then
B3b2range="$(echo {160..191})"
# bin oct hex dec
# lo 10100000 240 A0 160
# hi 10111111 277 BF 191
elif [[ $B3b1 == "ED" ]] ; then
B3b2range="$(echo {128..159})"
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10011111 237 9F 159
else
B3b2range="$(echo {128..191})"
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
fi
#
for d2 in $B3b2range ;do
B3b2=$(printf "%0.2X" $d2)
echo "${B3b1} ${B3b2} xx"
#
for d3 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B3b3=$(printf "%0.2X" $d3)
#
echo -n "${B3b1}${B3b2}${B3b3}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B3b1}${B3b2}${B3b3}"; exit 30; }
#
done
done
done
echo

# Now do a negated test.. This takes longer, because there are more values.
echo "# Test 3 bytes for Invalid values: ie. values which are out of range"
# =========================================================================
# Note: 'iconv' will treat a leading x00-x7F as a valid leading single,
# so this negated test primes the first UTF-8 byte with values starting at x80
#
# real 26m28.462s
# user 27m12.526s | stepping by 2
# sys 13m11.193s /
#
# real 239m00.836s
# user 225m11.108s | stepping by 1
# sys 120m00.538s /
#
time
for d1 in {128..223..1} {240..255..1} ;do
#for d1 in {128..224..1} {239..255..1} ;do # force a valid UTF-8 (needs $B2b2,$B3b3)
B3b1=$(printf "%0.2X" $d1)
#
if [[ $B3b1 == "E0" ]] ; then
B3b2range="$(echo {0..159..1} {192..255..1})"
#B3b2range="$(> {192..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
elif [[ $B3b1 == "ED" ]] ; then
B3b2range="$(echo {0..127..1} {160..255..1})"
#B3b2range="$(echo {0..128..1} {160..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
else
B3b2range="$(echo {0..127..1} {192..255..1})"
#B3b2range="$(echo {0..128..1} {192..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
fi
for d2 in $B3b2range ;do
B3b2=$(printf "%0.2X" $d2)
echo "${B3b1} ${B3b2} xx"
#
for d3 in {0..127..1} {192..255..1} ;do
#for d3 in {0..128..1} {192..255..1} ;do # force a valid UTF-8 (needs $B2b1)
B3b3=$(printf "%0.2X" $d3)
#
echo -n "${B3b1}${B3b2}${B3b3}" |
xxd -p -u -r |
iconv -f UTF-8 2>/dev/null && {
echo "ERROR: VALID UTF-8 found: ${B3b1}${B3b2}${B3b3}"; exit 31; }
#
done
done
done
echo

fi

#########################################################################

# Brute force testing in the Astral Plane will take a VERY LONG time..
# Perhaps selective testing is more appropriate, now that the previous tests
# have panned out okay...
#
# 4 Byte UTF-8 values:
if ((B4==1)) ; then
echo "# Test 4 bytes for Valid UTF-8 values: ie. values which are in range"
# ==================================================================
# real 58m18.531s
# user 56m44.317s |
# sys 27m29.867s /
time
for d1 in {240..244} ;do
# bin oct hex dec
# lo 11110000 360 F0 240
# hi 11110100 364 F4 244 -- F4 encodes some values greater than 0x10FFFF;
# such a sequence is invalid.
B4b1=$(printf "%0.2X" $d1)
#
if [[ $B4b1 == "F0" ]] ; then
B4b2range="$(echo {144..191})" ## f0 90 80 80 to f0 bf bf bf
# bin oct hex dec 010000 -- 03FFFF
# lo 10010000 220 90 144
# hi 10111111 277 BF 191
#
elif [[ $B4b1 == "F4" ]] ; then
B4b2range="$(echo {128..143})" ## f4 80 80 80 to f4 8f bf bf
# bin oct hex dec 100000 -- 10FFFF
# lo 10000000 200 80 128
# hi 10001111 217 8F 143 -- F4 encodes some values greater than 0x10FFFF;
# such a sequence is invalid.
else
B4b2range="$(echo {128..191})" ## fx 80 80 80 to f3 bf bf bf
# bin oct hex dec 0C0000 -- 0FFFFF
# lo 10000000 200 80 128 0A0000
# hi 10111111 277 BF 191
fi
#
for d2 in $B4b2range ;do
B4b2=$(printf "%0.2X" $d2)
#
for d3 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B4b3=$(printf "%0.2X" $d3)
echo "${B4b1} ${B4b2} ${B4b3} xx"
#
for d4 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B4b4=$(printf "%0.2X" $d4)
#
echo -n "${B4b1}${B4b2}${B4b3}${B4b4}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B4b1}${B4b2}${B4b3}${B4b4}"; exit 40; }
#
done
done
done
done
echo "# Test 4 bytes for Valid UTF-8 values: END"
echo
fi

########################################################################
# There is no test (yet) for negated range values in the astral plane. #
# (all negated range values must be invalid) #
# I won't bother; This was mainly for me to ge the general feel of #
# the tests, and the final test below should flush anything out.. #
# Traversing the intire UTF-8 range takes quite a while... #
# so no need to do it twice (albeit in a slightly different manner) #
########################################################################

################################
### The construction of: ####
### The Regular Expression ####
### (de-construction?) ####
################################

# BYTE 1 BYTE 2 BYTE 3 BYTE 4
# 1: [x00-x7F]
# ===========
# ([x00-x7F])
#
# 2: [xC2-xDF] [x80-xBF]
# =================================
# ([xC2-xDF][x80-xBF])
#
# 3: [xE0] [xA0-xBF] [x80-xBF]
# [xED] [x80-x9F] [x80-xBF]
# [xE1-xECxEE-xEF] [x80-xBF] [x80-xBF]
# ==============================================
# ((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF]))
#
# 4 [xF0] [x90-xBF] [x80-xBF] [x80-xBF]
# [xF1-xF3] [x80-xBF] [x80-xBF] [x80-xBF]
# [xF4] [x80-x8F] [x80-xBF] [x80-xBF]
# ===========================================================
# ((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2}))
#
# The final regex
# ===============
# 1-4: (([x00-x7F])|([xC2-xDF][x80-xBF])|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF]))|((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2})))
# 4-1: (((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2}))|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF]))|([xC2-xDF][x80-xBF])|([x00-x7F]))


#######################################################################
# The final Test; for a single character (multi chars to follow) #
# Compare the return code of 'iconv' against the 'regex' #
# for the full range of 0x000000 to 0x10FFFF #
# #
# Note; this script has 3 modes: #
# Run this test TWICE, set each mode Manually! #
# #
# 1. Sequentially test every value from 0x000000 to 0x10FFFF #
# 2. Throw a spanner into the works! Force random byte patterns #
# 2. Throw a spanner into the works! Force random longer strings #
# ============================== #
# #
# Note: The purpose of this routine is to determine if there is any #
# difference how 'iconv' and 'regex' handle the same data #
# #
#######################################################################
if ((Rx==1)) ; then
# real 191m34.826s
# user 158m24.114s
# sys 83m10.676s
time {
invalCt=0
validCt=0
failCt=0
decBeg=$((0x00110000)) # incement by decimal integer
decMax=$((0x7FFFFFFF)) # incement by decimal integer
#
for ((CPDec=decBeg;CPDec<=decMax;CPDec+=13247)) ;do
((D==1)) && echo "=========================================================="
#
# Convert decimal integer '$CPDec' to Hex-digits; 6-long (dec2hex)
hexUTF32BE=$(printf '%0.8Xn' $CPDec) # hexUTF32BE

# progress count
if (((CPDec%$((0x1000)))==0)) ;then
((Test>2)) && echo
echo "$hexUTF32BE Test=$Test mode=${mode[$modebits]} "
fi
if ((Test==1 || Test==2 ))
then # Test 1. Sequentially test every value from 0x000000 to 0x10FFFF
#
if ((Test==2)) ; then
bits=32
UTF8="$( perl -C -e 'print chr 0x'$hexUTF32BE |
perl -l -ne '/^( [00-177]
| [300-337][200-277]
| [340-357][200-277]{2}
| [360-367][200-277]{3}
| [370-373][200-277]{4}
| [374-375][200-277]{5}
)*$/x and print' |xxd -p )"
UTF8="${UTF8%0a}"
[[ -n "$UTF8" ]]
&& rcIco32=0 || rcIco32=1
rcIco16=

elif ((modebits==strict && CPDec<=$((0xFFFF)))) ;then
bits=16
UTF8="$( echo -n "${hexUTF32BE:4}" |
xxd -p -u -r |
iconv -f UTF-16BE -t UTF-8 2>/dev/null)"
&& rcIco16=0 || rcIco16=1
rcIco32=
else
bits=32
UTF8="$( echo -n "$hexUTF32BE" |
xxd -p -u -r |
iconv -f UTF-32BE -t UTF-8 2>/dev/null)"
&& rcIco32=0 || rcIco32=1
rcIco16=
fi
# echo "1 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
#
#
#
if ((${rcIco16}${rcIco32}!=0)) ;then
# 'iconv -f UTF-16BE' failed produce a reliable UTF-8
if ((bits==16)) ;then
((D==1)) && echo "bits-$bits rcIconv: error $hexUTF32BE .. 'strict' failed, now trying 'lax'"
# iconv failed to create a 'srict' UTF-8 so
# try UTF-32BE to get a 'lax' UTF-8 pattern
UTF8="$( echo -n "$hexUTF32BE" |
xxd -p -u -r |
iconv -f UTF-32BE -t UTF-8 2>/dev/null)"
&& rcIco32=0 || rcIco32=1
#echo "2 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
if ((rcIco32!=0)) ;then
((D==1)) && echo -n "bits-$bits rcIconv: Cannot gen UTF-8 for: $hexUTF32BE"
rcIco32=1
fi
fi
fi
# echo "3 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
#
#
#
if ((rcIco16==0 || rcIco32==0)) ;then
# 'strict(16)' OR 'lax(32)'... 'iconv' managed to generate a UTF-8 pattern
((D==1)) && echo -n "bits-$bits rcIconv: pattern* $hexUTF32BE"
((D==1)) && if [[ $bits == "16" && $rcIco32 == "0" ]] ;then
echo " .. 'lax' UTF-8 produced a pattern"
else
echo
fi
# regex test
if ((modebits==strict)) ;then
#rxOut="$(echo -n "$UTF8" |perl -l -ne '/^(([x00-x7F])|([xC2-xDF][x80-xBF])|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF]))|((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2})))*$/ or print' )"
rxOut="$(echo -n "$UTF8" |
perl -l -ne '/^( ([x00-x7F]) # 1-byte pattern
|([xC2-xDF][x80-xBF]) # 2-byte pattern
|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF])) # 3-byte pattern
|((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2})) # 4-byte pattern
)*$ /x or print' )"
else
if ((Test==2)) ;then
rx="$(echo -n "$UTF8" |perl -l -ne '/^([00-177]|[300-337][200-277]|[340-357][200-277]{2}|[360-367][200-277]{3}|[370-373][200-277]{4}|[374-375][200-277]{5})*$/ and print')"
[[ "$UTF8" != "$rx" ]] && rxOut="$UTF8" || rxOut=
rx="$(echo -n "$rx" |sed -e "s/(..)/1 /g")"
else
rxOut="$(echo -n "$UTF8" |perl -l -ne '/^([00-177]|[300-337][200-277]|[340-357][200-277]{2}|[360-367][200-277]{3}|[370-373][200-277]{4}|[374-375][200-277]{5})*$/ or print' )"
fi
fi
if [[ "$rxOut" == "" ]] ;then
((D==1)) && echo " rcRegex: ok"
rcRegex=0
else
((D==1)) && echo -n "bits-$bits rcRegex: error $hexUTF32BE .. 'strict' failed,"
((D==1)) && if [[ "12" == *$Test* ]] ;then
echo # " (codepoint) Test $Test"
else
echo
fi
rcRegex=1
fi
fi
#
elif [[ $Test == 2 ]]
then # Test 2. Throw a randomizing spanner into the works!
# Then test the arbitary bytes ASIS
#
hexLineRand="$(echo -n "$hexUTF32BE" |
sed -re "s/(.)(.)(.)(.)(.)(.)(.)(.)/1n2n3n4n5n6n7n8/" |
sort -R |
tr -d 'n')"
#
elif [[ $Test == 3 ]]
then # Test 3. Test single UTF-16BE bytes in the range 0x00000000 to 0x7FFFFFFF
#
echo "Test 3 is not properly implemented yet.. Exiting"
exit 99
else
echo "ERROR: Invalid mode"
exit
fi
#
#
if ((Test==1 || Test=2)) ;then
if ((modebits==strict && CPDec<=$((0xFFFF)))) ;then
((rcIconv=rcIco16))
else
((rcIconv=rcIco32))
fi
if ((rcRegex!=rcIconv)) ;then
[[ $Test != 1 ]] && echo
if ((rcRegex==1)) ;then
echo "ERROR: 'regex' ok, but NOT 'iconv': ${hexUTF32BE} "
else
echo "ERROR: 'iconv' ok, but NOT 'regex': ${hexUTF32BE} "
fi
((failCt++));
elif ((rcRegex!=0)) ;then
# ((invalCt++)); echo -ne "$hexUTF32BE exit-codes $${rcIco16}${rcIco32}=,$rcRegext: $(printf "%0.8Xn" $invalCt)t$hexLine$(printf "%$(((mode3whi*2)-${#hexLine}))s")r"
((invalCt++))
else
((validCt++))
fi
if ((Test==1)) ;then
echo -ne "$hexUTF32BE " "mode=${mode[$modebits]} test-return=($rcIconv,$rcRegex) valid,invalid,fail=($(printf "%X" $validCt),$(printf "%X" $invalCt),$(printf "%X" $failCt)) r"
else
echo -ne "$hexUTF32BE $rx mode=${mode[$modebits]} test-return=($rcIconv,$rcRegex) val,inval,fail=($(printf "%X" $validCt),$(printf "%X" $invalCt),$(printf "%X" $failCt))r"
fi
fi
done
} # End time
fi
exit





share|improve this answer


























  • The main problem with my regexp is that it allowed some forbidden sequences such as 300200 (really bad: that's code point 0 not expressed with a null byte!). I think your regexp rejects them correctly.

    – Gilles
    May 8 '11 at 12:38
















19














Edit: I've fixed a typo-bug in the regex.. It needed a 'x80` not 80.



The regex to filter out invalid UTF-8 forms, for strict adherance to UTF-8, is as follows



perl -l -ne '/
^( ([x00-x7F]) # 1-byte pattern
|([xC2-xDF][x80-xBF]) # 2-byte pattern
|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF])) # 3-byte pattern
|((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2})) # 4-byte pattern
)*$ /x or print'


Output (of key lines.from Test 1):



Codepoint
=========
00001000 Test=1 mode=strict valid,invalid,fail=(1000,0,0)
0000E000 Test=1 mode=strict valid,invalid,fail=(D800,800,0)
0010FFFF mode=strict test-return=(0,0) valid,invalid,fail=(10F800,800,0)




Q. How does one create test data to test a regex which filters invalid Unicode ?

A. Create your own UTF-8 test algorithm, and break it's rules...

Catch-22.. But then, how do you then test your test algorithm?



The regex, above, has been tested (using iconv as the reference) for every integer value from 0x00000 to 0x10FFFF.. This upper value being the maximum integer value of a Unicode Codepoint





  • In November 2003 UTF-8 was restricted by RFC 3629 to four bytes covering only the range U+0000 to U+10FFFF, in order to match the constraints of the UTF-16 character encoding.`


According to this wikipedia UTF-8 page,.




  • UTF-8 encodes each of the 1,112,064 code points in the Unicode character set, using one to four 8-bit bytes


This numeber (1,112,064) equates to a range 0x000000 to 0x10F7FF, which is 0x0800 shy of the actual maximum integer-value for the highest Unicode Codepoint: 0x10FFFF



This block of integers is missing from the Unicode Codepoints spectrum, because of the need for the UTF-16 encoding to step beyond its original design intent via a system called surrogate pairs. A block of 0x0800 integers has been reserved to be used by UTF-16.. This block spans the range0x00D800 to 0x00DFFF. None of these inteters are legal Unicode values, and are therefore invalid UTF-8 values.



In Test 1, the regex has been tested against every number in the range of Unicode Codepoints, and it matches exectly the results of iconv .. ie. 0x010F7FF valid values, and 0x000800 invalid values.



However, the issue now arises of, *How does the regex handle Out-Of-Range UTF-8 Value; above 0x010FFFF (UTF-8 can extend to 6 bytes, with a maximum integer value of 0x7FFFFFFF?

To generate the necessary *non-unicode UTF-8 byte values, I've used the following command:



  perl -C -e 'print chr 0x'$hexUTF32BE


To test their validity (in some fashion), I've used Gilles' UTF-8 regex...



  perl -l -ne '/
^( [00-177] # 1-byte pattern
|[300-337][200-277] # 2-byte pattern
|[340-357][200-277]{2} # 3-byte pattern
|[360-367][200-277]{3} # 4-byte pattern
|[370-373][200-277]{4} # 5-byte pattern
|[374-375][200-277]{5} # 6-byte pattern
)*$ /x or print'


The output of 'perl's print chr' matches the filtering of Gilles' regex.. One reinforces the validity of the other..
I can't use iconv because it only handles the valid-Unicode Standard subset of the broader (original) UTF-8 standard...



The nunbers involved are rather large, so I've tested top-of-range, bottom-of-range, and several scans stepping by increments such as, 11111, 13579, 33333, 53441... The results all match, so now all that remains is to test the regex against these out-of-range UTF-8-style values (invalid for Unicode, and therefore also invalid for strict UTF-8 itself) ..





Here are the test modules:



[[ "$(locale charmap)" != "UTF-8" ]] && { echo "ERROR: locale must be UTF-8, but it is $(locale charmap)"; exit 1; }

# Testing the UTF-8 regex
#
# Tests to check that the observed byte-ranges (above) have
# been accurately observed and included in the test code and final regex.
# =========================================================================
: 2 bytes; B2=0 # run-test=1 do-not-test=0
: 3 bytes; B3=0 # run-test=1 do-not-test=0
: 4 bytes; B4=0 # run-test=1 do-not-test=0

: regex; Rx=1 # run-test=1 do-not-test=0

((strict=16)); mode[$strict]=strict # iconv -f UTF-16BE then iconv -f UTF-32BE beyond 0xFFFF)
(( lax=32)); mode[$lax]=lax # iconv -f UTF-32BE only)

# modebits=$strict
# UTF-8, in relation to UTF-16 has invalid values
# modebits=$strict automatically shifts to modebits=$lax
# when the tested integer exceeds 0xFFFF
# modebits=$lax
# UTF-8, in relation to UTF-32, has no restrictione


# Test 1 Sequentially tests a range of Big-Endian integers
# * Unicode Codepoints are a subset ofBig-Endian integers
# ( based on 'iconv' -f UTF-32BE -f UTF-8 )
# Note: strict UTF-8 has a few quirks because of UTF-16
# Set modebits=16 to "strictly" test the low range

Test=1; modebits=$strict
# Test=2; modebits=$lax
# Test=3
mode3wlo=$(( 1*4)) # minimum chars * 4 ( '4' is for UTF-32BE )
mode3whi=$((10*4)) # minimum chars * 4 ( '4' is for UTF-32BE )


#########################################################################

# 1 byte UTF-8 values: Nothing to do; no complexities.

#########################################################################

# 2 Byte UTF-8 values: Verifying that I've got the right range values.
if ((B2==1)) ; then
echo "# Test 2 bytes for Valid UTF-8 values: ie. values which are in range"
# =========================================================================
time
for d1 in {194..223} ;do
# bin oct hex dec
# lo 11000010 302 C2 194
# hi 11011111 337 DF 223
B2b1=$(printf "%0.2X" $d1)
#
for d2 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B2b2=$(printf "%0.2X" $d2)
#
echo -n "${B2b1}${B2b2}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B2b1}${B2b2}"; exit 20; }
#
done
done
echo

# Now do a negated test.. This takes longer, because there are more values.
echo "# Test 2 bytes for Invalid values: ie. values which are out of range"
# =========================================================================
# Note: 'iconv' will treat a leading x00-x7F as a valid leading single,
# so this negated test primes the first UTF-8 byte with values starting at x80
time
for d1 in {128..193} {224..255} ;do
#for d1 in {128..194} {224..255} ;do # force a valid UTF-8 (needs $B2b2)
B2b1=$(printf "%0.2X" $d1)
#
for d2 in {0..127} {192..255} ;do
#for d2 in {0..128} {192..255} ;do # force a valid UTF-8 (needs $B2b1)
B2b2=$(printf "%0.2X" $d2)
#
echo -n "${B2b1}${B2b2}" |
xxd -p -u -r |
iconv -f UTF-8 2>/dev/null && {
echo "ERROR: VALID UTF-8 found: ${B2b1}${B2b2}"; exit 21; }
#
done
done
echo
fi

#########################################################################

# 3 Byte UTF-8 values: Verifying that I've got the right range values.
if ((B3==1)) ; then
echo "# Test 3 bytes for Valid UTF-8 values: ie. values which are in range"
# ========================================================================
time
for d1 in {224..239} ;do
# bin oct hex dec
# lo 11100000 340 E0 224
# hi 11101111 357 EF 239
B3b1=$(printf "%0.2X" $d1)
#
if [[ $B3b1 == "E0" ]] ; then
B3b2range="$(echo {160..191})"
# bin oct hex dec
# lo 10100000 240 A0 160
# hi 10111111 277 BF 191
elif [[ $B3b1 == "ED" ]] ; then
B3b2range="$(echo {128..159})"
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10011111 237 9F 159
else
B3b2range="$(echo {128..191})"
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
fi
#
for d2 in $B3b2range ;do
B3b2=$(printf "%0.2X" $d2)
echo "${B3b1} ${B3b2} xx"
#
for d3 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B3b3=$(printf "%0.2X" $d3)
#
echo -n "${B3b1}${B3b2}${B3b3}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B3b1}${B3b2}${B3b3}"; exit 30; }
#
done
done
done
echo

# Now do a negated test.. This takes longer, because there are more values.
echo "# Test 3 bytes for Invalid values: ie. values which are out of range"
# =========================================================================
# Note: 'iconv' will treat a leading x00-x7F as a valid leading single,
# so this negated test primes the first UTF-8 byte with values starting at x80
#
# real 26m28.462s
# user 27m12.526s | stepping by 2
# sys 13m11.193s /
#
# real 239m00.836s
# user 225m11.108s | stepping by 1
# sys 120m00.538s /
#
time
for d1 in {128..223..1} {240..255..1} ;do
#for d1 in {128..224..1} {239..255..1} ;do # force a valid UTF-8 (needs $B2b2,$B3b3)
B3b1=$(printf "%0.2X" $d1)
#
if [[ $B3b1 == "E0" ]] ; then
B3b2range="$(echo {0..159..1} {192..255..1})"
#B3b2range="$(> {192..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
elif [[ $B3b1 == "ED" ]] ; then
B3b2range="$(echo {0..127..1} {160..255..1})"
#B3b2range="$(echo {0..128..1} {160..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
else
B3b2range="$(echo {0..127..1} {192..255..1})"
#B3b2range="$(echo {0..128..1} {192..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
fi
for d2 in $B3b2range ;do
B3b2=$(printf "%0.2X" $d2)
echo "${B3b1} ${B3b2} xx"
#
for d3 in {0..127..1} {192..255..1} ;do
#for d3 in {0..128..1} {192..255..1} ;do # force a valid UTF-8 (needs $B2b1)
B3b3=$(printf "%0.2X" $d3)
#
echo -n "${B3b1}${B3b2}${B3b3}" |
xxd -p -u -r |
iconv -f UTF-8 2>/dev/null && {
echo "ERROR: VALID UTF-8 found: ${B3b1}${B3b2}${B3b3}"; exit 31; }
#
done
done
done
echo

fi

#########################################################################

# Brute force testing in the Astral Plane will take a VERY LONG time..
# Perhaps selective testing is more appropriate, now that the previous tests
# have panned out okay...
#
# 4 Byte UTF-8 values:
if ((B4==1)) ; then
echo "# Test 4 bytes for Valid UTF-8 values: ie. values which are in range"
# ==================================================================
# real 58m18.531s
# user 56m44.317s |
# sys 27m29.867s /
time
for d1 in {240..244} ;do
# bin oct hex dec
# lo 11110000 360 F0 240
# hi 11110100 364 F4 244 -- F4 encodes some values greater than 0x10FFFF;
# such a sequence is invalid.
B4b1=$(printf "%0.2X" $d1)
#
if [[ $B4b1 == "F0" ]] ; then
B4b2range="$(echo {144..191})" ## f0 90 80 80 to f0 bf bf bf
# bin oct hex dec 010000 -- 03FFFF
# lo 10010000 220 90 144
# hi 10111111 277 BF 191
#
elif [[ $B4b1 == "F4" ]] ; then
B4b2range="$(echo {128..143})" ## f4 80 80 80 to f4 8f bf bf
# bin oct hex dec 100000 -- 10FFFF
# lo 10000000 200 80 128
# hi 10001111 217 8F 143 -- F4 encodes some values greater than 0x10FFFF;
# such a sequence is invalid.
else
B4b2range="$(echo {128..191})" ## fx 80 80 80 to f3 bf bf bf
# bin oct hex dec 0C0000 -- 0FFFFF
# lo 10000000 200 80 128 0A0000
# hi 10111111 277 BF 191
fi
#
for d2 in $B4b2range ;do
B4b2=$(printf "%0.2X" $d2)
#
for d3 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B4b3=$(printf "%0.2X" $d3)
echo "${B4b1} ${B4b2} ${B4b3} xx"
#
for d4 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B4b4=$(printf "%0.2X" $d4)
#
echo -n "${B4b1}${B4b2}${B4b3}${B4b4}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B4b1}${B4b2}${B4b3}${B4b4}"; exit 40; }
#
done
done
done
done
echo "# Test 4 bytes for Valid UTF-8 values: END"
echo
fi

########################################################################
# There is no test (yet) for negated range values in the astral plane. #
# (all negated range values must be invalid) #
# I won't bother; This was mainly for me to ge the general feel of #
# the tests, and the final test below should flush anything out.. #
# Traversing the intire UTF-8 range takes quite a while... #
# so no need to do it twice (albeit in a slightly different manner) #
########################################################################

################################
### The construction of: ####
### The Regular Expression ####
### (de-construction?) ####
################################

# BYTE 1 BYTE 2 BYTE 3 BYTE 4
# 1: [x00-x7F]
# ===========
# ([x00-x7F])
#
# 2: [xC2-xDF] [x80-xBF]
# =================================
# ([xC2-xDF][x80-xBF])
#
# 3: [xE0] [xA0-xBF] [x80-xBF]
# [xED] [x80-x9F] [x80-xBF]
# [xE1-xECxEE-xEF] [x80-xBF] [x80-xBF]
# ==============================================
# ((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF]))
#
# 4 [xF0] [x90-xBF] [x80-xBF] [x80-xBF]
# [xF1-xF3] [x80-xBF] [x80-xBF] [x80-xBF]
# [xF4] [x80-x8F] [x80-xBF] [x80-xBF]
# ===========================================================
# ((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2}))
#
# The final regex
# ===============
# 1-4: (([x00-x7F])|([xC2-xDF][x80-xBF])|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF]))|((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2})))
# 4-1: (((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2}))|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF]))|([xC2-xDF][x80-xBF])|([x00-x7F]))


#######################################################################
# The final Test; for a single character (multi chars to follow) #
# Compare the return code of 'iconv' against the 'regex' #
# for the full range of 0x000000 to 0x10FFFF #
# #
# Note; this script has 3 modes: #
# Run this test TWICE, set each mode Manually! #
# #
# 1. Sequentially test every value from 0x000000 to 0x10FFFF #
# 2. Throw a spanner into the works! Force random byte patterns #
# 2. Throw a spanner into the works! Force random longer strings #
# ============================== #
# #
# Note: The purpose of this routine is to determine if there is any #
# difference how 'iconv' and 'regex' handle the same data #
# #
#######################################################################
if ((Rx==1)) ; then
# real 191m34.826s
# user 158m24.114s
# sys 83m10.676s
time {
invalCt=0
validCt=0
failCt=0
decBeg=$((0x00110000)) # incement by decimal integer
decMax=$((0x7FFFFFFF)) # incement by decimal integer
#
for ((CPDec=decBeg;CPDec<=decMax;CPDec+=13247)) ;do
((D==1)) && echo "=========================================================="
#
# Convert decimal integer '$CPDec' to Hex-digits; 6-long (dec2hex)
hexUTF32BE=$(printf '%0.8Xn' $CPDec) # hexUTF32BE

# progress count
if (((CPDec%$((0x1000)))==0)) ;then
((Test>2)) && echo
echo "$hexUTF32BE Test=$Test mode=${mode[$modebits]} "
fi
if ((Test==1 || Test==2 ))
then # Test 1. Sequentially test every value from 0x000000 to 0x10FFFF
#
if ((Test==2)) ; then
bits=32
UTF8="$( perl -C -e 'print chr 0x'$hexUTF32BE |
perl -l -ne '/^( [00-177]
| [300-337][200-277]
| [340-357][200-277]{2}
| [360-367][200-277]{3}
| [370-373][200-277]{4}
| [374-375][200-277]{5}
)*$/x and print' |xxd -p )"
UTF8="${UTF8%0a}"
[[ -n "$UTF8" ]]
&& rcIco32=0 || rcIco32=1
rcIco16=

elif ((modebits==strict && CPDec<=$((0xFFFF)))) ;then
bits=16
UTF8="$( echo -n "${hexUTF32BE:4}" |
xxd -p -u -r |
iconv -f UTF-16BE -t UTF-8 2>/dev/null)"
&& rcIco16=0 || rcIco16=1
rcIco32=
else
bits=32
UTF8="$( echo -n "$hexUTF32BE" |
xxd -p -u -r |
iconv -f UTF-32BE -t UTF-8 2>/dev/null)"
&& rcIco32=0 || rcIco32=1
rcIco16=
fi
# echo "1 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
#
#
#
if ((${rcIco16}${rcIco32}!=0)) ;then
# 'iconv -f UTF-16BE' failed produce a reliable UTF-8
if ((bits==16)) ;then
((D==1)) && echo "bits-$bits rcIconv: error $hexUTF32BE .. 'strict' failed, now trying 'lax'"
# iconv failed to create a 'srict' UTF-8 so
# try UTF-32BE to get a 'lax' UTF-8 pattern
UTF8="$( echo -n "$hexUTF32BE" |
xxd -p -u -r |
iconv -f UTF-32BE -t UTF-8 2>/dev/null)"
&& rcIco32=0 || rcIco32=1
#echo "2 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
if ((rcIco32!=0)) ;then
((D==1)) && echo -n "bits-$bits rcIconv: Cannot gen UTF-8 for: $hexUTF32BE"
rcIco32=1
fi
fi
fi
# echo "3 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
#
#
#
if ((rcIco16==0 || rcIco32==0)) ;then
# 'strict(16)' OR 'lax(32)'... 'iconv' managed to generate a UTF-8 pattern
((D==1)) && echo -n "bits-$bits rcIconv: pattern* $hexUTF32BE"
((D==1)) && if [[ $bits == "16" && $rcIco32 == "0" ]] ;then
echo " .. 'lax' UTF-8 produced a pattern"
else
echo
fi
# regex test
if ((modebits==strict)) ;then
#rxOut="$(echo -n "$UTF8" |perl -l -ne '/^(([x00-x7F])|([xC2-xDF][x80-xBF])|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF]))|((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2})))*$/ or print' )"
rxOut="$(echo -n "$UTF8" |
perl -l -ne '/^( ([x00-x7F]) # 1-byte pattern
|([xC2-xDF][x80-xBF]) # 2-byte pattern
|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF])) # 3-byte pattern
|((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2})) # 4-byte pattern
)*$ /x or print' )"
else
if ((Test==2)) ;then
rx="$(echo -n "$UTF8" |perl -l -ne '/^([00-177]|[300-337][200-277]|[340-357][200-277]{2}|[360-367][200-277]{3}|[370-373][200-277]{4}|[374-375][200-277]{5})*$/ and print')"
[[ "$UTF8" != "$rx" ]] && rxOut="$UTF8" || rxOut=
rx="$(echo -n "$rx" |sed -e "s/(..)/1 /g")"
else
rxOut="$(echo -n "$UTF8" |perl -l -ne '/^([00-177]|[300-337][200-277]|[340-357][200-277]{2}|[360-367][200-277]{3}|[370-373][200-277]{4}|[374-375][200-277]{5})*$/ or print' )"
fi
fi
if [[ "$rxOut" == "" ]] ;then
((D==1)) && echo " rcRegex: ok"
rcRegex=0
else
((D==1)) && echo -n "bits-$bits rcRegex: error $hexUTF32BE .. 'strict' failed,"
((D==1)) && if [[ "12" == *$Test* ]] ;then
echo # " (codepoint) Test $Test"
else
echo
fi
rcRegex=1
fi
fi
#
elif [[ $Test == 2 ]]
then # Test 2. Throw a randomizing spanner into the works!
# Then test the arbitary bytes ASIS
#
hexLineRand="$(echo -n "$hexUTF32BE" |
sed -re "s/(.)(.)(.)(.)(.)(.)(.)(.)/1n2n3n4n5n6n7n8/" |
sort -R |
tr -d 'n')"
#
elif [[ $Test == 3 ]]
then # Test 3. Test single UTF-16BE bytes in the range 0x00000000 to 0x7FFFFFFF
#
echo "Test 3 is not properly implemented yet.. Exiting"
exit 99
else
echo "ERROR: Invalid mode"
exit
fi
#
#
if ((Test==1 || Test=2)) ;then
if ((modebits==strict && CPDec<=$((0xFFFF)))) ;then
((rcIconv=rcIco16))
else
((rcIconv=rcIco32))
fi
if ((rcRegex!=rcIconv)) ;then
[[ $Test != 1 ]] && echo
if ((rcRegex==1)) ;then
echo "ERROR: 'regex' ok, but NOT 'iconv': ${hexUTF32BE} "
else
echo "ERROR: 'iconv' ok, but NOT 'regex': ${hexUTF32BE} "
fi
((failCt++));
elif ((rcRegex!=0)) ;then
# ((invalCt++)); echo -ne "$hexUTF32BE exit-codes $${rcIco16}${rcIco32}=,$rcRegext: $(printf "%0.8Xn" $invalCt)t$hexLine$(printf "%$(((mode3whi*2)-${#hexLine}))s")r"
((invalCt++))
else
((validCt++))
fi
if ((Test==1)) ;then
echo -ne "$hexUTF32BE " "mode=${mode[$modebits]} test-return=($rcIconv,$rcRegex) valid,invalid,fail=($(printf "%X" $validCt),$(printf "%X" $invalCt),$(printf "%X" $failCt)) r"
else
echo -ne "$hexUTF32BE $rx mode=${mode[$modebits]} test-return=($rcIconv,$rcRegex) val,inval,fail=($(printf "%X" $validCt),$(printf "%X" $invalCt),$(printf "%X" $failCt))r"
fi
fi
done
} # End time
fi
exit





share|improve this answer


























  • The main problem with my regexp is that it allowed some forbidden sequences such as 300200 (really bad: that's code point 0 not expressed with a null byte!). I think your regexp rejects them correctly.

    – Gilles
    May 8 '11 at 12:38














19












19








19







Edit: I've fixed a typo-bug in the regex.. It needed a 'x80` not 80.



The regex to filter out invalid UTF-8 forms, for strict adherance to UTF-8, is as follows



perl -l -ne '/
^( ([x00-x7F]) # 1-byte pattern
|([xC2-xDF][x80-xBF]) # 2-byte pattern
|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF])) # 3-byte pattern
|((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2})) # 4-byte pattern
)*$ /x or print'


Output (of key lines.from Test 1):



Codepoint
=========
00001000 Test=1 mode=strict valid,invalid,fail=(1000,0,0)
0000E000 Test=1 mode=strict valid,invalid,fail=(D800,800,0)
0010FFFF mode=strict test-return=(0,0) valid,invalid,fail=(10F800,800,0)




Q. How does one create test data to test a regex which filters invalid Unicode ?

A. Create your own UTF-8 test algorithm, and break it's rules...

Catch-22.. But then, how do you then test your test algorithm?



The regex, above, has been tested (using iconv as the reference) for every integer value from 0x00000 to 0x10FFFF.. This upper value being the maximum integer value of a Unicode Codepoint





  • In November 2003 UTF-8 was restricted by RFC 3629 to four bytes covering only the range U+0000 to U+10FFFF, in order to match the constraints of the UTF-16 character encoding.`


According to this wikipedia UTF-8 page,.




  • UTF-8 encodes each of the 1,112,064 code points in the Unicode character set, using one to four 8-bit bytes


This numeber (1,112,064) equates to a range 0x000000 to 0x10F7FF, which is 0x0800 shy of the actual maximum integer-value for the highest Unicode Codepoint: 0x10FFFF



This block of integers is missing from the Unicode Codepoints spectrum, because of the need for the UTF-16 encoding to step beyond its original design intent via a system called surrogate pairs. A block of 0x0800 integers has been reserved to be used by UTF-16.. This block spans the range0x00D800 to 0x00DFFF. None of these inteters are legal Unicode values, and are therefore invalid UTF-8 values.



In Test 1, the regex has been tested against every number in the range of Unicode Codepoints, and it matches exectly the results of iconv .. ie. 0x010F7FF valid values, and 0x000800 invalid values.



However, the issue now arises of, *How does the regex handle Out-Of-Range UTF-8 Value; above 0x010FFFF (UTF-8 can extend to 6 bytes, with a maximum integer value of 0x7FFFFFFF?

To generate the necessary *non-unicode UTF-8 byte values, I've used the following command:



  perl -C -e 'print chr 0x'$hexUTF32BE


To test their validity (in some fashion), I've used Gilles' UTF-8 regex...



  perl -l -ne '/
^( [00-177] # 1-byte pattern
|[300-337][200-277] # 2-byte pattern
|[340-357][200-277]{2} # 3-byte pattern
|[360-367][200-277]{3} # 4-byte pattern
|[370-373][200-277]{4} # 5-byte pattern
|[374-375][200-277]{5} # 6-byte pattern
)*$ /x or print'


The output of 'perl's print chr' matches the filtering of Gilles' regex.. One reinforces the validity of the other..
I can't use iconv because it only handles the valid-Unicode Standard subset of the broader (original) UTF-8 standard...



The nunbers involved are rather large, so I've tested top-of-range, bottom-of-range, and several scans stepping by increments such as, 11111, 13579, 33333, 53441... The results all match, so now all that remains is to test the regex against these out-of-range UTF-8-style values (invalid for Unicode, and therefore also invalid for strict UTF-8 itself) ..





Here are the test modules:



[[ "$(locale charmap)" != "UTF-8" ]] && { echo "ERROR: locale must be UTF-8, but it is $(locale charmap)"; exit 1; }

# Testing the UTF-8 regex
#
# Tests to check that the observed byte-ranges (above) have
# been accurately observed and included in the test code and final regex.
# =========================================================================
: 2 bytes; B2=0 # run-test=1 do-not-test=0
: 3 bytes; B3=0 # run-test=1 do-not-test=0
: 4 bytes; B4=0 # run-test=1 do-not-test=0

: regex; Rx=1 # run-test=1 do-not-test=0

((strict=16)); mode[$strict]=strict # iconv -f UTF-16BE then iconv -f UTF-32BE beyond 0xFFFF)
(( lax=32)); mode[$lax]=lax # iconv -f UTF-32BE only)

# modebits=$strict
# UTF-8, in relation to UTF-16 has invalid values
# modebits=$strict automatically shifts to modebits=$lax
# when the tested integer exceeds 0xFFFF
# modebits=$lax
# UTF-8, in relation to UTF-32, has no restrictione


# Test 1 Sequentially tests a range of Big-Endian integers
# * Unicode Codepoints are a subset ofBig-Endian integers
# ( based on 'iconv' -f UTF-32BE -f UTF-8 )
# Note: strict UTF-8 has a few quirks because of UTF-16
# Set modebits=16 to "strictly" test the low range

Test=1; modebits=$strict
# Test=2; modebits=$lax
# Test=3
mode3wlo=$(( 1*4)) # minimum chars * 4 ( '4' is for UTF-32BE )
mode3whi=$((10*4)) # minimum chars * 4 ( '4' is for UTF-32BE )


#########################################################################

# 1 byte UTF-8 values: Nothing to do; no complexities.

#########################################################################

# 2 Byte UTF-8 values: Verifying that I've got the right range values.
if ((B2==1)) ; then
echo "# Test 2 bytes for Valid UTF-8 values: ie. values which are in range"
# =========================================================================
time
for d1 in {194..223} ;do
# bin oct hex dec
# lo 11000010 302 C2 194
# hi 11011111 337 DF 223
B2b1=$(printf "%0.2X" $d1)
#
for d2 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B2b2=$(printf "%0.2X" $d2)
#
echo -n "${B2b1}${B2b2}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B2b1}${B2b2}"; exit 20; }
#
done
done
echo

# Now do a negated test.. This takes longer, because there are more values.
echo "# Test 2 bytes for Invalid values: ie. values which are out of range"
# =========================================================================
# Note: 'iconv' will treat a leading x00-x7F as a valid leading single,
# so this negated test primes the first UTF-8 byte with values starting at x80
time
for d1 in {128..193} {224..255} ;do
#for d1 in {128..194} {224..255} ;do # force a valid UTF-8 (needs $B2b2)
B2b1=$(printf "%0.2X" $d1)
#
for d2 in {0..127} {192..255} ;do
#for d2 in {0..128} {192..255} ;do # force a valid UTF-8 (needs $B2b1)
B2b2=$(printf "%0.2X" $d2)
#
echo -n "${B2b1}${B2b2}" |
xxd -p -u -r |
iconv -f UTF-8 2>/dev/null && {
echo "ERROR: VALID UTF-8 found: ${B2b1}${B2b2}"; exit 21; }
#
done
done
echo
fi

#########################################################################

# 3 Byte UTF-8 values: Verifying that I've got the right range values.
if ((B3==1)) ; then
echo "# Test 3 bytes for Valid UTF-8 values: ie. values which are in range"
# ========================================================================
time
for d1 in {224..239} ;do
# bin oct hex dec
# lo 11100000 340 E0 224
# hi 11101111 357 EF 239
B3b1=$(printf "%0.2X" $d1)
#
if [[ $B3b1 == "E0" ]] ; then
B3b2range="$(echo {160..191})"
# bin oct hex dec
# lo 10100000 240 A0 160
# hi 10111111 277 BF 191
elif [[ $B3b1 == "ED" ]] ; then
B3b2range="$(echo {128..159})"
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10011111 237 9F 159
else
B3b2range="$(echo {128..191})"
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
fi
#
for d2 in $B3b2range ;do
B3b2=$(printf "%0.2X" $d2)
echo "${B3b1} ${B3b2} xx"
#
for d3 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B3b3=$(printf "%0.2X" $d3)
#
echo -n "${B3b1}${B3b2}${B3b3}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B3b1}${B3b2}${B3b3}"; exit 30; }
#
done
done
done
echo

# Now do a negated test.. This takes longer, because there are more values.
echo "# Test 3 bytes for Invalid values: ie. values which are out of range"
# =========================================================================
# Note: 'iconv' will treat a leading x00-x7F as a valid leading single,
# so this negated test primes the first UTF-8 byte with values starting at x80
#
# real 26m28.462s
# user 27m12.526s | stepping by 2
# sys 13m11.193s /
#
# real 239m00.836s
# user 225m11.108s | stepping by 1
# sys 120m00.538s /
#
time
for d1 in {128..223..1} {240..255..1} ;do
#for d1 in {128..224..1} {239..255..1} ;do # force a valid UTF-8 (needs $B2b2,$B3b3)
B3b1=$(printf "%0.2X" $d1)
#
if [[ $B3b1 == "E0" ]] ; then
B3b2range="$(echo {0..159..1} {192..255..1})"
#B3b2range="$(> {192..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
elif [[ $B3b1 == "ED" ]] ; then
B3b2range="$(echo {0..127..1} {160..255..1})"
#B3b2range="$(echo {0..128..1} {160..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
else
B3b2range="$(echo {0..127..1} {192..255..1})"
#B3b2range="$(echo {0..128..1} {192..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
fi
for d2 in $B3b2range ;do
B3b2=$(printf "%0.2X" $d2)
echo "${B3b1} ${B3b2} xx"
#
for d3 in {0..127..1} {192..255..1} ;do
#for d3 in {0..128..1} {192..255..1} ;do # force a valid UTF-8 (needs $B2b1)
B3b3=$(printf "%0.2X" $d3)
#
echo -n "${B3b1}${B3b2}${B3b3}" |
xxd -p -u -r |
iconv -f UTF-8 2>/dev/null && {
echo "ERROR: VALID UTF-8 found: ${B3b1}${B3b2}${B3b3}"; exit 31; }
#
done
done
done
echo

fi

#########################################################################

# Brute force testing in the Astral Plane will take a VERY LONG time..
# Perhaps selective testing is more appropriate, now that the previous tests
# have panned out okay...
#
# 4 Byte UTF-8 values:
if ((B4==1)) ; then
echo "# Test 4 bytes for Valid UTF-8 values: ie. values which are in range"
# ==================================================================
# real 58m18.531s
# user 56m44.317s |
# sys 27m29.867s /
time
for d1 in {240..244} ;do
# bin oct hex dec
# lo 11110000 360 F0 240
# hi 11110100 364 F4 244 -- F4 encodes some values greater than 0x10FFFF;
# such a sequence is invalid.
B4b1=$(printf "%0.2X" $d1)
#
if [[ $B4b1 == "F0" ]] ; then
B4b2range="$(echo {144..191})" ## f0 90 80 80 to f0 bf bf bf
# bin oct hex dec 010000 -- 03FFFF
# lo 10010000 220 90 144
# hi 10111111 277 BF 191
#
elif [[ $B4b1 == "F4" ]] ; then
B4b2range="$(echo {128..143})" ## f4 80 80 80 to f4 8f bf bf
# bin oct hex dec 100000 -- 10FFFF
# lo 10000000 200 80 128
# hi 10001111 217 8F 143 -- F4 encodes some values greater than 0x10FFFF;
# such a sequence is invalid.
else
B4b2range="$(echo {128..191})" ## fx 80 80 80 to f3 bf bf bf
# bin oct hex dec 0C0000 -- 0FFFFF
# lo 10000000 200 80 128 0A0000
# hi 10111111 277 BF 191
fi
#
for d2 in $B4b2range ;do
B4b2=$(printf "%0.2X" $d2)
#
for d3 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B4b3=$(printf "%0.2X" $d3)
echo "${B4b1} ${B4b2} ${B4b3} xx"
#
for d4 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B4b4=$(printf "%0.2X" $d4)
#
echo -n "${B4b1}${B4b2}${B4b3}${B4b4}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B4b1}${B4b2}${B4b3}${B4b4}"; exit 40; }
#
done
done
done
done
echo "# Test 4 bytes for Valid UTF-8 values: END"
echo
fi

########################################################################
# There is no test (yet) for negated range values in the astral plane. #
# (all negated range values must be invalid) #
# I won't bother; This was mainly for me to ge the general feel of #
# the tests, and the final test below should flush anything out.. #
# Traversing the intire UTF-8 range takes quite a while... #
# so no need to do it twice (albeit in a slightly different manner) #
########################################################################

################################
### The construction of: ####
### The Regular Expression ####
### (de-construction?) ####
################################

# BYTE 1 BYTE 2 BYTE 3 BYTE 4
# 1: [x00-x7F]
# ===========
# ([x00-x7F])
#
# 2: [xC2-xDF] [x80-xBF]
# =================================
# ([xC2-xDF][x80-xBF])
#
# 3: [xE0] [xA0-xBF] [x80-xBF]
# [xED] [x80-x9F] [x80-xBF]
# [xE1-xECxEE-xEF] [x80-xBF] [x80-xBF]
# ==============================================
# ((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF]))
#
# 4 [xF0] [x90-xBF] [x80-xBF] [x80-xBF]
# [xF1-xF3] [x80-xBF] [x80-xBF] [x80-xBF]
# [xF4] [x80-x8F] [x80-xBF] [x80-xBF]
# ===========================================================
# ((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2}))
#
# The final regex
# ===============
# 1-4: (([x00-x7F])|([xC2-xDF][x80-xBF])|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF]))|((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2})))
# 4-1: (((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2}))|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF]))|([xC2-xDF][x80-xBF])|([x00-x7F]))


#######################################################################
# The final Test; for a single character (multi chars to follow) #
# Compare the return code of 'iconv' against the 'regex' #
# for the full range of 0x000000 to 0x10FFFF #
# #
# Note; this script has 3 modes: #
# Run this test TWICE, set each mode Manually! #
# #
# 1. Sequentially test every value from 0x000000 to 0x10FFFF #
# 2. Throw a spanner into the works! Force random byte patterns #
# 2. Throw a spanner into the works! Force random longer strings #
# ============================== #
# #
# Note: The purpose of this routine is to determine if there is any #
# difference how 'iconv' and 'regex' handle the same data #
# #
#######################################################################
if ((Rx==1)) ; then
# real 191m34.826s
# user 158m24.114s
# sys 83m10.676s
time {
invalCt=0
validCt=0
failCt=0
decBeg=$((0x00110000)) # incement by decimal integer
decMax=$((0x7FFFFFFF)) # incement by decimal integer
#
for ((CPDec=decBeg;CPDec<=decMax;CPDec+=13247)) ;do
((D==1)) && echo "=========================================================="
#
# Convert decimal integer '$CPDec' to Hex-digits; 6-long (dec2hex)
hexUTF32BE=$(printf '%0.8Xn' $CPDec) # hexUTF32BE

# progress count
if (((CPDec%$((0x1000)))==0)) ;then
((Test>2)) && echo
echo "$hexUTF32BE Test=$Test mode=${mode[$modebits]} "
fi
if ((Test==1 || Test==2 ))
then # Test 1. Sequentially test every value from 0x000000 to 0x10FFFF
#
if ((Test==2)) ; then
bits=32
UTF8="$( perl -C -e 'print chr 0x'$hexUTF32BE |
perl -l -ne '/^( [00-177]
| [300-337][200-277]
| [340-357][200-277]{2}
| [360-367][200-277]{3}
| [370-373][200-277]{4}
| [374-375][200-277]{5}
)*$/x and print' |xxd -p )"
UTF8="${UTF8%0a}"
[[ -n "$UTF8" ]]
&& rcIco32=0 || rcIco32=1
rcIco16=

elif ((modebits==strict && CPDec<=$((0xFFFF)))) ;then
bits=16
UTF8="$( echo -n "${hexUTF32BE:4}" |
xxd -p -u -r |
iconv -f UTF-16BE -t UTF-8 2>/dev/null)"
&& rcIco16=0 || rcIco16=1
rcIco32=
else
bits=32
UTF8="$( echo -n "$hexUTF32BE" |
xxd -p -u -r |
iconv -f UTF-32BE -t UTF-8 2>/dev/null)"
&& rcIco32=0 || rcIco32=1
rcIco16=
fi
# echo "1 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
#
#
#
if ((${rcIco16}${rcIco32}!=0)) ;then
# 'iconv -f UTF-16BE' failed produce a reliable UTF-8
if ((bits==16)) ;then
((D==1)) && echo "bits-$bits rcIconv: error $hexUTF32BE .. 'strict' failed, now trying 'lax'"
# iconv failed to create a 'srict' UTF-8 so
# try UTF-32BE to get a 'lax' UTF-8 pattern
UTF8="$( echo -n "$hexUTF32BE" |
xxd -p -u -r |
iconv -f UTF-32BE -t UTF-8 2>/dev/null)"
&& rcIco32=0 || rcIco32=1
#echo "2 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
if ((rcIco32!=0)) ;then
((D==1)) && echo -n "bits-$bits rcIconv: Cannot gen UTF-8 for: $hexUTF32BE"
rcIco32=1
fi
fi
fi
# echo "3 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
#
#
#
if ((rcIco16==0 || rcIco32==0)) ;then
# 'strict(16)' OR 'lax(32)'... 'iconv' managed to generate a UTF-8 pattern
((D==1)) && echo -n "bits-$bits rcIconv: pattern* $hexUTF32BE"
((D==1)) && if [[ $bits == "16" && $rcIco32 == "0" ]] ;then
echo " .. 'lax' UTF-8 produced a pattern"
else
echo
fi
# regex test
if ((modebits==strict)) ;then
#rxOut="$(echo -n "$UTF8" |perl -l -ne '/^(([x00-x7F])|([xC2-xDF][x80-xBF])|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF]))|((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2})))*$/ or print' )"
rxOut="$(echo -n "$UTF8" |
perl -l -ne '/^( ([x00-x7F]) # 1-byte pattern
|([xC2-xDF][x80-xBF]) # 2-byte pattern
|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF])) # 3-byte pattern
|((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2})) # 4-byte pattern
)*$ /x or print' )"
else
if ((Test==2)) ;then
rx="$(echo -n "$UTF8" |perl -l -ne '/^([00-177]|[300-337][200-277]|[340-357][200-277]{2}|[360-367][200-277]{3}|[370-373][200-277]{4}|[374-375][200-277]{5})*$/ and print')"
[[ "$UTF8" != "$rx" ]] && rxOut="$UTF8" || rxOut=
rx="$(echo -n "$rx" |sed -e "s/(..)/1 /g")"
else
rxOut="$(echo -n "$UTF8" |perl -l -ne '/^([00-177]|[300-337][200-277]|[340-357][200-277]{2}|[360-367][200-277]{3}|[370-373][200-277]{4}|[374-375][200-277]{5})*$/ or print' )"
fi
fi
if [[ "$rxOut" == "" ]] ;then
((D==1)) && echo " rcRegex: ok"
rcRegex=0
else
((D==1)) && echo -n "bits-$bits rcRegex: error $hexUTF32BE .. 'strict' failed,"
((D==1)) && if [[ "12" == *$Test* ]] ;then
echo # " (codepoint) Test $Test"
else
echo
fi
rcRegex=1
fi
fi
#
elif [[ $Test == 2 ]]
then # Test 2. Throw a randomizing spanner into the works!
# Then test the arbitary bytes ASIS
#
hexLineRand="$(echo -n "$hexUTF32BE" |
sed -re "s/(.)(.)(.)(.)(.)(.)(.)(.)/1n2n3n4n5n6n7n8/" |
sort -R |
tr -d 'n')"
#
elif [[ $Test == 3 ]]
then # Test 3. Test single UTF-16BE bytes in the range 0x00000000 to 0x7FFFFFFF
#
echo "Test 3 is not properly implemented yet.. Exiting"
exit 99
else
echo "ERROR: Invalid mode"
exit
fi
#
#
if ((Test==1 || Test=2)) ;then
if ((modebits==strict && CPDec<=$((0xFFFF)))) ;then
((rcIconv=rcIco16))
else
((rcIconv=rcIco32))
fi
if ((rcRegex!=rcIconv)) ;then
[[ $Test != 1 ]] && echo
if ((rcRegex==1)) ;then
echo "ERROR: 'regex' ok, but NOT 'iconv': ${hexUTF32BE} "
else
echo "ERROR: 'iconv' ok, but NOT 'regex': ${hexUTF32BE} "
fi
((failCt++));
elif ((rcRegex!=0)) ;then
# ((invalCt++)); echo -ne "$hexUTF32BE exit-codes $${rcIco16}${rcIco32}=,$rcRegext: $(printf "%0.8Xn" $invalCt)t$hexLine$(printf "%$(((mode3whi*2)-${#hexLine}))s")r"
((invalCt++))
else
((validCt++))
fi
if ((Test==1)) ;then
echo -ne "$hexUTF32BE " "mode=${mode[$modebits]} test-return=($rcIconv,$rcRegex) valid,invalid,fail=($(printf "%X" $validCt),$(printf "%X" $invalCt),$(printf "%X" $failCt)) r"
else
echo -ne "$hexUTF32BE $rx mode=${mode[$modebits]} test-return=($rcIconv,$rcRegex) val,inval,fail=($(printf "%X" $validCt),$(printf "%X" $invalCt),$(printf "%X" $failCt))r"
fi
fi
done
} # End time
fi
exit





share|improve this answer















Edit: I've fixed a typo-bug in the regex.. It needed a 'x80` not 80.



The regex to filter out invalid UTF-8 forms, for strict adherance to UTF-8, is as follows



perl -l -ne '/
^( ([x00-x7F]) # 1-byte pattern
|([xC2-xDF][x80-xBF]) # 2-byte pattern
|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF])) # 3-byte pattern
|((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2})) # 4-byte pattern
)*$ /x or print'


Output (of key lines.from Test 1):



Codepoint
=========
00001000 Test=1 mode=strict valid,invalid,fail=(1000,0,0)
0000E000 Test=1 mode=strict valid,invalid,fail=(D800,800,0)
0010FFFF mode=strict test-return=(0,0) valid,invalid,fail=(10F800,800,0)




Q. How does one create test data to test a regex which filters invalid Unicode ?

A. Create your own UTF-8 test algorithm, and break it's rules...

Catch-22.. But then, how do you then test your test algorithm?



The regex, above, has been tested (using iconv as the reference) for every integer value from 0x00000 to 0x10FFFF.. This upper value being the maximum integer value of a Unicode Codepoint





  • In November 2003 UTF-8 was restricted by RFC 3629 to four bytes covering only the range U+0000 to U+10FFFF, in order to match the constraints of the UTF-16 character encoding.`


According to this wikipedia UTF-8 page,.




  • UTF-8 encodes each of the 1,112,064 code points in the Unicode character set, using one to four 8-bit bytes


This numeber (1,112,064) equates to a range 0x000000 to 0x10F7FF, which is 0x0800 shy of the actual maximum integer-value for the highest Unicode Codepoint: 0x10FFFF



This block of integers is missing from the Unicode Codepoints spectrum, because of the need for the UTF-16 encoding to step beyond its original design intent via a system called surrogate pairs. A block of 0x0800 integers has been reserved to be used by UTF-16.. This block spans the range0x00D800 to 0x00DFFF. None of these inteters are legal Unicode values, and are therefore invalid UTF-8 values.



In Test 1, the regex has been tested against every number in the range of Unicode Codepoints, and it matches exectly the results of iconv .. ie. 0x010F7FF valid values, and 0x000800 invalid values.



However, the issue now arises of, *How does the regex handle Out-Of-Range UTF-8 Value; above 0x010FFFF (UTF-8 can extend to 6 bytes, with a maximum integer value of 0x7FFFFFFF?

To generate the necessary *non-unicode UTF-8 byte values, I've used the following command:



  perl -C -e 'print chr 0x'$hexUTF32BE


To test their validity (in some fashion), I've used Gilles' UTF-8 regex...



  perl -l -ne '/
^( [00-177] # 1-byte pattern
|[300-337][200-277] # 2-byte pattern
|[340-357][200-277]{2} # 3-byte pattern
|[360-367][200-277]{3} # 4-byte pattern
|[370-373][200-277]{4} # 5-byte pattern
|[374-375][200-277]{5} # 6-byte pattern
)*$ /x or print'


The output of 'perl's print chr' matches the filtering of Gilles' regex.. One reinforces the validity of the other..
I can't use iconv because it only handles the valid-Unicode Standard subset of the broader (original) UTF-8 standard...



The nunbers involved are rather large, so I've tested top-of-range, bottom-of-range, and several scans stepping by increments such as, 11111, 13579, 33333, 53441... The results all match, so now all that remains is to test the regex against these out-of-range UTF-8-style values (invalid for Unicode, and therefore also invalid for strict UTF-8 itself) ..





Here are the test modules:



[[ "$(locale charmap)" != "UTF-8" ]] && { echo "ERROR: locale must be UTF-8, but it is $(locale charmap)"; exit 1; }

# Testing the UTF-8 regex
#
# Tests to check that the observed byte-ranges (above) have
# been accurately observed and included in the test code and final regex.
# =========================================================================
: 2 bytes; B2=0 # run-test=1 do-not-test=0
: 3 bytes; B3=0 # run-test=1 do-not-test=0
: 4 bytes; B4=0 # run-test=1 do-not-test=0

: regex; Rx=1 # run-test=1 do-not-test=0

((strict=16)); mode[$strict]=strict # iconv -f UTF-16BE then iconv -f UTF-32BE beyond 0xFFFF)
(( lax=32)); mode[$lax]=lax # iconv -f UTF-32BE only)

# modebits=$strict
# UTF-8, in relation to UTF-16 has invalid values
# modebits=$strict automatically shifts to modebits=$lax
# when the tested integer exceeds 0xFFFF
# modebits=$lax
# UTF-8, in relation to UTF-32, has no restrictione


# Test 1 Sequentially tests a range of Big-Endian integers
# * Unicode Codepoints are a subset ofBig-Endian integers
# ( based on 'iconv' -f UTF-32BE -f UTF-8 )
# Note: strict UTF-8 has a few quirks because of UTF-16
# Set modebits=16 to "strictly" test the low range

Test=1; modebits=$strict
# Test=2; modebits=$lax
# Test=3
mode3wlo=$(( 1*4)) # minimum chars * 4 ( '4' is for UTF-32BE )
mode3whi=$((10*4)) # minimum chars * 4 ( '4' is for UTF-32BE )


#########################################################################

# 1 byte UTF-8 values: Nothing to do; no complexities.

#########################################################################

# 2 Byte UTF-8 values: Verifying that I've got the right range values.
if ((B2==1)) ; then
echo "# Test 2 bytes for Valid UTF-8 values: ie. values which are in range"
# =========================================================================
time
for d1 in {194..223} ;do
# bin oct hex dec
# lo 11000010 302 C2 194
# hi 11011111 337 DF 223
B2b1=$(printf "%0.2X" $d1)
#
for d2 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B2b2=$(printf "%0.2X" $d2)
#
echo -n "${B2b1}${B2b2}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B2b1}${B2b2}"; exit 20; }
#
done
done
echo

# Now do a negated test.. This takes longer, because there are more values.
echo "# Test 2 bytes for Invalid values: ie. values which are out of range"
# =========================================================================
# Note: 'iconv' will treat a leading x00-x7F as a valid leading single,
# so this negated test primes the first UTF-8 byte with values starting at x80
time
for d1 in {128..193} {224..255} ;do
#for d1 in {128..194} {224..255} ;do # force a valid UTF-8 (needs $B2b2)
B2b1=$(printf "%0.2X" $d1)
#
for d2 in {0..127} {192..255} ;do
#for d2 in {0..128} {192..255} ;do # force a valid UTF-8 (needs $B2b1)
B2b2=$(printf "%0.2X" $d2)
#
echo -n "${B2b1}${B2b2}" |
xxd -p -u -r |
iconv -f UTF-8 2>/dev/null && {
echo "ERROR: VALID UTF-8 found: ${B2b1}${B2b2}"; exit 21; }
#
done
done
echo
fi

#########################################################################

# 3 Byte UTF-8 values: Verifying that I've got the right range values.
if ((B3==1)) ; then
echo "# Test 3 bytes for Valid UTF-8 values: ie. values which are in range"
# ========================================================================
time
for d1 in {224..239} ;do
# bin oct hex dec
# lo 11100000 340 E0 224
# hi 11101111 357 EF 239
B3b1=$(printf "%0.2X" $d1)
#
if [[ $B3b1 == "E0" ]] ; then
B3b2range="$(echo {160..191})"
# bin oct hex dec
# lo 10100000 240 A0 160
# hi 10111111 277 BF 191
elif [[ $B3b1 == "ED" ]] ; then
B3b2range="$(echo {128..159})"
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10011111 237 9F 159
else
B3b2range="$(echo {128..191})"
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
fi
#
for d2 in $B3b2range ;do
B3b2=$(printf "%0.2X" $d2)
echo "${B3b1} ${B3b2} xx"
#
for d3 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B3b3=$(printf "%0.2X" $d3)
#
echo -n "${B3b1}${B3b2}${B3b3}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B3b1}${B3b2}${B3b3}"; exit 30; }
#
done
done
done
echo

# Now do a negated test.. This takes longer, because there are more values.
echo "# Test 3 bytes for Invalid values: ie. values which are out of range"
# =========================================================================
# Note: 'iconv' will treat a leading x00-x7F as a valid leading single,
# so this negated test primes the first UTF-8 byte with values starting at x80
#
# real 26m28.462s
# user 27m12.526s | stepping by 2
# sys 13m11.193s /
#
# real 239m00.836s
# user 225m11.108s | stepping by 1
# sys 120m00.538s /
#
time
for d1 in {128..223..1} {240..255..1} ;do
#for d1 in {128..224..1} {239..255..1} ;do # force a valid UTF-8 (needs $B2b2,$B3b3)
B3b1=$(printf "%0.2X" $d1)
#
if [[ $B3b1 == "E0" ]] ; then
B3b2range="$(echo {0..159..1} {192..255..1})"
#B3b2range="$(> {192..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
elif [[ $B3b1 == "ED" ]] ; then
B3b2range="$(echo {0..127..1} {160..255..1})"
#B3b2range="$(echo {0..128..1} {160..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
else
B3b2range="$(echo {0..127..1} {192..255..1})"
#B3b2range="$(echo {0..128..1} {192..255..1})" # force a valid UTF-8 (needs $B3b1,$B3b3)
fi
for d2 in $B3b2range ;do
B3b2=$(printf "%0.2X" $d2)
echo "${B3b1} ${B3b2} xx"
#
for d3 in {0..127..1} {192..255..1} ;do
#for d3 in {0..128..1} {192..255..1} ;do # force a valid UTF-8 (needs $B2b1)
B3b3=$(printf "%0.2X" $d3)
#
echo -n "${B3b1}${B3b2}${B3b3}" |
xxd -p -u -r |
iconv -f UTF-8 2>/dev/null && {
echo "ERROR: VALID UTF-8 found: ${B3b1}${B3b2}${B3b3}"; exit 31; }
#
done
done
done
echo

fi

#########################################################################

# Brute force testing in the Astral Plane will take a VERY LONG time..
# Perhaps selective testing is more appropriate, now that the previous tests
# have panned out okay...
#
# 4 Byte UTF-8 values:
if ((B4==1)) ; then
echo "# Test 4 bytes for Valid UTF-8 values: ie. values which are in range"
# ==================================================================
# real 58m18.531s
# user 56m44.317s |
# sys 27m29.867s /
time
for d1 in {240..244} ;do
# bin oct hex dec
# lo 11110000 360 F0 240
# hi 11110100 364 F4 244 -- F4 encodes some values greater than 0x10FFFF;
# such a sequence is invalid.
B4b1=$(printf "%0.2X" $d1)
#
if [[ $B4b1 == "F0" ]] ; then
B4b2range="$(echo {144..191})" ## f0 90 80 80 to f0 bf bf bf
# bin oct hex dec 010000 -- 03FFFF
# lo 10010000 220 90 144
# hi 10111111 277 BF 191
#
elif [[ $B4b1 == "F4" ]] ; then
B4b2range="$(echo {128..143})" ## f4 80 80 80 to f4 8f bf bf
# bin oct hex dec 100000 -- 10FFFF
# lo 10000000 200 80 128
# hi 10001111 217 8F 143 -- F4 encodes some values greater than 0x10FFFF;
# such a sequence is invalid.
else
B4b2range="$(echo {128..191})" ## fx 80 80 80 to f3 bf bf bf
# bin oct hex dec 0C0000 -- 0FFFFF
# lo 10000000 200 80 128 0A0000
# hi 10111111 277 BF 191
fi
#
for d2 in $B4b2range ;do
B4b2=$(printf "%0.2X" $d2)
#
for d3 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B4b3=$(printf "%0.2X" $d3)
echo "${B4b1} ${B4b2} ${B4b3} xx"
#
for d4 in {128..191} ;do
# bin oct hex dec
# lo 10000000 200 80 128
# hi 10111111 277 BF 191
B4b4=$(printf "%0.2X" $d4)
#
echo -n "${B4b1}${B4b2}${B4b3}${B4b4}" |
xxd -p -u -r |
iconv -f UTF-8 >/dev/null || {
echo "ERROR: Invalid UTF-8 found: ${B4b1}${B4b2}${B4b3}${B4b4}"; exit 40; }
#
done
done
done
done
echo "# Test 4 bytes for Valid UTF-8 values: END"
echo
fi

########################################################################
# There is no test (yet) for negated range values in the astral plane. #
# (all negated range values must be invalid) #
# I won't bother; This was mainly for me to ge the general feel of #
# the tests, and the final test below should flush anything out.. #
# Traversing the intire UTF-8 range takes quite a while... #
# so no need to do it twice (albeit in a slightly different manner) #
########################################################################

################################
### The construction of: ####
### The Regular Expression ####
### (de-construction?) ####
################################

# BYTE 1 BYTE 2 BYTE 3 BYTE 4
# 1: [x00-x7F]
# ===========
# ([x00-x7F])
#
# 2: [xC2-xDF] [x80-xBF]
# =================================
# ([xC2-xDF][x80-xBF])
#
# 3: [xE0] [xA0-xBF] [x80-xBF]
# [xED] [x80-x9F] [x80-xBF]
# [xE1-xECxEE-xEF] [x80-xBF] [x80-xBF]
# ==============================================
# ((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF]))
#
# 4 [xF0] [x90-xBF] [x80-xBF] [x80-xBF]
# [xF1-xF3] [x80-xBF] [x80-xBF] [x80-xBF]
# [xF4] [x80-x8F] [x80-xBF] [x80-xBF]
# ===========================================================
# ((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2}))
#
# The final regex
# ===============
# 1-4: (([x00-x7F])|([xC2-xDF][x80-xBF])|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF]))|((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2})))
# 4-1: (((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2}))|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF]))|([xC2-xDF][x80-xBF])|([x00-x7F]))


#######################################################################
# The final Test; for a single character (multi chars to follow) #
# Compare the return code of 'iconv' against the 'regex' #
# for the full range of 0x000000 to 0x10FFFF #
# #
# Note; this script has 3 modes: #
# Run this test TWICE, set each mode Manually! #
# #
# 1. Sequentially test every value from 0x000000 to 0x10FFFF #
# 2. Throw a spanner into the works! Force random byte patterns #
# 2. Throw a spanner into the works! Force random longer strings #
# ============================== #
# #
# Note: The purpose of this routine is to determine if there is any #
# difference how 'iconv' and 'regex' handle the same data #
# #
#######################################################################
if ((Rx==1)) ; then
# real 191m34.826s
# user 158m24.114s
# sys 83m10.676s
time {
invalCt=0
validCt=0
failCt=0
decBeg=$((0x00110000)) # incement by decimal integer
decMax=$((0x7FFFFFFF)) # incement by decimal integer
#
for ((CPDec=decBeg;CPDec<=decMax;CPDec+=13247)) ;do
((D==1)) && echo "=========================================================="
#
# Convert decimal integer '$CPDec' to Hex-digits; 6-long (dec2hex)
hexUTF32BE=$(printf '%0.8Xn' $CPDec) # hexUTF32BE

# progress count
if (((CPDec%$((0x1000)))==0)) ;then
((Test>2)) && echo
echo "$hexUTF32BE Test=$Test mode=${mode[$modebits]} "
fi
if ((Test==1 || Test==2 ))
then # Test 1. Sequentially test every value from 0x000000 to 0x10FFFF
#
if ((Test==2)) ; then
bits=32
UTF8="$( perl -C -e 'print chr 0x'$hexUTF32BE |
perl -l -ne '/^( [00-177]
| [300-337][200-277]
| [340-357][200-277]{2}
| [360-367][200-277]{3}
| [370-373][200-277]{4}
| [374-375][200-277]{5}
)*$/x and print' |xxd -p )"
UTF8="${UTF8%0a}"
[[ -n "$UTF8" ]]
&& rcIco32=0 || rcIco32=1
rcIco16=

elif ((modebits==strict && CPDec<=$((0xFFFF)))) ;then
bits=16
UTF8="$( echo -n "${hexUTF32BE:4}" |
xxd -p -u -r |
iconv -f UTF-16BE -t UTF-8 2>/dev/null)"
&& rcIco16=0 || rcIco16=1
rcIco32=
else
bits=32
UTF8="$( echo -n "$hexUTF32BE" |
xxd -p -u -r |
iconv -f UTF-32BE -t UTF-8 2>/dev/null)"
&& rcIco32=0 || rcIco32=1
rcIco16=
fi
# echo "1 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
#
#
#
if ((${rcIco16}${rcIco32}!=0)) ;then
# 'iconv -f UTF-16BE' failed produce a reliable UTF-8
if ((bits==16)) ;then
((D==1)) && echo "bits-$bits rcIconv: error $hexUTF32BE .. 'strict' failed, now trying 'lax'"
# iconv failed to create a 'srict' UTF-8 so
# try UTF-32BE to get a 'lax' UTF-8 pattern
UTF8="$( echo -n "$hexUTF32BE" |
xxd -p -u -r |
iconv -f UTF-32BE -t UTF-8 2>/dev/null)"
&& rcIco32=0 || rcIco32=1
#echo "2 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
if ((rcIco32!=0)) ;then
((D==1)) && echo -n "bits-$bits rcIconv: Cannot gen UTF-8 for: $hexUTF32BE"
rcIco32=1
fi
fi
fi
# echo "3 mode=${mode[$modebits]}-$bits rcIconv: (${rcIco16},${rcIco32}) $hexUTF32BE "
#
#
#
if ((rcIco16==0 || rcIco32==0)) ;then
# 'strict(16)' OR 'lax(32)'... 'iconv' managed to generate a UTF-8 pattern
((D==1)) && echo -n "bits-$bits rcIconv: pattern* $hexUTF32BE"
((D==1)) && if [[ $bits == "16" && $rcIco32 == "0" ]] ;then
echo " .. 'lax' UTF-8 produced a pattern"
else
echo
fi
# regex test
if ((modebits==strict)) ;then
#rxOut="$(echo -n "$UTF8" |perl -l -ne '/^(([x00-x7F])|([xC2-xDF][x80-xBF])|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF]))|((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2})))*$/ or print' )"
rxOut="$(echo -n "$UTF8" |
perl -l -ne '/^( ([x00-x7F]) # 1-byte pattern
|([xC2-xDF][x80-xBF]) # 2-byte pattern
|((([xE0][xA0-xBF])|([xED][x80-x9F])|([xE1-xECxEE-xEF][x80-xBF]))([x80-xBF])) # 3-byte pattern
|((([xF0][x90-xBF])|([xF1-xF3][x80-xBF])|([xF4][x80-x8F]))([x80-xBF]{2})) # 4-byte pattern
)*$ /x or print' )"
else
if ((Test==2)) ;then
rx="$(echo -n "$UTF8" |perl -l -ne '/^([00-177]|[300-337][200-277]|[340-357][200-277]{2}|[360-367][200-277]{3}|[370-373][200-277]{4}|[374-375][200-277]{5})*$/ and print')"
[[ "$UTF8" != "$rx" ]] && rxOut="$UTF8" || rxOut=
rx="$(echo -n "$rx" |sed -e "s/(..)/1 /g")"
else
rxOut="$(echo -n "$UTF8" |perl -l -ne '/^([00-177]|[300-337][200-277]|[340-357][200-277]{2}|[360-367][200-277]{3}|[370-373][200-277]{4}|[374-375][200-277]{5})*$/ or print' )"
fi
fi
if [[ "$rxOut" == "" ]] ;then
((D==1)) && echo " rcRegex: ok"
rcRegex=0
else
((D==1)) && echo -n "bits-$bits rcRegex: error $hexUTF32BE .. 'strict' failed,"
((D==1)) && if [[ "12" == *$Test* ]] ;then
echo # " (codepoint) Test $Test"
else
echo
fi
rcRegex=1
fi
fi
#
elif [[ $Test == 2 ]]
then # Test 2. Throw a randomizing spanner into the works!
# Then test the arbitary bytes ASIS
#
hexLineRand="$(echo -n "$hexUTF32BE" |
sed -re "s/(.)(.)(.)(.)(.)(.)(.)(.)/1n2n3n4n5n6n7n8/" |
sort -R |
tr -d 'n')"
#
elif [[ $Test == 3 ]]
then # Test 3. Test single UTF-16BE bytes in the range 0x00000000 to 0x7FFFFFFF
#
echo "Test 3 is not properly implemented yet.. Exiting"
exit 99
else
echo "ERROR: Invalid mode"
exit
fi
#
#
if ((Test==1 || Test=2)) ;then
if ((modebits==strict && CPDec<=$((0xFFFF)))) ;then
((rcIconv=rcIco16))
else
((rcIconv=rcIco32))
fi
if ((rcRegex!=rcIconv)) ;then
[[ $Test != 1 ]] && echo
if ((rcRegex==1)) ;then
echo "ERROR: 'regex' ok, but NOT 'iconv': ${hexUTF32BE} "
else
echo "ERROR: 'iconv' ok, but NOT 'regex': ${hexUTF32BE} "
fi
((failCt++));
elif ((rcRegex!=0)) ;then
# ((invalCt++)); echo -ne "$hexUTF32BE exit-codes $${rcIco16}${rcIco32}=,$rcRegext: $(printf "%0.8Xn" $invalCt)t$hexLine$(printf "%$(((mode3whi*2)-${#hexLine}))s")r"
((invalCt++))
else
((validCt++))
fi
if ((Test==1)) ;then
echo -ne "$hexUTF32BE " "mode=${mode[$modebits]} test-return=($rcIconv,$rcRegex) valid,invalid,fail=($(printf "%X" $validCt),$(printf "%X" $invalCt),$(printf "%X" $failCt)) r"
else
echo -ne "$hexUTF32BE $rx mode=${mode[$modebits]} test-return=($rcIconv,$rcRegex) val,inval,fail=($(printf "%X" $validCt),$(printf "%X" $invalCt),$(printf "%X" $failCt))r"
fi
fi
done
} # End time
fi
exit






share|improve this answer














share|improve this answer



share|improve this answer








edited May 6 '11 at 8:29

























answered May 3 '11 at 18:50









Peter.OPeter.O

18.9k1791144




18.9k1791144













  • The main problem with my regexp is that it allowed some forbidden sequences such as 300200 (really bad: that's code point 0 not expressed with a null byte!). I think your regexp rejects them correctly.

    – Gilles
    May 8 '11 at 12:38



















  • The main problem with my regexp is that it allowed some forbidden sequences such as 300200 (really bad: that's code point 0 not expressed with a null byte!). I think your regexp rejects them correctly.

    – Gilles
    May 8 '11 at 12:38

















The main problem with my regexp is that it allowed some forbidden sequences such as 300200 (really bad: that's code point 0 not expressed with a null byte!). I think your regexp rejects them correctly.

– Gilles
May 8 '11 at 12:38





The main problem with my regexp is that it allowed some forbidden sequences such as 300200 (really bad: that's code point 0 not expressed with a null byte!). I think your regexp rejects them correctly.

– Gilles
May 8 '11 at 12:38











6





+100









I find uconv (in icu-devtools package in Debian) useful to inspect UTF-8 data:



$ print '\xE9 xe9 u20ac ud800udc00 U110000' |
uconv --callback escape-c -t us
xE9 xE9 u20ac xEDxA0x80xEDxB0x80 xF4x90x80x80


(The xs help spotting the invalid characters (except for the false positive voluntarily introduced with a literal xE9 above)).



(plenty of other nice usages).






share|improve this answer


























  • I think recode can be used similarly - except that I think it should fail if asked to translate an invalid multibyte sequence. I'm not sure though; it won't fail for print...|recode u8..u8/x4 for example (which just does a hexdump as you do above) because it doesn't do anything but iconv data data, but it does fail like recode u8..u2..u8/x4 because it translates then prints. But I don't know enough about it to be sure - and there are a lot of possibilities.

    – mikeserv
    Dec 20 '14 at 19:29











  • If I have a file, say, test.txt. How should I suppose to find the invalid character using your solution? What does us in your code mean?

    – jdhao
    Dec 26 '17 at 7:51











  • @Hao, us means United States, that is short for ASCII. It converts the input into a ASCII one where the non-ASCII characters are converted to uXXXX notation and the non-characters to xXX.

    – Stéphane Chazelas
    Dec 26 '17 at 14:19











  • Where should I put my file to use your script? Is the last line in the code block the output of your code? It is a little confusing to me.

    – jdhao
    Dec 26 '17 at 14:27
















6





+100









I find uconv (in icu-devtools package in Debian) useful to inspect UTF-8 data:



$ print '\xE9 xe9 u20ac ud800udc00 U110000' |
uconv --callback escape-c -t us
xE9 xE9 u20ac xEDxA0x80xEDxB0x80 xF4x90x80x80


(The xs help spotting the invalid characters (except for the false positive voluntarily introduced with a literal xE9 above)).



(plenty of other nice usages).






share|improve this answer


























  • I think recode can be used similarly - except that I think it should fail if asked to translate an invalid multibyte sequence. I'm not sure though; it won't fail for print...|recode u8..u8/x4 for example (which just does a hexdump as you do above) because it doesn't do anything but iconv data data, but it does fail like recode u8..u2..u8/x4 because it translates then prints. But I don't know enough about it to be sure - and there are a lot of possibilities.

    – mikeserv
    Dec 20 '14 at 19:29











  • If I have a file, say, test.txt. How should I suppose to find the invalid character using your solution? What does us in your code mean?

    – jdhao
    Dec 26 '17 at 7:51











  • @Hao, us means United States, that is short for ASCII. It converts the input into a ASCII one where the non-ASCII characters are converted to uXXXX notation and the non-characters to xXX.

    – Stéphane Chazelas
    Dec 26 '17 at 14:19











  • Where should I put my file to use your script? Is the last line in the code block the output of your code? It is a little confusing to me.

    – jdhao
    Dec 26 '17 at 14:27














6





+100







6





+100



6




+100





I find uconv (in icu-devtools package in Debian) useful to inspect UTF-8 data:



$ print '\xE9 xe9 u20ac ud800udc00 U110000' |
uconv --callback escape-c -t us
xE9 xE9 u20ac xEDxA0x80xEDxB0x80 xF4x90x80x80


(The xs help spotting the invalid characters (except for the false positive voluntarily introduced with a literal xE9 above)).



(plenty of other nice usages).






share|improve this answer















I find uconv (in icu-devtools package in Debian) useful to inspect UTF-8 data:



$ print '\xE9 xe9 u20ac ud800udc00 U110000' |
uconv --callback escape-c -t us
xE9 xE9 u20ac xEDxA0x80xEDxB0x80 xF4x90x80x80


(The xs help spotting the invalid characters (except for the false positive voluntarily introduced with a literal xE9 above)).



(plenty of other nice usages).







share|improve this answer














share|improve this answer



share|improve this answer








edited Dec 20 '14 at 19:09

























answered Dec 19 '14 at 23:46









Stéphane ChazelasStéphane Chazelas

302k56567919




302k56567919













  • I think recode can be used similarly - except that I think it should fail if asked to translate an invalid multibyte sequence. I'm not sure though; it won't fail for print...|recode u8..u8/x4 for example (which just does a hexdump as you do above) because it doesn't do anything but iconv data data, but it does fail like recode u8..u2..u8/x4 because it translates then prints. But I don't know enough about it to be sure - and there are a lot of possibilities.

    – mikeserv
    Dec 20 '14 at 19:29











  • If I have a file, say, test.txt. How should I suppose to find the invalid character using your solution? What does us in your code mean?

    – jdhao
    Dec 26 '17 at 7:51











  • @Hao, us means United States, that is short for ASCII. It converts the input into a ASCII one where the non-ASCII characters are converted to uXXXX notation and the non-characters to xXX.

    – Stéphane Chazelas
    Dec 26 '17 at 14:19











  • Where should I put my file to use your script? Is the last line in the code block the output of your code? It is a little confusing to me.

    – jdhao
    Dec 26 '17 at 14:27



















  • I think recode can be used similarly - except that I think it should fail if asked to translate an invalid multibyte sequence. I'm not sure though; it won't fail for print...|recode u8..u8/x4 for example (which just does a hexdump as you do above) because it doesn't do anything but iconv data data, but it does fail like recode u8..u2..u8/x4 because it translates then prints. But I don't know enough about it to be sure - and there are a lot of possibilities.

    – mikeserv
    Dec 20 '14 at 19:29











  • If I have a file, say, test.txt. How should I suppose to find the invalid character using your solution? What does us in your code mean?

    – jdhao
    Dec 26 '17 at 7:51











  • @Hao, us means United States, that is short for ASCII. It converts the input into a ASCII one where the non-ASCII characters are converted to uXXXX notation and the non-characters to xXX.

    – Stéphane Chazelas
    Dec 26 '17 at 14:19











  • Where should I put my file to use your script? Is the last line in the code block the output of your code? It is a little confusing to me.

    – jdhao
    Dec 26 '17 at 14:27

















I think recode can be used similarly - except that I think it should fail if asked to translate an invalid multibyte sequence. I'm not sure though; it won't fail for print...|recode u8..u8/x4 for example (which just does a hexdump as you do above) because it doesn't do anything but iconv data data, but it does fail like recode u8..u2..u8/x4 because it translates then prints. But I don't know enough about it to be sure - and there are a lot of possibilities.

– mikeserv
Dec 20 '14 at 19:29





I think recode can be used similarly - except that I think it should fail if asked to translate an invalid multibyte sequence. I'm not sure though; it won't fail for print...|recode u8..u8/x4 for example (which just does a hexdump as you do above) because it doesn't do anything but iconv data data, but it does fail like recode u8..u2..u8/x4 because it translates then prints. But I don't know enough about it to be sure - and there are a lot of possibilities.

– mikeserv
Dec 20 '14 at 19:29













If I have a file, say, test.txt. How should I suppose to find the invalid character using your solution? What does us in your code mean?

– jdhao
Dec 26 '17 at 7:51





If I have a file, say, test.txt. How should I suppose to find the invalid character using your solution? What does us in your code mean?

– jdhao
Dec 26 '17 at 7:51













@Hao, us means United States, that is short for ASCII. It converts the input into a ASCII one where the non-ASCII characters are converted to uXXXX notation and the non-characters to xXX.

– Stéphane Chazelas
Dec 26 '17 at 14:19





@Hao, us means United States, that is short for ASCII. It converts the input into a ASCII one where the non-ASCII characters are converted to uXXXX notation and the non-characters to xXX.

– Stéphane Chazelas
Dec 26 '17 at 14:19













Where should I put my file to use your script? Is the last line in the code block the output of your code? It is a little confusing to me.

– jdhao
Dec 26 '17 at 14:27





Where should I put my file to use your script? Is the last line in the code block the output of your code? It is a little confusing to me.

– jdhao
Dec 26 '17 at 14:27











3














Python has had a built-in unicode function since version 2.0.



#!/usr/bin/env python2
import sys
for line in sys.stdin:
try:
unicode(line, 'utf-8')
except UnicodeDecodeError:
sys.stdout.write(line)


In Python 3, unicode has been folded into str. It needs to be passed a bytes-like object, here the underlying buffer objects for the standard descriptors.



#!/usr/bin/env python3
import sys
for line in sys.stdin.buffer:
try:
str(line, 'utf-8')
except UnicodeDecodeError:
sys.stdout.buffer.write(line)





share|improve this answer
























  • The python 2 one fails to flag UTF-8 encoded UTF-16 surrogate non-characters (at least with 2.7.6).

    – Stéphane Chazelas
    Dec 19 '14 at 23:12











  • @StéphaneChazelas Dammit. Thanks. I've only run nominal tests so far, I'll run Peter's test battery later.

    – Gilles
    Dec 19 '14 at 23:14
















3














Python has had a built-in unicode function since version 2.0.



#!/usr/bin/env python2
import sys
for line in sys.stdin:
try:
unicode(line, 'utf-8')
except UnicodeDecodeError:
sys.stdout.write(line)


In Python 3, unicode has been folded into str. It needs to be passed a bytes-like object, here the underlying buffer objects for the standard descriptors.



#!/usr/bin/env python3
import sys
for line in sys.stdin.buffer:
try:
str(line, 'utf-8')
except UnicodeDecodeError:
sys.stdout.buffer.write(line)





share|improve this answer
























  • The python 2 one fails to flag UTF-8 encoded UTF-16 surrogate non-characters (at least with 2.7.6).

    – Stéphane Chazelas
    Dec 19 '14 at 23:12











  • @StéphaneChazelas Dammit. Thanks. I've only run nominal tests so far, I'll run Peter's test battery later.

    – Gilles
    Dec 19 '14 at 23:14














3












3








3







Python has had a built-in unicode function since version 2.0.



#!/usr/bin/env python2
import sys
for line in sys.stdin:
try:
unicode(line, 'utf-8')
except UnicodeDecodeError:
sys.stdout.write(line)


In Python 3, unicode has been folded into str. It needs to be passed a bytes-like object, here the underlying buffer objects for the standard descriptors.



#!/usr/bin/env python3
import sys
for line in sys.stdin.buffer:
try:
str(line, 'utf-8')
except UnicodeDecodeError:
sys.stdout.buffer.write(line)





share|improve this answer













Python has had a built-in unicode function since version 2.0.



#!/usr/bin/env python2
import sys
for line in sys.stdin:
try:
unicode(line, 'utf-8')
except UnicodeDecodeError:
sys.stdout.write(line)


In Python 3, unicode has been folded into str. It needs to be passed a bytes-like object, here the underlying buffer objects for the standard descriptors.



#!/usr/bin/env python3
import sys
for line in sys.stdin.buffer:
try:
str(line, 'utf-8')
except UnicodeDecodeError:
sys.stdout.buffer.write(line)






share|improve this answer












share|improve this answer



share|improve this answer










answered Dec 19 '14 at 23:05









GillesGilles

533k12810721594




533k12810721594













  • The python 2 one fails to flag UTF-8 encoded UTF-16 surrogate non-characters (at least with 2.7.6).

    – Stéphane Chazelas
    Dec 19 '14 at 23:12











  • @StéphaneChazelas Dammit. Thanks. I've only run nominal tests so far, I'll run Peter's test battery later.

    – Gilles
    Dec 19 '14 at 23:14



















  • The python 2 one fails to flag UTF-8 encoded UTF-16 surrogate non-characters (at least with 2.7.6).

    – Stéphane Chazelas
    Dec 19 '14 at 23:12











  • @StéphaneChazelas Dammit. Thanks. I've only run nominal tests so far, I'll run Peter's test battery later.

    – Gilles
    Dec 19 '14 at 23:14

















The python 2 one fails to flag UTF-8 encoded UTF-16 surrogate non-characters (at least with 2.7.6).

– Stéphane Chazelas
Dec 19 '14 at 23:12





The python 2 one fails to flag UTF-8 encoded UTF-16 surrogate non-characters (at least with 2.7.6).

– Stéphane Chazelas
Dec 19 '14 at 23:12













@StéphaneChazelas Dammit. Thanks. I've only run nominal tests so far, I'll run Peter's test battery later.

– Gilles
Dec 19 '14 at 23:14





@StéphaneChazelas Dammit. Thanks. I've only run nominal tests so far, I'll run Peter's test battery later.

– Gilles
Dec 19 '14 at 23:14











0














I came across similar problem (detail in "Context" section) and arrived with following ftfy_line_by_line.py solution:



#!/usr/bin/env python3
import ftfy, sys
with open(sys.argv[1], mode='rt', encoding='utf8', errors='replace') as f:
for line in f:
sys.stdout.buffer.write(ftfy.fix_text(line).encode('utf8', 'replace'))
#print(ftfy.fix_text(line).rstrip().decode(encoding="utf-8", errors="replace"))


Using encode+replace + ftfy to auto-fix Mojibake and other corrections.



Context



I've collected >10GiB CSV of basic filesystem metadata using following gen_basic_files_metadata.csv.sh script, running essentially:



find "${path}" -type f -exec stat --format="%i,%Y,%s,${hostname},%m,%n" "{}" ;


The trouble I had was with inconsistent encoding of filenames across file systems, causing UnicodeDecodeError when processing further with python applications (csvsql to be more specific).



Therefore I applied above ftfy script, and it took



Please note ftfy is pretty slow, processing those >10GiB took:



real    147m35.182s
user 146m14.329s
sys 2m8.713s


while sha256sum for comparison:



real    6m28.897s
user 1m9.273s
sys 0m6.210s


on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz + 16GiB RAM (and data on external drive)






share|improve this answer


























  • And yes, I know that this find command will not properly encode filenames containing quotes according to csv standard

    – Grzegorz Wierzowiecki
    May 25 '17 at 14:12
















0














I came across similar problem (detail in "Context" section) and arrived with following ftfy_line_by_line.py solution:



#!/usr/bin/env python3
import ftfy, sys
with open(sys.argv[1], mode='rt', encoding='utf8', errors='replace') as f:
for line in f:
sys.stdout.buffer.write(ftfy.fix_text(line).encode('utf8', 'replace'))
#print(ftfy.fix_text(line).rstrip().decode(encoding="utf-8", errors="replace"))


Using encode+replace + ftfy to auto-fix Mojibake and other corrections.



Context



I've collected >10GiB CSV of basic filesystem metadata using following gen_basic_files_metadata.csv.sh script, running essentially:



find "${path}" -type f -exec stat --format="%i,%Y,%s,${hostname},%m,%n" "{}" ;


The trouble I had was with inconsistent encoding of filenames across file systems, causing UnicodeDecodeError when processing further with python applications (csvsql to be more specific).



Therefore I applied above ftfy script, and it took



Please note ftfy is pretty slow, processing those >10GiB took:



real    147m35.182s
user 146m14.329s
sys 2m8.713s


while sha256sum for comparison:



real    6m28.897s
user 1m9.273s
sys 0m6.210s


on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz + 16GiB RAM (and data on external drive)






share|improve this answer


























  • And yes, I know that this find command will not properly encode filenames containing quotes according to csv standard

    – Grzegorz Wierzowiecki
    May 25 '17 at 14:12














0












0








0







I came across similar problem (detail in "Context" section) and arrived with following ftfy_line_by_line.py solution:



#!/usr/bin/env python3
import ftfy, sys
with open(sys.argv[1], mode='rt', encoding='utf8', errors='replace') as f:
for line in f:
sys.stdout.buffer.write(ftfy.fix_text(line).encode('utf8', 'replace'))
#print(ftfy.fix_text(line).rstrip().decode(encoding="utf-8", errors="replace"))


Using encode+replace + ftfy to auto-fix Mojibake and other corrections.



Context



I've collected >10GiB CSV of basic filesystem metadata using following gen_basic_files_metadata.csv.sh script, running essentially:



find "${path}" -type f -exec stat --format="%i,%Y,%s,${hostname},%m,%n" "{}" ;


The trouble I had was with inconsistent encoding of filenames across file systems, causing UnicodeDecodeError when processing further with python applications (csvsql to be more specific).



Therefore I applied above ftfy script, and it took



Please note ftfy is pretty slow, processing those >10GiB took:



real    147m35.182s
user 146m14.329s
sys 2m8.713s


while sha256sum for comparison:



real    6m28.897s
user 1m9.273s
sys 0m6.210s


on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz + 16GiB RAM (and data on external drive)






share|improve this answer















I came across similar problem (detail in "Context" section) and arrived with following ftfy_line_by_line.py solution:



#!/usr/bin/env python3
import ftfy, sys
with open(sys.argv[1], mode='rt', encoding='utf8', errors='replace') as f:
for line in f:
sys.stdout.buffer.write(ftfy.fix_text(line).encode('utf8', 'replace'))
#print(ftfy.fix_text(line).rstrip().decode(encoding="utf-8", errors="replace"))


Using encode+replace + ftfy to auto-fix Mojibake and other corrections.



Context



I've collected >10GiB CSV of basic filesystem metadata using following gen_basic_files_metadata.csv.sh script, running essentially:



find "${path}" -type f -exec stat --format="%i,%Y,%s,${hostname},%m,%n" "{}" ;


The trouble I had was with inconsistent encoding of filenames across file systems, causing UnicodeDecodeError when processing further with python applications (csvsql to be more specific).



Therefore I applied above ftfy script, and it took



Please note ftfy is pretty slow, processing those >10GiB took:



real    147m35.182s
user 146m14.329s
sys 2m8.713s


while sha256sum for comparison:



real    6m28.897s
user 1m9.273s
sys 0m6.210s


on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz + 16GiB RAM (and data on external drive)







share|improve this answer














share|improve this answer



share|improve this answer








edited May 25 '17 at 18:33

























answered May 25 '17 at 14:11









Grzegorz WierzowieckiGrzegorz Wierzowiecki

5,2421363105




5,2421363105













  • And yes, I know that this find command will not properly encode filenames containing quotes according to csv standard

    – Grzegorz Wierzowiecki
    May 25 '17 at 14:12



















  • And yes, I know that this find command will not properly encode filenames containing quotes according to csv standard

    – Grzegorz Wierzowiecki
    May 25 '17 at 14:12

















And yes, I know that this find command will not properly encode filenames containing quotes according to csv standard

– Grzegorz Wierzowiecki
May 25 '17 at 14:12





And yes, I know that this find command will not properly encode filenames containing quotes according to csv standard

– Grzegorz Wierzowiecki
May 25 '17 at 14:12


















draft saved

draft discarded




















































Thanks for contributing an answer to Unix & Linux Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f6516%2ffiltering-invalid-utf8%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to reconfigure Docker Trusted Registry 2.x.x to use CEPH FS mount instead of NFS and other traditional...

is 'sed' thread safe

How to make a Squid Proxy server?