s*******3 发帖数: 6 | 1 UTF-8 encoding scheme is described below:
0XXXXXXX = this is the entire rune
10XXXXXX = this is a continuation of the rune from the previous byte
110XXXXX = this is the start of a 2-byte rune.
1110XXXX = this is the start of a 3-byte rune.
11110XXX = this is the start of a 4-byte rune.
111110XX = this is the start of a 5-byte rune.
1111110X = this is the start of a 6-byte rune.
11111110 = this is the start of a 7-byte rune.
11111111 = this is the start of a 8-byte rune.
For example, a 3-byte rune would be 1110XXXX 10XXXXXX 10XXXXXX.
Write a function that decides whether a given byte array (or string) is
valid UTF-8 encoded text.
求讨论求指导 |
x***j 发帖数: 75 | |
l*****a 发帖数: 14598 | 3 那就去学习一下UTF-8的定义
【在 x***j 的大作中提到】 : 没太看懂
|
l*****a 发帖数: 14598 | 4 通过这个识别每个byte的type,再check是否满足规则?
public int getType(byte input) {
int target=1<<7;
int type=0;
while(target>0) {
if(intput & target==0) return type;
type++;
target>=1;
}
}
【在 s*******3 的大作中提到】 : UTF-8 encoding scheme is described below: : 0XXXXXXX = this is the entire rune : 10XXXXXX = this is a continuation of the rune from the previous byte : 110XXXXX = this is the start of a 2-byte rune. : 1110XXXX = this is the start of a 3-byte rune. : 11110XXX = this is the start of a 4-byte rune. : 111110XX = this is the start of a 5-byte rune. : 1111110X = this is the start of a 6-byte rune. : 11111110 = this is the start of a 7-byte rune. : 11111111 = this is the start of a 8-byte rune.
|
l*******a 发帖数: 16 | 5 Nice~
public boolean validUTF8(byte [] array){
if (array == null || array.length == 0){
return false;
}
int leftByte = 0;
for (int i = 0; i < array.length; i++){
int val = getType(array[i]);
if (leftByte == 0){ // new start
if (val == 1){ // continue byte
return false;
}
else if (val == 0){
continue;
}
else {
leftByte = val-1;
}
}
else{ // shall continue
if (val != 1){
return false;
}
else{
leftByte--;
}
}
}
if (leftByte == 0){
return true;
}
else{
return false;
}
}
public int getType(byte input) {
int target=1<<7;
int type=0;
while(target>0) {
if((input & target)==0) return type;
type++;
target = target>>1;
}
return type;
}
【在 l*****a 的大作中提到】 : 通过这个识别每个byte的type,再check是否满足规则? : public int getType(byte input) { : int target=1<<7; : int type=0; : while(target>0) { : if(intput & target==0) return type; : type++; : target>=1; : } : }
|
p**p 发帖数: 742 | 6 这样做不是最优吧。没必要把8位都循环一遍。
遇到第一个0就可以返回了:
private int getType(byte b) {
int mask = 1 << 7;
int type = 0;
while((mask & b) != 0) {
type++;
mask >>= 1;
}
return type;
}
【在 l*****a 的大作中提到】 : 通过这个识别每个byte的type,再check是否满足规则? : public int getType(byte input) { : int target=1<<7; : int type=0; : while(target>0) { : if(intput & target==0) return type; : type++; : target>=1; : } : }
|
p**p 发帖数: 742 | 7 看错了。你的方法也是第一个0就返回。
【在 p**p 的大作中提到】 : 这样做不是最优吧。没必要把8位都循环一遍。 : 遇到第一个0就可以返回了: : private int getType(byte b) { : int mask = 1 << 7; : int type = 0; : while((mask & b) != 0) { : type++; : mask >>= 1; : } : return type;
|
p*****2 发帖数: 21240 | 8 我回家写一下
【在 l*****a 的大作中提到】 : 通过这个识别每个byte的type,再check是否满足规则? : public int getType(byte input) { : int target=1<<7; : int type=0; : while(target>0) { : if(intput & target==0) return type; : type++; : target>=1; : } : }
|
p**p 发帖数: 742 | 9 private int getType(byte b) {
int mask = 1 << 7;
int type = 0;
while((mask & b) != 0) {
type++;
mask >>= 1;
}
return type;
}
public boolean isValidUTF8(byte[] bytes) {
if(bytes == null || bytes.length == 0) {
return false;
}
int bytesLeft = 0;
for(byte b : bytes) {
int type = getType(b);
if(type == 0) {
if(bytesLeft != 0) {
return false;
}
continue;
} else if (type == 1) {
if(bytesLeft == 0) {
return false;
}
bytesLeft--;
} else {
if(bytesLeft != 0) {
return false;
}
bytesLeft = type-1;
}
}
return bytesLeft == 0;
}
【在 s*******3 的大作中提到】 : UTF-8 encoding scheme is described below: : 0XXXXXXX = this is the entire rune : 10XXXXXX = this is a continuation of the rune from the previous byte : 110XXXXX = this is the start of a 2-byte rune. : 1110XXXX = this is the start of a 3-byte rune. : 11110XXX = this is the start of a 4-byte rune. : 111110XX = this is the start of a 5-byte rune. : 1111110X = this is the start of a 6-byte rune. : 11111110 = this is the start of a 7-byte rune. : 11111111 = this is the start of a 8-byte rune.
|
n**p 发帖数: 1150 | 10 while(s != NULL && *s != NULL)
{
if (*s & 0x80 == 0) {s++; continue;}
if (*s & 0xC0 == 0x80) return false;
char* t = s+1;
for(int bitmask = 0x40;
bitmask >0 && (*s & bitmask != 0);
bitmask /= 2, t++)
if (*t == NULL || (*t & 0xC0 != 0x80)) return false;
s = t + 1;
}
return true;
【在 s*******3 的大作中提到】 : UTF-8 encoding scheme is described below: : 0XXXXXXX = this is the entire rune : 10XXXXXX = this is a continuation of the rune from the previous byte : 110XXXXX = this is the start of a 2-byte rune. : 1110XXXX = this is the start of a 3-byte rune. : 11110XXX = this is the start of a 4-byte rune. : 111110XX = this is the start of a 5-byte rune. : 1111110X = this is the start of a 6-byte rune. : 11111110 = this is the start of a 7-byte rune. : 11111111 = this is the start of a 8-byte rune.
|
p*****2 发帖数: 21240 | 11 def isUTF(xs: List[String]):Boolean = {
xs match {
case Nil => true
case head::tail => head.toList match {
case '0'::_ => isUTF(xs.tail)
case '1'::'0'::_ => false
case _ =>
val len = (x: String) => x.prefixLength(_ == '1')
val l = len(head)
val rest = tail.take(l-1)
rest.size == l-1 && rest.forall(len(_)==1) && isUTF(xs.drop(l))
}
}
} |
l*********8 发帖数: 4642 | 12 bool isUTF8(const string & str) {
int ct = 0;
for (char ch : str) {
int type = 8;
for (ch = ~ch; ch; ch >>= 1)
--type;
if (ct > 0) {
if (type != 1) return false;
--ct;
} else {
if (type == 1) return false;
ct = max(0, type - 1);
}
}
return ct == 0;
} |