Nokia - Bell Labs
Nokia - Bell Labs, The C Programming Langugage, AKA K & R, but nothing here
UTF-8 - Wikipedia
Strings, bytes, runes and characters in Go
The Unicode Consortium - OMG, they're talking about Emoji!
So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.
Bits of code point | First code point | Last code point | Bytes in sequence | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 |
---|---|---|---|---|---|---|---|---|---|
7 | U+0000 | U+007F | 1 | 0xxxxxxx | |||||
11 | U+0080 | U+07FF | 2 | 110xxxxx | 10xxxxxx | ||||
16 | U+0800 | U+FFFF | 3 | 1110xxxx | 10xxxxxx | 10xxxxxx | |||
21 | U+10000 | U+1FFFFF | 4 | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | ||
The following sequences are not part of the UTF-8 standard, only part of the original proposal | |||||||||
26 | U+200000 | U+3FFFFFF | 5 | 111110xx | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | |
31 | U+4000000 | U+7FFFFFFF | 6 | 1111110x | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
Here's a 'C' procedure from magaiti that will tell you how many bytes the next character occupies:
int get_mbchar_length(char lb) {
if (( lb & 0xE0 ) == 0xC0 ) return 2;
if (( lb & 0xF0 ) == 0xE0 ) return 3;
if (( lb & 0xF8 ) == 0xF0 ) return 4;
return 1;
}
For the Snake Encoding problem on Codingame dot com, I used unicode/utf8 as suggested by Rob Pike.
No comments:
Post a Comment