Pages, some stolen, some original

Thursday, May 5, 2016

Characters

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

Nokia - Bell Labs
Nokia - Bell Labs, The C Programming Langugage, AKA K & R, but nothing here
UTF-8 - Wikipedia
Strings, bytes, runes and characters in Go
The Unicode Consortium - OMG, they're talking about Emoji!
So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.
Bits of
code point
First
code point
Last
code point
Bytes in
sequence
Byte 1Byte 2Byte 3Byte 4Byte 5Byte 6
  7U+0000U+007F10xxxxxxx
11U+0080U+07FF2110xxxxx10xxxxxx
16U+0800U+FFFF31110xxxx10xxxxxx10xxxxxx
21U+10000U+1FFFFF411110xxx10xxxxxx10xxxxxx10xxxxxx
The following sequences are not part of the UTF-8 standard, only part of the original proposal
26U+200000U+3FFFFFF5111110xx10xxxxxx10xxxxxx10xxxxxx10xxxxxx
31U+4000000U+7FFFFFFF61111110x10xxxxxx10xxxxxx10xxxxxx10xxxxxx10xxxxxx

Here's a 'C' procedure from magaiti that will tell you how many bytes the next character occupies:
int get_mbchar_length(char lb) {
    if (( lb & 0xE0 ) == 0xC0 ) return 2;
    if (( lb & 0xF0 ) == 0xE0 ) return 3;
    if (( lb & 0xF8 ) == 0xF0 ) return 4;
    return 1;
}

For the Snake Encoding problem on Codingame dot com, I used unicode/utf8 as suggested by Rob Pike.

No comments:

Post a Comment