Intel's Ronler Acres Plant

Silicon Forest

Thursday, May 5, 2016

Characters

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

Nokia - Bell Labs
Nokia - Bell Labs, The C Programming Langugage, AKA K & R, but nothing here
UTF-8 - Wikipedia
Strings, bytes, runes and characters in Go
The Unicode Consortium - OMG, they're talking about Emoji!
So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.
Bits of
code point
First
code point
Last
code point
Bytes in
sequence
Byte 1Byte 2Byte 3Byte 4Byte 5Byte 6
  7U+0000U+007F10xxxxxxx
11U+0080U+07FF2110xxxxx10xxxxxx
16U+0800U+FFFF31110xxxx10xxxxxx10xxxxxx
21U+10000U+1FFFFF411110xxx10xxxxxx10xxxxxx10xxxxxx
The following sequences are not part of the UTF-8 standard, only part of the original proposal
26U+200000U+3FFFFFF5111110xx10xxxxxx10xxxxxx10xxxxxx10xxxxxx
31U+4000000U+7FFFFFFF61111110x10xxxxxx10xxxxxx10xxxxxx10xxxxxx10xxxxxx

Here's a 'C' procedure from magaiti that will tell you how many bytes the next character occupies:
int get_mbchar_length(char lb) {
    if (( lb & 0xE0 ) == 0xC0 ) return 2;
    if (( lb & 0xF0 ) == 0xE0 ) return 3;
    if (( lb & 0xF8 ) == 0xF0 ) return 4;
    return 1;
}

For the Snake Encoding problem on Codingame dot com, I used unicode/utf8 as suggested by Rob Pike.

No comments: