Intel's Ronler Acres Plant

Silicon Forest

Thursday, May 5, 2016


The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

Nokia - Bell Labs
Nokia - Bell Labs, The C Programming Langugage, AKA K & R, but nothing here
UTF-8 - Wikipedia
Strings, bytes, runes and characters in Go
The Unicode Consortium - OMG, they're talking about Emoji!
So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.
Bits of
code point
code point
code point
Bytes in
Byte 1Byte 2Byte 3Byte 4Byte 5Byte 6
The following sequences are not part of the UTF-8 standard, only part of the original proposal

Here's a 'C' procedure from magaiti that will tell you how many bytes the next character occupies:
int get_mbchar_length(char lb) {
    if (( lb & 0xE0 ) == 0xC0 ) return 2;
    if (( lb & 0xF0 ) == 0xE0 ) return 3;
    if (( lb & 0xF8 ) == 0xF0 ) return 4;
    return 1;

For the Snake Encoding problem on Codingame dot com, I used unicode/utf8 as suggested by Rob Pike.

No comments: