Pergelator: Characters

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

Nokia - Bell Labs
Nokia - Bell Labs, The C Programming Langugage, AKA K & R, but nothing here
UTF-8 - Wikipedia
Strings, bytes, runes and characters in Go
The Unicode Consortium - OMG, they're talking about Emoji!

So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.

Bits of code point	First code point	Last code point	Bytes in sequence	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5	Byte 6
7	U+0000	U+007F	1	`0xxxxxxx`
11	U+0080	U+07FF	2	`110xxxxx`	`10xxxxxx`
16	U+0800	U+FFFF	3	`1110xxxx`	`10xxxxxx`	`10xxxxxx`
21	U+10000	U+1FFFFF	4	`11110xxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`
The following sequences are not part of the UTF-8 standard, only part of the original proposal
26	U+200000	U+3FFFFFF	5	`111110xx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`
31	U+4000000	U+7FFFFFFF	6	`1111110x`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`

Here's a 'C' procedure from magaiti that will tell you how many bytes the next character occupies:

int get_mbchar_length(char lb) {
    if (( lb & 0xE0 ) == 0xC0 ) return 2;
    if (( lb & 0xF0 ) == 0xE0 ) return 3;
    if (( lb & 0xF8 ) == 0xF0 ) return 4;
    return 1;
}

For the Snake Encoding problem on Codingame dot com, I used unicode/utf8 as suggested by Rob Pike.

Pergelator

Pages, some stolen, some original

Thursday, May 5, 2016

Characters

No comments:

Post a Comment