Intel's Ronler Acres Plant

Silicon Forest
If the type is too small, Ctrl+ is your friend

Saturday, April 24, 2010

When is a text file not a text file?


When it's a UTF-8 file, that's when! Dad-burn whipper-snappers anywho. Wrote a program yesterday, reads in some numbers from a text file. Works fine. Tried the program again today and it blows up and dies. Can't read any data from the file. Look at the file with Notepad++ and it looks fine. What's going on here? Change the input function so it is reading the input character by character, and I see we are getting data that is not ordinary ASCII characters. Oh, yeah? Fine. Pull out my trusty 30 year old text editor and look at the file, and by gum, there they are: three little garbage characters at the beginning of the file. Now who would put those there? What I have done to this file that would cause someone to stick garbage like that at the beginning of the file?

So I do a little Googling and I find the answer in a post by Mika Halttunen on an Allegro.cc forum:
Those "garbage characters" are actually the byte-order mark of UTF-8, that specify the text is encoded in UTF-8. See UTF-8 BOM [en.wikipedia.org] for more info.
Looking at the wikipedia entry, the three characters have hex values of EF BB and BF and look like this: 

Just remember that Bill Gates loves you and wants you to be happy.

Windows is Wonderful.

If you are wondering about the picture, you need to go back and read the history of UTF-8, which will lead you, eventually, to Plan 9.

Update September 2016 replaced missing picture.

No comments: