Rails and Unicode and PostGres DB (Oh My!): Understanding how UTF-8 works

January 5, 2008 - 2 minutes read - 380 words

Examine the Unicode standard’s code page collection for “Latin small letter a with macron”.
Nets U0100.pdf
“Latin small letter a with macron” appears on chart as 0101. This is a hexidemial number which points to U+0101 as its code point. Converting 0101 to decimal gets you 257, this is the same as the HTML entity code. Thus one can enter either ā or ā and get the right glyph [ā|ā]
Put ā character into a view via Rails that is back-ended by a PostGres database.
Using script/console, write the collection of models that contain this accented character to a YAML file.
“Latin small letter a with macron” is stored in a YAML dump of accented charcters as: \xC4\x81
Hm, OK that’s a start. Somehow 0101 or 257 is linked to C4 81. How? I know, BTW, the database that holds that entry is in UTF-8 as psql -l shows this.
C4: 196
81: 129
196+129=325 != 0101. Hm, look at documentation.
Be stumped.
Send mail to mailing lists for help.

In the immortal words of Sid Meier’s “Civilization”: “Time Passes…”

\xC4\x81 is the UTF-8 encoding for the Unicode code point U+0101.
[Q:] Which table does U+0101 fall into?

[A:] “So the first 128 characters (US-ASCII) need one byte. The next 1920 characters need two bytes to encode. This includes Latin alphabet characters with diacritics, Greek, Cyrillic, Coptic, Armenian, Hebrew, and Arabic characters. The rest of the BMP characters use three bytes, and additional characters are encoded in four bytes.”

OK this means that the code point will be of the form: “110yyyyy 10zzzzzz”
We will now work to fill in the “y” and “z” values:
Hexidecimal “U+0101” converts to binary: “100000001”
There are 5 y’s and 6 z’s. So let’s split the above number to match that form: “[00]100-000001”. Note, we moved from the right. Where the leading 0’s were required to turn 100 into 00100, they were pre-pended.
Integrate and produce: “110” + “00100” : “10” + “000001” => 11000100 : 10000001
Take THESE numbers and convert them back to hex => c4 81
String notation for this is \xc4\x81 - viola!
Figuring this out letter by letter is a major pain in the keester. A good URL resource is: Fileformat.info or, handily a URI of the form:

http://www.fileformat.info/info/unicode/char//index.htm