This caused me a bit of confusion, and so I am hoping to prevent the same from happening to you. It has to do with Unicode code points, and how they relate back to interesting things like bits in memory.
Code points are arbitrary numbers that represent unicode data. You see them expressed in the form U+XXXX, where X is a Hex digit. So, for instance, the letter A is represented by U+0041. These code points are just abstractions that allow us to assign a unique ID to something approaching a "letter" (the concept of letters and whatnot are just an easy way for English-speaking people to get code points, but isn't 100% accurate).
So how do you take a code point, which is an abstract concept, and turn it into something useful that the computer can deal with? That's where encodings come in -- an encoding is an actual concrete representation of the unicode code point. The encoding is what's responsible for determining the bits in memory.
For example, U+0041 is represented in ASCII encoding as 0x41. It's represented in UTF-8 by 0x41. It's represented in UCS-2 by 0x00 0x41 (or 0x41 0x00, depending on the endianness of your platform). As you can see, the code point can have different encodings on disk, but all those encodings represent the abstract concept of U+0041. A more complex example would be U+2665 (Black Heart Suit). In ASCII, there is no representation. In UTF-16, it is 0x26 0x65 (or swapped, as mentioned earlier). In UTF-8, it is 0xE2 0x99 0xA5.
So the key point to take home is this: code points are abstract until you apply an encoding to them. The encoding determines the actual bits.
Leave a comment