A quick tour of Unicode

Unicode is a character encoding system used by computers for the storage and interchange of textual data. It provides a unique number (a code point) for every character of the major writing systems of the world. It also includes technical symbols, punctuation, and many other characters used in writing text.

In addition to being a character map, Unicode includes algorithms for collation and encoding bidirectional scripts such as Arabic, as well as specifications for the normalization of text forms.

This topic provides an overview of Unicode. For a more complete explanation and for a list of specific languages that can be encoded with Unicode, see the Unicode Consortium website.

Code points

Characters are units of information that roughly correspond to a unit of text in the written form of a natural language. Unicode defines how characters are interpreted, not how they are rendered.

A glyph, which is the rendering or visual representation of a character, is the mark made on the computer screen or printed page. In some writing systems, one character may correspond to several glyphs or several characters to one glyph. For example, "ll" in the Spanish language is one glyph but two characters: "l" and "l."

In Unicode, a character maps to a code point. Code points are the numbers assigned by the Unicode Consortium to every character in every writing system. Code points are represented as U+ followed by four numbers and/or letters. The following are examples of code points for four different characters: a lowercase l, a lowercase u with an umlaut, beta, and a lowercase e with an acute accent.

l = U+006C

ü = U+00FC

β = U+0392

é = U+00E9

Unicode contains 1,114,112 code points; currently, characters are assigned to more than 96,000 of them.

Planes

The Unicode code space for characters is divided into 17 planes, and each plane has 65,536 code points.

The first plane—plane 0—is the Basic Multilingual Plane (BMP). The majority of commonly used characters fit into BMP, and to date, this is where most have been assigned. BMP contains code points for almost all characters in modern languages and many special characters. There are approximately 6,300 unused code points in BMP; these will be used to add more characters in the future.

The next plane—plane 1—is the Supplementary Multilingual Plane (SMP). SMP is used for historic scripts and musical and mathematical symbols.

Character encoding

Character encoding defines each character, its code point, and how the code point is represented in bits. Without knowing what encoding is being used, you cannot correctly interpret a string of characters.

Numerous encoding schemes exist, but they may not easily convert between one another, and few take into account characters for more than a few different languages. For example, if your PC was set to use OEM—Latin II by default and you browsed to a Web site that used IBM EBCDIC—Cyrillic, any characters present in the Cyrillic that are not in the Latin II encoding scheme would not be displayed properly; they would be replaced with other characters, such as a question mark or a square.

Because Unicode contains code points for the majority of characters in all modern languages, using a Unicode character encoder will allow your computer to interpret nearly every known character.

There are three main Unicode character encoding schemes in use: UTF-8, UTF-16, and UTF-32. UTF stands for Unicode Transformation Format. The numbers that follow UTF indicate the size, in bits, of units used for encoding.

  • UTF-8 uses 8-bit variable-width character encodings. UTF-8 uses between 1 and 6 bytes to encode a character; it may use fewer, the same, or more bytes than UTF-16 to encode the same character. In UTF-8, every code point from 0 to 127 (U+0000 to U+0127) is stored in a single byte. Only code points 128 (U+0128) and above are stored using 2 to 6 bytes.
  • UTF-16 uses a single, fixed-width, 16-bit code unit. It is relatively compact, and all the most commonly used characters fit into a single 16-bit code unit. Other characters are accessible using pairs of 16-bit code units.
  • UTF-32 requires 4 bytes to encode any character. In most cases, a document encoded in UTF-32 will be nearly twice as large as the same document encoded in UTF-16. Each character is encoded in a single, fixed-width, 32-bit code unit. You would use UTF-32 if memory space is not an issue and you want to be able to use a single code unit for every character.

All three encoding forms encode the same common characters and can be converted from one to the other without loss of data.

Other Unicode character encodings include UTF-7 and UTF-EBCDIC. There is also an encoding GB18030, which is a Chinese equivalent of UTF-8 and supports simplified and traditional Chinese characters.