Character encoding (unicode, UTF-8, ASCII)

Posted on October 23, 2012. Filed under: SANS Dev 541 |

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world’s writing systems.

Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8 and UTF-16.

UTF-8 uses one byte for any ASCII characters, which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters.
The UTF-8 method maps code points to a sequence of bytes ranging in length from 1 to 4 bytes. Each byte within the sequence contains both control bits and non-control bits. The control bits indicate how many bytes there are in a given sequence, and whether a given byte is the first in the sequence or one of the “trailing” bytes. The figure below illustrates how these control bits are interpreted. The non-control bits of each byte in a sequence are used to record the character code value (i.e., code point) assigned by the Unicode Standard.

The standard specifies that the correct encoding of a codepoint use only the minimum number of bytes required to hold the significant bits of the codepoint. Longer encodings are called overlong and are not valid UTF-8 representations of the codepoint. This rule maintains a one-to-one correspondence between codepoints and their valid encodings, so that there is a unique valid encoding for each codepoint. Allowing multiple encodings would make testing for string equality difficult to define.

The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as ASCII, making valid ASCII text valid UTF-8-encoded Unicode as well.

At the top of Figure 2, the binary representation of the code value for the trademark symbol is given. The trademark simbol has Unicode value U+2122, it requires three bytes for its UTF-8 representation, which are C2 84 A2

The following video compares ASCII, Unicode and UTF-8.

This video discuss UTF-8 from a historical perspective.


Make a Comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

One Response to “Character encoding (unicode, UTF-8, ASCII)”

RSS Feed for IT Certifications Comments RSS Feed

Where's The Comment Form?

Liked it here?
Why not try sites on the blogroll...

%d bloggers like this: