unicode transformation format 8 (UTF-8)

In UnicodeTransformation Format 8( UTF-8), the standard Latin letters and digits are encoded with one byte, special characters and umlauts with two or three bytes.

Thus the 127 characters of the ASCII character set are taken over unchanged if the Most Significant Bit( MSB) of the first byte is a "0". If the first byte begins with a "1", then it is a Unicode character. The character sets for encodings with two bytes are called Double Byte Character Sets( DBCS), those with more than two bytes are called Multibyte Character Sets( MBCS).

UTF-8 encoding with the formation of byte strings

Larger Unicode characters are formed from byte strings. The order of the byte strings is marked by certain bit patterns, which are placed at the beginning of the start byte. Thus, the first byte of two bytes always starts with a 110-start combination, the first byte of three bytes with 1110 and of four bytes with 11110. The following bytes always start with a 10-combination. The number of ones before the first "0" in the first byte indicates the number of bytes of the whole character.

UTF-8 is described in RFC 3629 from 2003 as "UTF-8, a Transformation Format of ISO 10646 F".

Englisch:	unicode transformation format 8 - UTF-8
Updated at:	19.12.2011
#Words:	193
Links:	indium (In), unicode transformation format (UTF), standard (STD), byte (B), ASCII character set
Translations:	DE
Sharing: