6 Types of Character Set

John Spacey, updated on May 19, 2017

A character set is a system for representing languages in data. Where binary data can include any sequence of 0s and 1s, text data is restricted to a set of binary sequences that is each interpreted as a character from a language. The following are common types of character set.

ASCII

An encoding for English characters based on 7-bits that are mapped to 128 characters. ASCII is an American standard that dates back to the early 1960s. It has been changed numerous times throughout its history including an expansion to competing versions of 8-bit ASCII. It is extremely limited, even for English, and led to a number of character sets being developed for every language.

Language Specific Character Sets

Beginning in the mid-1960s, a large number of standards organizations and commercial entities developed their own character sets. By the early-1980s, there were a large number of character sets for each major language. As such, dealing with character encoding was becoming a major issue in the computing industry and a common usability issue.

Unicode

Unicode is an international standard first released in the early 1990s to unify 135 modern and historic languages under a common character set. It currently has 128,237 characters that are increased over time. Its current maximum is 1,114,112 characters that represents the hexadecimal numbers 0 to 10FFFF. Unicode is a standard for a character set and not a character set itself.

UTF-8

UTF-8 is a character set that implements Unicode. Despite its name, UTF-8 isn't a static 8-bit encoding but instead is a variable length encoding that uses up to 32 bits. It encodes the most common characters, such as basic numbers and English with 8-bits. This makes it efficient for most data. Another advantage of UTF-8 is that for English, it is identical to ASCII.

UTF-16

UTF-16 is a character set that implements Unicode. Like UTF-8, it is a variable length encoding that uses up to 32 bits. It encodes the most common characters with 16-bits and less common characters with 32-bits.

UTF-32

UTF-32 is a character set that implements Unicode as a static 32-bit code. Unicode only requires 21-bits to encode its limit of 1,114,112 characters. As such, UTF-32 has a number of leading zeros that pad each code. This is inefficient and all data is smaller in UTF-8 and UTF-16. For English data, UTF-32 is typically about 4 times larger.

Overview: Character Set
Type	Data Files
Definition	A system for representing text with binary codes.
Related Concepts	Data Files Everything is a File Compression Encryption Computing