A-Z Popular New Data Search »
Data
 
Related Guides
Data Analysis

Misuse of Statistics

6 Types of Character Set

 , updated on
A character set is a system for representing languages in data. Where binary data can include any sequence of 0s and 1s, text data is restricted to a set of binary sequences that is each interpreted as a character from a language. The following are common types of character set.

ASCII

An encoding for English characters based on 7-bits that are mapped to 128 characters. ASCII is an American standard that dates back to the early 1960s. It has been changed numerous times throughout its history including an expansion to competing versions of 8-bit ASCII. It is extremely limited, even for English, and led to a number of character sets being developed for every language.

Language Specific Character Sets

Beginning in the mid-1960s, a large number of standards organizations and commercial entities developed their own character sets. By the early-1980s, there were a large number of character sets for each major language. As such, dealing with character encoding was becoming a major issue in the computing industry and a common usability issue.

Unicode

Unicode is an international standard first released in the early 1990s to unify 135 modern and historic languages under a common character set. It currently has 128,237 characters that are increased over time. Its current maximum is 1,114,112 characters that represents the hexadecimal numbers 0 to 10FFFF. Unicode is a standard for a character set and not a character set itself.

UTF-8

UTF-8 is a character set that implements Unicode. Despite its name, UTF-8 isn't a static 8-bit encoding but instead is a variable length encoding that uses up to 32 bits. It encodes the most common characters, such as basic numbers and English with 8-bits. This makes it efficient for most data. Another advantage of UTF-8 is that for English, it is identical to ASCII.

UTF-16

UTF-16 is a character set that implements Unicode. Like UTF-8, it is a variable length encoding that uses up to 32 bits. It encodes the most common characters with 16-bits and less common characters with 32-bits.

UTF-32

UTF-32 is a character set that implements Unicode as a static 32-bit code. Unicode only requires 21-bits to encode its limit of 1,114,112 characters. As such, UTF-32 has a number of leading zeros that pad each code. This is inefficient and all data is smaller in UTF-8 and UTF-16. For English data, UTF-32 is typically about 4 times larger.
Overview: Character Set
Type
Definition
A system for representing text with binary codes.
Related Concepts

Character Sets

This is the complete list of articles we have written about character sets.
Arrows
Compression
Computing
Data
Encryption
Files
Unicode
Utf-16
Utf-8
If you enjoyed this page, please consider bookmarking Simplicable.
 

Unicode

A list of useful text symbols.

Files

An overview of files.

Compression

A definition of data compression with a few examples.

Encryption Examples

A definition of encryption with examples.

Flat File

An overview of the common types of flat file.

Animated Gif

Everything you ever wanted to know about animated GIFs but were afraid to ask.

Arrow Text

A list of useful unicode arrows.

Documents

The definition of document with a list of examples.

Data

An overview of data with a list of examples.

Types Of Data

The basic types of data.

Dark Data

The definition of dark data with examples.

Data Massage

The mysteries of data massage.

Data Definition

Several useful definitions of data.

Analytics

A definition of analytics with examples.

Data vs Information

The difference between data and information.

Hard Data vs Soft Data

The difference between hard data and soft data.

Human Readable

A definition of human readable.

Data Loss

The common types of data loss.
The most popular articles on Simplicable in the past day.

New Articles

Recent posts or updates on Simplicable.
Site Map