Encoding #

Encoding is just a way of representing one thing with something else. Since computers can only store 0’s and 1’s for example, characters must be encoded to 0’s and 1’s to be displayed.

Encoding Types #

ASCII #

An 8-bit encoding scheme which represents 127 distinct characters. These are mostly the English letters/special characters. Cannot support more than 256 different combinations since it only uses 1 byte.

Unicode #

Attempts to build one encoding scheme for all characters. There are over 1 million different characters represented in unicode.

Unicode is not actually an encoding. It is just a map from characters to code points. But encodings such as utf-8, UTF-16, and UTF-32.

Four bits are needed for the proper encoding of all potential unicode values (2^32) but this would waste a lot of space for the most common letters.

Thus, UTF-8 and UTF-16 were developed as variable-length encodings to address this problem. UTF-16 is not ascii compatible and thus can screw things up when a parser, etc. expects ascii compatible encodings.

Characters in unicode are referred to by their unicode code points which are represented in hexidecemal prefaced by U+.

e.g. U+1E001

Note:

Since unicode can represent all characters you should always be using a unicode-based encoding

UTF-16 #

Base64 #

The process for safely transporting text is thus as follows:

Sender:

  1. Encode text string in encoding of choice (e.g. UTF-8) to turn the text into bits
  2. Base64 encode the bit string and send it to the other computer

Receiver:

  1. Base64 decode string into bits.
  2. Knowing which text-encoding was used (e.g. UTF-8) you can now process the bit string properly.

Percent Encoding (URL Encoding) #