← Back

Encoding

Unicode

Code Point

The range of integers used to code the abstract characters is called the codespace. A particular integer in this set is called a code point The codespace consists of the integers from 0 to 10FFFF.

When an abstract character is mapped or assigned to a particular code point in the codespace, it is then referred to as an encoded character.

Design of the Unicode Standard is not specific to the design of particular basic text-processing algorithms.

Sorting and string comparison algorithms cannot assume that the assignment of Unicode character code numbers provides an alphabetical ordering for lexicographic string comparison.

The Unicode Standard identifies more than 100 different character properties, including numeric, casing, combination, and directionality properties.

Unicode Collation Algorithm - defines a set of default collation weights that can be used with a standard algorithm. Tailorings for each language are provided in the Unicode Common Locale Data Repository.

Unicode Bidirectional Algorithm - defines the conversion of Unicode text from logical order to the order of readable (displayed) text so as to ensure consistent legibility.

Whereas the code point is what we store, an encoding (utf-8) deals with how we store it: encoding is an implementation.

Unicode Plane

In the Unicode standard, a plane is a continuous group of 65,536 (216) code points. There are 17 planes.

Code Unit

The minimal bit combination that can represent a unit of encoded text for processing or interchange. For example, 8 bit for uff-8 and 16 bit for utf-16

Grapheme Cluster

A grapheme is a sequence of one or more code points that are displayed as a single, graphical unit that a reader recognizes as a single element of the writing system.

Glyphs

Glyphs represent the shapes that characters can have when they are rendered or displayed. A repertoire of glyphs makes up a font.

A single glyph may correspond to a single character or to a number of characters, or multiple glyphs may result from a single character.

Glyph shape and methods of identifying and selecting glyphs are the responsibility of individual font vendors and of appropriate standards and are not part of the Unicode Standard.

Email Content Transfer Encoding

SMTP supports only 7 bit channel. The 8th bit may get ignored.

Quoted-Printable

  1. This is already used by mail gem when content contains at least one non-ascii character.

  2. Encodes only Non-ASCII characters.

  3. Plain English Text (printable ASCII characters) will be present as it is.

  4. Email content will be readable for most cases.

  5. Content size can get three times as large as original for other languages.

Base64

  1. Encodes each 3 bytes into 4 byte representation.

  2. Results in fully unreadable content.

  3. Increase in email size is consistent (33%)