0

Decoded

Unicode NFD
Unicode NFKD

Encoded

Unicode NFC
Unicode NFKC

About Unicode Normalization

Unicode normalization is the decomposition and composition of characters. Some Unicode characters have the same appearance but multiple representations. For example, "â" can be represented as one code point for "â" (U+00E2), and two decomposed code points for "a" (U+0061) and " ̂" (U+0302). It can also be expressed as (base character + combining character). The former is called a precomposed character and the latter is called a combining character sequence (CCS).

There are the following types of Unicode normalization:

Normalization FormDescriptionExample
Normalization Form D (NFD)Characters are decomposed by canonical equivalence"â" (U+00E2) -> "a" (U+0061) + " ̂" (U+0302)
Normalization Form KD (NFKD)Characters are decomposed by compatibility"fi" (U+FB01) -> "f" (U+0066) + "i" (U+0069)
Normalization Form C (NFC)Characters are decomposed and then re-composed by canonical equivalence"â" (U+00E2) -> "a" (U+0061) + " ̂" (U+0302) -> "â" (U+00E2)
Normalization Form KC (NFKC)Characters are decomposed by compatibility, then re-composed by canonical equivalence"fi" (U+FB01) -> "f" (U+0066) + "i" (U+0069) -> "f" (U+0066) + "i" (U+0069)

Canonical equivalence normalizes in a way that preserves visually and functionally equivalent characters. e.g. "â" <-> "a" + " ̂"

In addition to canonical equivalence, compatibility equivalence also normalizes characters that have different semantic shapes. e.g. "fi" -> "f" + "i"