- Load text file
- Load from image (OCR)
- Scan QR code
- UTF-16LE
- UTF-32LE
- US-ASCII
- ISO-8859-1 (Latin-1)
- ISO-8859-15 (Latin-9)
- Windows-1252
- ISO-8859-2 (Latin-2)
- Windows-1250
- ISO-8859-3 (Latin-3)
- ISO-8859-4 (Latin-4)
- ISO-8859-13 (Latin-7)
- Windows-1257
- Shift_JIS
- EUC-JP
- ISO-2022-JP (JIS)
- GB2312 (EUC-CN)
- GB18030
- Big5-HKSCS
- EUC-KR (KS X 1001)
- ISO-2022-KR
- ISO-8859-5
- Windows-1251
- KOI8-R
- KOI8-U
- ISO-8859-6
- Windows-1256
- ISO-8859-7
- Windows-1253
- ISO-8859-8
- Windows-1255
- ISO-8859-9 (Latin-5)
- Windows-1254
- TIS-620
- Windows-874
- Windows-1258
Decoded
Unicode NFD | |
---|---|
Unicode NFKD |
Encoded
Unicode NFC | |
---|---|
Unicode NFKC |
About Unicode Normalization
Unicode normalization is the decomposition and composition of characters. Some Unicode characters have the same appearance but multiple representations. For example, "â" can be represented as one code point for "â" (U+00E2), and two decomposed code points for "a" (U+0061) and " ̂" (U+0302). It can also be expressed as (base character + combining character). The former is called a precomposed character and the latter is called a combining character sequence (CCS).
There are the following types of Unicode normalization:
Normalization Form | Description | Example |
---|---|---|
Normalization Form D (NFD) | Characters are decomposed by canonical equivalence | "â" (U+00E2) -> "a" (U+0061) + " ̂" (U+0302) |
Normalization Form KD (NFKD) | Characters are decomposed by compatibility | "fi" (U+FB01) -> "f" (U+0066) + "i" (U+0069) |
Normalization Form C (NFC) | Characters are decomposed and then re-composed by canonical equivalence | "â" (U+00E2) -> "a" (U+0061) + " ̂" (U+0302) -> "â" (U+00E2) |
Normalization Form KC (NFKC) | Characters are decomposed by compatibility, then re-composed by canonical equivalence | "fi" (U+FB01) -> "f" (U+0066) + "i" (U+0069) -> "f" (U+0066) + "i" (U+0069) |
Canonical equivalence normalizes while preserving visually and functionally equivalent characters. e.g. "â" <-> "a" + " ̂"
In addition to canonical equivalence, compatibility equivalence also normalizes characters that have different semantic shapes. e.g. "fi" -> "f" + "i"