- Load text file
- Load from image (OCR)
- Scan QR code
- UTF-16LE
- UTF-32LE
- US-ASCII
- ISO-8859-1 (Latin-1)
- ISO-8859-15 (Latin-9)
- Windows-1252
- ISO-8859-2 (Latin-2)
- Windows-1250
- ISO-8859-3 (Latin-3)
- ISO-8859-4 (Latin-4)
- ISO-8859-13 (Latin-7)
- Windows-1257
- Shift_JIS
- EUC-JP
- ISO-2022-JP (JIS)
- GB2312 (EUC-CN)
- GB18030
- Big5-HKSCS
- EUC-KR (KS X 1001)
- ISO-2022-KR
- ISO-8859-5
- Windows-1251
- KOI8-R
- KOI8-U
- ISO-8859-6
- Windows-1256
- ISO-8859-7
- Windows-1253
- ISO-8859-8
- Windows-1255
- ISO-8859-9 (Latin-5)
- Windows-1254
- TIS-620
- Windows-874
- Windows-1258
Decoded
Unicode Escape |
---|
Encoded
Unicode Escape |
Format
A-F
|
---|
About Unicode escape sequence
Unicode escape sequence convert a single character to the format of a 4-digit hexadecimal code point, such as \uXXXX. For example, "A" becomes "\u0041".
In DenCode, in addition to the \uXXXX format, the following notation formats are also supported.
Format | Conversion result of "ABC" | Description / Programming language |
---|---|---|
\uXXXX | \u0041\u0042\u0043 | Common Unicode escape sequences |
\u{X} | \u{41}\u{42}\u{43} | Lua |
\x{X} | \x{41}\x{42}\x{43} | Perl |
\X | \41\42\43 | CSS |
&#xX; | ABC | HTML, XML |
%uXXXX | %u0041%u0042%u0043 | Percent-encoding (Non-standard) |
U+XXXX | U+0041 U+0042 U+0043 | Unicode standard notation for code points (Space separated) |
0xX | 0x41 0x42 0x43 | Hexadecimal notation of code points (Space separated) |
Some of the above formats are mentioned in RFC 5137 (ASCII Escaping of Unicode Characters) as BEST CURRENT PRACTICE, but there is no international standard.
The %uXXXX format is supported by Microsoft IIS, but is a non-standard format. It can be encoded in %u format with System.Web.HttpUtility.UrlEncodeUnicode in C#, but this method has been obsoleted since .NET Framework 4.5.
Please note that in the \X format, a trailing single space is treated as a delimiter and ignored when decoding, as specified by CSS. In the U+XXXX and 0xX formats, each character is separated by a single space when encoded, and a trailing single space is ignored when decoded, as in the \X format.
Escaping by Unicode Name
As Unicode escape sequences, escaping by Unicode name is also supported.
Format | Conversion result of "A" | Description / Programming language |
---|---|---|
\N{name} | \N{LATIN CAPITAL LETTER A} | C++23, Python, Perl |
Unicode names can be found at Names List Charts - Unicode or NamesList.txt - Unicode.
Unicode non-BMP characters in Unicode escape sequence
Unicode non-BMP characters do not fit in the 4-digit code point, so they are represented in the following notation formats for each programming language.
The result of converting "😀" (U+1F600), which is a Unicode non-BMP character, is as follows.
Format | Conversion result of "😀" (U+1F600) | Programming language |
---|---|---|
\uXXXX | \uD83D\uDE00 | Java, Kotlin, Scala |
\u{X} | \u{1F600} | C++23, Rust, Swift, JavaScript, PHP, Ruby, Dart, Lua |
\U00XXXXXX | \U0001F600 | C, C++, Objective-C, C#, Go, Python, R |
\x{X} | \x{1F600} | Perl |
\X | \1F600 | CSS |
&#xX; | 😀 | HTML, XML |
%uXXXX | %uD83D%uDE00 | - |
U+XXXX | U+1F600 | - |
0xX | 0x1F600 | - |
\N{name} | \N{GRINNING FACE} | C++23, Python, Perl |
In the \uXXXX and %uXXXX formats, non-BMP characters are represented by two code units as UTF-16 surrogate pairs. In other formats, a character is represented by a single code point.