Unicode Escape


Unicode Escape

About Unicode escape sequence

Unicode escape sequence convert a single character to the format of a 4-digit hexadecimal code point, such as \uXXXX. For example, "A" becomes "\u0041".

In DenCode, in addition to the \uXXXX format, the following notation formats are also supported.

FormatConversion result of "ABC"Description / Programming language
\uXXXX\u0041\u0042\u0043Common Unicode escape sequences
%uXXXX%u0041%u0042%u0043Percent-encoding (Non-standard)
U+XXXXU+0041 U+0042 U+0043Unicode standard notation for code points (Space separated)
0xX0x41 0x42 0x43Hexadecimal notation of code points (Space separated)

Some of the above formats are mentioned in RFC 5137 (ASCII Escaping of Unicode Characters) as BEST CURRENT PRACTICE, but there is no international standard.

The %uXXXX format is supported by Microsoft IIS, but is a non-standard format. It can be encoded in %u format with System.Web.HttpUtility.UrlEncodeUnicode in C#, but this method has been obsoleted since .NET Framework 4.5.

Please note that in the \X format, a trailing single space is treated as a delimiter and ignored when decoding, as specified by CSS. In the U+XXXX and 0xX formats, each character is separated by a single space when encoded, and a trailing single space is ignored when decoded, as in the \X format.

Escaping by Unicode Name

As Unicode escape sequences, escaping by Unicode name is also supported.

FormatConversion result of "A"Description / Programming language
\N{name}\N{LATIN CAPITAL LETTER A}C++23, Python, Perl

Unicode names can be found at Names List Charts - Unicode or NamesList.txt - Unicode.

Unicode non-BMP characters in Unicode escape sequence

Unicode non-BMP characters do not fit in the 4-digit code point, so they are represented in the following notation formats for each programming language.

The result of converting "😀" (U+1F600), which is a Unicode non-BMP character, is as follows.

FormatConversion result of "😀" (U+1F600)Programming language
\uXXXX\uD83D\uDE00Java, Kotlin, Scala
\u{X}\u{1F600}C++23, Rust, Swift, JavaScript, PHP, Ruby, Dart, Lua
\U00XXXXXX\U0001F600C, C++, Objective-C, C#, Go, Python, R
\N{name}\N{GRINNING FACE}C++23, Python, Perl

In the \uXXXX and %uXXXX formats, non-BMP characters are represented by two code units as UTF-16 surrogate pairs. In other formats, a character is represented by a single code point.