csvjson

Unicode Converter

Type or paste any text to inspect each character's codepoint, UTF-8 bytes, UTF-16 encoding, HTML entities, JavaScript escape, CSS escape, and URL encoding. Click any character to see all representations.

Encoding representations

Unicode codepoint
U+1F44B

The canonical identifier for a character in the Unicode standard. Written as U+ followed by a hex number. There are 1,114,112 possible codepoints (U+0000 to U+10FFFF).

UTF-8 bytes
0xF0 0x9F 0x91 0x8B

The variable-length byte encoding. ASCII characters use 1 byte; most Latin/Cyrillic/Arabic/Hebrew use 2; CJK and other scripts use 3; emoji and supplementary characters use 4 bytes.

UTF-16 words
0xD83D 0xDC4B

The encoding used by JavaScript, Java, and Windows internally. Characters above U+FFFF require two 16-bit code units called a surrogate pair (high surrogate + low surrogate).

HTML entities
👋 or 👋

HTML numeric character references in decimal (👋) or hexadecimal (👋) form. Safe to use in HTML even when the character can't be typed directly.

JavaScript / JSON escape
\u{1F44B}

For characters in the BMP (U+0000–U+FFFF): \uXXXX. For supplementary plane characters: \u{XXXXX} (ES6+). Required inside string literals in source code.

CSS escape
\1F44B

Used in CSS content property values and selectors. A backslash followed by the hex codepoint. Required when inserting characters via CSS ::before / ::after.

UTF-8 byte count by codepoint range

How many bytes each character takes in UTF-8 storage

RangeBytesExamples
U+0000 – U+007F1ASCII: A, 0, space, !…
U+0080 – U+07FF2Latin Extended, Cyrillic, Arabic, Hebrew
U+0800 – U+FFFF3Devanagari, CJK, emoji in BMP
U+10000 – U+10FFFF4Emoji 🎉, supplementary CJK, historic scripts

Frequently asked questions

What is the difference between Unicode, UTF-8, and UTF-16?

Unicode is the standard that assigns a number (codepoint) to every character from every writing system. UTF-8 and UTF-16 are encoding schemes that represent those codepoints as bytes for storage and transmission. UTF-8 uses 1–4 bytes per character and is ASCII-compatible — it dominates the web. UTF-16 uses 2 or 4 bytes and is used internally by JavaScript, Java, and Windows.

Why do emoji need 4 bytes in UTF-8?

UTF-8 encoding uses 1 byte for U+0000–U+007F (ASCII), 2 for U+0080–U+07FF, 3 for U+0800–U+FFFF, and 4 for U+10000–U+10FFFF. Most emoji live in the Supplementary Multilingual Plane (above U+FFFF), so they require 4 bytes. In UTF-16, supplementary plane characters require a surrogate pair — two 16-bit code units.

What is a surrogate pair in UTF-16?

UTF-16 encodes characters above U+FFFF as two 16-bit code units: a high surrogate (U+D800–U+DBFF) and a low surrogate (U+DC00–U+DFFF). Together they encode a single codepoint. JavaScript strings are UTF-16 internally, so a single emoji character has .length === 2. Use Array.from() or the spread operator to iterate by codepoints rather than code units.

What is the Unicode BMP (Basic Multilingual Plane)?

The BMP is the first 65,536 codepoints (U+0000–U+FFFF). It covers most modern scripts: Latin, Cyrillic, Arabic, Hebrew, Devanagari, CJK, and many more. Characters outside the BMP (emoji, historic scripts, supplementary CJK) are in the supplementary planes (U+10000–U+10FFFF) and require extra handling in UTF-16.

How do I use a Unicode character in HTML?

Three options: (1) Type the character directly if your editor and file encoding support it. (2) Use a decimal numeric entity: 👋 (3) Use a hex numeric entity: 👋 All three are equivalent in HTML. Named entities like & or < only exist for the most common characters.