UTF-8 Encoder & Decoder — Convert Text to UTF-8 Bytes Online

Convert text to UTF-8 byte representation in hex, decimal, binary, or percent-encoded format. Decode UTF-8 byte sequences back to readable text. See character count, byte count, and encoding details.

Text → UTF-8 Bytes

Output format

UTF-8 Bytes → Text

Input format

How UTF-8 encoding works

UTF-8 is the dominant character encoding for the web, used by over 98% of websites. It encodes each Unicode code point into one to four bytes, making it backward-compatible with ASCII while supporting every character in the Unicode standard — including emoji, CJK characters, and mathematical symbols.

ASCII characters (U+0000 to U+007F) use a single byte, identical to their ASCII values. Characters outside this range use 2-4 bytes, with leading bits indicating the byte count. This variable-length encoding keeps English text compact while supporting all world scripts.

UTF-8 byte ranges

  • 1 byte (0xxxxxxx): ASCII characters U+0000–U+007F (A-Z, 0-9, basic punctuation)
  • 2 bytes (110xxxxx 10xxxxxx): Latin, Greek, Cyrillic, Arabic, Hebrew U+0080–U+07FF
  • 3 bytes (1110xxxx 10xxxxxx 10xxxxxx): CJK, most emoji, symbols U+0800–U+FFFF
  • 4 bytes (11110xxx 10xxxxxx 10xxxxxx 10xxxxxx): Rare characters, flags, extended emoji U+10000–U+10FFFF

Common use cases

  • Debug encoding issues when text appears garbled (mojibake)
  • Inspect byte-level representation for network protocols
  • Verify correct encoding in databases and file systems

FAQs

What is the difference between UTF-8 and Unicode?

Unicode is a character set that assigns a unique number (code point) to every character. UTF-8 is an encoding that defines how those code points are stored as bytes. Unicode defines what characters exist; UTF-8 defines how to represent them in binary.

Why do some characters use more bytes than others in UTF-8?

UTF-8 uses variable-length encoding for efficiency. ASCII characters (the most common in English) use just 1 byte, keeping text compact. Less common characters use 2-4 bytes. This design makes UTF-8 backward-compatible with ASCII while supporting all Unicode characters.

How can I tell if text is UTF-8 encoded?

Look at the byte patterns: UTF-8 multi-byte sequences always start with specific bit patterns (110, 1110, or 11110) followed by continuation bytes starting with 10. If the bytes follow these patterns, the text is likely UTF-8. Invalid sequences indicate a different encoding.

What causes garbled text (mojibake) and how do I fix it?

Mojibake occurs when text encoded in one format (e.g., UTF-8) is decoded using a different format (e.g., Latin-1). To fix it, identify the original encoding by examining the byte sequence, then decode with the correct encoding. This tool helps you inspect bytes to diagnose encoding issues.

Herramientas relacionadas