Skip to content
Java getting started 6 min read

Unicode System

Java was designed from day one to support every human language on the planet. At the heart of that design is Unicode — a universal character encoding standard that assigns a unique number (called a code point) to every character, symbol, and emoji in existence.

What Is Unicode?

Before Unicode, different regions used different encoding systems — ASCII for English, ISO-8859 for Western European languages, Shift-JIS for Japanese, and so on. This made it a nightmare to write software that worked across languages.

Unicode solves that by giving each character a single, globally agreed-upon number. The current Unicode standard (version 15+) covers more than 149,000 characters across 161 scripts, including Latin, Arabic, Chinese, Devanagari, and even ancient scripts like Cuneiform.

Every character gets a code point, written as U+XXXX in hex notation. For example:

CharacterUnicode Code PointDescription
AU+0041Latin capital letter A
àU+00E0Latin small letter a with grave
U+4E2DCJK (Chinese) character
😊U+1F60ASmiling face emoji
U+0939Devanagari letter HA

Java and Unicode: The char Type

Java’s char data type is a 16-bit unsigned integer that stores a Unicode code point. This means char can represent code points from U+0000 to U+FFFF — a range called the Basic Multilingual Plane (BMP).

public class UnicodeBasics {
    public static void main(String[] args) {
        char letter = 'A';
        char heart  = '❤';  // ❤ Unicode escape
        char devanagari = 'न'; // न (na)

        System.out.println(letter);      // A
        System.out.println(heart);       // ❤
        System.out.println(devanagari);  // न

        // char is just a number under the hood
        int codePoint = letter;
        System.out.println(codePoint);   // 65
    }
}

Output:

A


65

Unicode Escape Sequences

You can write any Unicode character using the \uXXXX escape sequence (exactly four hex digits) anywhere in Java source code — even in identifiers and string literals.

public class EscapeDemo {
    public static void main(String[] args) {
        String greeting = "Hello"; // "Hello"
        System.out.println(greeting); // Hello

        // You can even use Unicode in variable names (not recommended, but valid!)
        int café = 5;
        System.out.println(café); // 5
    }
}

Output:

Hello
5

Tip: Unicode escapes are processed by the Java compiler before any other processing step, so " (a double-quote) inside a string literal will actually terminate the string. Stick to \uXXXX escapes for non-ASCII characters in string values.

Characters Beyond the BMP: Supplementary Characters

The BMP covers U+0000 to U+FFFF (65,536 code points). But Unicode has grown beyond that to U+10FFFF — that’s over 1 million code points. Characters above U+FFFF are called supplementary characters.

Since char is only 16 bits, a single char cannot hold a supplementary character. Java represents these using surrogate pairs: two consecutive char values (a high surrogate + a low surrogate) that together encode one supplementary code point.

public class SupplementaryDemo {
    public static void main(String[] args) {
        // 𝄞 (Musical G-Clef) is at U+1D11E — outside the BMP
        String clef = "𝄞"; // surrogate pair
        System.out.println(clef);                       // 𝄞
        System.out.println("char count : " + clef.length());         // 2
        System.out.println("code points: " + clef.codePointCount(0, clef.length())); // 1
    }
}

Output:

𝄞
char count : 2
code points: 1

Note: String.length() returns the number of char values, not the number of Unicode characters. Use codePointCount() when you need the actual character count for strings that might contain supplementary characters.

The Character Class

The Wrapper Classes chapter covers boxing in general, but Character also packs rich Unicode utility methods:

public class CharacterMethods {
    public static void main(String[] args) {
        char ch = 'A';

        System.out.println(Character.isLetter(ch));       // true
        System.out.println(Character.isDigit(ch));        // false
        System.out.println(Character.isUpperCase(ch));    // true
        System.out.println(Character.toLowerCase(ch));    // a
        System.out.println(Character.getNumericValue('9')); // 9

        // Unicode category
        int type = Character.getType('A');
        System.out.println(type == Character.UPPERCASE_LETTER); // true

        // Code point version — works for supplementary characters too
        int cp = 0x1D11E; // 𝄞
        System.out.println(Character.isValidCodePoint(cp)); // true
        System.out.println(Character.charCount(cp));        // 2
    }
}

Output:

true
false
true
a
9
true
true
2

Iterating Over Code Points

When processing text that might contain supplementary characters, iterate over code points rather than char values:

public class CodePointIteration {
    public static void main(String[] args) {
        String text = "Hi 𝄞!";

        System.out.println("By char (length = " + text.length() + "):");
        for (int i = 0; i < text.length(); i++) {
            System.out.print(text.charAt(i) + " ");
        }

        System.out.println("\nBy code point:");
        text.codePoints().forEach(cp ->
            System.out.print(new String(Character.toChars(cp)) + " ")
        );
    }
}

Output:

By char (length = 6):
H i   ? ? !
By code point:
H i   𝄞 !

Tip: The String.codePoints() stream (added in Java 8) is the cleanest way to iterate Unicode-aware code points in modern Java.

Unicode in Java Source Files

Java source files are expected to be encoded in UTF-8 by default since Java 18 (JEP 400). On earlier versions the default was platform-dependent, which caused “mojibake” bugs when source files with non-ASCII literals were compiled on a different OS.

When compiling with javac, you can explicitly specify the encoding:

javac -encoding UTF-8 MyClass.java

Warning: If you save a .java file with non-ASCII characters in an encoding other than UTF-8 (e.g., Windows-1252) and then compile without the -encoding flag, you’ll get garbled output or a compile error. Always save source files as UTF-8.

Under the Hood: How Java Stores Strings

Java stores String internally as an array of char (UTF-16 encoded). Each char is 2 bytes. That means a 10-character ASCII string still takes 20 bytes.

Starting in Java 9, the JVM introduced Compact Strings (JEP 254). If a string contains only Latin-1 characters (U+0000–U+00FF), it is stored as a byte[] with one byte per character, cutting memory in half for typical English text. When a non-Latin-1 character is added, the string automatically switches back to a 2-byte-per-char UTF-16 layout.

You can verify this behavior indirectly:

public class CompactStringHint {
    public static void main(String[] args) {
        // JVM stores this as 1 byte/char internally (Latin-1)
        String ascii = "Hello";

        // JVM stores this as 2 bytes/char (UTF-16)
        String mixed = "Hello 中";

        // Both behave identically from the API perspective
        System.out.println(ascii.length()); // 5
        System.out.println(mixed.length()); // 7
    }
}

This optimization is transparent to application code but has a real impact on memory usage in applications that process large volumes of ASCII text, such as JSON parsers or HTTP servers.

Note: The String.chars() method returns int values (not char) so it can correctly carry the full code point range without sign-extension issues.

Quick Reference: Common Unicode Ranges

RangeScript / Usage
U+0000–U+007FBasic ASCII (English letters, digits, punctuation)
U+0080–U+00FFLatin Extended (accented chars, £, ©, ®)
U+0900–U+097FDevanagari (Hindi, Sanskrit)
U+4E00–U+9FFFCJK Unified Ideographs (Chinese/Japanese/Korean)
U+1F600–U+1F64FEmoticons (😊, 😂, 🤔 …)

For internationalization, Java’s java.text and java.util.Locale APIs build on this Unicode foundation to handle locale-aware formatting, collation, and text segmentation.

  • Data Types — learn how char fits into Java’s primitive type system
  • Variables — declaring and using char variables in practice
  • Strings — how Java’s String class is built on top of Unicode char arrays
  • Wrapper Classes — the Character wrapper class and its full API
  • Internationalization — building locale-aware, multilingual Java applications
  • String Methods — methods like codePointAt(), codePoints(), and charAt() in action
Last updated June 13, 2026
Was this helpful?