Navigation

Java getting started 6 min read

Unicode System

Java was designed from day one to support every human language on the planet. At the heart of that design is Unicode — a universal character encoding standard that assigns a unique number (called a code point) to every character, symbol, and emoji in existence.

What Is Unicode?

Before Unicode, different regions used different encoding systems — ASCII for English, ISO-8859 for Western European languages, Shift-JIS for Japanese, and so on. This made it a nightmare to write software that worked across languages.

Unicode solves that by giving each character a single, globally agreed-upon number. The current Unicode standard (version 15+) covers more than 149,000 characters across 161 scripts, including Latin, Arabic, Chinese, Devanagari, and even ancient scripts like Cuneiform.

Every character gets a code point, written as U+XXXX in hex notation. For example:

Character	Unicode Code Point	Description
`A`	U+0041	Latin capital letter A
`à`	U+00E0	Latin small letter a with grave
`中`	U+4E2D	CJK (Chinese) character
`😊`	U+1F60A	Smiling face emoji
`ह`	U+0939	Devanagari letter HA

Java and Unicode: The `char` Type

Java’s char data type is a 16-bit unsigned integer that stores a Unicode code point. This means char can represent code points from U+0000 to U+FFFF — a range called the Basic Multilingual Plane (BMP).

public class UnicodeBasics {
    public static void main(String[] args) {
        char letter = 'A';
        char heart  = '❤';  // ❤ Unicode escape
        char devanagari = 'न'; // न (na)

        System.out.println(letter);      // A
        System.out.println(heart);       // ❤
        System.out.println(devanagari);  // न

        // char is just a number under the hood
        int codePoint = letter;
        System.out.println(codePoint);   // 65
    }
}

Output:

A
❤
न
65

Unicode Escape Sequences

You can write any Unicode character using the \uXXXX escape sequence (exactly four hex digits) anywhere in Java source code — even in identifiers and string literals.

public class EscapeDemo {
    public static void main(String[] args) {
        String greeting = "Hello"; // "Hello"
        System.out.println(greeting); // Hello

        // You can even use Unicode in variable names (not recommended, but valid!)
        int café = 5;
        System.out.println(café); // 5
    }
}

Output:

Hello
5

Tip: Unicode escapes are processed by the Java compiler before any other processing step, so " (a double-quote) inside a string literal will actually terminate the string. Stick to \uXXXX escapes for non-ASCII characters in string values.

Characters Beyond the BMP: Supplementary Characters

The BMP covers U+0000 to U+FFFF (65,536 code points). But Unicode has grown beyond that to U+10FFFF — that’s over 1 million code points. Characters above U+FFFF are called supplementary characters.

Since char is only 16 bits, a single char cannot hold a supplementary character. Java represents these using surrogate pairs: two consecutive char values (a high surrogate + a low surrogate) that together encode one supplementary code point.

public class SupplementaryDemo {
    public static void main(String[] args) {
        // 𝄞 (Musical G-Clef) is at U+1D11E — outside the BMP
        String clef = "𝄞"; // surrogate pair
        System.out.println(clef);                       // 𝄞
        System.out.println("char count : " + clef.length());         // 2
        System.out.println("code points: " + clef.codePointCount(0, clef.length())); // 1
    }
}

Output:

𝄞
char count : 2
code points: 1

Note: String.length() returns the number of char values, not the number of Unicode characters. Use codePointCount() when you need the actual character count for strings that might contain supplementary characters.

The `Character` Class

The Wrapper Classes chapter covers boxing in general, but Character also packs rich Unicode utility methods:

public class CharacterMethods {
    public static void main(String[] args) {
        char ch = 'A';

        System.out.println(Character.isLetter(ch));       // true
        System.out.println(Character.isDigit(ch));        // false
        System.out.println(Character.isUpperCase(ch));    // true
        System.out.println(Character.toLowerCase(ch));    // a
        System.out.println(Character.getNumericValue('9')); // 9

        // Unicode category
        int type = Character.getType('A');
        System.out.println(type == Character.UPPERCASE_LETTER); // true

        // Code point version — works for supplementary characters too
        int cp = 0x1D11E; // 𝄞
        System.out.println(Character.isValidCodePoint(cp)); // true
        System.out.println(Character.charCount(cp));        // 2
    }
}

Output:

true
false
true
a
9
true
true
2

Iterating Over Code Points

When processing text that might contain supplementary characters, iterate over code points rather than char values:

public class CodePointIteration {
    public static void main(String[] args) {
        String text = "Hi 𝄞!";

        System.out.println("By char (length = " + text.length() + "):");
        for (int i = 0; i < text.length(); i++) {
            System.out.print(text.charAt(i) + " ");
        }

        System.out.println("\nBy code point:");
        text.codePoints().forEach(cp ->
            System.out.print(new String(Character.toChars(cp)) + " ")
        );
    }
}

Output:

By char (length = 6):
H i   ? ? !
By code point:
H i   𝄞 !

Tip: The String.codePoints() stream (added in Java 8) is the cleanest way to iterate Unicode-aware code points in modern Java.

Unicode in Java Source Files

Java source files are expected to be encoded in UTF-8 by default since Java 18 (JEP 400). On earlier versions the default was platform-dependent, which caused “mojibake” bugs when source files with non-ASCII literals were compiled on a different OS.

When compiling with javac, you can explicitly specify the encoding:

javac -encoding UTF-8 MyClass.java

Warning: If you save a .java file with non-ASCII characters in an encoding other than UTF-8 (e.g., Windows-1252) and then compile without the -encoding flag, you’ll get garbled output or a compile error. Always save source files as UTF-8.

Under the Hood: How Java Stores Strings

Java stores String internally as an array of char (UTF-16 encoded). Each char is 2 bytes. That means a 10-character ASCII string still takes 20 bytes.

Starting in Java 9, the JVM introduced Compact Strings (JEP 254). If a string contains only Latin-1 characters (U+0000–U+00FF), it is stored as a byte[] with one byte per character, cutting memory in half for typical English text. When a non-Latin-1 character is added, the string automatically switches back to a 2-byte-per-char UTF-16 layout.

You can verify this behavior indirectly:

public class CompactStringHint {
    public static void main(String[] args) {
        // JVM stores this as 1 byte/char internally (Latin-1)
        String ascii = "Hello";

        // JVM stores this as 2 bytes/char (UTF-16)
        String mixed = "Hello 中";

        // Both behave identically from the API perspective
        System.out.println(ascii.length()); // 5
        System.out.println(mixed.length()); // 7
    }
}

This optimization is transparent to application code but has a real impact on memory usage in applications that process large volumes of ASCII text, such as JSON parsers or HTTP servers.

Note: The String.chars() method returns int values (not char) so it can correctly carry the full code point range without sign-extension issues.

Quick Reference: Common Unicode Ranges

Range	Script / Usage
U+0000–U+007F	Basic ASCII (English letters, digits, punctuation)
U+0080–U+00FF	Latin Extended (accented chars, £, ©, ®)
U+0900–U+097F	Devanagari (Hindi, Sanskrit)
U+4E00–U+9FFF	CJK Unified Ideographs (Chinese/Japanese/Korean)
U+1F600–U+1F64F	Emoticons (😊, 😂, 🤔 …)

For internationalization, Java’s java.text and java.util.Locale APIs build on this Unicode foundation to handle locale-aware formatting, collation, and text segmentation.

Data Types — learn how char fits into Java’s primitive type system
Variables — declaring and using char variables in practice
Strings — how Java’s String class is built on top of Unicode char arrays
Wrapper Classes — the Character wrapper class and its full API
Internationalization — building locale-aware, multilingual Java applications
String Methods — methods like codePointAt(), codePoints(), and charAt() in action

Unicode System

What Is Unicode?

Java and Unicode: The char Type

Unicode Escape Sequences

Characters Beyond the BMP: Supplementary Characters

The Character Class

Iterating Over Code Points

Unicode in Java Source Files

Under the Hood: How Java Stores Strings

Quick Reference: Common Unicode Ranges

Related Topics

Java and Unicode: The `char` Type

The `Character` Class