Unicode System
Java was designed from day one to support every human language on the planet. At the heart of that design is Unicode — a universal character encoding standard that assigns a unique number (called a code point) to every character, symbol, and emoji in existence.
What Is Unicode?
Before Unicode, different regions used different encoding systems — ASCII for English, ISO-8859 for Western European languages, Shift-JIS for Japanese, and so on. This made it a nightmare to write software that worked across languages.
Unicode solves that by giving each character a single, globally agreed-upon number. The current Unicode standard (version 15+) covers more than 149,000 characters across 161 scripts, including Latin, Arabic, Chinese, Devanagari, and even ancient scripts like Cuneiform.
Every character gets a code point, written as U+XXXX in hex notation. For example:
| Character | Unicode Code Point | Description |
|---|---|---|
A | U+0041 | Latin capital letter A |
à | U+00E0 | Latin small letter a with grave |
中 | U+4E2D | CJK (Chinese) character |
😊 | U+1F60A | Smiling face emoji |
ह | U+0939 | Devanagari letter HA |
Java and Unicode: The char Type
Java’s char data type is a 16-bit unsigned integer that stores a Unicode code point. This means char can represent code points from U+0000 to U+FFFF — a range called the Basic Multilingual Plane (BMP).
public class UnicodeBasics {
public static void main(String[] args) {
char letter = 'A';
char heart = '❤'; // ❤ Unicode escape
char devanagari = 'न'; // न (na)
System.out.println(letter); // A
System.out.println(heart); // ❤
System.out.println(devanagari); // न
// char is just a number under the hood
int codePoint = letter;
System.out.println(codePoint); // 65
}
}
Output:
A
❤
न
65
Unicode Escape Sequences
You can write any Unicode character using the \uXXXX escape sequence (exactly four hex digits) anywhere in Java source code — even in identifiers and string literals.
public class EscapeDemo {
public static void main(String[] args) {
String greeting = "Hello"; // "Hello"
System.out.println(greeting); // Hello
// You can even use Unicode in variable names (not recommended, but valid!)
int café = 5;
System.out.println(café); // 5
}
}
Output:
Hello
5
Tip: Unicode escapes are processed by the Java compiler before any other processing step, so
"(a double-quote) inside a string literal will actually terminate the string. Stick to\uXXXXescapes for non-ASCII characters in string values.
Characters Beyond the BMP: Supplementary Characters
The BMP covers U+0000 to U+FFFF (65,536 code points). But Unicode has grown beyond that to U+10FFFF — that’s over 1 million code points. Characters above U+FFFF are called supplementary characters.
Since char is only 16 bits, a single char cannot hold a supplementary character. Java represents these using surrogate pairs: two consecutive char values (a high surrogate + a low surrogate) that together encode one supplementary code point.
public class SupplementaryDemo {
public static void main(String[] args) {
// 𝄞 (Musical G-Clef) is at U+1D11E — outside the BMP
String clef = "𝄞"; // surrogate pair
System.out.println(clef); // 𝄞
System.out.println("char count : " + clef.length()); // 2
System.out.println("code points: " + clef.codePointCount(0, clef.length())); // 1
}
}
Output:
𝄞
char count : 2
code points: 1
Note:
String.length()returns the number ofcharvalues, not the number of Unicode characters. UsecodePointCount()when you need the actual character count for strings that might contain supplementary characters.
The Character Class
The Wrapper Classes chapter covers boxing in general, but Character also packs rich Unicode utility methods:
public class CharacterMethods {
public static void main(String[] args) {
char ch = 'A';
System.out.println(Character.isLetter(ch)); // true
System.out.println(Character.isDigit(ch)); // false
System.out.println(Character.isUpperCase(ch)); // true
System.out.println(Character.toLowerCase(ch)); // a
System.out.println(Character.getNumericValue('9')); // 9
// Unicode category
int type = Character.getType('A');
System.out.println(type == Character.UPPERCASE_LETTER); // true
// Code point version — works for supplementary characters too
int cp = 0x1D11E; // 𝄞
System.out.println(Character.isValidCodePoint(cp)); // true
System.out.println(Character.charCount(cp)); // 2
}
}
Output:
true
false
true
a
9
true
true
2
Iterating Over Code Points
When processing text that might contain supplementary characters, iterate over code points rather than char values:
public class CodePointIteration {
public static void main(String[] args) {
String text = "Hi 𝄞!";
System.out.println("By char (length = " + text.length() + "):");
for (int i = 0; i < text.length(); i++) {
System.out.print(text.charAt(i) + " ");
}
System.out.println("\nBy code point:");
text.codePoints().forEach(cp ->
System.out.print(new String(Character.toChars(cp)) + " ")
);
}
}
Output:
By char (length = 6):
H i ? ? !
By code point:
H i 𝄞 !
Tip: The
String.codePoints()stream (added in Java 8) is the cleanest way to iterate Unicode-aware code points in modern Java.
Unicode in Java Source Files
Java source files are expected to be encoded in UTF-8 by default since Java 18 (JEP 400). On earlier versions the default was platform-dependent, which caused “mojibake” bugs when source files with non-ASCII literals were compiled on a different OS.
When compiling with javac, you can explicitly specify the encoding:
javac -encoding UTF-8 MyClass.java
Warning: If you save a
.javafile with non-ASCII characters in an encoding other than UTF-8 (e.g., Windows-1252) and then compile without the-encodingflag, you’ll get garbled output or a compile error. Always save source files as UTF-8.
Under the Hood: How Java Stores Strings
Java stores String internally as an array of char (UTF-16 encoded). Each char is 2 bytes. That means a 10-character ASCII string still takes 20 bytes.
Starting in Java 9, the JVM introduced Compact Strings (JEP 254). If a string contains only Latin-1 characters (U+0000–U+00FF), it is stored as a byte[] with one byte per character, cutting memory in half for typical English text. When a non-Latin-1 character is added, the string automatically switches back to a 2-byte-per-char UTF-16 layout.
You can verify this behavior indirectly:
public class CompactStringHint {
public static void main(String[] args) {
// JVM stores this as 1 byte/char internally (Latin-1)
String ascii = "Hello";
// JVM stores this as 2 bytes/char (UTF-16)
String mixed = "Hello 中";
// Both behave identically from the API perspective
System.out.println(ascii.length()); // 5
System.out.println(mixed.length()); // 7
}
}
This optimization is transparent to application code but has a real impact on memory usage in applications that process large volumes of ASCII text, such as JSON parsers or HTTP servers.
Note: The
String.chars()method returnsintvalues (notchar) so it can correctly carry the full code point range without sign-extension issues.
Quick Reference: Common Unicode Ranges
| Range | Script / Usage |
|---|---|
| U+0000–U+007F | Basic ASCII (English letters, digits, punctuation) |
| U+0080–U+00FF | Latin Extended (accented chars, £, ©, ®) |
| U+0900–U+097F | Devanagari (Hindi, Sanskrit) |
| U+4E00–U+9FFF | CJK Unified Ideographs (Chinese/Japanese/Korean) |
| U+1F600–U+1F64F | Emoticons (😊, 😂, 🤔 …) |
For internationalization, Java’s java.text and java.util.Locale APIs build on this Unicode foundation to handle locale-aware formatting, collation, and text segmentation.
Related Topics
- Data Types — learn how
charfits into Java’s primitive type system - Variables — declaring and using
charvariables in practice - Strings — how Java’s
Stringclass is built on top of Unicodechararrays - Wrapper Classes — the
Characterwrapper class and its full API - Internationalization — building locale-aware, multilingual Java applications
- String Methods — methods like
codePointAt(),codePoints(), andcharAt()in action