Unicode & Encoding
JavaScript strings are sequences of UTF-16 code units, not abstract characters. This works fine for everyday text, but it leaks badly the moment you touch emoji, accented letters, or scripts outside the Basic Multilingual Plane. Understanding the difference between code units and code points is the key to slicing, counting, and comparing strings without corrupting them.
How JavaScript stores text
Internally, a JavaScript string is an immutable array of 16-bit values called code units. Each .length, each bracket index, and each charCodeAt() operates on these units. For characters in the Basic Multilingual Plane (BMP, code points U+0000 to U+FFFF) one code unit equals one character, so things behave intuitively.
const word = "café";
console.log(word.length); // 4
console.log(word.charCodeAt(3)); // 233 (é)
Output:
4
233
The trouble starts above U+FFFF. Those characters cannot fit in a single 16-bit unit, so UTF-16 encodes them as a surrogate pair: two code units that together represent one code point.
Surrogate pairs and the emoji length trap
A code point is the real Unicode scalar value (what you think of as “a character”). A code unit is the storage cell. An emoji like 😀 (U+1F600) sits outside the BMP, so it is stored as two surrogate code units. As a result, .length overcounts and indexing returns broken halves.
const smile = "😀";
console.log(smile.length); // 2 ← two code units, one character
console.log(smile[0]); // "\uD83D" lone surrogate (mojibake)
console.log(smile.charCodeAt(0)); // 55357
console.log([...smile].length); // 1 ← correct character count
Output:
2
�
55357
1
Warning: never use
.lengthto count “characters” for user-facing text, and never slice a string at an arbitrary index — you can cut a surrogate pair in half and produce an invalid string.
Working with code points
ES2015 added code-point-aware APIs that respect surrogate pairs. codePointAt(index) returns the full code point starting at a position, and the static String.fromCodePoint() builds a string from one or more code points.
const heart = "💙";
console.log(heart.codePointAt(0)); // 128153
console.log(heart.codePointAt(0).toString(16)); // "1f499"
console.log(String.fromCodePoint(0x1f499)); // "💙"
// Unicode code point escape (ES2015):
console.log("\u{1F499}"); // "💙"
Output:
128153
1f499
💙
💙
The table below contrasts the older code-unit APIs with their code-point-aware counterparts.
| Code unit (legacy) | Code point (Unicode-aware) | Notes |
|---|---|---|
str.charAt(i) / str[i] | [...str][i] | Indexing by character |
str.charCodeAt(i) | str.codePointAt(i) | Returns full scalar value |
String.fromCharCode(n) | String.fromCodePoint(n) | Builds from scalar value |
str.length | [...str].length | Counts characters, not units |
Iterating correctly
The string iterator is Unicode-aware: it yields whole code points. That means for...of and the spread operator ([...str]) both step over surrogate pairs as single characters, while a classic for loop over indexes does not.
const text = "a😀b";
// Wrong: walks code units
let units = [];
for (let i = 0; i < text.length; i++) units.push(text[i]);
console.log(units.length); // 4 (😀 split)
// Right: walks code points
const chars = [...text];
console.log(chars.length); // 3
console.log(chars); // ["a", "😀", "b"]
Output:
4
3
["a", "😀", "b"]
Tip: even
[...str]does not count grapheme clusters (like 👨👩👧, a family emoji joined by zero-width joiners, or a base letter plus combining accent). For true user-perceived character counts, useIntl.Segmenterwith{ granularity: "grapheme" }.
const seg = new Intl.Segmenter("en", { granularity: "grapheme" });
const family = "👨👩👧";
console.log([...family].length); // 5 (code points)
console.log([...seg.segment(family)].length); // 1 (grapheme)
Output:
5
1
Normalization
The same visible text can have multiple binary representations. “é” can be a single precomposed code point (U+00E9) or “e” plus a combining acute accent (U+0065 U+0301). These look identical but are not equal with ===. The normalize() method rewrites a string into a canonical form so comparisons and storage are consistent.
const precomposed = "é"; // é
const decomposed = "é"; // e + ́
console.log(precomposed === decomposed); // false
console.log(precomposed.length, decomposed.length); // 1 2
console.log(
precomposed.normalize("NFC") === decomposed.normalize("NFC")
); // true
Output:
false
1 2
true
The four normalization forms are "NFC" (default, canonical composition), "NFD" (canonical decomposition), "NFKC" and "NFKD" (compatibility forms that also fold things like ligatures). Use NFC for general comparison and storage.
Best Practices
- Treat
.length,str[i], andcharCodeAt()as code-unit operations — never as character counts for international text. - Use
for...of, the spread operator, orArray.from()to iterate or split strings by code point. - Reach for
codePointAt()andString.fromCodePoint()instead of thecharCodepair when emoji or astral characters are possible. - Normalize to NFC before comparing, deduplicating, or persisting user input.
- Use
Intl.Segmenterwhen you need true user-perceived character counts (graphemes), such as for text-length limits. - Prefer the
\u{...}code point escape over surrogate-pair escapes for readability. - Validate or sanitize input to avoid lone surrogates, which are invalid as standalone characters.