Unicode & Encoding

JavaScript strings are sequences of UTF-16 code units, not abstract characters. This works fine for everyday text, but it leaks badly the moment you touch emoji, accented letters, or scripts outside the Basic Multilingual Plane. Understanding the difference between code units and code points is the key to slicing, counting, and comparing strings without corrupting them.

How JavaScript stores text

Internally, a JavaScript string is an immutable array of 16-bit values called code units. Each .length, each bracket index, and each charCodeAt() operates on these units. For characters in the Basic Multilingual Plane (BMP, code points U+0000 to U+FFFF) one code unit equals one character, so things behave intuitively.

const word = "café";
console.log(word.length);          // 4
console.log(word.charCodeAt(3));   // 233 (é)

Output:

4
233

The trouble starts above U+FFFF. Those characters cannot fit in a single 16-bit unit, so UTF-16 encodes them as a surrogate pair: two code units that together represent one code point.

Surrogate pairs and the emoji length trap

A code point is the real Unicode scalar value (what you think of as “a character”). A code unit is the storage cell. An emoji like 😀 (U+1F600) sits outside the BMP, so it is stored as two surrogate code units. As a result, .length overcounts and indexing returns broken halves.

const smile = "😀";
console.log(smile.length);        // 2  ← two code units, one character
console.log(smile[0]);            // "\uD83D" lone surrogate (mojibake)
console.log(smile.charCodeAt(0)); // 55357
console.log([...smile].length);   // 1  ← correct character count

Output:

Warning: never use .length to count “characters” for user-facing text, and never slice a string at an arbitrary index — you can cut a surrogate pair in half and produce an invalid string.

Working with code points

ES2015 added code-point-aware APIs that respect surrogate pairs. codePointAt(index) returns the full code point starting at a position, and the static String.fromCodePoint() builds a string from one or more code points.

const heart = "💙";
console.log(heart.codePointAt(0));            // 128153
console.log(heart.codePointAt(0).toString(16)); // "1f499"
console.log(String.fromCodePoint(0x1f499));   // "💙"

// Unicode code point escape (ES2015):
console.log("\u{1F499}");                      // "💙"

Output:

The table below contrasts the older code-unit APIs with their code-point-aware counterparts.

Code unit (legacy)	Code point (Unicode-aware)	Notes
`str.charAt(i)` / `str[i]`	`[...str][i]`	Indexing by character
`str.charCodeAt(i)`	`str.codePointAt(i)`	Returns full scalar value
`String.fromCharCode(n)`	`String.fromCodePoint(n)`	Builds from scalar value
`str.length`	`[...str].length`	Counts characters, not units

Iterating correctly

The string iterator is Unicode-aware: it yields whole code points. That means for...of and the spread operator ([...str]) both step over surrogate pairs as single characters, while a classic for loop over indexes does not.

const text = "a😀b";

// Wrong: walks code units
let units = [];
for (let i = 0; i < text.length; i++) units.push(text[i]);
console.log(units.length); // 4 (😀 split)

// Right: walks code points
const chars = [...text];
console.log(chars.length);  // 3
console.log(chars);         // ["a", "😀", "b"]

Output:

4
3
["a", "😀", "b"]

Tip: even [...str] does not count grapheme clusters (like 👨‍👩‍👧, a family emoji joined by zero-width joiners, or a base letter plus combining accent). For true user-perceived character counts, use Intl.Segmenter with { granularity: "grapheme" }.

const seg = new Intl.Segmenter("en", { granularity: "grapheme" });
const family = "👨‍👩‍👧";
console.log([...family].length);            // 5 (code points)
console.log([...seg.segment(family)].length); // 1 (grapheme)

Output:

5
1

Normalization

The same visible text can have multiple binary representations. “é” can be a single precomposed code point (U+00E9) or “e” plus a combining acute accent (U+0065 U+0301). These look identical but are not equal with ===. The normalize() method rewrites a string into a canonical form so comparisons and storage are consistent.

const precomposed = "é";        // é
const decomposed  = "é";  // e + ́

console.log(precomposed === decomposed);                 // false
console.log(precomposed.length, decomposed.length);      // 1 2
console.log(
  precomposed.normalize("NFC") === decomposed.normalize("NFC")
);                                                       // true

Output:

false
1 2
true

The four normalization forms are "NFC" (default, canonical composition), "NFD" (canonical decomposition), "NFKC" and "NFKD" (compatibility forms that also fold things like ligatures). Use NFC for general comparison and storage.

Best Practices

Treat .length, str[i], and charCodeAt() as code-unit operations — never as character counts for international text.
Use for...of, the spread operator, or Array.from() to iterate or split strings by code point.
Reach for codePointAt() and String.fromCodePoint() instead of the charCode pair when emoji or astral characters are possible.
Normalize to NFC before comparing, deduplicating, or persisting user input.
Use Intl.Segmenter when you need true user-perceived character counts (graphemes), such as for text-length limits.
Prefer the \u{...} code point escape over surrogate-pair escapes for readability.
Validate or sanitize input to avoid lone surrogates, which are invalid as standalone characters.

Unicode & Encoding

How JavaScript stores text

Surrogate pairs and the emoji length trap

Working with code points

Iterating correctly

Normalization

Best Practices

Related Topics