Page Cache & Zero-Copy
Kafka routinely pushes millions of messages per second through commodity hardware, yet it is written in the JVM and persists everything to disk. The trick is that Kafka does almost no clever caching of its own. Instead it leans on four well-understood operating-system and hardware behaviours: the OS page cache, sequential disk access, zero-copy data transfer, and aggressive batching with compression. Understanding these is essential for capacity planning, because they dictate how you size memory, choose disks, and tune JVM heap in production.
Reliance on the OS page cache
Most data systems build an in-process cache (a buffer pool) on top of the disk. Kafka deliberately does not. When a producer writes a record, it is appended to a log segment file and the bytes land in the operating system’s page cache — the kernel’s in-memory copy of recently touched file pages. Reads served to consumers usually come straight from that same page cache, never touching the physical disk.
This design has several consequences:
- The JVM heap stays small. Kafka does not hold message payloads on the heap, so garbage-collection pressure is low even at high throughput.
- Free RAM becomes the cache automatically. The kernel manages eviction, and a broker that has been running for a while will show most of its RAM as page cache, not application memory.
- A broker restart does not cold-start the cache. Because the cache lives in the kernel, it survives a JVM restart (though not a machine reboot).
Tip: Give the JVM a modest heap (commonly 5-6 GB is plenty) and leave the rest of physical memory to the OS for page cache. Over-sizing the heap starves the page cache and can actually reduce throughput.
Sequential disk I/O
Kafka’s log is an append-only structure. Producers only ever add records to the end of the active segment, and consumers read forward from an offset. This turns nearly all disk activity into sequential reads and writes, which are dramatically faster than random access — often by two or three orders of magnitude on spinning disks, and still meaningfully faster on SSDs because of read-ahead and large contiguous transfers.
Because writes are sequential appends, Kafka can hand them to the page cache and let the kernel flush them to disk in large, efficient batches via its normal writeback machinery. Kafka does not call fsync on every write by default; durability comes primarily from replication across brokers rather than from forcing each write to platter.
| Behaviour | Random I/O | Kafka sequential I/O |
|---|---|---|
| Access pattern | Seeks all over the disk | Append to tail / read forward |
| Throughput on HDD | A few MB/s | Hundreds of MB/s |
| Page-cache friendliness | Poor (scattered pages) | Excellent (contiguous pages) |
| Durability model | Often per-write fsync | Replication + periodic flush |
You can tune flush behaviour if you have stricter requirements, but doing so trades throughput for synchronous durability:
# Flush after N messages or after T milliseconds (defaults are effectively "never" by count)
log.flush.interval.messages=10000
log.flush.interval.ms=1000
Zero-copy transfer with sendfile
The biggest single win for consumer reads is zero-copy. In a naive server, sending a file to a network socket copies data four times and crosses the user/kernel boundary repeatedly: disk to kernel page cache, page cache to a user-space buffer, user buffer back into a kernel socket buffer, and finally socket buffer to the NIC.
Kafka avoids the two user-space copies by using the sendfile system call (exposed in Java as FileChannel.transferTo). The kernel transfers bytes directly from the page cache to the socket without ever materialising them in the JVM, and on modern NICs with scatter-gather DMA the data can go straight from the page cache to the network card.
Traditional copy path: Zero-copy (sendfile) path:
disk -> page cache disk -> page cache
-> application buffer (JVM) -> socket / NIC (DMA)
-> socket buffer
-> NIC
4 copies, 2 context-switch round trips 2 copies, no JVM involvement
There is an important caveat: zero-copy only works when the broker can hand the on-disk bytes to the socket unchanged. If the broker has to re-encrypt or re-format data — for example when TLS is terminated at the broker, or when an older consumer forces a message-format down-conversion — the kernel cannot use sendfile and the data must pass through user space.
Warning: Enabling broker-to-consumer TLS disables
sendfilezero-copy, since the bytes must be encrypted in user space. Budget extra CPU for encrypted clusters, and keep all clients on the current message format to avoid down-conversion.
Batching and compression
Kafka amortises per-message overhead by grouping records into batches on the producer before they are sent, stored, and replicated. A batch is compressed once and travels through the entire pipeline — producer, broker disk, replica, and consumer — in its compressed form. This keeps compression CPU off the broker and preserves zero-copy, because the broker never decompresses to serve a fetch.
Tune batching on the producer to trade a little latency for much higher throughput:
Properties props = new Properties();
props.put("bootstrap.servers", "broker1:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
// Batch up to 64 KB, and wait up to 10 ms to let a batch fill before sending.
props.put("batch.size", 64 * 1024);
props.put("linger.ms", 10);
props.put("compression.type", "lz4");
try (KafkaProducer<String, String> producer = new KafkaProducer<>(props)) {
producer.send(new ProducerRecord<>("orders", "k1", "v1"));
}
The equivalent in a Spring Boot 3.x application is set through application.yaml:
spring:
kafka:
producer:
batch-size: 65536
properties:
linger.ms: 10
compression-type: lz4
You can confirm the page cache is doing its job by watching how little of a busy broker’s traffic actually hits disk:
free -h
Output:
total used free shared buff/cache available
Mem: 62Gi 6.1Gi 1.2Gi 12Mi 55Gi 55Gi
The buff/cache column — here 55 GB — is page cache holding hot log segments, which is exactly what makes consumer fetches return without disk reads.
Best Practices
- Keep the JVM heap small (around 5-6 GB) and leave the vast majority of RAM free for the OS page cache.
- Use fast, sequential-friendly storage; multiple disks with one log directory each spreads I/O and increases sequential throughput.
- Rely on replication for durability rather than aggressive per-write
fsync, unless a specific workload demands synchronous flushing. - Keep producers and consumers on the current message format to preserve zero-copy and avoid broker-side down-conversion.
- Account for the CPU cost of TLS, since broker-side encryption disables
sendfilezero-copy. - Enable producer batching and a lightweight codec like
lz4orzstdso compression happens once and survives end to end. - Monitor
buff/cacheand broker disk-read rates; rising physical reads signal that the working set no longer fits in the page cache.