Cooperative Rebalancing

When a consumer joins or leaves a group, Kafka must redistribute partitions across the surviving members. The original “eager” protocol does this with a stop-the-world pause: every consumer revokes all of its partitions, then the group reassigns them from scratch. Incremental cooperative rebalancing fixes the most painful part of that design by revoking only the partitions that actually need to move, letting every other partition keep flowing the whole time. In high-throughput production systems this is the difference between a brief hiccup and a multi-second consumption stall on every deployment.

Why eager rebalancing hurts

Under the eager protocol, the rebalance is a single synchronization barrier. As soon as one member triggers a rebalance, the group coordinator revokes the assignment from all members. No consumer in the group can poll records until the new assignment is computed and every member has re-joined. The duration of that pause scales with the slowest member and the size of the group.

The cost is wasted work and latency. A consumer that owned partitions 0-9 and will still own partitions 0-8 after the rebalance is forced to give them all up and re-acquire them, flushing in-flight processing and local state for no reason. During a rolling restart of N instances you pay this penalty N times.

The two-phase cooperative protocol

Cooperative rebalancing splits the work into two consecutive rebalances so partitions move without a global pause:

Phase 1 — revoke only what moves. Every member sends the coordinator its current ownership. The assignor computes the target assignment but, in this round, members keep everything they already own. Any partition that needs to migrate to a different member is simply revoked by its current owner. No new partitions are handed out yet.
Phase 2 — assign the freed partitions. A second rebalance is triggered automatically. The partitions that were revoked in phase 1 are now unowned, so the coordinator assigns them to their new owners.

The key property is that partitions you keep are never revoked, so poll() continues to return records for them across both phases. Only the migrating partitions experience a gap, and only on the two members involved in that specific move.

The two-phase handshake means a cooperative rebalance often involves two poll() cycles before the group is stable. This is expected and is still far cheaper than the global pause of eager mode.

Enabling it with CooperativeStickyAssignor

Cooperative rebalancing is selected through the partition assignment strategy. Set partition.assignment.strategy to CooperativeStickyAssignor, which combines sticky (minimal-movement) assignment with the cooperative protocol.

import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.CooperativeStickyAssignor;
import org.apache.kafka.clients.consumer.KafkaConsumer;

import java.time.Duration;
import java.util.List;
import java.util.Properties;

Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ConsumerConfig.GROUP_ID_CONFIG, "orders-processor");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,
        "org.apache.kafka.common.serialization.StringDeserializer");
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,
        "org.apache.kafka.common.serialization.StringDeserializer");

// Switch the group to incremental cooperative rebalancing
props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
        List.of(CooperativeStickyAssignor.class.getName()));

try (KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props)) {
    consumer.subscribe(List.of("orders"));
    while (true) {
        consumer.poll(Duration.ofMillis(500))
                .forEach(record -> System.out.printf(
                        "p=%d off=%d %s%n",
                        record.partition(), record.offset(), record.value()));
    }
}

In Spring Boot, configure the same strategy declaratively:

spring:
  kafka:
    consumer:
      group-id: orders-processor
      properties:
        partition.assignment.strategy: org.apache.kafka.clients.consumer.CooperativeStickyAssignor

Rebalance listener semantics change

With the cooperative protocol, ConsumerRebalanceListener.onPartitionsRevoked is invoked only for the partitions actually being taken away — not your whole assignment. A third callback, onPartitionsLost, fires when partitions are reclaimed without a clean revoke (for example after a session timeout). Commit offsets for revoked partitions before releasing them:

consumer.subscribe(List.of("orders"), new ConsumerRebalanceListener() {
    @Override
    public void onPartitionsRevoked(java.util.Collection<org.apache.kafka.common.TopicPartition> revoked) {
        // 'revoked' contains ONLY the partitions moving away from this member
        consumer.commitSync();
    }

    @Override
    public void onPartitionsAssigned(java.util.Collection<org.apache.kafka.common.TopicPartition> assigned) {
        // 'assigned' contains ONLY the newly granted partitions
    }

    @Override
    public void onPartitionsLost(java.util.Collection<org.apache.kafka.common.TopicPartition> lost) {
        // Do NOT commit here — ownership is already gone
    }
});

Downtime compared

The table below summarizes the practical differences between the two protocols for a group of N consumers.

Aspect	Eager (RangeAssignor / RoundRobin)	Cooperative (CooperativeStickyAssignor)
Partitions revoked per rebalance	All partitions, every member	Only partitions that change owner
Consumption during rebalance	Fully paused (stop-the-world)	Continues for unaffected partitions
Rebalance rounds	One	Two (revoke, then assign)
Rolling restart impact	Pause on each of N restarts	Minimal, localized movement
`onPartitionsRevoked` scope	Entire assignment	Only the moving partitions

A rolling deploy of a 12-instance group illustrates it. With eager, each instance restart pauses the entire group; you see twelve global stalls. With cooperative, only the partitions migrating off the restarting instance pause, and only briefly — the other eleven instances keep processing.

Migrating an existing group safely

You cannot flip a live group from eager to cooperative in a single step, because the two protocols use incompatible rebalance semantics. Use a rolling, two-pass upgrade:

Pass 1: strategy = [CooperativeStickyAssignor, RangeAssignor]   # both listed
Pass 2: strategy = [CooperativeStickyAssignor]                  # eager removed

During pass 1 every member advertises support for both protocols; the group keeps using eager until all members understand cooperative. Once the whole group is on pass 1, a second rolling restart to pass 2 promotes it to pure cooperative.

Output:

[Consumer clientId=c1, groupId=orders-processor] Notifying assignor about the new Assignment
[Consumer clientId=c1, groupId=orders-processor] Adding newly assigned partitions: orders-3
[Consumer clientId=c1, groupId=orders-processor] Revoke previously assigned partitions

Best Practices

Prefer CooperativeStickyAssignor for any group where rebalance pauses are visible to users or cause lag spikes — it is the recommended default in modern Kafka.
Migrate from eager in two rolling passes ([CooperativeSticky, Range] then [CooperativeSticky]); never change all members at once.
Commit offsets in onPartitionsRevoked, but never in onPartitionsLost — the partitions are already gone in the latter case.
Pair cooperative rebalancing with static group membership so routine restarts skip rebalancing entirely.
Keep max.poll.interval.ms generous enough that the extra second poll() of the two-phase handshake never trips a timeout.
Treat onPartitionsRevoked as partial: write your listener so it correctly handles a subset of your assignment, not the whole set.