Distributed Communication Architecture

✅ Required: Compare communication protocols (WebSockets, SSE, gRPC, custom UDP)

✅ Required: How do you ensure message ordering, delivery guarantees, and idempotency?

✅ Required: Design your approach to handling partial failures and network issues

✅ Optional: How do you implement backpressure and flow control?

✅ Optional: What's your strategy for connection management and load balancing?

Decision

WebSockets for bidirectional communication between clients and servers.
gRPC for internal services communication.

1. Bidirectional communication protocol between Clients and Servers

1.1 Why WebSockets?

WebSocket should be the solution that aligns the most with the requirements. This protocol establish a bidirectional low latency connection which is perfect to this usecase where client needs to publish and subscribe to synchronization events simultaneously. This protocol is can be simple to implement and we have a lot of abstractions for NodeJS per example the Socket.io package.

1.2 SSE limitations

Server-Sent Events (SSE) could work well to synchronize clients local data, but it establish only an unidirectional connection, this implies that the client only receive updates from server but cannot publish new events to the server.

1.3 gRPC limitations

gRPC could get too complex to develop and maintain, specially for this usecase where the protocol doesn't provide a good integration with browsers.

1.4 Custom UDP limitations

Custom UDP establish a low latency connection but it could be very complex to implement and has a lot of limitations for this usecase, per example, UDP does not ensure that packets reach the destination, lost packets are not retransmitted automatically and there is no feedback mechanism from the receiver to confirm successful delivery.

2. Communication between internal services

1.1 Why gRPC for internal services communication?

gRPC is built using HTTP/2, enabling multiplexed streams, header compression, and persistent connections, which significantly reduce latency and bandwidth usage.
Binary serialization via Protocol Buffers (Protobuf) makes data exchange very efficient.
gRPC supports multiple languages, making it suitable for agnostic architectures.
Teams can build services in different languages while maintaining consistent communication protocols.
gRPC is useful in low-latency, high-throughput environments such as internal service meshes, where client and server trust is established and network reliability is high.

3. Kafka

3.1 How Kafka could be useful to this solution?

To ensure message ordering, delivery guarantees and idempotency we can use Kafka. It delivery messages in order and supports three delivery garantees "at most once, at least once and exactly once", so to ensure idempotency we could use "exactly once" delivery garantee. We can ensure scalability by horizontally scaling with topics partitions.

Characteristics that justify this decision:

Durability: Messages are written to disk and replicated across brokers.
Replayability: Consumers can reprocess events from any committed offset.
Scalability: Topics can be partitioned and processed concurrently by multiple consumers.
Ordering: Message order is preserved within each partition.
Fault tolerance: If a service fails, another consumer can resume processing from the last committed offset.

4. Handling backpressure

Backpressure can be mitigated across multiple layers of the system.

4.1 Client layer

At the client layer, it can be controlled by applying debounce or throttling techniques before emitting new messages.

The debounce technique delays message emission until a defined period has passed without new inputs. In this example, a message is sent only if the user stops triggering new events for 1 second:

let timeout: NodeJS.Timeout;
const DEBOUNCE_INTERVAL = 1000;

function sendMessage(message: Message): void {
  clearTimeout(timeout);

  timeout = setTimeout(async function() {
    await kafkaClient.emit(message);
  }, DEBOUNCE_INTERVAL);
}

The throttling technique limits how frequently messages can be sent, regardless of how many events occur. In this example, messages can be emitted at most once every 2 seconds:

let lastSent = 0;
const THROTTLE_INTERVAL = 2000;

async function sendMessage(message: Message): Promise<void> {
  const now = Date.now();

  if (now - lastSent >= THROTTLE_INTERVAL) {
    lastSent = now;
    await kafkaClient.emit(message);
  }
}

4.2 Backend layer

At the backend layer, it can be addressed by scaling out consumers, optimizing processing throughput, or implementing flow control mechanisms to balance producer and consumer rates.

4.3 Kafka layer

Kafka has native backpressure support on both the producer and consumer sides.

On producer side, if the broker or the network becomes slow, the producer will either block (max.in.flight.requests.per.connection) or throw an error (BUFFER_EXHAUSTED). It’s possible to configure limits such as linger.ms, batch.size, max.request.size to control the message sending rate.
On consumer side, if the consumer does not commit offsets fast enough, the consumer lag increases.

5. Connection management & Load balancing

Hybrid solution WebSockets + Kafka

5.1 WebSockets connection management

WebSockets establish a bidirectional connection between clients and servers.

The connection remains open as long as the client is online.
You can send and receive messages in real time without opening new HTTP requests.
It is ideal for this usecase of continuous collaboration, since operations (CRDT deltas) need to be propagated with low latency (under 200 ms in this case).

When clients identify any network failure they could store the events locally and then, once they get online back again, they could send all the stored events to the server to synchronize and ensure eventual consistency.

WebSockets acts as transport layer for state synchronization but it's not responsible for events durability.

5.3 WebSockets Load Balancing

WebSockets use the TCP protocol, so a Layer 4 Network Load Balancer (NLB) is required to distribute connections across service instances.

NLB provides high throughput and can support millions of simultaneous connections.

Since we are using a CRDT algorithm to handle synchronization and store state locally on clients, sticky sessions are not needed.

In fact, sticky sessions in this scenario could create hot document situations, concentrating load on specific instances.
Instead, a round-robin strategy can be used to automatically distribute connections and events evenly across all available service instances.

5.3 Kafka load balancing

For Kafka we need to use partitions and consumer groups to handle load balancing.

It is important to choose an appropriate partition key to prevent hot partitions and ensure the load is evenly balanced across all partitions within a topic.

Consumer group coordination distributes automatically partitions for consumers to balance the load.

5.4 Event durability

Kafka serves as the event backbone between services, providing critical guarantees that WebSocket alone cannot offer:

Kafka provide durability because messages are written to disk and replicated across brokers. If a service fails, another consumer can resume processing from the last committed offset.

This combination ensures that while WebSockets handle real-time delivery, Kafka guarantees persistence, reliability, and replay across distributed services.