For private document, only users with permission should be able access the document.
Is expected to have in flight and at rest encryption for documents.
Is expected to have in flight encryption for operations.
Is expected to implement firewall rules.
Is expected to avoid most common attacks (DDoS, XSS, CSRF and SQL Injection).
Is expected to provide authentication and authorization.
Should implement local throttling or debounce when sending operations to sync service to avoid DDoS.
All document access and modifications should be logged with user identity, timestamp, and operation type.
Audit logs should be retained according to retention policies.
2. Availability
2.1 Balancing high availability and offline support
System should provide continuous document editing for users offline and availability requirements focus on synchronization and conflict resolution when reconnecting.
Availability requirements focus on backend services for real-time synchronization and collaboration, rather than local offline editing.
Multi-region replication with automatic conflict resolution ensures consistency when offline changes are merged.
2.1.1 Recovery Time Objective (RTO)
Should be minimized to ensure quick system restoration after incidents.
2.1.2 Recovery Point Objective (RPO)
Should aim to prevent loss of collaborative operations, using snapshots and replication strategies.
2.2 Distributed architecture
Handle network partitions, latency, and failure scenarios.
Database replications.
Documents snapshots and restoration.
Global load balancer.
Global DNS.
2.3 Avoid data loss (CRDT)
Is expected to provide offline support to collaborations even without internet connectivity.
Is expected to use LSEQ to handle sequential and ordered.
LSEQ algorithm should be a great solution to ensure sequential, ordered and idempotent operations while optimizing the data structure to support bigger documents.
2.4 Fault tolerant
Should be able to handle network partitions, latency, and failure scenarios.
Is expected to use a multi-region strategy.
Is expected to use a fail-over strategy.
Is expected to use data replication strategy.
3. Scalability
Is expected to support from 3 users to 100,000+ concurrent users.
Should enable horizontal scalability.
Should enable distributed sessions throught instances.
Should provide partitioned storage for documents.
Vertical scaling may be used as a fallback.
Is expected to implement a rate-limit to ensure scalable concurrency.
3.1 Memory usage
Is expected to use an implementation on the synchronization service (for conflicts merge) with NodeJs streams to reduce memory overhead in each instance.
Is expected to use Kafka persisted streams to keep track of checkpoints to avoid data loss in case of any failure or interruption during stream processing.
3.2 Horizontal scaling
System should support horizontal scaling across multiple instances to handle peak loads.
Is expected to use autoscaling strategy.
Is expected to use cluster mode with NodeJs to use all available cores of each instance CPU.
4. Performance
Is expected to provide local operations with sub-50ms response times with instanteneous renderizations.
Should ensure online syncrhonization with a p99 of sub-200ms response times for collaborative operations.
Is expected to provide Content Delivery Network (CDN) to optimize to edge users.
Is expected to provide edge caching.
Is expected a low latency for global users with multi-region strategy.
5. Consistency
Should maintain data integrity across distributed components.
Is expected to have an eventual consistency across multi-region databases and services.
6. Traceability: Version control
Should provide checkpoints with snapshots to enable restorations.
Should provide version control with audit mode (metadata and author).
7. Observability
Should provide logs, distributed traces, metrics, alarms and dashboards to monitor and provide support for incidents.