cloud

Distributed Cloud Computing

Tuesday, April 15, 2025 · 7 min read

Page 1 of 9

Why?

We need cost-effective, failure-resisting systems.

What?

Exception handling: Gracefully restart (catch or swallow)
Communication: Out of order should be independent (catch in order)

Cloud Computing is about embracing failure.

Developer: Unhandled Exception
DevOps: Scaling the number of service instances down
DevOps: Updating service code to a new version
Orchestrators: Moving service code from one region to another
Force majeure: Hardware failures (hard disk, etc.)
(Additional scenarios are implied but not listed)

Conclusion

So, it will happen—embrace it ✔️

Orchestration & Orchestrators

IAAS, FAAS (usually called clusters)
These orchestrators manage VMs for us (their lifecycle):
- Networking
- Health
- Upgrades
- Scaling
- Environment

Examples include:

Azure Functions
Kubernetes

Page 2 of 9

Regions, Availability Zones & Fault Domain

Diagram

flowchart TD
    A[Region]
    AZ1[AZ 1] --> Rack1[Rack # 1]
    Rack1 --> PC1[PC # 1]
    Rack1 --> PC2[PC # 2]
    PC2 --> VM1[VM # 1]
    PC2 --> VM2[VM # 2]
    AZ2[AZ 2] --> Rack2[Rack # 2]
    Rack2 --> VM3[VM # 3]
    B[Apps - Public Endpoint] --> AZ1
    AZ1 --> AZ2
    AZ2 --> B

Definition

A fault domain is a unit of failure.

Hierarchy: Planet / Region / Availability Zone / Rack / PC / VM / Container
Intra-service communication: More fault tolerance = higher latency.

Microservices

Concept

One service is broken down into many services. Each service has its own database.

Diagram

flowchart TD
    Service1[Service 1]
    Service1 --> LocalDatabase[Local Database]
    Service2[Service 2]
    Service2 --> Database1[Internal #1]
    Service2 --> Database2[Internal #2]
    Service3[Service 3]

Page 3 of 9

SLA (Service Level Agreement)

Uptime Percentages

99.99% = $(x \times 260) / \text{month}$
99.999% = $(h \times 26) / \text{month}$ Where h = number of services (dependent).

Auto-Scaling Service Instances

Periodically check queue length:
- If growing → scale up, else down
- Or scheduled scaling

Diagram

flowchart TD
    Client[Client]
    Client --> Queue[Queue]
    Queue --> Service1[Service 1]
    Queue --> Service2[Service 2]

Periodically check resource usage: (Maybe memory, compute—choose a metric)

Diagram

flowchart TD
    LoadBalancer[Load Balancer]
    LoadBalancer --> Service1[Service 1]
    LoadBalancer --> Service2[Service 2]

12 Factor Applications/Services

Single code repo; don't share code with another service
Deploy dependent libs with service
No config in code; read from environment vars
Handle unresponsive service dependencies smartly
Strictly separate build, release & run steps:
- Build
- Release
- Run: Run service in execution environment
Service is stateless; process & share nothing
Service listens on ports; avoid using (web) tabs
Use processes for isolation; multiple concurrency
Processes can crash/be killed quickly & start fast
Keep dev, staging & prod environments similar
Log to stdout (dev console, prod file, action it)
Deploy & run admin tasks (scripts) as processes

Philosophy

Make it simple
Lightweight
Reproducible builds

Page 4 of 9

Container Images & Containers

Definition

Immutable & defines a version of a single service with its dependencies.
Runs an image in an isolated environment.

Purpose

They allow distinction of Isolation & Density.

Comparison Table

Hardware	Isolation	Density
PC	❌	❌
VM	✅	❌
Hybrid VM/Containers	✅	✅
Containers	✅	✅
Bare Metal	✅	✅

Page 5 of 9

Service Endpoints

Original:

IP:Port → PC: Service

Now:

IP with many VMs
VM with many containers

Visualization (hooks) are required to make this work:

Examples:
- Routing Tables
- SNAT/DNAT
- Modification to client code

We can't modify existing network cards, routers, switches, DNS protocols, etc., which makes service addressing/discovery complicated.

What to do now?

Even more problems arise:

We run multiple service instances, so they can go up/down anytime.
Client code shouldn't have to deal with this.

Solution: Reverse Proxy

Examples:
- WebServers
- Load Balancers
- API Gateway

The client stores a known endpoint which is responsible for redirecting to the actual service using DNS.

Diagram:

flowchart LR
    Public -->|Load Balancers| RP
    RP -->|Worker 1| RP-I
    RP -->|Worker 2| RP-O
    RP -->|Worker 3| RP-O
    RP-I -->|Instance 1| Instance1
    RP-I -->|Instance 2| Instance2
    RP-O -->|Instance 1| Instance1
    RP-O -->|Instance 2| Instance2

Page 6 of 9

Messaging Fundamentals

Queue-Based Communication:

Queue-based communication is better as it is non-blocking.
Note: Service communicating with clients needs to have a blocking connection with the queue still.

How to make messaging fault-tolerant?

Diagram:

flowchart LR
    A[Service 1] --> Queue
    B[Service 2] --> Queue
    Queue -->|3, 2, 1| Service

Steps:
- Increase count.
- Hide messages for n seconds.
- If count > threshold, log bad messages and delete it.
- Else, process the message and delete it.

Service Upgrade & Config

Versioning:

Netflix calls it Red-Black.
Blue-Green Deployment.

Diagram:

flowchart LR
    Cluster1[Cluster] --> Rolling Update
    Cluster2[Cluster] --> Rolling Update
    Rolling Update --> V1
    Rolling Update --> V2
    V1 -->|Controlled Migration| V2

Steps:
- Delete & Upload.
- Downtime: At an instance, nothing can be running.
- Reduce speed during migration.

Page 7 of 9

How to Do Shutdown Gracefully?

Steps:

Use integer representing "in-flight"; initialize to 1.
Mark future LB pods with "not ready."
Wait some time (e.g., 30s); decrement integer.

Shutdown Modes:

SIGTERM: Graceful.
SIGKILL: Forceful.

Reconfiguration

Notes:

Very hard to maintain.

Recommendations:

If you know config object is only one constant, ask orchestrators to send signal to processes if config has changed and reconfigure the object.
Shutdown & restart.

Leader Election

Examples:

RAFT, Paxos, etc.
- Via Lease.
- Queue Message.

Data Storage Service Considerations

Notes:

Building & managing reliable & scalable services with state is very hard because of:
1. Data Size.
2. Replication.
3. Security.

Page 8 of 9

Storage Services Overview

Notes:

We already have existing robust hardened storage services.

Trade-offs in Multiple Storage Services:

I. Cache

Fast but can be redundant.
Example: Azure Cache.

II. File (Blob & Object) Storage Service

Fast & inexpensive.
Example: Storage Account.

III. Database Storage Services

Relational (SQL).
Non-relational (NoSQL).

Requirements:

Partitioning & Replication:
- Size & speed are required for partitioning.
- Reliability is an end-requirement for replication.
Consistency:
- Strong = ACID.
- Weak = Eventual BASE.

BASE:

Basically Available, Soft State, Eventual Consistency.

Page 9 of 9

Schemas for Data Storage Service

Use formal, language-agnostic data schemas.
All data must specify version info starting with v1.
New services must be infinitely backward compatible.
During rolling updates, v1 & v2 instances run together.

Backup & Restore

~~Maybe a code bug or backup attack~~ → Potential issues include code bugs or backup attacks.
1. Periodically backup data in order to restore it to a known good state.
2. Restoration may still result in some data loss.

Disaster Recovery

Batch up changes in the data storage service.
- Replicate to other cluster(s).
- More clusters → More resilient/extensive/fast.

Diagram Placeholder

flowchart TD
    A[Batch up changes in data storage service]
    B[Replicate to other clusters]
    C[More clusters → More resilient/extensive/fast]
    A --> B --> C

References & Related Topics:

Reverse Proxy Concepts
RAFT Consensus Algorithm
ACID vs BASE Consistency Models
Blue-Green Deployment
Microservices Architecture Design
Data Backup & Recovery Techniques

Share via EmailShare on XShare on LinkedInShare on Reddit

Why?

What?

Cloud Computing is about embracing failure.

Conclusion

Orchestration & Orchestrators

Regions, Availability Zones & Fault Domain

Diagram

Definition

Microservices

Concept

Diagram

SLA (Service Level Agreement)

Uptime Percentages

Auto-Scaling Service Instances

Diagram

Diagram

12 Factor Applications/Services

Philosophy

Container Images & Containers

Definition

Purpose

Comparison Table

Service Endpoints

Original:

Now:

Visualization (hooks) are required to make this work:

What to do now?

Solution: Reverse Proxy

Diagram:

Messaging Fundamentals

Queue-Based Communication:

How to make messaging fault-tolerant?

Diagram:

Service Upgrade & Config

Versioning:

Diagram:

How to Do Shutdown Gracefully?

Steps:

Shutdown Modes:

Reconfiguration

Notes:

Recommendations:

Leader Election

Examples:

Data Storage Service Considerations

Notes:

Storage Services Overview

Notes:

Trade-offs in Multiple Storage Services:

I. Cache

II. File (Blob & Object) Storage Service

III. Database Storage Services

Requirements:

BASE:

Schemas for Data Storage Service

Backup & Restore

Disaster Recovery

Diagram Placeholder

References & Related Topics:

Share this