Loading...
Page 1 of 9
Why?
We need cost-effective, failure-resisting systems.
What?
- Exception handling: Gracefully restart (catch or swallow)
- Communication: Out of order should be independent (catch in order)
Cloud Computing is about embracing failure.
- Developer: Unhandled Exception
- DevOps: Scaling the number of service instances down
- DevOps: Updating service code to a new version
- Orchestrators: Moving service code from one region to another
- Force majeure: Hardware failures (hard disk, etc.)
- (Additional scenarios are implied but not listed)
Conclusion
So, it will happen—embrace it ✔️
Orchestration & Orchestrators
- IAAS, FAAS (usually called clusters)
- These orchestrators manage VMs for us (their lifecycle):
- Networking
- Health
- Upgrades
- Scaling
- Environment
Examples include:
- Azure Functions
- Kubernetes
Page 2 of 9
Regions, Availability Zones & Fault Domain
Diagram
flowchart TD
A[Region]
AZ1[AZ 1] --> Rack1[Rack # 1]
Rack1 --> PC1[PC # 1]
Rack1 --> PC2[PC # 2]
PC2 --> VM1[VM # 1]
PC2 --> VM2[VM # 2]
AZ2[AZ 2] --> Rack2[Rack # 2]
Rack2 --> VM3[VM # 3]
B[Apps - Public Endpoint] --> AZ1
AZ1 --> AZ2
AZ2 --> B
Definition
A fault domain is a unit of failure.
- Hierarchy: Planet / Region / Availability Zone / Rack / PC / VM / Container
- Intra-service communication: More fault tolerance = higher latency.
Microservices
Concept
One service is broken down into many services. Each service has its own database.
Diagram
flowchart TD
Service1[Service 1]
Service1 --> LocalDatabase[Local Database]
Service2[Service 2]
Service2 --> Database1[Internal #1]
Service2 --> Database2[Internal #2]
Service3[Service 3]
Page 3 of 9
SLA (Service Level Agreement)
Uptime Percentages
- 99.99% =
- 99.999% = Where h = number of services (dependent).
Auto-Scaling Service Instances
- Periodically check queue length:
- If growing → scale up, else down
- Or scheduled scaling
Diagram
flowchart TD
Client[Client]
Client --> Queue[Queue]
Queue --> Service1[Service 1]
Queue --> Service2[Service 2]
- Periodically check resource usage: (Maybe memory, compute—choose a metric)
Diagram
flowchart TD
LoadBalancer[Load Balancer]
LoadBalancer --> Service1[Service 1]
LoadBalancer --> Service2[Service 2]
12 Factor Applications/Services
- Single code repo; don't share code with another service
- Deploy dependent libs with service
- No config in code; read from environment vars
- Handle unresponsive service dependencies smartly
- Strictly separate build, release & run steps:
- Build
- Release
- Run: Run service in execution environment
- Service is stateless; process & share nothing
- Service listens on ports; avoid using (web) tabs
- Use processes for isolation; multiple concurrency
- Processes can crash/be killed quickly & start fast
- Keep dev, staging & prod environments similar
- Log to stdout (dev console, prod file, action it)
- Deploy & run admin tasks (scripts) as processes
Philosophy
- Make it simple
- Lightweight
- Reproducible builds
Page 4 of 9
Container Images & Containers
Definition
- Immutable & defines a version of a single service with its dependencies.
- Runs an image in an isolated environment.
Purpose
They allow distinction of Isolation & Density.
Comparison Table
| Hardware | Isolation | Density |
|---|---|---|
| PC | ❌ | ❌ |
| VM | ✅ | ❌ |
| Hybrid VM/Containers | ✅ | ✅ |
| Containers | ✅ | ✅ |
| Bare Metal | ✅ | ✅ |
Page 5 of 9
Service Endpoints
Original:
IP:Port → PC: Service
Now:
IPwith many VMsVMwith many containers
Visualization (hooks) are required to make this work:
- Examples:
- Routing Tables
- SNAT/DNAT
- Modification to client code
We can't modify existing network cards, routers, switches, DNS protocols, etc., which makes service addressing/discovery complicated.
What to do now?
Even more problems arise:
- We run multiple service instances, so they can go up/down anytime.
- Client code shouldn't have to deal with this.
Solution: Reverse Proxy
- Examples:
- WebServers
- Load Balancers
- API Gateway
The client stores a known endpoint which is responsible for redirecting to the actual service using DNS.
Diagram:
flowchart LR
Public -->|Load Balancers| RP
RP -->|Worker 1| RP-I
RP -->|Worker 2| RP-O
RP -->|Worker 3| RP-O
RP-I -->|Instance 1| Instance1
RP-I -->|Instance 2| Instance2
RP-O -->|Instance 1| Instance1
RP-O -->|Instance 2| Instance2
Page 6 of 9
Messaging Fundamentals
Queue-Based Communication:
- Queue-based communication is better as it is non-blocking.
- Note: Service communicating with clients needs to have a blocking connection with the queue still.
How to make messaging fault-tolerant?
Diagram:
flowchart LR
A[Service 1] --> Queue
B[Service 2] --> Queue
Queue -->|3, 2, 1| Service
- Steps:
- Increase count.
- Hide messages for
nseconds. - If count > threshold, log bad messages and delete it.
- Else, process the message and delete it.
Service Upgrade & Config
Versioning:
- Netflix calls it Red-Black.
- Blue-Green Deployment.
Diagram:
flowchart LR
Cluster1[Cluster] --> Rolling Update
Cluster2[Cluster] --> Rolling Update
Rolling Update --> V1
Rolling Update --> V2
V1 -->|Controlled Migration| V2
- Steps:
- Delete & Upload.
- Downtime: At an instance, nothing can be running.
- Reduce speed during migration.
Page 7 of 9
How to Do Shutdown Gracefully?
Steps:
- Use integer representing "in-flight"; initialize to 1.
- Mark future LB pods with "not ready."
- Wait some time (e.g., 30s); decrement integer.
Shutdown Modes:
- SIGTERM: Graceful.
- SIGKILL: Forceful.
Reconfiguration
Notes:
- Very hard to maintain.
Recommendations:
- If you know config object is only one constant, ask orchestrators to send signal to processes if config has changed and reconfigure the object.
- Shutdown & restart.
Leader Election
Examples:
- RAFT, Paxos, etc.
- Via Lease.
- Queue Message.
Data Storage Service Considerations
Notes:
- Building & managing reliable & scalable services with state is very hard because of:
- Data Size.
- Replication.
- Security.
Page 8 of 9
Storage Services Overview
Notes:
- We already have existing robust hardened storage services.
Trade-offs in Multiple Storage Services:
I. Cache
- Fast but can be redundant.
- Example: Azure Cache.
II. File (Blob & Object) Storage Service
- Fast & inexpensive.
- Example: Storage Account.
III. Database Storage Services
- Relational (SQL).
- Non-relational (NoSQL).
Requirements:
- Partitioning & Replication:
- Size & speed are required for partitioning.
- Reliability is an end-requirement for replication.
- Consistency:
- Strong = ACID.
- Weak = Eventual BASE.
BASE:
- Basically Available, Soft State, Eventual Consistency.
Page 9 of 9
Schemas for Data Storage Service
- Use formal, language-agnostic data schemas.
- All data must specify version info starting with
v1. - New services must be infinitely backward compatible.
- During rolling updates, v1 & v2 instances run together.
Backup & Restore
Maybe a code bug or backup attack→ Potential issues include code bugs or backup attacks.- Periodically backup data in order to restore it to a known good state.
- Restoration may still result in some data loss.
Disaster Recovery
- Batch up changes in the data storage service.
- Replicate to other cluster(s).
- More clusters → More resilient/extensive/fast.
Diagram Placeholder
flowchart TD
A[Batch up changes in data storage service]
B[Replicate to other clusters]
C[More clusters → More resilient/extensive/fast]
A --> B --> C
References & Related Topics:
- Reverse Proxy Concepts
- RAFT Consensus Algorithm
- ACID vs BASE Consistency Models
- Blue-Green Deployment
- Microservices Architecture Design
- Data Backup & Recovery Techniques