Docker & Kubernetes — Deep Architecture

01The Foundation

Linux Kernel: The Real Container Engine

💡 Feynman Says

Docker didn't invent containers. Linux did — years before Docker existed. Docker is really a very good user-friendly wrapper around kernel features. If you can't explain why a container is NOT a VM, you don't understand containers yet.

When you type docker run nginx, you're asking the kernel to do three specific things simultaneously. Let's dissect each one. No hand-waving.

Linux Kernel — Isolation Primitives Used by Containers

User Space

Your app lives here

nginx process

python app

redis-server

your code

Namespaces

What you can see

pid ns → own process tree

net ns → own network stack

mnt ns → own filesystem view

uts ns → own hostname

ipc ns → own IPC resources

user ns → own UID/GID mapping

cgroups v2

What you can use

cpu.max → CPU throttle

memory.max → RAM cap

blkio → disk I/O limit

pids.max → process count

net_cls → network priority

seccomp / LSM

What you can do

seccomp-bpf → syscall filter

AppArmor / SELinux → MAC

capabilities → drop root

Linux Kernel

syscalls, VFS, TCP/IP

clone()

unshare()

setns()

pivot_root()

mount()

seccomp()

Namespaces — "What You Can See"

A namespace wraps a global resource and presents processes inside it with an illusion they have their own isolated instance. The key syscall is clone(CLONE_NEWPID | CLONE_NEWNET | ...).

When PID namespace is created: the first process inside it gets PID 1 — even though the host kernel sees it as PID 7823. Two different realities, same machine. That's the trick.

💡 Feynman Says

Think of namespaces like one-way mirrors in interrogation rooms. Your container thinks it's alone in the universe. The host kernel sees everything.

cgroups v2 — "What You Can Use"

Control groups enforce resource accounting and limits. The cgroup hierarchy is exposed via a virtual filesystem at /sys/fs/cgroup/. Every container gets its own subtree.

Inspect a container's cgroup limits

# Find container's cgroup path
cat /proc/$(docker inspect \
  --format='{{.State.Pid}}' nginx)/cgroup

# Read memory limit directly
cat /sys/fs/cgroup/system.slice/\
  docker-<id>.scope/memory.max
# → 268435456 (256MB)

02Docker Architecture

The 5-Component Runtime Stack

Most people think "Docker" is one thing. It's actually a layered stack of 5 distinct components, each with a specific contract. Understanding their boundaries is how you debug production issues at 3am.

Docker Component Stack — Who Calls Who

🖥️

docker CLI

REST API calls to Docker daemon over Unix socket

↓ HTTP REST /var/run/docker.sock

⚙️

dockerd (Docker Daemon)

Image management, network, volumes, build cache. Delegates container lifecycle to containerd.

↓ gRPC /run/containerd/containerd.sock

📦

containerd

Industry-standard container runtime. Pulls images, manages snapshots, creates containers via shim.

↓ OCI Runtime Spec exec()

🔧

runc (OCI Runtime)

Calls clone(), mount(), pivot_root(), execve(). Sets up namespaces + cgroups. Then exits.

↓ Linux syscalls

🐧

Linux Kernel

Namespaces + cgroups + OverlayFS. The actual isolation happens here.

💡 Feynman Says

Why does runc exit after starting the container? Because the container's main process (PID 1 in its namespace) is now running. runc's job is done. It's like a rocket stage — it fires, does its job, detaches. The payload keeps flying.

The containerd-shim: The Unsung Hero

Between containerd and your container lives a tiny process called containerd-shim-runc-v2. One shim per container. It survives even if containerd restarts. Without it, when containerd dies, all containers die too. The shim owns the container's stdin/stdout pipes and reports back exit codes to containerd when it restarts.

03Images & OverlayFS

How Layers Actually Work on Disk

Docker images are stored as a stack of read-only layers using OverlayFS — a Linux union filesystem. When a container starts, one writable layer is added on top (the upperdir).

OverlayFS Mount Structure

merged/ (container view)
What the container sees — unified view of all layers

upperdir/ (container layer)
Writable. New files written here. Deleted files marked with "whiteout" entries.

lowerdir[n] — Image Layer N (top)
Read-only. e.g. your app code COPY

lowerdir[n-1] — Image Layer N-1
Read-only. e.g. apt-get install packages

lowerdir[0] — Base Image Layer
e.g. ubuntu:22.04 filesystem

The copy-on-write mechanism: when a container modifies a file from a lower layer, the entire file is first copied to upperdir, then modified. This is why writing to many large files inside containers is slow.

04Kubernetes Architecture

The Reconciliation Engine

Kubernetes is not a container manager. It's a desired-state reconciliation engine that happens to manage containers.— The most important sentence in this document

☸ Kubernetes Cluster

Control Plane

kube-apiserver

REST API gateway

AuthN / AuthZ / Admission

Writes to etcd only

Every kubectl command hits this. No business logic — just validate, persist, return.

etcd

Distributed KV store

Raft consensus

Source of truth

All cluster state lives here. If etcd dies, cluster is read-only. Back this up.

kube-scheduler

Pod → Node binding

Filters + Scorers

Watches unbound pods

Watches for pods with no nodeName. Scores nodes. Writes nodeName to etcd.

controller-manager

ReplicaSet controller

Deployment controller

Node controller

~30 controllers in one binary, each running its own reconciliation loop.

Worker Nodes

kubelet

Node agent

Pod lifecycle via CRI

Reports node status

pod: nginx

pod: api

kube-proxy

Service networking

iptables / IPVS rules

Load balancing

Programs iptables/IPVS to route ClusterIP traffic to pod endpoints.

container runtime

containerd (CRI)

runc (OCI)

Same as standalone Docker

The exact same stack as Docker — kubelet just calls containerd's gRPC API directly.

The Journey of kubectl apply

① kubectl sends HTTP PATCH to kube-apiserver

kubectl serializes your YAML to JSON, sends a REST request. The API server runs: Authentication → Authorization (RBAC) → Admission controllers (webhooks can mutate/validate) → Validation against OpenAPI schema → Persist to etcd.

② Deployment Controller wakes up

The Deployment controller in controller-manager has a watch on Deployment resources. etcd sends a watch event. Controller reconciles: desired replicas = 3, current ReplicaSets = 0. Creates a new ReplicaSet with pod template + podTemplateHash label.

③ ReplicaSet Controller creates Pod objects

ReplicaSet controller sees it owns 0 pods but wants 3. Creates 3 Pod objects in etcd with status.phase: Pending and no spec.nodeName. These pods don't exist anywhere yet — they're just records in a database.

④ Scheduler binds pods to nodes

Scheduler watches for unscheduled pods. Runs filter plugins (nodeSelector, taints, resource availability), score plugins (spreading, affinity), picks winner, writes spec.nodeName: worker-2 back to etcd.

⑤ kubelet on worker-2 notices its pod

kubelet watches pods assigned to its node. Calls containerd CRI gRPC API: RunPodSandbox (create pause container + network namespace), PullImage, CreateContainer, StartContainer. Updates pod status in etcd.

⑥ CNI plugin sets up networking

kubelet calls the CNI plugin (Calico/Cilium/Flannel) to assign an IP from the pod CIDR, set up veth pairs, configure routing. Now the pod has an IP and can communicate with the cluster network.

⑦ Pod is Running — Endpoints updated

Endpoint controller detects the pod is ready (readinessProbe passes), adds its IP:port to the Service's Endpoints object. kube-proxy programs iptables rules. Traffic can now reach the pod.

Deep Concepts

The Pause Container — Why does every pod have a mystery container?

etcd Watch API — How controllers get notified without polling

Services, ClusterIP and iptables — The magic of virtual IPs

Rolling Updates — Zero-downtime deployment mechanics

05CI/CD Pipeline

From git push to Running Pod

CI/CD is not just "automated testing". At the architectural level, it's a trust escalation pipeline. Code starts untrusted. Each gate adds trust. By the time it reaches production, it's been verified more thoroughly than any human review could achieve.

PASS① Source Trigger — git push to main

PASS② Static Analysis + Linting

PASS③ Unit + Integration Tests

PASS④ Docker Build + Push (BuildKit)

PASS⑤ Trivy Security Scan

GATE⑥ Deploy to Staging (Kustomize)

PASS⑦ Smoke Tests + Performance Baseline

GATE⑧ Production Deploy (GitOps Trigger)

💡 Feynman Says

Notice that the Docker image tag (the SHA) is the communication protocol between CI and CD. CI produces an immutable artifact tagged with a SHA. CD deploys that exact SHA. At any moment you can ask "what's running in prod?" and get back a git commit you can examine. That traceability is the whole point.

06GitOps & Observability

GitOps, ArgoCD, and the Observability Stack

GitOps — Git as Source of Truth

In traditional CD, the pipeline runs kubectl apply directly. If someone does kubectl edit manually, drift is invisible.

GitOps flips this: ArgoCD runs inside the cluster and continuously watches a Git repo. If the cluster drifts from Git, the controller reconciles back. Git is the only way to change production.

ArgoCD Application Controller internals

Observability: The Three Pillars

📊

Metrics (Prometheus)

Prometheus scrapes /metrics endpoints. TSDB stores time-series with labels. RED method: Rate (req/s), Errors (error %), Duration (latency histograms). AlertManager routes: severity levels, inhibition rules, Slack/PagerDuty.

📝

Logs (Loki / ELK)

Fluent Bit DaemonSet tails /var/log/containers/. Enriches with k8s metadata. Ships to Loki (label-indexed) or Elasticsearch. Use structured logging and query on demand — don't turn logs into dashboards.

🔍

Traces (OpenTelemetry)

OTel SDK instruments your code. Each request gets a trace ID propagated through all services via HTTP headers. Spans record each hop. Jaeger/Tempo. Traces answer 'why was THIS request slow?' — metrics can't.

Prometheus PromQL — The Language of Metrics

Production-grade alerting rules (PrometheusRule)

# API error rate > 1% over 5 minutes → PagerDuty
alert: HighErrorRate
expr: |
  sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
  /
  sum(rate(http_requests_total[5m])) by (service)
  > 0.01
for: 2m  # must be true for 2 min (avoid flaps)
labels:
  severity: critical

# p95 latency degradation
alert: SlowRequests
expr: |
  histogram_quantile(0.95,
    sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
  ) > 0.5  # 500ms SLO

⚡ The Complete Mental Model

From git push to Prometheus alert — The Unified Picture

git

Code Commit

→

CI

Build + Test + Scan

→

OCI

Image:SHA → ECR

→

GitOps

Commit SHA

→

ArgoCD

Sync to Cluster

→

K8s

Reconcile Pods

→

Prom

Observe + Alert

ISOLATION LAYER

namespaces + cgroups + OverlayFS = container. Not a VM. Same kernel, isolated view.

ORCHESTRATION LAYER

etcd + controllers + scheduler = desired-state reconciliation. Kubernetes is a database with side effects.

AUTOMATION LAYER

Git SHA = the contract. CI produces it. CD deploys it. GitOps enforces it. Prometheus watches it.

What actually happens when you run a container?