Deep Architecture Β· Feynman Edition

What actually happens when you run a container?

A ruthlessly honest, kernel-to-cluster breakdown of Docker and Kubernetes β€” not the docs version, the real version. Every syscall, every reconciliation loop, every deployment manifested.

namespaces
cgroups
overlay FS
OCI
etcd
kube-apiserver
kubelet
control plane
CI/CD β†’ GitOps
↓scroll to begin the descent
01The Foundation

Linux Kernel: The Real Container Engine

πŸ’‘ Feynman Says

Docker didn't invent containers. Linux did β€” years before Docker existed. Docker is really a very good user-friendly wrapper around kernel features. If you can't explain why a container is NOT a VM, you don't understand containers yet.

When you type docker run nginx, you're asking the kernel to do three specific things simultaneously. Let's dissect each one. No hand-waving.

Linux Kernel β€” Isolation Primitives Used by Containers
User Space
Your app lives here
nginx process
python app
redis-server
your code
Namespaces
What you can see
pid ns β†’ own process tree
net ns β†’ own network stack
mnt ns β†’ own filesystem view
uts ns β†’ own hostname
ipc ns β†’ own IPC resources
user ns β†’ own UID/GID mapping
cgroups v2
What you can use
cpu.max β†’ CPU throttle
memory.max β†’ RAM cap
blkio β†’ disk I/O limit
pids.max β†’ process count
net_cls β†’ network priority
seccomp / LSM
What you can do
seccomp-bpf β†’ syscall filter
AppArmor / SELinux β†’ MAC
capabilities β†’ drop root
Linux Kernel
syscalls, VFS, TCP/IP
clone()
unshare()
setns()
pivot_root()
mount()
seccomp()

Namespaces β€” "What You Can See"

A namespace wraps a global resource and presents processes inside it with an illusion they have their own isolated instance. The key syscall is clone(CLONE_NEWPID | CLONE_NEWNET | ...).

When PID namespace is created: the first process inside it gets PID 1 β€” even though the host kernel sees it as PID 7823. Two different realities, same machine. That's the trick.

πŸ’‘ Feynman Says

Think of namespaces like one-way mirrors in interrogation rooms. Your container thinks it's alone in the universe. The host kernel sees everything.

cgroups v2 β€” "What You Can Use"

Control groups enforce resource accounting and limits. The cgroup hierarchy is exposed via a virtual filesystem at /sys/fs/cgroup/. Every container gets its own subtree.

Inspect a container's cgroup limits
# Find container's cgroup path
cat /proc/$(docker inspect \
  --format='{{.State.Pid}}' nginx)/cgroup

# Read memory limit directly
cat /sys/fs/cgroup/system.slice/\
  docker-<id>.scope/memory.max
# β†’ 268435456 (256MB)
02Docker Architecture

The 5-Component Runtime Stack

Most people think "Docker" is one thing. It's actually a layered stack of 5 distinct components, each with a specific contract. Understanding their boundaries is how you debug production issues at 3am.

Docker Component Stack β€” Who Calls Who
πŸ–₯️
docker CLI
REST API calls to Docker daemon over Unix socket
↓ HTTP REST /var/run/docker.sock
βš™οΈ
dockerd (Docker Daemon)
Image management, network, volumes, build cache. Delegates container lifecycle to containerd.
↓ gRPC /run/containerd/containerd.sock
πŸ“¦
containerd
Industry-standard container runtime. Pulls images, manages snapshots, creates containers via shim.
↓ OCI Runtime Spec exec()
πŸ”§
runc (OCI Runtime)
Calls clone(), mount(), pivot_root(), execve(). Sets up namespaces + cgroups. Then exits.
↓ Linux syscalls
🐧
Linux Kernel
Namespaces + cgroups + OverlayFS. The actual isolation happens here.
πŸ’‘ Feynman Says

Why does runc exit after starting the container? Because the container's main process (PID 1 in its namespace) is now running. runc's job is done. It's like a rocket stage β€” it fires, does its job, detaches. The payload keeps flying.

The containerd-shim: The Unsung Hero

Between containerd and your container lives a tiny process called containerd-shim-runc-v2. One shim per container. It survives even if containerd restarts. Without it, when containerd dies, all containers die too. The shim owns the container's stdin/stdout pipes and reports back exit codes to containerd when it restarts.

03Images & OverlayFS

How Layers Actually Work on Disk

Docker images are stored as a stack of read-only layers using OverlayFS β€” a Linux union filesystem. When a container starts, one writable layer is added on top (the upperdir).

OverlayFS Mount Structure
merged/ (container view)
What the container sees β€” unified view of all layers
upperdir/ (container layer)
Writable. New files written here. Deleted files marked with "whiteout" entries.
lowerdir[n] β€” Image Layer N (top)
Read-only. e.g. your app code COPY
lowerdir[n-1] β€” Image Layer N-1
Read-only. e.g. apt-get install packages
lowerdir[0] β€” Base Image Layer
e.g. ubuntu:22.04 filesystem
The copy-on-write mechanism: when a container modifies a file from a lower layer, the entire file is first copied to upperdir, then modified. This is why writing to many large files inside containers is slow.
04Kubernetes Architecture

The Reconciliation Engine

Kubernetes is not a container manager. It's a desired-state reconciliation engine that happens to manage containers.β€” The most important sentence in this document
☸ Kubernetes Cluster
Control Plane
kube-apiserver
REST API gateway
AuthN / AuthZ / Admission
Writes to etcd only
Every kubectl command hits this. No business logic β€” just validate, persist, return.
etcd
Distributed KV store
Raft consensus
Source of truth
All cluster state lives here. If etcd dies, cluster is read-only. Back this up.
kube-scheduler
Pod β†’ Node binding
Filters + Scorers
Watches unbound pods
Watches for pods with no nodeName. Scores nodes. Writes nodeName to etcd.
controller-manager
ReplicaSet controller
Deployment controller
Node controller
~30 controllers in one binary, each running its own reconciliation loop.
Worker Nodes
kubelet
Node agent
Pod lifecycle via CRI
Reports node status
pod: nginx
pod: api
kube-proxy
Service networking
iptables / IPVS rules
Load balancing
Programs iptables/IPVS to route ClusterIP traffic to pod endpoints.
container runtime
containerd (CRI)
runc (OCI)
Same as standalone Docker
The exact same stack as Docker β€” kubelet just calls containerd's gRPC API directly.

The Journey of kubectl apply

β‘  kubectl sends HTTP PATCH to kube-apiserver

kubectl serializes your YAML to JSON, sends a REST request. The API server runs: Authentication β†’ Authorization (RBAC) β†’ Admission controllers (webhooks can mutate/validate) β†’ Validation against OpenAPI schema β†’ Persist to etcd.

β‘‘ Deployment Controller wakes up

The Deployment controller in controller-manager has a watch on Deployment resources. etcd sends a watch event. Controller reconciles: desired replicas = 3, current ReplicaSets = 0. Creates a new ReplicaSet with pod template + podTemplateHash label.

β‘’ ReplicaSet Controller creates Pod objects

ReplicaSet controller sees it owns 0 pods but wants 3. Creates 3 Pod objects in etcd with status.phase: Pending and no spec.nodeName. These pods don't exist anywhere yet β€” they're just records in a database.

β‘£ Scheduler binds pods to nodes

Scheduler watches for unscheduled pods. Runs filter plugins (nodeSelector, taints, resource availability), score plugins (spreading, affinity), picks winner, writes spec.nodeName: worker-2 back to etcd.

β‘€ kubelet on worker-2 notices its pod

kubelet watches pods assigned to its node. Calls containerd CRI gRPC API: RunPodSandbox (create pause container + network namespace), PullImage, CreateContainer, StartContainer. Updates pod status in etcd.

β‘₯ CNI plugin sets up networking

kubelet calls the CNI plugin (Calico/Cilium/Flannel) to assign an IP from the pod CIDR, set up veth pairs, configure routing. Now the pod has an IP and can communicate with the cluster network.

⑦ Pod is Running β€” Endpoints updated

Endpoint controller detects the pod is ready (readinessProbe passes), adds its IP:port to the Service's Endpoints object. kube-proxy programs iptables rules. Traffic can now reach the pod.

Deep Concepts

The Pause Container β€” Why does every pod have a mystery container?

+

etcd Watch API β€” How controllers get notified without polling

+

Services, ClusterIP and iptables β€” The magic of virtual IPs

+

Rolling Updates β€” Zero-downtime deployment mechanics

+
05CI/CD Pipeline

From git push to Running Pod

CI/CD is not just "automated testing". At the architectural level, it's a trust escalation pipeline. Code starts untrusted. Each gate adds trust. By the time it reaches production, it's been verified more thoroughly than any human review could achieve.

PASSβ‘  Source Trigger β€” git push to main
PASSβ‘‘ Static Analysis + Linting
PASSβ‘’ Unit + Integration Tests
PASSβ‘£ Docker Build + Push (BuildKit)
PASSβ‘€ Trivy Security Scan
GATEβ‘₯ Deploy to Staging (Kustomize)
PASS⑦ Smoke Tests + Performance Baseline
GATEβ‘§ Production Deploy (GitOps Trigger)
πŸ’‘ Feynman Says

Notice that the Docker image tag (the SHA) is the communication protocol between CI and CD. CI produces an immutable artifact tagged with a SHA. CD deploys that exact SHA. At any moment you can ask "what's running in prod?" and get back a git commit you can examine. That traceability is the whole point.

06GitOps & Observability

GitOps, ArgoCD, and the Observability Stack

GitOps β€” Git as Source of Truth

In traditional CD, the pipeline runs kubectl apply directly. If someone does kubectl edit manually, drift is invisible.

GitOps flips this: ArgoCD runs inside the cluster and continuously watches a Git repo. If the cluster drifts from Git, the controller reconciles back. Git is the only way to change production.

ArgoCD Application Controller internals

+

Observability: The Three Pillars

πŸ“Š
Metrics (Prometheus)

Prometheus scrapes /metrics endpoints. TSDB stores time-series with labels. RED method: Rate (req/s), Errors (error %), Duration (latency histograms). AlertManager routes: severity levels, inhibition rules, Slack/PagerDuty.

πŸ“
Logs (Loki / ELK)

Fluent Bit DaemonSet tails /var/log/containers/. Enriches with k8s metadata. Ships to Loki (label-indexed) or Elasticsearch. Use structured logging and query on demand β€” don't turn logs into dashboards.

πŸ”
Traces (OpenTelemetry)

OTel SDK instruments your code. Each request gets a trace ID propagated through all services via HTTP headers. Spans record each hop. Jaeger/Tempo. Traces answer 'why was THIS request slow?' β€” metrics can't.

Prometheus PromQL β€” The Language of Metrics

Production-grade alerting rules (PrometheusRule)
# API error rate > 1% over 5 minutes β†’ PagerDuty
alert: HighErrorRate
expr: |
  sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
  /
  sum(rate(http_requests_total[5m])) by (service)
  > 0.01
for: 2m  # must be true for 2 min (avoid flaps)
labels:
  severity: critical

# p95 latency degradation
alert: SlowRequests
expr: |
  histogram_quantile(0.95,
    sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
  ) > 0.5  # 500ms SLO
⚑ The Complete Mental Model

From git push to Prometheus alert β€” The Unified Picture

git
Code Commit
β†’
CI
Build + Test + Scan
β†’
OCI
Image:SHA β†’ ECR
β†’
GitOps
Commit SHA
β†’
ArgoCD
Sync to Cluster
β†’
K8s
Reconcile Pods
β†’
Prom
Observe + Alert
ISOLATION LAYER
namespaces + cgroups + OverlayFS = container. Not a VM. Same kernel, isolated view.
ORCHESTRATION LAYER
etcd + controllers + scheduler = desired-state reconciliation. Kubernetes is a database with side effects.
AUTOMATION LAYER
Git SHA = the contract. CI produces it. CD deploys it. GitOps enforces it. Prometheus watches it.
07Knowledge Check

Test Your Understanding

18 questions covering every section. Answers are verified server-side. Explanations go deep β€” read them even when you get it right.

Loading questions…