Docker didn't invent containers. Linux did β years before Docker existed. Docker is really a very good user-friendly wrapper around kernel features. If you can't explain why a container is NOT a VM, you don't understand containers yet.
When you type docker run nginx, you're asking the kernel to do three specific things simultaneously. Let's dissect each one. No hand-waving.
A namespace wraps a global resource and presents processes inside it with an illusion they have their own isolated instance. The key syscall is clone(CLONE_NEWPID | CLONE_NEWNET | ...).
When PID namespace is created: the first process inside it gets PID 1 β even though the host kernel sees it as PID 7823. Two different realities, same machine. That's the trick.
Think of namespaces like one-way mirrors in interrogation rooms. Your container thinks it's alone in the universe. The host kernel sees everything.
Control groups enforce resource accounting and limits. The cgroup hierarchy is exposed via a virtual filesystem at /sys/fs/cgroup/. Every container gets its own subtree.
# Find container's cgroup path cat /proc/$(docker inspect \ --format='{{.State.Pid}}' nginx)/cgroup # Read memory limit directly cat /sys/fs/cgroup/system.slice/\ docker-<id>.scope/memory.max # β 268435456 (256MB)
Most people think "Docker" is one thing. It's actually a layered stack of 5 distinct components, each with a specific contract. Understanding their boundaries is how you debug production issues at 3am.
Why does runc exit after starting the container? Because the container's main process (PID 1 in its namespace) is now running. runc's job is done. It's like a rocket stage β it fires, does its job, detaches. The payload keeps flying.
Between containerd and your container lives a tiny process called containerd-shim-runc-v2. One shim per container. It survives even if containerd restarts. Without it, when containerd dies, all containers die too. The shim owns the container's stdin/stdout pipes and reports back exit codes to containerd when it restarts.
Docker images are stored as a stack of read-only layers using OverlayFS β a Linux union filesystem. When a container starts, one writable layer is added on top (the upperdir).
Kubernetes is not a container manager. It's a desired-state reconciliation engine that happens to manage containers.β The most important sentence in this document
kubectl serializes your YAML to JSON, sends a REST request. The API server runs: Authentication β Authorization (RBAC) β Admission controllers (webhooks can mutate/validate) β Validation against OpenAPI schema β Persist to etcd.
The Deployment controller in controller-manager has a watch on Deployment resources. etcd sends a watch event. Controller reconciles: desired replicas = 3, current ReplicaSets = 0. Creates a new ReplicaSet with pod template + podTemplateHash label.
ReplicaSet controller sees it owns 0 pods but wants 3. Creates 3 Pod objects in etcd with status.phase: Pending and no spec.nodeName. These pods don't exist anywhere yet β they're just records in a database.
Scheduler watches for unscheduled pods. Runs filter plugins (nodeSelector, taints, resource availability), score plugins (spreading, affinity), picks winner, writes spec.nodeName: worker-2 back to etcd.
kubelet watches pods assigned to its node. Calls containerd CRI gRPC API: RunPodSandbox (create pause container + network namespace), PullImage, CreateContainer, StartContainer. Updates pod status in etcd.
kubelet calls the CNI plugin (Calico/Cilium/Flannel) to assign an IP from the pod CIDR, set up veth pairs, configure routing. Now the pod has an IP and can communicate with the cluster network.
Endpoint controller detects the pod is ready (readinessProbe passes), adds its IP:port to the Service's Endpoints object. kube-proxy programs iptables rules. Traffic can now reach the pod.
CI/CD is not just "automated testing". At the architectural level, it's a trust escalation pipeline. Code starts untrusted. Each gate adds trust. By the time it reaches production, it's been verified more thoroughly than any human review could achieve.
Notice that the Docker image tag (the SHA) is the communication protocol between CI and CD. CI produces an immutable artifact tagged with a SHA. CD deploys that exact SHA. At any moment you can ask "what's running in prod?" and get back a git commit you can examine. That traceability is the whole point.
In traditional CD, the pipeline runs kubectl apply directly. If someone does kubectl edit manually, drift is invisible.
GitOps flips this: ArgoCD runs inside the cluster and continuously watches a Git repo. If the cluster drifts from Git, the controller reconciles back. Git is the only way to change production.
Prometheus scrapes /metrics endpoints. TSDB stores time-series with labels. RED method: Rate (req/s), Errors (error %), Duration (latency histograms). AlertManager routes: severity levels, inhibition rules, Slack/PagerDuty.
Fluent Bit DaemonSet tails /var/log/containers/. Enriches with k8s metadata. Ships to Loki (label-indexed) or Elasticsearch. Use structured logging and query on demand β don't turn logs into dashboards.
OTel SDK instruments your code. Each request gets a trace ID propagated through all services via HTTP headers. Spans record each hop. Jaeger/Tempo. Traces answer 'why was THIS request slow?' β metrics can't.
# API error rate > 1% over 5 minutes β PagerDuty alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) > 0.01 for: 2m # must be true for 2 min (avoid flaps) labels: severity: critical # p95 latency degradation alert: SlowRequests expr: | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) ) > 0.5 # 500ms SLO
18 questions covering every section. Answers are verified server-side. Explanations go deep β read them even when you get it right.