Quick Navigation
GitOps Principles and ModelsArgo CD and FluxKubernetes Core ConceptsObservability: Metrics, Logs, TracesSecurity: Policy Engines and mTLSCI/CD Pipelines and Deployment StrategiesPlatform APIs and Infrastructure ProvisioningInternal Developer Platforms (IDPs) and Developer ExperienceDORA Metrics and Platform MeasurementPlatform Engineering Fundamentals
GitOps Principles and Models
- GitOps Principle 1: Declarative
- The entire system must be expressed declaratively. Desired state is described in configuration files (YAML), not procedural scripts.
- GitOps Principle 2: Versioned and Immutable
- Desired state is stored in Git, providing a canonical version history. Previous versions are immutable — changes create new commits, not edits.
- GitOps Principle 3: Pulled Automatically
- Approved changes are automatically pulled and applied by a software agent in the cluster. The cluster pulls from Git — NOT the pipeline pushing to the cluster.
- GitOps Principle 4: Continuously Reconciled
- Software agents continuously observe actual state and attempt to achieve the desired state. Drift is automatically detected and corrected.
- Push vs Pull Deployment
- Push: CI pipeline pushes to cluster with kubectl/Helm. Pull (GitOps): cluster agent pulls from Git. GitOps pull model provides drift detection and correction that push lacks.
- Drift Detection
- When manual kubectl changes make cluster state diverge from Git, the GitOps operator detects drift and reconciles the cluster back to the Git-defined desired state.
- GitOps Rollback
- Revert the Git commit → push → GitOps operator detects change → reconciles cluster to previous state automatically. No kubectl rollout undo needed.
Argo CD and Flux
- Argo CD Application
- Core Argo CD resource that links a Git repo path to a Kubernetes cluster/namespace. Continuously syncs the cluster to match the Git source.
- Argo CD Sync Policies
- Manual sync (admin triggers): default. Automated sync: Argo CD automatically applies changes when Git diverges. Automated self-heal: reverts manual cluster changes.
- Argo CD App-of-Apps
- Pattern where a parent Argo CD Application manages child Application manifests in Git, enabling declarative management of multiple applications from a single root.
- Argo CD Projects
- Logical grouping of applications in Argo CD for multi-tenancy. Control which repos, clusters, and namespaces each team can deploy to.
- Flux GitOps Toolkit
- Modular set of controllers: source-controller (fetches from Git/OCI), kustomize-controller (applies Kustomize), helm-controller (reconciles Helm releases), notification-controller.
- Flux Kustomization Resource
- Flux CRD that points to a path in a Git repository and applies it using Kustomize. Supports health checks and dependency ordering between resources.
- Argo CD vs Flux
- Argo CD: single application, rich UI, strong RBAC via Projects. Flux: modular toolkit, better Helm native integration, multi-tenancy via separate controller instances.
Kubernetes Core Concepts
- Declarative Apply
- kubectl apply -f manifest.yaml — applies desired state. Kubernetes reconciles current state to match. Idempotent: running twice produces the same result.
- Reconciliation Loop
- Watch current state → compare to desired state → compute diff → apply changes to close gap → repeat. Used by all Kubernetes controllers, GitOps operators, and operators.
- Custom Resource Definition (CRD)
- Extends the Kubernetes API with a new resource type. Defines the schema (spec fields). Does NOT provide behavior — a controller must watch instances and act on them.
- kubectl apply CRD
- kubectl apply -f my-crd.yaml — installs the CRD schema. After this, kubectl apply -f my-instance.yaml creates an instance. The controller handles the instance.
- Kubernetes Operator Pattern
- A controller that encodes operational knowledge for a complex application. Watches a CRD and manages the application lifecycle: install, configure, upgrade, backup, recover.
- Kubernetes RBAC
- Role/ClusterRole: defines permissions on resources. RoleBinding/ClusterRoleBinding: assigns the role to a subject (user, group, service account). Principle of least privilege.
- Network Policy Default
- By default ALL pod traffic is allowed in Kubernetes. Network Policies ADD restrictions — they do not grant permissions. A pod with no NetworkPolicy selecting it receives all traffic.
- Pod Security Admission
- Kubernetes built-in admission controller enforcing pod security. Three profiles: Privileged (no restrictions), Baseline (common threats blocked), Restricted (hardened). Apply via namespace labels.
Observability: Metrics, Logs, Traces
- Three Observability Pillars
- Metrics: quantitative aggregated measurements over time. Logs: timestamped discrete events. Traces: end-to-end request paths through distributed services.
- Prometheus Pull Model
- Prometheus SCRAPES (pulls) metrics from targets at configured intervals. Pushgateway exists for short-lived jobs. Do not confuse with push-model metrics systems.
- PromQL Basics
- rate(http_requests_total[5m]) — per-second rate over 5 minutes. histogram_quantile(0.99, ...) — p99 latency. up == 0 — targets that are down.
- Grafana
- Visualization layer. Creates dashboards from Prometheus (and other) data sources. Does NOT collect metrics — it queries and visualizes them.
- Loki (Logs)
- Log aggregation system from Grafana Labs. Indexes only metadata (labels), not log content — makes it lightweight. Queries use LogQL. Often paired with Promtail for log collection.
- Distributed Tracing (Jaeger/Tempo)
- Traces show the end-to-end path of a request across microservices, with timing for each service call. Identify which service is causing latency or failures.
- SLI / SLO / SLA
- SLI: the metric measured (e.g., success rate). SLO: reliability target (e.g., 99.9% success). SLA: contractual commitment (looser than SLO). SLO stricter than SLA by design.
- Error Budget
- Error budget = 1 - SLO target. At 99.9% SLO, error budget is 0.1% (about 43 minutes/month). When budget is exhausted, prioritize reliability work over new features.
Security: Policy Engines and mTLS
- OPA/Gatekeeper
- Policy engine using Rego language. Runs as ValidatingAdmissionWebhook. Audit mode: reports violations on existing resources. Enforce mode: blocks new/updated resources violating policy.
- Kyverno
- Kubernetes-native policy engine using YAML policies. Supports validate, mutate, and generate operations. Easier learning curve than Rego. Enforce and audit modes.
- Admission Webhook Order
- Mutating webhooks run FIRST (can modify resources), then Validating webhooks run (can allow/deny). Validators see the resource AFTER mutation.
- mTLS (Mutual TLS)
- Both client and server authenticate each other via certificates and encrypt the connection. Service meshes implement mTLS transparently without application code changes.
- Service Mesh (Istio/Linkerd)
- Infrastructure layer managing service-to-service communication. Injects sidecar proxies (Envoy) that handle mTLS, traffic management, and observability automatically.
- Kubernetes Secrets
- Base64-encoded (NOT encrypted by default) Kubernetes objects for sensitive data. Enable encryption-at-rest for secrets. Use external secret managers (Vault, AWS Secrets Manager) for production.
- Supply Chain Security (Sigstore/cosign)
- Sign container images after build with cosign. Policy engines (Kyverno/Gatekeeper) verify signatures at admission time. Prevents running unsigned or tampered images.
- SLSA (Supply Chain Levels)
- Framework defining levels of software supply chain security (0-3 in SLSA v1.0). Higher levels require stronger build provenance, hermetic builds, and verified artifacts.
CI/CD Pipelines and Deployment Strategies
- CI Pipeline Stages
- Commit → Checkout → Lint → Unit Test → Build Image → Scan Image for CVEs → Push to Registry → Update GitOps Manifests. Each stage is a gate — failure blocks promotion.
- Immutable Artifacts
- Build the container image ONCE in CI. Promote the SAME image tag through dev → staging → production. Never rebuild per environment — rebuilds break the immutability guarantee.
- Rolling Update
- Kubernetes default deployment strategy. Gradually replaces old pods with new. Zero downtime if health checks are configured. Rollback takes time (reverse rolling update).
- Blue-Green Deployment
- Two identical environments (blue=current, green=new). Switch traffic instantly by updating the load balancer. Instant rollback by switching back. High resource cost (double capacity needed).
- Canary Deployment
- Route small % of traffic (e.g., 5%) to new version. Monitor error rates and latency. Gradually increase traffic to new version if metrics are healthy. Rollback = redirect traffic back.
- Argo Rollouts (Progressive Delivery)
- Kubernetes controller for advanced deployment strategies (canary, blue-green) with automated analysis. Automatically promotes or rolls back based on metric thresholds.
- CI vs CD Separation
- CI: builds and validates artifacts (test, scan, sign). CD/GitOps: deploys validated artifacts. CI pipeline should update the GitOps manifest (image tag); GitOps operator does the deployment.
Platform APIs and Infrastructure Provisioning
- CRD Schema Definition
- apiVersion: apiextensions.k8s.io/v1, kind: CustomResourceDefinition. Defines group, version, names, and OpenAPI schema for the new resource type.
- CRD + Controller = Operator
- CRD defines WHAT the resource looks like. Controller defines WHAT HAPPENS to instances of that resource. You need both for a functional operator.
- Operator Reconciliation Idempotency
- Reconcile loops must be idempotent — safe to run multiple times. On controller restart, it reconciles from current cluster state. Do not assume reconcile runs only once.
- Crossplane Provider
- A Crossplane component that connects to a cloud provider API (AWS, GCP, Azure). Manages Managed Resources (e.g., RDSInstance CRD maps to an AWS RDS database).
- Crossplane Composite Resource
- A higher-level abstraction combining multiple Managed Resources. Developers create a DatabaseClaim; Crossplane creates the RDS instance, subnet group, and security group behind the scenes.
- Infrastructure as Code (IaC)
- Declare infrastructure in version-controlled configuration files. Terraform: HCL-based, external state file. Crossplane: Kubernetes-native, state in etcd, uses reconciliation loop.
- Terraform vs Crossplane
- Terraform: standalone CLI tool, HCL language, state file. Crossplane: Kubernetes-native, CRDs, reconciliation loop, no external state. Crossplane is better for Kubernetes-centric platforms.
Internal Developer Platforms (IDPs) and Developer Experience
- IDP Definition
- Internal Developer Platform: a self-service platform built by platform engineers that abstracts infrastructure complexity. Provides golden paths, self-service tooling, and reduces developer cognitive load.
- Backstage Software Catalog
- Central registry of all services, APIs, libraries, and teams with ownership data. Enables discoverability — answers 'who owns this?' and 'what APIs exist?'
- Backstage Software Templates
- Scaffolding system (golden paths) that creates new projects with opinionated defaults: repo setup, CI/CD pipeline, monitoring, and documentation pre-configured.
- Backstage TechDocs
- Documentation-as-code: write docs in Markdown alongside code, published automatically to Backstage. Every service's docs live in the same place.
- Golden Paths
- Recommended, pre-built, opinionated workflows for common tasks (create a service, provision a database). 'Golden paths, not golden cages' — deviation is allowed but costs more effort.
- Cognitive Load Reduction
- Platform engineering goal: developers should not need deep infrastructure knowledge to deploy and operate their services. The platform abstracts away complexity through self-service.
- Platform as a Product
- Platform teams treat the IDP as a product with developers as customers. Requires product management, user research, feedback loops, and treating platform usability as a first-class concern.
DORA Metrics and Platform Measurement
- Deployment Frequency
- How often code is deployed to production. Elite: multiple times per day. High: once per day to once per week. Measures delivery pipeline maturity.
- Lead Time for Changes
- Time from code commit to running in production. Elite: under 1 hour. High: 1 day to 1 week. Measures end-to-end pipeline efficiency.
- Change Failure Rate
- Percentage of deployments that cause an incident requiring rollback or hotfix. Elite: < 5%. High: 5-10%. Measures deployment quality, not just speed.
- MTTR (Mean Time to Restore)
- Average time to restore service after a production incident. Elite: < 1 hour. High: < 1 day. Measures platform resilience and incident response effectiveness.
- DORA Elite vs Low Performer
- Elite: high frequency AND low failure rate AND low MTTR. High frequency + high failure rate is NOT elite — stability must improve WITH speed, not instead of it.
- Platform Adoption Metrics
- Self-service request volume, golden path adoption rate, developer portal active users, ticket volume reduction. Measure whether the IDP is actually reducing friction.
- Developer Experience (DevEx) Metrics
- Developer satisfaction surveys (NPS/SPACE framework), time to first deployment for new developers, onboarding time reduction, self-reported cognitive load scores.
- Error Budget Burn Rate
- Rate at which the SLO error budget is consumed. High burn rate signals an incident in progress. Budget exhaustion triggers freeze on new features until reliability is restored.
Platform Engineering Fundamentals
- Platform Engineering vs DevOps
- DevOps: culture and practices for dev-ops collaboration. Platform Engineering: builds the IDPs that enable teams to practice DevOps at scale. Platform Engineering = 'DevOps as a product'.
- Team Topologies Alignment
- Platform teams are enabling teams providing self-service to stream-aligned (feature) teams. Platform reduces cognitive load so stream-aligned teams focus on product features.
- Application Environments
- Development (dev): fast iteration, not stable. Staging/Pre-production: production-like validation. Production: live traffic. Promotion from dev → staging → prod is the standard path.
- Continuous Integration (CI)
- Automated: build, unit test, lint, integration test, image scan on every commit. Goal: fast feedback. If CI fails, the developer fixes it immediately — do not let a broken build sit.
- Continuous Delivery vs Continuous Deployment
- Delivery: every commit is deployable; production deployment is triggered manually. Deployment: every passing commit is automatically deployed to production. CD = Delivery + no approval gate.
- Infrastructure as Code (IaC) Principles
- Version control all infrastructure definitions. Infrastructure is reproducible and auditable. Enables environment parity (dev matches prod). Changes go through code review.
- CNCF Project Maturity Levels
- Sandbox: early stage, experimental. Incubating: growing adoption, stable API. Graduated: widely adopted, production-proven. Archived: no longer maintained.