CNCF / Linux FoundationCNPA6 domains

CNPA Exam Notes

Last-minute traps, must-know facts, and scenario tips for the Certified Cloud Native Platform Engineering Associate exam.

General Exam Tips

1.Read ALL answer options before selecting — many questions have two plausible answers, but one is 'most aligned with the platform engineering goal'. The platform mindset (self-service, declarative, least effort) usually picks the winner.
2.Watch for trigger words in question stems: 'declarative', 'secure by default', 'self-service', 'least effort', and 'without code changes'. These almost always point toward the GitOps/operator/service mesh answer.
3.Despite the official 'Beginner' label, the exam is genuinely intermediate difficulty. Do not under-prepare just because it is labeled foundational.
4.60 questions in 120 minutes = exactly 2 minutes per question. This is generous time — you can afford to read carefully. Mark and skip questions where you are unsure and return to them after completing the rest.
5.No penalty for wrong answers. Always answer every question, even if guessing. Never leave a question blank.
6.The exam is closed-book and online proctored via PSI. Your environment must be quiet and your desk clear. Prepare your ID and test space the day before.
7.Think like a platform team building products for developers, not like a Kubernetes operator. The exam rewards architectural reasoning over syntax recall.
8.When a scenario describes 'without modifying application code', 'transparently', or 'without developer involvement', the answer is almost always a service mesh, admission webhook, or policy engine — not an application-level change.
9.When the question is about rollback in a GitOps environment, the answer is always 'revert the Git commit'. Never kubectl rollout undo.
10.The exam covers OpenTelemetry as the standardized observability framework — know it is a CNCF project that provides unified APIs and SDKs for metrics, logs, AND traces. It is distinct from Prometheus or Jaeger individually.

Quick Navigation

Platform Engineering Core Fundamentals Platform Observability, Security, and Conformance Continuous Delivery and Platform Engineering Platform APIs and Provisioning Infrastructure IDPs and Developer Experience Measuring your Platform

Domain 136% of exam

Platform Engineering Core Fundamentals

Must-Know Facts

This single domain is 36% of the exam — more than one-third. You CANNOT pass without mastering it. Spend disproportionate study time here.
The four OpenGitOps principles VERBATIM: (1) Declarative, (2) Versioned and Immutable, (3) Pulled Automatically, (4) Continuously Reconciled. The exam uses exact OpenGitOps terminology.
Continuous Delivery: every commit is in a deployable state, production deployment requires a MANUAL trigger. Continuous Deployment: every passing commit is AUTOMATICALLY deployed to production with no human gate.
Platform engineering is 'DevOps as a product' — platform engineers build platforms that enable dev teams to practice DevOps without deep infrastructure expertise. It is a distinct discipline, not just rebranded DevOps or SRE.
Declarative management = describe desired state, let the reconciliation loop handle 'how'. Imperative management = issue step-by-step commands. Kubernetes is declarative by design.
Team Topologies applies here: platform teams are 'enabling teams' that reduce cognitive load for stream-aligned (feature) teams. This framing appears in exam questions about why organizations build IDPs.
The CNCF Platform Maturity Model describes platform evolution: Provisional (reactive, manual) → Operational → Scalable → Optimizing. Platform maturity is not just about technology — it includes adoption, processes, and team topology.
GitOps workflow: developer pushes to Git → CI pipeline builds and validates artifact → CI updates the GitOps manifest (image tag) → GitOps operator detects change → operator reconciles cluster to new desired state. The CI system does NOT deploy to the cluster directly.
Application environment promotion order: Development → Staging/Pre-production → Production. The same immutable artifact is promoted through environments — never rebuilt.

Common Traps

TrapTreating Continuous Delivery and Continuous Deployment as synonyms

RealityContinuous Delivery means you CAN deploy at any time with a manual trigger — a human still approves production releases. Continuous Deployment means you DO deploy automatically on every commit that passes tests. Most organizations practice Delivery but not Deployment. The exam distinguishes them precisely.

TrapThinking GitOps is a specific tool

RealityGitOps is a set of practices defined by the OpenGitOps standard. Argo CD and Flux are tools that IMPLEMENT GitOps. A question asking 'which best describes GitOps' is asking about principles (pull-based, Git as source of truth, continuous reconciliation), not tool names.

TrapThinking platform engineering is just DevOps or SRE with a new name

RealityDevOps is a culture. SRE is a reliability operations role. Platform engineering is a product-oriented discipline that builds self-service platforms as internal products. Platform engineers have developers as customers — they do user research, build golden paths, and measure adoption.

TrapAssuming the GitOps operator updates Git when cluster state diverges

RealityWhen drift is detected (cluster ≠ Git), the GitOps operator reconciles the CLUSTER back to match Git — it overwrites manual changes. Git is always the authority. The operator never updates Git to match the cluster.

TrapConflating CI and CD responsibilities

RealityCI is responsible for producing and VALIDATING a container image artifact (build, test, scan, sign). CD/GitOps is responsible for DEPLOYING that validated artifact. CI pipelines should update the GitOps manifest's image tag but should never directly deploy to the cluster using kubectl or Helm.

Confusing Pairs

Continuous DeliveryContinuous Deployment

Delivery = CAN deploy; requires manual approval for production. Deployment = DOES deploy; automated all the way to production with no human gate. Key signal in questions: if there is a manual approval step anywhere in the pipeline, it is Delivery, not Deployment.

GitOps (Pull-based)Push-based CI/CD

GitOps: cluster agent continuously PULLS from Git and reconciles state; provides drift detection and correction. Push-based: CI pipeline PUSHES to the cluster via kubectl/Helm; no drift detection. The pull model and automatic reconciliation are what define GitOps — both words matter.

Platform EngineeringDevOpsSRE

DevOps = culture and collaboration practices (not a role). SRE = operational role focused on reliability engineering. Platform Engineering = discipline of building self-service internal platforms (IDPs) with developers as customers. Platform engineers build the tooling that enables DevOps practices at scale.

Argo CDFlux

Both implement identical GitOps principles — pull-based, Git as source of truth, continuous reconciliation. Argo CD is a single integrated application with a rich UI and RBAC via Projects. Flux is a modular GitOps Toolkit (source-controller, kustomize-controller, helm-controller). The exam will not ask you to pick one over the other — it will ask you to distinguish their architectural approaches.

Scenario Tips

If the question asks about:

A developer commits to Git and the question asks what happens next in a GitOps workflow

Answer:

CI pipeline triggers (builds, tests, scans), CI updates the GitOps manifest image tag, GitOps operator detects the change, operator reconciles cluster to match new desired state. The key sequence to get right: CI DOES NOT deploy directly; it updates the manifest, and the operator does the deploying.

Distractor to avoid:

Saying 'CI pipeline deploys to the cluster' is wrong in a GitOps context. CI produces the artifact and updates the manifest. Deployment is the operator's job.

If the question asks about:

A manual kubectl change is applied to production. The question asks what the GitOps operator does.

Answer:

The operator detects drift (current state ≠ desired state in Git), then reconciles the cluster BACK to match Git — the manual change is overwritten. This is by design: Git is the single source of truth.

Distractor to avoid:

Choosing 'operator sends an alert and preserves the change' is wrong. Default GitOps behavior is to reconcile drift, not preserve manual changes. Argo CD with self-heal enabled will immediately revert it.

If the question asks about:

A team deploys 20 times per day but has a 25% change failure rate. How should you characterize them using DORA?

Answer:

High in deployment frequency but low in stability — NOT an elite performer. DORA elite requires BOTH high frequency AND low failure rate (< 5%). High frequency with high failure rate means the team is shipping broken changes rapidly — a dangerous pattern.

Distractor to avoid:

Choosing 'elite performer because of high frequency' misses that DORA measures all four metrics holistically. Speed without stability is explicitly not the elite profile.

Last-Minute Facts

1OpenGitOps 4 principles: Declarative, Versioned and Immutable, Pulled Automatically, Continuously Reconciled

2Exam weight: 36% — the single largest domain; almost double the next largest domain (Observability at 20%). Failing this domain alone can sink your total score.

3CNCF project maturity levels: Sandbox → Incubating → Graduated → Archived. Exam trick: Backstage and Argo CD are Graduated; Flux is Graduated; Crossplane is Graduated. If a question asks 'which is a production-proven CNCF project?', Graduated is the answer — not Sandbox.

4Platform maturity stages: Provisional → Operational → Scalable → Optimizing

5Team Topologies: platform team = enabling team; product team = stream-aligned team

Domain 220% of exam

Platform Observability, Security, and Conformance

Must-Know Facts

Three observability pillars: Metrics (quantitative aggregated measurements), Logs (discrete timestamped events), Traces (end-to-end request paths through distributed services). Each answers a different question: metrics = 'what', logs = 'what happened', traces = 'where and why'.
OpenTelemetry is the CNCF standard for collecting all three pillars — metrics, logs, AND traces — through unified APIs and SDKs. It is vendor-neutral and separate from specific backends like Prometheus or Jaeger.
Error budget = 1 minus SLO target. At 99.9% SLO, error budget is 0.1% (about 43.8 minutes per month). When error budget is exhausted, the team should freeze feature work and prioritize reliability.
Network Policies are DEFAULT ALLOW in Kubernetes. If no NetworkPolicy selects a pod, all ingress and egress traffic is permitted. Adding a NetworkPolicy creates restrictions — it does not grant additional permissions.
Admission webhook execution order: Mutating webhooks run FIRST (can modify the resource). Validating webhooks run SECOND (can allow or deny). Validators see the resource AFTER mutations have been applied.
Pod Security Admission replaced PodSecurityPolicy (PSP) in Kubernetes 1.25. PSP is removed — never choose PSP as a modern answer. The three Pod Security profiles are Privileged, Baseline, and Restricted.
OPA/Gatekeeper policy modes: Audit mode logs violations on EXISTING resources (reports but does not block). Enforce mode blocks NEW and UPDATED resources that violate policy. Existing violating resources are NOT automatically deleted or corrected in enforce mode.
mTLS in a service mesh is transparent to application code — the sidecar proxy handles TLS termination and origination. Applications communicate as if using plaintext; the mesh encrypts and mutually authenticates at the proxy layer.
Kubernetes Secrets are base64-encoded by default, NOT encrypted. Enable encryption at rest via EncryptionConfiguration. For production secrets, use an external secrets manager (HashiCorp Vault, AWS Secrets Manager) and synchronize to Kubernetes Secrets.

Common Traps

TrapConfusing monitoring with observability

RealityMonitoring tells you WHEN something is wrong (threshold alerts). Observability tells you WHY — it requires that systems are designed to emit structured telemetry (metrics, logs, traces). You cannot add observability retroactively; it must be designed in. The exam distinguishes between 'we got an alert' (monitoring) and 'we can explain the failure' (observability).

TrapAssuming Kyverno only validates resources

RealityKyverno supports three operations: Validate (allow/deny), Mutate (modify resources at admission), and Generate (create related resources automatically). OPA/Gatekeeper primarily validates but has limited mutation support. The key Kyverno differentiator is Generate — for example, automatically creating a NetworkPolicy whenever a new Namespace is created.

TrapThinking SLO should equal or be looser than SLA

RealitySLO must be STRICTER than SLA. If your SLA is 99.5%, your SLO should be 99.9% so that SLO violations trigger internal engineering action before the SLA breach threshold is reached. An SLA violation has contractual consequences (customer credits); an SLO violation should be an internal alert only.

TrapThinking policy engines in enforce mode clean up existing violations

RealityEnforce mode only blocks future resource creation or updates. It does NOT retroactively delete or modify existing resources that violate the policy. To discover existing violations, use audit mode. Cleanup of existing violations requires separate remediation action.

TrapSelecting 'require TLS in application code' as the answer to encrypting service-to-service traffic

RealityThe cloud native answer for transparent mutual TLS without code changes is a service mesh (Istio or Linkerd). When the question includes 'without modifying application code' or 'transparently', the answer is a service mesh with mTLS — never requiring developers to implement TLS themselves.

Confusing Pairs

OPA/GatekeeperKyverno

OPA/Gatekeeper: Rego policy language (powerful, general-purpose, steeper learning curve). Kyverno: YAML-native policies (Kubernetes-familiar, easier to adopt). Critical difference: Kyverno can GENERATE resources (e.g., auto-create NetworkPolicy on new Namespace); OPA/Gatekeeper primarily validates. Both support audit and enforce modes.

SLO (Service Level Objective)SLA (Service Level Agreement)

SLO = internal reliability target, set by engineering team, stricter threshold, violation triggers internal action, no contractual penalty. SLA = external contractual commitment, looser threshold, violation has consequences (credits, penalties). SLO must be ABOVE SLA — the SLO is your early warning system.

Metrics (Prometheus)Traces (Jaeger/Tempo)

Metrics: aggregated numbers over time, answer 'how much' and 'how often', efficient for alerting. Traces: distributed request paths with per-service timing, answer 'which service failed and why'. Use metrics for alerting, traces for root cause analysis after an alert fires.

Mutating Admission WebhookValidating Admission Webhook

Mutating: runs first, CAN modify the resource (e.g., inject sidecar, set default values, add labels). Validating: runs second (sees final mutated resource), can ONLY allow or deny — cannot modify. This order matters: if a mutating webhook adds a required label, the validating webhook can then check for that label.

Scenario Tips

If the question asks about:

The question asks how to prevent root containers in a namespace without modifying each deployment manifest

Answer:

Apply Pod Security Admission label to the namespace (enforce the Restricted profile) OR use a Kyverno/OPA policy in enforce mode. PSA is built-in to Kubernetes (no extra tooling needed). Kyverno provides more granular control. Both are valid — if the question says 'namespace-level control', PSA is the cleanest answer.

Distractor to avoid:

Choosing PodSecurityPolicy (PSP) is wrong — PSP was removed in Kubernetes 1.25. Never select PSP for a modern platform engineering question.

If the question asks about:

The question says 'encrypt and mutually authenticate ALL service-to-service traffic without application code changes'

Answer:

Deploy a service mesh (Istio or Linkerd) with mTLS enabled. The sidecar proxies handle TLS transparently. Network Policies control traffic flow but do NOT encrypt it. Ingress TLS only covers external traffic.

Distractor to avoid:

Network Policies are wrong — they restrict which pods can communicate but do not encrypt traffic. TLS termination at Ingress is wrong — that only protects external-to-cluster communication.

If the question asks about:

An SLO is 99.9% and actual availability was 99.7% for the month. The SLA is 99.5%. What is the impact?

Answer:

The SLO is violated (99.7% < 99.9%) but the SLA is not violated (99.7% > 99.5%). This is the intended design — the SLO violation triggers internal engineering action, and the SLA is preserved. The SLO is the early warning system.

Distractor to avoid:

Saying 'both SLO and SLA are violated' is a common mistake when candidates confuse which is stricter. SLO is always the stricter internal target.

Last-Minute Facts

1Prometheus uses PULL model (scrapes targets). Pushgateway exists for short-lived jobs that cannot be scraped.

2OpenTelemetry = unified CNCF standard for metrics + logs + traces. Vendor-neutral. Not a backend — sends to Prometheus, Jaeger, etc.

3Pod Security profiles: Privileged (no restrictions) < Baseline (prevents known privilege escalations) < Restricted (hardened, most restrictive)

4Error budget at 99.9% SLO = 0.1% = ~43.8 minutes per month of allowed downtime

5Network Policy default = permissive (all traffic allowed). The exam trap: a question saying 'restrict pod-to-pod traffic' requires ADDING a NetworkPolicy — it is not on by default. Deny-all baseline = apply an ingress+egress NetworkPolicy that selects all pods but specifies no allowed peers.

Domain 316% of exam

Continuous Delivery and Platform Engineering

Must-Know Facts

CI pipeline stage order: Commit → Checkout → Lint → Unit Test → Build Container Image → Scan Image for CVEs → Push to Registry → Update GitOps Manifest. Each stage is a gate — failure blocks progression. Vulnerability scanning MUST be a blocking gate, not just a report.
Immutable artifact principle: build the container image ONCE in CI, then PROMOTE that exact same image tag through dev → staging → production. Never rebuild the image per environment — rebuilding defeats the immutability guarantee and means staging tests a different artifact than production runs.
Deployment strategies at a glance: Rolling Update = Kubernetes default, gradual pod replacement, rollback takes time. Blue-Green = two full environments, instant traffic switch, doubles resource cost. Canary = small percentage of traffic to new version, gradual increase, best for risk-managed rollouts with real production traffic.
GitOps rollback = revert the Git commit and push. The operator detects the revert and reconciles the cluster back to the previous desired state. This is faster, more auditable, and more reliable than kubectl rollout undo.
Supply chain security in CI: container image signing with cosign/Sigstore, SBOM (Software Bill of Materials) generation, SLSA levels for build provenance. Policy engines can enforce that only signed images are admitted to the cluster.
Progressive delivery with Argo Rollouts: advanced canary/blue-green with automated metric analysis. Argo Rollouts automatically promotes or rolls back based on Prometheus metric thresholds — this is the 'automated analysis' answer for advanced deployment scenarios.

Common Traps

TrapThinking canary deployment routes traffic to a separate test environment

RealityCanary routes a PERCENTAGE of real production traffic to the new version while the majority still hits the stable version. This is real production traffic, not a test environment. Blue-Green uses TWO environments. Canary uses ONE environment with traffic splitting.

TrapChoosing 'kubectl rollout undo' as the correct rollback in a GitOps workflow

RealityIn GitOps, the correct rollback is to revert the Git commit. kubectl rollout undo bypasses Git and creates drift — the cluster no longer matches Git's desired state. The GitOps operator will then try to re-apply the new version, fighting the rollback. Always revert Git, never kubectl rollout undo in a GitOps workflow.

TrapTreating vulnerability scan results as a warning instead of a gate

RealityCritical CVEs discovered in CI should FAIL the pipeline and block the image from being pushed to the registry. The exam treats supply chain security as a hard gate, not a soft advisory. Options that say 'continue and notify' are wrong for critical vulnerabilities.

TrapThinking rolling updates provide instant rollback

RealityRolling update rollback is a reverse rolling update — it takes time proportional to the number of pods. Blue-Green provides INSTANT rollback by switching the load balancer back to the blue environment. If instant rollback is a stated requirement, the answer is blue-green, not rolling update.

Confusing Pairs

Blue-Green DeploymentCanary Deployment

Blue-Green: two full identical environments, instant 100% traffic switch, instant rollback, double resource cost. Best for: when you need instant rollback capability. Canary: one environment with traffic percentage splitting, gradual traffic increase, minimal resource overhead. Best for: risk-managed rollout with real production validation. Key signal: if the question mentions 'route 5% of traffic' or 'gradually increase traffic', it is canary.

Rolling UpdateRecreate Deployment

Rolling Update (Kubernetes default): gradually replaces old pods with new, zero downtime if readiness probes pass. Recreate: terminates ALL old pods before starting new pods — causes downtime. Recreate is only appropriate when the old and new versions cannot run simultaneously (e.g., database schema migration incompatibilities).

Supply Chain Security (SLSA)Runtime Security (Falco)

Supply chain security (SLSA, cosign, SBOM) secures the BUILD and DISTRIBUTION pipeline — ensuring artifacts are what they claim to be and came from a trusted build process. Runtime security (Falco, eBPF-based tools) monitors RUNNING containers for suspicious behavior. Both are needed; they operate at different lifecycle stages.

Scenario Tips

If the question asks about:

A question describes routing 5% of traffic to a new version, monitoring metrics, and gradually increasing if healthy

Answer:

This is canary deployment. The key signals are: percentage of traffic, monitoring metrics, gradual increase. Choose canary every time these signals appear together.

Distractor to avoid:

Rolling update is wrong — it replaces all pods gradually but does not split traffic percentages. Blue-green is wrong — it switches all traffic at once.

If the question asks about:

The question asks what CI should do when a critical CVE is found during image scanning

Answer:

Fail the pipeline and block the image from being pushed to the registry. Critical CVEs are a hard gate.

Distractor to avoid:

Any option that says 'continue', 'add a warning label', or 'notify and allow developer to decide' is wrong for critical severity. The pipeline must fail.

If the question asks about:

A production incident occurs and the team uses GitOps with Argo CD. How do they roll back fastest?

Answer:

Revert the Git commit and push. Argo CD detects the revert and reconciles the cluster to the previous state automatically. Faster than any kubectl command and maintains Git as the source of truth.

Distractor to avoid:

kubectl rollout undo is wrong in a GitOps context — it creates drift and will be overwritten by the next reconciliation cycle.

Last-Minute Facts

1SLSA v1.0 Build levels: L0 (no guarantees) → L1 (provenance exists, documented build) → L2 (hosted build platform + signed provenance, prevents post-build tampering) → L3 (hardened build platform, prevents insider tampering during build). L3 is the highest build level in v1.0. Higher = stronger supply chain security.

2cosign/Sigstore: signs container images after CI build. Policy engines verify signatures at admission time.

3Argo Rollouts: Kubernetes controller for canary and blue-green with automated metric-based promotion/rollback.

4Image immutability: same image digest promoted through all environments. Never use 'latest' tag in GitOps manifests — always pin to a specific digest or immutable tag.

5CI pipeline responsibility: produce artifact. GitOps operator responsibility: deploy artifact. Never overlap these concerns.

Domain 412% of exam

Platform APIs and Provisioning Infrastructure

Must-Know Facts

CRD (Custom Resource Definition) defines the SCHEMA — the API shape and accepted fields for a new resource type. A CRD alone provides zero behavior. Behavior is provided by a custom controller or operator that watches instances of the CRD and acts on them. CRD + controller = operator.
The operator pattern encodes operational knowledge for a complex stateful application: install, configure, upgrade, backup, and failover logic is baked into the controller. Use operators for complex stateful apps (databases, message queues), not for simple stateless deployments.
Reconciliation loop design requirement: controllers MUST be idempotent — the reconcile function must be safe to run multiple times without unintended side effects. When a controller restarts, it re-reconciles everything from current cluster state. It cannot assume 'reconcile ran once already'.
Crossplane architecture: Provider (connects to cloud API) → Managed Resource (CRD mapping one-to-one with a cloud resource like RDSInstance) → Composite Resource (abstract higher-level API combining multiple managed resources) → Composition (template that defines how a composite resource is realized). Developers create Composite Resource Claims, not raw cloud API calls.
Terraform vs Crossplane: Terraform is standalone CLI, HCL language, external state file (.tfstate). Crossplane is Kubernetes-native, uses CRDs, state lives in etcd, leverages the reconciliation loop continuously. Crossplane continuously reconciles cloud resources like any Kubernetes controller — Terraform only reconciles on explicit `plan/apply`.

Common Traps

TrapThinking a CRD provides behavior when installed

RealityA CRD is only a schema definition — it tells Kubernetes what fields the new resource accepts. When you create an instance of the CRD, Kubernetes stores it in etcd but NOTHING HAPPENS until a controller watches for that resource type and acts on it. CRD alone = data store. CRD + controller = functional operator.

TrapUsing operators for all Kubernetes applications

RealityOperators are appropriate for complex stateful applications where lifecycle management (backup, upgrade, failover) requires specialized operational logic. For simple stateless microservices, a Deployment and Helm chart is sufficient. Operators add complexity — only use them when the operational knowledge they encode is genuinely needed.

TrapThinking Crossplane replaces Terraform for all IaC use cases

RealityCrossplane is the Kubernetes-native choice for teams that want cloud infrastructure managed through the Kubernetes API. Terraform has broader ecosystem support, is cloud-agnostic without Kubernetes, and has a larger module ecosystem. They have different strengths — the exam will not ask you to say one is universally better, but CNPA favors Crossplane for Kubernetes-centric platforms.

TrapAssuming reconciliation logic runs exactly once per change

RealityKubernetes controllers process events from a work queue and may process the same event multiple times due to retries, crashes, and restarts. Reconcile logic MUST be idempotent — safe to run repeatedly. Designing reconcile to 'create if not exists' and 'update if different' (not 'always create') is the correct pattern.

Confusing Pairs

CRD (Custom Resource Definition)Kubernetes Operator

CRD = schema only (defines API shape, accepted fields, validation rules). Operator = CRD + custom controller (adds operational behavior for the lifecycle of the custom resource). Think of CRD as a database table schema, and the operator as the application logic that acts on rows in that table.

Crossplane Managed ResourceCrossplane Composite Resource

Managed Resource: maps 1:1 to a specific cloud resource (e.g., RDSInstance CRD = one AWS RDS database). Composite Resource: abstracts multiple Managed Resources behind a developer-friendly API (e.g., DatabaseClaim creates RDS + subnet group + security group). Developers use Composite Resources; platform engineers define the Compositions that realize them.

TerraformCrossplane

Terraform: standalone CLI, HCL language, external state file, plan/apply workflow, runs outside Kubernetes. Crossplane: Kubernetes-native, CRD-based, state in etcd, continuous reconciliation loop, kubectl interface. When a question says 'Kubernetes-native infrastructure provisioning' or 'developers use kubectl to request cloud resources', the answer is Crossplane.

Scenario Tips

If the question asks about:

Developers need to request a managed PostgreSQL database using a Kubernetes YAML manifest without cloud provider knowledge

Answer:

Crossplane Composite Resource Definition with a DatabaseClaim. Developers submit a DatabaseClaim CRD; Crossplane's provider controller provisions the actual cloud database resource. This is the Kubernetes-native infrastructure abstraction pattern.

Distractor to avoid:

Giving developers IAM credentials to provision cloud resources directly bypasses all platform abstraction. Deploying a PostgreSQL pod in Kubernetes is not a managed database service.

If the question asks about:

A custom controller crashes and restarts. The question asks what property its reconciliation logic must have.

Answer:

Idempotency — the reconcile function must be safe to run multiple times without unintended side effects. On restart, the controller re-processes current state from scratch.

Distractor to avoid:

Storing state externally in a database adds unnecessary coupling. Kubernetes does not preserve in-memory controller state across pod restarts.

Last-Minute Facts

1CRD apiVersion: apiextensions.k8s.io/v1 — the extension API group for defining new resource types

2Crossplane Composition = template that specifies how a Composite Resource is realized as Managed Resources

3Operator reconciliation must be idempotent — 'create-or-update' semantics, never 'always create'

4Kubernetes API groups: core (apiVersion: v1), apps (apps/v1), batch (batch/v1), custom (your-group.domain/v1)

Domain 58% of exam

IDPs and Developer Experience

Must-Know Facts

An IDP (Internal Developer Platform) is the ENTIRE self-service platform: CI/CD, secret management, monitoring, databases, namespaces, and all platform services. Backstage is ONE component — the developer portal UI layer — not the whole IDP.
Backstage four core components: (1) Software Catalog (central registry of all services, APIs, libraries, and teams with ownership), (2) Software Templates (scaffolding/golden paths for creating new projects), (3) TechDocs (documentation-as-code — markdown docs in repo, published to Backstage), (4) Plugins (extensibility ecosystem). The Software Catalog is the core — everything else plugs into it.
Golden paths are RECOMMENDED workflows, not mandatory constraints. Developers CAN deviate from golden paths, but they take on the additional operational burden themselves (security hardening, monitoring setup, etc.) that the golden path would have handled. This is the 'golden path, not golden cage' principle.
The primary value of a Service Catalog is discoverability and ownership. It answers: 'who owns this service?', 'what APIs are available?', 'what does this service depend on?'. Ownership data is essential for incident response.
Self-service is the core IDP value proposition: developers should be able to create standard resources (namespaces, databases, CI pipelines) without filing tickets or waiting for the platform team. Ticket volume reduction is a key IDP success metric.

Common Traps

TrapTreating Backstage as a ready-to-use finished product

RealityBackstage is an open-source FRAMEWORK that organizations customize, configure, and maintain. It requires ongoing engineering investment — plugin development, catalog entity maintenance, template authoring. A platform team that installs Backstage is starting a product, not finishing one.

TrapConfusing the developer portal (Backstage) with the IDP

RealityThe IDP is the entire self-service platform. Backstage is the portal UI that surfaces the IDP's capabilities to developers. The IDP includes the CI/CD system, secret manager, monitoring, database provisioning, Kubernetes namespaces — all the infrastructure services. Backstage is the interface to all of these.

TrapThinking golden paths are enforced constraints developers cannot bypass

RealityGolden paths are opinionated recommendations, not hard restrictions. The platform should make the golden path easier than alternatives — 'paving the road', not 'building a cage'. Developers who deviate accept more responsibility. Forcing only golden path options would reduce developer autonomy and create bottlenecks.

Confusing Pairs

IDP (Internal Developer Platform)Developer Portal (Backstage)

IDP = the full self-service platform (all tools, services, pipelines, and infrastructure). Developer Portal = the UI/interface layer where developers interact with the IDP. Backstage is the most common developer portal framework. The portal is one component of the IDP, not the IDP itself.

Software CatalogSoftware Templates

Software Catalog: DISCOVER existing services, APIs, teams, and their ownership. Software Templates: CREATE new services from opinionated golden-path scaffolding. Use catalog to find, use templates to build. These are the two most commonly confused Backstage components.

Scenario Tips

If the question asks about:

A new developer needs to create a microservice with CI pipeline, database, and monitoring already configured on day one

Answer:

Software Templates (golden paths) in Backstage scaffold new services with all standard integrations pre-configured. This is the self-service IDP model — developers get a fully wired service from a template without manual setup.

Distractor to avoid:

The Software Catalog helps you FIND existing services, not create new ones. TechDocs provides documentation but does not automate setup.

If the question asks about:

A question asks what the PRIMARY value of Backstage's Software Catalog is

Answer:

A central registry where all services, APIs, and teams can be discovered with ownership information. The primary value is discoverability and ownership tracking.

Distractor to avoid:

Scaffolding (Software Templates) and documentation (TechDocs) are separate Backstage components. If the question says 'catalog', the answer is about discoverability and ownership, not creation.

If the question asks about:

A platform team complains that developers still open tickets instead of using the IDP. What is the most likely root cause?

Answer:

The golden paths are not easy enough — the self-service option requires more effort than filing a ticket. Platform adoption requires that the golden path is demonstrably faster and simpler. The fix is to improve self-service UX and discoverability, not to restrict the ticket system.

Distractor to avoid:

Mandating IDP usage by disabling the ticket system is the 'golden cage' anti-pattern — it reduces developer autonomy and creates resentment. The right answer is always to make the platform more compelling, not more coercive.

Last-Minute Facts

1Backstage created by Spotify, donated to CNCF, now a Graduated CNCF project

2IDP vs Developer Portal: IDP is everything; portal is the UI. Backstage = portal framework, not full IDP.

3Golden paths: recommended, not mandatory. Deviation = developer accepts extra operational burden.

4TechDocs: markdown docs in the repo, auto-published to Backstage. 'Docs as code' — same repo as the service.

Domain 68% of exam

Measuring your Platform

Must-Know Facts

Four DORA metrics by name and definition: (1) Deployment Frequency — how often code is deployed to production. (2) Lead Time for Changes — time from code commit to running in production. (3) Change Failure Rate — percentage of deployments that cause an incident requiring rollback or hotfix. (4) MTTR (Mean Time to Restore) — average time to restore service after a production incident.
DORA elite performer thresholds: Deployment Frequency = multiple deploys per day. Lead Time for Changes = under 1 hour. Change Failure Rate = under 5%. MTTR = under 1 hour.
DORA metrics measure holistically — high speed with high failure rate is NOT elite. Elite performers are fast AND stable simultaneously. The four metrics must improve together.
MTTR is Mean Time to RESTORE (recovering service, not fixing root cause). Sometimes called Mean Time to Recovery. The exam uses 'restore' — service back to normal, not the underlying bug being fixed.
Platform adoption metrics measure IDP effectiveness: self-service request volume vs manual ticket volume over time, golden path adoption rate, developer portal active users. Measuring adoption is as important as technical metrics — a platform nobody uses provides no value.
Error budget burn rate is the rate at which the error budget is being consumed relative to the monthly allocation. A high burn rate signals an active incident. When the error budget is exhausted, stop feature work and focus on reliability.

Common Traps

TrapThinking high deployment frequency alone makes an elite DORA performer

RealityDORA measures BOTH velocity (deployment frequency, lead time) AND stability (change failure rate, MTTR). A team deploying 20 times per day with a 30% change failure rate is NOT elite — it is shipping broken changes rapidly. Elite performers must improve all four metrics together.

TrapConfusing MTTR with MTTF or MTBF

RealityMTTR (Mean Time to RESTORE) = how quickly you recover from an incident. MTTF (Mean Time to Failure) = how long until the next failure. MTBF (Mean Time BETWEEN Failures) = average time between incidents. DORA uses MTTR. Lowering MTTR requires good monitoring, runbooks, rollback capability, and on-call processes.

TrapAssuming DORA metrics measure individual engineers

RealityDORA metrics measure TEAM and PROCESS performance. They are indicators of organizational DevOps capability and delivery pipeline health. Using DORA metrics to evaluate individual engineers is an anti-pattern that the research explicitly warns against.

TrapThinking a positive error budget means you can skip stability work

RealityRemaining error budget means you have capacity to take some risk, but it should inform deployment decisions, not eliminate reliability investment. The budget encourages a data-driven conversation between product (ship features) and platform (maintain reliability) — not a license to ignore stability.

Confusing Pairs

Deployment FrequencyLead Time for Changes

Deployment Frequency = how often you deploy (frequency of releases). Lead Time = how long a change takes from commit to production (end-to-end pipeline speed). Both measure delivery velocity but from different angles. High frequency can coexist with high lead time if you batch changes — but elite performers have both high frequency AND low lead time.

Change Failure RateMTTR

Change Failure Rate = quality metric (what % of your deployments break things). MTTR = resilience metric (how quickly you recover when things break). CFR measures prevention; MTTR measures response. Improving CFR = better testing, canary deployments, quality gates. Improving MTTR = better monitoring, runbooks, rollback capability.

Platform Adoption MetricsDORA Metrics

DORA metrics measure software delivery performance (speed + stability). Platform adoption metrics measure whether developers are actually using the IDP (self-service volume, ticket reduction, portal users). A platform with great DORA metrics but low adoption may mean the platform team is the only user. Both sets of metrics are needed.

Scenario Tips

If the question asks about:

A team deploys 15 times per day but 30% of deployments cause incidents. How should this be characterized using DORA?

Answer:

High deployment frequency but low stability — NOT an elite performer. Change failure rate of 30% far exceeds the elite threshold of under 5%. The recommendation is to improve deployment quality (better testing, canary deployments, quality gates), not to reduce deployment frequency.

Distractor to avoid:

Choosing 'elite because of high frequency' ignores that DORA requires all four metrics to be strong simultaneously. Reducing frequency is also a wrong answer — fix quality, not speed.

If the question asks about:

A platform team wants to measure whether their IDP is reducing manual work for developers. Which metric is most appropriate?

Answer:

Self-service request volume versus manual ticket volume over time. As the IDP matures, self-service requests increase and ticket volume decreases. This directly measures the IDP's core value proposition.

Distractor to avoid:

DORA metrics (deployment frequency, change failure rate, etc.) measure delivery performance, not IDP self-service effectiveness. The question is asking about platform adoption, not delivery velocity.

If the question asks about:

SLO is 99.9% and actual availability was 99.95% this quarter. The team wants to ship a risky large feature. How should error budget inform this?

Answer:

They have remaining error budget (0.05% unused of the 0.1% total). They can afford some risk but should use canary deployment and close monitoring to minimize the chance of budget exhaustion.

Distractor to avoid:

Saying 'no error budget because SLO was met' reverses the logic. Achieving ABOVE the SLO means you have budget remaining. Waiting for a quarterly reset wastes available innovation capacity.

Last-Minute Facts

1DORA elite: deploy multiple times/day, lead time < 1 hour, CFR < 5%, MTTR < 1 hour

2DORA low performer: deploy less than once/month, lead time > 6 months, CFR 46-60%, MTTR > 6 months

3MTTR = Mean Time to RESTORE (service recovery), not root cause fix

4Error budget = 1 - SLO. 99.9% SLO → 0.1% budget → ~43.8 minutes/month

5DORA metrics measure team/process performance — never individual engineers

Feeling confident?

Put your knowledge to the test with a timed CNPA mock exam.