General Exam Tips
- 1.Read ALL answer options before selecting — many questions have two plausible answers, but one is 'most aligned with the platform engineering goal'. The platform mindset (self-service, declarative, least effort) usually picks the winner.
- 2.Watch for trigger words in question stems: 'declarative', 'secure by default', 'self-service', 'least effort', and 'without code changes'. These almost always point toward the GitOps/operator/service mesh answer.
- 3.Despite the official 'Beginner' label, the exam is genuinely intermediate difficulty. Do not under-prepare just because it is labeled foundational.
- 4.60 questions in 120 minutes = exactly 2 minutes per question. This is generous time — you can afford to read carefully. Mark and skip questions where you are unsure and return to them after completing the rest.
- 5.No penalty for wrong answers. Always answer every question, even if guessing. Never leave a question blank.
- 6.The exam is closed-book and online proctored via PSI. Your environment must be quiet and your desk clear. Prepare your ID and test space the day before.
- 7.Think like a platform team building products for developers, not like a Kubernetes operator. The exam rewards architectural reasoning over syntax recall.
- 8.When a scenario describes 'without modifying application code', 'transparently', or 'without developer involvement', the answer is almost always a service mesh, admission webhook, or policy engine — not an application-level change.
- 9.When the question is about rollback in a GitOps environment, the answer is always 'revert the Git commit'. Never kubectl rollout undo.
- 10.The exam covers OpenTelemetry as the standardized observability framework — know it is a CNCF project that provides unified APIs and SDKs for metrics, logs, AND traces. It is distinct from Prometheus or Jaeger individually.
Quick Navigation
Platform Engineering Core Fundamentals
Must-Know Facts
- This single domain is 36% of the exam — more than one-third. You CANNOT pass without mastering it. Spend disproportionate study time here.
- The four OpenGitOps principles VERBATIM: (1) Declarative, (2) Versioned and Immutable, (3) Pulled Automatically, (4) Continuously Reconciled. The exam uses exact OpenGitOps terminology.
- Continuous Delivery: every commit is in a deployable state, production deployment requires a MANUAL trigger. Continuous Deployment: every passing commit is AUTOMATICALLY deployed to production with no human gate.
- Platform engineering is 'DevOps as a product' — platform engineers build platforms that enable dev teams to practice DevOps without deep infrastructure expertise. It is a distinct discipline, not just rebranded DevOps or SRE.
- Declarative management = describe desired state, let the reconciliation loop handle 'how'. Imperative management = issue step-by-step commands. Kubernetes is declarative by design.
- Team Topologies applies here: platform teams are 'enabling teams' that reduce cognitive load for stream-aligned (feature) teams. This framing appears in exam questions about why organizations build IDPs.
- The CNCF Platform Maturity Model describes platform evolution: Provisional (reactive, manual) → Operational → Scalable → Optimizing. Platform maturity is not just about technology — it includes adoption, processes, and team topology.
- GitOps workflow: developer pushes to Git → CI pipeline builds and validates artifact → CI updates the GitOps manifest (image tag) → GitOps operator detects change → operator reconciles cluster to new desired state. The CI system does NOT deploy to the cluster directly.
- Application environment promotion order: Development → Staging/Pre-production → Production. The same immutable artifact is promoted through environments — never rebuilt.
Common Traps
Confusing Pairs
Scenario Tips
A developer commits to Git and the question asks what happens next in a GitOps workflow
CI pipeline triggers (builds, tests, scans), CI updates the GitOps manifest image tag, GitOps operator detects the change, operator reconciles cluster to match new desired state. The key sequence to get right: CI DOES NOT deploy directly; it updates the manifest, and the operator does the deploying.
Saying 'CI pipeline deploys to the cluster' is wrong in a GitOps context. CI produces the artifact and updates the manifest. Deployment is the operator's job.
A manual kubectl change is applied to production. The question asks what the GitOps operator does.
The operator detects drift (current state ≠ desired state in Git), then reconciles the cluster BACK to match Git — the manual change is overwritten. This is by design: Git is the single source of truth.
Choosing 'operator sends an alert and preserves the change' is wrong. Default GitOps behavior is to reconcile drift, not preserve manual changes. Argo CD with self-heal enabled will immediately revert it.
A team deploys 20 times per day but has a 25% change failure rate. How should you characterize them using DORA?
High in deployment frequency but low in stability — NOT an elite performer. DORA elite requires BOTH high frequency AND low failure rate (< 5%). High frequency with high failure rate means the team is shipping broken changes rapidly — a dangerous pattern.
Choosing 'elite performer because of high frequency' misses that DORA measures all four metrics holistically. Speed without stability is explicitly not the elite profile.
Last-Minute Facts
Platform Observability, Security, and Conformance
Must-Know Facts
- Three observability pillars: Metrics (quantitative aggregated measurements), Logs (discrete timestamped events), Traces (end-to-end request paths through distributed services). Each answers a different question: metrics = 'what', logs = 'what happened', traces = 'where and why'.
- OpenTelemetry is the CNCF standard for collecting all three pillars — metrics, logs, AND traces — through unified APIs and SDKs. It is vendor-neutral and separate from specific backends like Prometheus or Jaeger.
- Error budget = 1 minus SLO target. At 99.9% SLO, error budget is 0.1% (about 43.8 minutes per month). When error budget is exhausted, the team should freeze feature work and prioritize reliability.
- Network Policies are DEFAULT ALLOW in Kubernetes. If no NetworkPolicy selects a pod, all ingress and egress traffic is permitted. Adding a NetworkPolicy creates restrictions — it does not grant additional permissions.
- Admission webhook execution order: Mutating webhooks run FIRST (can modify the resource). Validating webhooks run SECOND (can allow or deny). Validators see the resource AFTER mutations have been applied.
- Pod Security Admission replaced PodSecurityPolicy (PSP) in Kubernetes 1.25. PSP is removed — never choose PSP as a modern answer. The three Pod Security profiles are Privileged, Baseline, and Restricted.
- OPA/Gatekeeper policy modes: Audit mode logs violations on EXISTING resources (reports but does not block). Enforce mode blocks NEW and UPDATED resources that violate policy. Existing violating resources are NOT automatically deleted or corrected in enforce mode.
- mTLS in a service mesh is transparent to application code — the sidecar proxy handles TLS termination and origination. Applications communicate as if using plaintext; the mesh encrypts and mutually authenticates at the proxy layer.
- Kubernetes Secrets are base64-encoded by default, NOT encrypted. Enable encryption at rest via EncryptionConfiguration. For production secrets, use an external secrets manager (HashiCorp Vault, AWS Secrets Manager) and synchronize to Kubernetes Secrets.
Common Traps
Confusing Pairs
Scenario Tips
The question asks how to prevent root containers in a namespace without modifying each deployment manifest
Apply Pod Security Admission label to the namespace (enforce the Restricted profile) OR use a Kyverno/OPA policy in enforce mode. PSA is built-in to Kubernetes (no extra tooling needed). Kyverno provides more granular control. Both are valid — if the question says 'namespace-level control', PSA is the cleanest answer.
Choosing PodSecurityPolicy (PSP) is wrong — PSP was removed in Kubernetes 1.25. Never select PSP for a modern platform engineering question.
The question says 'encrypt and mutually authenticate ALL service-to-service traffic without application code changes'
Deploy a service mesh (Istio or Linkerd) with mTLS enabled. The sidecar proxies handle TLS transparently. Network Policies control traffic flow but do NOT encrypt it. Ingress TLS only covers external traffic.
Network Policies are wrong — they restrict which pods can communicate but do not encrypt traffic. TLS termination at Ingress is wrong — that only protects external-to-cluster communication.
An SLO is 99.9% and actual availability was 99.7% for the month. The SLA is 99.5%. What is the impact?
The SLO is violated (99.7% < 99.9%) but the SLA is not violated (99.7% > 99.5%). This is the intended design — the SLO violation triggers internal engineering action, and the SLA is preserved. The SLO is the early warning system.
Saying 'both SLO and SLA are violated' is a common mistake when candidates confuse which is stricter. SLO is always the stricter internal target.
Last-Minute Facts
Continuous Delivery and Platform Engineering
Must-Know Facts
- CI pipeline stage order: Commit → Checkout → Lint → Unit Test → Build Container Image → Scan Image for CVEs → Push to Registry → Update GitOps Manifest. Each stage is a gate — failure blocks progression. Vulnerability scanning MUST be a blocking gate, not just a report.
- Immutable artifact principle: build the container image ONCE in CI, then PROMOTE that exact same image tag through dev → staging → production. Never rebuild the image per environment — rebuilding defeats the immutability guarantee and means staging tests a different artifact than production runs.
- Deployment strategies at a glance: Rolling Update = Kubernetes default, gradual pod replacement, rollback takes time. Blue-Green = two full environments, instant traffic switch, doubles resource cost. Canary = small percentage of traffic to new version, gradual increase, best for risk-managed rollouts with real production traffic.
- GitOps rollback = revert the Git commit and push. The operator detects the revert and reconciles the cluster back to the previous desired state. This is faster, more auditable, and more reliable than kubectl rollout undo.
- Supply chain security in CI: container image signing with cosign/Sigstore, SBOM (Software Bill of Materials) generation, SLSA levels for build provenance. Policy engines can enforce that only signed images are admitted to the cluster.
- Progressive delivery with Argo Rollouts: advanced canary/blue-green with automated metric analysis. Argo Rollouts automatically promotes or rolls back based on Prometheus metric thresholds — this is the 'automated analysis' answer for advanced deployment scenarios.
Common Traps
Confusing Pairs
Scenario Tips
A question describes routing 5% of traffic to a new version, monitoring metrics, and gradually increasing if healthy
This is canary deployment. The key signals are: percentage of traffic, monitoring metrics, gradual increase. Choose canary every time these signals appear together.
Rolling update is wrong — it replaces all pods gradually but does not split traffic percentages. Blue-green is wrong — it switches all traffic at once.
The question asks what CI should do when a critical CVE is found during image scanning
Fail the pipeline and block the image from being pushed to the registry. Critical CVEs are a hard gate.
Any option that says 'continue', 'add a warning label', or 'notify and allow developer to decide' is wrong for critical severity. The pipeline must fail.
A production incident occurs and the team uses GitOps with Argo CD. How do they roll back fastest?
Revert the Git commit and push. Argo CD detects the revert and reconciles the cluster to the previous state automatically. Faster than any kubectl command and maintains Git as the source of truth.
kubectl rollout undo is wrong in a GitOps context — it creates drift and will be overwritten by the next reconciliation cycle.
Last-Minute Facts
Platform APIs and Provisioning Infrastructure
Must-Know Facts
- CRD (Custom Resource Definition) defines the SCHEMA — the API shape and accepted fields for a new resource type. A CRD alone provides zero behavior. Behavior is provided by a custom controller or operator that watches instances of the CRD and acts on them. CRD + controller = operator.
- The operator pattern encodes operational knowledge for a complex stateful application: install, configure, upgrade, backup, and failover logic is baked into the controller. Use operators for complex stateful apps (databases, message queues), not for simple stateless deployments.
- Reconciliation loop design requirement: controllers MUST be idempotent — the reconcile function must be safe to run multiple times without unintended side effects. When a controller restarts, it re-reconciles everything from current cluster state. It cannot assume 'reconcile ran once already'.
- Crossplane architecture: Provider (connects to cloud API) → Managed Resource (CRD mapping one-to-one with a cloud resource like RDSInstance) → Composite Resource (abstract higher-level API combining multiple managed resources) → Composition (template that defines how a composite resource is realized). Developers create Composite Resource Claims, not raw cloud API calls.
- Terraform vs Crossplane: Terraform is standalone CLI, HCL language, external state file (.tfstate). Crossplane is Kubernetes-native, uses CRDs, state lives in etcd, leverages the reconciliation loop continuously. Crossplane continuously reconciles cloud resources like any Kubernetes controller — Terraform only reconciles on explicit `plan/apply`.
Common Traps
Confusing Pairs
Scenario Tips
Developers need to request a managed PostgreSQL database using a Kubernetes YAML manifest without cloud provider knowledge
Crossplane Composite Resource Definition with a DatabaseClaim. Developers submit a DatabaseClaim CRD; Crossplane's provider controller provisions the actual cloud database resource. This is the Kubernetes-native infrastructure abstraction pattern.
Giving developers IAM credentials to provision cloud resources directly bypasses all platform abstraction. Deploying a PostgreSQL pod in Kubernetes is not a managed database service.
A custom controller crashes and restarts. The question asks what property its reconciliation logic must have.
Idempotency — the reconcile function must be safe to run multiple times without unintended side effects. On restart, the controller re-processes current state from scratch.
Storing state externally in a database adds unnecessary coupling. Kubernetes does not preserve in-memory controller state across pod restarts.
Last-Minute Facts
IDPs and Developer Experience
Must-Know Facts
- An IDP (Internal Developer Platform) is the ENTIRE self-service platform: CI/CD, secret management, monitoring, databases, namespaces, and all platform services. Backstage is ONE component — the developer portal UI layer — not the whole IDP.
- Backstage four core components: (1) Software Catalog (central registry of all services, APIs, libraries, and teams with ownership), (2) Software Templates (scaffolding/golden paths for creating new projects), (3) TechDocs (documentation-as-code — markdown docs in repo, published to Backstage), (4) Plugins (extensibility ecosystem). The Software Catalog is the core — everything else plugs into it.
- Golden paths are RECOMMENDED workflows, not mandatory constraints. Developers CAN deviate from golden paths, but they take on the additional operational burden themselves (security hardening, monitoring setup, etc.) that the golden path would have handled. This is the 'golden path, not golden cage' principle.
- The primary value of a Service Catalog is discoverability and ownership. It answers: 'who owns this service?', 'what APIs are available?', 'what does this service depend on?'. Ownership data is essential for incident response.
- Self-service is the core IDP value proposition: developers should be able to create standard resources (namespaces, databases, CI pipelines) without filing tickets or waiting for the platform team. Ticket volume reduction is a key IDP success metric.
Common Traps
Confusing Pairs
Scenario Tips
A new developer needs to create a microservice with CI pipeline, database, and monitoring already configured on day one
Software Templates (golden paths) in Backstage scaffold new services with all standard integrations pre-configured. This is the self-service IDP model — developers get a fully wired service from a template without manual setup.
The Software Catalog helps you FIND existing services, not create new ones. TechDocs provides documentation but does not automate setup.
A question asks what the PRIMARY value of Backstage's Software Catalog is
A central registry where all services, APIs, and teams can be discovered with ownership information. The primary value is discoverability and ownership tracking.
Scaffolding (Software Templates) and documentation (TechDocs) are separate Backstage components. If the question says 'catalog', the answer is about discoverability and ownership, not creation.
A platform team complains that developers still open tickets instead of using the IDP. What is the most likely root cause?
The golden paths are not easy enough — the self-service option requires more effort than filing a ticket. Platform adoption requires that the golden path is demonstrably faster and simpler. The fix is to improve self-service UX and discoverability, not to restrict the ticket system.
Mandating IDP usage by disabling the ticket system is the 'golden cage' anti-pattern — it reduces developer autonomy and creates resentment. The right answer is always to make the platform more compelling, not more coercive.
Last-Minute Facts
Measuring your Platform
Must-Know Facts
- Four DORA metrics by name and definition: (1) Deployment Frequency — how often code is deployed to production. (2) Lead Time for Changes — time from code commit to running in production. (3) Change Failure Rate — percentage of deployments that cause an incident requiring rollback or hotfix. (4) MTTR (Mean Time to Restore) — average time to restore service after a production incident.
- DORA elite performer thresholds: Deployment Frequency = multiple deploys per day. Lead Time for Changes = under 1 hour. Change Failure Rate = under 5%. MTTR = under 1 hour.
- DORA metrics measure holistically — high speed with high failure rate is NOT elite. Elite performers are fast AND stable simultaneously. The four metrics must improve together.
- MTTR is Mean Time to RESTORE (recovering service, not fixing root cause). Sometimes called Mean Time to Recovery. The exam uses 'restore' — service back to normal, not the underlying bug being fixed.
- Platform adoption metrics measure IDP effectiveness: self-service request volume vs manual ticket volume over time, golden path adoption rate, developer portal active users. Measuring adoption is as important as technical metrics — a platform nobody uses provides no value.
- Error budget burn rate is the rate at which the error budget is being consumed relative to the monthly allocation. A high burn rate signals an active incident. When the error budget is exhausted, stop feature work and focus on reliability.
Common Traps
Confusing Pairs
Scenario Tips
A team deploys 15 times per day but 30% of deployments cause incidents. How should this be characterized using DORA?
High deployment frequency but low stability — NOT an elite performer. Change failure rate of 30% far exceeds the elite threshold of under 5%. The recommendation is to improve deployment quality (better testing, canary deployments, quality gates), not to reduce deployment frequency.
Choosing 'elite because of high frequency' ignores that DORA requires all four metrics to be strong simultaneously. Reducing frequency is also a wrong answer — fix quality, not speed.
A platform team wants to measure whether their IDP is reducing manual work for developers. Which metric is most appropriate?
Self-service request volume versus manual ticket volume over time. As the IDP matures, self-service requests increase and ticket volume decreases. This directly measures the IDP's core value proposition.
DORA metrics (deployment frequency, change failure rate, etc.) measure delivery performance, not IDP self-service effectiveness. The question is asking about platform adoption, not delivery velocity.
SLO is 99.9% and actual availability was 99.95% this quarter. The team wants to ship a risky large feature. How should error budget inform this?
They have remaining error budget (0.05% unused of the 0.1% total). They can afford some risk but should use canary deployment and close monitoring to minimize the chance of budget exhaustion.
Saying 'no error budget because SLO was met' reverses the logic. Achieving ABOVE the SLO means you have budget remaining. Waiting for a quarterly reset wastes available innovation capacity.