DatabricksData Analyst Associate62 concepts

Data Analyst Associate Cheat Sheet

Quick reference for the Databricks Certified Data Analyst Associate exam.

Quick Navigation

SQL Fundamentals & Query Structure Aggregate Functions & GROUP BY Window Functions CTEs, Subqueries & Advanced Patterns Date, String & Array Functions Delta Lake & Table Management Unity Catalog & Governance Dashboards, Alerts & Visualization Analytics Ecosystem & Platform Features

SQL Fundamentals & Query Structure

SQL Execution Order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT: Logical processing order — WHERE runs before SELECT, so you cannot reference column aliases in WHERE
SELECT col1, col2 FROM schema.table WHERE col1 > 100 GROUP BY col1, col2 HAVING COUNT(*) > 5 ORDER BY col1 DESC LIMIT 10: Complete query template — filtering rows (WHERE), grouping (GROUP BY), filtering groups (HAVING), sorting (ORDER BY), limiting (LIMIT)
SELECT * FROM t1 INNER JOIN t2 ON t1.id = t2.id: Inner join — only matching rows from both tables
SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id: Left join — all rows from t1, matching rows from t2 (NULL if no match)
SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.id = t2.id: Full outer join — all rows from both tables, NULLs where no match exists
SELECT * FROM t1 CROSS JOIN t2: Cross join — every row in t1 paired with every row in t2 (Cartesian product)
INSERT INTO my_table VALUES (1, 'Alice', 100.00): Insert — appends rows to existing data. Does NOT replace existing rows
INSERT OVERWRITE my_table SELECT * FROM staging_table: Insert overwrite — replaces all existing data with new data. Unlike INSERT INTO, this is destructive

Aggregate Functions & GROUP BY

SELECT region, category, COUNT(*) AS total_orders, SUM(amount) AS total_revenue, AVG(amount) AS avg_order_value, MIN(amount) AS min_order, MAX(amount) AS max_order FROM orders GROUP BY region, category: Multi-dimension aggregation — one row per region-category combination with summary metrics
SELECT department, COUNT(DISTINCT employee_id) AS unique_employees FROM payroll GROUP BY department: COUNT DISTINCT — counts unique values only, ignoring duplicates
SELECT region, SUM(revenue) AS total FROM sales GROUP BY region HAVING SUM(revenue) > 100000: HAVING filters groups after aggregation — unlike WHERE, which filters rows before grouping
SELECT COALESCE(region, 'Unknown') AS region, COUNT(*) AS orders FROM sales GROUP BY region: COALESCE replaces NULL with a default value — essential for clean aggregate reports
SELECT product, ROUND(AVG(price), 2) AS avg_price FROM products GROUP BY product: ROUND for controlling decimal precision in aggregate results

Window Functions

SELECT *, ROW_NUMBER() OVER ( PARTITION BY customer_id ORDER BY order_date DESC ) AS rn FROM orders: ROW_NUMBER — assigns unique sequential numbers within each partition. Use for deduplication (keep rn = 1)
SELECT *, RANK() OVER (ORDER BY score DESC) AS rank, DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank FROM students: RANK skips positions after ties (1,2,2,4). DENSE_RANK does not skip (1,2,2,3)
SELECT *, LAG(revenue, 1) OVER (ORDER BY month) AS prev_month, LEAD(revenue, 1) OVER (ORDER BY month) AS next_month FROM monthly_sales: LAG looks back, LEAD looks forward — use for period-over-period comparisons
SELECT *, SUM(amount) OVER ( PARTITION BY customer_id ORDER BY order_date ) AS running_total FROM orders: Running total — SUM with ORDER BY computes cumulative sum. Without ORDER BY, it computes a partition-wide total
SELECT *, AVG(amount) OVER ( PARTITION BY region ) AS region_avg FROM sales: Partition-wide average — no ORDER BY means the aggregate covers the entire partition, not a running calculation
SELECT * FROM ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY id ORDER BY ts DESC ) AS rn FROM my_table ) WHERE rn = 1: Deduplication pattern — window function in subquery, filter in outer WHERE. Cannot use window function directly in WHERE
SELECT *, ROW_NUMBER() OVER ( PARTITION BY id ORDER BY ts DESC ) AS rn FROM my_table QUALIFY rn = 1: QUALIFY — Databricks shortcut that filters on window function results without needing a subquery

CTEs, Subqueries & Advanced Patterns

WITH regional_totals AS ( SELECT region, SUM(amount) AS total FROM sales GROUP BY region ) SELECT * FROM regional_totals WHERE total > 50000: CTE (Common Table Expression) — named temporary result set defined with WITH clause. More readable than nested subqueries
WITH step1 AS (SELECT ... FROM raw_data), step2 AS (SELECT ... FROM step1), step3 AS (SELECT ... FROM step2 JOIN dim) SELECT * FROM step3: Chained CTEs — each step can reference previous CTEs. Great for multi-step transformations
SELECT * FROM orders WHERE customer_id IN ( SELECT customer_id FROM vip_customers ): Subquery in WHERE — filters rows based on values from another query
SELECT CASE WHEN amount >= 1000 THEN 'High' WHEN amount >= 100 THEN 'Medium' ELSE 'Low' END AS tier FROM orders: CASE WHEN — conditional logic in SQL. Evaluated top-to-bottom, first match wins
SELECT * FROM sales PIVOT ( SUM(amount) FOR quarter IN ('Q1', 'Q2', 'Q3', 'Q4') ): PIVOT — transforms rows into columns. Creates one column per distinct value in the pivot column
SELECT * FROM quarterly_sales UNPIVOT ( amount FOR quarter IN (Q1, Q2, Q3, Q4) ): UNPIVOT — transforms columns into rows. Reverses a pivot operation for normalized analysis
CAST(column AS DATE) CAST(column AS INT) CAST(column AS STRING): CAST — explicit type conversion. Common in analytics for dates, numeric formatting, and string operations

Date, String & Array Functions

SELECT CURRENT_DATE() AS today, CURRENT_TIMESTAMP() AS now, DATE_ADD(order_date, 30) AS plus_30_days, DATEDIFF(end_date, start_date) AS days_between: Date functions — current date/time, date arithmetic, and day-difference calculations
SELECT YEAR(order_date) AS yr, MONTH(order_date) AS mo, DAY(order_date) AS dy, DATE_TRUNC('MONTH', order_date) AS month_start: Date extraction — pull year/month/day components. DATE_TRUNC rounds down to the start of a period
SELECT UPPER(name) AS upper_name, LOWER(name) AS lower_name, TRIM(name) AS trimmed, LENGTH(name) AS char_count: String functions — case conversion, whitespace trimming, and length calculation
SELECT CONCAT(first_name, ' ', last_name) AS full_name, SUBSTRING(phone, 1, 3) AS area_code, REPLACE(email, '@old.com', '@new.com') AS updated_email: String manipulation — concatenation, substring extraction, and text replacement
SELECT *, EXPLODE(tags) AS tag FROM products: EXPLODE — flattens an array column into separate rows. One output row per array element
SELECT * FROM products WHERE 'waterproof' = ANY(features): ANY() — checks if a value exists within an array. Alternative to ARRAY_CONTAINS

Delta Lake & Table Management

CREATE TABLE my_table ( id INT, name STRING, amount DECIMAL(10,2) ) USING DELTA: Create a managed Delta table — data stored in Unity Catalog managed location. DROP TABLE will delete data
CREATE TABLE my_table LOCATION 's3://bucket/path': Create an external table — data stays in your cloud storage. DROP TABLE removes metadata only, files remain
DESCRIBE HISTORY my_table: View table version history — shows operations, timestamps, users, and version numbers for audit
SELECT * FROM my_table VERSION AS OF 3: Time travel by version — query a specific version of the table
SELECT * FROM my_table TIMESTAMP AS OF '2026-01-15': Time travel by timestamp — query the table state at a specific point in time
DESCRIBE EXTENDED my_table: Show detailed table metadata — including Location property that reveals if the table is managed or external
CREATE VIEW monthly_summary AS SELECT month, SUM(amount) AS total FROM sales GROUP BY month: View — saved query definition that persists in the metastore across sessions. Recomputes on each access
CREATE TEMPORARY VIEW temp_data AS SELECT * FROM raw WHERE status = 'active': Temporary view — exists only for the current session. Automatically dropped on logout/disconnect
DROP TABLE IF EXISTS my_table: Drop table safely — IF EXISTS prevents errors if the table doesn't exist. Behavior depends on managed vs external

Unity Catalog & Governance

3-level namespace: catalog.schema.table Example: prod.sales.orders: Unity Catalog organizes data in a three-level hierarchy. All data access uses this fully qualified name
GRANT USE CATALOG ON CATALOG my_catalog TO `analysts`: Required first — users must have USE CATALOG before accessing any schema or table within the catalog
GRANT USE SCHEMA ON SCHEMA my_catalog.my_schema TO `analysts`: Required second — users must have USE SCHEMA before accessing tables within the schema
GRANT SELECT ON TABLE my_catalog.my_schema.orders TO `analysts`: Grant read access — but only works if the user already has USE CATALOG and USE SCHEMA permissions
REVOKE ALL PRIVILEGES ON TABLE my_table FROM `user@email.com`: Remove all permissions from a user on a specific table
ALTER TABLE reporting_data SET TAGS ('system.Certified' = 'true'): Mark a table as certified — system.Certified is a reserved Unity Catalog governance tag for production-ready assets
Managed table: DROP TABLE → deletes data + metadata External table: DROP TABLE → deletes metadata only: Critical distinction — the most tested concept on the exam. Check DESCRIBE EXTENDED Location to determine table type

Dashboards, Alerts & Visualization

Dashboard Components: - Query-based visualizations (bar, line, scatter, pie, table, counter) - Markdown text boxes (section headers, labels) - Filter widgets (dropdown, date range) - Parameters ({{ param_name }} in queries): Four building blocks of a Databricks SQL dashboard — visualizations, text, filters, and parameters
Markdown text box: Add section headers and descriptions Use ## for headings, **bold**, and bullet lists: Markdown text boxes organize dashboards visually — the correct way to add section labels (not query results or SQL comments)
Dashboard Refresh Scheduling: - Built-in schedule (hourly, daily, custom cron) - Requires an active SQL warehouse - Refresh frequency bounded by data pipeline interval: Schedule automatic dashboard refreshes — no external tools needed. Serverless warehouse avoids startup latency
Alert Statuses: - OK: threshold NOT exceeded - TRIGGERED: threshold exceeded - UNKNOWN: not yet evaluated or query failed: Three alert states — TRIGGERED means the condition was met, NOT that the alert is enabled/activated
Alert Configuration: 1. Write a SQL query that returns a numeric value 2. Set a threshold condition (>, <, =, etc.) 3. Choose notification target (email, Slack, webhook) 4. Set refresh schedule (how often to evaluate): Four steps to create a SQL alert — the query must return a single numeric value for threshold comparison
Chart Selection Guide: - Trends over time → Line chart - Category comparison → Bar chart - Part of whole → Pie chart - Distribution → Histogram - Correlation → Scatter plot - Single KPI → Counter: Choose the right visualization type based on the analytical question being answered

Analytics Ecosystem & Platform Features

SQL Warehouse Types: - Classic: manual config, full control - Pro: governance features (Query Profile, system tables) - Serverless: instant startup, auto-scale, pay-as-you-go: Three warehouse types — Serverless eliminates startup latency and is the answer for cost optimization and instant availability
Partner Connect: Automatically provisions SQL warehouse + establishes connection to Fivetran, dbt, Tableau, Power BI, etc.: One-click integration setup — the most efficient method. Manual configuration is never the recommended approach on the exam
Genie Spaces: AI-powered natural language interface - Ask questions in plain English - Generates SQL automatically - Respects Unity Catalog permissions: Natural language data exploration for business users who don't write SQL. Does NOT bypass access controls
Databricks Marketplace: - Third-party datasets, models, insights - Accessed via Delta Sharing (live, read-only) - No data replication or ETL needed - Governed by Unity Catalog: External data marketplace — live access without copying data. The exam tests that it uses Delta Sharing, not replication
Data Explorer: - Browse tables/columns/metadata without SQL - Preview data samples - View and manage permissions - Transfer ownership (click Owner field): Graphical metadata browser — the go-to tool for discovering data assets and managing table ownership
Query History: - Review past queries and execution metrics - View query plans and timing - Diagnose performance issues - Cannot schedule or automate queries: Diagnostic tool for analyzing query performance — read-only, does not support scheduling or workflow automation
Databricks SQL positioning: - Complement to BI tools (Tableau, Power BI) - Best for ad-hoc analytics and rapid prototyping - NOT a full replacement for enterprise BI: Key exam concept — Databricks SQL is complementary, not a replacement. Choose this positioning when asked

Ready to test yourself?

Start a timed Data Analyst Associate mock exam or review practice questions by domain.