DatabricksData Analyst Associate62 concepts
Data Analyst Associate Cheat Sheet
Quick reference for the Databricks Certified Data Analyst Associate exam.
Quick Navigation
SQL Fundamentals & Query Structure
- SQL Execution Order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT
- Logical processing order — WHERE runs before SELECT, so you cannot reference column aliases in WHERE
- SELECT col1, col2 FROM schema.table WHERE col1 > 100 GROUP BY col1, col2 HAVING COUNT(*) > 5 ORDER BY col1 DESC LIMIT 10
- Complete query template — filtering rows (WHERE), grouping (GROUP BY), filtering groups (HAVING), sorting (ORDER BY), limiting (LIMIT)
- SELECT * FROM t1 INNER JOIN t2 ON t1.id = t2.id
- Inner join — only matching rows from both tables
- SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id
- Left join — all rows from t1, matching rows from t2 (NULL if no match)
- SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.id = t2.id
- Full outer join — all rows from both tables, NULLs where no match exists
- SELECT * FROM t1 CROSS JOIN t2
- Cross join — every row in t1 paired with every row in t2 (Cartesian product)
- INSERT INTO my_table VALUES (1, 'Alice', 100.00)
- Insert — appends rows to existing data. Does NOT replace existing rows
- INSERT OVERWRITE my_table SELECT * FROM staging_table
- Insert overwrite — replaces all existing data with new data. Unlike INSERT INTO, this is destructive
Aggregate Functions & GROUP BY
- SELECT region, category, COUNT(*) AS total_orders, SUM(amount) AS total_revenue, AVG(amount) AS avg_order_value, MIN(amount) AS min_order, MAX(amount) AS max_order FROM orders GROUP BY region, category
- Multi-dimension aggregation — one row per region-category combination with summary metrics
- SELECT department, COUNT(DISTINCT employee_id) AS unique_employees FROM payroll GROUP BY department
- COUNT DISTINCT — counts unique values only, ignoring duplicates
- SELECT region, SUM(revenue) AS total FROM sales GROUP BY region HAVING SUM(revenue) > 100000
- HAVING filters groups after aggregation — unlike WHERE, which filters rows before grouping
- SELECT COALESCE(region, 'Unknown') AS region, COUNT(*) AS orders FROM sales GROUP BY region
- COALESCE replaces NULL with a default value — essential for clean aggregate reports
- SELECT product, ROUND(AVG(price), 2) AS avg_price FROM products GROUP BY product
- ROUND for controlling decimal precision in aggregate results
Window Functions
- SELECT *, ROW_NUMBER() OVER ( PARTITION BY customer_id ORDER BY order_date DESC ) AS rn FROM orders
- ROW_NUMBER — assigns unique sequential numbers within each partition. Use for deduplication (keep rn = 1)
- SELECT *, RANK() OVER (ORDER BY score DESC) AS rank, DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank FROM students
- RANK skips positions after ties (1,2,2,4). DENSE_RANK does not skip (1,2,2,3)
- SELECT *, LAG(revenue, 1) OVER (ORDER BY month) AS prev_month, LEAD(revenue, 1) OVER (ORDER BY month) AS next_month FROM monthly_sales
- LAG looks back, LEAD looks forward — use for period-over-period comparisons
- SELECT *, SUM(amount) OVER ( PARTITION BY customer_id ORDER BY order_date ) AS running_total FROM orders
- Running total — SUM with ORDER BY computes cumulative sum. Without ORDER BY, it computes a partition-wide total
- SELECT *, AVG(amount) OVER ( PARTITION BY region ) AS region_avg FROM sales
- Partition-wide average — no ORDER BY means the aggregate covers the entire partition, not a running calculation
- SELECT * FROM ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY id ORDER BY ts DESC ) AS rn FROM my_table ) WHERE rn = 1
- Deduplication pattern — window function in subquery, filter in outer WHERE. Cannot use window function directly in WHERE
- SELECT *, ROW_NUMBER() OVER ( PARTITION BY id ORDER BY ts DESC ) AS rn FROM my_table QUALIFY rn = 1
- QUALIFY — Databricks shortcut that filters on window function results without needing a subquery
CTEs, Subqueries & Advanced Patterns
- WITH regional_totals AS ( SELECT region, SUM(amount) AS total FROM sales GROUP BY region ) SELECT * FROM regional_totals WHERE total > 50000
- CTE (Common Table Expression) — named temporary result set defined with WITH clause. More readable than nested subqueries
- WITH step1 AS (SELECT ... FROM raw_data), step2 AS (SELECT ... FROM step1), step3 AS (SELECT ... FROM step2 JOIN dim) SELECT * FROM step3
- Chained CTEs — each step can reference previous CTEs. Great for multi-step transformations
- SELECT * FROM orders WHERE customer_id IN ( SELECT customer_id FROM vip_customers )
- Subquery in WHERE — filters rows based on values from another query
- SELECT CASE WHEN amount >= 1000 THEN 'High' WHEN amount >= 100 THEN 'Medium' ELSE 'Low' END AS tier FROM orders
- CASE WHEN — conditional logic in SQL. Evaluated top-to-bottom, first match wins
- SELECT * FROM sales PIVOT ( SUM(amount) FOR quarter IN ('Q1', 'Q2', 'Q3', 'Q4') )
- PIVOT — transforms rows into columns. Creates one column per distinct value in the pivot column
- SELECT * FROM quarterly_sales UNPIVOT ( amount FOR quarter IN (Q1, Q2, Q3, Q4) )
- UNPIVOT — transforms columns into rows. Reverses a pivot operation for normalized analysis
- CAST(column AS DATE) CAST(column AS INT) CAST(column AS STRING)
- CAST — explicit type conversion. Common in analytics for dates, numeric formatting, and string operations
Date, String & Array Functions
- SELECT CURRENT_DATE() AS today, CURRENT_TIMESTAMP() AS now, DATE_ADD(order_date, 30) AS plus_30_days, DATEDIFF(end_date, start_date) AS days_between
- Date functions — current date/time, date arithmetic, and day-difference calculations
- SELECT YEAR(order_date) AS yr, MONTH(order_date) AS mo, DAY(order_date) AS dy, DATE_TRUNC('MONTH', order_date) AS month_start
- Date extraction — pull year/month/day components. DATE_TRUNC rounds down to the start of a period
- SELECT UPPER(name) AS upper_name, LOWER(name) AS lower_name, TRIM(name) AS trimmed, LENGTH(name) AS char_count
- String functions — case conversion, whitespace trimming, and length calculation
- SELECT CONCAT(first_name, ' ', last_name) AS full_name, SUBSTRING(phone, 1, 3) AS area_code, REPLACE(email, '@old.com', '@new.com') AS updated_email
- String manipulation — concatenation, substring extraction, and text replacement
- SELECT *, EXPLODE(tags) AS tag FROM products
- EXPLODE — flattens an array column into separate rows. One output row per array element
- SELECT * FROM products WHERE 'waterproof' = ANY(features)
- ANY() — checks if a value exists within an array. Alternative to ARRAY_CONTAINS
Delta Lake & Table Management
- CREATE TABLE my_table ( id INT, name STRING, amount DECIMAL(10,2) ) USING DELTA
- Create a managed Delta table — data stored in Unity Catalog managed location. DROP TABLE will delete data
- CREATE TABLE my_table LOCATION 's3://bucket/path'
- Create an external table — data stays in your cloud storage. DROP TABLE removes metadata only, files remain
- DESCRIBE HISTORY my_table
- View table version history — shows operations, timestamps, users, and version numbers for audit
- SELECT * FROM my_table VERSION AS OF 3
- Time travel by version — query a specific version of the table
- SELECT * FROM my_table TIMESTAMP AS OF '2026-01-15'
- Time travel by timestamp — query the table state at a specific point in time
- DESCRIBE EXTENDED my_table
- Show detailed table metadata — including Location property that reveals if the table is managed or external
- CREATE VIEW monthly_summary AS SELECT month, SUM(amount) AS total FROM sales GROUP BY month
- View — saved query definition that persists in the metastore across sessions. Recomputes on each access
- CREATE TEMPORARY VIEW temp_data AS SELECT * FROM raw WHERE status = 'active'
- Temporary view — exists only for the current session. Automatically dropped on logout/disconnect
- DROP TABLE IF EXISTS my_table
- Drop table safely — IF EXISTS prevents errors if the table doesn't exist. Behavior depends on managed vs external
Unity Catalog & Governance
- 3-level namespace: catalog.schema.table Example: prod.sales.orders
- Unity Catalog organizes data in a three-level hierarchy. All data access uses this fully qualified name
- GRANT USE CATALOG ON CATALOG my_catalog TO `analysts`
- Required first — users must have USE CATALOG before accessing any schema or table within the catalog
- GRANT USE SCHEMA ON SCHEMA my_catalog.my_schema TO `analysts`
- Required second — users must have USE SCHEMA before accessing tables within the schema
- GRANT SELECT ON TABLE my_catalog.my_schema.orders TO `analysts`
- Grant read access — but only works if the user already has USE CATALOG and USE SCHEMA permissions
- REVOKE ALL PRIVILEGES ON TABLE my_table FROM `user@email.com`
- Remove all permissions from a user on a specific table
- ALTER TABLE reporting_data SET TAGS ('system.Certified' = 'true')
- Mark a table as certified — system.Certified is a reserved Unity Catalog governance tag for production-ready assets
- Managed table: DROP TABLE → deletes data + metadata External table: DROP TABLE → deletes metadata only
- Critical distinction — the most tested concept on the exam. Check DESCRIBE EXTENDED Location to determine table type
Dashboards, Alerts & Visualization
- Dashboard Components: - Query-based visualizations (bar, line, scatter, pie, table, counter) - Markdown text boxes (section headers, labels) - Filter widgets (dropdown, date range) - Parameters ({{ param_name }} in queries)
- Four building blocks of a Databricks SQL dashboard — visualizations, text, filters, and parameters
- Markdown text box: Add section headers and descriptions Use ## for headings, **bold**, and bullet lists
- Markdown text boxes organize dashboards visually — the correct way to add section labels (not query results or SQL comments)
- Dashboard Refresh Scheduling: - Built-in schedule (hourly, daily, custom cron) - Requires an active SQL warehouse - Refresh frequency bounded by data pipeline interval
- Schedule automatic dashboard refreshes — no external tools needed. Serverless warehouse avoids startup latency
- Alert Statuses: - OK: threshold NOT exceeded - TRIGGERED: threshold exceeded - UNKNOWN: not yet evaluated or query failed
- Three alert states — TRIGGERED means the condition was met, NOT that the alert is enabled/activated
- Alert Configuration: 1. Write a SQL query that returns a numeric value 2. Set a threshold condition (>, <, =, etc.) 3. Choose notification target (email, Slack, webhook) 4. Set refresh schedule (how often to evaluate)
- Four steps to create a SQL alert — the query must return a single numeric value for threshold comparison
- Chart Selection Guide: - Trends over time → Line chart - Category comparison → Bar chart - Part of whole → Pie chart - Distribution → Histogram - Correlation → Scatter plot - Single KPI → Counter
- Choose the right visualization type based on the analytical question being answered
Analytics Ecosystem & Platform Features
- SQL Warehouse Types: - Classic: manual config, full control - Pro: governance features (Query Profile, system tables) - Serverless: instant startup, auto-scale, pay-as-you-go
- Three warehouse types — Serverless eliminates startup latency and is the answer for cost optimization and instant availability
- Partner Connect: Automatically provisions SQL warehouse + establishes connection to Fivetran, dbt, Tableau, Power BI, etc.
- One-click integration setup — the most efficient method. Manual configuration is never the recommended approach on the exam
- Genie Spaces: AI-powered natural language interface - Ask questions in plain English - Generates SQL automatically - Respects Unity Catalog permissions
- Natural language data exploration for business users who don't write SQL. Does NOT bypass access controls
- Databricks Marketplace: - Third-party datasets, models, insights - Accessed via Delta Sharing (live, read-only) - No data replication or ETL needed - Governed by Unity Catalog
- External data marketplace — live access without copying data. The exam tests that it uses Delta Sharing, not replication
- Data Explorer: - Browse tables/columns/metadata without SQL - Preview data samples - View and manage permissions - Transfer ownership (click Owner field)
- Graphical metadata browser — the go-to tool for discovering data assets and managing table ownership
- Query History: - Review past queries and execution metrics - View query plans and timing - Diagnose performance issues - Cannot schedule or automate queries
- Diagnostic tool for analyzing query performance — read-only, does not support scheduling or workflow automation
- Databricks SQL positioning: - Complement to BI tools (Tableau, Power BI) - Best for ad-hoc analytics and rapid prototyping - NOT a full replacement for enterprise BI
- Key exam concept — Databricks SQL is complementary, not a replacement. Choose this positioning when asked