DBA to Databricks
This book gives the perspective of a (SQL Server) DBA making the transition to databricks.
Databricks is a loud based data platform for lakehouse in the big three (Azure, Amazon, GCP). Based on the open source Delta table format.
About me: briandill.com (opens in a new window)
Glossary
Glossary of Terms associated with databricks
ADLS2 - Azure Data Lake Storage Gen2; hierarchical namespace–enabled blob storage used for analytics.
autoloader - Databricks feature that incrementally ingests new files from a cloud storage path.
autoscaling - automatic resizing of cluster worker count based on workload demand.
Azure Key Vault - Azure-managed secret store used for storing credentials, keys, and SAS tokens for DBX access.
Azure Managed Identity - Identity assigned to a DBX resource to securely access Azure services without secrets.
barcode - internal Databricks deployment identifier for clusters, notebooks, and jobs.
batch processing - processing data in grouped chunks rather than continuous streams.
blob storage - Azure term for inexpensive storage of any type of file. Amazon term is S3
bronze layer - raw ingestion layer in the medallion architecture; contains minimally processed source data.
catalog - top-level container in Unity Catalog that organizes schemas and tables.
checkpoint - directory used by Structured Streaming and Autoloader to track state for incremental processing.
cluster mode - controls how local vs remote driver processes run (single-node, standard, high-concurrency).
cluster policies - governance rules that restrict cluster configurations that users are allowed to create.
cloudFiles - Autoloader's file discovery mechanism API for streaming ingestion.
compute plane - where the VM nodes actually run—notebooks, jobs, worker tasks, shuffle operations.
concurrency - number of simultaneous queries a cluster can process; high-concurrency clusters optimize this.
control plane - the "brains" of DBX. The DBX UI is in the control plane and issues commands to the node(s) in the data plane
copy-on-write - Delta Lake mechanism where updates create new versions of modified Parquet files.
data lineage - Unity Catalog-tracked history of data flows between tables, notebooks, and jobs.
data plane - location of the user's data - typically in blob storage container in an Azure storage account
data skipping - Delta optimization that uses statistics in data file metadata to prune unnecessary files during reads.
Databricks Connect - tool allowing you to run local IDE code against a remote Databricks cluster.
Databricks Runtime - versioned environment that defines Spark version, Delta version, and libraries on a cluster.
DBX - abbreviation of databricks
Delta Lake - storage layer that adds ACID transactions, schema enforcement, and time travel to Parquet.
Delta Live Tables - declarative ETL framework in DBX for building pipelines with quality checks and event tracking.
Delta Log - _delta_log folder containing JSON transaction files tracking commits and table versions.
driver node - orchestrates Spark tasks, maintains metadata, and coordinates work among executors/workers.
ETL - extract, transform, load; standard data integration pipeline pattern.
event grid - Azure service that publishes notifications for blob storage events to trigger ingestion.
executor node - worker process in a cluster that performs the actual computation for tasks.
Hive metastore - legacy metadata catalog used prior to Unity Catalog, scoped per workspace.
instance pool - pre-warmed VMs used to speed up Databricks cluster startup times and reduce cost.
job cluster - ephemeral cluster automatically spun up for a job, then terminated after completion.
lakehouse - unified architecture combining data lake storage and data warehouse capabilities.
manifest file - JSON export of a Delta table for external tools not natively Delta-aware.
materialized view - table whose results are precomputed and refreshed to improve query performance.
medallion architecture - bronze → silver → gold tiered data modeling design for incrementally refined datasets.
metastore - metadata service containing catalogs, schemas, tables, permissions, and lineage.
MLflow - open-source model tracking and experiment management system native to Databricks.
multi-cluster warehouse - SQL warehouse that can automatically scale out with multiple clusters for concurrency.
notebook - interactive coding environment for Scala, SQL, Python, or R inside the Databricks workspace.
optimize - Delta table maintenance operation to coalesce small files and improve performance.
parquet files - columnar data file format that is compressed. Contains column headers, data types, and some metadata.
photon - next-generation Databricks execution engine written in C++ for faster SQL performance.
pipelines - automated workflows in DBX Jobs or Delta Live Tables for scheduled ETL processes.
power BI connector - direct connection option between Power BI and Databricks SQL warehouses.
query profile - graphical explanation of query stages, tasks, and performance characteristics.
RBAC - role-based access control; permissions model in Unity Catalog for fine-grained governance.
schema - logical grouping of tables within a catalog (similar to a database schema in SQL Server).
schema enforcement - Delta Lake’s ability to prevent writes that violate expected column types or names.
schema evolution - Delta Lake’s automated capability to add new columns when enabled.
serverless SQL - Databricks-managed SQL compute with instant start and no cluster management.
shallow clone - lightweight metadata-only clone of a Delta table referencing the same underlying data files.
silver layer - cleaned and conformed data layer in the medallion architecture.
spark - distributed compute engine used under the hood by Databricks for parallel processing.
SQL warehouse - compute resource optimized for SQL workloads, formerly called SQL endpoints.
table ACLs - access controls that regulate who can query, modify, or manage tables.
table history - list of previous Delta table versions with timestamps and operations.
task - unit of work within a Spark stage executed on a worker.
time travel - Delta feature that lets you query older versions of a table via version number or timestamp.
token - workspace-specific personal access credential used to authenticate external tools to DBX.
Unity Catalog - centralized governance layer for permissions, metadata, auditing, and lineage.
UDF - user-defined function; custom Python/Scala logic registered for use in SQL.
UDI (Update/Delete/Insert) - mutation operations that modify Delta tables atomically.
vacuum - Delta operation that permanently deletes old versions and files older than a retention threshold.
view - saved SQL query definition that appears as a table but does not store its own data.
volume - UC-governed directory for unstructured files, supporting data and code assets.
widget - Notebook UI control (dropdowns, text boxes) enabling parameterization of jobs and dashboards.
workflow - job-based orchestrated set of tasks, dependencies, and triggers.
worker node - cluster node that executes Spark tasks and holds shuffled data.
Z-order - file-level clustering technique in Delta to co-locate related values for faster reads.
Building SELECT lists with SQL
If you are using SQL for ingestion and you want to rename columns
PascalCase to snake_case
-- PascalCase to snake_case
-- , OtherNFLStats AS other_nfl_stats
SELECT
CONCAT(', ', column_name, ' AS ', LOWER(
REGEXP_REPLACE(
REGEXP_REPLACE(column_name, '([a-z0-9])([A-Z])', '$1_$2'), -- Pass 1: Lower to Upper
'([A-Z])([A-Z][a-z])', '$1_$2' -- Pass 2: Acronym to Word
)
)) AS formatted_column
FROM information_schema.columns
WHERE table_name = 'my_table'
AND table_schema = 'my_schema'
ORDER BY ordinal_position;
PascalCase to snake_case with substitution
-- PascalCase to snake_case
-- , OverageAmount AS overage_amt
WITH RawColumns AS (
SELECT column_name
, LOWER(
REGEXP_REPLACE(
REGEXP_REPLACE(column_name, '([a-z0-9])([A-Z])', '$1_$2')
, '([A-Z])([A-Z][a-z])', '$1_$2'
)
) AS snake_name
FROM information_schema.columns
WHERE table_name = 'my_table'
AND table_schema = 'my_schema'
)
SELECT CONCAT(', ', column_name, ' AS ',
AGGREGATE( -- AGGREGATE is Spark-specific (thus applicable to databricks)
-- Define the list of [pattern, replacement]
ARRAY(
STRUCT('_amount$' AS pat, '_amt' AS rep)
, STRUCT('_percent$' AS pat, '_pct' AS rep)
, STRUCT('_flag$' AS pat, '_flg' AS rep)
, STRUCT('_code$' AS pat, '_cd' AS rep)
, STRUCT('_number$' AS pat, '_nbr' AS rep)
, STRUCT('_name$' AS pat, '_nm' AS rep)
, STRUCT('_description$' AS pat, '_dsc' AS rep)
, STRUCT('_desc$' AS pat, '_dsc' AS rep)
, STRUCT('_date$' AS pat, '_dts' AS rep)
)
, snake_name -- Initial value
, (acc, x) -> REGEXP_REPLACE(acc, x.pat, x.rep) -- Iterative function
)
) AS formatted_column
FROM RawColumns
ORDER BY column_name;
snake_case to PascalCase + custom rules + suffix
-- snake_case to PascalCase with suffix substitution
WITH base AS (
SELECT
column_name
, CASE
WHEN column_name LIKE '%\_%' THEN ARRAY_JOIN( TRANSFORM(SPLIT(LOWER(column_name), '_'), x -> INITCAP(x) ), '' ) -- underscore to PascalCase
ELSE INITCAP(column_name)
END AS pascal_name
, ordinal_position
FROM my_catalog.information_schema.columns
WHERE table_schema = 'my_schema'
AND table_name = 'my_table'
)
SELECT
CASE
-- Parsed Excel files
WHEN column_name = '_FILENAME' THEN CONCAT(', ', 'regexp_extract(', column_name, ', \'[^/]+$\', 0)', ' AS FileNM')
WHEN column_name = '_ROW_NUM' THEN CONCAT(', ', column_name, ' AS RowNBR')
WHEN column_name = '_SHEETNAME' THEN CONCAT(', ', column_name, ' AS SheetNM')
-- metadata cols
WHEN column_name = 'hcsys_file_name' THEN CONCAT(', ', 'regexp_extract(', column_name, ', \'[^/]+$\', 0)', ' AS FileNM') -- SourceHCSystemFileNM?
WHEN column_name = 'hcsys_row_num' THEN CONCAT(', ', column_name, ' AS SourceHCSystemRowNBR')
WHEN column_name = 'meta_updated' THEN CONCAT(', ', column_name, ' AS EDWLastModifiedDTS')
WHEN column_name = 'meta_deleted' THEN CONCAT(', ', column_name, ' AS MetaDeletedFLG')
WHEN column_name = 'meta_surrogate_key' THEN CONCAT(', ', column_name, ' AS MetaSurrogateKeyID')
WHEN column_name = 'meta_checksum' THEN CONCAT(', ', column_name, ' AS MetaChecksumID')
WHEN column_name = 'meta_location' THEN CONCAT(', ', column_name, ' AS MetaLocationDSC')
-- Replace end word w/ suffix
WHEN pascal_name RLIKE '(?i)Amount$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Amount$', 'AMT'))
WHEN pascal_name RLIKE '(?i)Percent$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Percent$', 'PCT'))
WHEN pascal_name RLIKE '(?i)Flag$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Flag$', 'FLG'))
WHEN pascal_name RLIKE '(?i)Code$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Code$', 'CD'))
WHEN pascal_name RLIKE '(?i)Number$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Number$', 'NBR'))
WHEN pascal_name RLIKE '(?i)Name$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Name$', 'NM'))
WHEN pascal_name RLIKE '(?i)Description$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Description$', 'DSC'))
WHEN pascal_name RLIKE '(?i)Desc$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Desc$', 'DSC'))
WHEN pascal_name RLIKE '(?i)Date$' THEN CONCAT(', try_to_timestamp(', column_name, ', \'yyyy-MM-dd HH:mm:ss\') AS ', REGEXP_REPLACE(pascal_name, '(?i)Date$', 'DTS')) -- chg to try_to_date if no time data
WHEN pascal_name RLIKE '(?i)Count$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Count$', 'CNT'))
WHEN pascal_name RLIKE '(?i)Id$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Id$', 'ID'))
WHEN pascal_name RLIKE '(?i)Address1$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Address1$', 'Address01TXT'))
WHEN pascal_name RLIKE '(?i)Address2$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Address2$', 'Address02TXT'))
-- Appended suffixes
WHEN pascal_name RLIKE '(?i)Gender$' THEN CONCAT(', ', column_name, ' AS ', CONCAT(pascal_name, 'CD'))
WHEN pascal_name RLIKE '(?i)Age$' THEN CONCAT(', ', column_name, ' AS ', CONCAT(pascal_name, 'NBR'))
WHEN pascal_name RLIKE '(?i)Address$' THEN CONCAT(', ', column_name, ' AS ', CONCAT(pascal_name, 'TXT'))
WHEN pascal_name RLIKE '(?i)City$' THEN CONCAT(', ', column_name, ' AS ', CONCAT(pascal_name, 'NN'))
WHEN pascal_name RLIKE '(?i)State$' THEN CONCAT(', ', column_name, ' AS ', CONCAT(pascal_name, 'NN'))
WHEN pascal_name RLIKE '(?i)Zip$' THEN CONCAT(', ', column_name, ' AS ', CONCAT(pascal_name, 'CD'))
WHEN pascal_name RLIKE '(?i)Phone$' THEN CONCAT(', ', column_name, ' AS ', CONCAT(pascal_name, 'NBR'))
WHEN pascal_name RLIKE '(?i)Fax$' THEN CONCAT(', ', column_name, ' AS ', CONCAT(pascal_name, 'NBR'))
WHEN pascal_name RLIKE '(?i)Active$' THEN CONCAT(', ', column_name, ' AS ', CONCAT(pascal_name, 'FLG'))
ELSE CONCAT(', ', column_name, ' AS ', pascal_name, 'TXT')
END AS c
FROM base
ORDER BY ordinal_position;