# DBA to Databricks

# Glossary

## Glossary of Terms associated with databricks

**ADLS2** - Azure Data Lake Storage Gen2; hierarchical namespace–enabled blob storage used for analytics.  
**autoloader** - Databricks feature that incrementally ingests new files from a cloud storage path.  
**autoscaling** - automatic resizing of cluster worker count based on workload demand.  
**Azure Key Vault** - Azure-managed secret store used for storing credentials, keys, and SAS tokens for DBX access.  
**Azure Managed Identity** - Identity assigned to a DBX resource to securely access Azure services without secrets.  
**barcode** - internal Databricks deployment identifier for clusters, notebooks, and jobs.  
**batch processing** - processing data in grouped chunks rather than continuous streams.  
**blob storage** - Azure term for inexpensive storage of any type of file. Amazon term is S3  
**bronze layer** - raw ingestion layer in the medallion architecture; contains minimally processed source data.  
**catalog** - top-level container in Unity Catalog that organizes schemas and tables.  
**checkpoint** - directory used by Structured Streaming and Autoloader to track state for incremental processing.  
**cluster mode** - controls how local vs remote driver processes run (single-node, standard, high-concurrency).  
**cluster policies** - governance rules that restrict cluster configurations that users are allowed to create.  
**cloudFiles** - Autoloader's file discovery mechanism API for streaming ingestion.  
**compute plane** - where the VM nodes actually run—notebooks, jobs, worker tasks, shuffle operations.  
**concurrency** - number of simultaneous queries a cluster can process; high-concurrency clusters optimize this.  
**control plane** - the "brains" of DBX. The DBX UI is in the control plane and issues commands to the node(s) in the data plane  
**copy-on-write** - Delta Lake mechanism where updates create new versions of modified Parquet files.  
**data lineage** - Unity Catalog-tracked history of data flows between tables, notebooks, and jobs.  
**data plane** - location of the user's data - typically in blob storage container in an Azure storage account **data skipping** - Delta optimization that uses statistics in data file metadata to prune unnecessary files during reads.  
**Databricks Connect** - tool allowing you to run local IDE code against a remote Databricks cluster.  
**Databricks Runtime** - versioned environment that defines Spark version, Delta version, and libraries on a cluster.  
**DBX** - abbreviation of databricks  
**Delta Lake** - storage layer that adds ACID transactions, schema enforcement, and time travel to Parquet.  
**Delta Live Tables** - declarative ETL framework in DBX for building pipelines with quality checks and event tracking.  
**Delta Log** - `_delta_log` folder containing JSON transaction files tracking commits and table versions.  
**driver node** - orchestrates Spark tasks, maintains metadata, and coordinates work among executors/workers.  
**ETL** - extract, transform, load; standard data integration pipeline pattern.  
**event grid** - Azure service that publishes notifications for blob storage events to trigger ingestion.  
**executor node** - worker process in a cluster that performs the actual computation for tasks.  
**Hive metastore** - legacy metadata catalog used prior to Unity Catalog, scoped per workspace.  
**instance pool** - pre-warmed VMs used to speed up Databricks cluster startup times and reduce cost.  
**job cluster** - ephemeral cluster automatically spun up for a job, then terminated after completion.  
**lakehouse** - unified architecture combining data lake storage and data warehouse capabilities.  
**manifest file** - JSON export of a Delta table for external tools not natively Delta-aware.  
**materialized view** - table whose results are precomputed and refreshed to improve query performance.  
**medallion architecture** - bronze → silver → gold tiered data modeling design for incrementally refined datasets.  
**metastore** - metadata service containing catalogs, schemas, tables, permissions, and lineage.  
**MLflow** - open-source model tracking and experiment management system native to Databricks.  
**multi-cluster warehouse** - SQL warehouse that can automatically scale out with multiple clusters for concurrency.  
**notebook** - interactive coding environment for Scala, SQL, Python, or R inside the Databricks workspace.  
**optimize** - Delta table maintenance operation to coalesce small files and improve performance.  
**parquet files** - columnar data file format that is compressed. Contains column headers, data types, and some metadata.  
**photon** - next-generation Databricks execution engine written in C++ for faster SQL performance.  
**pipelines** - automated workflows in DBX Jobs or Delta Live Tables for scheduled ETL processes.  
**power BI connector** - direct connection option between Power BI and Databricks SQL warehouses.  
**query profile** - graphical explanation of query stages, tasks, and performance characteristics.  
**RBAC** - role-based access control; permissions model in Unity Catalog for fine-grained governance.  
**schema** - logical grouping of tables within a catalog (similar to a database schema in SQL Server).  
**schema enforcement** - Delta Lake’s ability to prevent writes that violate expected column types or names.  
**schema evolution** - Delta Lake’s automated capability to add new columns when enabled.  
**serverless SQL** - Databricks-managed SQL compute with instant start and no cluster management.  
**shallow clone** - lightweight metadata-only clone of a Delta table referencing the same underlying data files.  
**silver layer** - cleaned and conformed data layer in the medallion architecture.  
**spark** - distributed compute engine used under the hood by Databricks for parallel processing.  
**SQL warehouse** - compute resource optimized for SQL workloads, formerly called SQL endpoints.  
**table ACLs** - access controls that regulate who can query, modify, or manage tables.  
**table history** - list of previous Delta table versions with timestamps and operations.  
**task** - unit of work within a Spark stage executed on a worker.  
**time travel** - Delta feature that lets you query older versions of a table via version number or timestamp.  
**token** - workspace-specific personal access credential used to authenticate external tools to DBX.  
**Unity Catalog** - centralized governance layer for permissions, metadata, auditing, and lineage.  
**UDF** - user-defined function; custom Python/Scala logic registered for use in SQL.  
**UDI (Update/Delete/Insert)** - mutation operations that modify Delta tables atomically.  
**vacuum** - Delta operation that permanently deletes old versions and files older than a retention threshold.  
**view** - saved SQL query definition that appears as a table but does not store its own data.  
**volume** - UC-governed directory for unstructured files, supporting data and code assets.  
**widget** - Notebook UI control (dropdowns, text boxes) enabling parameterization of jobs and dashboards.  
**workflow** - job-based orchestrated set of tasks, dependencies, and triggers.  
**worker node** - cluster node that executes Spark tasks and holds shuffled data.  
**Z-order** - file-level clustering technique in Delta to co-locate related values for faster reads.

# Building SELECT lists with SQL

If you are using SQL for ingestion and you want to rename columns

### PascalCase to snake\_case

```sql
-- PascalCase to snake_case
-- , OtherNFLStats AS other_nfl_stats

SELECT
    CONCAT(', ', column_name, ' AS ', LOWER(
      REGEXP_REPLACE(
        REGEXP_REPLACE(column_name, '([a-z0-9])([A-Z])', '$1_$2'), -- Pass 1: Lower to Upper
        '([A-Z])([A-Z][a-z])', '$1_$2'                             -- Pass 2: Acronym to Word
      )
    )) AS formatted_column
FROM information_schema.columns
WHERE table_name = 'my_table'
    AND table_schema = 'my_schema'
ORDER BY ordinal_position;
```

### PascalCase to snake\_case with substitution

```sql
-- PascalCase to snake_case
-- , OverageAmount AS overage_amt
WITH RawColumns AS (
    SELECT  column_name
        , LOWER(
            REGEXP_REPLACE(
                REGEXP_REPLACE(column_name, '([a-z0-9])([A-Z])', '$1_$2')
                , '([A-Z])([A-Z][a-z])', '$1_$2'
            )
        ) AS snake_name
    FROM information_schema.columns
    WHERE table_name = 'my_table'
    AND table_schema = 'my_schema'
)
SELECT CONCAT(', ', column_name, ' AS ', 
        AGGREGATE( -- AGGREGATE is Spark-specific (thus applicable to databricks)
            -- Define the list of [pattern, replacement]
            ARRAY(
                  STRUCT('_amount$'      AS pat, '_amt' AS rep)
                , STRUCT('_percent$'     AS pat, '_pct' AS rep)
                , STRUCT('_flag$'        AS pat, '_flg' AS rep)
                , STRUCT('_code$'        AS pat, '_cd'  AS rep)
                , STRUCT('_number$'      AS pat, '_nbr' AS rep)
                , STRUCT('_name$'        AS pat, '_nm'  AS rep)
                , STRUCT('_description$' AS pat, '_dsc' AS rep)
                , STRUCT('_desc$'        AS pat, '_dsc' AS rep)
                , STRUCT('_date$'        AS pat, '_dts' AS rep)
            )
            , snake_name -- Initial value
            , (acc, x) -> REGEXP_REPLACE(acc, x.pat, x.rep) -- Iterative function
        )
    ) AS formatted_column
FROM RawColumns
ORDER BY column_name;
```

### snake\_case to PascalCase + custom rules + suffix

```sql
-- snake_case to PascalCase with suffix substitution
WITH base AS (
    SELECT
          column_name
        , CASE 
            WHEN column_name LIKE '%\_%' THEN  ARRAY_JOIN( TRANSFORM(SPLIT(LOWER(column_name), '_'), x -> INITCAP(x) ), '' )  -- underscore to PascalCase
            ELSE INITCAP(column_name)
          END AS pascal_name
        , ordinal_position
    FROM my_catalog.information_schema.columns
    WHERE table_schema = 'my_schema'
        AND table_name = 'my_table'
)
SELECT 
    CASE  
        -- Parsed Excel files 
        WHEN column_name = '_FILENAME' THEN CONCAT(', ', 'regexp_extract(', column_name, ', \'[^/]+$\', 0)', ' AS FileNM') 
        WHEN column_name = '_ROW_NUM' THEN CONCAT(', ', column_name, ' AS RowNBR')  
        WHEN column_name = '_SHEETNAME' THEN CONCAT(', ', column_name, ' AS SheetNM') 
        -- metadata cols 
        WHEN column_name = 'hcsys_file_name' THEN CONCAT(', ', 'regexp_extract(', column_name, ', \'[^/]+$\', 0)', ' AS FileNM') -- SourceHCSystemFileNM? 
        WHEN column_name = 'hcsys_row_num' THEN CONCAT(', ', column_name, ' AS SourceHCSystemRowNBR') 
        WHEN column_name = 'meta_updated' THEN CONCAT(', ', column_name, ' AS  EDWLastModifiedDTS') 
        WHEN column_name = 'meta_deleted' THEN CONCAT(', ', column_name, ' AS MetaDeletedFLG') 
        WHEN column_name = 'meta_surrogate_key' THEN CONCAT(', ', column_name, ' AS MetaSurrogateKeyID') 
        WHEN column_name = 'meta_checksum' THEN CONCAT(', ', column_name, ' AS MetaChecksumID') 
        WHEN column_name = 'meta_location' THEN CONCAT(', ', column_name, ' AS MetaLocationDSC')        
        -- Replace end word w/ suffix 
        WHEN pascal_name RLIKE '(?i)Amount$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Amount$', 'AMT')) 
        WHEN pascal_name RLIKE '(?i)Percent$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Percent$', 'PCT')) 
        WHEN pascal_name RLIKE '(?i)Flag$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Flag$', 'FLG')) 
        WHEN pascal_name RLIKE '(?i)Code$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Code$', 'CD')) 
        WHEN pascal_name RLIKE '(?i)Number$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Number$', 'NBR')) 
        WHEN pascal_name RLIKE '(?i)Name$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Name$', 'NM')) 
        WHEN pascal_name RLIKE '(?i)Description$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Description$', 'DSC')) 
        WHEN pascal_name RLIKE '(?i)Desc$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Desc$', 'DSC')) 
        WHEN pascal_name RLIKE '(?i)Date$' THEN CONCAT(', try_to_timestamp(', column_name, ', \'yyyy-MM-dd HH:mm:ss\') AS ', REGEXP_REPLACE(pascal_name, '(?i)Date$', 'DTS'))  -- chg to try_to_date if no time data 
        WHEN pascal_name RLIKE '(?i)Count$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Count$', 'CNT')) 
        WHEN pascal_name RLIKE '(?i)Id$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Id$', 'ID')) 
        WHEN pascal_name RLIKE '(?i)Address1$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Address1$', 'Address01TXT')) 
        WHEN pascal_name RLIKE '(?i)Address2$' THEN CONCAT(', ', column_name, ' AS ', REGEXP_REPLACE(pascal_name, '(?i)Address2$', 'Address02TXT')) 
        -- Appended suffixes 
        WHEN pascal_name RLIKE '(?i)Gender$' THEN CONCAT(', ', column_name, ' AS ', CONCAT(pascal_name, 'CD')) 
        WHEN pascal_name RLIKE '(?i)Age$' THEN CONCAT(', ', column_name, ' AS ', CONCAT(pascal_name, 'NBR')) 
        WHEN pascal_name RLIKE '(?i)Address$' THEN CONCAT(', ', column_name, ' AS ', CONCAT(pascal_name, 'TXT')) 
        WHEN pascal_name RLIKE '(?i)City$' THEN CONCAT(', ', column_name, ' AS ', CONCAT(pascal_name, 'NN')) 
        WHEN pascal_name RLIKE '(?i)State$' THEN CONCAT(', ', column_name, ' AS ', CONCAT(pascal_name, 'NN')) 
        WHEN pascal_name RLIKE '(?i)Zip$' THEN CONCAT(', ', column_name, ' AS ', CONCAT(pascal_name, 'CD'))   
        WHEN pascal_name RLIKE '(?i)Phone$' THEN CONCAT(', ', column_name, ' AS ', CONCAT(pascal_name, 'NBR')) 
        WHEN pascal_name RLIKE '(?i)Fax$' THEN CONCAT(', ', column_name, ' AS ', CONCAT(pascal_name, 'NBR')) 
        WHEN pascal_name RLIKE '(?i)Active$' THEN CONCAT(', ', column_name, ' AS ', CONCAT(pascal_name, 'FLG')) 
        ELSE CONCAT(', ', column_name, ' AS ', pascal_name, 'TXT') 
    END AS c 
FROM base 
ORDER BY ordinal_position; 
```