Glossary

Glossary of Terms associated with databricks

ADLS2 - Azure Data Lake Storage Gen2; hierarchical namespace–enabled blob storage used for analytics.
autoloader - Databricks feature that incrementally ingests new files from a cloud storage path.
autoscaling - automatic resizing of cluster worker count based on workload demand.
Azure Key Vault - Azure-managed secret store used for storing credentials, keys, and SAS tokens for DBX access.
Azure Managed Identity - Identity assigned to a DBX resource to securely access Azure services without secrets.
barcode - internal Databricks deployment identifier for clusters, notebooks, and jobs.
batch processing - processing data in grouped chunks rather than continuous streams.
blob storage - Azure term for inexpensive storage of any type of file. Amazon term is S3
bronze layer - raw ingestion layer in the medallion architecture; contains minimally processed source data.
catalog - top-level container in Unity Catalog that organizes schemas and tables.
checkpoint - directory used by Structured Streaming and Autoloader to track state for incremental processing.
cluster mode - controls how local vs remote driver processes run (single-node, standard, high-concurrency).
cluster policies - governance rules that restrict cluster configurations that users are allowed to create.
cloudFiles - Autoloader's file discovery mechanism API for streaming ingestion.
compute plane - where the VM nodes actually run—notebooks, jobs, worker tasks, shuffle operations.
concurrency - number of simultaneous queries a cluster can process; high-concurrency clusters optimize this.
control plane - the "brains" of DBX. The DBX UI is in the control plane and issues commands to the node(s) in the data plane
copy-on-write - Delta Lake mechanism where updates create new versions of modified Parquet files.
data lineage - Unity Catalog-tracked history of data flows between tables, notebooks, and jobs.
data plane - location of the user's data - typically in blob storage container in an Azure storage account
data skipping - Delta optimization that uses statistics in data file metadata to prune unnecessary files during reads.
Databricks Connect - tool allowing you to run local IDE code against a remote Databricks cluster.
Databricks Runtime - versioned environment that defines Spark version, Delta version, and libraries on a cluster.
DBX - abbreviation of databricks
Delta Lake - storage layer that adds ACID transactions, schema enforcement, and time travel to Parquet.
Delta Live Tables - declarative ETL framework in DBX for building pipelines with quality checks and event tracking.
Delta Log - _delta_log folder containing JSON transaction files tracking commits and table versions.
driver node - orchestrates Spark tasks, maintains metadata, and coordinates work among executors/workers.
ETL - extract, transform, load; standard data integration pipeline pattern.
event grid - Azure service that publishes notifications for blob storage events to trigger ingestion.
executor node - worker process in a cluster that performs the actual computation for tasks.
Hive metastore - legacy metadata catalog used prior to Unity Catalog, scoped per workspace.
instance pool - pre-warmed VMs used to speed up Databricks cluster startup times and reduce cost.
job cluster - ephemeral cluster automatically spun up for a job, then terminated after completion.
lakehouse - unified architecture combining data lake storage and data warehouse capabilities.
manifest file - JSON export of a Delta table for external tools not natively Delta-aware.
materialized view - table whose results are precomputed and refreshed to improve query performance.
medallion architecture - bronze → silver → gold tiered data modeling design for incrementally refined datasets.
metastore - metadata service containing catalogs, schemas, tables, permissions, and lineage.
MLflow - open-source model tracking and experiment management system native to Databricks.
multi-cluster warehouse - SQL warehouse that can automatically scale out with multiple clusters for concurrency.
notebook - interactive coding environment for Scala, SQL, Python, or R inside the Databricks workspace.
optimize - Delta table maintenance operation to coalesce small files and improve performance.
parquet files - columnar data file format that is compressed. Contains column headers, data types, and some metadata.
photon - next-generation Databricks execution engine written in C++ for faster SQL performance.
pipelines - automated workflows in DBX Jobs or Delta Live Tables for scheduled ETL processes.
power BI connector - direct connection option between Power BI and Databricks SQL warehouses.
query profile - graphical explanation of query stages, tasks, and performance characteristics.
RBAC - role-based access control; permissions model in Unity Catalog for fine-grained governance.
schema - logical grouping of tables within a catalog (similar to a database schema in SQL Server).
schema enforcement - Delta Lake’s ability to prevent writes that violate expected column types or names.
schema evolution - Delta Lake’s automated capability to add new columns when enabled.
serverless SQL - Databricks-managed SQL compute with instant start and no cluster management.
shallow clone - lightweight metadata-only clone of a Delta table referencing the same underlying data files.
silver layer - cleaned and conformed data layer in the medallion architecture.
spark - distributed compute engine used under the hood by Databricks for parallel processing.
SQL warehouse - compute resource optimized for SQL workloads, formerly called SQL endpoints.
table ACLs - access controls that regulate who can query, modify, or manage tables.
table history - list of previous Delta table versions with timestamps and operations.
task - unit of work within a Spark stage executed on a worker.
time travel - Delta feature that lets you query older versions of a table via version number or timestamp.
token - workspace-specific personal access credential used to authenticate external tools to DBX.
Unity Catalog - centralized governance layer for permissions, metadata, auditing, and lineage.
UDF - user-defined function; custom Python/Scala logic registered for use in SQL.
UDI (Update/Delete/Insert) - mutation operations that modify Delta tables atomically.
vacuum - Delta operation that permanently deletes old versions and files older than a retention threshold.
view - saved SQL query definition that appears as a table but does not store its own data.
volume - UC-governed directory for unstructured files, supporting data and code assets.
widget - Notebook UI control (dropdowns, text boxes) enabling parameterization of jobs and dashboards.
workflow - job-based orchestrated set of tasks, dependencies, and triggers.
worker node - cluster node that executes Spark tasks and holds shuffled data.
Z-order - file-level clustering technique in Delta to co-locate related values for faster reads.

Glossary

Glossary of Terms associated with databricks

No Comments