Glossary

Glossary of Terms associated with databricks 

 ADLS2 - Azure Data Lake Storage Gen2; hierarchical namespace–enabled blob storage used for analytics. autoloader - Databricks feature that incrementally ingests new files from a cloud storage path. autoscaling - automatic resizing of cluster worker count based on workload demand. Azure Key Vault - Azure-managed secret store used for storing credentials, keys, and SAS tokens for DBX access. Azure Managed Identity - Identity assigned to a DBX resource to securely access Azure services without secrets. barcode - internal Databricks deployment identifier for clusters, notebooks, and jobs. batch processing - processing data in grouped chunks rather than continuous streams. blob storage - Azure term for inexpensive storage of any type of file.  Amazon term is S3 bronze layer - raw ingestion layer in the medallion architecture; contains minimally processed source data. catalog - top-level container in Unity Catalog that organizes schemas and tables. checkpoint - directory used by Structured Streaming and Autoloader to track state for incremental processing. cluster mode - controls how local vs remote driver processes run (single-node, standard, high-concurrency). cluster policies - governance rules that restrict cluster configurations that users are allowed to create. cloudFiles - Autoloader's file discovery mechanism API for streaming ingestion. compute plane - where the VM nodes actually run—notebooks, jobs, worker tasks, shuffle operations. concurrency - number of simultaneous queries a cluster can process; high-concurrency clusters optimize this. control plane - the "brains" of DBX.  The DBX UI is in the control plane and issues commands to the node(s) in the data plane copy-on-write - Delta Lake mechanism where updates create new versions of modified Parquet files. data lineage - Unity Catalog-tracked history of data flows between tables, notebooks, and jobs. data plane - location of the user's data - typically in blob storage container in an Azure storage account data skipping - Delta optimization that uses statistics in data file metadata to prune unnecessary files during reads. Databricks Connect - tool allowing you to run local IDE code against a remote Databricks cluster. Databricks Runtime - versioned environment that defines Spark version, Delta version, and libraries on a cluster. DBX - abbreviation of databricks Delta Lake - storage layer that adds ACID transactions, schema enforcement, and time travel to Parquet. Delta Live Tables - declarative ETL framework in DBX for building pipelines with quality checks and event tracking. Delta Log - _delta_log folder containing JSON transaction files tracking commits and table versions. driver node - orchestrates Spark tasks, maintains metadata, and coordinates work among executors/workers. ETL - extract, transform, load; standard data integration pipeline pattern. event grid - Azure service that publishes notifications for blob storage events to trigger ingestion. executor node - worker process in a cluster that performs the actual computation for tasks. Hive metastore - legacy metadata catalog used prior to Unity Catalog, scoped per workspace. instance pool - pre-warmed VMs used to speed up Databricks cluster startup times and reduce cost. job cluster - ephemeral cluster automatically spun up for a job, then terminated after completion. lakehouse - unified architecture combining data lake storage and data warehouse capabilities. manifest file - JSON export of a Delta table for external tools not natively Delta-aware. materialized view - table whose results are precomputed and refreshed to improve query performance. medallion architecture - bronze → silver → gold tiered data modeling design for incrementally refined datasets. metastore - metadata service containing catalogs, schemas, tables, permissions, and lineage. MLflow - open-source model tracking and experiment management system native to Databricks. multi-cluster warehouse - SQL warehouse that can automatically scale out with multiple clusters for concurrency. notebook - interactive coding environment for Scala, SQL, Python, or R inside the Databricks workspace. optimize - Delta table maintenance operation to coalesce small files and improve performance. parquet files - columnar data file format that is compressed.  Contains column headers, data types, and some metadata. photon - next-generation Databricks execution engine written in C++ for faster SQL performance. pipelines - automated workflows in DBX Jobs or Delta Live Tables for scheduled ETL processes. power BI connector - direct connection option between Power BI and Databricks SQL warehouses. query profile - graphical explanation of query stages, tasks, and performance characteristics. RBAC - role-based access control; permissions model in Unity Catalog for fine-grained governance. schema - logical grouping of tables within a catalog (similar to a database schema in SQL Server). schema enforcement - Delta Lake’s ability to prevent writes that violate expected column types or names. schema evolution - Delta Lake’s automated capability to add new columns when enabled. serverless SQL - Databricks-managed SQL compute with instant start and no cluster management. shallow clone - lightweight metadata-only clone of a Delta table referencing the same underlying data files. silver layer - cleaned and conformed data layer in the medallion architecture. spark - distributed compute engine used under the hood by Databricks for parallel processing. SQL warehouse - compute resource optimized for SQL workloads, formerly called SQL endpoints. table ACLs - access controls that regulate who can query, modify, or manage tables. table history - list of previous Delta table versions with timestamps and operations. task - unit of work within a Spark stage executed on a worker. time travel - Delta feature that lets you query older versions of a table via version number or timestamp. token - workspace-specific personal access credential used to authenticate external tools to DBX. Unity Catalog - centralized governance layer for permissions, metadata, auditing, and lineage. UDF - user-defined function; custom Python/Scala logic registered for use in SQL. UDI (Update/Delete/Insert) - mutation operations that modify Delta tables atomically. vacuum - Delta operation that permanently deletes old versions and files older than a retention threshold. view - saved SQL query definition that appears as a table but does not store its own data. volume - UC-governed directory for unstructured files, supporting data and code assets. widget - Notebook UI control (dropdowns, text boxes) enabling parameterization of jobs and dashboards. workflow - job-based orchestrated set of tasks, dependencies, and triggers. worker node - cluster node that executes Spark tasks and holds shuffled data. Z-order - file-level clustering technique in Delta to co-locate related values for faster reads.