Databricks Unity Catalog OCF Connector: Install and Configure

Prerequisites

  • Unity Catalog is enabled in Databricks.

  • The workspaces have been assigned to the Unity Catalog metastore. There is a running Unity-compatible interactive cluster or SQL warehouse that Alation will connect to and extract the metadata.

  • For Lineage Extraction (Beta) and Query Log Ingestion (Beta), the system schema system.access is enabled.

Network Connectivity

Open inbound TCP port 443 to the Databricks Unity Catalog server.

Extraction of Complex Data Types

Complex data types, such as map, array, and struct are extracted.

To enable their representation in the Alation user interface as a tree structure, on your Alation instance, set the alation_conf parameter alation.feature_flags.enable_generic_nosql_support to True.

Additionally, you can use the parameter alation.feature_flags.docstore_tree_table_depth to define the depth of the display. By default, three levels are displayed.

Important

After changing values of these parameters, restart Alation Supervisor from the Alation shell: alation_supervisor restart all.

For information about using alation_conf, refer to Using alation_conf.

Create Service Account

  • Create a Databricks account-level user to be used as a service account in Alation.

  • Assign the service account the USAGE and SELECT permissions on all catalog, schema, and table objects that you want to catalog in Alation.

  • Lineage extraction requires additional permissions. See Permissions for Lineage Extraction below.

  • Query log ingestion (QLI) requires additional permissions. See Permissions for QLI below.

  • Assign the service account to workspace(s) using the information in Manage users, service principals, and groups. The service account must be assigned to the same workspace(s) as the cluster or SQL warehouse.

  • Assign the service account the can use permissions on the cluster or SQL warehouse.

Permissions for Lineage Extraction

The Unity Catalog lineage feature is currently in Public Preview in Databricks and may require separate access enablement:

Contact your Databricks administrator about enabling access to this feature. Lineage extraction in Alation uses this functionality and is currently a beta feature.

Lineage extraction requires access to the system catalog, system.access schema, and the tables in this schema. Grant the Alation service account these permissions:

  • USE CATALOG on catalog system

  • USE SCHEMA on schema system.access

  • SELECT on table system.access.table_lineage

  • SELECT on table system.access.column_lineage

The service account does not require USE or SELECT for all catalogs, schemas, and tables captured in the lineage records in the system.access lineage tables. All lineage will be extracted. Any objects that are not cataloged but exist in the system.access tables will be marked as temporary (TMP) on Lineage diagrams unless temporary objects have been disabled.

Permissions for QLI

The Unity Catalog audit log feature is currently in Public Preview in Databricks and may require separate access enablement:

Contact your Databricks administrator about enabling access to this feature. Query history extraction and ingestion (QLI) in Alation uses this functionality and is currently a beta feature.

QLI requires access to the system catalog, system.access schema, and the system.access.audit table in this schema. Grant the Alation service account these permissions:

  • USE CATALOG on catalog system

  • USE SCHEMA on schema system.access

  • SELECT on table system.access.audit

Authentication

The connector supports token-based authentication.

Follow the steps in Generate a personal access token in Databricks documentation to generate a personal access token.

JDBC URI

If you are using a Databricks cluster, to get the JDBC URI, refer to:

If you are using a Databricks SQL warehouse (SQL endpoints), to get the JDBC URI, refer to:

The JDBC URI string you will need to provide in Alation depends on the connector version:

Note

When specifying the JDBC URI in Alation, remove the jdbc: prefix.

Connection String for Databricks JDBC Driver

Find more information in Databricks JDBC driver in Databricks documentation.

Format

databricks://<hostname>:443/default;transportMode=http;ssl=1;httpPath=<databricks_http_path_prefix>/<databricks_cluster_id>;AuthMech=3;

Examples

Compute Cluster

databricks://dbc-32ak8401-ac16.cloud.databricks.com:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/2479012801311837/0612-093241-z79vbfjk;AuthMech=3;

SQL Warehouse

databricks://dbc-32am8401-ac16.cloud.databricks.com:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=/sql/1.0/warehouses/9f5d50hhsaeb0k23;

Connection String for Spark JDBC Driver

Find more information in JDBC Spark driver in Databricks documentation.

Format

spark://<hostname>:443/default;transportMode=http;ssl=1;httpPath=<databricks_http_path_prefix>/<databricks_cluster_id>;AuthMech=3;

Example

spark://adb-58175503737864.5.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=/sql/1.0/endpoints/0f38f55be5cbd786;AuthMech=3;

JDBC URI Properties

The connector adds some JDBC properties to the connection. These properties do not need to be explicitly included into the JDBC URI connection string in Alation:

  • RowsFetchedPerBlock—Limits the number of objects returned in each fetch call. Used to regulate the amount of memory used by the connector and prevent OOM errors. Set to 500. The memory utilization of the MDE job is captured in the connector logs when debug logging is enabled.

  • UserAgentEntry—Identifies driver request calls from the connector in Databricks. Set to alation+unity_catalog.

Configuration in Alation

STEP 1: Install the Connector

Alation On-Prem

Important

Installation of OCF connectors requires Alation Connector Manager to be installed as a prerequisite.

To install an OCF connector:

  1. If this has not been done on your instance, install the Alation Connector Manager: Install Alation Connector Manager.

  2. Ensure that the OCF connector Zip file that you received from Alation is available on your local machine.

  3. Install the connector on the Connectors Dashboard page using the steps in Manage Connectors.

Alation Cloud Service

Note

On Alation Service Cloud instances, Alation Connector Manager is available by default.

  1. Ensure that the OCF connector Zip file that you received from Alation is available on your local machine.

  2. Install the connector on the Connectors Dashboard page using the steps in Manage Connectors.

STEP 2: Create and Configure a New Data Source

In Alation, add a new data source:

  1. Log in to Alation as a Server Admin.

  2. Expand the Apps menu on the right of the main toolbar and select Sources.

  3. On the Sources page, click +Add on the top right of the page and in the list that opens, click Data Source. This will open the Add a Data Source wizard.

  4. On the first screen of the wizard, specify a name for your data source, assign additional Data Source Admins, if necessary, and click the Continue Setup button on the bottom. The Add a Data Source screen will open.

  5. On the Add a Data Source screen, the only field you should populate is Database Type. From the Database Type dropdown, select the connector name. After that you will be navigated to the Settings page of your new data source.

The name of this connector is Databricks Unity Catalog OCF Connector.

Access

On the Access tab, set the data source visibility using these options:

  • Public Data Source—The data source will be visible to all users of the catalog.

  • Private Data Source—The data source will be visible to the users allowed access to the data source by Data Source Admins.

You can add new Data Source Admin users in the Data Source Admins section.

General Settings

Note

This section describes configuring settings for credentials and connection information stored in the Alation database. If your organization has configured Azure KeyVault or AWS Secrets Manager to hold such information, the user interface for the General Settings page will change to include the following icons to the right of most options:

../../../_images/VaultOrDB.png

By default, the database icon is selected, as shown. In the vault case, instead of the actual credential information, you enter the ID of the secret. See Configure Secrets for OCF Connector Settings for details.

Perform the configuration on the General Settings tab.

Application Settings

Specify Application Settings if applicable. Click Save to save the changes after providing the information.

Parameter

Description

BI Connection Info

This parameter is used to generate lineage between the current data source and another source in the catalog, for example a BI source that retrieves data from the underlying database. The parameter accepts host and port information of the corresponding BI data source connection.

Use the following format: host:port

You can provide multiple values as a comma-separated list:

10.13.71.216:1541,sever.com:1542

Find more details in BI Connection Info.

Disable Automatic Lineage Generation

(Present in the user interface in connector versions before 2.0.3.6564)

This checkbox enables or disables automatic lineage generation from QLI, MDE, and Compose queries.

If you are using connector version 2.0.2.6259, you must disable auto-lineage generation before running QLI by selecting this checkbox.

On connector version 2.0.3.6564 or newer, you don’t have to manually disable automatic lineage as it’s disabled programmatically.

Connector Settings

Populate the data source connection information and save the values by clicking Save in this section.

Data Source Connection

Parameter

Description

JDBC URI

Specify the JDBC URI in the required format.

Username

For token-based authentication, use the value token.

Password

Paste the personal access token for the service account.

Logging Configuration

Select the logging level for the connector logs and save the values by clicking Save in this section. The available log levels are based on the Log4j framework.

Parameter

Description

Log level

Select the log level to generate logs. The available options are INFO, DEBUG, WARN, TRACE, ERROR, FATAL, ALL.

Obfuscate Literals

Skip this section as it’s not applicable to Databricks Unity Catalog data sources.

Test Connection

Under Test Connection, click Test to validate network connectivity.

If the connection test fails, make sure the JDBC URI and service account credentials are correct.

Metadata Extraction

You can configure metadata extraction (MDE) for an OCF data source on the Metadata Extraction tab of the Settings page. For Databricks Unity Catalog data sources, Alation supports full and selective default MDE. Custom query-based MDE is not supported.

Refer to Configure Metadata Extraction for OCF Data Sources for information about the available configuration options.

Lineage Extraction Job

Metadata extraction will trigger a dependent direct lineage extraction job. If you have enabled the system.access tables in Unity Catalog and have provided the service account with access to tables in the system.access schema, lineage data will be extracted into Alation. See Capturing Lineage for more details. If you have not enabled the system.access tables, the downstream lineage extraction job will fail.

You can track the result of the lineage extraction job in the Details log accessible from the Job History table at the bottom of the Metadata Extraction tab. The screenshot below shows an example of the Details log, where records relevant to the direct lineage extraction job start with the DirectLineageExtraction Info Log record. More detailed logs will be available in the connector log.

../../../_images/OCF_Databricks_UC_InstallConfig_Details.png

Lineage Extraction (Beta)

For Databricks Unity Catalog data sources, Alation calculates lineage based on lineage metadata extracted from lineage system tables and on data definition language (DDL) query statements from Compose. Lineage extraction is available as a beta feature that is based on system lineage tables in Databricks:

Lineage is extracted directly from these tables (Direct Lineage Extraction).

Direct Lineage Extraction

Direct lineage extraction—The connector extracts lineage directly from system tables in Databricks that store lineage. Direct lineage extraction will be automatically triggered as a “downstream” job dependent on the metadata extraction job. Query history, joins, filters, and popularity information will not be available as the result of direct lineage extraction as they require query log ingestion (QLI).

For lineage data to be generated, the service account needs access to the system tables that store lineage. See Permissions for Lineage Extraction.

After you have granted the service account these permissions, lineage will be extracted automatically. No additional configuration is required on the Alation side.

Direct Lineage Feature Flags

Direct lineage extraction from OCF data sources is enabled by default. It is controlled by two alation_conf flags:

  • alation.ocf.mde.direct_lineage.enable_extraction—Enables or disables the direct lineage feature for all data sources in the catalog that support it.

  • alation.ocf.mde.direct_lineage.incremental_extraction—Enables or disables incremental lineage extraction. This flag only applies if the main feature flag alation.ocf.mde.direct_lineage.enable_extraction is set to True.

For more on alation_conf, see Using alation_conf.

For more details on incremental lineage, see Incremental Lineage Extraction below.

Capturing Lineage

On the Databricks side, DDL query runs generate lineage records in the lineage tables in the Unity Catalog metastore: system.access.table_lineage and system.access.column_lineage.

In Alation, direct lineage extraction is triggered as a downstream job after metadata extraction (MDE). The direct lineage extraction job reads the system lineage tables and extracts and ingests lineage information into Alation. Ingested lineage will become available on the Lineage tab of the catalog pages of data objects (tables and views) under the Databricks Unity Catalog OCF data source.

Direct lineage extraction depends on the lineage capture feature in Databricks, inheriting its requirements and limitations. If specific records are not available in the system lineage tables due to limitations on the Databricks side, they will not be available in Alation either. Review the requirements and limitations in Databricks documentation:

Note

  • Lineage records are stored in the system lineage tables in Databricks for 30 days. Dropping a view or table with lineage will not immediately remove its associated lineage. Alation would still show the object on the Lineage diagram, while the corresponding catalog page will display the message This object appears to have been removed from the data source.

  • Altering columns to a table or view after lineage has already been created does not alter the existing lineage records.

Dataflow Content From Direct Lineage Extraction

Dataflow objects generated by lineage extraction will not show the SQL queries. The Dataflow Content field will contain the URL of the Databricks entity that generated the lineage (a notebook, a dashboard, a workflow, or a Databricks SQL query).

../../../_images/OCF_Databricks_UC_InstallConfig_Dataflow.png

For more on Dataflow objects, see Dataflow Objects.

Lineage from Compose Queries

In addition to lineage extracted from Databricks, Alation will capture lineage from DDL SQL queries executed in Compose. The Compose queries will be available in the Dataflow Content field on the Lineage diagram.

Important

Use the multipart schema names when referencing schemas, tables, and views in Compose for lineage to be captured—catalog.schema.table.

Incremental Lineage Extraction

Incremental lineage extraction is available from connector version 1.1.0.4393 and Alation release 2023.1.4.

Incremental lineage extraction is supplemental to direct lineage extraction. The lineage extraction job creates a timestamp “bookmark” and stores it in Postgres. The bookmark is based on the created_at value in the system lineage tables that is the latest timestamp. During the next MDE, Alation will only extract those lineage records where the created_at value is later than the value stored in the bookmark. For example, if an initial MDE job extracts 50 lineage records into the Alation catalog on day one and creates a bookmark, then the next MDE will only extract lineage records where the created_at value is later than the bookmark stored in Alation, thus adding lineage records incrementally. The same extraction job will create a new bookmark to be used during the subsequent MDE.

If the the incremental lineage feature flag is disabled, the MDE job will extract all available lineage records but only ingest the records that were not previously extracted and are not present in Alation. This may increase the time of the MDE extraction job, depending on how much metadata you are extracting.

See Direct Lineage Feature Flags for information on the feature flag that controls incremental lineage extraction.

Sampling and Profiling

Sampling and profiling, including dynamic profiling, is supported from connector version 1.0.2.3423, compatible with Alation version 2022.4 or newer.

For details, see Configure Sampling and Profiling for OCF Data Sources.

Query Log Ingestion (Beta)

Query Log Ingestion (QLI) is available from connector version 2.0.2 or newer. The connector supports query history extraction from the system audit log table:

Prerequisites

Before you can perform QLI from your data source in Alation, you will need to:

  1. Enable the system.access schema:

  2. Configure audit log delivery on your Databricks account and workspace:

  3. Enable verbose audit logging to capture additional events in the audit log table. Your logs will include the following additional events:

    • For notebooks: runCommand

    • For SQL warehouse: commandStart, commandFinish

    Alation requires these events to be available in the audit table to successfully extract and ingest query history:

  4. Grant Permissions for QLI to the Alation service account.

  5. In Alation, ensure that you run metadata extraction before running QLI.

QLI Query

Alation runs the following query on the system.access.audit table to retrieve query history:

SELECT request_event_time AS startAt,
  username,
  queryText,
  TIMESTAMPDIFF(SECOND, request_event_time, response_event_time) AS executionTime,
  statusCode,
  errorMessage,
  session_id
FROM (
  SELECT
    user_identity.email AS username,
    request_params.commandText AS queryText,
    event_time AS request_event_time,
    request_params.commandId
  FROM system.access.audit
  WHERE action_name = 'commandSubmit'
  ) submitTable
  LEFT JOIN (
    SELECT
      response.status_code AS statusCode,
      response.error_message AS errorMessage,
      session_id,
      event_time AS response_event_time,
      request_params.commandId
    FROM system.access.audit
    WHERE action_name = 'commandFinish'
    ) finishTable
WHERE submitTable.commandId = finishTable.commandId
  AND submitTable.request_event_time >= "FROM_TIME"
  AND submitTable.request_event_time <="TO_TIME"
  AND finishTable.statusCode = 200
  AND finishTable.errorMessage IS NULL
UNION ALL
SELECT
  event_time AS startAt,
  user_identity.email AS username,
  request_params.commandText AS queryText,
  request_params.executionTime AS executionTime,
  response.status_code AS statusCode,
  response.error_message AS errorMessage,
  session_id
FROM system.access.audit
WHERE action_name in ('runCommand')
  AND event_time >= "FROM_TIME"
  AND event_time <= "TO_TIME"
  AND response.status_code = 200
  AND response.error_message IS NULL;

Perform QLI

You can either run QLI manually on demand or configure it to run automatically on a schedule.

Note

Alation supports a maximum SQL query length of 100K rows. There is also a 50 KB limitation on query file size. Queries with more than 100K rows or queries larger than 50 KB cannot be parsed and ingested by QLI jobs.

Run QLI Manually

To perform QLI manually on demand:

  1. If you are using connector version 2.0.2.6259, you must disable lineage auto-generation before running QLI. Open the General Settings tab, select the Disable Automatic Lineage Generation checkbox under Application Settings, and click Save. If you are using a newer connector version (2.0.3.6564 or later), you don’t have to manually disable automatic lineage as from this connector version, it’s disabled programmatically. The corresponding control is not present in the user interface.

    Note

    After every QLI job, the QLI framework in Alation runs a downstream sub-job to calculate lineage based on the query history metadata ingested during QLI. However, for a Databricks Unity Catalog data source, lineage is extracted directly from the system tables. Disabling automatic lineage calculation disables the lineage downstream job and prevents creating duplicate lineage paths from both direct lineage extraction and QLI.

  2. On the Query Log Ingestion tab of the Settings page of your data source, under Automated and Manual Query Log Ingestion ensure that the Enable Automated Query Log Ingestion toggle is disabled.

  3. Specify the desired date range using the Date Range calendar widgets. You will need to specify the start date and the end date separately.

  4. Click the Preview button to preview the queries that will be ingested. This will run a preview job that will fetch a subset of the queries based on your QLI configuration. The result will be displayed in a table.

  5. Click the Import button. This will start a query log ingestion job.

View Job Status

The status of preview or QLI jobs is logged in the Query Log Ingestion Job Status table at the bottom of the page. In this table, click the status link or the View Details link for a job to view the details on its progress.

Schedule QLI

To schedule QLI:

  1. On the Query Log Ingestion tab, under Automated and Manual Query Log Ingestion, enable the Enable Automated Query Log Ingestion toggle.

  2. In the Automated Query Log Ingestion Time panel that will appear, specify the values for the job recurrence and time. The values are set in your local time.

    Note

    The hourly schedule for automated QLI is not supported.

  3. The next QLI will run on the schedule you set.

View Query History in the Catalog

After a successful QLI run, you will be able to find and view the following information under your data source:

Note

Joins, filters, and popularity calculation is a scheduled task that runs asynchronously with query log ingestion. After running QLI for the first time, you will see popularity and join and filter information in the catalog on the next day.

Query History

The query history information that Alation ingests during QLI is made available on the Queries tab on the catalog pages for schemas and tables.

To view query history:

  1. Navigate to the catalog page of a data source object, such as a schema or a table.

  2. Open the Queries tab.

  3. Select the History section from the list on the left. All the queries ingested for this object during QLI will be listed here.

  4. You can expand the query panels to see the full query statement and the related properties.

Filters, Joins, and Expressions

When ingesting query history, Alation parses SQL statements and catalogs joins and filter expressions. The joins and filters can be viewed on the Joins and Filters tabs on catalog pages of the relevant table objects. Filter expressions are captured under the Expressions tab on catalog pages of the relevant column objects.

To view the joins and filter information:

  1. Navigate to the catalog page of a table object under your data source.

  2. Open the Joins tab to view the relevant joins.

  3. Open the Filters tab to view the relevant filters.

To view filter expressions:

  1. Navigate to the catalog page of a column object under your data source.

  2. Open the Expressions tab to view the relevant filter expressions.

Popularity

Popularity of objects is calculated from the number of mentions an object has in the queries parsed during query history ingestion. Tables that are more frequently accessed by users will have a greater popularity value than tables that are queried rarely. The popularity value is displayed on catalog pages of schema, table, and column objects.

Compose

Compose is supported from connector version 1.0.2.3423, compatible with Alation version 2022.4 or newer.

For details about configuring the Compose tab of the Settings, refer to Configure Compose for OCF Data Sources.

Note

To establish a connection between Compose and Unity Catalog, Compose users will need their own personal access token or the knowledge of the token of the service account.

Incremental MDE from Compose

When users create tables and views in Compose, Alation triggers a background extraction job to make the new table and view objects available in the catalog. As a result, users will see the corresponding table or view metadata in the data catalog without re-running MDE on the Settings page of the data source. When users run DDL SQL queries in Compose, incremental MDE also adds lineage information to the affected objects under the data source.

Troubleshooting

MDE

Problem: MDE fails with error “iterating schema children”

The issue may occur with older connector versions that use the legacy Spark driver. This driver sometimes cannot keep up the connection long enough for MDE to complete.

Solution

This issue was resolved starting with connector version 2.0.2.6259, where the driver was changed to the Databricks JDBC driver. See more in JDBC URI.

Lineage

Problem: After upgrading the connector to version 1.0.3, MDE fails at ingestion

Connector logs will show the error “system.access.table_lineage table does not exist”.

Solution

Direct lineage extraction from system tables is a beta feature. The lineage system tables are not enabled by default and need to be enabled by your Databricks admin. See more in Lineage Extraction (Beta).

Problem: No lineage is generated after MDE

Solution

Verify that lineage data is available in the system lineage tables in Databricks.

QLI

Problem: After running QLI, some queries don’t appear in the History section of the Queries tab

Explanation

Some SQL statements are currently not supported by the Databricks SQL parser that is used to parse SQL statements during QLI, for example:

  • Partially qualified names in SQL statements are not parsed. Object names in query history must be fully qualified to be ingested.

  • The USE statements are not parsed.

  • Queries containing CTEs are not parsed.

  • Queries containing comments (COMMENT, COMMENT ON) are not parsed

This is a limitation with the current GSP and a known issue.

Problem: After QLI, parts of a query appear as separate queries in the catalog

Explanation

To support proper query parsing, accurate execution times count, and lineage resolution, batched queries are split into individual query statements before ingestion.

Problem: Queries that appeared in Preview don’t appear in the catalog after QLI

Explanation

The Preview is not directly tied to queries that will appear in the catalog. It returns 100 statements which may be a subset of all imported queries. For queries to appear in the catalog after QLI:

  • Queries must contain fully qualified object names with three-level namespace (catalog.schema.table).

  • The objects referenced in queries must have been cataloged with MDE.

  • Queries must not duplicate identical queries that were ingested previously. Duplicate queries are not resolved.