Amazon EMR OCF Connector: Install and Configure

Alation Cloud Service Applies to Alation Cloud Service instances of Alation

Customer Managed Applies to customer-managed instances of Alation

Network Connectivity

Open outbound TCP port 8889 to the Amazon EMR Presto server.

Service Account

Sample SQL to Create a Role

CREATE ROLE moderator WITH ADMIN USER alation;

Permissions for Profiling and Query Log Ingestion

GRANT SELECT ON <catalog>.<schema>.<tablename> TO alation WITH GRANT OPTION;

Permissions

The service account must be granted read access to the tables in the system.jdbc schema.

JDBC URI

Use the following format to build the JDBC URI:

presto://<hostname_or_ip>:<port>/<service_name>

Example

presto://ec2-3-220-155-20.compute.amazonaws.com:8889/hive

Basic Authentication

You’ll need the following credentials for basic authentication:

  • Username of the service account

  • Password of the service account

JDBC URI for Basic Authentication

Use the following JDBC URI in Alation to connect to the Presto instance:

presto://<hostname_or_ip>:<port>/<service_name>
Example
presto://ip-10-13-61-118.alation.com:7778/hive

Kerberos Authentication

You’ll need the following credentials and resources for kerberos authentication:

  • Username of the service account

  • Password of the service account

  • The krb5.conf file

  • The Presto key tab file

Keytab Authentication

You’ll need the following credentials and resources for the keytab authentication:

  • Username of the service account

  • The krb5.conf file

  • The presto.jks key file
    • Download this file from the EMR Presto server: /etc/presto/presto.jks

  • The keytab file

  • The Presto keytab file for the service account

JDBC URI for Kerberos Authentication

Use the following JDBC URI in Alation to connect to the Presto instance:

presto://<hostname_or_ip>:<port>/<service_name>/default?SSL=true&KerberosRemoteServiceName=presto&KerberosPrincipal=presto/<hostname_or_ip>&KerberosUseCanonicalHostname=false
Example
presto://ip-10-13-61-118.alation.com:7778/hive/default?SSL=true&KerberosRemoteServiceName=presto&KerberosPrincipal=presto/ip-10-13-61-118.alation.com@ALATION.COM&KerberosUseCanonicalHostname=false

JDBC Parameters and Usage

Parameter

Description

KerberosRemoteServiceName

The Presto coordinator Kerberos service name. This parameter is required for Kerberos authentication.

KerberosPrincipal

The principal to use when you authenticate to the Presto coordinator.

KerberosUseCanonicalHostname

The canonical hostname of the Presto coordinator for the Kerberos service principal. You must resolve the hostname to an IP address and then do a reverse DNS lookup for that IP address. This is enabled by default.

For more information, refer to the JDBC driver parameter reference in the Presto documentation.

SSL Authentication

You’ll need the following credentials and resources for authentication:
  • Username of the service account

  • Password of the service account

  • The SSL certificate file
    • Export the SSL certificate in the CRT file format from the EMR Presto server. You’ll need to upload it to Alation.

  • The TrustStore password
    • The default password is changeit

JDBC URI for SSL Authentication

Use the following JDBC URI in Alation to connect to the Presto instance:

presto://<hostname>:<port>/<service_name>/?SSL=true
Example
presto://ip-10-13-61-118.alation.com:7778/hive/default?SSL=true

LDAP Authentication

You’ll need the following credentials and resources for authentication:

  • Username of the service account

  • Password of the service account

  • SSL Certificate file
    • Download the keystore.jks file from EMR Presto server: /usr/share/aws/emr/security/conf/Keystore.jks

  • TrustStore password
    • Copy truststore password from EMR Presto server: In config.properties : etc/presto/conf

Configuration in Alation

STEP 1: Install the Connector

Alation On-Premise

Important

Installation of OCF connectors requires Alation Connector Manager to be installed as a prerequisite.

To install an OCF connector:

  1. If this has not been done on your instance, install the Alation Connector Manager: Install Alation Connector Manager.

  2. Ensure that the OCF connector Zip file is available on your local machine.

  3. Install the connector on the Connectors Dashboard page using the steps in Manage Connectors.

Alation Cloud Service

Note

On Alation Cloud Service instances, Alation Connector Manager is available by default.

Depending on your network configuration, you may need to use Alation Agent to connect to databases.

Connection via Alation Agent
  1. Ensure that Alation Agent is enabled on your Alation instance. If necessary, create a Support ticket with Alation for an Alation representative to enable the Alation Agent feature on your instance and to receive the Alation Agent installer.

  2. Install the Alation Agent.

  3. Install the OCF connector on Alation Agent.

Connection Without Agent

To install an OCF connector:

  1. Ensure that the OCF connector Zip file is available on your local machine.

  2. Install the connector on the Connectors Dashboard page using the steps in Manage Connectors.

STEP 2: Create and Configure a New Data Source

In Alation, add a new data source:

  1. Log in to Alation as a Server Admin.

  2. Expand the Apps menu on the right of the main toolbar and select Sources.

  3. On the Sources page, click +Add on the top right of the page and in the list that opens, click Data Source. This will open the Add a Data Source wizard.

  4. On the first screen of the wizard, specify a name for your data source, assign additional Data Source Admins, if necessary, and click the Continue Setup button on the bottom. The Add a Data Source screen will open.

  5. On the Add a Data Source screen, the only field you should populate is Database Type. From the Database Type dropdown, select the connector name. After that you will be navigated to the Settings page of your new data source.

    Note

    Agent-based connectors will have the Agent name appended to the connector name.

The name of this connector is EMR Presto OCF Connector.

Access

On the Access tab, set the data source visibility using these options:

  • Public Data Source—The data source will be visible to all users of the catalog.

  • Private Data Source—The data source will be visible to the users allowed access to the data source by Data Source Admins.

You can add new Data Source Admin users in the Data Source Admins section.

General Settings

Application Settings

Specify Application Settings if applicable. Click Save to save the changes after providing the information.

Parameter

Description

BI Connection Info

This parameter is used to generate lineage between the current data source and another source in the catalog, for example a BI source that retrieves data from the underlying database. The parameter accepts host and port information of the corresponding BI data source connection.

Use the following format: host:port

You can provide multiple values as a comma-separated list:

10.13.71.216:1541,sever.com:1542

Find more details in BI Connection Info.

Disable Automatic Lineage Generation

Select this checkbox to disable automatic lineage generation from QLI, MDE, and Compose queries. By default, automatic lineage generation is enabled.

Connector Settings

Data Source Connection

Populate the data source connection information and save the values by clicking Save in this section.

Parameter

Description

JDBC URI

Specify the JDBC URI in the required format.

Username

Specify the service account username.

Password

Specify the service account password.

Enable Kerberos authentication

Select this checkbox if using Kerberos authentication and upload the krb5.conf file using the upload link under the checkbox.

Use Presto keytab

If you enable Kerberos authentication, select Use Presto Keytab checkbox and upload the Presto keytab file.

Use keytab

Select this checkbox if using keytabs and upload the keytab file for the service account using the upload link under the checkbox.

Enable LDAP

Select this checkbox to use LDAP authentication. Also, select Enable SSL if you’ve enabled the LDAP authentication.

Enable SSL

Enable or disable SSL authentication by selecting or clearing the Enable SSL checkbox.

If the Enable SSL checkbox is enabled, upload the SSL certificate using the upload link below.

Truststore password

Specify the password for the SSL certificate.

The password will be deleted if the data source connection is deleted.

Logging Configuration

Select the logging level for the connector logs and save the values by clicking Save in this section. The available log levels are based on the Log4j framework.

Parameter

Description

Log level

Select the log level to generate logs. The available options are INFO, DEBUG, WARN, TRACE, ERROR, FATAL, ALL.

Obfuscate Literals

Obfuscate Literals—Enable this toggle to hide actual values in the query statements that are ingested during query log ingestion or executed in Compose. This toggle is disabled by default.

Test Connection

Under Test Connection, click Test to validate network connectivity.

If the connection test fails, make sure the JDBC URI and service account credentials are correct.

Metadata Extraction

You can configure metadata extraction (MDE) for an OCF data source on the Metadata Extraction tab of the Settings page. Refer to Configure Metadata Extraction for OCF Data Sources for information about the available configuration options.

This connector supports default query-based MDE. Custom query-based extraction is not supported.

Compose

For details about configuring the Compose tab of the Settings page, refer to Configure Compose for OCF Data Sources.

Sampling and Profiling

Sampling and profiling is supported. For details, see Configure Sampling and Profiling for OCF Data Sources.

Query Log Ingestion

To use Query Log Ingestion, run the custom query-based QLI.

Custom Query-Based QLI

For custom query-based QLI to succeed, ensure that the service account has enough permissions to select from the query log table.

Find an example of a custom query for QLI below. You can customize it by adding, removing, or changing the filter, but the columns and their aliases must remain as is since the connector expects this query structure.

Note

When using the QLI query template, do not substitute the STARTTIME and ENDTIME parameters in the WHERE filter. These parameters are not actual column names and should stay as is. They are expected by the connector and will be substituted with the start and end date of the QLI range selected in the user interface when QLI is run manually or on schedule.

SELECT user AS username,
    query AS queryString,
    source AS defaultDatabases,
    False AS sessionid,
    created AS sessionstarttime,
    started AS starttime,
    False AS cancelled,
    date_diff('second',"started", "end") AS seconds
FROM system.runtime.queries
WHERE state ='FINISHED'
    AND date(started) BETWEEN date STARTTIME
    AND date ENDTIME
ORDER BY sessionId, startTime;

To configure query-based QLI:

  1. Go to the Query Log Ingestion tab of the Settings page of your OCF data source.

  2. Under Connector Settings > Query Extraction, in the Custom QLI Query field, provide the QLI query.

  3. Click Save.

Automated and Manual Query Log Ingestion

You can either perform QLI manually on demand or enable automated QLI:

To perform manual QLI, under the Automated and Manual Query Log Ingestion section of the Query Log Ingestion tab, ensure that the Enable Automated Query Log Ingestion toggle is disabled.

Note

Metadata extraction must be completed first before running QLI.

  1. Click Preview to get a sample of the query history data to be ingested.

  2. Click the Import button to perform QLI on demand.

  3. To schedule QLI, enable the Enable Automated Query Log Ingestion toggle.

  4. Set a schedule under Automated Query Log Ingestion Time by specifying values in the week, day, hour, and time fields. The next QLI job will run on the schedule you have specified.

Troubleshooting

Refer to Troubleshooting for information about logs.

Map FQDN in the Presto Server Host File

When you use Kerberos authentication, you need to map the EMR Presto server domain name with the Kerberos Principal domain name. If you see the following error while authenticating via Kerberos, you may need to do the mapping:

KrbException: Server not found in Kerberos database (7) - LOOKING_UP_SERVER

To map the domain name, follow these steps to edit the EMR Presto host file:

  1. Log in to the EMR Presto server using SSH.

  2. Navigate to the etc/ path and open the host file.

  3. Add the following IP mapping to the hostname:

    • IP address—Example: ​10.13.63.118

    • Fully qualified name—Example: ip-10-13-63-118.alation.com

    • Alias name—Example: ip-10-13-63-118