Set Up Incremental MDE

Alation Cloud Service Applies to Alation Cloud Service instances of Alation

Customer Managed Applies to customer-managed instances of Alation

You can configure metadata extraction (MDE) to be incremental. This means that each consecutive MDE job will extract new metadata and metadata updates but not the full scope of selected metadata.

Important

Incremental extraction is a resource consuming event. Review the prerequisites listed in this section and the additional cost you may incur within your AWS account before you proceed, if you have not already done so.

Alation recommends using this feature only when the volume of incremental changes is less compared to the total number of objects extracted. Example: 10K incremental changes over 5 million objects.

If you enable incremental sync for MDE, the very first MDE job will extract all accessible metadata. However, consecutive MDE jobs will extract new metadata and metadata updates, not the full scope of selected metadata.

Note

If you are using Alation’s Terraform script to configure the inventory, the same script can be used to set up incremental MDE. For more information, see the Configure Inventory with a Terraform Script section in Prerequisites.

Configure Incremental Sync

Before enabling incremental Sync for MDE, review these considerations:

  • Incremental extraction is recommended when the bucket size is very large but only a few changes happen to its contents daily.

  • The time required for incremental extraction depends on the number of incremental events and the number of affected objects. The more incremental events there are, the longer time may be required. According to Alation’s internal MDE performance analysis:

    • It may take about 85 minutes to process 100K incremental events.

    • Full extraction (non-incremental) may take five to six hours to process 50M objects.

  • Incremental MDE has a number of limitations:

    • As part of configurations for incremental extraction, you’ll be required to create event notifications for the source buckets. If you already have event notifications set up on the source buckets, you will not be able to use incremental extraction.

    • The last modification time will not be displayed for the folder objects.

    • The last modified timestamp for files for incremental events will be displayed as the time.

To configure incremental sync for MDE, you will need to perform configurations in Amazon S3 and in Alation:

Perform Configuration in Amazon S3

Use the step in this section to configure incremental MDE manually.

Note

You can also use a Terraform script provided by Alation: Use Terraform to Set Up Inventory and Incremental MDE.

Create an IAM Role for a Lambda Function

Create an IAM role for the Lambda function you’ll create later. Attach the AWS managed policy AWSLambdaBasicExecutionRole and adjust the Resource value to the destination bucket created in Create an S3 Bucket to Store Inventory Reports. The role provides write access to the destination bucket.

Refer to Lambda execution role in AWS documentation for more details.

Policy example:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::alation-destination-bucket/*"
        }
    ]
}

Create a Lambda Function

You’ll need to create a Lambda function to write event notifications to the destination bucket. The function must be created in the same AWS region as the destination bucket:

  1. Open the Lambda service from your S3 console and follow the steps given below to create the Lambda function in the same region as the destination bucket:

  2. Select Create Function.

    ../../../_images/S3OCF_10.png
  3. Select Use a blueprint option and s3-get-object-python template option. Click Configure.

    ../../../_images/S3OCF_11.png
  4. Enter the Function name. Example: capture-events-for-alation.

  5. In Execution role, choose Use an existing role and select the IAM role created in Step 1.

    ../../../_images/S3OCF_12.png
  6. Click Create function.

  7. Replace the code in the window shown below with the code provided here. Make sure to use the correct destination bucket name.

    ../../../_images/S3OCF_13.png

    Code to replace:

    """
    Copyright (C) Alation - All Rights Reserved
    """
    
    import json
    import boto3
    import hashlib
    from datetime import datetime
    
    s3 = boto3.client('s3')
    
    def lambda_handler(event, context):
        print("Received event: " + json.dumps(event, indent=2))
    
        bucket = event['Records'][0]['s3']['bucket']['name']
        # Using md5 hash of a filepath to store as a key
        filePath = bucket + "/" + event['Records'][0]['s3']['object']['key']
        print("File Path: " + filePath)
        key = hashlib.md5(filePath.encode()).hexdigest()
        print("Key: " + key)
        date = datetime.utcnow().strftime("%Y-%m-%d")
        try:
            response = s3.put_object(
                        Body=json.dumps(event),
                        # Update the destination bucket name
                        Bucket='alation-destination-bucket',
                        Key='incremental_sync/' + bucket +'/' + date + '/' + key + '.json',
                    )
            print(response)
        except Exception as e:
            print(e)
            raise e
    
  8. Click Deploy after replacing the code.

Create Event Notifications

Important

Perform this configuration for each source bucket.

To create an event notification:

  1. Open the source buckets where you want to create an event configuration. For each bucket, perform the following:

  2. Go to Properties > Event Notifications and click Create event notification.

    ../../../_images/S3OCF_14.png
  3. Provide the Event name.

    ../../../_images/S3OCF_15.png
  4. Select the type of events that need to be captured from this source bucket.

    ../../../_images/S3OCF_16.png
  5. Under Destination, select Lambda function.

  6. Under Specify Lambda function, select Choose from your Lambda functions.

  7. Select the Lambda function that is created in Create a Lambda Function from the dropdown list.

    ../../../_images/S3OCF_17.png

Enable Configuration in Alation

To enable incremental sync, perform these steps:

From Alation version 2023.3.5 and connector version 3.9.0

  1. On the Settings page of your Amazon S3 file system source, go to the Metadata Extraction tab.

  2. Expand the Incremental sync section and turn on the Incremental sync toggle.

Note

Incremental sync is applicable for Metadata Extraction and Column Extraction.