Move the Alation Database to AWS RDS

Customer Managed Applies to customer-managed instances of Alation

Available from version 2023.3.5

Overview

Alation uses a PostgreSQL database to store customer metadata and other information, such as some event data. Due to the amount of data stored in Alation, this internal database can grow quite large, which can result in performance issues for backups or migrations. To free up space on an Alation customer-managed server, Alation versions 2023.3.5 and later offer server administrators the ability to extract the database from the server and store it on a separate Amazon Web Service (AWS) Relational Database Service (RDS) instance.

We recommend that the RDS instance be located in the same VPC and region as your Alation instance.

Alation has two PostgreSQL databases that can be moved in this way. Many of the steps below must be done for both databases:

  • rosemeta, the main database

  • the lineage database

To move the internal Postgres databases to RDS, the databases are first copied to the RDS instance. Alation and its internal Postgres databases continue to function as normal during this process. We use a logical replication process that continues to stream database changes to the new copy until all data has been moved. Once a database copy on RDS is ready, you point Alation to that database, and the internal database becomes unused.

We recommend taking a backup before starting this procedure. As a precaution, keep the backup after completing this procedure.

This procedure was designed to cause as little disruption as possible. Most of the procedure can be done with no downtime, and users can continue using Alation without trouble. When a step does require downtime, it will be noted.

The overall process goes like this:

Step 1: Delete Old Data

Step 2: Find and Remove Corrupted Data

Step 3: Gather Information for the RDS Instance

Step 4: Create the RDS Instance

Step 5: Migrate the Data to the RDS Instance

Step 6: Switch Alation to Use the RDS Instance

You can also Troubleshoot the Migration and Maintain Alation with the Database on an AWS RDS Instance.

Each step is described in detail below.

Step 1: Delete Old Data

Note

This step can be done without any downtime. Alation users can continue using Alation as usual.

This step is optional but highly recommended. To make the migration to RDS faster and more reliable, we recommend deleting old data from certain large tables. We have provided a script to delete specific data from specific tables. The following table shows which database tables are affected and how far back data is deleted.

Table

Delete Data Older Than

  • rosemeta_executionevent

  • rosemeta_executioneventexpressionmention

  • rosemeta_executioneventmention

  • rosemeta_executioneventunresolvedmention

  • rosemeta_executionsession

6 months

  • rosemeta_metadatachangestats

5 months

  • metrics_event

  • metrics_metric

30 days

To delete old data:

  1. Use SSH to connect to the Alation server.

  2. Enter the Alation shell using the following command:

    sudo /etc/init.d/alation shell
    
  1. Switch to the alation user:

    sudo su alation
    
  2. Run the pre-validation SQL queries and save the results in a separate file:

    psql -d rosemeta -U alation -f /opt/alation/ops/postgres_rds_migration/pruning_scripts/pre-validation.sql > /tmp/pre-validation.out
    
  3. Run the scan_prune.pyc script to delete the old data:

    python /opt/alation/django/rosemeta/one_off_scripts/scan_prune.pyc -p purge -i metadata metrics qli > /tmp/scan_prune.log 2>&1
    
  4. Run the post-validation SQL queries and save the results in a separate file:

    psql -d rosemeta -U alation -f /opt/alation/ops/postgres_rds_migration/pruning_scripts/post-dbvalidation.sql > /tmp/post-dbvalidation.out
    
  5. Search the log file at /tmp/scan_prune.log for the following message:

    --- Finished execution of scan problematic tables ---
    
    • If you find this message, it indicates that everything is good, and you can skip to Step 2: Find and Remove Corrupted Data.

    • If you don’t find the finished message, do the following:

      1. Search scan_prune.log for the following message:

        The difference between one of pointer table is greater than tolerated difference of {} , Please create support ticket".format(TOLERATED_PROCESSED))
        
        • If you find this message in the log, something has gone wrong. Stop here, and contact Alation Support. Send the pre-validation.out, scan_prune.log, and post-dbvalidation.out files.

        • If you don’t find this message, do the next step.

      2. Search scan_prune.log for the following message:

        At this point, The QLI archival did not clear any data. Please review
        

        If you find that message in the log, check for the following messages:

        INFO - There's close to 0 records to process for rosemeta_executionevent.
        INFO - There's close to 0 records to process for rosemeta_executioneventmention.
        INFO - There's close to 0 records to process for rosemeta_executioneventexpressionmention.
        
        • If there are close to 0 records to process, you can proceed to Step 2: Find and Remove Corrupted Data.

        • If not, stop here, and contact Alation Support. Send the pre-validation.out, scan_prune.log, and post-dbvalidation.out files.

Step 2: Find and Remove Corrupted Data

Note

This step can be done without any downtime. Alation users can continue using Alation as usual.

To ensure the migration works successfully, you have to make sure there’s no corrupted data in Alation’s PostgreSQL databases. If there is corrupted data, it has to be removed.

To find and remove corrupted data:

  1. If you’re not on the Alation server already:

    1. Use SSH to connect to the Alation server.

    2. Enter the Alation shell using the following command:

      sudo /etc/init.d/alation shell
      
  2. Switch to the alation user:

    sudo su alation
    
  3. Run the script db_scan_parallel.pyc against the rosemeta database:

    python /opt/alation/ops/postgres_rds_migration/scanning_script/db_scan_parallel.pyc -b /bin/ -h /tmp -p 5432 -U alation -d rosemeta
    

    Note

    This may take some time. For example, on a t3.2xlarge instance with a 364 GB PostgreSQL database, it may take about 30 minutes.

  4. Examine the output from db_scan_parallel.pyc.

    • If the scan detected no corruption, you’ll see this at the end of the output:

      Total tables successfully dumped: <#>, failed_tables: []
      
    • If you see a message indicating that some tables failed, it means some rows in those tables are corrupted and need to be removed. Make a list of each failed table.

      Example failure message:

      (ERROR) PID=31625: b'pg_dump: error: Dumping the contents of table "data_storage_blobaccesspostgres" failed: PQgetResult() failed.\npg_dump: error: Error message from server: ERROR: missing chunk number 0 for toast value 3523697 in pg_toast_3131713\npg_dump: error: The command was: COPY public.data_storage_blobaccesspostgres (id, model_code, blob_key, blob_value) TO stdout;'
      

      From the message above, you would take note that the table data_storage_blobaccesspostgres failed.

  5. Run the script db_scan_parallel.pyc against the lineage database:

    python /opt/alation/ops/postgres_rds_migration/scanning_script/db_scan_parallel.pyc -b /bin/ -h /tmp -p 5432 -U alation -d lineage
    
  6. Examine the output from db_scan_parallel.pyc.

    • If the scan detected no corruption, you’ll see this at the end of the output:

      Total tables successfully dumped: <#>, failed_tables: []
      
    • If you see a message indicating that some tables failed, it means some rows in those tables are corrupted and need to be removed. Make a list of each failed table.

      Example failure message:

      (ERROR) PID=31625: b'pg_dump: error: Dumping the contents of table "data_storage_blobaccesspostgres" failed: PQgetResult() failed.\npg_dump: error: Error message from server: ERROR: missing chunk number 0 for toast value 3523697 in pg_toast_3131713\npg_dump: error: The command was: COPY public.data_storage_blobaccesspostgres (id, model_code, blob_key, blob_value) TO stdout;'
      
  7. Exit from the alation user:

    exit
    
  8. Did the scans find corruption?

  9. If the scan found corruption in one or more tables, you need to identify exactly which rows were corrupted. To do this, we have provided a find_bad_rows function with your Alation installation. To use the function, first switch to the postgres user:

    sudo su postgres
    
  10. Add the find_bad_rows function to Postgres:

    psql -h /tmp -d rosemeta < /opt/alation/ops/postgres_rds_migration/corruption_check/find_bad_rows.sql
    
  11. Execute find_bad_rows as shown below, replacing <FAILED_TABLE_NAME> with the name of a failed table:

    psql -h /tmp -d rosemeta -c "select find_bad_rows('public.<FAILED_TABLE_NAME>');"
    

    The function will output the ctid for each corrupted row. Example output:

    NOTICE: Corrupted ctid: (56,70)
    NOTICE: XX001: missing chunk number 0 for toast value 3523697 in pg_toast_3131713
    NOTICE: Corrupted ctid: (57,109)
    NOTICE: XX001: missing chunk number 0 for toast value 3523702 in pg_toast_3131713
    

    Note

    This step may take a long time, depending on the size of the tables.

  12. Double check each row to make sure it’s corrupted. Run the following command for each corrupted row:

    psql -h /tmp -d rosemeta -c "SELECT \* FROM public.<FAILED_TABLE_NAME> where ctid = '(<ROW_CTID>)'"
    

    Example output:

    missing chunk number 0 for toast value 3523697 in pg_toast_3131713
    
  13. Delete the corrupted rows. Run the following command for each corrupted row:

    psql -h /tmp -d rosemeta -c "DELETE FROM public.<FAILED_TABLE_NAME> where ctid = '(<ROW_CTID>)'"
    

    Example output:

    DELETE 1
    
  14. Repeat the prior two steps for each corrupted row in the table.

  15. Repeat the prior three steps for each failed table.

  16. Exit from the postgres user:

    exit
    

Step 3: Gather Information for the RDS Instance

Note

This step can be done without any downtime. Alation users can continue using Alation as usual.

In this step, you’ll make sure you have all the information needed to create the RDS instance.

  1. Ensure that you have AWS keys to create the RDS instance using AWS APIs.

  2. Ensure that the machine you’re using has access to the AWS VPC where you are creating the RDS instance. We recommend that the RDS instance be located in the same VPC and region as your Alation instance.

  3. Determine how much storage your RDS instance will require.

    1. If you’re not on the Alation server already:

      1. Use SSH to connect to the Alation server.

      2. Enter the Alation shell using the following command:

        sudo /etc/init.d/alation shell
        
    2. Enter the PostgreSQL shell:

      alation_psql
      
    3. Display the size of the lineage and rosemeta databases

      \l+
      

      The output will look something like this:

                                                                                    List of databases
         Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges   |   Size    | Tablespace |                Description
      -----------+----------+----------+-------------+-------------+-----------------------+-----------+------------+--------------------------------------------
       lineage   | alation  | UTF8     | en_US.UTF-8 | en_US.UTF-8 |                       | 5774 MB   | pg_default |
       postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |                       | 8221 kB   | pg_default | default administrative connection database
       rdsadmin  | rdsadmin | UTF8     | en_US.UTF-8 | en_US.UTF-8 | rdsadmin=CTc/rdsadmin+| No Access | pg_default |
                 |          |          |             |             | rdstopmgr=Tc/rdsadmin |           |            |
       rosemeta  | alation  | UTF8     | en_US.UTF-8 | en_US.UTF-8 |                       | 177 GB    | pg_default |
       template0 | rdsadmin | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/rdsadmin          +| 8041 kB   | pg_default | unmodifiable empty database
                 |          |          |             |             | rdsadmin=CTc/rdsadmin |           |            |
       template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +| 8213 kB   | pg_default | default template for new databases
                 |          |          |             |             | postgres=CTc/postgres |           |            |
      (6 rows)
      
    4. Look in the Size column, and add the sizes for the lineage and rosemeta databases together. Then double that number.

      Example: In the example output above, lineage is 5774 MB and rosemeta is 177 GB, so the total is about 183 GB. Double that number is 366 GB.

    5. If your final number is less than 20 GB, plan to use the minimum of 20 GB. The maximum is 16384 GB.

    6. Exit the PostgreSQL shell:

      \q
      
  4. Determine the RDS instance class that’s required to hold your PostgreSQL database. The table below shows some recommendations based on the size of your database, concurrent users, and Alation instance size.

    PostgreSQL Object Count

    Concurrent User Load

    Alation Host Size on AWS

    RDS Instance Type Required

    Up to 5 million

    150 users

    m5a.4xl

    db.m6g.large

    Up to 20 million

    150 users

    m5a.8xl

    db.m5.2xlarge

    Over 20 million

    150 to 800 users

    m5a.12xl

    db.m5.4xlarge

  5. Figure out how many CPUs you have on your Alation server. This will determine how many migration workers you can use. In the Alation shell, run this command:

    lscpu | grep -E '^CPU\('
    

    Example output:

    CPU(s):              8
    

    Save this number for use in a later step.

  6. Generate two passwords:

    1. One for the RDS administration, referred to as ALATION_RDS_ADMIN_PASSWORD in these instructions.

    2. One for the PostgreSQL user that Alation will use to connect to the migrated PostgreSQL database. This password is referred to as ALATION_PASSWORD in these instructions.

    You can generate these passwords however you like, or you can use this command on Linux to generate two random 16-digit passwords:

    cat /dev/urandom | tr -dc A-Za-z | head -c 16 ; echo ''
    

    Save these two passwords in a secure location. You will need them in later steps.

Step 4: Create the RDS Instance

Note

This step can be done without any downtime. Alation users can continue using Alation as usual.

You’ll now create the AWS RDS instance that you’ll move the Alation database to. We have provided some Terraform scripts in the Alation installation that will automatically create the RDS instance.

To create the RDS instance from the Alation server:

  1. If you’re not on the Alation server already:

    1. Use SSH to connect to the Alation server.

    2. Enter the Alation shell using the following command:

      sudo /etc/init.d/alation shell
      
  2. Install Terraform if it’s not already installed. You can find installers on Terraform’s installation page. As an example, on an AMD64 Linux machine, you could install Terraform like this:

    sudo curl -O https://releases.hashicorp.com/terraform/1.0.6/terraform_1.0.6_linux_amd64.zip
    
    sudo unzip terraform_1.0.6_linux_amd64.zip
    
    sudo mv terraform /usr/bin/
    
  3. Switch to the alation user:

    sudo su alation
    
  4. Using terminal, set some environment variables that the Terraform scripts rely on. Replace the angle brackets and their contents with your own values:

    export AWS_ACCESS_KEY_ID=<ADMIN_API_KEY>
    export AWS_SECRET_ACCESS_KEY=<ADMIN_SECRET_ACCESS_KEY>
    export TF_VAR_alation_region=<ALATION_AWS_REGION>
    export TF_VAR_on_prem_alation_ec2_id=<ALATION_EC2_INSTANCE_ID such as i-0f76425cf32057e81>
    export TF_VAR_rds_admin_password=<ALATION_RDS_ADMIN_PASSWORD>
    export TF_VAR_allocated_storage=<STORAGE_SIZE_IN_GB>
    export TF_VAR_multi_az=true
    export TF_VAR_rds_instance_class=<RDS_INSTANCE_CLASS>
    export TF_VAR_storage_type=gp3
    
  5. Go into the folder containing the Terraform scripts:

    cd /opt/alation/ops/postgres_rds_migration/rds_setup_terraform/
    
  6. Initialize Terraform:

    terraform init
    
  7. Validate that the Terraform configuration is valid:

    terraform validate
    

    If you notice any errors in the output, contact Alation Support.

  8. Verify that the Terraform plan is accurate:

    terraform plan
    

    If you notice any errors in the output, contact Alation Support.

  9. Run the Terraform script:

    terraform apply --auto-approve
    

    This will automatically create an RDS instance and display its RDS endpoint, which you will need in later steps. This step may take several minutes. Near the end of the output you will see the RDS endpoint. Example output:

    data.dns_a_record_set.rds_endpoint: Read complete after 0s [id=onprem-alation-i-02ad22d793b9d6dfa-rds.calwlq0balzb.us-west-2.rds.amazonaws.com]
    

    The value after id= is the RDS endpoint. In this example, it is onprem-alation-i-02ad22d793b9d6dfa-rds.calwlq0balzb.us-west-2.rds.amazonaws.com.

Now that the RDS instance exists, you can migrate the Alation database to it. Proceed to Step 5: Migrate the Data to the RDS Instance below.

Step 5: Migrate the Data to the RDS Instance

Important

This step requires about 10 minutes of downtime.

If this process fails for any reason, you can reset the RDS instance using the steps under Reset the RDS Instance. Then try this section again.

In the following steps, you will copy the rosemeta and lineage databases to your RDS instance.

  1. If you’re not in the Alation shell already:

    1. Use SSH to connect to the Alation server.

    2. Enter the Alation shell using the following command:

      sudo /etc/init.d/alation shell
      
  2. Switch to the alation user:

    sudo su alation
    
  3. Use a text editor to replace the contents of setup_internal_postgres.pyc with the following:

    # /opt/alation/env/bin/python
    
    import socket
    import sys
    import pty
    import subprocess
    
    EXPECTED_POSTGRES_VERSIONS = {"13.1", "13.6", "13.8", "13.11"}
    PGLOGICAL_PACKAGE = "pglogical"
    
    def get_ip(host):
       return socket.gethostbyname(host)
    
    def execute_command(bash_command, shell_or_die=False):
       master_fd, slave_fd = pty.openpty()
       sp = subprocess.Popen(
          bash_command, stdin=slave_fd, stderr=subprocess.PIPE, stdout=subprocess.PIPE, shell=True
       )
       out, err = sp.communicate()
       if shell_or_die and sp.returncode != 0:
          raise Exception("Failed: %s", out.decode(), err.decode())
       return sp.returncode, out.decode(), err.decode()
    
    def version_check():
       cmd = 'sudo -u postgres psql -qtAX -h /tmp -d rosemeta -c "SHOW server_version;"'
       code, stdout, stderr = execute_command(cmd)
       if code != 0:
          raise Exception("Cannot get postgres version")
       elif stdout.strip() not in EXPECTED_POSTGRES_VERSIONS:
          raise Exception(
                "Its version %s is not equal to %s" % (stdout.strip(), str(EXPECTED_POSTGRES_VERSIONS))
          )
       print("## Its version is %s" % str(EXPECTED_POSTGRES_VERSIONS))
    
    def setup_postgresql_auto_conf(num_of_wall_senders):
       cmd = "sudo -u postgres cat /data1/pgsql/13/data/postgresql.auto.conf"
       code, stdout, stderr = execute_command(cmd)
       if code != 0:
          raise Exception("Cannot cat /data1/pgsql/13/data/postgresql.auto.conf")
       if "listen_addresses = '*'" not in stdout:
          cmd = "sudo -u postgres -- sh -c $'echo \"listen_addresses = \\'*\\'\" >> /data1/pgsql/13/data/postgresql.auto.conf'"
          execute_command(cmd, shell_or_die=True)
       if "wal_level = logical" not in stdout:
          cmd = "sudo -u postgres -- sh -c $'echo \"wal_level = logical\" >> /data1/pgsql/13/data/postgresql.auto.conf'"
          execute_command(cmd, shell_or_die=True)
       if "max_replication_slots = %s" % num_of_wall_senders not in stdout:
          cmd = (
                "sudo -u postgres -- sh -c $'echo \"%s\" >> /data1/pgsql/13/data/postgresql.auto.conf'"
                % ("max_replication_slots = %s" % num_of_wall_senders)
          )
          execute_command(cmd, shell_or_die=True)
       if "max_wal_senders = %s" % num_of_wall_senders not in stdout:
          cmd = (
                "sudo -u postgres -- sh -c $'echo \"%s\" >> /data1/pgsql/13/data/postgresql.auto.conf'"
                % ("max_wal_senders = %s" % num_of_wall_senders)
          )
          execute_command(cmd, shell_or_die=True)
       if "shared_preload_libraries = 'pglogical'" not in stdout:
          cmd = "sudo -u postgres -- sh -c $'echo \"shared_preload_libraries = \\'pglogical\\'\" >> /data1/pgsql/13/data/postgresql.auto.conf'"
          execute_command(cmd, shell_or_die=True)
       print("## Completed postgresql.auto.conf update")
    
    def setup_pg_hba_conf(rds_private_ip):
       cmd = "sudo -u postgres cat /data1/pgsql/13/data/pg_hba.conf"
       code, stdout, stderr = execute_command(cmd)
       if code != 0:
          raise Exception("Cannot cat /data1/pgsql/13/data/pg_hba.conf")
       if rds_private_ip not in stdout:
          cmd = (
                "sudo -u postgres -- sh -c 'echo \"host     all     all      %s/32           trust\" >> /data1/pgsql/13/data/pg_hba.conf'"
                % rds_private_ip
          )
          execute_command(cmd, shell_or_die=True)
       if "host     all     all      127.0.0.1/0           trust" not in stdout:
          cmd = "sudo -u postgres -- sh -c $'echo \"host     all     all      127.0.0.1/0           trust\" >> /data1/pgsql/13/data/pg_hba.conf'"
          execute_command(cmd, shell_or_die=True)
       if "host    all   alation      ::1/128      trust" not in stdout:
          cmd = "sudo -u postgres -- sh -c $'echo \"host    all   alation      ::1/128      trust\" >> /data1/pgsql/13/data/pg_hba.conf'"
          execute_command(cmd, shell_or_die=True)
       print("## Completed pg_hba.conf update")
    
    def restart_postgres():
       cmd = "alation_action stop_postgres"
       execute_command(cmd, shell_or_die=True)
       cmd = "alation_action start_postgres"
       execute_command(cmd, shell_or_die=True)
       print("## Restart completed")
    
    def main(argv):
       if len(argv) < 2:
          raise Exception("python setup_internal_postgres.py <rds_endpoint> <num_of_wall_senders>")
       rds_endpoint, num_of_wall_senders = argv[0], argv[1]
       rds_private_ip = get_ip(rds_endpoint)
       print(rds_private_ip)
       version_check()
       setup_postgresql_auto_conf(num_of_wall_senders)
       setup_pg_hba_conf(rds_private_ip)
       restart_postgres()
    
    if __name__ == "__main__":
       main(sys.argv[1:])
    

Important

The next step requires Alation to be down for about ten minutes.

  1. Still as the alation user, run the setup_internal_postgres.pyc script and provide the RDS endpoint and number of CPUs you identified earlier:

    python /opt/alation/ops/postgres_rds_migration/rds_migration_scripts/setup_internal_postgres.pyc <RDS_ENDPOINT> <#_CPUs>
    

    Example output:

    ## Its version is {'13.11', '13.6', '13.1', '13.8'}
    ## Completed postgresql.auto.conf update
    ## Completed pg_hba.conf update
    ## Restart completed
    

Note

Items 5–7 apply to the rosemeta database. Items 8–10 apply to lineage. These steps can take a long time. You can follow the items for rosemeta in parallel to the items for lineage.

  1. To migrate the rosemeta database to your new RDS instance, run the migrate_postgres_to_rds.pyc script as shown below. This will sync the data to your RDS instance. We use a logical replication process that continues to stream database changes to the new copy until all data has been moved. This ensures no data will be lost.

    Important

    This process runs asynchronously. It may take a long time. As an example, if you have:

    • An Alation instance with 8 cores and 32 GB of memory

    • An RDS instance with 4 cores and 16 GB of memory

    It may take about four to five hours to migrate a 175 GB rosemeta database.

    To run the script, use the command below. Replace <RDS_ENDPOINT> with the endpoint of your RDS instance, <ALATION_RDS_ADMIN_PASSWORD> with the RDS admin password you created earlier, and <ALATION_PASSWORD> with the Alation password you created earlier.

    python /opt/alation/ops/postgres_rds_migration/rds_migration_scripts/migrate_postgres_to_rds.pyc --rds_port 5432 --rds_endpoint <RDS_ENDPOINT> --rds_admin_password <ALATION_RDS_ADMIN_PASSWORD> --new_alation_password <ALATION_PASSWORD> --remove_indexes true --target_database rosemeta
    

    Successful output will look like this:

    initialized RateSampler, sample 100% of traces
    initialized RateSampler, sample 100% of traces
    Connecting to DogStatsd(udp://localhost:8125)
    {"data": {"module": "alation_conf", "version": null, "requestid": null, "msg": "Reloading conf. current time: 1702581142.2829413"}, "header": {"timestamp": "2023-12-14T19:12:22.282994", "appname": null, "instanceurl": null, "tenantid": null, "loglevel": "DEBUG", "traceid": "0"}, "dd.trace_id": "0", "dd.span_id": "0", "dd.env": ""}
    {"data": {"module": "alation_conf", "version": null, "requestid": null, "msg": "alation_conf is initialized. current time: 1702581142.2829413"}, "header": {"timestamp": "2023-12-14T19:12:22.467949", "appname": null, "instanceurl": null, "tenantid": null, "loglevel": "DEBUG", "traceid": "0"}, "dd.trace_id": "0", "dd.span_id": "0", "dd.env": ""}
    Check pg_hba entry for replication...
    Check pg_hba entry for replication...Done
    Check listen_addresses...
    Check listen_addresses...Done
    Check if migration_rosemeta already created
    Drop file_fdw extension
    Start export postgres schema only...
    Create indexes backup file ...Done
    Removed indexes from schema file ...Done
    Start export postgres schema only...Done
    Check if alation exists
    Create new alation user in RDS...
    Prepare alation user in RDS...Done
    Check if alation_db_user exists
    Create new alation_db_user user in RDS...
    Prepare alation user in RDS...Done
    Check if alation_db exists
    Create new alation_db user in RDS...
    Prepare alation user in RDS...Done
    Check if datadog exists
    Create new datadog user in RDS...
    Prepare alation user in RDS...Done
    Check if rosemeta exists
    No rosemeta exists
    Create new alation user...
    Build existing extension set
    Create extensions intarray,hstore,amcheck,pglogical,pg_visibility if not exists ...
    Create extensions intarray,hstore,amcheck,pglogical,pg_visibility ...Done
    Import /tmp/rosemeta_schema_only.dump ...
    Import /tmp/rosemeta_schema_only.dump ...Done
    List target tables own by public and alation
    Check if publication migration_rosemeta is created
    migration_rosemeta is created
    Validate publication table by table
    All tables are included in public
    Check if subscription migration_rosemeta already created
    ## create subscription migration_rosemeta connection 'host=<RDS IP> port=5432 dbname=rosemeta user=alation' publication migration_rosemeta
    Stopping AgentWriter thread
    
  2. Check on the progress of migrating the rosemeta database by running get_subscription_status.pyc. Replace <RDS_ENDPOINT> with the endpoint of your RDS instance, <ALATION_PASSWORD> with the Alation password you created earlier.

    python /opt/alation/ops/postgres_rds_migration/rds_migration_scripts/get_subscription_status.pyc --rds_endpoint <RDS_ENDPOINT> --rds_port 5432 --new_alation_password <ALATION_PASSWORD> --target_database rosemeta
    
    • While the process is still running, you’ll see the following in the output:

      ## res: NEED_TO_WAIT
      
    • When the process is finished, you’ll see the following in the output:

      ## res: STOP_WAITING
      
  3. When get_subscription_status.pyc shows ## res: STOP_WAITING, it automatically starts another script, create_indexes.py. This script creates indexes for all the tables in the rosemeta database. It runs in the background. This script must finish before you can move on to Step 6: Switch Alation to Use the RDS Instance below. You can check the status of this script by tailing the log file:

    tail -f /tmp/create_indexes.log
    

    When the process is done, you’ll see this in the log:

    Created 2090/2091 indexes.
    Created 2091/2091 indexes.
    Creation of indexes /tmp/rosemeta_postdata_only.dump ...Done
    Removed indexes dump file /tmp/rosemeta_postdata_only.dump since indexes have been applied.
    

    Note

    Items 8–10 below apply to the lineage database. You can follow these steps in parallel to items 5–7.

  4. To migrate the lineage database to your new RDS instance, run the migrate_postgres_to_rds.pyc script as shown below. Replace <RDS_ENDPOINT> with the endpoint of your RDS instance, <ALATION_RDS_ADMIN_PASSWORD> with the RDS admin password you created earlier, and <ALATION_PASSWORD> with the Alation password you created earlier.

    python /opt/alation/ops/postgres_rds_migration/rds_migration_scripts/migrate_postgres_to_rds.pyc --rds_port 5432 --rds_endpoint <RDS_ENDPOINT> --rds_admin_password <ALATION_RDS_ADMIN_PASSWORD> --new_alation_password <ALATION_PASSWORD> --remove_indexes true --target_database lineage
    

    This process runs asynchronously. Successful output will look like this:

    initialized RateSampler, sample 100% of traces
    initialized RateSampler, sample 100% of traces
    Connecting to DogStatsd(udp://localhost:8125)
    {"data": {"module": "alation_conf", "version": null, "requestid": null, "msg": "Reloading conf. current time: 1702581232.9394724"}, "header": {"timestamp": "2023-12-14T19:13:52.939548", "appname": null, "instanceurl": null, "tenantid": null, "loglevel": "DEBUG", "traceid": "0"}, "dd.trace_id": "0", "dd.span_id": "0", "dd.env": ""}
    {"data": {"module": "alation_conf", "version": null, "requestid": null, "msg": "alation_conf is initialized. current time: 1702581232.9394724"}, "header": {"timestamp": "2023-12-14T19:13:53.135925", "appname": null, "instanceurl": null, "tenantid": null, "loglevel": "DEBUG", "traceid": "0"}, "dd.trace_id": "0", "dd.span_id": "0", "dd.env": ""}
    Check pg_hba entry for replication...
    Check pg_hba entry for replication...Done
    Check listen_addresses...
    Check listen_addresses...Done
    Check if migration_lineage already created
    Drop file_fdw extension
    Start export postgres schema only...
    Create indexes backup file ...Done
    Removed indexes from schema file ...Done
    Start export postgres schema only...Done
    Check if alation exists
    Reset alation user password in RDS...
    Prepare alation user in RDS...Done
    Check if alation_db_user exists
    Reset alation_db_user user password in RDS...
    Prepare alation user in RDS...Done
    Check if alation_db exists
    Reset alation_db user password in RDS...
    Prepare alation user in RDS...Done
    Check if datadog exists
    Reset datadog user password in RDS...
    Prepare alation user in RDS...Done
    Check if lineage exists
    No lineage exists
    Create new alation user...
    Build existing extension set
    Create extensions pg_visibility,intarray,hstore,pglogical,amcheck if not exists ...
    Create extensions pg_visibility,intarray,hstore,pglogical,amcheck ...Done
    Import /tmp/lineage_schema_only.dump ...
    Import /tmp/lineage_schema_only.dump ...Done
    List target tables own by public and alation
    Check if publication migration_lineage is created
    migration_lineage is created
    Validate publication table by table
    All tables are included in public
    Check if subscription migration_lineage already created
    ## create subscription migration_lineage connection 'host=<RDS IP> port=5432 dbname=lineage user=alation' publication migration_lineage
    Stopping AgentWriter thread
    
  5. Check on the progress of migrating the lineage database by running get_subscription_status.pyc. Replace <RDS_ENDPOINT> with the endpoint of your RDS instance, <ALATION_PASSWORD> with the Alation password you created earlier.

    python /opt/alation/ops/postgres_rds_migration/rds_migration_scripts/get_subscription_status.pyc --rds_endpoint <RDS_ENDPOINT> --rds_port 5432 --new_alation_password <ALATION_PASSWORD> --target_database lineage
    

    While the script is still running, you’ll see the following in the output:

    ## res: NEED_TO_WAIT
    

    When the script is finished, you’ll see the following in the output:

    ## res: STOP_WAITING
    
  6. When get_subscription_status.pyc shows ## res: STOP_WAITING, it automatically starts another script, create_indexes.py. This script creates indexes for all the tables in the lineage database. It runs in the background. This script must finish before you can move on to Step 6: Switch Alation to Use the RDS Instance below. You can check the status of this script by tailing the log file:

    tail -f /tmp/create_indexes.log
    

    When the process is done, you’ll see this in the log:

    Created 2090/2091 indexes.
    Created 2091/2091 indexes.
    Creation of indexes /tmp/rosemeta_postdata_only.dump ...Done
    Removed indexes dump file /tmp/rosemeta_postdata_only.dump since indexes have been applied.
    
  7. Wait until the create_indexes.py script has finished indexing the tables for both rosemeta (item 8 above) and lineage (item 10 above). Then move on to the next step.

Step 6: Switch Alation to Use the RDS Instance

Important

This step requires about an hour of downtime.

If this process fails for any reason, you can reset the RDS instance using the steps under Reset the RDS Instance. Then try the migration again.

Now that all the data has been migrated to the RDS instance, you need to tell Alation to use the RDS instance.

Important

The first three steps require about one hour of downtime.

  1. When the create_indexes.py script has finished indexing the tables for both rosemeta and lineage, you are ready to direct Alation to start using the copies that are now on the RDS instance.

    Still as the alation user on the Alation server, stop Alation from running:

    alation_supervisor stop celery:* ingestion customer-portal-agent java:* web:* lineage logical-metadata:*
    

Important

The next step requires downtime.

  1. Run the post_postgres_migration.pyc script to tell Alation to start using the rosemeta database on the RDS instance. Replace <RDS_ENDPOINT> with the endpoint of your RDS instance, <ALATION_RDS_ADMIN_PASSWORD> with the RDS admin password you created earlier, and <ALATION_PASSWORD> with the Alation password you created earlier.

    python /opt/alation/ops/postgres_rds_migration/rds_migration_scripts/post_postgres_migration.pyc --rds_port 5432 --rds_endpoint <RDS_ENDPOINT> --rds_admin_password <ALATION_RDS_ADMIN_PASSWORD> --new_alation_password <ALATION_PASSWORD> --num_of_threads <# of cores> --target_database rosemeta
    

    The output will be very lengthy. If it’s successful, you’ll see this at the end of the output:

    Switching to RDS <YOUR RDS ENDPOINT>...Done
    

Important

The next step requires downtime.

  1. Run the post_postgres_migration.pyc script to tell Alation to start using the lineage database on the RDS instance. Replace <RDS_ENDPOINT> with the endpoint of your RDS instance, <ALATION_RDS_ADMIN_PASSWORD> with the RDS admin password you created earlier, and <ALATION_PASSWORD> with the Alation password you created earlier.

    python /tmp/postgres_rds_migration_scripts/rds_migration_scripts/post_postgres_migration.pyc --rds_port 5432 --rds_endpoint <RDS_ENDPOINT> --rds_admin_password <ALATION_RDS_ADMIN_PASSWORD> --new_alation_password <ALATION_PASSWORD> --num_of_threads <# of cores> --target_database lineage
    

    If it’s successful, you’ll see this at the end of the output:

    Switching to RDS <YOUR RDS ENDPOINT>...Done
    
  2. Start Alation back up:

    alation_supervisor start celery:* ingestion customer-portal-agent java:* web:* lineage logical-metadata:*
    

    Alation should now be using the external RDS instance.

  3. Stop the internal Postgres:

    sudo -u postgres /usr/pgsql-13/bin/pg_ctl -D /data1/pgsql/13/data stop -w
    
  4. Confirm that Alation is using the RDS instance:

    alation_conf pgsql.config.host
    

    The parameters lineage-service.pgsql.config.host and pgsql.config.host should show your RDS endpoint.

The process is now complete. You should be able to log into Alation and use it as usual.

After you’ve verified that the migration was successful and Alation is using the RDS instance without problems, you can now drop the data from the internal rosemeta and lineage databases. As a precaution, we recommend keeping the latest backups from before the migration.

Troubleshoot the Migration

If you encounter problems with migrating the Alation databases to an AWS RDS instance, please contact Alation Support.

Reset the RDS Instance

If you run into trouble during Step 5: Migrate the Data to the RDS Instance, you can reset the RDS instance and try again. To reset the RDS instance:

  1. If you’re not on the Alation server already:

    1. Use SSH to connect to the Alation server.

    2. Enter the Alation shell using the following command:

      sudo /etc/init.d/alation shell
      
  2. Connect to the rosemeta database on the RDS instance:

    export PGPASSWORD=<ALATION_PASSWORD>; psql -h <RDS endpoint> -U alation -d rosemeta
    

    Then drop the subscription migration:

    rosemeta=> drop subscription migration;
    
  3. Connect to the lineage database on the RDS instance:

    export PGPASSWORD=<ALATION_PASSWORD>; psql -h <RDS endpoint> -U alation -d lineage
    

    Then drop the subscription migration:

    lineage=> drop subscription migration;
    
  4. Connect to the database on the RDS instance as the postgres user:

    export PGPASSWORD=<ALATION_RDS_ADMIN_PASSWORD>; psql -h <RDS_ENDPOINT> -U postgres
    

    Change the owner of the rosemeta database to the postgres user, then drop the database:

    postgres=> alter database rosemeta owner to postgres;
    
    postgres=> drop database rosemeta;
    

    Change the owner of the lineage database to the postgres user, then drop the database:

    postgres=> alter database lineage owner to postgres;
    
    postgres=> drop database lineage;
    
  5. If rosemeta or lineage is still being used, stop Alation and terminate the sessions:

    alation_action stop_alation
    
    postgres=> select pg_terminate_backend(pid) from pg_stat_activity where datname='rosemeta';
    
    postgres=> select pg_terminate_backend(pid) from pg_stat_activity where datname='lineage';
    
  6. Roll back the changes to alation_conf:

    alation_conf pgsql.config.remote -s False
    
    alation_conf pgsql.config.host -s /tmp
    
    alation_conf pgsql.config.port -s 5432
    
    alation_conf pgsql.config.password -c
    
    alation_conf lineage-service.pgsql.config.host -s /tmp
    
    alation_conf lineage-service.pgsql.password -c
    

You should now be ready to retry Step 5: Migrate the Data to the RDS Instance.

Maintain Alation with the Database on an AWS RDS Instance

Moving the database to an AWS RDS instance will not change how you upgrade, back up, or restore the Alation applications. The commands for upgrading, backing up, and restoring the Alation applications remain the same. However, there are some differences in the backup and restore behavior.

Upgrade

There are no changes to how you upgrade Alation with the database(s) on an AWS RDS instance.

Back Up

After moving the database to AWS RDS, it will no longer be included directly in the Alation backup tar file. Instead, the Alation backup process creates an RDS snapshot in AWS using the existing RDS functionality. The snapshot is mapped to the Alation backup so it can be restored later along with Alation, if needed.

You can also use the Amazon RDS backup functionality separately from Alation backups.

Restore

When restoring Alation from a backup, the destructive_restore_all action will automatically retrieve the AWS RDS snapshot that is mapped to the backup. The old AWS RDS instance will be stopped, and the new one will become available. In the AWS console, you will see both the old and newly restored RDS instances. The old one can be deleted if you want to save costs.