columnar_migrator#

columnar_migrator

columnar_migrator

columnar_migrator — a script designed to migrate data from a PostgreSQL database with the Citus Columnar extension to a PostgreSQL database with the Hydra Columnar extension, ensuring data integrity and compatibility.

Documentation for PostgreSQL Data Migration Tool

Overview

This script facilitates the migration of data from a PostgreSQL database running the Citus Columnar extension (version 15) to a PostgreSQL database equipped with the Hydra Columnar extension (version 16). It ensures compatibility and integrity of data during the migration process, addressing changes in metadata structures between the two columnar extensions.

Purpose

The main objectives of this script are:

  • To adapt columnar table structures and their associated metadata from Citus to Hydra format.

  • To efficiently transfer large datasets using batch and parallel processing to minimize operational downtime.

  • To maintain data integrity and consistency throughout the migration process.

Description

Written in Python, the script utilizes asyncpg for asynchronous database operations, enhancing performance during large-scale data transfers. It incorporates detailed logging for transparency and troubleshooting. The migration approach includes generating appropriate DDL commands for the new database environment and leveraging concurrency for faster data replication.

Key Features

  • Concurrency Control: Uses semaphores to limit the number of concurrent data transfer tasks.

  • Batch Processing: Transfers data in batches to efficiently handle large datasets.

  • Reconnection Handling: Ensures that data transfer continues seamlessly after connection interruptions.

  • Row Count Verification: Compares the row counts between source and target tables after transfer to ensure data integrity.

  • User Prompts for Existing Tables: Checks for existing tables in the target database and prompts the user to delete them if necessary.

Usage

Input Flags

The script accepts various command-line arguments to specify the source and target database configurations:

  • --source-db-host: Hostname or IP address of the source database (default: localhost).

  • --source-db-port: Port number for the source database (required).

  • --source-db-name: Name of the source database (default: postgres).

  • --source-db-user: Username for the source database (default: postgres).

  • --source-db-password: Password for the source database user (default: empty).

  • --target-db-host: Hostname or IP address of the target database (default: localhost).

  • --target-db-port: Port number for the target database (required).

  • --target-db-name: Name of the target database (default: postgres).

  • --target-db-user: Username for the target database (default: postgres).

  • --target-db-password: Password for the target database user (default: empty).

  • --batch-size: Number of rows per batch in the data transfer (default: 100000).

  • --concurrency: Number of concurrent tasks for parallel data transfer (default: 2).

Example Command

Run the migration using the following command:

python columnar_migrator.py /
--source-db-host 192.168.1.100 /
--source-db-port 5432 / 
--source-db-user admin /
--source-db-password securepass /
--target-db-host 192.168.1.101 /
--target-db-port 5433 /
--target-db-user admin /
--target-db-password securepass /
--batch-size 100000 /
--concurrency 4

Process Flow

  • Initialization: Establish connection pools fothe source and target databases. Check for existing tables in thtarget database and prompt thuser for action if any are found.

  • Execute DDL and Copy Data: Create necessary extensions and execute DDL commands to set up the target database schema. Schedule data transfer tasks using a semaphore to limit concurrency. Transfer data in batches, handling reconnection if the connection is lost.

  • Reconnection Handling: If a connection is lost during the transfer, attempt to reconnect and resume the transfer.

  • Row Count Verification: After transferring each table, compare the row counts between the source and target databases. Log the results of the comparison.

Requirements

To ensure smooth execution of this migration script, the following requirements must be met:

  • Python 3.7 or newer.

  • The asyncpg library is installed to facilitate asynchronous database operations.

Install the necessary Python library using pip:

pip install asyncpg colorlog

Ensure that both source and target PostgreSQL instances are accessible and that the user has the necessary permissions to execute commands and handle data. ## Conclusion

This script is essential for database administrators upgrading PostgreSQL installations with new columnar technologies. It ensures an efficient, reliable migration process, minimizing downtime and maintaining data integrity between two actively running servers.

Detailed Code Review and Documentation

Initialization and Configuration:

  • async def create_pools(source_db_config, target_db_config): Establishes connection pools for source and target databases.

  • async def check_existing_tables(conn, tables): Checks if tables already exist in the target database.

  • async def prompt_user_for_deletion(existing_tables): Prompts the user to confirm deletion of existing tables.

  • async def drop_existing_tables(conn, tables): Drops existing tables from the target database if the user confirms.

Main Execution Function:

  • async def execute_ddl_and_copy_data(target_db_config, ddl_commands, source_db_config, batch_size, concurrency): Creates connection pools and checks for existing tables. Executes DDL commands to set up the target database schema. Manages the data transfer process using concurrency control and batch processing.

Data Transfer and Reconnection Handling:

  • async def transfer_data_copy_to_tables(source_pool, target_pool, obj, batch_size, task_id, total_rows): Transfers data in batches. Handles reconnection if the connection is lost and resumes the transfer process.

Reconnection Logic:

  • async def reconnect_new(pool): Attempts to reconnect to the database several times with a delay between attempts.