Blogs

Database Migration From Microsoft Azure to Snowflake on AWS: Part 2

In part one of this blog series, we looked at our approach to migration by leveraging the Azure data factory, tuning the performance of Data Factory Pipeline, using Parquet files and increasing computing power. 

In this blog, we’ll look at some of the challenges encountered during the migration, cost-saving actions and our approach to data validation.

Challenges Encountered

  • Data Factory export failed for tables if the column name had spaces or special characters. In such cases, aliases were used (replacing spaces with an underscore) for column names and the query method was used as a source.
  • Data Factory export failed for the column data type “Geography” and had to be cast to varchar and then exported to Blob.
  • Some column names in SQL server were keywords in Snowflake, which didn’t allow it to migrate. For example, ‘LOCALTIME’ is a column name in SQL but is also a keyword in Snowflake, and it is addressed using inverted commas as escape characters (“”) while creating the Snowflake table.
  • During ingestion into Snowflake, the flat files were named after tables, listing files based on the name pattern. For example, if we wanted to list the file with table name test_1, Snowflake would list all the files starting with the name test_1. It is essential to make sure that each file name is unique.
  • If the file to be ingested was incorrect with respect to the table, Snowflake did not return any error messages or warnings. It simply ingested the file and put NULL wherever column mapping was not found.
  • While creating the tables, if the column name had spaces, escape characters were used so that the same spaces could be maintained. But for ingestion of such columns, the column mapping from a flat file would be the same aliases that were used in the source query.
  • Triggers and compute columns were not supported in Snowflake and had to be eliminated while creating the tables.
  • Also, data types that did not exist in the Snowflake schema had to be changed, keeping in mind the correctness of data.

Cost-Saving Actions

  • Use of the Parquet file facilitated a compression rate of up to 75%. The 800GB of data was compressed to 234GB in multiple files which reduced the storage cost of Blobs.
  • Detailed POC and analysis discovered that Snowflake ingestion was optimal with Small for moderate sized tables and Medium for large sized tables which kept Snowflake costs in check.
  • Self-hosted IR saved on the data pipeline operating cost as the compute power of Azure VM was used.

Cost Estimation

 

Data Validation

The vital part of the project was ensuring that the integrity and correctness of the data were maintained during the migration process. As a part of standard DataOps best practices, a thorough data validation activity was performed to make sure that the data stayed meaningful and remained of use to drive business decisions.

The approach to data validation was two-fold:

In the first approach, all the distinct data types in the SQL server that were part of the db migration were listed and sampled randomly across one row per data type. The same row was fetched from Snowflake and matched.

List of tables based on data type

select table_catalog,table_schema,table_name,COLUMN_NAME,* from
INFORMATION_SCHEMA.COLUMNS
where DATA_TYPE = 'money' and TABLE_NAME not like 'sys%'

Issues like that shown below were uncovered in this validation process and were corrected for all instances of the data type.

The second approach involved writing an automated script that accomplished the following results (More details can be found here):

  • Script touched each table of each database
  • A set of five records were randomly sampled from both sides and matched
  • If the tables were empty, it was recorded as tables with zero records
  • If the sample sets did not match the script, it wrangled the sets and sorted them one column at a time. If the sample matched the script, it was recorded. This wrangle took care of the scenario wherein the first n number of columns had duplicate records
  • Any failed validation attempt would be visible in a generated log report:

Validation Started for Table: Test1

—-> Validation Passed for table Test1

Validation Started for Table: Test2

—-> Validation Passed for table Test2

Validation Started for Table: Test3

—-> Table has 0 records

Validation Started for Table: Test4

—-> Validation Failed for table Test4

Validation Started for Table: Test5

—-> Validation Passed for table Test5

============= Report =============

# Total Number of Tables Scanned: 5

# Tables with 0 records: 1

# Passed: 3

# Failed: 1

# Percentage 60.5556123

Amol Gawas

Amol Gawas is a DevOps Engineer at Sigmoid and loves to code and talk to data. He is passionate to learn about new technologies and the impact it can bring to DevOps projects. In his free time he is a competitive gamer who loves to watch F1 and is an avid HotWheels car collector.

Recent Posts

Valkey is Rapidly Overtaking Redis

Redis is taking it in the chops, as both maintainers and customers move to the Valkey Redis fork.

4 hours ago

GitLab Adds AI Chat Interface to Increase DevOps Productivity

GitLab Duo Chat is a natural language interface which helps generate code, create tests and access code summarizations.

8 hours ago

The Role of AI in Securing Software and Data Supply Chains

Expect attacks on the open source software supply chain to accelerate, with attackers automating attacks in common open source software…

14 hours ago

Exploring Low/No-Code Platforms, GenAI, Copilots and Code Generators

The emergence of low/no-code platforms is challenging traditional notions of coding expertise. Gone are the days when coding was an…

1 day ago

Datadog DevSecOps Report Shines Spotlight on Java Security Issues

Datadog today published a State of DevSecOps report that finds 90% of Java services running in a production environment are…

2 days ago

OpenSSF warns of Open Source Social Engineering Threats

Linux dodged a bullet. If the XZ exploit had gone undiscovered for only a few more weeks, millions of Linux…

2 days ago