Migrate, transform, and backfill data with zero downtime

Oftentimes database tables begin with the best of intentions. They can start out narrow and clearly named with well-defined content, but as our applications and teams grow, we may start to notice that those same tables begin to take on a different shape. It’s possible that they grow wider and are harder to work with than you remember. They could be holding data around multiple concepts, which might mean your models are breaking the single responsibility principle. Perhaps that table didn’t start out concise at all. There are many common scenarios that cause tables to exist in a cumbersome form, but it doesn’t mean they have to stay that way.

This post aims to explain how to migrate data from one table to another in a zero downtime, safe manner. The following process can be used for any number of data migration scenarios by using each step as a general guideline. We love Ruby at ezCater, so this post will be within the Rails ecosystem.

Scenario

In this example, we’ve noticed that the Caterer table has grown wider than we’d like. Over the years, the number of columns has grown as our application has matured. It’s not surprising to see the expansion of data, but it would be best if we could stop the expansion and simplify one of our core tables. To start, a truncated view of the caterers table looks like this:

  create_table "caterers", do |t|

    t.integer      "id"
    t.string       "name"
    t.string       "email"
    t.string       "phone"
    t.string       "street1"
    t.string       "street2"
    t.string       "city"
    t.string       "zip"
    t.string       "state"

    t.datetime     "created_at", null: false
    t.datetime     "updated_at", null: false

  end

As this isn’t the full view of the caterers table, it may seem pretty clear where there is an opportunity to trim down the table to a narrower state. There are six address related fields in this selection of columns. This is a great opportunity to make a new addresses table to store all of those attributes and simplify caterers.

With the extractible columns identified, we can start to think about the process ahead. These are the steps we will take in order to migrate over to the new data model for addresses:

Create the new model in the application and in the database
Dual write data to the original table and the new table
Backfill the existing data to the new table
Update reads from the old table to the new table
Update writes to the old table to write to the new table
Drop the old table and delete any code that is no longer needed

Zero downtime and data accuracy

This may seem like a lot of steps for straightforward data migration, but they are all necessary to protect our app from downtime and inaccurate data. There are several points in this process that if done in an incorrect order could result in problematic releases. For example, setting up the model first and dual writing between both the old and new tables allows us to swap the reads and writes incrementally. If we were to do it all at once, we would run the risk of caterers not having an address or having stale data.

It is important that dual writing happens before the data backfill. Having the Caterer model write to both locations protects the accuracy of the address fields. If we were to backfill first and then set up dual writing, we could potentially miss data changes that happen during the backfill.

Having the new model in place first and deleting the old model last ensures the process runs smoothly and avoids bugs.

Create the new model

Before we can do any data migration or backfill, we need to set up our model. This will consist of a Rails migration, a barebones model, and an update to the Caterer model:

  class CreateAddresses < ActiveRecord::Migration[5.2]
    def change
      create_table :addresses do |t|
        t.belongs_to :caterer, foreign_key: true, null: false, index: true
        t.string :street1, null: false
        t.string :street2
        t.string :city, null: false
        t.string :state, null: false
        t.string :zip, null: false

        t.timestamps null: false
      end
    end
  end

  class Address < ApplicationRecord
    belongs_to :caterer

    validates :street1, presence: true
    validates :city, presence: true
    validates :zip, presence: true
    validates :state, presence: true
  end

  class Caterer < ApplicationRecord
    has_one :address, dependent: :destroy
  end

Dual write to the new model

We need to capture all of the changes to the data in the caterer’s original address fields. This includes creation events, updates, and deletions. Having the dependent_destroy option for address on the Caterer model handles deleting events for us. We can use an after_commit callback to handle the remaining create, and update events we want to track. For example:

  class Caterer < ApplicationRecord
    has_one :address, dependent: :destroy

    after_commit :sync_address

    def sync_address
      <WRITE TO THE NEW TABLE>
    end
  end

The after_commit hook is a succinct way to trigger a second write to the new data model as it happens after the completion of a database transaction. This covers our two use cases. As for the logic that performs the writing, we are going to use a Sidekiq job to handle the work.

A Sidekiq job allows us to perform this work in the background and save a bit on performance. Because the dual writing is happening before we have any reads on the new table, it can be safely processed in the background without any repercussions. It also allows us the opportunity to do any sort of data transformation outside of the original model. We don’t need to normalize any data in this example, but it is common to want to update data to make it easier to work with a new data type or format. Our job looks like this:

  class UpdateAddressJob < ApplicationJob
    queue_as :low

    def perform(caterer:)
     address = Address.find_or_initialize_by(caterer: caterer)

     address.update(
       street1: caterer.street1,
       street2: caterer.street2,
       city: caterer.city, state: caterer.state,
       zip: caterer.zip
     )
    end
  end

When it is combined with our model:

  class Caterer < ApplicationRecord
    has_one :address, dependent: :destroy

    after_commit :sync_address

    def sync_address
      UpdateAddressJob.perform_later(caterer: self)
    end
  end

It’s important to note here that this is a great place to write some tests, especially if the data is going through a transformation between the original write and the write to the new model. This will give you confidence that the original data ends up in the correct state and table. These tests will allow you to confirm that all of the pipes are hooked up correctly and that the data flowing through completes its journey in the shape that is intended.

Backfill the data

At this point, it is safe to backfill the original address data into the new table. We can run this backfill with the same job that is run in the after_commit hook with a Rake task.

  desc "Backfill address records"
  task backfill_address_records: :environment do
    caterers = Caterer.all
 
    caterers.each_with_index do |id, index|
      UpdateAddressJob.set(
        queue: :backfill, wait: index * 0.25
      ).perform_later(caterer: caterer)
    end
  end

This Rake task will fan out the jobs by using set with the wait argument. This safely enqueues all the jobs over a span of time making sure they will complete in a secure, managed timeframe. These jobs are also being added to the backfill Sidekiq queue, which ensures that they will not block any other priority job. We’ve setup Sidekiq to use different queues for different types of jobs. This allows us the ability to send high priority jobs into a queue that can handle them quickly. It also allows us to send backfill jobs into a queue specifically for them and protects the resources of the higher priority queues.

After the Rake task has been run, monitor your Sidekiq queue to make sure they are completing successfully. After all jobs have been completed, confirm that the data is correct by spot-checking the new address table. Some good things to check would be the number of caterers compared to the number of addresses — they should be the same! It’s also good practice to take a look at a few rows and make sure the data in each is correct.

Update reads

Our model is in place, the data has been backfilled, and the new table is being populated with all changes to the original model. It’s now safe to start swapping out any reads from the original columns to use the new data structure. This entails moving code that looks like this:

caterer.city

To:

caterer.address.city

It is very important to replace all the reads before replacing the writes. If these two steps are done concurrently, you will end up with data that is out of sync and inaccurate.

Update Writes

Once all reads have been updated to use the new model, update all writes to the new model as well.

caterer.update(city: 'new city') becomes caterer.address.update(city: 'new city')

With all writes updated, it is safe to remove the dual writing after_commit as it is no longer needed. Lastly, ignore the columns on the original table and monitor for errors in case a read or write has been missed:

  class Caterer < ApplicationRecord
    self.ignored_columns = %w(street1 street2 city state zip)

    has_one :address, dependent: :destroy

    def sync_address
      UpdateAddressJob.perform_later(caterer: self)
    end
  end

Clean up dead code and data

Cleaning up no longer needed code and data is our final step. At this point we should be able to do the following safely:

Remove the UpdateAddressJob and the backfill Rake task
Drop the original columns from the table
Remove the ignored columns from the Caterer model

Conclusion

While there are several steps to get to the conclusion of this data migration, the end result is a positive one. The caterers table is more succinct with five fewer columns and the new addresses table has accurate and focused data. It’s a nice refactor that improves the relationship between a caterer and their address.

Migrate, transform, and backfill data with zero downtime

Scenario

Zero downtime and data accuracy

Create the new model

Dual write to the new model

Backfill the data

Update reads

Update Writes

Clean up dead code and data

Conclusion

Work With Us

About ezCater

ezCater is hiring!

Recent Posts

Posts by Topic

Subscribe to Email Updates