Oftentimes database tables begin with the best of intentions. They can start out narrow and clearly named with well-defined content, but as our applications and teams grow, we may start to notice that those same tables begin to take on a different shape. It’s possible that they grow wider and are harder to work with than you remember. They could be holding data around multiple concepts, which might mean your models are breaking the single responsibility principle. Perhaps that table didn’t start out concise at all. There are many common scenarios that cause tables to exist in a cumbersome form, but it doesn’t mean they have to stay that way.
This post aims to explain how to migrate data from one table to another in a zero downtime, safe manner. The following process can be used for any number of data migration scenarios by using each step as a general guideline. We love Ruby at ezCater, so this post will be within the Rails ecosystem.
In this example, we’ve noticed that the Caterer table has grown wider than we’d like. Over the years, the number of columns has grown as our application has matured. It’s not surprising to see the expansion of data, but it would be best if we could stop the expansion and simplify one of our core tables. To start, a truncated view of the caterers table looks like this:
create_table "caterers", do |t|
t.integer "id"
t.string "name"
t.string "email"
t.string "phone"
t.string "street1"
t.string "street2"
t.string "city"
t.string "zip"
t.string "state"
t.datetime "created_at", null: false
t.datetime "updated_at", null: false
end
As this isn’t the full view of the caterers table, it may seem pretty clear where there is an opportunity to trim down the table to a narrower state. There are six address related fields in this selection of columns. This is a great opportunity to make a new addresses table to store all of those attributes and simplify caterers.
With the extractible columns identified, we can start to think about the process ahead. These are the steps we will take in order to migrate over to the new data model for addresses:
This may seem like a lot of steps for straightforward data migration, but they are all necessary to protect our app from downtime and inaccurate data. There are several points in this process that if done in an incorrect order could result in problematic releases. For example, setting up the model first and dual writing between both the old and new tables allows us to swap the reads and writes incrementally. If we were to do it all at once, we would run the risk of caterers not having an address or having stale data.
It is important that dual writing happens before the data backfill. Having the Caterer model write to both locations protects the accuracy of the address fields. If we were to backfill first and then set up dual writing, we could potentially miss data changes that happen during the backfill.
Having the new model in place first and deleting the old model last ensures the process runs smoothly and avoids bugs.
Before we can do any data migration or backfill, we need to set up our model. This will consist of a Rails migration, a barebones model, and an update to the Caterer model:
class CreateAddresses < ActiveRecord::Migration[5.2]
def change
create_table :addresses do |t|
t.belongs_to :caterer, foreign_key: true, null: false, index: true
t.string :street1, null: false
t.string :street2
t.string :city, null: false
t.string :state, null: false
t.string :zip, null: false
t.timestamps null: false
end
end
end
class Address < ApplicationRecord
belongs_to :caterer
validates :street1, presence: true
validates :city, presence: true
validates :zip, presence: true
validates :state, presence: true
end
class Caterer < ApplicationRecord
has_one :address, dependent: :destroy
end
We need to capture all of the changes to the data in the caterer’s original address fields. This includes creation events, updates, and deletions. Having the dependent_destroy option for address on the Caterer model handles deleting events for us. We can use an after_commit callback to handle the remaining create, and update events we want to track. For example:
class Caterer < ApplicationRecord
has_one :address, dependent: :destroy
after_commit :sync_address
def sync_address
<WRITE TO THE NEW TABLE>
end
end
The after_commit hook is a succinct way to trigger a second write to the new data model as it happens after the completion of a database transaction. This covers our two use cases. As for the logic that performs the writing, we are going to use a Sidekiq job to handle the work.
A Sidekiq job allows us to perform this work in the background and save a bit on performance. Because the dual writing is happening before we have any reads on the new table, it can be safely processed in the background without any repercussions. It also allows us the opportunity to do any sort of data transformation outside of the original model. We don’t need to normalize any data in this example, but it is common to want to update data to make it easier to work with a new data type or format. Our job looks like this:
class UpdateAddressJob < ApplicationJob
queue_as :low
def perform(caterer:)
address = Address.find_or_initialize_by(caterer: caterer)
address.update(
street1: caterer.street1,
street2: caterer.street2,
city: caterer.city, state: caterer.state,
zip: caterer.zip
)
end
end
When it is combined with our model:
class Caterer < ApplicationRecord
has_one :address, dependent: :destroy
after_commit :sync_address
def sync_address
UpdateAddressJob.perform_later(caterer: self)
end
end
It’s important to note here that this is a great place to write some tests, especially if the data is going through a transformation between the original write and the write to the new model. This will give you confidence that the original data ends up in the correct state and table. These tests will allow you to confirm that all of the pipes are hooked up correctly and that the data flowing through completes its journey in the shape that is intended.
At this point, it is safe to backfill the original address data into the new table. We can run this backfill with the same job that is run in the after_commit hook with a Rake task.
desc "Backfill address records"
task backfill_address_records: :environment do
caterers = Caterer.all
caterers.each_with_index do |id, index|
UpdateAddressJob.set(
queue: :backfill, wait: index * 0.25
).perform_later(caterer: caterer)
end
end
This Rake task will fan out the jobs by using set with the wait argument. This safely enqueues all the jobs over a span of time making sure they will complete in a secure, managed timeframe. These jobs are also being added to the backfill Sidekiq queue, which ensures that they will not block any other priority job. We’ve setup Sidekiq to use different queues for different types of jobs. This allows us the ability to send high priority jobs into a queue that can handle them quickly. It also allows us to send backfill jobs into a queue specifically for them and protects the resources of the higher priority queues.
After the Rake task has been run, monitor your Sidekiq queue to make sure they are completing successfully. After all jobs have been completed, confirm that the data is correct by spot-checking the new address table. Some good things to check would be the number of caterers compared to the number of addresses — they should be the same! It’s also good practice to take a look at a few rows and make sure the data in each is correct.
Our model is in place, the data has been backfilled, and the new table is being populated with all changes to the original model. It’s now safe to start swapping out any reads from the original columns to use the new data structure. This entails moving code that looks like this:
caterer.city
It is very important to replace all the reads before replacing the writes. If these two steps are done concurrently, you will end up with data that is out of sync and inaccurate.
Once all reads have been updated to use the new model, update all writes to the new model as well.
caterer.update(city: 'new city') becomes caterer.address.update(city: 'new city')
With all writes updated, it is safe to remove the dual writing after_commit as it is no longer needed. Lastly, ignore the columns on the original table and monitor for errors in case a read or write has been missed:
class Caterer < ApplicationRecord
self.ignored_columns = %w(street1 street2 city state zip)
has_one :address, dependent: :destroy
def sync_address
UpdateAddressJob.perform_later(caterer: self)
end
end
Cleaning up no longer needed code and data is our final step. At this point we should be able to do the following safely:
While there are several steps to get to the conclusion of this data migration, the end result is a positive one. The caterers table is more succinct with five fewer columns and the new addresses table has accurate and focused data. It’s a nice refactor that improves the relationship between a caterer and their address.