Do the hard one second

actually do

                            April 14, 2021

                Do the hard one second

                        Look at this perfect sleepy donut boy!

Technical migrations
This post is an excerpt from a work-in-progress post about running technical migrations. If all goes well that one will go up on the blog at some point, but I wanted to try out this section, since I feel unsure about it; I’m not entirely sure the right amount to generalize from my experiences at Stripe here and I worry about saying something either too specific or so broad as to be meaningless. I’d love feedback on whether this post lands for you and how it compares with others’ experience.
Incremental migrations
If you work at a technical organization long enough, you’ll be involved in a lot of technical migrations. Technologies and pieces of infrastructure need to be upgraded and replaced with newer, improved, versions. Perhaps you’re migrating to a new observability platform, or moving to Kubernetes. Maybe you’re moving from PostgreSQL to MySQL, or from GFS to Colossus.
Occasionally you will be able to do a migration in one fell swoop, where you move all users of the old system with the new system in one smooth flag day. More commonly, though, large migrations are incremental; you’ll move over piecemeal, moving one system or component or subset of your traffic at a time.
When you’re migrating something incrementally, you’ll divide the users of your system somehow, and then face the question of what the right order is in which to migrate them. What this division looks like depends on the nature of your system and the migration, but some common breakdowns include:

You might migrate one table or database model class at a time
You might migrate one service at a time in a microservice or SOA architecture
You might migrate one endpoint or URL route at a time
You might migrate one repository at a time, if your organization divides code into many repositories

I’ll refer to these downstream systems that you are trying to cut over to your new system as your “users,” as viewed from the perspective of the infrastructure team.
When considering this breakdown, it’s pretty common that you will find a small number — maybe even just one — of users which are both the most important by far, and also the most complex and challenging¹. At the same time you’ll find a long tail of other users which are smaller and simpler and also must be migrated eventually. A key choice in running such a migration is the order in which you approach these. Do you start with the long tail, testing out ideas and learning on the easy cases, and work your way smoothly up? Or do you jump straight in to the hard but important cases at the head of the distribution?
Start small, then jump to the hard one
I worked on and ran a lot of migrations at Stripe, and the heuristic that I eventually settled on for most projects was this: Start with an easy user, or maybe two², but after you have an early win and a working end-to-end proof of concept, jump straight to the top of the distribution and attempt the most important migration(s).
Starting with one or two easy cases is, I think, intuitive. There’s often a lot of fiddly details (“plumbing”) to get something up and running in the first place. There’s figuring out how to package and configure everything in your environment and all the small details of how things work together end-to-end. It’s typically easiest to do that in a low-risk environment where you don’t have to be too careful or worry about that much data. Your goals are to get an end-to-end proof-of-concept, and to develop initial familiarity with the new system in production, and with the basic mechanics of migrating a user from the old system to the new.
My (potentially) surprising claim is that, once you have something up and running, you are usually well-served by moving quickly on to the hardest and most important user systems, instead of working your way gradually towards them. There are a few reasons I think this is usually advantageous:
Fastest learning
It can be tempting to work your way up the chain, tackling ever-more-important systems and letting the integrations “bake-in” over time, in order to learn more about operating the system in production. And it’s true that you’ll gain some experience by doing so.
In practice, however, there will always be gaps between what you’ve learned and the ways in which the top few users will stress your system. If you really want to learn as much about your new system, and about how the important users will interact with it, you have to work on that question directly. And you do that by diving in and working directly on migrating those big, important, hairy users and learning where the bodies are buried, not by working on something else and hoping you eventually learn by osmosis.
Note that moving on to these big users early does not mean being reckless! You’ll want to be careful and incremental — likely even more so than you would working on the long tail — using tools like test environments, moving traffic incrementally, feature flags, dark launches, and running load tests and gameday exercises to move safely and learn before you cut over for good. But by deploying those tools directly on the most important systems for the migration, you’ll learn much more, faster, about the hard parts of the project and how you can cope with them, than by avoiding them until you feel “ready.”
That’s where the value is
Hopefully, you’re not running this migration because it’s fun or out of a sense of technical idealism or purity. You’re running it because the new system is superior in some way, and will provide better value to the organization or its customers in some way. Typically, this value will be realized in proportion to how important the internal users that are successfully migrated. In order to get the most value from the migration, you need to migrate the most important users. Focusing on the most important users means you’re focusing on where you’re going to get the value out of the migration, and not just fumbling around forever at the long tail.
In some cases, you get the most value out of a migration when you get to 100% and can actually turn off the old system entirely. Even in those cases, starting with the important ones is valuable since that’s where you’ll learn the most the fastest, and also because those systems run the largest risks of posing insurmountable difficulties.
You make sure it happens
I’ve seen far too many migrations succeed at migrating a handful of small, easy, users, and then stall out for various reasons. Those failures can be the worst possible outcome for migration projects; you haven’t generated much value, but you’ve now created a second system that’s too entrenched to easily roll back, and which the organization now needs to support going forward.
By focusing your efforts on doing the hardest and most important users early, you are forced to grapple with them, and you make sure that, if you move forward, you actually do those hard problems, instead of perpetually deferring them. And if you fail, for some reason, you will fail earlier, when you still have an easier path to rolling back completely, if necessary.
Closing thoughts
I saw some version of this situation play out repeatedly at Stripe, and I grew quite confident in this heuristic at Stripe. In that case, the “big, complex customer” was one of a usual handful of customers each time; the Charges table, the primary API service, or the Ruby monorepo. I feel confident it’s a useful heuristic in other cases, but I worry about overly generalizing, and how to define the set of “similar enough” situations. I do think it’s probably a valuable perspective to have on hand when you’re looking at a migration, as long as you remember that it’s a heuristic or a suggestion, and not a hard-and-fast rule.
One place where I struggle is the question of defining “important” or “valuable” in rapidly-growing organizations. If an organization is growing, most of the value it produces is in the future, and sometimes it’s the right answer not to invest in existing systems, relatively speaking, but to build new systems that will replace them, or at least grow to be more important. Such a high-growth environment can lead to situations like the one I allude to earlier, where the “easy” cases (building green-field applications, or investing in small-but-growing systems) are also the highest-value ones.
At the same time, growth is always speculative and uncertain, and it can be a seductive trap to convince yourself that your new system will become more valuable than the existing system, when you’re actually just using that as a convenient excuse to avoid the hardest problems. At Stripe I saw numerous new systems proposed that had no story for how the existing API monolith would ever use them, defended on the grounds that we were growing and they would become important enough to justify their costs. In my judgment, in 9 out of 10 cases what happened instead was that the API monolith kept growing in traffic and importance, and the new system never took off.
But that wasn’t always the case, and it’s always hard to know ahead of time what the future holds. And even in cases where you’re growing, growth isn’t forever and you will always eventually have to do some “true” migrations — potentially even in the form of full rewrites — to keep systems running indefinitely.
Have you run large migrations within an organization? Does my model here and this heuristic match your experience, or did your problems look very different or did you have success with some other approach? I’d really love to hear from you.

If it’s ever not the case that “value” and “complexity” line up, then your job is easier; You can start with the most important systems, get most of the value fast, and then only later tackle the truly hard cases. ↩

Maybe the post title should be “do the hard one at-most third” or “do the hard one early,” but those sound less catchy. ↩

                            Don't miss what's next. Subscribe to Musing in Computer Systems: