At the end of last year I was given the task to move an infrastructure spread on various providers (cloud and physical servers) to a new cloud. My role was to discover the entire infrastructure and services communication, come up with a strategy of how things are going to take place, then manage and supervise the sysadmin in charge of effectively moving machines. This was a production environment, so minimum down time was desired. Most of the work was during night time.
It was a challenge from day zero, being a legacy multi component system with no specific flow. I had high level knowledge of the project, but subtle info came up every day. And I learned and used some things that I’d like to mention.
Planning everything ahead
I took each component (or group of components where required) individually and wrote down on a whiteboard their role, technologies, firewalls, running processes, storage info, communication flow with the other components, and a step-by-step process of how the migration should go.
Can the machine be turned off? Should it be cloned or can it be recreated from scratch and services deployed in a fresh environment?
In case of disaster, make sure you’re not going to be sorry about lost data.
Always start with the firewall on the new machines, don’t assume you’re gonna be fast enough. Some data was lost (it was backed up) because ports were being scanned.
MongoDB is good at this if in a replica set (otherwise you should make a replica set at least until you’re done migrating). You just run a new instance on the new host, put it in the replica set, the sync process will start, and the new member will be available when sync is done. Repeat this with each member. Before moving the master node, just set a new master.
If migrating a file database like Redis has (rdb file) and you work with Docker, don’t forget to start the container only after copying the old db file on the new machine, otherwise you might lose some data.
Easy one if behind your own load balancer. This system was using Cloudflare DNS load balancer. Luckily, the IP addresses could also be migrated, so it took only a few moments to stop the interface on the old host, then start it on the new one. While one machine was down, the other was up.
If you weren’t using an automatic configuration system, now is the time. Firewalls and ssh keys are a burden if managed manually.
There were also some workers machines. I’ve just started the most important workers on another machine while old machines were being migrated. Small down times were allowed because data was queued in RabbitMQ.
Know your infrastructure as much as you can, even if it’s not your job. It’s a win!