Rancher 101: How To Make Your Rancher Server Immortal
Articles,  Blog

Rancher 101: How To Make Your Rancher Server Immortal


Hang on just
a second… [media ejected] Done. This morning I was going through an old box
of technology bits and I found an old flash drive. On that flash drive was a Bitcoin wallet from
2011. In that wallet was more than a thousand Bitcoin. A thousand Bitcoin! That’s like… [doing math] A LOT. I just transferred all of the data over to
this flash drive. This is an ultra-secure flash drive with a
small lithium ion battery that powers a microphone and a chip with my voice imprint on it. The drive is encrypted, and if I’m ever in
a situation where someome might steal the drive, I only have to say the word “alabaster”
and… [small popping sound] [sound of fire] Well that sucks. [music playing] Hey welcome back to another episode
of my visual documentation series where today we’ll be continuing through the Rancher documentation,
helping you get the knowledge you need as quickly as possible so that you can do more,
make more, and live a better life. Despite the fact that sounds super corny, it’s the truth. First off, that flash drive didn’t really
have 1000 Bitcoin on it, but if it did, I’d be a fool for not having a backup of it. That’s what we’re going to talk about today
– backing up and restoring Rancher so that you don’t have to experience the loss of your
Kubernetes management solution. [sad music, crows calling] [crying] I’ll miss him…. [music playing] In my other videos I showed you how to launch
both a standalone Rancher server and a highly-available Rancher server. In the standalone video I said that I like
to use a bind-mounted volume from the host because it makes it super easy to back up
the server during an upgrade. Scratch out the “upgrade” part and let’s just
leave it at “it makes it super easy to back up the server.” Full stop. This is one of those times that I’m gonna
to deviate from the Rancher documentation. Their method has you stop the server, make
a data container, make a tarball from the data container and then restart the server. That’s a lot of steps and a lot of containers
and volumes to keep track of. Instead, let’s stop the server, tar up the
directory, and start the server. [typing] Easy peasy. Be sure that you use “-p” so that tar retains
the file permissions. Restoring a backup is just as easy. Stop the server, move away the bad directory,
untar your backup, and start the server. Again, be sure to use “-p” so that the file
permissions are restored. [typing] I recommend that you script this and run it
with cron, and of course, move that backup off of the server. Write it out to an NFS share, an S3 endpoint,
or just SCP it somewhere. Don’t leave it around on the server. For an HA installation the steps are different. Rancher’s using etcd as its datastore, so
we need to make a snapshot of it and write it out to the filesystem. This can be done as a one off, such as before
an upgrade, or as a scheduled task like the cron job would be for a single-node install. For production environments you always want
to go with the recurring option, and you’ll set that up in the RKE config file. This tells RKE to make a snapshot every six
hours and keep each snapshot for one day. Beginning with RKE 0.2.0 you can add configuration
to ship the backup off to an S3 compatible endpoint. Now, note that I’m saying “S3 compatible,”
which means you can use compatible object storage from providers other than Amazon,. You can even run your own S3 endpoint using something like Minio. How often should you make backups? That’s a great question, and only you can answer
it. It depends on how busy your Rancher server
is and how often you’re making changes to it. I like to ask myself what’s the most amount
of data I can lose before I feel unreasonable pain. Maybe that’s a day or six hours or one hour
or five minutes. You decide. You’ll do one-off backups before upgrades
or major changes where things might go sideways. You do these with the RKE command from the
machine where you have the RKE config file. Backups will be written to /opt/rke/etcd-snapshots
on each node. If you don’t have an NFS share mounted there,
you’ll have to move the backups off each node to a safe location. If you pass S3 access credentials to the RKE
command when you make the snapshot, then in addition to writing out a local copy of the
backup, RKE will also write it to the S3 endpoint. If you’re using a version of RKE before 0.2.0,
be sure to also copy off the pki.bundle file with the main backup file. This contains the certificate information
for the server, and you can’t do a restore without it. The restore process uses the cluster.yml
and cluster.rkestate files from the machine where you ran RKE to build the server cluster,
so it’s imperative that you also have those backed up and stored in a safe location. Always test your backups, because
the only thing worse than not having a backup is having a backup that you can’t restore. When working with an HA installation of Rancher,
restoring a backup is a little more complicated, and honestly, a little bit scarier. This is all the more reason that you should
run through the scenario before you actually need it. You want to be super comfortable with the
process and also be super confident that it works. The process goes like this: We’re going to power off the old nodes. Then we’re going to fire up new, clean nodes. We’re going to pick one of those nodes to
be the restore point, and if we’re not using S3, we’re going to put the backup file on
that node. Then we’re going to do a restore, after which
we’re going to bring up a new cluster with that single node, wait for it to stabilize,
and then add in the new nodes. When everything is back up and running, we’ll
delete the old nodes. It sounds worse than it is. It’s actually not that hard and kinda magical. Let’s go through it step by step. Power down all of the nodes, but don’t delete
them yet. You need the old nodes out of the way so that
you can initialize a new etcd data plane. Fire up three new nodes. Those will become the
new Rancher server cluster. On the host where you have the RKE binary,
copy the cluster.rkestate file to cluster-restore.rkestate. Then copy the cluster.yml file to cluster-restore.yml and open it up. Change one node’s IP address to your new restore
node and comment out the other two nodes entirely. If you’re working with public and private
addresses, change them both for the restore node. If you are using S3, good for you. We’ll do the restore directly from S3 in a
second. If you’re not using S3, put your backup file
into “/opt/rke/etcd-snapshots” on the node. If you’re using an RKE version less than 0.2.0,
you’ll also need to put the pki.bundle file from the backup into the same directory. Now do a restore of the backup. This will spin up a single etcd container
and restore the data, but it will stop before it goes further in the process of launching a new
cluster. Add the flags for S3 here if you need to. Once the restore is done, run “rke up” to bring
up the rest of the services on this single node. Alright, so at this point you have a single-node
RKE cluster. Point your KUBECONFIG environment variable
to kube_config_cluster-restore.yml and do a “kubectl get nodes”. You should see your single node with a status
of Ready. If not, then wait and check it again. Once the node is ready, if you see your old
nodes with a status of Not Ready, go ahead and delete them with kubectl. The next thing is to reboot the node. This makes sure that it comes up clean and pretty
has no issues. After it comes back up, check the ingress-nginx and kube-system namespaces and wait for all the system pods to stabilize. If you haven’t added the new nodes into the
load balancer, then you’ll see errors on cattle-node-agent and cattle-cluster-agent pods. That’s fine. They’ll start properly after you update the
load balancer. Once they’re all running, edit your RKE config
file. Change the addresses for the other two nodes
and uncomment them. The last step is to run “rke up” with the modified
config file to add the other two nodes into the cluster. RKE will rebuild the cluster, and while it’s
doing that, you can add the new nodes into the load balancer. Once everything is up and running, remember
to remove the old nodes from the load balancer and also delete them. That was easy, right? If nothing else, this demonstrates how flexible
and robust RKE is and how awesome Rancher Labs is for building software that quickly
recovers from failure, because you know what? I make these videos to help people like you
learn how to use technology to make your life better. If you enjoyed it, please subscribe, but more
importantly, please share it with your friends and colleagues. If there’s anything about Rancher or Kubernetes
that you’d like to see in a future video, let me know on Twitter or in the comments
below, and I’ll see about getting it in the queue. Thanks so much for watching! I’ll see you next time.

3 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *