Designing for Update/Fault Domains
Articles,  Blog

Designing for Update/Fault Domains


[Music] In this topic, we are going to look
at update and fault domains. So this is pretty important to understand when you are designing
your Azure solution. We are going to look that from two different perspectives. So you
know, me being kind of an IaaS kind of guy with virtual machines, I will be talking about
kind of machines and of course Cale with his application side will kind of put his little
spin and twist to that side of the house as well from PaaS perspective. So when you think
about it, if I am going to have a couple machines that I fire up, we have to think about how
Azure updates these things over time. So for example, if I have two different machines
and they are both kind of being updated by Azure and patched and maintained at the same
time, so we get a service update and they both go down at once and that kind of creates
a problem, right? So that is why we talked before about availability sets. So for example
I do a lot of AD FS installations, so when I do a pair of AD FS servers, I will do an
availability set and then that will spread them across these domains. So we have a great little diagram here that
shows how this all kind of comes into play with the fault domains in designing without
availability sets. So the fault domain essentially, the easiest way to describe it, it is the
physical rack. It is the physical aspects of all these servers that are within Azure
in one specified place. So if we had two, three, four servers in that one fault domain
and that whole rack went down, we lose service to all the applications and all the servers
right, but if we can spread those across as I mentioned in my case of AD FS servers, if
I build two AD FS servers, I create an availability set, which is an option when we configure
that, it will then spread that across those two fault domains. At the very bottom here,
I am just going to highlight this here for you. So you will see here and actually I did
this when I was initially playing around, I just grabbed that screenshot that was pretty
good, so I tried to create an availability set with just one machine in it and it actually
gives, what I call good gooey. It gives you a warning that says availability set with
this virtual machine has only one running instance, which effects the service level
agreement. So from a business perspective, this is a very important point because if
you have an availability set guess what, you get an SLA of 99.9% uptime okay. If you do
not or if you make an availability set with only one server in it, you do not get that
SLA. So it is something that is very important that you want to think about okay. So I am going to show here also where you
can cease some of these options here. So if I go into my portal and I see I have a couple
AD FS servers right and I’m looking, actually this is the Cloud Service and you see I have three AD
FS servers and this lists the update domains and the fault domains. The fault domains distributes
these loads across the physical racks, right? The update domains tells us how it is going
to be patched and updated when Azure does not update to those machines and you see in
this case, I have three servers. So we have ADFSDemo01 and ADFSDemo02 are both in the
same update domain and the fault domain right. So in order to get both in the same fault
domain, when I built those I created an availability set. So the numbering starts at zero, and
then the third one here, so I have ADFSDemo03. It’s in its own, I have no SLA right and
still I put that into another server or another availability set with another server, I do
not get that SLA. So probably a better design what I should have done when I made those
is put all three in an availability set right to improve that availability with those servers
okay.>>Yea, the way I think about this especially from an IaaS perspective, which Mark is describing
here, is it is kind of clustering other than we do not do the synchronization. So when
you set up a traditional Windows cluster or SQL cluster, any of those type of clusters
would basically have a quorum disk and we kind of keep these things in sync, but this
does not do any of those synchronization. It is just there to protect those physical
assets. So if the fault domains primary use case is that rack went down. Something bad
happened in that rack in Azure, either the hardware broke, something broke in the switch,
something happened bad in that server, so that thing will be repaired, but Azure fabric
can detect that and go ahead and move that workload over to someplace else as long as
you have another one in another fault domain, you are golden because your app is still running.
Remember we have those VIPs across the top or those virtual IPs across the top of our
availability set so that it will continue to let the clients keep writing into or accessing
the servers that they need to. Those servers should be identical specifically because we
are going to move those things around. They are durable VMs, but we are still going to
move them around. The other thing was the upgrade domain. So with IaaS, it is important
to understand that we are not going to patch the guest OS, so if you spin up an IaaS VM
with Windows on it and you do not turn on automatic updates or any of those features,
we are not going to…Azure is not going to come along and patch your OS at any point,
but the thing that it will be patching is the hypervisor. So underneath the covers right,
that virtual machine is running through some hypervisor somewhere. That hypervisor has
to get patched at some point and so I hear customers sometimes say well, I am not taking
my VM down. I know when the VM is going down, so I do not need to have multiple. No, you
do need to have multiple because we are going to patch that rack or that server at some
point to hypervisor and it is going to bring that VM down or it is going to move it. So
you need to make sure that you have both of those in there in order for that to meet the
SLA like Mark talked. Now when we are talking about PaaS services, it is slightly different. So PaaS does patch the guest operating system,
so the upgrade domain is a little bit more important. So if we take a look at the diagram
of this and let me pull this up, I will draw this out here a little bit, but basically
what we have is, we have our application here that is running, we will call it a Cloud Service.
This is a PaaS app. So we have our PaaS application running here. Now like he mentioned we could
have multiple of these and maybe this is in update domain 0 and this is in update domain
1, this is in fault domain 0 and this is in fault domain 1, and that is the way it would
be by default and then if you spun up a third instance, it would go into update domain 2
and it would go back into probably fault domain 0. There are only two fault domains for them
to come in to, but the upgrade domain will go up to five. So what happens is, when we
come along if either Azure wants to, the fault domains work the exact same as what we talked
about with Infrastructure as a Service. So if one of these fails, we move the instance,
everybody is happy because we at least have one more that is running, but with the upgrade
domains or update domains, if we bring an app update in here, so let us say we wanted
to patch our app right and sorry about that. Let me re-draw that. So if we brought an app
update in here, we said we want to patch our application or we have an update for our application.
We can leverage the same concept of these update domains in order to do that. So we
can tell Azure when we bring this package up here. Remember, we have a CS package here that
we are gonna bring up. When we bring that up to Azure, we can tell it specifically that
we want to do an update as opposed to replace. So if we do an update, it is going to go in
and patch this server first, the one that is in 0. It is going to do whatever it needs
to do to make sure that that thing becomes healthy again. If it is a web endpoint, it
is going to make sure the web endpoint is healthy and then it is going to move on to
this one and subsequently it is going to walk that stack of updating these. It is not just
going to blast it out there all at once. We can do a replace, which basically means do
it all at once, do not walk the stack to do it. So it is kind of some of the difference
between it. Now web apps is a totally different concept. So in a web app, we do not have to
think about this as much. With web application architecture, let me draw that out really
quickly here. So if we were talking about web apps, remember we have our web app here
and essentially what the web app the VM that is running it is being controlled by this
SQL Server and the IaaS instance that is running ARR for a web app. So if we come along and
patch this one, if Microsoft determines we need to patch this and they will because it
is a PaaS instance, they will take that VM down. The other VMs would be over here. There
are still a pool of VMs that is running our web apps and what happens is if our app was
running on this one and it goes down, the next request that comes in is going to say,
oh! That VM is down whenever this guy recognizes. Hey! He is not there anymore, he is going
to say SQL give me that package, I am going to go ahead and deploy it over here and we
can continue to run the application. So the user notices no downtime, but it is slightly
different than our traditional PaaS.>>Okay. And another little tip I want to mention too,
so by default you get five update domains right and we can go up to 20, but five may
be perfectly sufficient. So let us say we had seven instances, so what it will do is
once we hit those first five, it will layer the other two. So as long as it is okay and
acceptable with your application or your server for two to be down at a time and these two
to be down at a time then that is fine, but otherwise you can actually go up to 20 instances
for your update domains. So that concludes our section today on fault and update domains.

Leave a Reply

Your email address will not be published. Required fields are marked *