High Performance Computing (HPC) – Computerphile
Articles,  Blog

High Performance Computing (HPC) – Computerphile


This this two-factor authentication; so
basically to get in I need my card and then I need a PIN and
this is a scrambler pad so basically every time that you look at that the
numbers are in a different order. This is the High Performance Computing facility for the university of Nottingham. SEAN>>What do you use it for? All sorts of things it’s
basically to do with the high compute research so for example students and
researchers will use this for doing calculations based on things like fluid
dynamics, aerospace, genomics… All sorts of things anything which requires
– astronomy – that’s what anything that requires a large amount of compute. SEAN>>And you’ve got earplugs in today for obvious reasons Yes it’s a litte bit noisy in
here yes yeah… SEAN>>So we will do some talking outside (LINK IN DESCRIPTION) but can you show us a bit of it before we go outside? Certainly yes yes the main HPC facility which we call
Minerva is… …and then we’ve got some extensions in, on the racks on here. SEAN>>All of these’s blinking lights what’s going on is this data activity.. …or processing what’s going on there? Both! The actual lights that you can see there, the brighter ones, those are actually the storage, the disk storage the
actual compute nodes don’t actually blink very much. The ones at the bottom
there that’s the network activity. We do shut it down for maintenance once a year for a day or so this at the moment is the third
generation of HPC – the first one… …which was installed about eleven years ago and then we regularly refresh this. SEAN>>So this one’s been going for how long or how long is this been? This one’s been going for about four years. SEAN>>Okay and then I hear rumors of a
new one on the horizon? Yes we’re in the procurement at the moment to put a replacement in SEAN>>and will that mean this gets ripped out and the whole new one just gets put in? Good question, we would like to utilize as much as possible because although it is
old you know there is still life left in it and we do try to – “sweat the
assets” as they say but certainly some of this will be replaced. SEAN>>What’s it running, would we recognise any of the operating system or any of that? It’s, yes the; most of the Nodes are running a version of Linux and the the storage is fairly standard but above that we use PBS as our main
scheduler. SEAN>>How many people might be using this at one time? At any one time they’re probably running hundreds of jobs SEAN>>Do they run for a long time? Might they be running years? How does it work? We wouldn’t have jobs that are run for years but certainly we could have jobs which are running for months. Most of the jobs -you
know- we’re probably only running for days SEAN>>Okay and so when you look at a system like this can you put a figure on how much it costs? Capital cost for a system like this
we’re probably talking in terms of about one and a half to two million pounds ($2.1m – $2.8m) The ongoing costs – We have about 250 kilowatts of air conditioning. When we run this flat out – this particular block here running flat out pulls about 70 kilowatts of power and you’re drawing that all the time so to
run this whole facility you’re talking about thousands of pounds just purely in
power costs and then of course they’re all the ongoing licensing and the support for that… So it’s not insignificant. SEAN>>So that’s a lot of power is there a big red switch somewhere someone has to pull to turn it on? Yes there is – and no I’m not going to press it for you SEAN>>So its obviously a lot of equipment and looks like it might be
quite complicated does it ever go horribly wrong? Does it ever have big problems? Generally speaking it is pretty reliable. Individual nodes will fail. Individual disks will fail but generally speaking the equipment itself is relatively… …modern computer equipment is inherently reliable – we probably have more problems with the air conditioning
than we do with the actual compute itself. SEAN>>So the other thing I was thinking about when when you look at this it’s is this totally bespoke or is it’s like a
template or how does it work in terms of how do you buy one of these – How would
you go and buy a high-performance computer? That’s the $64,000 question
basically you have to start to think “What do we need it for?” because there is
no one generic high performance compute job. Different departments, different
research, different requirements have different computing requirements. Some are very very high performance computing you know it’s a lot of number crunching – others
it’s about manipulating data so there’s a lot of data movement. Other things it’s about
visualization. So you the first thing you’ve got to do is to say right “What is our mix of jobs?” because the way which you set it up for high analytics
is a different hardware set to what you set up for vizualization and things like
that. So that’s the first thing you’ve got to do. You’ve then basically got to say okay these are the jobs that we want to run. Once you’ve actually got that you
then go up with a supplier to say right this is what we want to do, this is how
much money we’ve got to spend. What can you give us? Although this is fairly old now, you know there is still quite a lot of life left in here okay it’s not cutting edge – but it’ll still do a lot of the jobs because a lot
of the jobs are purely about number crunching. This is perfect for that so
basically we will put the new one in – We will try and keep as much as we can of
the old one so that that we “sweat our assets ” and that also means that we’ve
got additionally capacity for our researchers to use as well
and then basically we will then go for a gradual replacement so as new processors come online and as new research projects come you know the balance of the jobs
will change so that means we may have to strip out a particular type of node
replace it with a different type of node but you know so that will be far more
organic in the future we’re not expecting in the future to do a
complete rip and shred. Unless something comes up and oh you know we build a new data center – but that’s not on the cards at the moment. The equipment itself is fairly generic, you know, these are standard blade enclosures. The storage is standard
storage – We have about two hundred and forty terabytes in this block here – it’s all
connected up by InfiniBand SEAN>>Is InfiniBand a speed of network? It’s a standard – This is a 40 gigabit InfiniBand gigabit SEAN>>So at home you might have Gigabit – this is 40 of those? Yes, 40 Gigabit yes – and also of course it’s also multi path as well so.. …because you know there’s no point
in doing a lot of calculations if you can’t then get the result of those
calculations off. There’re effectively two types of jobs. There are parallel jobs where you’ve got a job running on multiple nodes and then you’ve got
single node jobs where basically it’s all running on one node. So again, with the parallel jobs you need network connectivity to make sure you’re not processing the same bit twice. SEAN>>So for a researcher or someone who’s a part of a project what’s the big benefit
of doing this rather than letting their office computer do it? Is it the speed of
compute? It the fact that they can set it off and come back another day or,
what’s the main benefit? Yes it’s the capacity. Because basically the job will start to run it will then continue to run and then so for example Christmas is a very very busy time for
us because a lot of researchers will start a job going then come back after
Christmas and pick up the data As I say, you you could do these things at home, it’s just that it would take you months or years to do what this can do in days or hours. SEAN>>Are they ‘hot’ swappable then? Yes they are SEAN>>(Joking) Come on then, let’s pull one out… No! They’re all single-phase power but because the phase on this rack is different to the phase on this rack
there is the possibility of having a potential difference of more than 400
volts across the two racks. It’s unlikely because each of the… but from a “health and safety” point… and it’s exactly the same why you’ll see a lot of these have got laser [warning stickers] because we use laser optics SEAN>>For your networking? Er, yes the fibre… SEAN>>And what is that, the aircon? Nope, that is the fire suppression SEAN>>Oh let’s go of a look at that then The fire suppression system that we have in here is it’s an IG55 system which is an inert gas. It’s 50% Argon, 50% Nitrogen basically if there is a fire in here all
of the gas in there is released in one go that replaces about half the atmosphere in here which takes the oxygen level down to a point where it doesn’t support combustion. It is just about breathable but you wouldn’t want to run a marathon in it you know it’s like
trying to run at the top of Mount Everest. SEAN>>So it suppresses the fire without damaging the kit? Yes. The gas is released through these nozzles here. SEAN>>They look like sprinklers but they’re actually gas… Gas nozzles, yes. SEAN>>and how does it work with the cooling? Is it go in hot one side and out
cold the other? This is – Yes basically we use aisle containment so this is the cold aisle when we put cold air in it then goes through the equipment
we’d expect to see a delta T in terms of 20-odd degrees – and on the other side basically it gets vented through… SEAN>>So through that glass is going to be 20 degrees warmer? Can we go in? yeah OK I think I’d like to spend my time on this side… If you come down here you
can definitely feel the temperature difference. So these are compute nodes. SEAN>>…and how many computers are in each one of those blocks then? Each one of here so in this particular one you’ve got 1 2 3 4… …8 individual blades in this blade enclosure here. You asked about the big red button? That’s the big red button SEAN>>That would turn it off and on? No, that would turn it off. SEAN>>Ah that’s like a “Danger danger!” –
press that? Basically if I press that then everything will die immediately SEAN>>let’s stay away from the big red button then… But that is the big red button, yes…. Assuming that they are separate parts of the CPU if we look back at our instructions here we execute
instruction 1 it uses the load/store unit.. complicated. The point is what we’re doing is by multiplying G by various numbers or adding it to itself – this point addition –
we’re moving around this curve sort of seemingly at random

100 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *