Ep 1. BAIM Roundtable: Research Computing & HPC at Purdue
Articles,  Blog

Ep 1. BAIM Roundtable: Research Computing & HPC at Purdue

[MUSIC] I’m really happy to be here today,
I have two guests. Geoff Lentner, data scientist and
research computing. And Preston Smith,
the Director of Research Computing. So, before we begin, I just want to kind
of remind everybody a little bit about our Master of Science and business analytics
and Information Management program. I’m the Academic Director for that. It’s an 11-month program. It starts in June. It ends in May. It’s a STEM certified program, meaning
that if you’re an international student, that gives you a three year OPT extension. Basically the program is designed to
give students a great experience on how to use start of the art information
technologies and analytical methods. Currently students are using
software tools such as MySQL, Python, R, SAS, Tableau and Hadoop. And there’s other technologies
that we’re using as well. But one of the things that really a lot of
students like to talk about our program is these projects, these experiential
learning projects with companies and this year we’ve actually
had quite a bit of success. We’ve had 19 posters accepted at Antwerp’s
Business Analytics Conference and there was only 33 spots for
student posters, so that was insane. They won first place in the poster
competition and they won third place. We had another team that got selected for the top eight at SAS global forum and
we’ve done a lot of great stuff as far as experiential learning
projects with companies but we could not do what we do here at Purdue
without these two guys right here. So these are the guys that kind of make
sure we have sort of the architecture and platform to be successful. Can Preston or Geoff, could you tell me a little bit about
research computing and your role there?>>That’s good for you.>>Is that for me?
>>Yeah.>>So the research computing group here at
Purdue provides central super computing services and support for
researchers all over the Purdue campus. We have approximately 185 paid faculty
investors who invest their own startup or grant money into using our supercomputing
resources in over 1,200 active users who have used those systems
to get some kind of science done.>>One of the platforms or
one of the, I guess, capabilities that we use from you guys
not only our classroom projects but also with some of these industry
practicum projects is Scholar. Can you tell the audience a little
bit about what Scholar’s all about?>>So Scholar is effectively a small
version of one of our big super computers that’s dedicated entirely for
instructional use. So over the recent years we’ve seen more
and more instructors wanting to teach the concepts of high performance computing
or data analytics relevant to the domain. Whether they’re teaching bio informatics,
or weather modeling, or business analytics, everybody at some
level needs high end data storage, high end computing to be able to do
computational science in their field. So since it started in about 2012, it’s
grown from just a couple of classes to over 50 classes, using each of
the last couple of semesters, over 3,000 students used Scholar
during the Spring semester of 2019.>>I could tell you 82 of those
3000 students were students and I know we use it quite a bit. So last year we purchased and tested a high performance database
server from you guys, that we use for both classroom projects, and
projects with industry partners. We felt we kind of needed a secure
environment where it was kind of our own resource. Were you guys surprised at all about
kind of the usage from Krannert and say our masters and
business analytics students.>>I wouldn’t say we
were surprised per se but historically our user community has been
from the physical science and engineering. So their computations and
their needs are all kind of a certain way, they’re doing modeling
in simulation mostly. We’ve been well aware of the amount of
data intensive data science thing that happened over here in Krannert and
they’re driven by databases, they’re driven by
processing large text files which aren’t necessarily things that
we’ve historically done a lot of. But as the scale of the data
analytics needs grow, they’ll start to edge closer to
high performance computing and high performance computing is going
to move closer to data analytics. So it seems like a logical intersection.>>Yeah I totally agree. So some of the projects we worked on this
year, we had great performance but there were some projects where I felt like maybe
we need a little more under the hood. So this coming year we’re actually
going to purchase a cluster from one of your community clusters that’s
gonna be dedicated just for our program. Can you tell the audience a little
bit about your community clusters?>>So the community cluster program
is how we’ve been able to build large supercomputers at a cost effective
manner for the university to use. Many of them are broadly
usable supercomputers that 85% of the users on campus
can take advantage of. But several of them are kind of
tailored to the particular community that needs them. For example,
we have one system that was designed with every node having large amounts of memory. So if you need to solve a very big
analytics problem of bioinformatics, that’s a great system for that. We just built one named Galbreath
here in 2019, which is a 50-node system with 100 GPU accelerators in it,
which is perfectly suited for doing large machine learning and
artificial intelligence applications.>>Say you’re working at a company,
we have some partners that maybe they’re not as analytically
mature as, say, a company like Amazon. Most companies aren’t there yet, but
I know a lot of our partners are trying to get to that point, they’re trying to grow
their data science capabilities, and to do that they’re considering investing
in some of these high performing systems. How would you say research
computings architecture and resources would compare to kinda what
state of the art out in the industry?>>Yeah, so,
one of the key differences there. So for a lot of companies that are growing
and developing a data science team or a data analytics practice,
will get the resources and infrastructure they need from
something like an Amazon or an Azure. And the key difference
here is that there’s so much less complexity involved and there’s
no credits and trying to stitch together, spin up a virtual machine and bring
together some other kind of data system, whether that’s storage or a database,
whereas here on campus at Purdue with and elsewhere,
these are real physical machines. That are on campus, that students and
researchers get direct access to. There’s no virtualization. It’s bare metal, it’s always up. It’s cost effective too
because it’s not per CPU hour. You buy into a shared resource at
a quite frankly very affordable price in comparison to some of
the commercial cloud options. So students get direct access, they can
log in directly on the command line or from an application like our
studio server or JupyterHub and there’s no wait time,
it’s always helping us right there.>>Yeah, and just from my experience
mentoring a lot of these different projects this year, not only do we have
great access to RStudio server and Jupyter, Python,
we also were able to interface quite seamlessly with other tools such as SAS,
Tableau, and other things. So if there’s companies out there that are
considering trying to do sort of big data analytics kind of project with us,
something that might require high performance computing, just know that
we do have the resources here because of Research Computing to scope out those
kinds of projects with our students. Is there anything else that you’d like to
let people know about Research Computing that kind of separates you from or
highlights what you guys can do?>>Yeah.>>Just to build on what Geoff was just
saying about if somebody’s going out to create and an environment like this
on their own, you build something on the Amazon Cloud for example,
if you stand up your own resources and it’s all up to you to solve
the problems yourself. What I always tell people on campus
is the key value that we provide in research computing is not
just the infrastructure. Like Geoff says, it’s very large,
it’s always on, it’s very reliable but we also have a team of
dedicated staff like Geoff. We have a variety of scientific domains
that are there to help consult and solve your problems,
to translate your science needs into data analytics solutions and
help you get science done.>>Okay, well thanks a lot guys. The MSBA program is so
thankful to work with you and I’m looking to continue to grow
our relationship in the future.>>Yeah. [MUSIC]

Leave a Reply

Your email address will not be published. Required fields are marked *