Exascale Computing–fact or fiction?
Articles,  Blog

Exascale Computing–fact or fiction?


Hello, my name is Shekhar Borkar. I’m from
Intel Corporation. The title of my talk today is Exascale Computing, whether it is fact
or fiction. Here’s an outline of my talk. I’ll first start
with a complete roadmap and technology outlook, followed by challenges and solutions for compute,
memory, interconnect stack, as well as the software stack. And then, I’ll summarize. Let’s look at the compute roadmap. We want
to go to Exa. From Giga to Exa were two stops called Tera and Peta. What I’m showing here
is a compute roadmap. In the 80s, we were at Gigascale systems, in the 90s we were at
Terascale systems, in the last decade we have been at Petascale systems, and at this rate,
we want to go and have an Exascale system by the end of the decade. What this shows
is the relative performance of the system as to where it comes from. From Giga to Tera,
around 32x performance came from transistors alone. Another 32x came from parallelism.
From Tera to Peta, 8x came from transistors and 128x came from parallelism. From Peta
to Exa, expect only about 50% more to come from transistors, followed by the predominant
performance of 670x to come from parallelism. As a result, we believe that the system performance
in the future will really come from parallelism, especially considering 1000x performance every
decade. Let’s look at the energy. Where is the energy
consumed? Let’s take a Teraflop system today. It consumes around 50pJ of energy per operation
for the compute alone – that’s about 50W. Then you add the memory – about 1.5nJ per
Byte amounted to 150W. Then you add the communication, you add the disk, and the total system is
of the order of 1KW for a Teraflop system. Where does the 600W go? It goes into instruction
decode, the control, address translation, and mostly with the bloated features that
were added in over the years. Our goal here is to really bring this power down to something
like 20W. That’s a tall order, and that itself was the
challenge that was posed by this program called UHPC – Ubiquitous High Performance Computing
from DARPA. What it said was: “give me an Exascale system under 20MW in a data center,
and if you can do that, you can have a Petascale system in a cabinet that you can put in an
airplane, or you can have a 20W Terascale system, or you can have a Gigascale system
– included in a toy helicopter; more importantly you could have a Megascale system consuming
micro-Watts of power, which can be embedded in your body.” All of this is possible if
at a system level, we bring the energy consumption down to 20pJ/operation. How do we do this? Let’s look at voltage scaling.
If we have a circuit, and we reduce the supply voltage, we know that the frequency of operation
goes down almost linearly. The power goes down cubically, the leakage increases a little,
but overall, if you look at the energy efficiency of such a system, it increases by almost an
order of magnitude. This was demonstrated experimentally and discussed
at ISSCC 2008, with a very simple accelerator. What you see here is: as the supply voltage
is reduced, the maximum frequency of operation reduces, and at the same time, the total power
consumption also reduces. But notice, on the graph to the right is the energy efficiency.
The energy efficiency peaks near the threshold voltage of the transistor by almost an order
of magnitude. To demonstrate this further, there was another
paper that was discussed at ISSCC 2012. Here, they showed a Pentium processor, which was
redesigned on a 32nm High-K Metal Gate process, and you can see the die photo to the left
with the package and the custom interposer. This processor was put into a Legacy Socket-7
Motherboard running Windows and Linux operating systems, and the results are very interesting. This showed that if you can do the design
from day 1 for a wide dynamic range, starting from the maximum supply voltage down to close-to
zero, you can see that the energy efficiency of such a design starts increasing near the
threshold voltage by almost 5x. And here are the results. This processor showed that in
the high-performance mode at the full voltage, it runs at about 1GHz, consumes about 1W in
1200 Mips/W in energy efficiency. The same processor, when used in energy efficient mode
runs at 60MHz under 10mW. Compare this to the original Pentium circa 1992, which also
ran at 60MHz but consumed 15W. Now if you want really low power, you can use the same
circuit, run it in an ultra-low power mode at 280mV and it consumes only about 2mW. So
we know how to design computational circuits with extreme energy efficiency. We can also integrate the power delivery.
What this shows is the integration of a voltage regulator on a chip. You have a load chip,
which is a processor, and you have a converter, which is a voltage regulator. Such a voltage
regulator, when it is integrated, using deep sub-micron technologies also has an advantage
of very good efficiency. So the power delivery brought closer to the load can have improved
efficiency of the system, and you can also do very fine-grain power management. This was discussed in a paper in 2010, where
what you have here is an 80-core chip (it was an experimental chip), that has three
modes: the normal mode where everything is active; a standby mode where the logic is
off and memory is one – you almost get a 50% power saving and the wake-up time is also
very fast; or you can have a sleep mode where the logic and memory are both turned off and
the power savings are also having by about 80%. By using these modes dynamically, along
with each core having dynamic within-core 21 sleep regions per tile, you can dynamically
activate different cores and different modes. Of course, you need software to play with
this, but overall at the system level, it improves your system efficiency by almost
60%. Please read this paper for further details. What about memory? 3D-integration of DRAM
and logic, which is what we call heterogeneous integration, has got a lot of merit. Think
about a package with a logic buffer chip on the package, where the logic technology is
optimized for high-speed signaling, it implements energy efficient logic circuits and can also
implement intelligence. On top of that, you integrate a DRAM stack, and the DRAM technology
is optimized for memory density and lower cost. So by using the heterogeneous integration,
using 3D integration, it provides the best of both worlds. Let’s quickly discuss interconnect. Here’s
on-die interconnect. This shows technology generations on the x-axis, and the relative
scale on the y-axis. If you look at the compute energy starting at 90nm technology and projected
down to 7nm technology, the compute energy goes down by almost 6x. But what about the
interconnect? On-die interconnect energy per mm (why per mm? – because the size of the
system doesn’t change), that energy reduces by only 60%. So clearly, the interconnect
energy per mm at the system level reduces slower than compute energy, and as a result,
on-die data movement energy will start to dominate compared to the compute energy. This was discussed here using networks on
chip – two papers in 2010. What you see here is an 80-core Terfaflop test chip discussed
in 2006 with the mesh interconnect – this consumed almost 30% of the power. The network
itself is about 30% of the power. The second one is a 48-core single chip cloud chip discussed
in 2009, and this showed only 10% power, simply because we were smarter in our interconnect,
where we had two cores clustered in a 6×4 mesh, and not in a 6×8 mesh. So we had to
be really clever about building networks on a chip. One more innovation that was shown was using
circuit switched network on a chip, where you have mesh network, which is narrow – it
is a high frequency packet switched network, which establishes a circuit. Then once you
have a circuit that’s established, the data is transferred using wide and slower established
circuit switched bus. Such a circuit switched bus is differential, low swing, and as a result,
improves energy efficiency. This paper here at ISSCC 2010 showed 2-3x increase in energy
efficiency over a traditional packet switched network. So in the future, you have to be a little
clever about interconnect structures such as using buses over short distances, or shared
memories as a shared switch, or crossbar switches for longer distances. If none of these things
work, then you can start thinking about packet switched networks, extended to the board,
cabinet, and full system level. So there is hierarchy and heterogeneity all throughout
the system when it comes to the network. Finally, a system must be harmonized with
the hardware-software co-design. You can look at the entire stack: at the top you have applications,
and at the bottom you have the circuits, the design, and the process technology. We need
to start thinking about how we can get guidance from the applications and the software stack
for efficient system design. At the same time, us, the circuit designers, have to start studying
the limitations and issues, and then how can you use these opportunities that you can exploit
in the future, and give that guidance all the way to the top of the stack. Putting all these things together, to summarize
this: power and energy challenge will continue, we must opportunistically employ the near-threshold
voltage (NTV), 3D integration of memory has a lot of promise, communication energy will
far exceed the computation meaning that the data locality will be paramount, and finally,
at the system level, a revolutionary software stack will be needed. This all together, will
make Exascale real and not fiction. So thank you very much for your attention.

Leave a Reply

Your email address will not be published. Required fields are marked *