# **Computing Beyond Moore's Law**

## John Shalf

Department Head for Computer Science Lawrence Berkeley National Laboratory



### Technology Scaling Trends

mm

BERKELEY LAB

*Exascale in 2021... and then what?* 



# **Moore's Law IS Ending**

#### Hennessy / Patterson



| Multiple chips in<br>Minicomputers         Single microprocessors | Multicore microprocessors |
|-------------------------------------------------------------------|---------------------------|
|-------------------------------------------------------------------|---------------------------|



Many unproven candidates yet to be invested at scale. Most are disruptive to our current ecosystem.

BERKELEY LAB

# The Future Direction for Post-Exascale Computing



# **Specialization:**

#### Natures way of Extracting More Performance in Resource Limited Environment

#### **Powerful General Purpose**





#### Xeon, Power

KNL AMD, Cavium/Marvell, GPU

#### Many Different Specialized (Post-Moore Scarcity)



#### Apple, Google, Amazon



## **Extreme Hardware Specialization is Happening Now!**



#### Large Scale Datacenters also Moving to Specialized Acceleration The Google TPU



## **Amazon AWS Graviton CustomARM SoC** (and others)

#### **AWS Graviton2 processor**

- 4x the vCPUs
- 7x CPU performance
- ~2x performance/vCPU
- ~30 Billion transistors
- 7nm





#### **AWS CEO Andy Jassy:**

"AWS isn't going to wait for the tech supply chain to innovate for it and is making a statement with performance comparisons against an Intel Xeon-based instance. The EC2 team was clear that Graviton2 sends a message to vendors that they need to move faster and AWS is not going to hold back its cadence based on suppliers." **Hardware Generators:** Enabling Technology for Exploring Design Space Together with Close Collaborations with Applied Math & Applications

Co-Develop Hardware and Algorithm



#### Research platform: 96-core Tiled CPU on FPGA SC2016 Demo (accidentally Sunway-like architecture emulation)







## 2 people spent 2 months to create

- Z-Scale processors connected in a Concentrated Mesh
- 4 Z-scale processors
- 2x2 Concentrated mesh with 2 virtual channels
- Micron HMC Memory

http://www.codexhpc.org/?p=367



# **Putting Architecture Specialization to work for**

- But what are the right specializations to include?
- What is the cost model (we know we cannot afford to spin our own chips from scratch)
- Leverage the Open Source and ARM IP Ecosystem:
  - IP is the commodity (not the chip)!!!
- What is the right partnership/economic model for the future of HPC?





# Project 38 -- Background

DOD and DOE recognize the imperative to develop new mechanisms for engagement with the vendor community, particularly on architectural innovations with strategic value to USG HPC.

Project 38 (P38) is a set of vendor-agnostic architectural explorations involving DOD, the DOE Office of Science, and NNSA (these latter two organizations are referred to in this document as "DOE"). These explorations should accomplish the following:

- **Near-term goal:** Quantify the performance value and identify the potential costs of specific architectural concepts against a limited set of applications of interest to both the DOE and DOD.
- **Long-term goal:** *Develop an enduring capability for DOE and DOD to jointly explore architectural innovations and quantify their value.*
- **Stretch goal:** Specification of a shared, purpose built architecture to drive future DOE-DOD collaborations and investments. (purpose-built HPC by 2025) Internal



| Recappi                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | Ing Key P38 Technology Features                                                                                                                                                                                                                                                                       |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| $\begin{array}{cccccccccccccccccccccccccccccccccccc$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | <ul> <li>Fixed Function Accelerators &amp; COTS IP (Extreme Heterogeneity)</li> <li>RISC-V and ARM cores</li> <li>Fixed function FFT (Generated by SPIRAL)</li> </ul>                                                                                                                                 |
| Register File         Lightweight           S         S         S         S         S           M         M         M         M         M         M           A         A         A         A         A         A           A         A         A         A         A         A         A           k         k         k         k         k         k         A         A                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | <ul> <li>Word Granularity Scratchpad Memory (Gather Scatter):</li> <li>Gather-scatter within processor tile</li> <li>more effective SIMD</li> </ul>                                                                                                                                                   |
| Perfect Addr<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream<br>Stream | <ul> <li>Recoding engine (Efficient programmable FSM &amp; data reorg.)</li> <li>Sub-word granularity and high control irregularity</li> <li>Handles branch-heavy code (avg. 20x improvement over processor core)</li> <li>One lane is 1/100<sup>th</sup> the size of a x86 processor core</li> </ul> |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | <ul> <li>Hardware Message Queues (Lightweight Interprocessor Communication)</li> <li>Gather-scatter between processor tiles</li> <li>Async between tiles to eliminate overhead of barriers</li> </ul>                                                                                                 |

George Fann & Yuan Zheng

#### **General-Purpose:** Tensor Contractions on Word Granularity SPM



# Create Hardware Features to Accelerate Broadly used Numerical Algorithm Primitives

memory memory memory memory memory memory memory memory

slice

ARB

slice

slice

slice

slice

ARB

slice

slice

slice

ARB ARB

- Accelerate commonly used primitives for interprocessor communication
  - Queues & DAGs commonly used in pseudocode
  - Why not make them REAL? (in design library)



### **Sparse Matrix Trisolve (refresher)**

#### Currently Use OMP Atomic to track dependencies





### Example of CoDevelopment of Hardware and Software: SuperLU Dependency Tracking



BERKELEY LAB

# Benefit of MsgQ's on KNL-like architecture





# Recoding Engine, Chien (ANL) <u>**Recode:**</u> Regex 1-lane Performance and Energy Efficiency</u>



- 7x faster per lane than x86, 64 lanes => ~450x faster than single x86 thread
- Recode engine (UDP) scales to ~150 Gbps for a 64-lane Recode engine (<<1 watt total)</li>
- 128 tile chip could achieve 20 Tbps total line rate; 256 tiles => 40 Tbps
- Large pattern sets supported with NFA, and scale-out

BERKELEY LA

Extreme, Scalable Regex at 10-40 Tbps

# **SNAPPY: Sparse Matrix Compression Accelerator**



## Fixed Function Accelerators Design Study Dark Silicon

- Adopt SmartPhone SoC Strategy -mix fixed-function accelerators with programmable cores
- Target commonly used scientific primitives/libraries
  - BLAS (level 1,2,3)
  - FFT (FFTW or SPIRAL interface)







## 💌 m p

## FFT Example With FFTx (Francetti, Popovic, Canning)



#### For FFT of size N

- Storage = N \* operand\_size
- Compute = 5/2 \* N \* log2(N) FLOPs
- Use Pseudo-2D algorithm for large FFTs

#### **Single FFT Accelerator Resource**

#### Assumptions: Spiral HW Generator

- 1GHz @ 14nm technology node
- 2M point transform (data off-chip)
- HPC Challenge Benchmark: Single precision (Float32) complex, out-of-place

#### Limit: 100 GB/s off-chip memory

- 16k points on-chip engine
- Analytic model for FP limit ~1.5TFLOPs SP
- 4.5mm<sup>2</sup> area for compute @ 14nm
- Limit: 1TB/s off-chip memory
  - ~10k MADD + ~5k add -> 15k FP@1GHz

Analytical model for FP limit ~15TFLOPs SP

47mm<sup>2</sup> area for compute @14nm



# FFT Radix 2 RTL generated by SPIRAL – @14nm

Run RTL through synthesis to get accurate power/area/timing



#### Chip-layout at 14nm using Mentor Design Synthesis Flow

- Shows 2x improved density improvement over analytic model, but 2x lower clock
- Floating point multiplier is the Critical path around 1900 ps leading to
  - 500 MHz design for standard cell based synthesis
  - Improved StdCell library (better than OpenSDK) could result in further improvements



# Results for RISC-V FFT Accelerator for CryoEM Benchmarking FFT Accelerator for CryoEM (Donofrio, Fard)





Detector / Microscope Installation Year



| Instruction | opcode[3:2] | Description                     |  |
|-------------|-------------|---------------------------------|--|
| fft_config  | 10b         | Configures FFT parameters       |  |
| fft_status  | 01b         | Reads FFTAccel status registers |  |
| fft_start   | 11b         | Starts FFT processing           |  |
| fft_stop    | 00b         | Stops FFT processing            |  |





#### **Created RISC-V Core with FFT ISA Extension**

#### **RISC-V+FFT Accel 126x faster than x86 host**

- —FFT on Intel Core i7-5930K @ 3.50GHz: ~265ms
- -FFTAccel (Floating): ~2.10ms



# **Full Measure**

Full Custom Acceleration for Targeted Science (Industrializing use of Anton or GRAPE-like technology)



# **FPGA vs. ASIC**



Cost for first FPGA (NRE): Cost for 20,000<sup>th</sup> : Clock Rate: \$2,500-\$7,500 \$2,500-\$7,500 0.1-0.3Ghz



10x-100x FPGA

**Energy Efficiency :** 

### **Example Algorithm-Driven Design of Hardware Accelerators**

#### Example: LS3DF/Density Functional Theory (DFT)

- What: Design the hardware accelerator around the target algorithm/application
  - Purpose-built acceleration
  - Lab-led reference design
- Why: Huge opportunities to improve performance density and efficiency
  - FFT hardware accelerator 50x-100x higher performance density than GPU or CPU+SIMD (using SPIRAL generator)
- How: Use Density Functional Theory (DFT) as the target for this experiment
  - 1. Large fraction of the DOE workload
  - 2. Mature code base and algorithm

3. LS3DF formulation minimizes off-chip communication and scales O(N)



## The DFT kernel for each fragment

*Communication Avoiding LS3DF Formulation – Scales O(N)* 



$$F_{222} + F_{211} + F_{121} + F_{112} - F_{221} - F_{212} - F_{122} - F_{111} \}$$

#### **Von-Neumann Instruction Processors vs. Hardware Circuits**

(must redesign for static dataflow and deep flow-through pipelines)



# **Algorithm Reformulated as Custom Circuit**



### **Architecture Specialization for Science**

(hardware is design around the algorithms) can't design effective hardware without math





- 34 -



# **Post CMOS Device Technology**

Accelerating the pace for discovery for the future of Microelectronics



## Many Options for New Device Technology

but few satisfy Borkar-Shalf Criteria (2013-2015 viewpoint)

- Gain
   Signal to Noise
   Scalability
- 4. Manufacturability





**OSTP Report 2015:** John Shalf Robert Leland and Shekhar Borkar

| <b>TABLE 1.</b> Summary of techology options for extending digital electronics. |                                                    |               |            |        |             |  |  |  |
|---------------------------------------------------------------------------------|----------------------------------------------------|---------------|------------|--------|-------------|--|--|--|
| Improvement Class                                                               | Technology                                         | Timescale     | Complexity | Risk   | Opportunity |  |  |  |
| Architecture and software advances                                              | Advanced energy management                         | Near-Term     | Medium     | Low    | Low         |  |  |  |
|                                                                                 | Advanced circuit design                            | Near-Term     | High       | Low    | Medium      |  |  |  |
|                                                                                 | System-on-chip specialization                      | Near-Term     | Low        | Low    | Medium      |  |  |  |
|                                                                                 | Logic specialization/dark silicon                  | Mid-Term      | High       | High   | High        |  |  |  |
|                                                                                 | Near threshold voltage (NTV) operation             | Near-Term     | Medium     | High   | High        |  |  |  |
| 3D integration and packaging                                                    | Chip stacking in 3D using thru-silicon vias (TSVs) | Near-Term     | Medium     | Low    | Medium      |  |  |  |
|                                                                                 | Metal layers                                       | Mid-Term      | Medium     | Medium | Medium      |  |  |  |
|                                                                                 | Active layers (epitaxial or other)                 | Mid-Term      | High       | Medium | High        |  |  |  |
| Resistance reduction                                                            | Superconductors                                    | Far-Term      | High       | Medium | High        |  |  |  |
|                                                                                 | Crystaline metals                                  | Far-Term      | Unknown    | Low    | Medium      |  |  |  |
| Millivolt switches (a<br>better transistor)                                     | Tunnel field-effect transistors (TFETs)            | Mid-Term      | Medium     | Medium | High        |  |  |  |
|                                                                                 | Heterogeneous semiconductors/strained silicon      | Mid-Term      | Medium     | Medium | Medium      |  |  |  |
|                                                                                 | Carbon nanotubes and graphene                      | Far-Term      | High       | High   | High        |  |  |  |
|                                                                                 | Piezo-electric transistors (PFETs)                 | Far-Term      | High       | High   | High        |  |  |  |
| Beyond transistors<br>(new logic<br>paradigms)                                  | Spintronics                                        | Far-Term      | Medium     | High   | High        |  |  |  |
|                                                                                 | Topological insulators                             | Far-Term      | Medium     | High   | High        |  |  |  |
|                                                                                 | Nanophotonics                                      | Near/Far-Term | Medium     | Medium | High        |  |  |  |
|                                                                                 | Biological and chemical computing                  | Far-Term      | High       | High   | High        |  |  |  |



# **Comparing CMOS Technology Alternatives**



## **Multiscale Modeling to ccelerate Post-CMOS Development**

Characterizing materials, analyzing devices, understanding impacts on circuits, architectures, systems and applications.



A holistic end-to-end modeling approach is required

BERKELEY LAE

# **Gap: Connecting and Scaling**

#### Accelerated feedback path to focus device and material discovery process



## **Integrated Plan to Accelerate Microelectronics Discovery**





#### **PARADISE:** Post-Moore Architecture and Accelerator Design Space Exploration

George Michelogiannakis & Dilip Vasudevan



#### **PARADISE:** Post-Moore Architecture and Accelerator Design Space Exploration

George Michelogiannakis & Dilip Vasudevan



#### The Sum of the Parts is Greater than the Whole





# Skyrmions "bags" for Multi-Valued Logic



# **Skyrmion-based Spiking Neural Networks**



# Conclusions

- Think more seriously about how to put specialization productively to use for science
  - Requires deep understanding of applied mathematics and the underlying algorithms to be successful
- Reevaluate the business/economic model for the design and acquisition of HPC systems
- Accelerate the development of materials, devices, and systems for post-CMOS electronics



### **Beyond-Moore Computing Directions**



#### Heterogeneous Architectures

Specialized accelerators for performance / energy

#### Post CMOS Devices/Materials

Evaluate new devices using simulation across scales

# New Models of Computation

Quantum algorithms, tools and testbeds, for science applications

Workload Analysis, Testbeds, Deployment



- 48 -

# **Data Movement Challenge**

Photonics and Advanced Packaging http://www.padalworkshop.org/



## **Data Movement Costs:**

Energy to move data proportional to distance.

#### Power is near chip thermal limits



# **Package Performance is Pin Limited**



## **Diverse Node Configurations for Datacenter Workloads**



# **Disaggregated Node/Rack Architecture**



Most solutions current disaggregation solutions use Interconnect bandwidth (1 – 10 GB/s) But this is significantly inferior to RAM bandwidth (100 GB/s – 1 TB/s)

# Photonic MCM (Multi-Chip Module)





# Photonic MCM (Multi-Chip Module)



# **Case for Disaggregation from a Workload Perspective**





#### Intra-node bandwidth steering

- Introduce low-radix optical circuit switches
   to the OC-MCM topology
  - 4x4 to 8x8 realizable with today's technology
  - Tens of switches can be collocated on a single chip
- Slower reconfiguration compared to packet switching
  - Reconfiguration takes microseconds
  - But traffic patterns are persistent for long periods (minutes to hours!)
- But transparent for packets
  - No buffering for point-to-point means Time-of-Flight latencies
  - Extremely energy efficient to reconfigure
  - Minimize marooned resources





## **ML** : Inference Configuration







# **ML : Training Configuration**







### **PINE: Photonic Integrated Networked Energy Efficient Datacenters**

Resource Disaggregation to custom-assemble diverse accelerators for diverse workload requirements



# Conclusions

- Think more seriously about how to put specialization productively to use for science
  - Requires deep understanding of applied mathematics and the underlying algorithms to be successful
- Reevaluate the business/economic model for the design and acquisition of HPC systems
- Accelerate the development of materials, devices, and systems for post-CMOS electronics



# **Beyond Moore Computing Taxonomy**





# Hardware Specialization and the Move Towards Extreme Heterogenous Acceleration

Make Heterogeneous Acceleration Productive for Science



7 0