Computing platforms have limited resources, such as memory, cache, bandwidth, and processing power. In the past, algorithms have been derived with optimal complexity and thus are supposed to make an efficient usage of processing power, while memory, cache and bandwidth limitations have often been ignored. However, it is often the case that these resources become the primary factors limiting overall performance, especially (but not only) when using accelerators such as GPUs. GPUs offer increased processing capabilities and superior energy efficiency compared to CPUs, making them a crucial element of many computing systems over the past decade.
In this course, we will present on one hand algorithmic approaches that have recently been proposed in order to utilize all resources efficiently, and on the other hand we will focus on how to implement these efficient algorithms on real hardware platforms. The typical use case will focus on linear algebra computations (matrix operations), which are the basis of both "traditional" high performance computing applications and recent neural network computations.
A sequential machine.A distributed memory machine.Nvidia Volta 100 architecture.
Outline
Introduction
Course details and evaluation methods
Relevance of this course
Models and algorithms for CPU computations
Matrix multiplications
I/O complexity bounds
Data transfer cost models
Communication avoiding sequential and parallel algorithms
2D/2.5D/3D algorithms
Directed acyclic graph (DAG) scheduling
Memory aware computations
Exploratory research topics
GPU fundamentals
GPU vs CPU architecture
CUDA basics
Memory model, threads, blocks
Memory optimizations using shared memory
Coalescing and memory alignment
Tensor cores
Exploratory research topics
Implementation on CPUs and GPUs
Parallelization using OpenMP and MPI
Parallelization using CUDA
We will look at several interesting research projects in the course
related to parallel computations in high performance computing, machine learning and data analytics.
Lectures
Prerequisite
Experience with C/C++ is expected. Knowledge of parallel algorithms will be helpful, but not required.
Evaluation
The evaluation will be based on the following weightings:
2 pen-and-paper-based assignments (20% weight)
3 small programming assignments (30% weight)
Project (50% weight): Each student will select an article based on their interests and work on it. The output will be accessed based on a written report and an oral prsentation.
Course project for the final evaluation
Each student will prepare a 5-6 pages report and give a presentation for one of the below articles. Project report is due by Jan 15.
For all projects, the student must include the following in the report: the context, the state of the art, an overview of the contributions, your feedback, and a detailed analysis or implementation of selected parts (if the article does not specify which parts, focus on those you consider most important). All project presentations will take place on Jan 30 in the Salle du conseil LIP, 394 nord. This room is adjacent to the entrance of Amphithéâtre B on the third floor.
Project presentation schedule for Jan 30
Time
8h
8h30
9h
9h30
10h
10h30
11h
13h
13h30
14h
14h30
Name
PLAID
KEJIKIAN
FONTAINE
DZIKI
DURAND
REMY
GOYET
FLYGAR
COURTOIS
GUMRAN
LORIFERNE
Research articles
Various hardness results in different red-blue pebbling variants. Assigned to Isaline PLAID. On the Hardness of Red-Blue Pebble Games (Specific part: proof of Theorem 2 (NP-hardness))
Rules of the pebble game. Assigned to Alexandre KEJIKIAN. Move rules and trade-offs in the pebble game (Specific part: proof of Theorem D (removing one pebble leads to exponential time)