Resource-aware computations on CPUs and GPUs

Instructors: Suraj Kumar, Loris Marchal & Frédéric Vivien (ROMA team, LIP, ENS Lyon), 2025-2026



Computing platforms have limited resources, such as memory, cache, bandwidth, and processing power. In the past, algorithms have been derived with optimal complexity and thus are supposed to make an efficient usage of processing power, while memory, cache and bandwidth limitations have often been ignored. However, it is often the case that these resources become the primary factors limiting overall performance, especially (but not only) when using accelerators such as GPUs. GPUs offer increased processing capabilities and superior energy efficiency compared to CPUs, making them a crucial element of many computing systems over the past decade.

In this course, we will present on one hand algorithmic approaches that have recently been proposed in order to utilize all resources efficiently, and on the other hand we will focus on how to implement these efficient algorithms on real hardware platforms. The typical use case will focus on linear algebra computations (matrix operations), which are the basis of both "traditional" high performance computing applications and recent neural network computations.

seq-machine
A sequential machine.
 
distributed-machine
A distributed memory machine.
 
V100-GPU
Nvidia Volta 100 architecture.

Outline


We will look at several interesting research projects in the course related to parallel computations in high performance computing, machine learning and data analytics.


Lectures

  • Lecture 1 (Nov 20, by F. Vivien): Introduction to Scheduling Under Memory Constraints and Pebble Game Models slides notes.
  • Lecture 2 (Nov 21, by S. Kumar): Overview of the GPU part slides and parallelization approaches on CPUs slides.
  • Lecture 3 (Nov 27, by F. Vivien): Pebble Game Models (2/2) slides notes.
  • Lecture 4 (Nov 28, by S. Kumar): Introduction to CUDA programming slides.
  • Lecture 5 (Dec 04, by L. Marchal): Communication avoiding algorithms and parallel IO-efficient algorithms for matrix product slides
  • Lecture 6 (Dec 05, by S. Kumar): Occupancy analysis for a Volta SM slides.
  • Lecture 7 (Dec 11, by S. Kumar): Some basic CUDA APIs slides.
  • Lecture 8 (Dec 12, by S. Kumar): Research articles for the projects slides and low rank approximations of matrices and tensors slides.
  • Lecture 9 (Dec 18, by F. Vivien): Memory-Aware DAG Scheduling slides.
  • Lecture 10 (Dec 19, by S. Kumar): Error handling on GPUs slides.
  • No class on Jan 08.
  • Lecture 11 (Jan 09, by F. Vivien): See the slides for Lecture 9 (Dec 18).
  • Lecture 12 (Jan 15, by L. Marchal): Computing for LLMs: associated memory and IO challenges. Related articles: FlexGen, MoE optimization and slides for this paper.
  • Lecture 13 (Jan 16, by S. Kumar & S. Singh): Parallel scan algorithms suggested reading; Parallel graph algorithms on GPUs slides.
  • Lecture 14 (Jan 21, by S. Kumar): Parallel scan and sorting algorithms slides.
  • Lecture 15 (Jan 22, by L. Marchal): External Memory Model and Cache-Oblivious Algorithms slides.
  • Lecture 16 (Jan 23, by S. Kumar): Popular CUDA libraries slides.
  • Prerequisite

    Experience with C/C++ is expected. Knowledge of parallel algorithms will be helpful, but not required.

    Evaluation

    The evaluation will be based on the following weightings:

    Course project for the final evaluation

    Each student will prepare a 5-6 pages report and give a presentation for one of the below articles. Project report is due by Jan 15. For all projects, the student must include the following in the report: the context, the state of the art, an overview of the contributions, your feedback, and a detailed analysis or implementation of selected parts (if the article does not specify which parts, focus on those you consider most important). All project presentations will take place on Jan 30 in the Salle du conseil LIP, 394 nord. This room is adjacent to the entrance of Amphithéâtre B on the third floor.
    Project presentation schedule for Jan 30
    Time 8h 8h30 9h 9h30 10h 10h30 11h 13h 13h30 14h 14h30
    Name PLAID KEJIKIAN FONTAINE DZIKI DURAND REMY GOYET FLYGAR COURTOIS GUMRAN LORIFERNE


    Research articles


    If you want to propose another article, we would be happy to discuss it with you.

    Bibliography

    The course will be based on recent research atricles in the area. However, we will follow the following textbooks for GPU programming:
  • Programming Massively Parallel Processors: A Hands-on Approach (4th Edition)
    by Wen-mei W. Hwu, David B. Kirk and Izzat El Hajj