The SemiAnalysis Blackwell Hackathon was a tremendous success – attracting a huge turnout!

A BIG congratulations to the winners! Be sure to check out their remarkable work below, along with the presentation decks.

The SemiAnalysis Blackwell Hackathon was a tremendous success – attracting a huge turnout!

A BIG congratulations to the winners! Be sure to check out their remarkable work below, along with the presentation decks.


Sponsored by



🎤 Talks 🎤

Mark

Saroufim

GPU MODE

How to Make an Impact in ML Systems

The Present and Future of CUTLASS Tensor Core Programming

Vijay Thakkar

NVIDIA CUTLASS

Building Machine Learning Systems for the Age of Really Big Models

Blackwell Programming for the Masses With OpenAI Triton

Optimizing Attention for Modern Hardware


🏆Winners🏆

FIRST PLACE

Optimizing NVIDIA Blackwell’s Split L2

Arun Demeure

A kernel for NVIDIA’s Blackwell GPU that minimizes cross-chip memory access, achieving a 100x+ reduction in data transfer and power consumption in a reduction kernel to improve memory latency.


SECOND PLACE

WebGPU Backend for PyTorch

Edward Wang, Albert Yang

A prototype of a custom PyTorch backend that leverages WebGPU shaders, enabling models to run on virtually any GPU, even the ones in your smart fridge!


THIRD PLACE

Distributed GEMM with Triton

Subrata Goswami, Vishal Goklani

A distributed GEMM for Blackwell, leveraging Triton (where applicable) to optimize performance, scalability, and load balancing across multiple systems.


SPECIAL MENTION

Laser Fault Injection at H100 Root of Trust

Jonathan Happel, Luc Chartier

An investigation into hacking a Microchip CEC1736 using Laser Fault Injection, focusing on the CEC1736 as the Root of Trust for Confidential Computing Flash within the Nvidia H100.


SPECIAL MENTION

Mains Magnificent Multi-GPU Muon Method

Main Horse, Jack Min Ong, Sami Jaghouar, Aaron Pazdera

Improved Newton-Schulz iteration speed by fusing iterations and exploiting symmetry, while leveraging round-robin gather-scatter for efficient multi-GPU distributed computation.