# Confirmed Speakers

### Misha Belkin

UC San Diego

### Michael Bronstein

University of Oxford

### Beidi Chen

Meta

### Ivan Dokmanić

University of Basel

### Simon Du

University of Washington

### Laurent El Ghaoui

UC Berkeley

### Michael Elad

Technion

### Reinhard Heckel

TU Munich

### Gitta Kutyniok

LMU Munich

### Qi Lei

NYU

### Rina Panigrahy

### Ohad Shamir

Weizmann Institute

### Daniel Soudry

Technion

### Weijie Su

UPenn

### René Vidal

Johns Hopkins

### Yu-Xiang Wang

UCSB

### Bihan Wen

NTU

### Fanny Yang

ETH Zurich

# Talk Details

### Misha Belkin

UC San Diego

#### Title: Why do neural models need so many parameters?

#### Abstract

One of the striking aspects of modern neural networks is their extreme size reaching billions or even trillions parameters. Why are so many parameters needed? To attempt an answer to this question, I will discuss an algorithm and distribution independent non-asymptotic trade-off between the model size, excess test loss, and training loss for linear predictors and feature maps. Specifically, we show that models that perform well on the test data (have low excess loss) are either “classical” – have training loss close to the noise level, or are “modern” – have a much larger number of parameters compared to the minimum needed to fit the training data exactly. Furthermore, I will provide empirical evidence that optimal performance of realistic models is typically achieved in the “modern” regime, when they are trained below the noise level.

### Michael Bronstein

University of Oxford

#### Title: TBA

#### Abstract

TBA

### Beidi Chen

Meta

#### Title: Hardware-aware Sparsity: Accurate and Efficient Foundation Model Training

#### Abstract

Foundation models trained on complex and rapidly growing data consume enormous computational resources. In this talk, I will describe our recent work on exploiting model and activation sparsity to accelerate foundation model training in both data and model parallel settings. We show that adapting algorithms on current hardware leads to efficient model training with no drop in accuracy. I will start by describing Pixelated Butterfly and Monarch, simple yet efficient sparse model training frameworks on GPUs. They use simple static block-sparse patterns based on butterfly and low-rank matrices, taking into account GPU block-oriented efficiency. They train up to 2.5x faster (wall-clock) than the dense Vision Transformer and GPT-2 counterparts with no drop in accuracy. Next, I will present AC-SGD, communication-efficient pipeline parallelism training frameworks over slow networks. Based on an interesting observation that model weights change slowly during the training, AC-SGD compresses activations or the change of activations with guarantees. It trains or fine-tunes DeBERTa and GPT2-1.5B 4.3x faster in slower networks without sacrificing model quality. I will conclude by outlining three future research directions - data efficiency,software-hardware codesign, and ML for science, and several ongoing projects including linear-time algorithm for large optimal transport problems, efficient autoregressive model (gpt3-style) inference, and ML for new material discovery.

### Ivan Dokmanić

University of Basel

#### Title: Statistical Mechanics of Graph Neural Networks

#### Abstract

Graph convolution networks are excellent models for relational data but their success is not well understood. I will show how ideas from statistical physics and random matrix theory allow us to precisely characterize GCN generalization on the contextual stochastic block model—a community-structured graph model with features. The resulting curves are rich: they predict double descent thus far unseen in graph learning and explain the qualitative distinction between learning on homophilic graphs (such as friendship networks) and heterophilic graphs (such as protein interaction networks). Earlier approaches based on VC-dimension or Rademacher complexity are too blunt to yield similar mechanistic insight. Our findings pleasingly translate to real “production-scale” networks and datasets and suggest simple redesigns which improve performance of state-of-the-art networks on heterophilic datasets. They further suggest intriguing connections with spectral graph theory, signal processing, and iterative methods for the Helmholtz equation. Joint work with Cheng Shi, Liming Pan, and Hong Hu.

### Simon Du

University of Washington

#### Title: Passive and Active Multi-Task Representation Learning

#### Abstract

Representation learning has been widely used in many applications. In this talk, I will present our work which uncovers when and why representation learning provably improves the sample efficiency, from a statistical learning point of view. Furthermore, I will talk about how to actively select the most relevant task to boost the performance.

### Laurent El Ghaoui

UC Berkeley

#### Title: Implicit rules in deep learning

#### Abstract

Recently, prediction rules based on so-called implicit models have emerged as a new, high-potential paradigm in deep learning. These models rely on an “equilibrium” equation to define the prediction, instead of a recurrence through multiple layers. Currently, even very complex deep learning models are based on a “feedforward” structure, without loops, and as such the popular term “neural” applied to such models is not fully warranted, since the brain itself possesses loops. Allowing for loops may be the key to describing complex higher-level reasoning, which has so far eluded the deep learning paradigms. However, it raises the fundamental issue of well-posedness, since there may be no or multiple solutions to the corresponding equilibrium equation. In this talk, I will review some aspects of implicit models, starting from a unifying “state-space” representation that greatly simplifies notation. I will also introduce various training problems of implicit models, including one that allows convex optimization, and their connections to topics such as architecture optimization, model compression, and robustness. In addition, the talk will show the potential of implicit models through experimental results on various problems such as parameter reduction, feature elimination, and mathematical reasoning tasks.

### Michael Elad

Technion

#### Title: Image Denoising - Not What You Think

#### Abstract

Image denoising – removal of white additive Gaussian noise from an image – is one of the oldest and most studied problems in image processing. An extensive work over several decades has led to thousands of papers on this subject, and to many well-performing algorithms for this task. As expected, the era of deep learning has brought yet another revolution to this subfield, and took the lead in today’s ability for noise suppression in images. All this progress has led some researchers to believe that “denoising is dead”, in the sense that all that can be achieved is already done. Exciting as all this story might be, this talk IS NOT ABOUT IT! Our story focuses on recently discovered abilities and vulnerabilities of image denoisers. In a nut-shell, we expose the possibility of using image denoisers for serving other problems, such as regularizing general inverse problems and serving as the engine for image synthesis. We also unveil the (strange?) idea that denoising (and other inverse problems) might not have a unique solution, as common algorithms would have you believe. Instead, we will describe constructive ways to produce randomized and diverse high perceptual quality results for inverse problems.

### Reinhard Heckel

TU Munich

#### Title: The role of data and models for deep-learning based image reconstruction

#### Abstract

Deep-learning methods give state-of-the-art performance for a variety of imaging tasks, including accelerated magnetic resonance imaging. In this talk we discuss whether improved models and algorithms or training data are the most promising way forward. First, we ask whether increasing the model size and the training data improves performance in a similar fashion as it has in domains such as language modeling. We find that scaling beyond relatively few examples yields only marginal performance gains. Second, we discuss the robustness of deep learning based image reconstruction methods. Perhaps surprisingly, we find no evidence for neural networks being any less robust than classical reconstruction methods (such as l1 minimization). However, we find that both classical and deep learning based approaches perform significantly worse under distribution shifts, i.e., when trained (or tuned) and tested on slightly different data. Finally, we show that the out-of-distribution performance can be improved through more diverse training data, or through an algorithmic intervention called test-time-training.

### Gitta Kutyniok

LMU Munich

#### Title: The Next Generation of Reliable AI: From Digital to Analog Hardware

#### Abstract

Artificial intelligence is currently leading to one breakthrough after the other, in the sciences, in industry, and in public life. However, one current major drawback is the lack of reliability of such methodologies, which, in particular, also concerns almost any application of deep neural networks. In this lecture, we will first provide a short introduction into the world of reliability of deep neural networks, with one main focus being on explainability. We will, in particular, present a novel approach based on information theory, coined ShearletX, which allows to not only provide higher level explanations, but also reveals the reason for wrong decisions. We will then delve deeper and discuss fundamental limitations of numerous current deep learning-based approaches, showing that there do exist severe problems in terms of computability on any type of digital hardware, which seriously affects their reliability. But theory also shows a way out, pointing towards a future on analog hardware such as neuromorphic computing or quantum computing to achieve true reliability.

### Qi Lei

NYU

#### Title: Reconstructing Training Data from Model Gradient, Provably

#### Abstract

Understanding when and how much a model gradient leaks information about the training sample is an important question in privacy. In this talk, we present a surprising result: even without training or memorizing the data, we can fully reconstruct the training samples from a single gradient query at a randomly chosen parameter value. We prove the identifiability of the training data under mild conditions: with shallow or deep neural networks and a wide range of activation functions. We also present a statistically and computationally efficient algorithm based on low-rank tensor decomposition to reconstruct the training data. As a provable attack that reveals sensitive training data, our findings suggest potential severe threats to privacy, especially in federated learning.

### Rina Panigrahy

#### Title: How to learn a Table of Concepts?

#### Abstract

An idealized view of an intelligent system is one where there is a table of concepts from simple to complex and on each raw input (image or text) is processed to identify the concepts present or triggered by that input. While today’s ML systems may use a Mixture of Experts or Table of entities/mentions that are trying to map concepts to experts/entities/mention, there is hardly a clear picture of how concepts build upon each other automatically. In the domain of images, we can think of commonly occurring shapes, types of motions as concepts and ask how would a visual system create entries for different shape types in a table of concepts. How does one convert raw data into concepts given by something like their latent representation. A trivial transformation like random projection doesn’t work. But we will see how simple networks similar to ones used in practice can be viewed as “sketching operators” that can be used to create representations for images with simple shapes that are isomorphic to their polynomial representations. By putting such polynomial representations in a locality sensitive table, we obtain a dictionary of curves. And then by recursively applying another layer of the sketching operator on the curves one gets a dictionary of concepts such as shapes.

### Tomaso Poggio

MIT

#### Title: A key principle underlying deep networks: compositional sparsity

#### Abstract

A key question is whether there exist a theoretical explanation — a common motif — to the various network architectures, including the human brain, that perform so well in learning tasks. I will discuss the conjecture that this is compositional sparsity of effectively computable functions: all functions of many variables must effectively be compositionally sparse that is with constituent functions each depending on a small number of variables.

### Ohad Shamir

Weizmann Institute

#### Title: Implicit bias in machine learning

#### Abstract

Most practical algorithms for supervised machine learning boil down to optimizing the average performance over a training dataset. However, it is increasingly recognized that although the optimization objective is the same, the manner in which it is optimized plays a decisive role in the properties of the resulting predictor. For example, when training large neural networks, there are generally many weight combinations that will perfectly fit the training data. However, gradient-based training methods somehow tend to reach those which, for example, do not overfit; are brittle to adversarially crafted examples; or have other unusual properties. In this talk, I’ll describe several recent theoretical and empirical results related to this question.

### Mahdi Soltanolkotabi

USC

#### Title: Demystifying Feature learning via gradient descent with applications to medical image reconstruction

#### Abstract

In this talk I will discuss the challenges and opportunities for using deep learning in medical image reconstruction. Contemporary techniques in this field rely on convolutional architectures that are limited by the spatial invariance of their filters and have difficulty modeling long-range dependencies. To remedy this, I will discuss our work on designing new transformer-based architectures called HUMUS-Net that lead to state of the art performance and do not suffer from these limitations. A key component in the success of the above approach is a unique feature learning capability of unrolled neural networks trained based on end-to-end training. In the second part of the talk I will demystify this feature learning capability of neural networks in this context as well as more broadly for other problems. Our result is based on an intriguing spectral bias phenomena for gradient descent, that puts the iterations on a particular trajectory towards solutions that learn good features that generalize well. Notably this analysis overcomes a major theoretical bottleneck in the existing literature and goes beyond the “lazy” training regime which requires unrealistic hyperparameter choices (e.g. very small step sizes, large initialization or wide models).

### Daniel Soudry

Technion

#### Title: How catastrophic can catastrophic forgetting be in linear regression?

#### Abstract

Deep neural nets typically forget old tasks when trained on new tasks. This phenomenon, called “catastrophic forgetting” is not well understood, even in the most basic setting of linear regression. Therefore, we study catastrophic forgetting when fitting an overparameterized linear model to a sequence of tasks with different input distributions. We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks, obtaining exact expressions and bounds. We establish connections between continual learning in the linear setting and two other research areas: alternating projections and the Kaczmarz method. In specific settings, we highlight differences between forgetting and convergence to the offline solution as studied in those areas. In particular, when T tasks in d dimensions are presented cyclically for k iterations, we prove an upper bound of T^2 * min{1/sqrt(k), d/k} on the forgetting. This stands in contrast to the convergence to the offline solution, which can be arbitrarily slow according to existing alternating projection results. We further show that the T^2 factor can be lifted when tasks are presented in a random ordering. Joint work with Itay Evron, Edward Moroshko, Rachel Ward, and Nati Srebro, published in COLT 22.

### Weijie Su

UPenn

#### Title: Geometrization of Real-World Deep Neural Networks

#### Abstract

In this talk, we will investigate the emergence of geometric patterns in well-trained deep learning models by making use of a layer-peeled model and the law of equi-separation. The former is a nonconvex optimization program that models the last-layer features and weights. We use the model to shed light on the neural collapse phenomenon of Papyan, Han, and Donoho, and to predict a hitherto-unknown phenomenon that we term minority collapse in imbalanced training. This is based on joint work with Cong Fang, Hangfeng He, and Qi Long (arXiv:2101.12699). The law of equi-separation is a pervasive empirical phenomenon that describes how data are separated according to their class membership from the bottom to the top layer in a well-trained neural network. We will show that, through extensive computational experiments, neural networks improve data separation through layers in a simple exponential manner. This law leads to roughly equal ratios of separation that a single layer is able to improve, thereby showing that all layers are created equal. We will conclude the talk by discussing the implications of this law on the interpretation, robustness, and generalization of deep learning, as well as on the inadequacy of some existing approaches toward demystifying deep learning. This is based on joint work with Hangfeng He (arXiv:2210.17020).

### René Vidal

Johns Hopkins

#### Title: TBA

#### Abstract

TBA

### Yu-Xiang Wang

UCSB

#### Title: Deep Learning meets Nonparametric Regression: Are Weight-decayed DNNs locally adaptive?

#### Abstract

They say deep learning is just curve fitting. But how good is it in curve fitting exactly? Are DNNs as good as, or even better than, classical curve-fitting tools, e.g., splines and wavelets? In this talk, I will cover my group’s recent paper on this topic and some interesting insight. Specifically, I will provide new answers to “Why is DNN stronger than kernels?” “Why are deep NNs stronger than shallow ones?”, “Why ReLU?”, “How do DNNs generalize under overparameterization?”, “What is the role of sparsity in deep learning?”, “Is lottery ticket hypothesis real?” All in one package. Intrigued? Come to my talk and find out!

### Bihan Wen

NTU

#### Title: Beyond Classic Image Priors: Deep Reinforcement Learning and Disentangling for Image Restoration

#### Abstract

The key to solving various ill-posed inverse problems, including many computational imaging and image restoration tasks, is to exploit effective image priors. The classic signal processing theory laid the foundation on constructing analytical image models such as sparse coding and low-rank approximation. Recent advances in machine learning, especially deep learning technologies have made incredible progress in the past few years, which enables people to rethink how the image prior can be formulated more effective. Despite those many deep learning methods achieved the state-of-the-art results in image restoration tasks, there are still limitations in practice, such as adversarial robustness, customization, generalization, and data-efficiency. In this talk, I will share some of our recent works on deep disentangling and deep reinforcement learning in image restoration tasks. We argue that learning more than just the classic image priors are needed for solving inverse problems to alleviate some of the limitations. We show promising results in several image restoration tasks, including denoising, adversarial purification, low-light image enhancement and computational imaging.

### Fanny Yang

ETH Zurich

#### Title: Strong inductive biases provably prevent harmless interpolation

#### Abstract

Interpolating models have recently gained popularity in the statistical learning community due to common practices in modern machine learning: complex models achieve good generalization performance despite interpolating high-dimensional training data. In this talk, we prove generalization bounds for high-dimensional linear models that interpolate noisy data generated by a sparse ground truth. In particular, we first show that minimum-l1-norm interpolators achieve high-dimensional asymptotic consistency at a logarithmic rate. Further, as opposed to the regularized or noiseless case, for min-lp-norm interpolators with 1<p<2 we surprisingly obtain polynomial rates. Our results suggest a new trade-off for interpolating models: a stronger inductive bias encourages a simpler structure better aligned with the ground truth at the cost of an increased variance. We finally discuss our latest results where we show that this phenomenon also holds for nonlinear models.