# Confirmed Speakers

Clicking a speaker’s photo will jump to their talk information below.

### Misha Belkin

UC San Diego

### Beidi Chen

Meta

### Ivan Dokmanić

University of Basel

### Simon Du

University of Washington

### Reinhard Heckel

TU Munich

### Gitta Kutyniok

LMU Munich

### Jason Lee

Princeton

### Qi Lei

NYU

### Yi Ma

UC Berkeley

### Rina Panigrahy

### Qing Qu

UMich

### Ohad Shamir

Weizmann Institute

### Daniel Soudry

Technion

### Weijie Su

UPenn

### René Vidal

Johns Hopkins

### Yu-Xiang Wang

UCSB

### Bihan Wen

NTU

### Fanny Yang

ETH Zurich

# Talk Details

### Misha Belkin

UC San Diego

#### Title: Why do neural models need so many parameters?

#### Abstract

One of the striking aspects of modern neural networks is their extreme size reaching billions or even trillions parameters. Why are so many parameters needed? To attempt an answer to this question, I will discuss an algorithm and distribution independent non-asymptotic trade-off between the model size, excess test loss, and training loss for linear predictors and feature maps. Specifically, we show that models that perform well on the test data (have low excess loss) are either “classical” – have training loss close to the noise level, or are “modern” – have a much larger number of parameters compared to the minimum needed to fit the training data exactly. Furthermore, I will provide empirical evidence that optimal performance of realistic models is typically achieved in the “modern” regime, when they are trained below the noise level.

### Beidi Chen

Meta

#### Title: Hardware-aware Sparsity: Accurate and Efficient Foundation Model Training

#### Abstract

Foundation models trained on complex and rapidly growing data consume enormous computational resources. In this talk, I will describe our recent work on exploiting model and activation sparsity to accelerate foundation model training in both data and model parallel settings. We show that adapting algorithms on current hardware leads to efficient model training with no drop in accuracy. I will start by describing Pixelated Butterfly and Monarch, simple yet efficient sparse model training frameworks on GPUs. They use simple static block-sparse patterns based on butterfly and low-rank matrices, taking into account GPU block-oriented efficiency. They train up to 2.5x faster (wall-clock) than the dense Vision Transformer and GPT-2 counterparts with no drop in accuracy. Next, I will present AC-SGD, communication-efficient pipeline parallelism training frameworks over slow networks. Based on an interesting observation that model weights change slowly during the training, AC-SGD compresses activations or the change of activations with guarantees. It trains or fine-tunes DeBERTa and GPT2-1.5B 4.3x faster in slower networks without sacrificing model quality. I will conclude by outlining three future research directions - data efficiency,software-hardware codesign, and ML for science, and several ongoing projects including linear-time algorithm for large optimal transport problems, efficient autoregressive model (gpt3-style) inference, and ML for new material discovery.

### Ivan Dokmanić

University of Basel

#### Title: Statistical Mechanics of Graph Neural Networks

#### Abstract

Graph convolution networks are excellent models for relational data but their success is not well understood. I will show how ideas from statistical physics and random matrix theory allow us to precisely characterize GCN generalization on the contextual stochastic block model—a community-structured graph model with features. The resulting curves are rich: they predict double descent thus far unseen in graph learning and explain the qualitative distinction between learning on homophilic graphs (such as friendship networks) and heterophilic graphs (such as protein interaction networks). Earlier approaches based on VC-dimension or Rademacher complexity are too blunt to yield similar mechanistic insight. Our findings pleasingly translate to real “production-scale” networks and datasets and suggest simple redesigns which improve performance of state-of-the-art networks on heterophilic datasets. They further suggest intriguing connections with spectral graph theory, signal processing, and iterative methods for the Helmholtz equation. Joint work with Cheng Shi, Liming Pan, and Hong Hu.

### Simon Du

University of Washington

#### Title: Passive and Active Multi-Task Representation Learning

#### Abstract

Representation learning has been widely used in many applications. In this talk, I will present our work which uncovers when and why representation learning provably improves the sample efficiency, from a statistical learning point of view. Furthermore, I will talk about how to actively select the most relevant task to boost the performance.

### Reinhard Heckel

TU Munich

#### Title: The role of data and models for deep-learning based image reconstruction

#### Abstract

Deep-learning methods give state-of-the-art performance for a variety of imaging tasks, including accelerated magnetic resonance imaging. In this talk we discuss whether improved models and algorithms or training data are the most promising way forward. First, we ask whether increasing the model size and the training data improves performance in a similar fashion as it has in domains such as language modeling. We find that scaling beyond relatively few examples yields only marginal performance gains. Second, we discuss the robustness of deep learning based image reconstruction methods. Perhaps surprisingly, we find no evidence for neural networks being any less robust than classical reconstruction methods (such as l1 minimization). However, we find that both classical and deep learning based approaches perform significantly worse under distribution shifts, i.e., when trained (or tuned) and tested on slightly different data. Finally, we show that the out-of-distribution performance can be improved through more diverse training data, or through an algorithmic intervention called test-time-training.

### Gitta Kutyniok

LMU Munich

#### Title: The Next Generation of Reliable AI: From Digital to Analog Hardware

#### Abstract

Artificial intelligence is currently leading to one breakthrough after the other, in the sciences, in industry, and in public life. However, one current major drawback is the lack of reliability of such methodologies, which, in particular, also concerns almost any application of deep neural networks. In this lecture, we will first provide a short introduction into the world of reliability of deep neural networks, with one main focus being on explainability. We will, in particular, present a novel approach based on information theory, coined ShearletX, which allows to not only provide higher level explanations, but also reveals the reason for wrong decisions. We will then delve deeper and discuss fundamental limitations of numerous current deep learning-based approaches, showing that there do exist severe problems in terms of computability on any type of digital hardware, which seriously affects their reliability. But theory also shows a way out, pointing towards a future on analog hardware such as neuromorphic computing or quantum computing to achieve true reliability.

### Jason Lee

Princeton

#### Title: Feature Learning with gradient descent

#### Abstract

Significant theoretical work has established that in specific regimes, neural networks trained by gradient descent behave like kernel methods. However, in practice, it is known that neural networks strongly outperform their associated kernels. In this work, we explain this gap by demonstrating that there is a large class of functions which cannot be efficiently learned by kernel methods but can be easily learned with gradient descent on a two layer neural network outside the kernel regime by learning representations that are relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime. Specifically, we consider the problem of learning polynomials which depend on only a few relevant directions, i.e. of the form f⋆(x)=g(Ux) where U: \R^d \to \R^r with d≫r. When the degree of f⋆ is p, it is known that n≍d^p samples are necessary to learn f⋆ in the kernel regime. Our primary result is that gradient descent learns a representation of the data which depends only on the directions relevant to f⋆. This results in an improved sample complexity of n≍rd^2+pd^r. Furthermore, in a transfer learning setup where the data distributions in the source and target domain share the same representation U but have different polynomial heads we show that a popular heuristic for transfer learning has a target sample complexity independent of d. This is joint work with Alex Damian and Mahdi Soltanolkotabi.

### Qi Lei

NYU

#### Title: Reconstructing Training Data from Model Gradient, Provably

#### Abstract

Understanding when and how much a model gradient leaks information about the training sample is an important question in privacy. In this talk, we present a surprising result: even without training or memorizing the data, we can fully reconstruct the training samples from a single gradient query at a randomly chosen parameter value. We prove the identifiability of the training data under mild conditions: with shallow or deep neural networks and a wide range of activation functions. We also present a statistically and computationally efficient algorithm based on low-rank tensor decomposition to reconstruct the training data. As a provable attack that reveals sensitive training data, our findings suggest potential severe threats to privacy, especially in federated learning.

### Yi Ma

UC Berkeley

#### Title: On the Principles of Parsimony and Self-Consistency – Structured Compressive Closed-Loop Transcription

#### Abstract

Ten years into the revival of deep networks and artificial intelligence, we propose a theoretical framework that sheds light on understanding deep networks within a bigger picture of intelligence in general. We introduce two fundamental principles, Parsimony and Self-consistency, that address two fundamental questions regarding Intelligence – what to learn and how to learn, respectively. We argue that these two principles can be realized in entirely measurable and computable ways for an important family of structures and models, known as a linear discriminative representation (LDR). The two principles naturally lead to an effective and efficient computational framework, known as a compressive closed-loop transcription, that unifies and explains the evolution of modern deep networks and modern practices of artificial intelligence. Within this framework, we will see how fundamental ideas in information theory, control theory, game theory, sparse coding, and optimization are closely integrated in such a closed-loop system, all as necessary ingredients to learn autonomously and correctly. We demonstrate the power of this framework for learning discriminative, generative, and autoencoding models for large-scale real-world visual data, with entirely white-box deep networks, under all settings (supervised, incremental, and unsupervised). We believe that these two principles are the cornerstones for the emergence of intelligence, artificial or natural, and the compressive closed-loop transcription is a universal learning engine that serves as the basic learning units for all autonomous intelligent systems, including the brain.

### Rina Panigrahy

#### Title: How to learn a Table of Concepts?

#### Abstract

An idealized view of an intelligent system is one where there is a table of concepts from simple to complex and on each raw input (image or text) is processed to identify the concepts present or triggered by that input. While today’s ML systems may use a Mixture of Experts or Table of entities/mentions that are trying to map concepts to experts/entities/mention, there is hardly a clear picture of how concepts build upon each other automatically. In the domain of images, we can think of commonly occurring shapes, types of motions as concepts and ask how would a visual system create entries for different shape types in a table of concepts. How does one convert raw data into concepts given by something like their latent representation. A trivial transformation like random projection doesn’t work. But we will see how simple networks similar to ones used in practice can be viewed as “sketching operators” that can be used to create representations for images with simple shapes that are isomorphic to their polynomial representations. By putting such polynomial representations in a locality sensitive table, we obtain a dictionary of curves. And then by recursively applying another layer of the sketching operator on the curves one gets a dictionary of concepts such as shapes.

### Tomaso Poggio

MIT

#### Title: A key principle underlying deep networks: compositional sparsity

#### Abstract

A key question is whether there exist a theoretical explanation — a common motif — to the various network architectures, including the human brain, that perform so well in learning tasks. I will discuss the conjecture that this is compositional sparsity of effectively computable functions: all functions of many variables must effectively be compositionally sparse that is with constituent functions each depending on a small number of variables.

### Qing Qu

UMich

#### Title: Understanding Deep Neural Networks via Neural Collapse

#### Abstract

Recently, an intriguing phenomenon in the final stages of network training has been discovered and caught great interest, in which the last-layer features and classifiers collapse to simple but elegant mathematical structures: all training inputs are mapped to class-specific points in feature space, and the last-layer classifier converges to the dual of the features’ class means while attaining the maximum possible margin. This phenomenon, dubbed Neural Collapse, persists across a variety of different network architectures, datasets, and even data domains. Moreover, a progressive neural collapse occurs from shallow to deep layers. This talk leverages the symmetry and geometry of Neural Collapse, and develops a rigorous mathematical theory to explain when and why it happens under the so-called unconstrained feature model. Based upon this, we show how it can be used to provide guidelines to understand and improve transferability with more efficient fine-tuning.

### Ohad Shamir

Weizmann Institute

#### Title: Implicit bias in machine learning

#### Abstract

Most practical algorithms for supervised machine learning boil down to optimizing the average performance over a training dataset. However, it is increasingly recognized that although the optimization objective is the same, the manner in which it is optimized plays a decisive role in the properties of the resulting predictor. For example, when training large neural networks, there are generally many weight combinations that will perfectly fit the training data. However, gradient-based training methods somehow tend to reach those which, for example, do not overfit; are brittle to adversarially crafted examples; or have other unusual properties. In this talk, I’ll describe several recent theoretical and empirical results related to this question.

### Mahdi Soltanolkotabi

USC

#### Title: Demystifying Feature learning via gradient descent with applications to medical image reconstruction

#### Abstract

In this talk I will discuss the challenges and opportunities for using deep learning in medical image reconstruction. Contemporary techniques in this field rely on convolutional architectures that are limited by the spatial invariance of their filters and have difficulty modeling long-range dependencies. To remedy this, I will discuss our work on designing new transformer-based architectures called HUMUS-Net that lead to state of the art performance and do not suffer from these limitations. A key component in the success of the above approach is a unique feature learning capability of unrolled neural networks trained based on end-to-end training. In the second part of the talk I will demystify this feature learning capability of neural networks in this context as well as more broadly for other problems. Our result is based on an intriguing spectral bias phenomena for gradient descent, that puts the iterations on a particular trajectory towards solutions that learn good features that generalize well. Notably this analysis overcomes a major theoretical bottleneck in the existing literature and goes beyond the “lazy” training regime which requires unrealistic hyperparameter choices (e.g. very small step sizes, large initialization or wide models).

### Daniel Soudry

Technion

#### Title: On catastrophic forgetting in linear regression and the implicit bias of minima stability

#### Abstract

This is a two-part talk. 1) Deep neural nets typically forget old tasks when trained on new tasks. This phenomenon, called “catastrophic forgetting” is not well understood, even in the most basic setting of linear regression. Therefore, we study catastrophic forgetting when fitting an overparameterized linear model to a sequence of tasks with different input distributions. We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks, obtaining exact expressions and bounds. In particular, when T tasks in d dimensions are presented cyclically for k iterations, we prove an upper bound of T^2 * min{1/sqrt(k), d/k} on the forgetting. This stands in contrast to the convergence to the offline solution, which can be arbitrarily slow according to existing alternating projection results. We further show that the T^2 factor can be lifted when tasks are presented in a random ordering. 2) We prove that the tendency of SGD to converge to dynamically stable (“flat”) minima leads to a step-size dependent bound on the smoothness of the learned function, in shallow ReLU nets. This implies that, although these networks are universal approximators, stable shallow networks are not. Namely, there is a function (that can be realized as a stable two hidden-layer ReLU network) but that cannot be approximated by stable single hidden-layer ReLU networks trained with a non-vanishing step size. Moreover, we prove that if a function is sufficiently smooth (Sobolev) then it can be approximated arbitrarily well using single hidden-layer ReLU networks that correspond to stable solutions of gradient descent.

### Weijie Su

UPenn

#### Title: Geometrization of Real-World Deep Neural Networks

#### Abstract

In this talk, we will investigate the emergence of geometric patterns in well-trained deep learning models by making use of a layer-peeled model and the law of equi-separation. The former is a nonconvex optimization program that models the last-layer features and weights. We use the model to shed light on the neural collapse phenomenon of Papyan, Han, and Donoho, and to predict a hitherto-unknown phenomenon that we term minority collapse in imbalanced training. This is based on joint work with Cong Fang, Hangfeng He, and Qi Long (arXiv:2101.12699). The law of equi-separation is a pervasive empirical phenomenon that describes how data are separated according to their class membership from the bottom to the top layer in a well-trained neural network. We will show that, through extensive computational experiments, neural networks improve data separation through layers in a simple exponential manner. This law leads to roughly equal ratios of separation that a single layer is able to improve, thereby showing that all layers are created equal. We will conclude the talk by discussing the implications of this law on the interpretation, robustness, and generalization of deep learning, as well as on the inadequacy of some existing approaches toward demystifying deep learning. This is based on joint work with Hangfeng He (arXiv:2210.17020).

### René Vidal

Johns Hopkins

#### Title: Principled Defenses and Reverse Engineering of Adversarial Attacks in a Union of Subspaces

#### Abstract

Deep neural network-based classifiers have been shown to be vulnerable to imperceptible perturbations to their input, such as ℓp-bounded norm adversarial attacks. This has motivated the development of many defense methods, which are then broken by new attacks, and so on. Recent work has also focused on the problem of reverse engineering adversarial attacks, which requires both recovering the clean signal and determining the type of attack (ℓ1, ℓ2 or ℓ∞). However, existing methods either do not come with provable guarantees, or they can certify the accuracy of the classifier only for very small perturbations. In this work, we assume that the data lies approximately in a union of low-dimensional linear subspaces and exploit this low-dimensional structure to develop a theory of adversarial robustness for subspace-sparse classifiers. We first derive geometric conditions on the subspaces under which any attacked signal can be decomposed as the sum of a clean signal plus an attack. We then derive norm-independent certification regions, showing that we can provably defend against specific unrestricted adversarial attacks.

### Yu-Xiang Wang

UCSB

#### Title: Deep Learning meets Nonparametric Regression: Are Weight-decayed DNNs locally adaptive?

#### Abstract

They say deep learning is just curve fitting. But how good is it in curve fitting exactly? Are DNNs as good as, or even better than, classical curve-fitting tools, e.g., splines and wavelets? In this talk, I will cover my group’s recent paper on this topic and some interesting insight. Specifically, I will provide new answers to “Why is DNN stronger than kernels?” “Why are deep NNs stronger than shallow ones?”, “Why ReLU?”, “How do DNNs generalize under overparameterization?”, “What is the role of sparsity in deep learning?”, “Is lottery ticket hypothesis real?” All in one package. Intrigued? Come to my talk and find out!

### Bihan Wen

NTU

#### Title: Beyond Classic Image Priors: Deep Reinforcement Learning and Disentangling for Image Restoration

#### Abstract

The key to solving various ill-posed inverse problems, including many computational imaging and image restoration tasks, is to exploit effective image priors. The classic signal processing theory laid the foundation on constructing analytical image models such as sparse coding and low-rank approximation. Recent advances in machine learning, especially deep learning technologies have made incredible progress in the past few years, which enables people to rethink how the image prior can be formulated more effective. Despite those many deep learning methods achieved the state-of-the-art results in image restoration tasks, there are still limitations in practice, such as adversarial robustness, customization, generalization, and data-efficiency. In this talk, I will share some of our recent works on deep disentangling and deep reinforcement learning in image restoration tasks. We argue that learning more than just the classic image priors are needed for solving inverse problems to alleviate some of the limitations. We show promising results in several image restoration tasks, including denoising, adversarial purification, low-light image enhancement and computational imaging.

### Fanny Yang

ETH Zurich

#### Title: Strong inductive biases provably prevent harmless interpolation

#### Abstract

Interpolating models have recently gained popularity in the statistical learning community due to common practices in modern machine learning: complex models achieve good generalization performance despite interpolating high-dimensional training data. In this talk, we prove generalization bounds for high-dimensional linear models that interpolate noisy data generated by a sparse ground truth. In particular, we first show that minimum-l1-norm interpolators achieve high-dimensional asymptotic consistency at a logarithmic rate. Further, as opposed to the regularized or noiseless case, for min-lp-norm interpolators with 1<p<2 we surprisingly obtain polynomial rates. Our results suggest a new trade-off for interpolating models: a stronger inductive bias encourages a simpler structure better aligned with the ground truth at the cost of an increased variance. We finally discuss our latest results where we show that this phenomenon also holds for nonlinear models.