Workshop on Functional Inference and Machine Intelligence
Virtual, Anywhere on the Earth.March 2-3, 2021.
The Workshop on Functional Inference and Machine Intelligence (FIMI) is an international workshop on machine learning and statistics, with a particular focus on theoretical and algorithmic aspects. It consists of invited talks with topics including (but not limited):
Mathematical Analysis of Deep Learning
Kernel and Probabilistic Models in Machine Learning
Statistical Learning Theory
The workshop will be a virtual (via Zoom), 2-3, March 2021. All schedules are in Japan Standard Time (GMT+9).
Please register! Links for the talks and the event will be emailed to registered participants.
Denny Wu (University of Toronto)
Title: Explicit and Implicit Regularization in Overparameterized Least Squares Regression
We study the generalization properties of the generalized ridge regression estimator in the overparameterized regime. We derive the exact prediction risk (generalization error) in the proportional asymptotic limit, and decide the optimal weighted L2 penalty. Our result provides a rigorous characterization of the surprising phenomenon that the optimal ridge regularization strength can be *negative*. We then connect the ridgeless limit of this estimator to the implicit bias of preconditioned gradient descent (e.g., natural gradient descent); this allows us to compare the generalization performance of first- and second-order optimizers, and identify different factors that affect this comparison. Our theoretical finding also aligns with empirical observation in various neural network experiments.
Masaaki Imaizumi (The University of Tokyo)
Title: Generalization Analysis of Deep Models with Loss Surface and Likelihood Models
Data analysis using large models such as deep learning has high generalization performance, however its principles are still unclear. In this talk, we will propose two theoretical frameworks to explain the principle. First, we develop a regularization theory using a shape of loss surfaces. The generalization error evaluation using uniform convergence has been questioned for its validity of assumptions. To resolve the question, we show that a minimum of loss surfaces in the population sence realizes regularization. Second, we develop theory of the double descent phenomenon, where a generalization error decreases in the limit of large parameters. Despite its generality, applicability of it are limited to shallow neural networks. In this study, we show that it be applied to a wide range of maximum likelihood models, including deep models as well.
Taiji Suzuki (The University of Tokyo)
Title: Optimization and statistical efficiency of neural network in mean field regimes
In this talk, I discuss optimization of neural network in mean field regimes and show its statistical efficiency with optimization guarantees. First, I present a deep learning optimization framework based on a noisy gradient descent in an infinite dimensional Hilbert space (gradient Langevin dynamics), and show generalization error and excess risk bounds for the solution obtained by the optimization procedure. I will show that deep learning can avoid the curse of dimensionality in a teacher-student setting, and eventually achieve better excess risk than kernel methods. Next, I discuss identifiability of neural network with a gradient descent method under a teacher-student setting. It will be shown that with a sparse regularization, we can show that the measure representation of the student network converges to that of the teacher network, while convergence in terms of parameters will not be guaranteed. Finally, I will briefly introduce a dual averaging method to optimize the neural network with a mean field representation and discuss its convergence. The proposed method utilizes a gradient Langevin dynamics and is guaranteed to converge the global optimal solution.
Ryo Karakida (National Institute of Advanced Industrial Science and Technology)
Title: Analysis of Fisher information and natural gradient descent in infinitely wide neural networks
Deep neural networks with random weights give us some insight into the typical behavior of networks or learning dynamics around random initialization. In particular, the neural tangent kernel (NTK) regime can hold for sufficiently wide networks and enables us to solve the learning dynamics explicitly. Following this line of analyses, we first investigate the local geometric structure of the loss surface characterized by Fisher information. Fisher information and NTK share the same eigenvalues which determine some convergence properties of training. Second, we analyze the dynamics of natural gradient descent (NGD) in the NTK regime. While NGD is known to accelerate the convergence of training, it requires high computational cost, and thus we usually use its approximation in practice. We prove that typical approximation methods such as K-FAC and unit-wise Fisher can achieve the same convergence rate as exact NGD.
Lénaïc Chizat (Université Paris-Saclay)
Title: Analysis of Gradient Descent on Wide Two-Layer ReLU Neural Networks
Artificial neural networks are a family of parametric models which, given a training set of sample/label pairs, can be trained to predict the labels of new samples. To do so, the training algorithm updates the parameters using variants of the gradient descent method on a well-chosen objective function (the empirical risk, with potentially a regularization term). In this talk, we propose an analysis of gradient descent on wide two-layer ReLU neural networks (networks with many parameters but only one simple "positive part" non-linearity) that leads to sharp characterizations of the learned predictor. The main idea is to study the dynamics when the width of the neural network goes to infinity, which can be written as a Wasserstein gradient flow. While this dynamics evolves on a non-convex landscape, we show that when the parameters are initialized properly, its limit is a global minimizer. We also study the "implicit bias" of this algorithm in various situations: among all the minimizers, we show that it selects a specific one which depends, among other things, on the initialization and the choice of objective function. Along the way, we discuss what these results tell us about the statistical performance of these models. This is based on joint work with Francis Bach.
Online Networking Event
Wednesday 3rd March.
Yusuke Mukuta (The University of Tokyo)
Title: Feature coding using invariance and kernel approximation
Feature coding is a method to use the statistics of the local features as a global feature, which can be used to enhance the performance of Convolutional Neural Networks for the image recognition task. We introduce two novel feature coding methods. First is a method to exploit the invariance of the input image such as rotation invariance and flip invariance for the feature coding function using the idea of tensor product representation. Second is a method to construct a compact approximation for the existing coding function using the technique of kernel approximation. These methods contribute to the high recognition accuracy with small feature dimension.
Invertible neural networks (INNs) are neural network architectures with invertibility by design. Thanks to their invertibility and the tractability of Jacobian, INNs have found various machine learning applications such as probabilistic modeling and feature extraction. However, their attractive properties come at the cost of restricting the layer designs, which poses a question on their representation power: can we use these models to approximate sufficiently diverse functions? In this research, we developed a general theoretical framework to investigate the representation power of INNs, building on a structure theorem of differential geometry. More specifically, the framework allows us to show the universal approximation properties of INNs for approximating a large class of diffeomorphisms. We applied the framework to two representative examples of INNs, namely Coupling-Flow-based INNs (CF-INNs) and Neural Ordinary Differential Equations (NODEs), and elucidated their high representation power despite their restricted architectures.
Kenji Fukumizu (The Institute of Statistical Mathematics)
Title: Robust Topological Data Analysis using Reproducing Kernels
Persistent homology has become an important tool for extracting geometric and topological features from data, whose multi-scale features are summarized in a persistence diagram. From a statistical perspective, however, persistence diagrams are very sensitive to perturbations in the input space. In this work, we develop a framework for constructing robust persistence diagrams from superlevel filtrations of robust density estimators constructed using reproducing kernels. Using an analogue of the influence function on the space of persistence diagrams, we establish the proposed framework to be less sensitive to outliers. The robust persistence diagrams are shown to be consistent estimators in bottleneck distance, with the convergence rate controlled by the smoothness of the kernel---this in turn allows us to construct uniform confidence bands in the space of persistence diagrams. Finally, we demonstrate the superiority of the proposed approach on benchmark datasets.
Pierre Alquier (RIKEN)
Title: Parametric estimation via MMD optimization: robustness to outliers and dependence
In this talk, I will study the properties of parametric estimators based on the Maximum Mean Discrepancy (MMD) defined by Briol et al. (2019). In a first time, I will show that these estimators are universal in the i.i.d setting: even in case of misspecification, they converge to the best approximation of the distribution of the data in the model, without ANY assumption on this model. This leads to very strong robustness properties. In a second time,I will show that these results remain valid when the data is not independent, but satisfy instead a weak-dependence condition. This condition is based on a new dependence coefficient, which is itself defined thanks to the MMD. I will show through examples that this new notion of dependence is actually quite general. This talk is based on the following papers and softwares, with Badr-Eddine Chérief Abdellatif (Oxford University), Mathieu Gerber(University of Bristol), Jean-David Fermanian (ENSAE Paris) and Alexis Derumigny (University of Twente):
http://arxiv.org/abs/1912.05737 http://proceedings.mlr.press/v118/cherief-abdellatif20a.html http://arxiv.org/abs/2006.00840 https://arxiv.org/abs/2010.00408 https://cran.r-project.org/web/packages/MMDCopula/
Krikamol Muandet (Max Planck Institute for Intelligent Systems)
Title: Maximum Moment Restriction and Its Applications
In this talk, I will discuss statistical inference and hypothesis testing on models that are specified by conditional moment restrictions (CMR). Our approach employs a maximum moment restriction (MMR) which is constructed by maximising the interaction between the generalised residual function and functions of the conditioning variables that belong to a unit ball of a vector-valued reproducing kernel Hilbert space (vv-RKHS). The reproducing kernel induces the information geometry on the parameter space from which we can choose the parameter estimates so that the sample-based MMR is zero. The MMR allows for an infinite continuum of moment restrictions to be used, while permitting a tractable objective function for both parameter estimation and hypothesis testing.
Associate Professor, The University of Tokyo
Associate Professor, The University of Tokyo
Professor, The Institute of Statistical Mathematics