Kernel methods are powerful learning methodologies that provide a simple way to construct nonlinear algorithms from linear ones. Despite their popularity, they suffer from poor scalability in big data scenarios. Various approximation methods, including random feature approximation, have been proposed to alleviate the problem. However, the statistical consistency of most of these approximate kernel methods is not well understood except for kernel ridge regression wherein it has been shown that the random feature approximation is not only computationally efficient but also statistically consistent with a minimax optimal rate of convergence. In this work, we investigate the efficacy of random feature approximation in the context of kernel principal component analysis (KPCA) by studying the statistical behavior of approximate KPCA in terms of the convergence of eigenspaces and the reconstruction error.
Motonobu Kanagawa (Max Planck Institute for Intelligent Systems)
Title: Convergence Analysis of Deterministic Kernel-Based Quadrature Rules in Misspecified Settings
In this talk, we present convergence analysis of kernel-based quadrature rules in misspecified settings, focusing on deterministic quadrature in Sobolev spaces. In particular, we deal with misspecified settings where a test integrand is less smooth than a Sobolev RKHS based on which a quadrature rule is constructed. We provide convergence guarantees based on two different assumptions on a quadrature rule: one on quadrature weights, and the other on design points. More precisely, we show that convergence rates can be derived (i) if the sum of absolute weights remains constant (or does not increase quickly), or (ii) if the minimum distance between distance design points does not decrease very quickly. As a consequence of the latter result, we derive a rate of convergence for Bayesian quadrature in misspecified settings. We reveal a condition on design points to make Bayesian quadrature robust to misspecification, and show that, under this condition, it may adaptively achieve the optimal rate of convergence in the Sobolev space of a lesser order (i.e., of the unknown smoothness of a test integrand), under a slightly stronger regularity condition on the integrand. (Joint work with Bharath K. Sriperumbudur and Kenji Fukumizu)
Krikamol Muandet (Mahidol University)
Title: Eigendecompositions of Transfer Operators in Reproducing Kernel Hilbert Spaces
Transfer operators such as the Perron-Frobenius or Koopman operator play an important role in the global analysis of complex dynamical systems. The eigenfunctions of these operators can be used to detect metastable sets, to project the dynamics onto the dominant slow processes, or to separate superimposed signals. We extend transfer operator theory to reproducing kernel Hilbert spaces and show that these operators are related to Hilbert space representations of conditional distributions, known as conditional mean embeddings in the machine learning community. Moreover, numerical methods to compute empirical estimates of these embeddings are akin to data-driven methods for the approximation of transfer operators such as extended dynamic mode decomposition and its variants. In fact, most of the existing methods can be derived from our framework, providing a unifying view on the approximation of transfer operators. One main benefit of the presented kernel-based approaches is that these methods can be applied to any domain where a similarity measure given by a kernel is available. We illustrate the results with the aid of guiding examples and highlight potential applications in molecular dynamics as well as video and text data analysis.
Tuesday 20 Feb.
Song Liu (Bristol University)
Title: Change-Point Detection Using Density Ratio Estimation, Revisited
Lizhen Lin (The University of Notre Dame)
Title: Geometry and Statistics: Nonparametric Statistical Inference of Non-Euclidean Data
This talk presents some recent advances in nonparametric inference on manifolds and other non-Euclidean spaces. The focus is on nonparametric inference base on Frechet means. In particular, we present omnibus central limit theorems for Frechet means for inference, which can be applied to general metric spaces including stratified spaces, greatly expanding the current scope of inference. Applications are also provided to the space of symmetric positive definite matrices arising in diffusion tensor imaging. A robust framework based on the classical idea of median-of-means is also proposed which yields estimates with provable robustness and improved concentration. In addition to inferring i.i.d data, we also consider nonparametric regression problems where predictors or responses lying on manifolds. Various simulated or real data examples are considered.
We consider variable selection based on $n$ observations from a high-dimensional linear regression model. The unknown parameter of the model is assumed to belong to the class $S$ of all $s$-sparse vectors in $R^p$ whose non-zero components are greater than $a > 0$. Variable selection in this context is an extensively studied problem and various methods of recovering sparsity pattern have been suggested. However, in the theory not much is known beyond the consistency of selection. For Gaussian design, which is of major importance in the context of compressed sensing, necessary and sufficient conditions of consistency for some configurations of $n,p,s,a$ are available. They are known to be achieved by the exhaustive search decoder, which is not realizable in polynomial time and requires the knowledge of $s$. This talk will focus on the issue of optimality in variable selection based on the Hamming risk criterion. The benchmark behavior is characterized by the minimax risk on the class $S$. We propose an adaptive algorithm independent of $s,a$, and of the noise level that nearly attains the value of the minimax risk. This algorithm is the first method, which is both realizable in polynomial time and is consistent under the same (minimal) sufficient conditions as the exhaustive search decoder.
Mladen Kolar (University of Chicago)
Title: Estimation and Inference for Differential Networks
We present a recent line of work on estimating differential networks and conducting statistical inference about parameters in a high-dimensional setting. First, we consider a Gaussian setting and show how to directly learn the difference between the graph structures. A debiasing procedure will be presented for construction of an asymptotically normal estimator of the difference. Next, building on the first part, we show how to learn the difference between two graphical models with latent variables. Linear convergence rate is established for an alternating gradient descent procedure with correct initialization. Simulation studies illustrate performance of the procedure. We also illustrate the procedure on an application in neuroscience. Finally, we will discuss how to do statistical inference on the differential networks when data are not Gaussian.
Kenji Fukumizu (The Institute of Statistical Mathematics)
Wednesday 21 Feb.
Taiji Suzuki (The University of Tokyo)
Title: Connecting Model Compression and Generalization Analysis for Deep Neural Network
In this talk, we consider a model compression problem for deep neural network models and show its connection to generalization error analysis. The generalization analysis is based on the eigenvalue distribution of the kernel functions defined in the internal layers. It gives a fast learning rate and the obtained convergence rate is almost independent on the network size unlike the previous analysis. Based on the analysis, we develop a simple compression algorithm for the neural network which is applicable to wide range of network models.
Yarin Gal (University of Oxford)
Title: Bayesian Deep Learning
Bayesian models are rooted in Bayesian statistics and easily benefit from the vast literature in the field. In contrast, deep learning lacks a solid mathematical grounding. Instead, empirical developments in deep learning are often justified by metaphors, evading the unexplained principles at play. These two fields are perceived as fairly antipodal to each other in their respective communities. It is perhaps astonishing then that most modern deep learning models can be cast as performing approximate inference in a Bayesian setting. The implications of this are profound: we can use the rich Bayesian statistics literature with deep learning models, explain away many of the curiosities with this technique, combine results from deep learning into Bayesian modeling, and much more.In this talk I will review a new theory linking Bayesian modeling and deep learning and demonstrate the practical impact of the framework with a range of real-world applications. I will also explore open problems for future research—problems that stand at the forefront of this new and exciting field.
Johannes Schmidt-Hieber (Leiden University)
Title: Statistical Theory for Deep Neural Networks with ReLU Activation Function
The universal approximation theorem states that neural networks are capable of approximating any continuous function up to a small error that depends on the size of the network. The expressive power of a network does, however, not guarantee that deep networks perform well on data. For that, control of the statistical estimation risk is needed. In the talk, we derive statistical theory for fitting deep neural networks to data generated from the multivariate nonparametric regression model. It is shown that estimators based on sparsely connected deep neural networks with ReLU activation function and properly chosen network architecture achieve the minimax rates of convergence (up to logarithmic factors) under a general composition assumption on the regression function. The framework includes many well-studied structural constraints such as (generalized) additive models. While there is a lot of flexibility in the network architecture, the tuning parameter is the sparsity of the network. Specifically, we consider large networks with number of potential parameters being much bigger than the sample size. Interestingly, the depth (number of layers) of the neural network architectures plays an important role and our theory suggests that scaling the network depth with the logarithm of the sample size is natural.
Masaaki Imaizumi (Institute of Statistical Mathematics)
Title: Statistical Estimation for Non-Smooth Functions by Deep Neural Networks
We investigate statistical properties of an estimation problem for non-smooth functions by multi-layer neural networks with a ReLU activation function. Despite the empirical success of deep neural networks, clarifying a source of the performance is still a developing problem. Especially, about estimation for smooth functions, it is hard to state advantages of deep neural networks, since other commonly used nonparametric methods are known to be optimal in the minimax sense. In this study, we consider estimation for piecewise smooth functions which have a support divided into several regions and smooth within each of the regions. We construct a neural network with ReLU functions for estimating the functions and derive a convergence rate of an estimator by the network. Based on the result, we discuss the efficiency of deep neural networks and practical methods for applications.