Arthur Gretton (University College London)
Title: Conditional Densities and Efficient Models in Infinite Exponential Families
The exponential family is one of the most powerful and widely used classes of models in statistics. A method was recently developed to fit this model when the natural parameter and sufficient statistic are infinite dimensional, using a score matching approach. The infinite exponential family is a natural generalisation of the finite case, much like the Gaussian and Dirichlet processes generalise their respective finite modfels.In this talk, I'll describe two recent results which make this model more applicable in practice, by reducing the computational burden and improving performance for high-dimensional data. The first is a Nytsrom-like approximation to the full solution. We prove that this approximate solution has the same consistency and convergence rates as the full-rank solution (exactly in Fisher distance, and nearly in other distances), with guarantees on the degree of cost and storage reduction. The second result is a generalisation of the model family to the conditional case, again with consistency guarantees. In experiments, the conditional model generally outperforms a competing approach with consistency guarantees, and is competitive with a deep conditional density model on datasets that exhibit abrupt transitions and heteroscedasticity.
Kernel methods are powerful learning methodologies that provide a simple way to construct nonlinear algorithms from linear ones. Despite their popularity, they suffer from poor scalability in big data scenarios. Various approximation methods, including random feature approximation, have been proposed to alleviate the problem. However, the statistical consistency of most of these approximate kernel methods is not well understood except for kernel ridge regression wherein it has been shown that the random feature approximation is not only computationally efficient but also statistically consistent with a minimax optimal rate of convergence. In this work, we investigate the efficacy of random feature approximation in the context of kernel principal component analysis (KPCA) by studying the statistical behavior of approximate KPCA in terms of the convergence of eigenspaces and the reconstruction error.
(Chair: Kenji Fukumizu)
Motonobu Kanagawa (Max Planck Institute for Intelligent Systems)
Title: Convergence Analysis of Deterministic Kernel-Based Quadrature Rules in Misspecified Settings
In this talk, we present convergence analysis of kernel-based quadrature rules in misspecified settings, focusing on deterministic quadrature in Sobolev spaces. In particular, we deal with misspecified settings where a test integrand is less smooth than a Sobolev RKHS based on which a quadrature rule is constructed. We provide convergence guarantees based on two different assumptions on a quadrature rule: one on quadrature weights, and the other on design points. More precisely, we show that convergence rates can be derived (i) if the sum of absolute weights remains constant (or does not increase quickly), or (ii) if the minimum distance between distance design points does not decrease very quickly. As a consequence of the latter result, we derive a rate of convergence for Bayesian quadrature in misspecified settings. We reveal a condition on design points to make Bayesian quadrature robust to misspecification, and show that, under this condition, it may adaptively achieve the optimal rate of convergence in the Sobolev space of a lesser order (i.e., of the unknown smoothness of a test integrand), under a slightly stronger regularity condition on the integrand. (Joint work with Bharath K. Sriperumbudur and Kenji Fukumizu)
(Chair: Arthur Gretton)
Krikamol Muandet (Mahidol University)
Title: Eigendecompositions of Transfer Operators in Reproducing Kernel Hilbert Spaces
Transfer operators such as the Perron-Frobenius or Koopman operator play an important role in the global analysis of complex dynamical systems. The eigenfunctions of these operators can be used to detect metastable sets, to project the dynamics onto the dominant slow processes, or to separate superimposed signals. We extend transfer operator theory to reproducing kernel Hilbert spaces and show that these operators are related to Hilbert space representations of conditional distributions, known as conditional mean embeddings in the machine learning community. Moreover, numerical methods to compute empirical estimates of these embeddings are akin to data-driven methods for the approximation of transfer operators such as extended dynamic mode decomposition and its variants. In fact, most of the existing methods can be derived from our framework, providing a unifying view on the approximation of transfer operators. One main benefit of the presented kernel-based approaches is that these methods can be applied to any domain where a similarity measure given by a kernel is available. We illustrate the results with the aid of guiding examples and highlight potential applications in molecular dynamics as well as video and text data analysis.
(Chair: Arthur Gretton)
Tuesday 20 Feb.
Song Liu (Bristol University)
Title: Density Ratio Estimation using Stein Method and Its Applicaitons
In this research, we estimate the ratio between the data generating probability density function and a model density function with the help of Stein operator. The estimated density ratio allows us to approximate the Kullback-Leibler divergence from a model to the data efficiently. We explore applications using such a goodness of fit measure including parameter fitting, Bayesian learning and change point detection.This is a joint work with Wittawat Jitkrittum and Carl Henrik Ek.
(Chair: Bharath K Sriperumbudur)
Kenji Fukumizu (The Institute of Statistical Mathematics)
Title: Machine Learning Approach to Topological Data Analysis
Topological data analysis (TDA) is a recent methodology for extracting topological and geometrical features from complex geometric data structures. Persistent homology, a new mathematical notion proposed by Edelsbrunner (2002), provides a multiscale descriptor for the topology of data, and has been recently applied to a variety of data analysis. In this talk I will introduce a machine learning framework of TDA by combining persistence homology and kernel methods. As an expression of persistent homology, persistence diagrams are widely used to represent the lifetimes of generators of homology groups. While they serve as a compact representation of data, it is not straightforward to apply standard methodology of data analysis since they consist of a set of points in 2D space expressing the lifetimes. We introduce a method of kernel embedding of the persistence diagrams to obtain their vector representation, which enables one to apply any kernel methods in topological data analysis, and propose a persistence weighted Gaussian kernel as a suitable kernel for vectorization of persistence diagrams. Some theoretical properties including Lipschitz continuity of the embedding are discussed. I will also present applications to change point detection and time series analysis in the field of material sciences and biochemistry.
We consider variable selection based on $n$ observations from a high-dimensional linear regression model. The unknown parameter of the model is assumed to belong to the class $S$ of all $s$-sparse vectors in $R^p$ whose non-zero components are greater than $a > 0$. Variable selection in this context is an extensively studied problem and various methods of recovering sparsity pattern have been suggested. However, in the theory not much is known beyond the consistency of selection. For Gaussian design, which is of major importance in the context of compressed sensing, necessary and sufficient conditions of consistency for some configurations of $n,p,s,a$ are available. They are known to be achieved by the exhaustive search decoder, which is not realizable in polynomial time and requires the knowledge of $s$. This talk will focus on the issue of optimality in variable selection based on the Hamming risk criterion. The benchmark behavior is characterized by the minimax risk on the class $S$. We propose an adaptive algorithm independent of $s,a$, and of the noise level that nearly attains the value of the minimax risk. This algorithm is the first method, which is both realizable in polynomial time and is consistent under the same (minimal) sufficient conditions as the exhaustive search decoder.
(Chair: Song Liu)
Mladen Kolar (University of Chicago)
Title: Estimation and Inference for Differential Networks
We present a recent line of work on estimating differential networks and conducting statistical inference about parameters in a high-dimensional setting. First, we consider a Gaussian setting and show how to directly learn the difference between the graph structures. A debiasing procedure will be presented for construction of an asymptotically normal estimator of the difference. Next, building on the first part, we show how to learn the difference between two graphical models with latent variables. Linear convergence rate is established for an alternating gradient descent procedure with correct initialization. Simulation studies illustrate performance of the procedure. We also illustrate the procedure on an application in neuroscience. Finally, we will discuss how to do statistical inference on the differential networks when data are not Gaussian.
(Chair: Song Liu)
Wittawat Jitkrittum (Max Planck Institute for Intelligent Systems)
Title: A Linear-Time Kernel Goodness-of-Fit Test
We propose a novel adaptive test of goodness of fit, with computational cost linear in the number of samples. We learn the test features that best indicate the differences between observed samples and a reference model, by minimizing the false negative rate. These features are interpretable, indicating where the model does not fit the samples well. The features are constructed via Stein's method, meaning that it is not necessary to compute the normalising constant of the model. We analyse the asymptotic Bahadur efficiency of the new test, and prove that under a mean-shift alternative, our test always has greater relative efficiency than a previous linear-time kernel test, regardless of the choice of parameters for that test. In experiments, the performance of our method exceeds that of the earlier linear-time test, and matches or exceeds the power of a quadratic-time kernel test. In high dimensions and where model structure may be exploited, our goodness of fit test performs far better than a quadratic-time two-sample test based on the Maximum Mean Discrepancy, with samples drawn from the model.
Taiji Suzuki (The University of Tokyo)
Title: Connecting Model Compression and Generalization Analysis for Deep Neural Network
In this talk, we consider a model compression problem for deep neural network models and show its connection to generalization error analysis. The generalization analysis is based on the eigenvalue distribution of the kernel functions defined in the internal layers. It gives a fast learning rate and the obtained convergence rate is almost independent on the network size unlike the previous analysis. Based on the analysis, we develop a simple compression algorithm for the neural network which is applicable to wide range of network models.
(Chair: Masaaki Imaizumi)
Yarin Gal (University of Oxford)
Title: Bayesian Deep Learning
Bayesian models are rooted in Bayesian statistics and easily benefit from the vast literature in the field. In contrast, deep learning lacks a solid mathematical grounding. Instead, empirical developments in deep learning are often justified by metaphors, evading the unexplained principles at play. These two fields are perceived as fairly antipodal to each other in their respective communities. It is perhaps astonishing then that most modern deep learning models can be cast as performing approximate inference in a Bayesian setting. The implications of this are profound: we can use the rich Bayesian statistics literature with deep learning models, explain away many of the curiosities with this technique, combine results from deep learning into Bayesian modeling, and much more.In this talk I will review a new theory linking Bayesian modeling and deep learning and demonstrate the practical impact of the framework with a range of real-world applications. I will also explore open problems for future research—problems that stand at the forefront of this new and exciting field.
(Chair: Masaaki Imaizumi)
Johannes Schmidt-Hieber (Leiden University)
Title: Statistical Theory for Deep Neural Networks with ReLU Activation Function
The universal approximation theorem states that neural networks are capable of approximating any continuous function up to a small error that depends on the size of the network. The expressive power of a network does, however, not guarantee that deep networks perform well on data. For that, control of the statistical estimation risk is needed. In the talk, we derive statistical theory for fitting deep neural networks to data generated from the multivariate nonparametric regression model. It is shown that estimators based on sparsely connected deep neural networks with ReLU activation function and properly chosen network architecture achieve the minimax rates of convergence (up to logarithmic factors) under a general composition assumption on the regression function. The framework includes many well-studied structural constraints such as (generalized) additive models. While there is a lot of flexibility in the network architecture, the tuning parameter is the sparsity of the network. Specifically, we consider large networks with number of potential parameters being much bigger than the sample size. Interestingly, the depth (number of layers) of the neural network architectures plays an important role and our theory suggests that scaling the network depth with the logarithm of the sample size is natural.
(Chair: Taiji Suzuki)
Masaaki Imaizumi (Institute of Statistical Mathematics)
Title: Statistical Estimation for Non-Smooth Functions by Deep Neural Networks
We provide a solution to explain why deep neural networks (DNNs) performs better than others by investigating statistical properties of estimators by DNNs. Recently, it has been empirically shown that DNNs have a higher performance than other common methods, however, understanding its mechanism is still a developing problem. From an aspect of the statistical theory, it is difficult to find theoretical advantages of DNNs, since other common methods are known to be optimal as far as data are generated by smooth functions. To fill the gap, in this study, we consider a certain class of non-smooth functions and derive convergence rates of the estimators by DNNs and the other methods. Consequently, we show that the estimator by DNNs are nearly optimal and some of the other methods are inefficient to estimate the non-smooth functions. Furthermore, our theoretical result provides guidelines for selecting an appropriate number of layers and edges of DNNs. We provide numerical experiments to support the theoretical result.