What is machine learning?

Machine Learning is a branch of artificial intelligence concerned with the design of algorithms that can automatically identify patterns or regularities in data. An example application is the problem of telling whether a person in an image is a “man” or a “woman”. To solve this task, we could manually write a computer program based on rules such as “if the subject has long hair, then it is likely to be a woman”. However, the number of rules needed to solve the problem this way would be too big to be manually designed and coded. Instead, machine learning techniques can use a dataset of images with associated labels “man” or “woman” to automatically identify patterns or regularities in images with the same label. These patterns can then be used to make predictions about new images with no associated labels.

What is model-based machine learning and the Bayesian framework?

Model-based machine learning is a simple and general recipe for designing machine learning algorithms that are specifically tailored to each new application. The central idea is to propose for each application a custom-made model that captures, in a computer-readable form, all our knowledge about the data generating process. Different algorithms are then used to combine the proposed model and the observed data with the objective to make predictions. The Bayesian framework represents a powerful implementation of this model-based approach. Within this framework, we assume that the observed data is produced by drawing samples from a probabilistic model. At the same time, any uncertainty about unknown variables in the probabilistic model is encoded using probability distributions. Bayes’ theorem is then used to combine these probability distributions with the observed data in a consistent way. The newly updated distributions can be finally used to make reliable predictions. Bayesian machine learning is included in an article by MIT Technology Review as one of the 10 emerging technologies that will change our world.

What is my research about?

My work in Bayesian machine learning includes the design and implementation of scalable methods for approximate inference and the construction, evaluation and refinement of probabilistic models that successfully describe the statistical patterns present in the data. During the last years I have designed new Bayesian machine learning methods with applications to the prediction of customer purchases in on-line stores, the modeling of price changes in financial markets, the analysis of the connectivity of genes in biological systems, the discovery of new materials with optimal properties or the design of more efficient hardware. I have focused on approaches based on probabilistic models, relying on methods for approximate inference that scale to large datasets. The results of this research have been published at top machine learning journals (Journal of Machine Learning Research) and conferences (NIPS and ICML).  I include below a description of a selection of my research, classified in different topics and with links to relevant publications.

Bayesian neural networks

Neural networks (NNs) have shown recent empirical achievements on a wide range of supervised learning problems. Part of the success of NNs is due to the fact that they can be trained on massive data sets using stochastic optimization and GPU accelerators. However, there are  many cases in which the network weights may be poorly specified and it is desirable to produce uncertainty estimates along with predictions. These uncertainty estimates can be very useful for the efficient collection of new data, with applications to optimization in engineering design and exploration in reinforcement learning. In my work I have proposed novel algorithms for approximate Bayesian inference in neural networks that produce accurate uncertainty estimates in a computationally efficient way. An example is given by probabilistic backpropagation (PBP). Like classical backpropagation, PBP works by first propagating the data forward through the network and then doing a backward computation of gradients. However, PBP propagates forward probability distributions and not deterministic quantities. Another algorithm for approximate inference is black-box alpha, which is based on the minimization of alpha-divergences. By changing the alpha parameter in the alpha-divergence, black-box alpha is able to interpolate between previous well-known methods for approximate Bayesian inference such as variational Bayes and an expectation-propagation-like algorithm.

  • Depeweg S., Hernández-Lobato J. M., Doshi-Velez F. and Udluft S.
    Learning and Policy Search in Stochastic Dynamical Systems with Bayesian Neural Networks,
    arXiv:1605.07127, 2016. [pdf]
  • Hernández-Lobato J. M., Li Y., Rowland M., Bui T. D., Hernández-Lobato D. and Turner R. E.
    Black-Box Alpha Divergence Minimization,
    In ICML, 2016. [pdf] [python code]
  • Hernández-Lobato J. M. and Adams R.
    Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks,
    In ICML, 2015. [pdf] [supp. material] [C and theano code]

Bayesian optimization

Some of my latest work is in the area of Bayesian optimization (BO). These methods reduce the number of function evaluations that are required to solve optimization problems with expensive objective functions. BO methods work by fitting a probabilistic model to the available data (evaluation locations for the objective and corresponding function values). The predictions of the probabilistic model are then used to make intelligent decisions about where to evaluate the objective function next so that its optimum is found by using a reduced number of function evaluations. BO methods can be used to speed up optimal design problems in engineering. In my work, I have proposed a new information-theoretic method for BO called Predictive Entropy Search (PES). At each iteration, PES evaluates the objective function at the location that will provide the highest amount of information on the solution to the optimization problem. The extensions of PES to address BO problems with constraints and with multiple objectives are called PES with constraints (PESC) and PESMO, respectively. Both PES, PESC and PESMO are state-of-the-art methods for BO problems. I am currently working with researchers from the Computer Architecture Group at Harvard on applications of PESMO to the design of low-power hardware. PESMO can be used to design hardware accelerators for deep neural networks that have low prediction error, power consumption, prediction time and chip area.

  • Hernández Lobato J. M., Gelbart M. A., Adams R. P, Hoffman M. W. and Ghahramani Z.
    A General Framework for Constrained Bayesian Optimization using Information-based Search,
    Journal of Machine Learning Research (in press), 2016. [pdf]
  • Hernández-Lobato D., Hernández-Lobato J. M., Shah A. and R. P. Adams.
    Predictive Entropy Search for Multi-objective Bayesian Optimization,
    In ICML, 2016. [pdf] [Spearmint code]
  • Hernández-Lobato J. M., Gelbart A. M., Hoffman M. W., Adams R. and Ghahramani Z.
    Predictive Entropy Search for Bayesian Optimization with Unknown Constraints,
    In ICML, 2015. [pdf] [supp. material] [python code]
  • Hernández-Lobato J. M., Hoffman M. W. and Ghahramani Z.
    Predictive Entropy Search for Efficient Global Optimization of Black-box Functions,
    In NIPS, 2014. [pdf] [supp. material] [matlab code]


Copulas are general models for dependence. They provide a powerful framework for the construction of multivariate probabilistic models by allowing us to separate the modelling of univariate marginal distributions from the modelling of dependencies between random variables. An important part of my work on copulas focuses on conditional copulas. This type of copulas are useful when we have access to a vector of covariates that has information about the dependence structure of the variables of interest. Conditional copulas have important applications in machine learning when we are required to estimate conditional or unconditional distributions from data. My research has produced new probabilistic methods for the estimation of  conditional copulas. These methods are based on semi-parametric copulas models in which the conditional dependence structure is parametric but the connection between this dependence structure with the conditioning variables is non-parametric. The resulting semi-parametric conditional copulas produce state-of-the-art results in the problem of predicting the temporal evolution of dependencies between different financial assets. Another important application of conditional copula models is the construction of high-dimensional copulas. Vine factorization methods allow us to construct high-dimensional copulas using two-dimensional conditional copulas as building blocks. However, existing vine techniques ignore any conditional dependencies in the copula functions and use only unconditional copulas in the vine factorization. This is equivalent to ignoring some of the dependencies present in the data. An alternative based on my research on conditional copulas employs fully conditional copulas in the vine factorization. The resulting conditional vine copulas have state-of-the-art performance in different prediction tasks.

  • Hernández-Lobato J. M., Lloyd J. R. and Hernández-Lobato D.
    Gaussian Process Conditional Copulas with Applications to Financial Time Series,
    In NIPS, 2013. [pdf] [data and R code] [R code only]
  • López-Paz D., Hernández-Lobato J. M. and Ghahramani Z.
    Gaussian Process Vine Copulas for Multivariate Dependence,
    In ICML, 2013. [pdf]
  • Hernández-Lobato J. M. and Suárez A.
    Semiparametric Bivariate Archimedean Copulas,
    Computational Statistics & Data Analysis, 55(6), 2038–2058, 2011. [pdf]

Matrix factorization methods

Several modern internet-based online services such as social networking, on-demand streaming media, news or cloud computing and storage generate a large amount of user interaction data (e.g. social connections, product ratings, clicks or purchases). This type of data can often be encoded in the form of a big matrix whose rows represent users and whose columns encode different objects. Each entry in the matrix contains then the outcome of an interaction between the user (row) and the object (column) corresponding to the matrix entry. Matrix factorization (MF) approaches are probably the most successful machine learning methods for modelling this type of data because of their simplicity and often superior predictive performance.  These methods assume that the data matrix X (which can be partially observed) is well approximated by a low rank matrix UV’ . The objective is then to find the two low rank matrices U and V given the data in X. My research has produced novel probabilistic MF methods for describing user interaction data. One of these methods learns pairwise preferences expressed by multiple users on different pairs of products, where non-linear interactions between users and products are captured using Gaussian processes. Approximate probabilistic inference is performed in this case with standard batch variational techniques. However, when the number of observed entries in X is massive, batch inference methods become infeasible. The alternative is to use stochastic inference techniques which are more computationally efficient. I have developed new stochastic inference methods for probabilistic MF models. These methods exhibit faster convergence than more expensive batch approaches and also have better predictive performance than other scalable alternatives.

Bayesian sparsity and linear models

Many machine learning problems are characterized by a reduced number of available data instances and a very high-dimensional feature space. In these cases one often assumes a sparse linear model with most of its coefficients being exactly zero except a few which are significantly different from zero. In a Bayesian learning framework the coefficients of the linear model can be favoured to be zero or very close to zero by using sparsity-enforcing priors. These priors are characterized by density functions that are peaked at zero and also have large probability mass for a wide range of non-zero values. This structure tends to produce a bi-separation in the model coefficients: the posterior distribution of most coefficients is strongly peaked at zero; simultaneously, a small subset of coefficients have a large posterior probability of being significantly different from zero. This bi-separation effect is known as selective shrinkage. Ideally, the posterior mean of truly zero coefficients is shrunk towards zero. At the same time the posterior mean of non-zero coefficients remains unaffected by the assumed prior. Different sparsity-enforcing priors have been proposed in the machine learning and statistics literature. Some examples are Laplace, Student’s t, horseshoe and spike-and-slab priors. My research has shown that among these distributions, spike-and-slab priors usually have better selective shrinkage properties,  closely followed by horseshoe priors. I have also developed computationally efficient algorithms for approximate inference in linear models with spike-and-slab priors. These new algorithms are based on the method expectation propagation, which produces very good approximations to the exact posterior distribution. Finally, I have also designed new sparsity-enforcing priors that can take into account prior information about dependencies in the selective-shrinkage process.

  • Hernández-Lobato J. M., Hernández-Lobato D. and Suárez A.
    Expectation Propagation in Linear Regression Models with Spike-and-slab Priors,
    Machine Learning, 99.3: 437-487, 2015. [pdf] [R code]
  • Hernández-Lobato D. and Hernández-Lobato J. M.
    Learning Feature Selection Dependencies in Multi-task Learning,
    In NIPS, 2013. [pdf]
  • Hernández-Lobato J. M., Hernández-Lobato D. and Suárez A.
    Network-based Sparse Bayesian Classification,
    Pattern Recognition, 44(4), 886–900, 2011. [pdf]
  • Ph.D. Thesis, Hernández-Lobato J.M.
    Balancing Flexibility and Robustness in Machine Learning: Semi-parametric Methods and Sparse Linear Models
    Universidad Autonoma de Madrid, 2010. [pdf]

Robust probabilistic models

I have designed robust probabilistic models that reduce the negative impact of their failure to perfectly describe the data. Among my contributions there are robust Gaussian process (GP) models for binary and multi-class classification. GPs encode probability distributions over non-linear functions. These probabilistic models are one of the cornerstones of non-parametric Bayesian machine learning. They can be understood as  infinite dimensional multivariate Gaussian distributions. In particular, any  GP is fully determined by a mean function and a covariance function, which are the infinite-dimensional equivalent of the mean vector and the covariance matrix that fully specify a multivariate Gaussian distribution in the finite dimensional case. GPs are a popular method for Bayesian non-linear non-parametric regression and classification. In my research, GPs are a key element for the construction of non-linear probabilistic models that represent novel contributions in other areas of machine learning not necessarily related to field of Gaussian processes. However, I have also made specific novel contributions to that field. In particular, I have developed new GP methods for robust non-linear binary and multi-class classification. These novel techniques are less affected than standard GP solutions by outliers in the data. I have also designed robust models for multi-task feature selection, where one often assumes that several learning tasks share relevant and irrelevant features. My techniques obtain improved results in these problems by allowing for outlier tasks and outlier features that do not satisfy the previous assumption

  • Hernández-Lobato D., Hernández-Lobato J. M. and Dupont P.
    Robust Multi-Class Gaussian Process Classification,
    In NIPS, 2011. [pdf] [R code and sup. material]
  • Hernández-Lobato D. and Hernández-Lobato J. M.
    Bayes Machines for Binary Classification,
    Pattern Recognition Letters, 29(10), 1466–1473, 2008. [pdf]
  • Hernández-Lobato D., Hernández-Lobato J. M. and Ghahramani Z.
    A Probabilistic Model for Dirty Multi-task Feature Selection,
    In ICML, 2015. [pdf] [supp. material] [R code]