Research

My research interests broadly span the areas of Machine Learning, Statistical Inference, and Crowdsourcing.



Publications


Abstracts

Maximum Likelihood Estimation for Learning Populations of Parameters

  • Ramya Korlakai Vinayak, Weihao Kong, Gregory Valiant, Sham Kakade.
  • Consider a setting with $N$ independent individuals, each with an unknown parameter, $p_i \in [0, 1]$ drawn from some unknown distribution $P^\star$. After observing the outcomes of $t$ independent Bernoulli trials, i.e., $X_i \sim \text{Binomial}(t, p_i)$ per individual, our objective is to accurately estimate $P^\star$. This problem arises in numerous domains, including the social sciences, psychology, health-care, and biology, where the size of the population under study is usually large while the number of observations per individual is often limited. Our main result shows that, in the regime where $t <<N$, the maximum likelihood estimator (MLE) is both statistically minimax optimal and efficiently computable. Precisely, for sufficiently large $N$, the MLE achieves the information theoretic optimal error bound of $\mathcal{O}(\frac{1}{t})$ for $t < c\log{N}$, with regards to the earth mover's distance (between the estimated and true distributions). More generally, in an exponentially large interval of $t$ beyond $c \log{N}$, the MLE achieves the minimax error bound of $\mathcal{O}(\frac{1}{\sqrt{t\log N}})$. In contrast, regardless of how large $N$ is, the naive "plug-in" estimator for this problem only achieves the sub-optimal error of $\Theta(\frac{1}{\sqrt{t}})$.

The Illusion of Change: Correcting for Bias when Inferring Changes in Sparse, Societal-Scale Data

  • Gabriel Cadamuro, Ramya Korlakai Vinayak, Joshua Blumenstock, Sham Kakade, Jacob N. Shapiro.
  • The Web Conference 2019, San Francisco, CA, USA.
  • Societal-scale data is playing an increasingly prominent role in social science research; examples from research on geopolitical events include questions on how emergency events impact the diffusion of information or how new policies change patterns of social interaction. Such research often draws critical inferences from observing how an exogenous event changes meaningful metrics like network degree or network entropy. However, as we show in this work, standard estimation methodologies make systematically incorrect inferences when the event also changes the sparsity of the data. To address this issue, we provide a general framework for inferring changes in social metrics when dealing with non-stationary sparsity. We propose a plug-in correction that can be applied to any estimator, including several recently proposed procedures. Using both simulated and real data, we demonstrate that the correction significantly improves the accuracy of the estimated change under a variety of plausible data generating processes. In particular, using a large dataset of calls from Afghanistan, we show that whereas traditional methods substantially overestimate the impact of a violent event on social diversity, the plug-in correction reveals the true response to be much more modest.

Crowdsourced Clustering: Querying Edges vs Triangles

  • Ramya Korlakai Vinayak, Babak Hassibi.
  • Neural Information Processing Systems (NIPS) 2016, Barcelona, Spain.
  • We consider the task of clustering items using answers from non-expert crowd workers. In such cases, the workers are often not able to label the items directly, however, it is reasonable to assume that they can compare items and judge whether they are similar or not. An important question is what queries to make, and we compare two types: random edge queries, where a pair of items is revealed, and random triangles, where a triple is. Since it is far too expensive to query all possible edges and/or triangles, we need to work with partial observations subject to a fixed query budget constraint. When a generative model for the data is available (and we consider a few of these) we determine the cost of a query by its entropy; when such models do not exist we use the average response time per query of the workers as a surrogate for the cost. In addition to theoretical justification, through several simulations and experiments on two real data sets on Amazon Mechanical Turk, we empirically demonstrate that, for a fixed budget, triangle queries uniformly outperform edge queries. Even though, in contrast to edge queries, triangle queries reveal dependent edges, they provide more reliable edges and, for a fixed budget, many more of them. We also provide a sufficient condition on the number of observations, edge densities inside and outside the clusters and the minimum cluster size required for the exact recovery of the true adjacency matrix via triangle queries using a convex optimization-based clustering algorithm.

Tensor-based crowdsourced clustering via triangle queries

  • Ramya Korlakai Vinayak, Tijana Zrnic, Babak Hassibi.
  • 42nd International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2017, New Orleans.
  • We consider the problem of crowdsourced clustering of a set of items based on queries of the similarity of triple of objects. Such an approach, called triangle queries, where it was shown that, for a fixed query budget, it outperforms clustering based on edge queries (i.e, comparing pairs of objects). In [1] the clustering algorithm for triangle and edge queries was identical and each triangle query response was treated as 3 separate edge query responses. In this paper we directly exploit the triangle structure of the responses by embedding them into a 3-way tensor. Since there are 5 possible responses to each triangle query, it is a priori not clear how best to embed them into the tensor. We give sufficient conditions on non-trivial embedding such that the resulting tensor has a rank equal to the underlying number of clusters (akin to what happens with the rank of the adjacency matrix). We then use an alternating least squares tensor decomposition algorithm to cluster a noisy and partially observed tensor and show, through extensive numerical simulations, that it significantly outperforms methods that make use only of the adjacency matrix.

Similarity Clustering in the Presence of Outliers: Exact Recovery via Convex Program

  • Ramya Korlakai Vinayak, Babak Hassibi.
  • International Symposium on Information Theory (ISIT) 2016, Barcelona, Spain.
  • We study the problem of clustering a set of data points based on their similarity matrix, each entry of which represents the similarity between the corresponding pair of points. We propose a convex-optimization-based algorithm for clustering using the similarity matrix, which has provable recovery guarantees. It needs no prior knowledge of the number of clusters and it behaves in a robust way in the presence of outliers and noise. Using a generative stochastic model for the similarity matrix (which can be thought of as a generalization of the classical Stochastic Block Model) we obtain \emph{precise bounds} (not orderwise) on the sizes of the clusters, the number of outliers, the noise variance, separation between the mean similarities inside and outside the clusters and the values of the regularization parameter that guarantee the exact recovery of the clusters with high probability. The theoretical findings are corroborated with extensive evidence from simulations.

Graph Clustering With Missing Data: Convex Algorithms and Analysis

  • Ramya Korlakai Vinayak, Samet Oymak, Babak Hassibi.
  • Neural Information Processing Systems (NIPS) 2014, Montreal, Canada.
  • We consider the problem of finding clusters in graphs which are partially observed. We analyze two programs based on regularized nuclear norm minimization, with a goal to recover the low- rank part of the adjacency matrix. For the commonly used Stochastic Block Model, we provide explicit conditions in terms of the size and sparsity of the clusters, amount of observed data, and the regularization parameter of the convex program that guarantees successful recovery of the clusters. We also corroborate our theoretical findings with extensive simulations, and also run our algorithms on a real data set obtained from crowdsourcing an image classification task on Amazon Mechanical Turk.

Sharp Performance Bounds for Graph Clustering via Convex Optimization

  • Ramya Korlakai Vinayak, Samet Oymak, Babak Hassibi.
  • 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2014, Florence, Italy.
  • We analyze a popular approach towards identifying clusters in graphs, that of a sparse plus low-rank decomposition of the adjacency matrix of a graph via a simple and intuitive convex program. We sharply characterize the conditions for the success and failure of this program in identifying clusters. We also present extensive simulations that corroborate our theoretical findings.