Estimating the support size of a distribution is a classical problem in
probability and statistics, dating back to the early work of Fisher,
Good-Turing, and the famous paper on "how many words did Shakespeare know"
by Efron-Thisted. In the sublinear regime where the alphabet cardinality
far exceeds the sample size, the challenge lies in how to extrapolate the
number of unseen symbols based on what have been seen so far.
We consider the estimation of the support size of a discrete distribution using independent samples whose minimum non-zero mass is at least 1/k. We show that, to achieve an additive error of dk (d small) with probability at least 0.9, the optimal sample complexity is within universal constant factors of (k/log k) x [log(1/ d)]2, which improves the state-of-the-art result (k/log k) x (1/d)2 due to Valiant-Valiant. The apparatus of best polynomial approximation, as well as its duality to the moment matching problem, plays a key role in both the impossibility result and the construction of linear-time estimators based on Chebyshev polynomials with provable optimality. We also discuss the closely-related species problem where the goal is to estimate the number of distinct colors in an ball-urn model from repeated draws. We characterize the sharp sample complexity and show that one can strictly outperform general support size estimation by discrete orthogonal polynomials. Extensions to other distributional properties and experimental results will be discussed.
Prof. Yihong Wu got his PhD from Princeton University with Prof. Sergiu Verdu in 2011. He was a postdoctoral fellow in University of Pennsylvania. He joined University of Illinois at Urbana-Champaign afterwards. He has won the Marconi Society Paul Baran Young Scholar Award and the ISIT Best Student Paper Award twice. His interests include information theory, high dimensional statistics, fractal geometry, additive combinatorics, optimal transportation and stochastic control.