Probability estimation from small samples

This page contains expressions in MathML and is best viewed with FireFox 3.5 or higher.

1 Introduction

Let us assume that in an experiment we have conducted n independent trials, of which there are r successes. The rest of the trials (n-r) are failures. Examples of such experiments are tossing a coin (e.g. success is a head), throwing a dice (e.g. success is 6 on top), shooting a free throw in basketball (success is a scored point), etc.

In this report we will discuss the following question: how to estimate the probability of success in the next (n+1) trial. In addition, we would like to devote special attention to the cases when the sample size is effectively small. In this context we will discuss and compare three methods: relative frequency, Laplace's law of succession and m-estimate.

What exactly do we mean by effectively small sample size? Good (1965) states then when there are more than three successes and more than three failures in our examples set, there is little difference between any of the described methods for many practical purposes. So, we will basically be dealing with the cases when either the number of successes or the number of failures are small (e.g. 0, 1, 2). Note that such situations can occur also when the actual sample size is large, but we divide the trial set to subsets that fulfill certain conditions for the purpose of estimating conditional probabilities in such subsets.

2 Estimates for success in the next trial

2.1 Relative frequency

Relative frequency is sometimes called also maximum likelihood estimate. Probability of success in the nest trial is computed according to the following formula:

p = r / n

[ $p = \frac{r}{n}$ ]

The estimation of probabilities can be regarded as a relatively simple task when the sample size is large enough. In such case we would hardly require any theory. Bernoulli's theorem states that when n is large enough, the probability of success in the next trial can be reliably estimated by the relative frequency p = r / n [ $p = \frac{r}{n}$ ]. More formally, for any arbitrary small $ε$ and $δ$ such $n_{0}$ can be found that for every $n \geq n_{0}$ the following inequation holds:

$P (| \frac{r}{n} - p | < ε) > 1 - δ$

However, after completing just one trial, which was a failure, the relative frequency probability estimate of a success in the next trial would be 0.

2.2 Laplace's law of succession

In order to alleviate such zero probability estimations, a modified formula was proposed:

$p = \frac{r + 1}{n + 2}$

In this formula a uniform prior probability is assumed (Good, 1965). In fact, the rationale behind adding 1 in the numerator and 2 in denominator is the following: before performing the actual experiment, we assume that there were two trials, one successful and one failure.

2.3 Bayesian m-estimate

More general Bayesian probability estimate is described in (Cestnik, 1990, 1991). To calculate a general Bayesian probability estimate on the basis of evidence, a prior probability distribution has to be assumed first. Then, given the evidence, the prior distribution is updated to a posterior one, from which the expectation can be taken as a point estimate of p. The task of determining a prior probability distribution has been identified as an intrinsic difficulty of the Bayesian approach.

$p = \frac{r + p_{a} m}{n + m}$

3 Estimation of conditional probabilities

The basic idea behind the proposed m-estimate (Cestnik, 1990, 1991) for estimating conditional probabilities is that the prior probabilities can be estimated from an unconditional sample. The remaining parameter m, which is related to the variance, has to be determined also. The prior variance is computed by the following formula:

$Var (p) = \frac{p_{a} (1 - p_{a})}{m + 1}$

The parameter m is inversely proportional to the variance of the prior distribution. It also leverages the tradeoff between relative frequency and prior probability, as can be observed from the following form of m-estimate:

$p = \frac{n}{n + m} \frac{r}{n} + \frac{m}{n + m} p_{a}$

4 DmB program

DmB is an acronym for "Data mining using Bayesian probability estimate". It is useful for experimenting with various estimation parameters while generating simple rules for a given problem domain.

DmB is implemented in Borland Delphi and runs on Microsoft Windows platforms. It's use is free of charge. The zip file contains three files:

DmB.exe - Executable program file
DmB.pdf - DmB Instruction for use
titanic.csv - demo data file

You can download it here.

5 Literature

B. Cestnik: Revisiting the optimal probability estimator from small samples for data mining. International journal of applied mathematics and computer science, 2019, vol. 29, no. 4, str. 783-796. ISSN 1641-876X. DOI: 10.2478/amcs-2019-0058.

B. Cestnik: Estimating probabilities in machine learning, Ph.D. thesis, University of Ljubljana, Faculty of Computer and Information Science, 1991.

B. Cestnik: Estimating probabilities: A crucial task in machine learning. In: Carlucci Aiello, Luigia (ed.). ECAI 90. London: Pitman, 1990, str. 147-149.

B. Cestnik, I. Bratko: On estimating probabilities in tree pruning. In: Kodratoff, Yves. Machine learning - EWSL-91 : European working session on learning, Porto, Portugal, March 6-8, 1991: proceedings, (Lecture notes in computer science, Lecture notes in artificial intelligence, 482). Berlin [etc.]: Springer-Verlag, 1991, str. 138-150.

S. Džeroski, B. Cestnik, I. Petrovski: Using the m-estimate in rule induction. CIT. J. Comput. Inf. Technol., 1993, vol. 1, str. 37-46

I. J. Good: The Estimation of Probabilities: An Essay on Modern Bayesian Methods, Cambridge, Mass., M.I.T. Press, 1965.

Page visited 5617 times since 1.9.2017

Page updated: 1. 1. 2019