ML Wiki

Sampling

Statistical Inference - making conclusions and decisions incomplete information in based on data. This is the main goal of Statistics

Population - the group we're interested in making conclusions about.
Census - collection of data from the entire population
- Census is almost impossible or very expensive to obtain
Sample - a subset of the population, typically a small fraction

Goals

So, the goal of sampling (data collection):

based on a sample make conclusions about the population
this is done at the Data Collection step in the process of statistical investigation (see Statistics)

For ML models there are other goals

how to reduce data to speed up computation?
select a subset of rows - a sample

Types of Sampling

We need a representative sample to be able to generalize from the statistics calculated on a sample to the population parameters

For that we need to use randomization - and have #Random Sampling
otherwise can have #Non-Random Sampling

Random Sampling

Random sampling (especially SRS - simple random sampling) is very important

in Inferential Statistics - when making the independence assumption about the observations
doesn't introduce bias

Replacements

without replacement
- when item is selected for a sampling, it's taken out of the population
sampling with replacement
- an item can be sampled several times
- used in the Bootstrap method - for resampling
also see Simulation Basics in R#Sampling

Simple Random Sampling

Randomly pick up items from the population

the original distribution of data is not always kept

Stratified Sampling

divide the population into non-overlapping groups (called strata)
and use SRS within each stratum
so the original distribution is kept

Also called

Sampling with proportional allocation
Under-sampling of the majority class
etc

Cluster Sampling

use Cluster Analysis to divide the population into clusters
select a cluster at random and use all the items from that cluster
Use then it's easer to select a group than an item

Examples

Example 1

1 mln elements
5% True, 95% False
want to sample 100 examples
proportional (stratified): 5 True, 95 False
without proportional (uniform): 50 True, 50 False

Reason to use proportional

suppose you need to be good at detecting TRUE
but you'll have only 5 records to train your classified - not enough!
so it's better to use stratified sampling

Stratified Sampling Example

Assume a company with the following allocation of staff

	Full Time	Part Time
Male	90	18
Female	9	63

How to build a sample of 40 staff?

Stratified with proportional allocation: according to the distribution
total number: $N = 180$
calculate the percentage in each group

	Full Time	Part Time
Male	90 / 180 = 50%	18 / 180 = 10%
Female	9 / 180 = 5%	63 / 180 = 35%

So we know that

50% in out sample of 40 should be males, full time

	Full Time	Part Time
Male	20	4
Female	2	14

Non-Random Sampling

Systematic sampling
- every $n$th individual, non-representative is there's a structure
Convenience / Volunteer sampling
- select first $n$ who are available or volunteer to participate. Also non-representative
all these may introduce bias into the samples

Bias

A sample is biased if it's differs from a population in a systematic way
That can result in a statistics that's consistently larger or smaller

Types of Biases

Selection bias - when you systematically exclude or under-represent a part of population
Measurement/Response bias - when data is collected with systematic error
Non-response bias - when responses aren't obtained from all individuals selected for inclusion in sampling

Sources

Sampling

Contents

Sampling

Goals

Types of Sampling

Random Sampling

Replacements

Simple Random Sampling

Stratified Sampling

Cluster Sampling

Examples

Example 1

Stratified Sampling Example

Non-Random Sampling

Bias

Types of Biases

Sources