ML Wiki
Machine Learning Wiki - A collection of ML concepts, algorithms, and resources.

Sampling

Sampling

'’Statistical Inference’’ - making conclusions and decisions incomplete information in based on data. This is the main goal of Statistics

  • '’Population’’ - the group we’re interested in making conclusions about.
  • '’Census’’ - collection of data from the entire population
    • Census is almost impossible or very expensive to obtain
  • '’Sample’’ - a subset of the population, typically a small fraction

Goals

So, the goal of sampling (data collection):

  • based on a sample make conclusions about the population
  • this is done at the Data Collection step in the process of statistical investigation (see Statistics)

Image

For ML models there are other goals

  • how to reduce data to speed up computation?
  • select a subset of rows - a ‘‘sample’’

Types of Sampling

We need a ‘‘representative’’ sample to be able to generalize from the statistics calculated on a sample to the population parameters

Random Sampling

Random sampling (especially SRS - simple random sampling) is very important

  • in Inferential Statistics - when making the independence assumption about the observations
  • doesn’t introduce bias

Replacements

  • ’'’without replacement’’’
    • when item is selected for a sampling, it’s taken out of the population
  • sampling ‘'’with replacement’’’
    • an item can be sampled several times
    • used in the Bootstrap method - for resampling
  • also see Simulation Basics in R#Sampling

Simple Random Sampling

Randomly pick up items from the population

Stratified Sampling

Stratified Sampling

  • divide the population into non-overlapping groups (called ‘‘strata’’)
  • and use SRS within each stratum
  • so the original distribution is kept

Also called

  • Sampling with proportional allocation
  • Under-sampling of the majority class
  • etc

Cluster Sampling

Cluster Sampling

  • use Cluster Analysis to divide the population into clusters
  • select a cluster at random and use all the items from that cluster
  • Use then it’s easer to select a group than an item

Examples

Example 1

  • 1 mln elements
  • 5% True, 95% False
  • want to sample 100 examples
  • proportional (stratified): 5 True, 95 False
  • without proportional (uniform): 50 True, 50 False

Reason to use proportional

  • suppose you need to be good at detecting TRUE
  • but you’ll have only 5 records to train your classified - not enough   - so it’s better to use stratified sampling

Stratified Sampling Example

Assume a company with the following allocation of staff

  Full Time Part Time Male 90 18   Female 9 63

How to build a sample of 40 staff?

  • Stratified with proportional allocation: according to the distribution
  • total number: $N = 180$
  • calculate the percentage in each group
  Full Time Part Time Male 90 / 180 = 50% 18 / 180 = 10%   Female 9 / 180 = 5% 63 / 180 = 35%

So we know that

  • 50% in out sample of 40 should be males, full time
  Full Time Part Time Male 20 4   Female 2 14

Non-Random Sampling

  • Systematic sampling
    • every $n$th individual, non-representative is there’s a structure
  • Convenience / Volunteer sampling
    • select first $n$ who are available or volunteer to participate. Also non-representative
  • all these may introduce bias into the samples

Bias

  • A sample is biased if it’s differs from a population in a systematic way
  • That can result in a statistics that’s consistently larger or smaller

=== Types of Biases ===

  • Selection bias - when you systematically exclude or under-represent a part of population
  • Measurement/Response bias - when data is collected with systematic error
  • Non-response bias - when responses aren’t obtained from all individuals selected for inclusion in sampling

Sources