Sampling
Statistical Inference  making conclusions and decisions incomplete information in based on data. This is the main goal of Statistics
 Population  the group we're interested in making conclusions about.
 Census  collection of data from the entire population
 Census is almost impossible or very expensive to obtain
 Sample  a subset of the population, typically a small fraction
Goals
So, the goal of sampling (data collection):
 based on a sample make conclusions about the population
 this is done at the Data Collection step in the process of statistical investigation (see Statistics)
For ML models there are other goals
 how to reduce data to speed up computation?
 select a subset of rows  a sample
Types of Sampling
We need a representative sample to be able to generalize from the statistics calculated on a sample to the population parameters
Random Sampling
Random sampling (especially SRS  simple random sampling) is very important
 in Inferential Statistics  when making the independence assumption about the observations
 doesn't introduce bias
Replacements
 without replacement
 when item is selected for a sampling, it's taken out of the population
 sampling with replacement
 an item can be sampled several times
 used in the Bootstrap method  for resampling
 also see Simulation Basics in R#Sampling
Simple Random Sampling
Randomly pick up items from the population
Stratified Sampling
Stratified Sampling
 divide the population into nonoverlapping groups (called strata)
 and use SRS within each stratum
 so the original distribution is kept
Also called
 Sampling with proportional allocation
 Undersampling of the majority class
 etc
Cluster Sampling
Cluster Sampling
 use Cluster Analysis to divide the population into clusters
 select a cluster at random and use all the items from that cluster
 Use then it's easer to select a group than an item
Examples
Example 1
 1 mln elements
 5% True, 95% False
 want to sample 100 examples
 proportional (stratified): 5 True, 95 False
 without proportional (uniform): 50 True, 50 False
Reason to use proportional
 suppose you need to be good at detecting TRUE
 but you'll have only 5 records to train your classified  not enough!
 so it's better to use stratified sampling
Stratified Sampling Example
Assume a company with the following allocation of staff

Full Time 
Part Time

Male

90 
18

Female

9 
63

How to build a sample of 40 staff?
 Stratified with proportional allocation: according to the distribution
 total number: $N = 180$
 calculate the percentage in each group

Full Time 
Part Time

Male

90 / 180 = 50% 
18 / 180 = 10%

Female

9 / 180 = 5% 
63 / 180 = 35%

So we know that
 50% in out sample of 40 should be males, full time

Full Time 
Part Time

Male

20 
4

Female

2 
14

NonRandom Sampling
 Systematic sampling
 every $n$th individual, nonrepresentative is there's a structure
 Convenience / Volunteer sampling
 select first $n$ who are available or volunteer to participate. Also nonrepresentative
 all these may introduce bias into the samples
 A sample is biased if it's differs from a population in a systematic way
 That can result in a statistics that's consistently larger or smaller
Types of Biases
 Selection bias  when you systematically exclude or underrepresent a part of population
 Measurement/Response bias  when data is collected with systematic error
 Nonresponse bias  when responses aren't obtained from all individuals selected for inclusion in sampling
Sources