Sampling Methods and Core Statistical Concepts for Data Analysis
Posted by Anonymous and classified in Mathematics
Written on in
English with a size of 63.81 KB
Sampling Methods
- Simple Random Sampling: equal probability of selection —> good representation but may have non-response bias.
- Systematic Sampling: apply a selection interval k from a random starting point; equal probability of selection —> simple to implement but may give poor representation if there is a pattern in how subjects are ordered.
- Stratified Sampling: divide the sampling frame into strata; each stratum can have a different size; apply simple random sampling within each stratum; equal probability of selection —> good representation but requires information about the sampling frame and strata.
- Cluster Sampling: divide the sampling frame into clusters; select a fixed number of clusters using simple random sampling; equal probability of selection —> less tedious but may give poor representation if clusters are heterogeneous.
- Convenience Sampling (not recommended): choose subjects who are easily available to participate —> selection bias and non-response bias; not representative.
- Volunteer Sampling (not recommended): subjects self-select into the sample and may have strong opinions on the research questions —> selection bias and non-response bias; not representative.
Variables, Mean, Variance & Standard Deviation
- Independent Variable: the factor intentionally manipulated in the study.
- Dependent Variable: the outcome hypothesized to change depending on how the independent variable is manipulated.
- Categorical: variables that take on a limited number of distinct categories or groups.
- Numerical: variables that represent measurable quantities (continuous or discrete).
- Mean: the arithmetic average of numerical data.
- Variance: the average squared deviation from the mean; a measure of spread.
- Standard Deviation: the square root of the variance; measures dispersion in the same units as the data.
Median, Quartiles, IQR & Mode
- Median: the middle value after arranging observations in ascending order. If there are two middle values, take their average.
- Q1 and Q3: Q1 is the 25th percentile (middle of the lower half); Q3 is the 75th percentile (middle of the upper half).
- IQR: interquartile range = Q3 − Q1; the IQR is always non-negative.
- Mode: the value that appears most often; interpreted as the peak of the distribution.
Experimental vs Observational Studies
- Experimental (controlled) study: intentionally applies a treatment to manipulate the independent variable to observe its effect on the dependent variable. Experimental studies can provide evidence for cause-and-effect relationships. A treatment group is exposed to the treatment or independent variable being tested, while a control group does not receive the treatment or receives a placebo so participants do not know whether they are in the treatment or control group (single blinding). Double blinding occurs when the assessors also do not know whether they are assessing the treatment or control group.
- Observational study: observes individuals and measures variables of interest without manipulating the independent variable. Observational studies can provide evidence of association but not cause-and-effect relationships because confounders may be present.
Association and Related Topics
- Association: relationships between variables (does not necessarily imply causation).
- Rules on rates: considerations when comparing rates (numerator, denominator, time period, and population at risk).
- Simpson's Paradox: a trend that appears in different groups of data can disappear or reverse when the groups are combined; watch for lurking variables and confounders.
- Confounders: variables that are associated with both the independent and dependent variables and can distort observed associations.
Data Visualization and Regression
- Histogram (univariate): visualizes the distribution of a single numerical variable.
- Boxplot (univariate): visual summary showing median, quartiles, and potential outliers.
- Scatter plot (bivariate): displays the relationship between two numerical variables.
- Correlation Coefficient: measures the strength and direction of association between two numerical variables.
- Linear Regression: models the relationship between a dependent variable and one or more independent variables using a linear equation.
Probability, Inference and Hypothesis Testing
- Probability: quantifies uncertainty and the likelihood of events.
- Conjunction Fallacy & Base Rate Fallacy: common errors in probabilistic reasoning; understand base rates and conditional probabilities to avoid mistakes.
- Random Variable: a variable that takes numeric values according to outcomes of a random phenomenon.
- Statistical Inference & Confidence Intervals: methods for drawing conclusions about populations from sample data and quantifying uncertainty.
- Hypothesis Testing: procedures to assess evidence against a null hypothesis using sample data.