Core Statistical Concepts and Methods
Classified in Mathematics
Written at on English with a size of 14.53 KB.
Statistical Goals
- Describe: Explain what's happening in the data (e.g., mean, mode, average, minimum, variation).
- Explore: Understand how different variables relate to each other.
- Draw Inference: Test hypotheses or theories to make generalizations. Important: Correlation doesn't equal causation.
- Predict: Forecast future outcomes (e.g., weather networks).
- Draw Causal Inference: Determine cause-and-effect relationships, which requires experiments.
Variables in Statistics
Variable Types
- Categorical (e.g., color, name, religion) vs. Numerical (Discrete: whole numbers OR Continuous: decimals, e.g., movie ranking).
- Nominal: Categories with no inherent order.
- Ordinal: Categories with a meaningful order.
- Interval: Ordered, equal intervals, but zero is arbitrary (e.g., 0 degrees Celsius doesn't mean no temperature).
- Ratio: Ordered, equal intervals, with a true zero point (e.g., number of children, snacks consumed - zero means none).
Roles in Experiments
- Independent Variable: The variable that is manipulated or changed by the researcher.
- Dependent Variable: The outcome variable that is measured to see the effect of the independent variable.
- Example: Exploring the influence of sad mood (independent) on music preference (dependent).
Common Statistical Graphs
- Histograms: For numerical, continuous data; bars represent frequency within intervals, typically in ascending order.
- Bar Charts: For numerical or categorical data; bars represent counts or amounts, order can be arbitrary or meaningful.
- Pie Charts: For categorical data showing proportions of a whole; less common in formal analysis.
- Scatter Plots: For numerical data; shows the relationship between two variables, with each point representing an observation.
- Line Graphs: For numerical data, often over time; emphasizes trends and averages but may obscure variation.
- Box and Whisker Plots (Box Plots): For numerical data; summarizes distribution using median, quartiles, and range.
- Stem and Leaf Plots: For numerical data; displays distribution while retaining individual data values.
Data Distribution Shapes
- Normal Distribution: Symmetric (bell-shaped), with most data points clustered around the center (mean/median/mode).
- Left-Skewed Distribution: Asymmetric, with a long tail extending to the left (negative skew). Mean < Median.
- Right-Skewed Distribution: Asymmetric, with a long tail extending to the right (positive skew). Mean > Median.
- Bimodal Distribution: Has two distinct peaks or clusters, suggesting two different subgroups within the data.
Avoiding Misleading Graphs
- Omitting the Baseline: Not starting the vertical (Y) axis at zero can exaggerate differences.
- Manipulating the Y-axis: Using an inappropriate scale or range to distort changes.
- Cherry-Picking Data: Selectively showing data that supports a specific narrative while ignoring contradictory data.
- Using the Wrong Graph Type: Choosing a graph unsuitable for the type of data being presented.
- Going Against Conventions: Violating standard practices for graph creation, leading to confusion.
Measures of Central Tendency
Mean
The average of all data points. Formula: M = ΣX / N (Sum of all values / Total number of values).
Median
The middle value in an ordered dataset.
- Sort numbers from lowest to largest.
- If an odd number of values, the median is the middle one.
- If an even number of values, the median is the average of the two middle numbers (e.g., for 1, 2, 4, 5: (2+4)/2 = 3. Median = 3).
Mode
The value that appears most frequently in the dataset. A dataset can have one mode (unimodal), two modes (bimodal), or more (multimodal).
Bimodal Distribution
A distribution with two modes, indicating no single clear center.
Measures of Variability (Spread)
Describes how spread out the data points are in a distribution.
- Range: Difference between the maximum and minimum values (Max - Min). Simple but sensitive to outliers and doesn't describe the distribution well; often insufficient.
- Interquartile Range (IQR): The range of the middle 50% of the data (Q3 - Q1). Less sensitive to outliers than the range, but still potentially insufficient alone.
- Q1: 25th percentile (median of the lower half of the data).
- Q3: 75th percentile (median of the upper half of the data).
- Variance (σ² or s²): The average of the squared differences from the Mean.
- Standard Deviation (σ or s): The square root of the variance; represents the typical deviation of scores from the mean, expressed in the original units of measurement.
Deviation
The difference between an individual data score and the mean of the dataset.
Calculating IQR
- Order the data from lowest to highest.
- Find the overall median (Q2).
- Identify the first quartile (Q1) - the median of the lower half of the data (excluding the overall median if N is odd).
- Identify the third quartile (Q3) - the median of the upper half of the data (excluding the overall median if N is odd).
- Calculate IQR = Q3 - Q1.
Z-Scores
A standardized score indicating how many standard deviations a data point (X) is from the mean (μ). Formula: z = (X - μ) / σ (where σ is the population standard deviation).
- Transforms a raw score into a standardized value relative to its distribution.
- Allows comparison of scores from different scales or distributions.
- Helps in assessing the probability of a score occurring.
Degrees of Freedom (df)
The number of values in a calculation that are free to vary. Often calculated as n - 1 (sample size minus one) in statistical tests.
Steps in Hypothesis Testing
- State the Null Hypothesis (H0) and the Alternative Hypothesis (HA).
- Set the significance level (alpha, α), the criterion for rejecting H0 (e.g., α = 0.05).
- Choose the appropriate statistical test (e.g., t-test, z-test).
- Conduct the test using sample data and calculate the test statistic and the p-value.
- Make a decision: Reject H0 if p-value < α, or Fail to Reject H0 if p-value ≥ α.
- Calculate and interpret the effect size (e.g., Cohen's d) to understand the magnitude of the finding.
Effect Size (Cohen's d)
Indicates the magnitude of an effect or difference, independent of sample size.
- Small effect: d ≈ 0.20
- Medium effect: d ≈ 0.50
- Large effect: d ≈ 0.80
Decision Rules in Hypothesis Testing
- Reject Null Hypothesis (H0): Occurs when the p-value is less than the chosen significance level (p < 0.05). Suggests statistically significant evidence against H0.
- Fail to Reject Null Hypothesis (H0): Occurs when the p-value is greater than or equal to the significance level (p ≥ 0.05). Suggests insufficient evidence to reject H0.
Types of Hypotheses
- Directional (One-tailed): Predicts the specific direction of the effect or relationship (e.g., H0: μ ≤ X, HA: μ > X).
- Non-directional (Two-tailed): Predicts that there will be an effect or difference, but not the specific direction (e.g., H0: μ = X, HA: μ ≠ X). Open to results in either direction. Ignore the sign (-) of the test statistic when comparing to critical values in a two-tailed test.
Probability Concepts
Theoretical Probability
Based on reasoning or known parameters. Formula: P(event) = (Number of outcomes meeting event criteria) / (Total number of possible outcomes).
Example: Out of 10 game players, 2 are impostors. The theoretical probability of randomly selecting an impostor is P(impostor) = 2/10 = 0.20.
Experimental Probability
Based on observed outcomes from trials or experiments. Formula: P(event) = (Number of times event occurred) / (Total number of trials).
Example: If 10 coins are flipped and 3 land on heads, the experimental probability of heads is P(heads) = 3/10 = 0.30.
Theoretical vs. Experimental Probability
These are not always the same, especially with a small number of trials. As the number of trials increases (Law of Large Numbers), experimental probability tends to converge towards theoretical probability.
Joint Probability
The probability of two or more events occurring together. Example (assuming independence or replacement): Probability of picking an impostor twice = P(impostor) * P(impostor) = (2/10) * (2/10) = 0.04.
Confidence Intervals (CI)
A range of values estimated from sample data that is likely to contain the true population parameter (e.g., population mean) with a certain level of confidence (e.g., 95%).
- Expresses the uncertainty around a point estimate (like the sample mean).
- Calculated using: Point Estimate ± Margin of Error
- Margin of Error often involves a critical value (like a t-value) and the standard error: Margin of Error = t-value * (s / √n)
- Example: 30 cats, sample mean weight = 15.4, sample SD (s) = 1.9, critical t-value for 95% CI = 2.045.
- Margin of Error = 2.045 * (1.9 / √30) ≈ 0.71
- CI = [Mean - Margin of Error, Mean + Margin of Error]
- CI Lower Bound = 15.4 - 0.71 = 14.69
- CI Upper Bound = 15.4 + 0.71 = 16.11
- Result: We are 95% confident that the true average weight of cats in the population is between 14.69 and 16.11. 95% CI = [14.69, 16.11].
Sampling Concepts
Sampling Error
The natural difference between a sample statistic and the corresponding population parameter, occurring due to random chance in sample selection. Minimized by collecting more data (larger sample size).
Sampling Bias
Systematic error in the sampling process that results in a non-representative sample. Minimized by using random sampling techniques (e.g., simple random sampling, stratified sampling) and removing barriers to participation.
Errors in Hypothesis Testing
Type 1 Error (False Positive)
Rejecting the null hypothesis (H0) when it is actually true. The probability of making a Type 1 error is equal to the significance level (α). (e.g., concluding a drug is effective when it isn't).
Type 2 Error (False Negative)
Failing to reject the null hypothesis (H0) when it is actually false. The probability of making a Type 2 error is denoted by beta (β). (e.g., concluding a drug is not effective when it actually is).
Replication studies (repeating research) help lower the chance of false positives (Type 1 errors) being accepted as true findings.
Standard Deviation Formulas
Sample Standard Deviation (s)
Estimates the population standard deviation based on sample data. Used in inferential statistics. Formula: s = √[Σ(x - M)² / (n - 1)] (where M is the sample mean).
Population Standard Deviation (σ)
Calculates the standard deviation for an entire population. Used in descriptive statistics when all population data is known. Formula: σ = √[Σ(x - μ)² / N] (where μ is the population mean).
Additional Concepts
Binomial Data
Data with only two possible outcomes (e.g., Yes/No, Success/Failure), often coded as 1 and 0. The mean of binomial data represents the proportion of outcomes coded as '1'.
Central Limit Theorem (CLT)
States that if you take sufficiently large random samples from a population (even if the population distribution is not normal), the distribution of the sample means will approximate a normal distribution.
Statistical Power
The probability of correctly rejecting a false null hypothesis (i.e., avoiding a Type 2 error). Power = 1 - β.
Factors influencing power:
- Sample Size (n): Larger n increases power.
- Effect Size: Larger effects are easier to detect, increasing power.
- Significance Level (α): Higher α (e.g., 0.10 vs 0.05) increases power but also increases the Type 1 error rate.
- Standard Error: Smaller standard error (from larger n or lower data variability) increases power.
Power Calculations
Used to:
- Determine the statistical power of a completed study.
- Perform a priori sample size analysis to determine the necessary sample size before conducting a study.
- Conduct sensitivity analysis to determine the minimum effect size detectable with a given sample size and power level.
Underpowered studies (low power) are more likely to produce misleading results and commit Type 2 errors.
Role of the Null Hypothesis
The null hypothesis (H0) represents the default assumption or statement of no effect/no difference. It is assumed to be true until sufficient statistical evidence contradicts it. Example: When testing a new drug, the null hypothesis is typically that the drug is not effective compared to a placebo or existing treatment.