Essential Concepts in Statistical Modeling and Optimization Methods

Classified in Mathematics

Written on in English with a size of 13.69 KB

Probability Distributions for Discrete Events

The following table matches common scenarios to their appropriate probability distributions:

Scenario DescriptionDistribution Type
Number of people clicking an online banner ad each hourPoisson
Number of arrivals to a flu-shot clinic each minutePoisson
Number of hits to a real estate website each minutePoisson
Number of arrivals to the ID-check queue at an airport each minutePoisson
Number of people entering a grocery store each minutePoisson
Number of penalty kicks taken until one is savedGeometric
Number of faces correctly identified by Deep Learning (DL) software until an error occursGeometric
Of the first 100 people viewing a house listing, the number who tour itBinomial
Number of days in a year with temperature 3+ degrees above forecastBinomial
Time between arrivals to a flu-shot clinicExponential
Time between hits on a real estate websiteExponential
Time between people entering a grocery storeExponential
Time from the start of a World Cup soccer match until a goal is scoredWeibull
Time from when a house is on the market until the first offerWeibull
Time from the beginning of Fall until the first snowflake is seenWeibull
Time from when a generator is turned on until it failsWeibull

Strategies for Handling Missing Data and Imputation

Five models proposed for handling missing school rating data:

  1. Model 1: Imputation using the average school rating derived from the rest of the dataset.
  2. Model 2: Imputation using a regression model based on other available variables.
  3. Model 3: Two-step approach: First, classify whether the school was built due to population growth, then select the appropriate regression model based on that classification.
  4. Model 4: Use a binary variable to explicitly identify locations where information is missing.
  5. Model 5: Use a categorical variable approach with three categories: "data available," "missing, population growth," and "missing, other reason."

Model Feasibility Notes

  • Model 3: Ratings can be used / Reasons can be inferred.
  • Model 2: Can be used / Cannot be used (depending on context).
  • Model 5: Cannot be used / Can be used (depending on context).
  • Model 4: Cannot be used / Cannot be used (depending on context).

Formulating Optimization Constraints with Binary Variables

Let y represent binary variables (1 if eaten, 0 if not) and x represent continuous amounts eaten. M is a large constant.

  • Mutual Exclusivity Constraints

    Out of peanut butter and cheese sauce, exactly one must be eaten:

    y_peanutbutter + y_cheesesauce = 1

    OR

    y_peanutbutter = 1 - y_cheesesauce
  • Neither peanut butter nor cheese sauce can be eaten:

    y_peanutbutter + y_cheesesauce = 0
  • Either peanut butter or cheese sauce, but not both, must be eaten (Exclusive OR):

    y_peanutbutter = 1 - y_cheesesauce
  • Conditional Constraints (Broccoli)

    Either cheese sauce or peanut butter (or both) must be eaten with broccoli:

    y_broccoli ≤ y_cheesesauce + y_peanutbutter

    (Note: This constraint means if broccoli is eaten (y_broccoli=1), then the sum of the others must be at least 1.)

  • Broccoli can only be eaten if either cheese sauce or peanut butter (or both) is also eaten:

    y_broccoli ≤ y_cheesesauce + y_peanutbutter
  • If cheese sauce and peanut butter are not eaten, then broccoli can't be eaten:

    y_broccoli ≤ y_cheesesauce + y_peanutbutter
  • Limiting Total Items

    No more than two of broccoli, cheese sauce, and peanut butter may be eaten:

    y_broccoli + y_cheesesauce + y_peanutbutter ≤ 2
  • Broccoli, cheese sauce, and peanut butter all cannot be eaten together:

    y_broccoli + y_cheesesauce + y_peanutbutter ≤ 2
  • Linking Continuous and Binary Variables (Big M Formulation)

    No amount of cheese sauce may be eaten unless its binary variable is 1 (If any amount of cheese sauce is eaten, then its binary variable must be 1):

    x_cheesesauce ≤ M · y_cheesesauce
  • Cheese sauce must be eaten:

    y_cheesesauce = 1
  • Unless peanut butter is eaten, no amount of broccoli can be eaten:

    x_broccoli ≤ M · y_peanutbutter
  • If any amount of broccoli is eaten, then peanut butter must also be eaten:

    x_broccoli ≤ M · y_peanutbutter

Discrete-Event Simulation and Replication

Stochastic Discrete-Event Simulation

When a company creates a stochastic discrete-event simulation, many replications are needed because of the inherent variability and randomness in the system being modeled.

Interpreting Simulation Run Results

  • A simulation could stop after 300 or 400 events, but it could not stop after only 5 events (implying a minimum run length requirement).
  • The simulated wait time was not 50 or less just once out of all the runs (implying consistency).
  • The expected wait time of simulated runs (replications) is likely to be between 65 and 75 (Confidence Interval interpretation).
  • The expected wait time of simulated runs (replications) is not likely to be between 75 and 85.
  • There is significant variability in the simulated wait time across the runs (replications).
  • There is not very little variability in the simulated wait time across the runs (replications).

Simulation Validation

If the simulated wait time is 50% higher than observed reality, one must investigate to see what is wrong with the simulation, as it indicates a poor match to reality.

Classification of Optimization Problem Types

Optimization problems are classified based on the structure of their objective function and constraints:

  • Linear Programming (LP)

    Objective: ∑i cixi

    Constraints: ∑i aijxi ≥ bj

  • Convex Quadratic Programming (CQP)

    Objective: ∑i cixi^2

    Constraints: ∑i aijxi ≥ bj

  • Convex Programming (CP)

    Objective: ∑i ci|xi−6|

    Constraints: ∑i aijxi ≥ bj

  • Integer Programming (IP)

    Objective: ∑i cixi

    Constraints: ∑i aijxi ≥ bj, where xi ∈ {0, 1} (Binary/Integer variables)

  • General Non-Convex Programming (GNCP)

    Objective: ci sin xi

    Constraints: (Linear or non-linear)

  • General Non-Convex Programming (GNCP)

    Objective: ∑i cixi

    Constraints: ∑i ∑k aikjxixk ≤ bj (Non-linear, non-convex constraints)

  • Linear Programming (LP)

    Objective: (log c) xi (Assuming log c is a constant coefficient)

    Constraints: ∑i aijxi ≥ bj

Queuing Theory and Markov Chain Properties

  1. To check system stability (utilization): Take the reciprocal of the service rate (1/μ), multiply by the number of service lines (c), and check if this value is greater than the arrival rate (λ). If c/μ > 1/λ, the system is stable.
  2. If a process is not memoryless, the standard Markov chain model would not be well-defined or appropriate for modeling the system state transitions.

Decision Making and Statistical Measures

Exploration vs. Exploitation

  • Use more exploration if observed rates are similar.
  • Use exploitation if observed rates are very different (choose the lowest or highest rate depending on the context of the problem).

Choosing Appropriate Measures

  • Binomial-based data: Use the highest rate or fraction.
  • Parametric data: Use the average or mean.
  • Non-parametric data: Use the median.

Regression Regularization Techniques

Regularization Constraints and Objective Functions

These techniques minimize the sum of squared errors (SSE) subject to constraints on the coefficients (aj):

  • Standard Linear Regression (No Regularization)

    Minimize: ∑n i=1 (yi − (a0 + ∑m j=1 ajxij))^2

  • Lasso Regression (L1 Penalty)

    Constraint: ∑j |aj| ≤ T (T is the tuning parameter)

  • Ridge Regression (L2 Penalty)

    Constraint: ∑j (aj)^2 ≤ T

  • Elastic Net

    Penalty Term: λ ∑j |aj| + (1 − λ) ∑j (aj)^2 (Combines L1 and L2 penalties)

Variable Selection Properties

  • Lasso: Selects the fewest variables (performs feature selection by driving coefficients to zero).
  • Elastic Net (EN): Selects a medium number of variables.
  • Linear Regression (LR) / Ridge Regression (RR): Selects the most variables (all variables are retained, though coefficients may be small).

Model Complexity and Overfitting

  • Should we seek a simpler model? YES. Because there isn't enough data to avoid overfitting a model with many factors.
  • Should we seek a more complex model? NO. (Unless complexity is justified by data volume and performance gains).
  • Should we seek a simpler model? YES. To improve interpretability and generalization.

Optimization vs. Regression Perspective

  • Optimization Perspective: Coefficients (a) are variables; inputs (x) are constants.
  • Regression Perspective: Inputs (x) are variables; coefficients (a) are constants (parameters to be estimated).

Data Science Modeling Workflow

A typical workflow for developing and validating predictive models:

  1. Remove outliers from the dataset.
  2. Impute missing data values and scale the data appropriately.
  3. Fit a Lasso regression model on all available variables (for feature selection).
  4. Fit alternative models (e.g., linear regression, regression tree, and random forest) using only the variables chosen by the Lasso regression model.
  5. Pick the best model to use based on performance metrics evaluated on a dedicated validation dataset.
  6. Test the final chosen model on a separate, unseen test dataset to estimate its true generalization quality.

Advanced Analytical Methods for Decision Making

Problem DescriptionAppropriate Analytical Method
Find the best airline schedule with uncertain delaysStochastic Optimization
Find the best portfolio with uncertain investment returnsStochastic Optimization
Determine the best route for delivery given uncertainties in trafficStochastic Optimization
Decide how many products to manufacture with uncertain demandStochastic Optimization
Estimate the required number of workers for a call centerQueuing Theory
Determine how many checkout lanes are needed in a supermarket or tables in a restaurantQueuing Theory
Compare the median age of MSA students across campus and online programsNon-parametric Test
Determine if the median home price is lower in one city versus anotherNon-parametric Test
Identify which month has a higher median temperatureNon-parametric Test
Identify which sets of electives or recipes share common elementsLouvain Algorithm (Community Detection)
Find groups of electives that are often taken by the same studentsLouvain Algorithm (Community Detection)
Find sets of terrorists (network community detection)Louvain Algorithm (Community Detection)
Determine how much to bid in competitive situationsGame-Theoretic Analysis
Determine the best marketing strategy, given competitor reactionGame-Theoretic Analysis

Related entries: