Below a low level inversion visibility is often

Classified in Mathematics

Written at on English with a size of 25.41 KB.

Chapter 17: Inferences when SD is unknown

T test is used when SD is unknown 

There are less conditions for inferences about a mean
Data is SRS from a larger pop
Observations follow a Normal distribution
We will estimate standard error using s/√n
S is the sample SD
S is the estimate of the variation btw indiv
s/√n is how much the sample means vary

T Test
(X bar - mu)/(s/√n)
T test is more variable than a z test because we have to estimate sigma with s
This means the t-test is not normal distr, it is more variable(wider)
Higher df=more close to normal

Conditions are met
Calculate the t-test stat using x bar and s, n and mu null
Compute prob of observing the test stat t or more extreme under the null hypothesis(p-value)
Interpret the p-value

For a 95CI with 25 objects
t_star <- qt(p = 0.975, df = 24)
X Bar + or - t_star * s/√n

Robust if the CI or p-value do not change much when procedure is violated
T procedure is quite robust against non-Normality, except when outliers or strong skew is present
T-procedure with outliers is ok if the sample is big enough

Plot data to see if there are outiers and if there is skew
SRS is more important than Normality
If n<15 use t procedures if the data appears close to normal
If >15 then use t unless there are outliers or strong skew
If >40 then use t 

Chapter 17 Pt2 Paired T test

Used to match by design
This is a test of the mean differences within a subject
T = (Mu d - 0)/(sigma d / √n)
Observed value of the test stat is t = (x bar d - 0)/(s d/√n)

Pull() vs Select()
Select keeps the data that you want but it remains inside a dataframe
Pull will pull the raw data without the dataframe and display it

Find the quantile using q <- qt(p = 0.975, lower.Tail = T, df = 10)
paired_t <- t.Test(chol_dat %>% pull(B), chol_dat %>% pull(A), 
                   alternative = "two.Sided", mu = 0, paired = T)
Use this code to find it, make sure it says paired

Paired T tests are good to remove confounding
Must make sure to give treatment after wash-out period so effects don’t transer over
Chapter 18: Comparing 2 pop means
We have used One sample tests
One sample and one variable
Now we will use two pops
H null: mu1-mu2=0
H alt: mu1-mu2 dne 0
Compare graphically
Make a histogram, one for each sample
Compare their shapes, centers and spreads
Or make two boxplots and compare their medians and IQRs
We have two SRSs from two pops
The samples are independent
Same quantitative val for both samp
Both are normally distributed and no outlers

Standard Deviation is √ (sigma1^2/n1)+(sigma2^2/n2)
Our estimate is SE=√ (s1^2/n1)+(s2^2/n2)

Two sample t-test is:
t= (xbar1-xbar2) - (mu1-mu2) / (SE)
t= (xbar1-xbar2) / (SE)
Degrees of freedom is fuckin long
df=(s1^2/n1)+(s2^2/n2)^2 / [ (1/n1-1)*(s1^2/n1)^2 + (1/n2-1)(s2^2/n2)^2 ] 
Confidence Interval
(xbar1-xbar2) + or - t_star * √(s1^2/n1 + s2^2/n2)

Infection of chickens with the avian flu is a threat to both poultry production and human health. A research team created transgenic chickens resistant to avian flu infection. Could the modification affect the chicken in other ways? The researchers compared the hatching weights (in grams) of 45 transgenic chickens and 54 independently selected commercial chickens of the same breed.
Use this to simplify all that bs
t.Test(commercial_weight, transgenic_weight, alternative = "two.Sided")
More robust than one sample tests, esp if the data is skewed
When the samples are the same size, they can work for samples as small as 5
When two pops have different shapes, ya need larger samples
Chapter 19: Inference about a pop proportion

This is about binary data, as opposed to continuous from previous
Large sample CI
P_hat + or - z_star √p_hat(1-p_hat)/n
Not as effective as Plus 4
Plus 4 Method
Used bc normal CI method will not be as good on binary data
Add 2 fake successes and 2 fake failures
P_tilde = number of success + 2 / n+4
SE = √ p_tilde(1-p_tilde) / (n + 4)
CI = z_star * SE
This should be used when n = 10 or more and CI is 90 or more
Use this when doing by hand
Wilson Score [prop.Test]
Same as Plus 4 but with correction
Basically the R version of Plus4
Clopper Pearson or Exact [binom.Test]
Statistically conservative
Gives better coverage than it suggests

Example: Suppose that 500 elderly individuals suffered hip fractures, of which 100 died within a year of their fracture. Compute the 95% CI for the proportion who died using:

Large sample16.5% to 23.5%by hand
Clopper Pearson*16.6% to 23.8%binom.Test
Wilson Score**16.6% to 23.8%prop.Test
Plus four16.7% to 23.7%by hand

Note that only large enough is symmetric around .20
We do not need symmetric CI with binary data

Finding a specific sample size
Let m = Margin of Error desired
M = z_star * √p_hat(1-p_hat) / n
We will have to guess for p using p_star
If you have no clue use p_star=0.5
But if true p is less than 0.3 or more than 0.7 than this will be bigger than needed
N = sample size wanted
n = (z_star/m)^2 * p_star * (1-p_star)
Ex. Suppose after the midterm vote, you were interested in estimating the number of STEM undergraduate students who voted. First you need to decide what margin of error you desire. Suppose it is 4 percentage points or m=0.04 for a 95% CI.
If we knew that the estimate was 25% then the formuler is 
(1.96/(0.04)^2 * 0.25 * (1-0.25) =450.19 = 451

Plug in the z value you get into 
pnorm(q = 3.06413, lower.Tail = F)

Chapter 20: Inference for comparing 2 proportions

LArgE Sample CI for diff of 2 prop
Use when the number of success and failures are >10 for both samples
(p_hat1-p_hat2) + or -  z_star * √p_hat1* (1-p_hat1) / n1 + p_hat2(1-p_hat2) / n2
This has low coverage
Example: Patients in a randomized controlled trial who were severely immobilized were randomly assigned to receive either Fragamin (to prevent blood clots) or a placebo. The number of patients experiencing deep vein thrombosis (DVT) was recorded:

DVTno DVTTotalp^

Samples both have more than 10
The estimate of the fidd is 2.19%

Plus 4
When Large enough is not satisfied
Add 4 objects, 1 success and 1 failure to both samples
P_tilde1 = successes in pop1+1 / n1+2
P_tilde2 = successes in pop2+1 / n2+2
(p_hat1-p_hat2) + or -  z_star * √p_hat1* (1-p_hat1) / n1 + p_hat2(1-p_hat2) / n2
Use when sample is at least 5, can be used even when success or failure = 0
Much more accurate when sample sizes are small
May be conservative (higher coverage than advertised)
(p_hat1-p_hat2) / √(p_hat)(1-p_hat)(1/n1 + 1/n2)
Use this only when counts of success and failure is more than 5 for both samples
Use pnorm to find p-value
pnorm(q = 3.112881, lower.Tail = F)*2

Chapter 21:The chi squared goodness of fit test

One categorical variable with more than 2 categories
Estimate how many observations we will expect in each cat
Compare the number of observations in each category to the exp value
Suppose that the following number of people were selected for jury duty in theprevious year, in a county where jury selection was supposed to be random.
EthnicityWhite  Black  Latinx  Asian  Other  Total
1920  347    1984130    2500
You want to take the percent of each race and make sure that each group is being represented proportionately
You can use line graph to show the difference between expected and observed

Chi Squared Stat
Equal to the sum of 
(observed1-expected1)^2 / expected 1
For each value
Chi squared distribution 
Like T distribution, the only parameter is degrees of freedom
df=number of groups minus 1
For this ex it would be 5-1=4
As df increases, the distributions central tendency will move to the right
Chi-square is positive, always take upper tail
Once you find the chi squared value plug it into
pchisq(q = 1606.454, df = 4,lower.Tail = F)
Chi Square function
chisq.Test(x = c(1920, 347, 19, 84, 130),
p = c(.422, .103, .251, .171, .053))
Conditions for Chi Squared
Fixed number of n observations
All observations are independent of each other
Each observation falls into one of the k mutually exclusively categories
At least 80% of the cells have 5 or more expected observations
All k cells have expected counts more than 1

Chapter 22: Inference two way tables

Last chapter was about one categorical variable, this is about 2
For example, what is the conditional probability of vaping among teens exposed to a JUUL advertisement vs. Teens unexposed?
GroupLung CancerNo Lung CancerRow total
Column total199811000

Then you will want to find the expected values given the null hypothesis that the condition you are observing has no effect on the other
Shortcut for expected counts
Expected =  row total * col total / overall total
Calculate the chi squared the same way
(expected1-observed1)^2 / expected1 for all values
Calculate degrees of freedom
For this it is the column number -1 times rows -1
(for this example it is (2-1)(2-1)=1
Same r function to find the p value
pchisq(q = 15.04015, df = 1, lower.Tail = F) #df = (2-1)(2-1) = 1
Chi Squared test of independence 
chisq.Test(two_way, correct = F)
chisq.Test(two_way, correct = T)
Use the correction whenever n < 100 or any observed value is less than 10
Conditions for the chi square test of independence
Expected is at least 5 for at least 80% of the cells
All expected values are greater than 1
If table is 2X2 then all four cells need expected at least 5
Assumptions for the chi squared test of independence
Must have data from independent SRSs from at least 2 populations, with mutually exclusive categories
Or a single SRS with each individual classified according to each of two categorical variables
For this test z^2 is the same as Chi squared
The p value for the two sided z test and the chi squared test are the same
When the data looks like this you may want to use a z test to find the one sided because you cannot do a one sided with chi squared
Use dodged histograms to compare the conditional distributions with one variable across levels of another variable
Chapter 23: Inference for Regression

Recap of regression from pt1
Graph the data. Does the data look linear? What is the correlation coefficient 
Calculate the line of bets fit w lm()
Using glance() and tidy() from library(broom) to summarize model findings
Interpret the slope (b_hat) and intercept (a_hat) parameters
Interpret the r_hat squared value

Assumptions to check for regression inference
The relationship between x and y is linear in the pop
y varies normally around the line of best fit. That is, the residuals vary normally around the line of best fit
Residuals refer to the vertical distance between the line of best fit and the observed y value
Observations are independent
This cannot be checked on the plot, we need to know the study design
The sd of the responses is the same for all values of x
Observed value: y
Fitted value: y_hat = a_hat + b_hat*x
Estimated residual: r_hat = observed value - fitted value = y - (a_hat + b_hat*x)

Graphs used to check
Scatter Plot
Shows fitted regression line and the data. The estimated residuals are shown by the dash lines. We want to see that residuals are positive and negative with no trend 
Check if residuals normally distributed
Fitted v. Residuals
Check to see random scatter
Amount Explained
Boxplot of the distribution of y v the distribution of the residuals. If x does a good job of describing y, then the box plot for the residuals will be much shorter 

Regression procedures are not too sensitive to lack of normality
Outliers are important since they can have a large effect

Chapter 23 Pt.2 Inference for Regression

tidy(your_lm) presents the output of the linear model
glance(lm) takes a quick one line look at fit stats
augmentlm) creates and augmented data frame that contains a column for the fitted y-values (y_hat) and the residuals (e_hat = y - y_hat) among other columns

New terminology: SSE
Sum of squared estimates of error
The SSE is the summation of the squared distance between each indiv’s y value and the fitted value based on the line of best fit
The higher the SSE, the worse the model

Regression standard error
Used to measure if a model is good fitting
S = √(1/n-2) * SSE
A good fitting model should have a low regression standard error
Look at s after running a linear model to assess the model’s fit to the data
s  is on the same scale as y, same units
glance(lm) will print s, denoted as sigma

Hypothesis testing for regression
We would like to know if the slope is different from 0
H_null: b = 0 
There is no association between x and y
H_alt: b is not equal to 0 for a two sided test
There is an association

Know how to use R to find these data
Estimate is the estimated slope coefficient b_hat
Std.Error is the standard error, SE b 
Statistic is the t test stat b_hat / SE b
Test will always have n-2 df
Use pt to find p value
pt(q = 6.7211302, df = 18, lower.Tail = F)*2
We can also use the tidy(lm) output to find the regression coefficient
B_hat + or - t_star * SE b
T_star = t_star <- qt(p = 0.975, df = 18)

Test for the lack of correlation
Lack of correlation If and only if there is no association between the explanatory and response variables
Thus if your hypothesis test does not reject the null (b = 0) then this also implies that you would not reject the hypothesis of no correlation between x and y

Chapter 24: ANOVA

Analysis of variance
When the ratio of between vs within variation is large enough, then we detect a difference between the groups
When the ratio is not large enough we do not detect the difference
The ratio is our test stat, denoted by F
Use a box plot for each level of the grouping variable
Make a density plot for each level of the grouping variable
Histogram for each level of the grouping variable
Null: mu1=mu2=muk
Alt: not all mu are equal
At least one mean differs from the rest
High-grade glioma is an aggressive type of brain cancer with a low long-term survival rate. Cannabinoids, a chemical compounds found in cannabis, are thought to inhibit glioma cell growth. Researchers transplanted glioma cells into otherwise-healthy mice, and then randomly assigned these mice to 4 cancer treatments: irradiation alone, cannabinoids alone, irradiation combined with cannabinoids, or no treatment. The treatments were administered for 21 days, after which the glioma tumor volume (in cubic millimeters) was assessed in each mouse using brain imaging.
The Test Stat(ANOVA F)
F = variation among group means / variation among individuals in the same group
F = mean squares for groups / mean squares for error
Numerator: MSG
Let x_bar represent the overall sample mean
MSG = n1(x_bar1-x_bar)^2 + … / k-1
Denominator: MSE
MSE = (n1-1)s1^2 + (nk-1)sk^2 / (Ntotal - k)
If the stat is high then there is relatively more variation among groups then there is among groups
If the stat is less than 1, then there is more variation across individuals in the same group than there is among groups
Anova in R
cancer_anova <- aov(formula = tumor_volume ~ treatment, data = cancer_data)
Use tidy() to display yo data
Df displays the numerator and denom degrees of freedom
Sumsq displays the sum of squares for groups and sum of squares for error, meansq displays the MSG and MSE respectively
Statistic is the F test stat
P.Value is the p value duh

Finding which group is different from the rest
TUKEY’s honestly significant differences (HSD)
This test maintains a 5% experimentwise error rate
The error rate is 5% overall no matter hwo many test we do
diffs <- TukeyHSD(cancer_anova, conf.Level = 0.95) %>% tidy()
Each row in the table corresponds to a pairwise test
Note that when you have an adjusted test, you cannot use the CI to infer the value of the p-value
Conditions for ANOVA
1: independent SRSs, one from each of k populations
The most important assumption, bc this method unlike others from Pt.3 depend on having a random sample
2: Each of the populations has a Normal distribution w an unknown mean
This is less neccessary 
The ANOVA test is robust to non-Normality
Normality of the sample means is more important
If the sample size is small, 4-5 indiv per group, then need data that is roughly symmetric with no outliers
3: All of the populations have the same sd whose value is unknown
Hardest to satisfy and check
If this is not satisfied it is usually a ok
Use group_by() and summarize() to calculate the sample SDs to see if they are similar and indicative that the population paramters are too
Rule of Dumb
Want the largest sample SD to be less than 2x the smallest one
No_Chapter Bootstrap Confidence Intervals

We will use this to find CI when data is not Normally distributed
Also can find the CI for the median for a quartile or some other parameter
This method takes repeated samples with replacement from our sample
1: find the median of the original sample. Denote this as m
2: resample with replacement from the original sample a new sample, also of size 54
3. Calculate the median based on resample #1. Call this median m1*
4. Resample again, calculate the median based on resample #2. Call this median m2* repeat this thousands of times lmao
5. Make a histogram of all the m*. This histogram will approximate the sampling distribution for the median
6. Calculate the bounds such that the middle 95% of the observations are between the lower and upper bounds. In R
quantile(sample_median, 0.025) and
quantile(sample_median, 0.975)
When to use Bootstrao
When we do not have a nice formula to calculate the CI or do not know what the formula is
The underlying assumptions of using a large sample formulas are not satisfied
We can make bootstrap CIs around any statistic we’ve learnt about 
No_chapter Permutation Tests

Permutation are used when we do not have a large enough sample and/or our data is not from an SRS
Like bootstrapping but for hypothesis testing
Background: Malaria and alcohol consumption both represent major public health problems. Alcohol consumption is rising in developing countries and, as efforts to manage malaria are expanded, understanding the links between malaria and alcohol consumption becomes crucial. Our aim was to ascertain the effect of beer consumption on human attractiveness to malaria mosquitoes in semi field conditions in Burkina Faso
Volunteers are randomly assigned to beer or water
We COULD use a t test to determine if there is a difference in Mosquito attraction between the drinkers, OR we could mix up the labels and recompute the difference between drinkers
Permutation requires you to load library(infer)

We’ll use specify hypothesize generate and calculate from infer

null_distn <- mosq_data %>% 
  specify(response = num_mosquitos, explanatory = treatment) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("beer", "water"))

Use get_pvalue() to get p value duh
null_distn %>% get_pvalue(obs_stat = 23.6-19.22, direction = "two_sided")

If the null is true then the distribution of the response vraible is the same for each level of the explanatory, should look the same after shuffling

Bonus_chapter Regression model with cat exposure

Qt, pt, qnorm, pnorm, pchisq
Testing functions: t.Test, binom.Test, prop.Test, chisq.Test, 
Broom: tidy, glance, augment, lm, predict, confint, aov, tukeyshsd
Ggplot2, dplyr, what does R code do <-
Check Inference Formulas PDf 
A study is designed to test whether there is a difference in mean daily calcium intake in adults with normal bone density, adults with osteopenia (a low bone density which may lead to osteoporosis) and adults with osteoporosis. Adults 60 years of age with normal bone density, osteopenia and osteoporosis are selected at random from hospital records and invited to participate in the study. Each participant’s daily calcium intake is measured based on reported food intake and supplements. The data are shown below.

Entradas relacionadas: