Applied Statistics course notes
Preface
I Preparing Data for Analysis
1
Workflow and Data Cleaning
1.1
Generating a reproducible workflows
1.1.1
Literate programming
1.1.2
Reproducible Research + Literate Programming
1.2
Import data
1.2.1
Renaming variable names for sanity sake
1.3
Data Management
1.3.1
Missing data
1.3.2
Identifying Variable Types
1.3.3
Outliers
1.3.4
Creating secondary variables
1.4
Saving your changes
1.5
Wide vs. Long data
1.6
Dealing with missing data post-analysis
2
Visualizing Data
2.1
The syntax of
ggplot
2.1.1
Required arguments
2.1.2
Optional but helpful arguments
2.2
The Data
2.3
Univariate Visualizations
2.3.1
Categorical variables
2.3.2
Continuous Measures
2.4
Bivariate Visualizations
2.4.1
Categorical v. Categorical
2.4.2
Grouped bar charts with percentages
2.4.3
Continuous v. Continuous
2.4.4
Continuous v. Categorical
2.4.5
Joy plots / Ridgelines
2.5
Faceting / paneling
2.6
Multiple plots per window
2.7
Multivariate (3+ variables)
2.7.1
Three continuous
2.7.2
Scatterplot matrix
2.7.3
Two categorical and one continuous
2.7.4
Two continuous and one categorical
2.8
Paneling on two variables
2.9
Troubleshooting
2.10
But what about…
2.10.1
Other plots not mentioned
2.11
Additional Resources
3
Selecting Appropriate Analyses
II Statistical Inference
4
Foundations for Inference
5
Bivariate Analysis
5.1
Assumption of Independent Observations
5.2
Choosing appropriate bivariate analysis
5.3
(Q~B) Two means: T-Test
5.3.1
Assumptions
5.3.2
Sampling Distribution for the difference
5.3.3
Example: Smoking and BMI
5.4
(Q~C) Multiple means: ANOVA
5.4.1
Formulation of the One-way ANOVA model
5.4.2
Analysis of Variance Tabl*:
5.4.3
The F-distribution
5.4.4
Assumptions
5.4.5
Example: A comparison of plant species under low water conditions
5.4.6
Coefficient of determination
\(R^{2}\)
5.4.7
Multiple Comparisons
5.5
(C~C) Multiple Proportions:
\(\chi^{2}\)
5.5.1
Conditions for the sampling distribution to be normal.
5.5.2
Example: Are Mammograms effective?
5.5.3
Example: Smoking and General Health
5.5.4
Multiple Comparisons
5.6
(Q~Q) Correlation
5.6.1
Example: Federal spending per capita and poverty rate
III Regression Modeling
6
Introduction
7
Simple Linear Regression
7.1
Mathematical Model
7.2
Parameter Estimates
7.3
Least Squares Regression
7.4
Interval estimation
7.5
Correlation Coefficient
7.6
Assumptions
7.7
Example
7.7.1
Confidence and Prediction Intervals
7.8
ANOVA for regression
8
Multiple Linear Regression
8.1
Types of X variables
8.2
Mathematical Model
8.3
Parameter Estimation
8.4
Example
8.5
Binary predictors.
8.6
Categorical Predictors
8.6.1
Factor variable coding
8.7
Model Diagnostics
8.8
Multicollinearity
8.9
What to watch out for
9
Model Building
9.1
Stratification
9.2
Moderation
9.2.1
Example 1: Sepal vs Petal Length
9.2.2
Example 2: Simpson’s Paradox
9.3
Interactions
9.3.1
Example 1
9.3.2
Example 2
9.3.3
Example 3: The relationship between income, employment status and depression.
9.4
Confounding
9.4.1
Example: Does smoking affect pulse rate?
9.5
Wald test (General F)
9.6
Variable Selection Process
9.6.1
Automated selection procedures
9.6.2
Implementation in R
9.6.3
Lasso
9.7
Comparing between models
9.7.1
RSS: Residual Sum of Squares
9.7.2
Likelihood function
9.7.3
General F Test
9.7.4
Multiple
\(R^{2}\)
9.7.5
Adjusted
\(R^{2}\)
9.7.6
Mallows
\(C_{p}\)
9.7.7
Akaike Information Criterion (AIC)
9.7.8
Bayesian Information Criterion (BIC)
9.7.9
AIC vs BIC
9.8
General Advice
9.9
What to watch out for
10
Generalized Linear Models
10.1
Fitting GLMs
10.1.1
R
10.1.2
SPSS
10.1.3
Stata
10.2
Binary outcome data
10.3
Logistic Regression
10.3.1
Interpreting Odds Ratios
10.3.2
Example: The effect of gender on Depression
10.3.3
Multiple Logistic Regression
10.3.4
Effect of a k unit change
10.3.5
Example: Predictors of smoking status
10.3.6
Reporting
10.3.7
Model Fit
10.4
Log-linear models
10.4.1
Example
10.5
Count outcome data
10.6
Categorical outcome data
11
Classification of Binary outcomes
11.1
Calculating predictions
11.2
Confusion Matrix
11.3
Vocabulary terms
11.4
ROC Curves
11.5
Model Performance
IV Multivariate Analysis
12
Introduction
13
Principal Component Analysis
13.1
When is Principal Components Analysis (PCA) used?
13.2
Basic Idea - change of coordinates
13.3
More Generally
13.4
Generating PC’s using R
13.5
Data Reduction
13.6
Standardizing
13.7
Example
13.8
Use in Multiple Regression
13.8.1
Example: Modeling acute illness
13.9
Things to watch out for
13.10
Additional References
14
Factor Analysis
14.1
Introduction
14.1.1
Latent Constructs
14.1.2
Comparison with PCA
14.1.3
EFA vs CFA
14.2
Factor Model
14.2.1
Components of Variance
14.2.2
Two big steps
14.3
Example data setup
14.4
Factor Extraction Methods
14.4.1
Principal components (PC Factor model)
14.4.2
Iterated components
14.4.3
Maximum Likelihood
14.4.4
R code
14.5
Rotating Factors
14.5.1
Varimax Rotation
14.5.2
Oblique rotation
14.6
Factor Scores
14.7
What to watch out for
14.8
Additional Resources
15
Cluster Analysis
15.1
When is cluster analysis used?
15.1.1
Data used in this chapter
15.1.2
Packages used in this chapter
15.2
Graphical Methods
15.3
Distance Measures
15.3.1
Other measures of distance
15.3.2
Gowers’s dissimilarity measure as a distance measures for binary data
15.3.3
Creating the distance matrix.
15.4
Hierarchical clustering
15.4.1
Linkages
15.4.2
Comparing Linkage methods
15.4.3
Dendogram extras
15.5
Non Hierarchical clustering
15.5.1
Visualizations k-means
15.6
Choosing K
15.6.1
Visually
15.6.2
Elbow method
15.6.3
Gap statistic
15.7
Assigning Cluster labels
15.8
Exploring clusters
15.8.1
Univariate distribution of clusters
15.8.2
Bivariate plots
15.8.3
Multivariate plots
15.8.4
Still too many records / variables
15.9
What to watch out for
15.10
Additional References
V Multi-level Modeling
16
Introduction
16.1
Example School Data
16.2
Multi-level models
17
Random Intercept Models
17.1
Pooling
17.2
Mathematical Models
17.2.1
Complete Pooling
17.2.2
No Pooling
17.2.3
Partial Pooling (RI)
17.3
Components of Variance
17.4
Fitting models in R
17.4.1
Comparison of estimates
17.5
Estimation Methods
17.6
Including Covariates
17.7
More Random Effects
17.8
Centering terms
17.8.1
A generic
dplyr
approach to centering.
17.9
Specifying Correlation Structures
17.9.1
Changing covariance structures in R
17.10
Additional References
17.10.1
Lecture notes from other classes found on the interwebs
17.10.2
Package Vignettes
VI Other Topics
18
Missing Data
18.1
Identifying missing data
18.1.1
Visualize missing patterns
18.2
Effects of Nonresponse
18.3
Missing Data Mechanisms
18.3.1
Demonstration via Simulation
18.4
General strategies
18.4.1
Complete cases analysis
18.4.2
Available-case analysis
18.4.3
Imputation
18.5
Imputation Methods
18.6
Multiple Imputation (MI)
18.6.1
Goals
18.6.2
Technique
18.6.3
MI as a paradigm
18.6.4
Inference on MI
18.7
Multiple Imputation using Chained Equations (MICE)
18.7.1
Overview
18.7.2
Process / Algorithm
18.7.3
Convergence
18.7.4
Imputation Methods
18.8
Diagnostics
18.9
Example: Prescribed amount of missing.
18.9.1
Multiply impute the missing data using
mice()
18.9.2
Check the imputation method used on each variable.
18.9.3
Check Convergence
18.9.4
Look at the values generated for imputation
18.9.5
Create a complete data set by filling in the missing data using the imputations
18.9.6
Visualize Imputations
18.9.7
Calculating bias
18.10
Final thoughts
18.11
Additional References
View source on Github
Applied Statistics
9.9
What to watch out for
Multicollinearity
Missing Data
Use previous research as a guide
Variables not included can bias the results
Significance levels are only a guide
Perform model diagnostics after selection to check model fit.