1.10 Wrangling Factors

For more help on renaming, releveling, lumping, and removing levels see Math 130 lesson 06 for now. Also the forcats vignette.

1.10.1 Collapsing categorical variables into fewer categories

For unbiased and accurate results of a statistical analysis, sufficient data has to be present. Often times once you start slicing and dicing the data to only look at certain groups, or if you are interested in the behavior of certain variables across levels of another variable, sometimes you start to run into small sample size problems.

For example, consider marital status again. There are only 13 people who report being separated. This could potentially be too small of a group size for valid statistical analysis. One way to deal with insufficient data within a certain category is to collapse categories. The following code uses the recode() function from the car package to create a new variable that I am calling marital2 that combines the Divorced and Separated levels.

⚠️ Note: See Math 130 lesson 06 for a better method using forcats

marital2 <- recode(depress$marital, "'Divorced' = 'Sep/Div'; 'Separated' = 'Sep/Div'")

Always confirm your recodes. Check a table of the old variable (marital) against the new one marital2.

table(depress$marital, marital2, useNA="always")
##                marital2
##                 Married Never Married Sep/Div Widowed <NA>
##   Never Married       0            73       0       0    0
##   Married           127             0       0       0    0
##   Divorced            0             0      43       0    0
##   Separated           0             0      13       0    0
##   Widowed             0             0       0      38    0
##   <NA>                0             0       0       0    0

This confirms that records where marital (rows) is Divorced or Separated have the value of Sep/Div for marital2 (columns). And that no missing data crept up in the process. Now I can drop the temporary marital2 variable and actually fix marital. (keeping it clean)

depress$marital <- recode(depress$marital, "'Divorced' = 'Sep/Div'; 'Separated' = 'Sep/Div'")