1.10 Wrangling Factors
For more help on renaming, releveling, lumping, and removing levels see Math 130 lesson 06 for now. Also the forcats vignette.
1.10.1 Collapsing categorical variables into fewer categories
For unbiased and accurate results of a statistical analysis, sufficient data has to be present. Often times once you start slicing and dicing the data to only look at certain groups, or if you are interested in the behavior of certain variables across levels of another variable, sometimes you start to run into small sample size problems.
For example, consider marital status again. There are only 13 people who report being separated. This could potentially be too small of a group size for valid statistical analysis.
One way to deal with insufficient data within a certain category is to collapse categories. The following code uses the recode()
function from the car
package to create a new variable that I am calling marital2
that combines the Divorced
and Separated
levels.
⚠️ Note: See Math 130 lesson 06 for a better method using forcats
Always confirm your recodes. Check a table of the old variable (marital
) against the new one marital2
.
table(depress$marital, marital2, useNA="always")
## marital2
## Married Never Married Sep/Div Widowed <NA>
## Never Married 0 73 0 0 0
## Married 127 0 0 0 0
## Divorced 0 0 43 0 0
## Separated 0 0 13 0 0
## Widowed 0 0 0 38 0
## <NA> 0 0 0 0 0
This confirms that records where marital
(rows) is Divorced
or Separated
have the value of Sep/Div
for marital2
(columns). And that no missing data crept up in the process. Now I can drop the temporary marital2
variable and actually fix marital
. (keeping it clean)