1.16 Wide vs. Long data

Read more on tidy data here: https://r4ds.hadley.nz/data-tidy

The data on Lung function originally was recorded in wide format, with separate variables for mother’s and father’s FEV1 score (MFEV1 and FFEV). In this format, the data is one record per family.

fev <- read.delim("https://norcalbiostat.netlify.com/data/Lung_081217.txt", 
                    sep="\t", header=TRUE)
head(fev)
##   ID AREA FSEX FAGE FHEIGHT FWEIGHT FFVC FFEV1 MSEX MAGE MHEIGHT MWEIGHT MFVC
## 1  1    1    1   53      61     161  391  3.23    2   43      62     136  370
## 2  2    1    1   40      72     198  441  3.95    2   38      66     160  411
## 3  3    1    1   26      69     210  445  3.47    2   27      59     114  309
## 4  4    1    1   34      68     187  433  3.74    2   36      58     123  265
## 5  5    1    1   46      61     121  354  2.90    2   39      62     128  245
## 6  6    1    1   44      72     153  610  4.91    2   36      66     125  349
##   MFEV1 OCSEX OCAGE OCHEIGHT OCWEIGHT OCFVC OCFEV1 MCSEX MCAGE MCHEIGHT
## 1  3.31     2    12       59      115   296   2.79    NA    NA       NA
## 2  3.47     1    10       56       66   323   2.39    NA    NA       NA
## 3  2.65     1     8       50       59   114   1.11    NA    NA       NA
## 4  2.06     2    11       57      106   256   1.85     1     9       49
## 5  2.33     1    16       61       88   260   2.47     2    12       60
## 6  3.06     1    15       67      100   389   3.55     1    13       57
##   MCWEIGHT MCFVC MCFEV1 YCSEX YCAGE YCHEIGHT YCWEIGHT YCFVC YCFEV1
## 1       NA    NA     NA    NA    NA       NA       NA    NA     NA
## 2       NA    NA     NA    NA    NA       NA       NA    NA     NA
## 3       NA    NA     NA    NA    NA       NA       NA    NA     NA
## 4       56   159   1.30    NA    NA       NA       NA    NA     NA
## 5       85   268   2.34     2    10       50       53   154   1.43
## 6       87   276   2.37     2    10       55       72   195   1.69

To analyze the effect of gender on FEV, the data need to be in long format, with a single variable for fev and a separate variable for gender. The following code chunk demonstrates one method of combining data on height, gender, age and FEV1 for both males and females.

fev2 <- data.frame(gender = c(fev$FSEX, fev$MSEX), 
                   rev = c(fev$FFEV1, fev$MFEV1), 
                   ht = c(fev$FHEIGHT, fev$MHEIGHT), 
                   age = c(fev$FAGE, fev$MAGE))
fev2$gender <- factor(fev2$gender, labels=c("M", "F"))
head(fev2)  
##   gender  rev ht age
## 1      M 3.23 61  53
## 2      M 3.95 72  40
## 3      M 3.47 69  26
## 4      M 3.74 68  34
## 5      M 2.90 61  46
## 6      M 4.91 72  44

Nearly all analysis procedures and most graphing procedures require the data to be in long format. There are several R packages that can help with this including reshape2 and tidyr.