## 15.2 Graphical Methods

When there are only a moderate number of variables under consideration, a profile diagram or a parallel coordinates plot can be informative.

This diagram plots the standardized values of several variables (on the x-axis), with one line per observation.

To do this in ggplot, we need to transform the data to long format. Here I use the pivot_longer function in tidyr. To match Table 16.2 and Figure 16.4 in the text, the signs of RRO5 and PAYOUTR1 are reversed to avoid lines crossing.

stan.chem <- cbind(chem[,1:3], scale(chem[,4:10]))
stan.chem$ROR5 <- -stan.chem$ROR5
stan.chem$PAYOUTR1 <- -stan.chem$PAYOUTR1
stan.chem.long <- tidyr::pivot_longer(data=stan.chem, cols=ROR5:PAYOUTR1,
names_to = "measure", values_to = "value")
stan.chem.long$measure <- factor(stan.chem.long$measure,
levels=c("ROR5", "DE", "SALESGR5", "EPS5", "NPM1", "PE", "PAYOUTR1"))

stan.chem.long %>% filter(OBSNO %in% c(15:21)) %>%
ggplot(aes(x=measure, y=value, group=SYMBOL, col=SYMBOL)) + geom_line(size=1.5) +
theme_bw(base_size = 14) + ylim(-3.1, 4) 

• hca, nme, hum
• lks, win
• ahs is similar to lks & win for ROR5 through SALESGR5, but diverges after.
• ami is more similar to hca than the others, but not on all measures.

Obvious limitation to creating a profile plot is that it could be difficult to pick out patterns as number of observations increases. The plot below shows all 25 observations in this data set.

ggplot(stan.chem.long, aes(x=measure, y=value, group=SYMBOL)) +
geom_line() + theme_bw() + ylim(-3, 4) 

If the x-axis were time this type of plot is also known as a “spaghetti” plot. See Figure 4.18b in PMA6 as an example.

We can do better than looking at profile plots and trying to see which observations behave “similarly”. Let’s leverage some math concepts and introduce measures of “distance” between two observations.

For cluster analysis, we need to use the wide data set, where each variable is it’s own column, and only the numeric variables containing the measurements. Not the row information such as symbol. Here I set the row names of this numeric matrix to the symbols. This lets the clusters to be ‘named’ instead of numbered by row number in plots.

cluster.dta <- stan.chem %>% select(ROR5:PAYOUTR1)
rownames(cluster.dta) <- stan.chem\$SYMBOL