15.2 Graphical Methods
When there are only a moderate number of variables under consideration, a profile diagram or a parallel coordinates plot can be informative.
This diagram plots the standardized values of several variables (on the x-axis), with one line per observation.
To do this in ggplot
, we need to transform the data to long format. Here I use the pivot_longer
function in tidyr
. To match Table 16.2 and Figure 16.4 in the text, the signs of RRO5 and PAYOUTR1 are reversed to avoid lines crossing.
stan.chem <- cbind(chem[,1:3], scale(chem[,4:10]))
stan.chem$ROR5 <- -stan.chem$ROR5
stan.chem$PAYOUTR1 <- -stan.chem$PAYOUTR1
stan.chem.long <- tidyr::pivot_longer(data=stan.chem, cols=ROR5:PAYOUTR1,
names_to = "measure", values_to = "value")
stan.chem.long$measure <- factor(stan.chem.long$measure,
levels=c("ROR5", "DE", "SALESGR5", "EPS5", "NPM1", "PE", "PAYOUTR1"))
stan.chem.long %>% filter(OBSNO %in% c(15:21)) %>%
ggplot(aes(x=measure, y=value, group=SYMBOL, col=SYMBOL)) + geom_line(size=1.5) +
theme_bw(base_size = 14) + ylim(-3.1, 4)
Companies that follow similar patterns:
hca
,nme
,hum
lks
,win
ahs
is similar tolks
&win
forROR5
throughSALESGR5
, but diverges after.ami
is more similar tohca
than the others, but not on all measures.
Obvious limitation to creating a profile plot is that it could be difficult to pick out patterns as number of observations increases. The plot below shows all 25 observations in this data set.
ggplot(stan.chem.long, aes(x=measure, y=value, group=SYMBOL)) +
geom_line() + theme_bw() + ylim(-3, 4)
We can do better than looking at profile plots and trying to see which observations behave “similarly”. Let’s leverage some math concepts and introduce measures of “distance” between two observations.
For cluster analysis, we need to use the wide data set, where each variable is it’s own column, and only the numeric variables containing the measurements. Not the row information such as symbol. Here I set the row names of this numeric matrix to the symbols. This lets the clusters to be ‘named’ instead of numbered by row number in plots.