Both Nominal and Ordinal data types can be visualized using the same methods: tables, barcharts and pie charts.
Tables are the most common way to get summary statistics of a categorical variable. The
table() function produces a frequency table, where each entry represents the number of records in the data set holding the corresponding labeled value.
There are 27 Fair quality diamonds, 83 good quality and 387 Ideal quality diamonds in this sample.
A Barchart or barplot takes these frequencies, and draws bars along the X-axis where the height of the bars is determined by the frequencies seen in the table.
To create a barplot/barchart in base graphics requires the data to be in summarized in a table form first. Then the result of the table is plotted. The first argument is the table to be plotted, the
main argument controls the title.
The geometry needed to draw a barchart in ggplot is
pretty The biggest addition to a barchart is the numbers on top of the bars. This isn’t mandatory, but it does make it nice.
Often you don’t want to compare counts but percents. To accomplish this, we have to aggregate the data to calculate the proportions first, then plot the aggregated data using
geom_col to create the columns.
cut.props <- data.frame(prop.table(table(dsmall$cut)))
cut.props # what does this data look like?
## Var1 Freq
## 1 Fair 0.034
## 2 Good 0.099
## 3 Very Good 0.220
## 4 Premium 0.257
## 5 Ideal 0.390
ggplot(cut.props, aes(x=Var1, y=Freq)) + geom_col() +
ylab("Proportion") + xlab("Cut type") +
ggtitle("Proportion of diamonds by cut type")
Another way to visualize categorical data that takes up less ink than bars is a Cleveland dot plot. Here again we are plotting summary data instead of the raw data. This uses the
geom_segment that draws the lines from x=0 to the value of the proportion (named
Freq because of the way
pie() takes a table object as it’s argument.
Pie charts are my least favorite plotting type. Human eyeballs can’t distinguish between angles as well as we can with heights. A mandatory piece needed to make the wedges readable is to add the percentages of each wedge.
And here I thought pie charts couldn’t get worse… i’m not a fan at all of the ggplot version. So i’m not even going to show it. Here’s a link to another great tutorial that does show you how to make one.
However – Never say never. Here’s an example of a good use of pie charts. http://www.storytellingwithdata.com/blog/2019/8/8/forty-five-pie-charts-never-say-never
This type of chart is not natively found in the
ggplot2 package, but it’s own
waffle package. These are great for infographics.
Here we can look at the price, carat, and depth of the diamonds.
The base function
plot() creates a dotplot for a continuous variable. The value of the variable is plotted on the y axis, and the index, or row number, is plotted on the x axis. This gives you a nice, quick way to see the values of the data.
Often you are not interested in the individual values of each data point, but the distribution of the data. In other words, where is the majority of the data? Does it look symmetric around some central point? Around what values do the bulk of the data lie?
Rather than showing the value of each observation, we prefer to think of the value as belonging to a . The height of the bars in a histogram display the frequency of values that fall into those of those bins. For example if we cut the poverty rates into 7 bins of equal width, the frequency table would look like this:
In a histogram, the binned counts are plotted as bars into a histogram. Note that the x-axis is continuous, so the bars touch. This is unlike the barchart that has a categorical x-axis, and vertical bars that are separated.
base You can make a histogram in base graphics super easy.
And it doesn’t take too much to clean it up. Here you can specify the number of bins by specifying how many
breaks should be made in the data (the number of breaks controls the number of bins, and bin width) and use
col for the fill color.
The binwidth here is set by looking at the cut points above that were used to create 7 bins. Notice that darkgrey is the default fill color, but makes it hard to differentiate between the bars. So we’ll make the outline black using
fill the bars with white.
Note I did not specify the
binwidth argument here. The size of the bins can hide features from your graph, the default value for ggplot2 is range/30 and usually is a good choice.
To get a better idea of the true shape of the distribution we can “smooth” out the bins and create what’s called a
density plot or curve. Notice that the shape of this distribution curve is much more… “wigglier” than the histogram may have implied.
Awesome title huh? (NOT)
Often is is more helpful to have the density (or kernel density) plot on top of a histogram plot.
Since the height of the bars in a histogram default to showing the frequency of records in the data set within that bin, we need to 1) scale the height so that it’s a relative frequency, and then use the
lines() function to add a
density() line on top.
The syntax starts the same, we’ll add a new geom,
geom_density and color the line blue. Then we add the histogram geom using
geom_histogram but must specify that the y axis should be on the density, not frequency, scale. Note that this has to go inside the aesthetic statement
aes(). I’m also going to get rid of the fill by using
NA so it doesn’t plot over the density line.
Another very common way to visualize the distribution of a continuous variable is using a boxplot. Boxplots are useful for quickly identifying where the bulk of your data lie. R specifically draws a “modified” boxplot where values that are considered outliers are plotted as dots.
Notice that the only axis labeled is the y=axis. Like a dotplot the x axis, or “width”, of the boxplot is meaningless here. We can make the axis more readable by flipping the plot on it’s side.
Horizontal is a bit easier to read in my opinion.
ggplot What about ggplot? ggplot doesn’t really like to do univariate boxplots. We can get around that by specifying that we want the box placed at a specific x value.
To flip it horizontal you may think to simply swap x and y? Good thinking. Of course it wouldn’t be that easy. So let’s just flip the whole darned plot on it’s coordinate axis.
Overlaying a boxplot and a violin plot serves a similar purpose to Histograms + Density plots.
Better appearance - different levels of transparency of the box and violin.
The last useful plot that we will do on a single continuous variable is to assess the normality of the distribution. Basically how close the data follows a normal distribution.
The line I make red because it is a reference line. The closer the points are to following this line, the more “normal” the shape of the distribution is. Price has some pretty strong deviation away from that line. Below I have plotted what a normal distribution looks like as an example of a “perfect” fit.
qq (or qnorm) plots specifically plot the data against a theoretical distribution. That means in the
aes() aesthetic argument we don’t specify either x or y, but instead the
sample= is the variable we want to plot.
Additional references on making qqplots in ggplot: http://www.sthda.com/english/wiki/ggplot2-qq-plot-quantile-quantile-graph-quick-start-guide-r-software-and-data-visualization