Data Visualization in R
Ggplot is a plotting system for Python based on R’s ggplot2 and the Grammer of Graphics. It is built for making profressional looking, plots quickly with minimal code. It takes care of many of the complicated details that make plotting difficult (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.
How to install ggplot2 package:
Ggplot2 can be easily installed by typing:
Make sure that you are using the latest version of R to get the most recent version of ggplot2.
Application of ggplot2:
The grammar implemented in ggplot2 provides an infrastructure for composing a graphic from multiple elements. The main applications of ggplot2 are:
- Aesthetics ,which refer to visual attributes that affect how data are displayed in a graphic, e.g., color, point size, or line type.
- Geometric objects for visual representation of observations such as points, lines, polygons, box plots, error bars, etc.
- Faceting which applies the same type of graph to each defined subset of the data, usually indicated by the unique values of a categorical variable or factor.
- Annotation, which allows you to add text and/or external graphics to a ggplot.
- Positional adjustments, to reduce overplotting of points.
Examples of qplot:
The qplot() function can be used to create the most common graph types. It can create a very wide range of useful plots.
The format is :
qplot(x, y, data=, color=, shape=, size=, alpha=, geom=, method=, formula=, facets=, xlim=, ylim= xlab=, ylab=, main=, sub=)
Some of the examples are:
- qplot examples
- create factors with value labels
mtcars$gear <- factor(mtcars$gear,levels=c(3,4,5), labels=c("3gears","4gears","5gears")) mtcars$am <- factor(mtcars$am,levels=c(0,1), labels=c("Automatic","Manual")) mtcars$cyl <- factor(mtcars$cyl,levels=c(4,6,8), labels=c("4cyl","6cyl","8cyl"))
- Kernel density plots for mpg
Grouped by number of gears (indicated by color)
qplot(mpg, data=mtcars, geom="density", fill=gear, alpha=I(.5), main="Distribution of Gas Milage", xlab="Miles Per Gallon", ylab="Density")
- Scatterplot of mpg vs. hp for each combination of gears and cylinders
In each facet, transmitting type is represented by shape and color
qplot(hp, mpg, data=mtcars, shape=am, color=am, facets=gear~cyl, size=I(3), xlab="Horsepower", ylab="Miles per Gallon")
- Separate regressions of mpg on weight for each number of cylinders
qplot(wt, mpg, data=mtcars, geom=c("point", "smooth"), method="lm", formula=y~x, color=cyl, main="Regression of MPG on Weight", xlab="Weight", ylab="Miles per Gallon")
- Boxplots of mpg by number of gears
Observations (points) are over layered and jittered
qplot(gear, mpg, data=mtcars, geom=c("boxplot", "jitter"), fill=gear, main="Mileage by Gear Number", xlab="", ylab="Miles per Gallon")
Adding Aesthetics (Shape, Color and Size) and Faceting in the qplot function:
aes creates a list of unevaluated expressions. This function also performs partial name matching, converts color to color, and old style R names to ggplot names (eg. pch to shape, cex to size). The first difference when using qplot instead of plot comes when you want to assign different colors or size or shape to the points on your plot. Plot converts the categorical variable in your data into something which plot knows how to use. Qplot does this automatically and provides a legend that maps the displayed attributes.
- Bar chart example
c <- ggplot(mtcars, aes(factor(cyl)))
- Default plotting
c + geom_bar()
- To change the interior coloring use fill aesthetic
c + geom_bar(fill = "red")
- Compare with the color aesthetic which changes just the bar outline
c + geom_bar(colour = "red")
- Combining both, you can see the changes more clearly
c + geom_bar(fill = "white", colour = "red")
Size should be specified with a numerical value (in millimetres), or from a variable source.
p <- ggplot(mtcars, aes(wt, mpg)) p + geom_point(size = 4) p + geom_point(aes(size = qsec)) p + geom_point(size = 2.5) + geom_hline(yintercept = 25, size = 3.5)
Example of shape in the data visualizations.
Shape takes four types of values: an integer in [0, 25], a single character-- which uses that character as the plotting symbol, to draw the smallest rectangle that is visible (i.e., about one pixel), an NA to draw nothing.
p + geom_point() p + geom_point(shape = 5) p + geom_point(shape = "k", size = 3) p + geom_point(shape = ".") p + geom_point(shape = NA)
- Shape can also be mapped from a variable
p + geom_point(aes(shape = factor(cyl)))
In some circumstances we want to plot relationships between set variables in multiple subsets of the data with the results appearing as panels in a larger figure. This is a known as a facet plot. This is a very useful feature of ggplot2. The faceting is defined by a categorical variable or variables.
The data can be split up by one or two variables that vary on the horizontal and/or vertical direction.
facet_grid(facets, margins = FALSE, scales = "fixed", space = "fixed", shrink = TRUE, labeller = "label_value", as.table = TRUE, drop = TRUE)
p <- ggplot(mtcars, aes(mpg, wt)) + geom_point() # With one variable p + facet_grid(. ~ cyl)
Here, a single categorical variable defines subsets of the data. The panels are calculated in a 1 dimensional ribbon that can be wrapped to multiple rows.
facet_wrap(facets, nrow = NULL, ncol = NULL, scales = "fixed", shrink = TRUE, as.table = TRUE, drop = TRUE)
d <- ggplot(diamonds, aes(carat, price, fill = ..density..)) + xlim(0, 2) + stat_binhex(na.rm = TRUE) + theme(aspect.ratio = 1) d + facet_wrap(~ color)
Geometric objects (geoms) are the visual representations of (subsets of) observations. We have so many geoms which are used for visual representations. Some of them are geom_point, geom_jitter, geom_text, geom_segment etc. We will explain you how to use geom by taking a example of geom_point.
It is a geom which draws a point defined by an x and y coordinates
This example shows a scatterplot. It represents a rather common configuration with use of some extra aesthetic parameters, such as size, shape, and color. The plot uses two aesthetic properties to represent the same aspect of the data. The
gender column is mapped into a shape and into a color. The plot maps the continuous
speed column onto the aesthetic
size property. To ensure that even observations with a "low" speed are still mapped to rather large points, the plot uses
scale_size_continuous to define the range of point sizes to use.
The lattice add-on package is an implementation of Trellis graphics for R. It is a powerful and elegant high-level data visualization system with an emphasis on multivariate data. It is designed to meet most typical graphics needs with minimal tuning, but can also be easily extended to handle most nonstandard requirements.
How to install Lattice package:
The lattice package is installed along with R. It can be installed by typing
> library(package = "lattice")
The most recent version of Lattice is available from CRAN. The latest development snapshot is available from R- forge.
Application of Lattice package:
The following is the list of high level functions in the lattice package which are used in data visualization:
- barchart: Bar plots.
- bwplot: Box-and-whisker plots.
- densityplot: Kernel density estimates.
- dotplot: Cleveland dot plots.
- histogram: Histograms.
- qqmath: Theretical quantile plots.
- stripplot: One-dimensional scatterplots
- qq: Quantile plots for comparing two distributions.
- xyplot: Scatterplots and time-series plots (and potentially a lot more)
- levelplot: Level plots (similar to image plots).
- contourplot: Contour plots.
- cloud: Three-dimensional scatter plots.
- wireframe: Three-dimensional surface plots (similar to persp plots).
- splom: Scatterplot matrices.
- parallel: Parallel coordinate plots
- rfs: Residual and fitted value plots (also see oneway).
- tmd: Tukey Mean-Difference plots.
Lattice also provides a collection of convenience functions that correspond to the primitives lines, points, etc. These are implemented using Grid graphics.These functions have names like llines or panel.lines and are often useful when writing nontrivial panel functions.
Examples of Lattice package:
Here are some examples of Lattice package which uses car data like mileage, number of cylinders, gears etc from the mtrcars data frame.
- Lattice Examples
- Create factors with value labels
gear.f<-factor(gear,levels=c(3,4,5), labels=c("3gears","4gears","5gears")) cyl.f <-factor(cyl,levels=c(4,6,8), labels=c("4cyl","6cyl","8cyl"))
- Kernel density plot
densityplot(~mpg, main="Density Plot", xlab="Miles per Gallon")
- Kernel density plots by factor level
densityplot(~mpg|cyl.f, main="Density Plot by Number of Cylinders", xlab="Miles per Gallon")
- Kernel density plots by factor level (alternate layout)
densityplot(~mpg|cyl.f, main="Density Plot by Numer of Cylinders", xlab="Miles per Gallon", layout=c(1,3))
- Boxplots for each combination of two factors
bwplot(cyl.f~mpg|gear.f, ylab="Cylinders", xlab="Miles per Gallon", main="Mileage by Cylinders and Gears", layout=(c(1,3))
- Scatterplots for each combination of two factors
xyplot(mpg~wt|cyl.f*gear.f, main="Scatterplots by Cylinders and Gears", ylab="Miles per Gallon", xlab="Car Weight")
- 3-Dscatterplot by factor level
cloud(mpg~wt*qsec|cyl.f, main="3D Scatterplot by Cylinders")
- Dotplot for each combination of two factors
dotplot(cyl.f~mpg|gear.f, main="Dotplot Plot by Number of Gears and Cylinders", xlab="Miles Per Gallon")
- Scatterplot matrix
splom(mtcars[c(1,3,4,5,6)], main="MTCARS Data")
Ggvis is data visualization for R which enables us to describe data graphics with a syntax similar to ggplot2. It enables to view and interact with the graphics on our local computer.
How to install ggvis:
Ggvis can be directly installed from GitHub. Make sure you have the latest version of devtools (at least 1.4) and run the following
devtools::install_github(c("hadley/testthat", "rstudio/shiny", "rstudio/ggvis"))
Application of Ggvis:
Some interactive plots can be generated using package ggvis
- It is still being developed.
- It can be used in Shiny application.
- It is similar to ggplot2 but is designed for dynamic web graphics.
- It uses chain operations %>% for multiple layers.
- Ggvis can do limited things, but what it does, it requires less effort.
Examples of Ggvis:
mtcars %>% ggvis(x= ~wt) %>% layer_densities ( stroke := input_radiobuttons(c("Purple","Orange","steelblue"), label="Line color"), fill := input_select(c("Purple","Orange","steelblue"), label="Fill color") ) library(ggvis) mtcars %>% ggvis(x = ~wt) %>% layer_densities ( adjust = input_slider(.1, 2, value = 1, step = .1, label = "Bandwidth adjustment"), kernel = input_select ( c("Gaussian" = "gaussian", "Epanechnikov" = "epanechnikov", "Rectangular" = "rectangular", "Triangular" = "triangular", "Biweight" = "biweight", "Cosine" = "cosine", "Optcosine" = "optcosine"), label = "Kernel") )