Basic usage
To display plots in a notebook, as is always the case for Bokeh plots, we also need to import bokeh.io
and execute bokeh.io.output_notebook()
. We will use the automobile fuel efficiency sample data set that is included in Bokeh to demonstrate the usage of iqplot.
[1]:
import numpy as np
import pandas as pd
import iqplot
import bokeh.sampledata.autompg
import bokeh.io
bokeh.io.output_notebook()
So we have an understanding of the data set, we will take a look at it.
[2]:
df = bokeh.sampledata.autompg.autompg_clean
df.head()
[2]:
mpg | cyl | displ | hp | weight | accel | yr | origin | name | mfr | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | North America | chevrolet chevelle malibu | chevrolet |
1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | North America | buick skylark 320 | buick |
2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | North America | plymouth satellite | plymouth |
3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | North America | amc rebel sst | amc |
4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | North America | ford torino | ford |
Importantly, this data set is tidy; each row represents a single observation and each column a variable associated with an observation. iqplot assumes that any inputted data frame is in tidy format. In the fuel efficiency example, the columns have different character. For example, 'mpg'
contains quantitative measurement of the miles per gallon of each car. The 'origin'
is categorical in the sense that it is not quantitative, but is a
descriptor of the automobile that takes on a few discrete values.
Quick start
In the most common usage, iqplot generates plots from tidy data frames where some columns may contain categorical data and the column of interest in the plot is quantitative.
There are seven types of plots that iqplot generates.
Box plots
Strip plots
Spike plots
Strip-box plots (strip and box plots overlaid)
Histograms
Strip-histogram plots (strip and histogram plots overlaid)
ECDFs
If you are unfamiliar with ECDFs, they are discussed below.
This first seven arguments are the same for all plots. They are:
data
: A tidy data frameq
: The column of the data frame to be treated as the quantitative variable.cats
: A list of columns in the data frame that are to be considered as categorical variables in the plot. IfNone
, a single box, strip, histogram, or ECDF is plotted.q_axis
: Along which axis, x or y that the quantitative variable varies. The default is'x'
.palette
: A list of hex colors to use for coloring the markers for each category. By default, it uses the Glasbey Category 10 color palette from colorcet.order
: If specified, the ordering of the categories to use on the categorical axis and legend (if applicable). Otherwise, the order of the inputted data frame is used.p
: If specified, thebokeh.plotting.Figure
object to use for the plot. If not specified, a new figure is created.
If data
is given as a Numpy array, it is the only required argument. If data
is given as a Pandas DataFrame, q
must also be supplied. All other arguments are optional and have reasonably set defaults.
The respective plots also have kwargs that are specific to them. Examples highlighting some, but not all, customizations are in the following sections.
Any extra kwargs not in the function call signature are passed to bokeh.plotting.figure()
when the figure is instantiated.
Here are six of the seven default plots for cats = 'origin'
and q = 'mpg'
.
[3]:
p_box = iqplot.box(data=df, q="mpg", cats="origin", title="box")
p_strip = iqplot.strip(data=df, q="mpg", cats="origin", title="strip")
p_stripbox = iqplot.stripbox(data=df, q="mpg", cats="origin", title="strip-box")
p_striphistogram = iqplot.striphistogram(data=df, q="mpg", cats="origin", title="strip-histogram")
p_histogram = iqplot.histogram(data=df, q="mpg", cats="origin", title="histogram")
p_ecdf = iqplot.ecdf(data=df, q="mpg", cats="origin", title="ecdf")
bokeh.io.show(bokeh.layouts.gridplot([p_box, p_strip, p_stripbox, p_striphistogram, p_histogram, p_ecdf], ncols=1))
When data take on discrete values, spike plots are useful. For demonstration purposes to make a spike plot, we will round the mpg
column to integer values.
[4]:
df['mpg_rounded'] = df['mpg'].round()
p_spike = iqplot.spike(data=df, q="mpg_rounded", cats="origin", title="spike")
bokeh.io.show(p_spike)
The height of each spike, in this case topped with a dot, is proportional to the number of vehicles with a given rounded miles per gallon.
Plots with a single data set
You can also generate plots from a single Numpy array without specifying categories and values. Note that when data
is specified as a Numpy array, the string used for the q
argument is used as the axis label.
[5]:
# MPG data for all cars as Numpy array
data = df["mpg"].values
data_rounded = np.round(data)
p_box = iqplot.box(data=data, q="mpg", title="box")
p_strip = iqplot.strip(data=data, q="mpg", title="strip")
p_spike = iqplot.spike(data=data_rounded, q="mpg", title="spike")
p_stripbox = iqplot.stripbox(data=data, q="mpg", title="strip-box")
p_striphistogram = iqplot.striphistogram(data=data, q="mpg", title="strip-histogram")
p_histogram = iqplot.histogram(data=data, q="mpg", title="histogram")
p_ecdf = iqplot.ecdf(data=data, q="mpg", title="ecdf")
bokeh.io.show(bokeh.layouts.gridplot([p_box, p_strip, p_spike, p_stripbox, p_striphistogram, p_histogram, p_ecdf], ncols=1))
Fine-tuning of plots
In the following, we investigate each of the five kind of plots and explore some, but not all, of the configuration options. Refer to the API reference for details about possible keyword arguments.
Strip plots
We can make a strip plot with dash markers and add some transparency. The marker
keyword argument allows selection of glyphs, and the marker_kwargs
keyword argument provides keyword arguments to be passed to p.dash()
(or p.circle()
, or whatever marker you choose), where p
is the figure.
[6]:
p = iqplot.strip(
data=df,
q="mpg",
cats="origin",
marker="dash",
marker_kwargs=dict(alpha=0.3),
)
bokeh.io.show(p)
The problem with strip plots is that they can have trouble with overlapping data points. A common approach to deal with this is to “jitter,” or place the glyphs with small random displacements along the categorical axis. I do that here, using the spread="jitter"
keyword argument. I also add tooltips to allow for for hover tools that give more information about the respective data points.
[7]:
p = iqplot.strip(
data=df,
q="mpg",
cats="origin",
spread="jitter",
marker_kwargs=dict(alpha=0.5),
tooltips=[("year", "@yr"), ("model", "@name")],
frame_width=500,
)
bokeh.io.show(p)
Note that in this plot, I used the frame_width
kwarg to make the plot wider. Any kwargs that can be passed into bokeh.plotting.figure()
can be used.
The spread
keyword argument specify how glyphs are spread from the position along the categorical axis to enable visualizing all points. In addition to jittering the data points, iqplot also supports spreading to make a beeswarm plot, also called a swarm plot, in which the points are made not to clash with each other using spread="swarm"
.
[8]:
p = iqplot.strip(
data=df,
q="mpg",
cats="origin",
spread="swarm",
marker_kwargs=dict(alpha=0.5),
tooltips=[("year", "@yr"), ("model", "@name")],
frame_width=500,
)
bokeh.io.show(p)
/Users/bois/Dropbox/git/iqplot/iqplot/cat.py:1804: UserWarning: 7 data points exceed maximum height. Consider using spread='jitter' or increasing the frame height.
warnings.warn(
Note the warning: some data points for North America are overlapping because the spread of the data points exceeds the maximal height allocated to each category. This is in general a problem with swarm plots, and I recommend jittering as a spreading mechanism for non-small data sets (not even large ones, just non-small ones!), especially since Bokeh plots are zoomable.
Parallel coordinate plots
Sometimes there is a relationship between points in respective categories and we may wish to annotate the strip plot with a parallel coordinate plot wherein the points are connected by lines. This is not the case for the car data set we are using, we can generate a data set that has a relationship between categories. Imagine an experiment was done on three different days. On each day, the experimenter did a set of trials. Each trial has three results, that from control, experiment 1, and experiment 2. We can generate a data set reflecting this scenario.
[9]:
np.random.seed(3252)
df_pc = pd.DataFrame(
np.vstack(
(
np.random.normal(5, 1, size=39),
np.random.normal(4, 1, size=39),
np.random.normal(6, 1, size=39),
)
).transpose(),
columns=["control", "exp 1", "exp 2"],
)
df_pc["day"] = ["Mon"] * 14 + ["Tues"] * 10 + ["Wed"] * 15
df_pc["trial"] = np.arange(1, len(df_pc) + 1)
# Melt to make it in iqplot's preferred tidy format
df_pc = df_pc.melt(id_vars=["day", "trial"], var_name="exp", value_name="val")
# Take a look
df_pc.head()
[9]:
day | trial | exp | val | |
---|---|---|---|---|
0 | Mon | 1 | control | 5.849320 |
1 | Mon | 2 | control | 5.620194 |
2 | Mon | 3 | control | 4.842355 |
3 | Mon | 4 | control | 6.627488 |
4 | Mon | 5 | control | 4.696773 |
The 'trial'
column specifies the relationship among the results. We can make a strip plot with this data set and include the parcoord_column='trial'
keyword argument to add the parallel coordinate annotation.
[10]:
p = iqplot.strip(
df_pc,
q="val",
cats=["day", "exp"],
q_axis="y",
frame_width=500,
color_column="exp",
parcoord_column="trial",
)
bokeh.io.show(p)
Box plots
We can also make vertical box plots by specifying q_axis='y'
. We also demonstrate the order
kwarg to specify the ordering of the categorical variables.
[11]:
p = iqplot.box(
data=df,
q="mpg",
cats="origin",
q_axis="y",
order=["Asia", "Europe", "North America"],
)
bokeh.io.show(p)
We can independently specify properties of the marks using box_kwargs
, whisker_kwargs
, median_kwargs
, and outlier_kwargs
. For example, say we wanted our colors to be Betancourt red, and that we wanted the outliers to also be that color and use diamond glyphs. We can also put caps on the whiskers using whisker_caps=True
.
[12]:
p = iqplot.box(
data=df,
q="mpg",
cats="origin",
whisker_caps=True,
outlier_marker="diamond",
box_kwargs=dict(fill_color="#7C0000"),
whisker_kwargs=dict(line_color="#7C0000", line_width=2),
)
bokeh.io.show(p)
We can have multiple categories by specifying cats
as a list. We will also specify a custom palette.
[13]:
bkp = bokeh.palettes.d3["Category20c"][20]
palette = bkp[:3] + bkp[4:7] + bkp[8:11]
p = iqplot.box(
data=df,
q="mpg",
cats=["origin", "cyl"],
palette=palette,
y_axis_label="# of cylinders",
)
p.yaxis.axis_label_text_font_style = "bold"
bokeh.io.show(p)
Strip-box plots
The appearance of strip-box plots can be fine-tuned using the same keyword arguments as with strip and box plots. The defaults are set to make the data presentation clear; i.e., the boxes are not filled by default.
Histograms
We can plot normalized histograms using the density
kwarg, and we’ll make the plot a little wider. We can also omit the rug plot in the histogram using the rug=False
kwarg.
[14]:
p = iqplot.histogram(
data=df, q="mpg", cats="origin", density=True, rug=False, frame_width=550,
)
bokeh.io.show(p)
We can also plot histograms with different binning. By default, the Freedman-Diaconis rule is used. We could instead specify an integer number of evenly spaced bins, or even specify the bin edges, which we do below as a list.
[15]:
bins = np.arange(8, 50, 3)
p = iqplot.histogram(data=df, q="mpg", cats="origin", bins=bins, frame_width=550)
bokeh.io.show(p)
In some cases, the data are discrete. In the case of integers, you can specify bins='integer'
, and the bars of the histogram will be centered on integer values.
[16]:
p = iqplot.histogram(
data=df,
q="cyl",
cats="origin",
bins="integer",
x_axis_label="number of cylinders",
frame_width=550,
)
bokeh.io.show(p)
Note that this is different from choosing bins='exact'
. In this case, a bar is made in the histogram for each unique value in the data set. The width of the bars is chosen such that the exact value is included in the bar and the bars abut each other. In the case of the histogram of cylinders, this is not really what we want.
[17]:
p = iqplot.histogram(
data=df,
q="cyl",
cats="origin",
bins="exact",
x_axis_label="number of cylinders",
frame_width=550,
)
bokeh.io.show(p)
Note that because there were no 7-cylinder cars, the bars for six-cylinder cars and those for the 8-cylinder cars meet at 7 and have different widths.
Histograms may also be overlaid in the same plot, instead of stacked on top of each other as they are by default, using the arrangement='overlay'
kwarg.
[18]:
p = iqplot.histogram(
data=df, q="mpg", cats="origin", arrangement="overlay", density=True, rug=False, frame_width=550,
)
bokeh.io.show(p)
When overlaid, the histograms are by default not filled to avoid clutter; they may be with the style
kwarg.
[19]:
p = iqplot.histogram(
data=df,
q="mpg",
cats="origin",
arrangement="overlay",
style="step_filled",
density=True,
rug=False,
frame_width=550,
)
bokeh.io.show(p)
Strip-histogram plots
Similar to strip-box plots, the appearance of strip-histogram plots can be fine-tuned using the same keyword arguments as with strip and histograms. The defaults are set to make the data presentation clear; i.e., the histograms are mirrors across a categorical value. Note also that the histograms are normalized as would be the case using the density=True
kwargs of iqplot.histogram()
. The number of measurements is clear from the strip plot.
Spike plots
A traditional spike plot does not have dots at the top of the spikes. This is accomplished using the style='spike'
keyword argument.
[20]:
p = iqplot.spike(data=df, q="mpg_rounded", cats="origin", style='spike')
bokeh.io.show(p)
I aesthetically prefer dots on top of the spikes, which is what is the default. Note, though, that a spike plot in the context of iqplot is different from a lollipop plot, which gives counts of categorical variables. Here, we are showing counts of quantitative variables which happen to take on discrete values.
If you want the height of the spikes to represent the fraction of measurements having a certain value within each category as opposed to the count, use the fraction=True
keyword argument.
[21]:
p = iqplot.spike(data=df, q="mpg_rounded", cats="origin", style="spike", fraction=True)
bokeh.io.show(p)
ECDFs
An empirical cumulative distribution function, or ECDF, is a convenient way to visualize a univariate probability distribution. Consider a measurement x in a set of measurements X. The ECDF evaluated at x is defined as
ECDF(x) = fraction of data points in X that are ≤ x.
By default, the ECDFs are plotted as dots, where y-value of a given dot is the fraction of data points that are less than or equal to the corresponding x value. (While unconventional, plot-as-dots is the default because it is easier to see individual measurements in the plot and also to hover over them for tooltips.) We may wish to display ECDFs as staircases, as is more traditionally done. To do this, we use the style='staircase'
kwarg. In the below example, we also include tooltips so
that when you hover over a corner in the staircase corresponding to a data point, the year and model are displayed.
[22]:
p = iqplot.ecdf(
data=df,
q="mpg",
cats="origin",
style="staircase",
tooltips=[("year", "@yr"), ("model", "@name")],
)
bokeh.io.show(p)
We can also display empirical complementary cumulative distribution functions (ECCDFs) using the complementary
kwarg.
ECCDF(x) = 1 - ECDF(x)
[23]:
p = iqplot.ecdf(data=df, q="mpg", cats="origin", complementary=True)
bokeh.io.show(p)
Rather then overlaying the ECDFs of each category, we can arrange the ECDFs in separate plots stacked on top of each other using the arrangement="stack"
kwarg. The return value is a bokeh.models.layouts.Column
instance and not a bokeh.plotting.figure.Figure
instance as for the other plots.
[24]:
p = iqplot.ecdf(data=df, q="mpg", cats="origin", arrangement="stack")
bokeh.io.show(p)
Instead of plotting a separate ECDF for each category, we can put all of the categories together on one ECDF and color the points by the categorical variable by using the kind='colored'
kwarg.
[25]:
p = iqplot.ecdf(data=df, q="mpg", cats="origin", kind="colored")
bokeh.io.show(p)
Statistical calculations
iqplot allows for displaying confidence intervals acquired by bootstrapping both for histograms, spike plots, and ECDFS using the conf_int
kwarg. By default a 95% confidence interval is shown.
[26]:
p_hist = iqplot.histogram(data=df, q="mpg", cats="origin", density=True, conf_int=True)
p_spike = iqplot.spike(data=df, q="mpg_rounded", cats="origin", fraction=True, conf_int=True)
p_ecdf = iqplot.ecdf(data=df, q="mpg", cats="origin", style="staircase", conf_int=True)
bokeh.io.show(bokeh.layouts.gridplot([p_hist, p_spike, p_ecdf], ncols=1))