Basic usage

To display plots in a notebook, as is always the case for Bokeh plots, we also need to import bokeh.io and execute bokeh.io.output_notebook(). We will use the automobile fuel efficiency sample data set that is included in Bokeh to demonstrate the usage of iqplot.

[1]:
import numpy as np
import pandas as pd

import iqplot

import bokeh.sampledata.autompg

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

So we have an understanding of the data set, we will take a look at it.

[2]:
df = bokeh.sampledata.autompg.autompg_clean

df.head()
[2]:
mpg cyl displ hp weight accel yr origin name mfr
0 18.0 8 307.0 130 3504 12.0 70 North America chevrolet chevelle malibu chevrolet
1 15.0 8 350.0 165 3693 11.5 70 North America buick skylark 320 buick
2 18.0 8 318.0 150 3436 11.0 70 North America plymouth satellite plymouth
3 16.0 8 304.0 150 3433 12.0 70 North America amc rebel sst amc
4 17.0 8 302.0 140 3449 10.5 70 North America ford torino ford

Importantly, this data set is tidy; each row represents a single observation and each column a variable associated with an observation. iqplot assumes that any inputted data frame is in tidy format. In the fuel efficiency example, the columns have different character. For example, 'mpg' contains quantitative measurement of the miles per gallon of each car. The 'origin' is categorical in the sense that it is not quantitative, but is a descriptor of the automobile that takes on a few discrete values.

Quick start

In the most common usage, iqplot generates plots from tidy data frames where some columns may contain categorical data and the column of interest in the plot is quantitative.

There are seven types of plots that iqplot generates.

  • Box plots

  • Strip plots

  • Spike plots

  • Strip-box plots (strip and box plots overlaid)

  • Histograms

  • Strip-histogram plots (strip and histogram plots overlaid)

  • ECDFs

If you are unfamiliar with ECDFs, they are discussed below.

This first seven arguments are the same for all plots. They are:

  • data: A tidy data frame

  • q: The column of the data frame to be treated as the quantitative variable.

  • cats: A list of columns in the data frame that are to be considered as categorical variables in the plot. If None, a single box, strip, histogram, or ECDF is plotted.

  • q_axis: Along which axis, x or y that the quantitative variable varies. The default is 'x'.

  • palette: A list of hex colors to use for coloring the markers for each category. By default, it uses the Glasbey Category 10 color palette from colorcet.

  • order: If specified, the ordering of the categories to use on the categorical axis and legend (if applicable). Otherwise, the order of the inputted data frame is used.

  • p: If specified, the bokeh.plotting.Figure object to use for the plot. If not specified, a new figure is created.

If data is given as a Numpy array, it is the only required argument. If data is given as a Pandas DataFrame, q must also be supplied. All other arguments are optional and have reasonably set defaults.

The respective plots also have kwargs that are specific to them. Examples highlighting some, but not all, customizations are in the following sections.

Any extra kwargs not in the function call signature are passed to bokeh.plotting.figure() when the figure is instantiated.

Here are six of the seven default plots for cats = 'origin' and q = 'mpg'.

[3]:
p_box = iqplot.box(data=df, q="mpg", cats="origin", title="box")
p_strip = iqplot.strip(data=df, q="mpg", cats="origin", title="strip")
p_stripbox = iqplot.stripbox(data=df, q="mpg", cats="origin", title="strip-box")
p_striphistogram = iqplot.striphistogram(data=df, q="mpg", cats="origin", title="strip-histogram")
p_histogram = iqplot.histogram(data=df, q="mpg", cats="origin", title="histogram")
p_ecdf = iqplot.ecdf(data=df, q="mpg", cats="origin", title="ecdf")

bokeh.io.show(bokeh.layouts.gridplot([p_box, p_strip, p_stripbox, p_striphistogram, p_histogram, p_ecdf], ncols=1))

When data take on discrete values, spike plots are useful. For demonstration purposes to make a spike plot, we will round the mpg column to integer values.

[4]:
df['mpg_rounded'] = df['mpg'].round()

p_spike = iqplot.spike(data=df, q="mpg_rounded", cats="origin", title="spike")

bokeh.io.show(p_spike)

The height of each spike, in this case topped with a dot, is proportional to the number of vehicles with a given rounded miles per gallon.

Plots with a single data set

You can also generate plots from a single Numpy array without specifying categories and values. Note that when data is specified as a Numpy array, the string used for the q argument is used as the axis label.

[5]:
# MPG data for all cars as Numpy array
data = df["mpg"].values
data_rounded = np.round(data)

p_box = iqplot.box(data=data, q="mpg", title="box")
p_strip = iqplot.strip(data=data, q="mpg", title="strip")
p_spike = iqplot.spike(data=data_rounded, q="mpg", title="spike")
p_stripbox = iqplot.stripbox(data=data, q="mpg", title="strip-box")
p_striphistogram = iqplot.striphistogram(data=data, q="mpg", title="strip-histogram")
p_histogram = iqplot.histogram(data=data, q="mpg", title="histogram")
p_ecdf = iqplot.ecdf(data=data, q="mpg", title="ecdf")

bokeh.io.show(bokeh.layouts.gridplot([p_box, p_strip, p_spike, p_stripbox, p_striphistogram, p_histogram, p_ecdf], ncols=1))

Fine-tuning of plots

In the following, we investigate each of the five kind of plots and explore some, but not all, of the configuration options. Refer to the API reference for details about possible keyword arguments.

Strip plots

We can make a strip plot with dash markers and add some transparency. The marker keyword argument allows selection of glyphs, and the marker_kwargs keyword argument provides keyword arguments to be passed to p.dash() (or p.circle(), or whatever marker you choose), where p is the figure.

[6]:
p = iqplot.strip(
    data=df,
    q="mpg",
    cats="origin",
    marker="dash",
    marker_kwargs=dict(alpha=0.3),
)

bokeh.io.show(p)

The problem with strip plots is that they can have trouble with overlapping data points. A common approach to deal with this is to “jitter,” or place the glyphs with small random displacements along the categorical axis. I do that here, using the spread="jitter" keyword argument. I also add tooltips to allow for for hover tools that give more information about the respective data points.

[7]:
p = iqplot.strip(
    data=df,
    q="mpg",
    cats="origin",
    spread="jitter",
    marker_kwargs=dict(alpha=0.5),
    tooltips=[("year", "@yr"), ("model", "@name")],
    frame_width=500,
)

bokeh.io.show(p)

Note that in this plot, I used the frame_width kwarg to make the plot wider. Any kwargs that can be passed into bokeh.plotting.figure() can be used.

The spread keyword argument specify how glyphs are spread from the position along the categorical axis to enable visualizing all points. In addition to jittering the data points, iqplot also supports spreading to make a beeswarm plot, also called a swarm plot, in which the points are made not to clash with each other using spread="swarm".

[8]:
p = iqplot.strip(
    data=df,
    q="mpg",
    cats="origin",
    spread="swarm",
    marker_kwargs=dict(alpha=0.5),
    tooltips=[("year", "@yr"), ("model", "@name")],
    frame_width=500,
)

bokeh.io.show(p)
/Users/bois/Dropbox/git/iqplot/iqplot/cat.py:1804: UserWarning: 7 data points exceed maximum height. Consider using spread='jitter' or increasing the frame height.
  warnings.warn(

Note the warning: some data points for North America are overlapping because the spread of the data points exceeds the maximal height allocated to each category. This is in general a problem with swarm plots, and I recommend jittering as a spreading mechanism for non-small data sets (not even large ones, just non-small ones!), especially since Bokeh plots are zoomable.

Parallel coordinate plots

Sometimes there is a relationship between points in respective categories and we may wish to annotate the strip plot with a parallel coordinate plot wherein the points are connected by lines. This is not the case for the car data set we are using, we can generate a data set that has a relationship between categories. Imagine an experiment was done on three different days. On each day, the experimenter did a set of trials. Each trial has three results, that from control, experiment 1, and experiment 2. We can generate a data set reflecting this scenario.

[9]:
np.random.seed(3252)

df_pc = pd.DataFrame(
    np.vstack(
        (
            np.random.normal(5, 1, size=39),
            np.random.normal(4, 1, size=39),
            np.random.normal(6, 1, size=39),
        )
    ).transpose(),
    columns=["control", "exp 1", "exp 2"],
)
df_pc["day"] = ["Mon"] * 14 + ["Tues"] * 10 + ["Wed"] * 15
df_pc["trial"] = np.arange(1, len(df_pc) + 1)

# Melt to make it in iqplot's preferred tidy format
df_pc = df_pc.melt(id_vars=["day", "trial"], var_name="exp", value_name="val")

# Take a look
df_pc.head()
[9]:
day trial exp val
0 Mon 1 control 5.849320
1 Mon 2 control 5.620194
2 Mon 3 control 4.842355
3 Mon 4 control 6.627488
4 Mon 5 control 4.696773

The 'trial' column specifies the relationship among the results. We can make a strip plot with this data set and include the parcoord_column='trial' keyword argument to add the parallel coordinate annotation.

[10]:
p = iqplot.strip(
    df_pc,
    q="val",
    cats=["day", "exp"],
    q_axis="y",
    frame_width=500,
    color_column="exp",
    parcoord_column="trial",
)

bokeh.io.show(p)

Box plots

We can also make vertical box plots by specifying q_axis='y'. We also demonstrate the order kwarg to specify the ordering of the categorical variables.

[11]:
p = iqplot.box(
    data=df,
    q="mpg",
    cats="origin",
    q_axis="y",
    order=["Asia", "Europe", "North America"],
)

bokeh.io.show(p)

We can independently specify properties of the marks using box_kwargs, whisker_kwargs, median_kwargs, and outlier_kwargs. For example, say we wanted our colors to be Betancourt red, and that we wanted the outliers to also be that color and use diamond glyphs. We can also put caps on the whiskers using whisker_caps=True.

[12]:
p = iqplot.box(
    data=df,
    q="mpg",
    cats="origin",
    whisker_caps=True,
    outlier_marker="diamond",
    box_kwargs=dict(fill_color="#7C0000"),
    whisker_kwargs=dict(line_color="#7C0000", line_width=2),
)

bokeh.io.show(p)

We can have multiple categories by specifying cats as a list. We will also specify a custom palette.

[13]:
bkp = bokeh.palettes.d3["Category20c"][20]
palette = bkp[:3] + bkp[4:7] + bkp[8:11]

p = iqplot.box(
    data=df,
    q="mpg",
    cats=["origin", "cyl"],
    palette=palette,
    y_axis_label="# of cylinders",
)

p.yaxis.axis_label_text_font_style = "bold"

bokeh.io.show(p)

Strip-box plots

The appearance of strip-box plots can be fine-tuned using the same keyword arguments as with strip and box plots. The defaults are set to make the data presentation clear; i.e., the boxes are not filled by default.

Histograms

We can plot normalized histograms using the density kwarg, and we’ll make the plot a little wider. We can also omit the rug plot in the histogram using the rug=False kwarg.

[14]:
p = iqplot.histogram(
    data=df, q="mpg", cats="origin", density=True, rug=False, frame_width=550,
)

bokeh.io.show(p)

We can also plot histograms with different binning. By default, the Freedman-Diaconis rule is used. We could instead specify an integer number of evenly spaced bins, or even specify the bin edges, which we do below as a list.

[15]:
bins = np.arange(8, 50, 3)

p = iqplot.histogram(data=df, q="mpg", cats="origin", bins=bins, frame_width=550)

bokeh.io.show(p)

In some cases, the data are discrete. In the case of integers, you can specify bins='integer', and the bars of the histogram will be centered on integer values.

[16]:
p = iqplot.histogram(
    data=df,
    q="cyl",
    cats="origin",
    bins="integer",
    x_axis_label="number of cylinders",
    frame_width=550,
)

bokeh.io.show(p)

Note that this is different from choosing bins='exact'. In this case, a bar is made in the histogram for each unique value in the data set. The width of the bars is chosen such that the exact value is included in the bar and the bars abut each other. In the case of the histogram of cylinders, this is not really what we want.

[17]:
p = iqplot.histogram(
    data=df,
    q="cyl",
    cats="origin",
    bins="exact",
    x_axis_label="number of cylinders",
    frame_width=550,
)

bokeh.io.show(p)

Note that because there were no 7-cylinder cars, the bars for six-cylinder cars and those for the 8-cylinder cars meet at 7 and have different widths.

Histograms may also be overlaid in the same plot, instead of stacked on top of each other as they are by default, using the arrangement='overlay' kwarg.

[18]:
p = iqplot.histogram(
    data=df, q="mpg", cats="origin", arrangement="overlay", density=True, rug=False, frame_width=550,
)

bokeh.io.show(p)

When overlaid, the histograms are by default not filled to avoid clutter; they may be with the style kwarg.

[19]:
p = iqplot.histogram(
    data=df,
    q="mpg",
    cats="origin",
    arrangement="overlay",
    style="step_filled",
    density=True,
    rug=False,
    frame_width=550,
)

bokeh.io.show(p)

Strip-histogram plots

Similar to strip-box plots, the appearance of strip-histogram plots can be fine-tuned using the same keyword arguments as with strip and histograms. The defaults are set to make the data presentation clear; i.e., the histograms are mirrors across a categorical value. Note also that the histograms are normalized as would be the case using the density=True kwargs of iqplot.histogram(). The number of measurements is clear from the strip plot.

Spike plots

A traditional spike plot does not have dots at the top of the spikes. This is accomplished using the style='spike' keyword argument.

[20]:
p = iqplot.spike(data=df, q="mpg_rounded", cats="origin", style='spike')

bokeh.io.show(p)

I aesthetically prefer dots on top of the spikes, which is what is the default. Note, though, that a spike plot in the context of iqplot is different from a lollipop plot, which gives counts of categorical variables. Here, we are showing counts of quantitative variables which happen to take on discrete values.

If you want the height of the spikes to represent the fraction of measurements having a certain value within each category as opposed to the count, use the fraction=True keyword argument.

[21]:
p = iqplot.spike(data=df, q="mpg_rounded", cats="origin", style="spike", fraction=True)

bokeh.io.show(p)

ECDFs

An empirical cumulative distribution function, or ECDF, is a convenient way to visualize a univariate probability distribution. Consider a measurement x in a set of measurements X. The ECDF evaluated at x is defined as

ECDF(x) = fraction of data points in X that are ≤ x.

By default, the ECDFs are plotted as dots, where y-value of a given dot is the fraction of data points that are less than or equal to the corresponding x value. (While unconventional, plot-as-dots is the default because it is easier to see individual measurements in the plot and also to hover over them for tooltips.) We may wish to display ECDFs as staircases, as is more traditionally done. To do this, we use the style='staircase' kwarg. In the below example, we also include tooltips so that when you hover over a corner in the staircase corresponding to a data point, the year and model are displayed.

[22]:
p = iqplot.ecdf(
    data=df,
    q="mpg",
    cats="origin",
    style="staircase",
    tooltips=[("year", "@yr"), ("model", "@name")],
    )

bokeh.io.show(p)

We can also display empirical complementary cumulative distribution functions (ECCDFs) using the complementary kwarg.

ECCDF(x) = 1 - ECDF(x)

[23]:
p = iqplot.ecdf(data=df, q="mpg", cats="origin", complementary=True)

bokeh.io.show(p)

Rather then overlaying the ECDFs of each category, we can arrange the ECDFs in separate plots stacked on top of each other using the arrangement="stack" kwarg. The return value is a bokeh.models.layouts.Column instance and not a bokeh.plotting.figure.Figure instance as for the other plots.

[24]:
p = iqplot.ecdf(data=df, q="mpg", cats="origin", arrangement="stack")

bokeh.io.show(p)

Instead of plotting a separate ECDF for each category, we can put all of the categories together on one ECDF and color the points by the categorical variable by using the kind='colored' kwarg.

[25]:
p = iqplot.ecdf(data=df, q="mpg", cats="origin", kind="colored")

bokeh.io.show(p)

Statistical calculations

iqplot allows for displaying confidence intervals acquired by bootstrapping both for histograms, spike plots, and ECDFS using the conf_int kwarg. By default a 95% confidence interval is shown.

[26]:
p_hist = iqplot.histogram(data=df, q="mpg", cats="origin", density=True, conf_int=True)
p_spike = iqplot.spike(data=df, q="mpg_rounded", cats="origin", fraction=True, conf_int=True)
p_ecdf = iqplot.ecdf(data=df, q="mpg", cats="origin", style="staircase", conf_int=True)

bokeh.io.show(bokeh.layouts.gridplot([p_hist, p_spike, p_ecdf], ncols=1))