Intentional omissions
iqplot is intentionally limited in scope. It is restricted to only data sets with a single quantitative variable. It is further limited in that only five types of plots (albeit with allowance for a fair amount of configurability) are allowed. Nonetheless, there are a few plots that fall into the one quantitative variable class of plots. Here, we address why those are not included.
Why no plots with a second quantitative variable?
The most common questions I get about extending this package are:
Can I color points on a strip plot according to a quantitative variable?
Can I color points on an ECDF according to a quantitative variable?
The answer to both of these questions is no. The reason is that it is important to limit the scope of iqplot to plots with one and only one quantitative variable. The limited scope allows for a clearer logical framework for the package, thereby allowing for cleaner specification of plots.
However, it is possible to “hack” iqplot to uses color to encode quantitative data for strip plots and ECDFs. Let us look at a few examples. In our examples, we will again work with the automobile fuel efficiency sample data set, which we will go ahead an load and look at (in addition to doing the requisite imports).
[1]:
import numpy as np
import scipy.stats.kde
import iqplot
import bokeh.sampledata.autompg
import bokeh.io
bokeh.io.output_notebook()
df = bokeh.sampledata.autompg.autompg_clean
df.head()
[1]:
mpg | cyl | displ | hp | weight | accel | yr | origin | name | mfr | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | North America | chevrolet chevelle malibu | chevrolet |
1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | North America | buick skylark 320 | buick |
2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | North America | plymouth satellite | plymouth |
3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | North America | amc rebel sst | amc |
4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | North America | ford torino | ford |
Converting a quantitative value to a color
Central to representing a quantitative variable as color is the ability to a quantitative value to a color. The function below accomplishes this; giving quantitative data x
and a list of colors colors
, the values of x
are converted to a corresponding color.
[2]:
def q_to_color(
x,
colors,
low=None,
high=None,
low_color="#808080",
high_color="#808080",
nan_color="#808080",
scale="linear",
):
"""Convert a quantitative value to a color.
Parameters
----------
x : int, float or array_like
Value(s) for which colors are needed.
colors : list or tuple
List of hex colors for mapping. E.g., bokeh.palettes.Viridis256.
low : int or float or None, default None
Lowest quantitative value in color range. If None, take smallest entry in `x`.
If `x` is scalar, `low` must not be `None`.
high : int or float or None, default None
Highest quantitative value in color range. If None, take largest entry in `x`.
If `x` is scalar, `high` must not be `None`.
low_color : str, default '#808080'
Hex value for color to be used for entries that are less than `low`.
The default is gray.
high_color : str, default '#808080'
Hex value for color to be used for entries that are greater than `high`.
The default is gray.
nan_color : str, default '#808080'
Hex value for color to be used for entries that are greater than `high`.
The default is gray.
scale : str, default 'linear'
Scale of color map. Must be either 'linear' or 'log'.
Returns
-------
output : str or list
If `x` is scalar, then a single hex color is returned. Otherwise,
a list of hex colors corresponding to the entries in `x` is returned.
"""
if scale == "linear":
spacefun = np.linspace
elif scale == "log":
spacefun = np.logspace
if np.isscalar(x):
if low is None or high is None:
raise ValueError(
"If `x` is scalar, then `low` and `high` must both be specified."
)
if np.isnan(x):
return nan_color
elif x < low:
return low_color
elif x > high:
return high_color
else:
return colors[np.searchsorted(spacefun(low, high, len(colors)), x)]
if low is None:
low = np.nanmin(x)
if high is None:
high = np.nanmax(x)
# Include low and high colors in color list in case value is out of range
ext_colors = [low_color] + list(colors) + [high_color]
# It's faster to do one searchsorted call and then adjust for nans and min
inds = np.searchsorted(spacefun(low, high, len(ext_colors)), x)
def color_select(i, x):
if np.isnan(x):
return nan_color
elif np.isclose(x, low):
return colors[0]
elif np.isclose(x, high):
return colors[-1]
else:
return ext_colors[i]
return [color_select(i, x_val) for i, x_val in zip(inds, x)]
We will use this function as we hack our way into plotting two quantitative variables using iqplot.
Considering an ordinal variable as categorical
The ordinal cyl
variable, which indicates the number of cylinders in a car, ranging from three to eight, could be considered a quantitative variable (or at least ordinal). iqplot.strip()
takes a color_column
keyword argument that specifies an ostensibly categorical variable to be used to color glyphs. We can specify 'cyl'
to be the color_column
and then choose a quantitative palette, e.g., Viridis, and color the ordinal variable with it.
[3]:
palette = q_to_color(np.sort(df['cyl'].unique()), bokeh.palettes.Viridis256)
p = iqplot.strip(
df,
q="mpg",
cats="origin",
spread="jitter",
color_column="cyl",
palette=palette,
show_legend=True,
)
p.legend.title = "cylinders"
bokeh.io.show(p)
In the above plot, we have one categorical and two quantitative variables. We cannot easily have this combination with ECDFs, since we would need separate ECDFs for each category (in this case region of origin), and they would be delineated by color. We could, however, have different markers for each region of origin and then color based on cylinders. This requires a bit more code, but is doable.
[4]:
# Make plot with different markers for each region of origin
p = None
markers = ["circle", "square", "diamond"]
for marker, (origin, g) in zip(markers, df.groupby("origin")):
# Get palette for this group, maintaining the overall min and max
palette = q_to_color(
np.sort(g["cyl"].unique()),
bokeh.palettes.Viridis256,
low=df["cyl"].min(),
high=df["cyl"].max(),
)
p = iqplot.ecdf(
data=g,
q="mpg",
cats="cyl",
kind="colored",
marker=marker,
palette=palette,
p=p,
show_legend=False,
)
# Hand build legends, first for the markers
dummy_xy = (df["mpg"].median(), 0.5)
items_origin = []
for marker, origin in zip(markers, df.groupby("origin").groups):
items_origin.append((origin, [p.scatter(*dummy_xy, color="gray", visible=False)]))
p.add_layout(bokeh.models.Legend(items=items_origin, location="bottom_right"), "center")
# Now the cylinders
items_cyl = []
low = df["cyl"].min()
high = df["cyl"].max()
for cyl in np.sort(df["cyl"].unique()):
color = q_to_color(cyl, bokeh.palettes.Viridis256, low=low, high=high)
items_cyl.append((str(cyl), [p.scatter(*dummy_xy, color=color, visible=False)]))
p.add_layout(
bokeh.models.Legend(items=items_cyl, title="cylinders", location="center"), "right"
)
bokeh.io.show(p)
I find this plot dizzying and hard to read. A more common use case would be to have only no categorical variables and plot an ECDF of MPG, coloring glyphs with cylinders. This is more easily achieved.
[5]:
palette = q_to_color(np.sort(df['cyl'].unique()), bokeh.palettes.Viridis256)
p = iqplot.ecdf(
df,
q="mpg",
cats="cyl",
kind="colored",
order=list(np.sort(df['cyl'].unique())),
palette=palette,
)
p.legend.title = "cylinders"
bokeh.io.show(p)
This legend is also clickable.
If we wanted to make a separate ECDF for each number of cylinders, again ignoring the region of origin, we can do that.
[6]:
p = iqplot.ecdf(
df,
q="mpg",
cats="cyl",
order=list(np.sort(df['cyl'].unique())),
palette=palette,
style='staircase',
)
p.legend.title = "cylinders"
bokeh.io.show(p)
Directly specifying color
The above examples work if the second quantitative variable to be used for coloring takes on only a few discrete values. If there are many values, a legend is not suitable for displaying the relationship between color and quantitative values; a colorbar is better suited. Fortunately, if the column specified in the color_column
keyword argument contains only hex codes for colors, these colors are applied directly to the glyphs. We can take advantage of this to hack our way into showing the
weight of the vehicles with color.
[7]:
# Add a column to the data frame with hex values for colors
df["markercolor"] = q_to_color(df["weight"], bokeh.palettes.Viridis256)
# Make the strip plot using the new color column with hex values
p = iqplot.strip(df, q="mpg", cats="origin", spread="jitter", color_column="markercolor")
# Build a colorbar
color_bar = bokeh.models.ColorBar(
color_mapper=bokeh.models.LinearColorMapper(
bokeh.palettes.Viridis256, low=df["weight"].min(), high=df["weight"].max()
),
border_line_color=None,
location=(0, 0),
title="weight (lbs)",
)
p.add_layout(color_bar, "right")
bokeh.io.show(p)
This is also possible with ECDFs, but not as directly because there is no color_column
keyword argument for ECDFs. Instead, we first hand-compute the values of the ECDF.
[8]:
# Make values of the ECDF for each data point
df['ecdf_mpg'] = df['mpg'].rank(method='first') / len(df)
p = bokeh.plotting.figure(
x_axis_label='mpg',
y_axis_label='ECDF',
**iqplot.utils._fig_dimensions({})
)
p.scatter(source=df, x='mpg', y='ecdf_mpg', color='markercolor')
p.add_layout(color_bar, "right")
bokeh.io.show(p)
At any rate, much better way of plotting these two quantitative variables with a single categorical variable is with a scatter plot.
[9]:
p = bokeh.plotting.figure(
frame_height=300, frame_width=300, x_axis_label="weight", y_axis_label="mpg"
)
for c, (origin, g) in zip(
bokeh.palettes.Category10_3, df.groupby("origin", sort=False)
):
p.scatter(source=g, x="weight", y="mpg", color=c, legend_label=origin)
p.legend.click_policy = "hide"
bokeh.io.show(p)
Why no stacked bar graphs?
Stacked bar graphs are useful to displaying relative count data, but unfortunately, their utility is somewhat restricted to that. All five functions in iqplot handle arbitrary scalar-valued quantitative data (including negative values), and allow for arbitrary many measurements per category. A stacked bar graph either requires one non-negative quantitative value per category or requires a count operation on the data points, which has a very specific, possibly ambiguous meaning. So, a stacked bar graph would necessitate restrictions on allowed data types beyond those allowed by the other kinds of plots.
Beyond that, there are often better choices than stacked bar. To demonstrate, consider making a stacked bar plot of the counts of cars with each number of cylinders from each region of origin.
[10]:
count_df = (
df.groupby(["origin"])["cyl"]
.value_counts()
.unstack()
.reset_index()
.fillna(0)
)
count_df.columns = count_df.columns.astype(str)
stackers = ["3", "4", "5", "6", "8"]
p = bokeh.plotting.figure(
frame_width=500,
frame_height=250,
y_range=["North America", "Europe", "Asia"],
x_axis_label="count",
)
p.x_range.start = 0
p.hbar_stack(
stackers=stackers,
height=0.5,
y="origin",
color=bokeh.palettes.Category10_5,
source=count_df,
legend_label=stackers,
)
p.ygrid.grid_line_alpha = 0
p.legend.title = "cylinders"
bokeh.io.show(p)
To get the actual count of each category (number of cylinders) in the stacks, you need to assess the difference from the top to bottom. Compare that with a strip plot containing the same information.
[11]:
p = iqplot.strip(
data=count_df.melt(id_vars="origin", value_name="count"),
q="count",
cats="origin",
color_column="cyl",
frame_width=500,
show_legend=True,
marker_kwargs=dict(size=10),
)
p.legend.title = "cylinders"
bokeh.io.show(p)
In this case, we can immediately read off the number of cars with the respective number of cylinders.
Why no bar graphs?
I strongly prefer strip plots (with jitter) to box plots and ECDFs to histograms. Why? Because in the strip plots and ECDFs, you are plotting all of your data. In practice, these are the only two types of visualizations for data with a categorical axis I use (though I’ll sometimes overlay a jitter on a box plot to show some of the summary statistics).
A bar graph is the antithesis of plotting all of your data. You distill all of the information in the data set down to one or two summary statistics, and then use giant glyphs to show them. You should plot all of your data, so you shouldn’t make bar graphs. iqplot will not help you practice bad plotting.
So why does iqplot have box-and-whisker plots? One may argue that it is nonetheless valuable to plot summary statistics, which is what box plots do. In that case, at least five summary statistics are plotted (the ends of the whiskers, the ends of the box, and the median). While this is still not plotting all of the data, it is still better than a dynamite graph (bar graph with error bars), which shows at most three summary statistics (height of bar, and lower and upper bound of confidence interval). But still, why does iqplot enable box plots, but not bar graphs?
The answer is that there are many ways to specify the summary statistic used in bar graphs. We could choose the height of the bar to be the mean of the data and the error bars to have a length given by the standard error of the mean. We could have the height of the bar be the median and the error bars be a possibly asymmetric confidence interval obtained by bootstrapping the median. And there are many more possibilities.
Conversely, if we stick to the widely-used (almost universally used, as far as I can tell) Tukey specification of a box-and-whisker plot, we are only plotting percentiles of the data. These assume no underlying statistical model, so the plots are unambiguous.
Regardless, making bar graphs is not particularly challenging, as shown below.
[12]:
# Compute mean MPG by region of origin
df_mean = df.groupby('origin')['mpg'].mean().reset_index()
# Build plot
p = bokeh.plotting.figure(
frame_height=200,
frame_width=400,
x_axis_label='mean mpg',
y_range=df_mean['origin'][::-1],
)
# Add bars
p.hbar(
source=df_mean,
y='origin',
right='mpg',
height=0.6
)
# Turn off gridlines on categorical axis
p.ygrid.grid_line_color = None
# Start axes at origin on quantitative axis
p.x_range.start = 0
bokeh.io.show(p)
Why no violin plots?
Similarly to histograms, violin plots are a way to visualize the probability density function (pdf) of a quantitative variable. Violin plots accomplish this using kernel density estimation (KDE), a procedure by which a smooth function is used to approximate the pdf of a random variable. Like binning must be specified for a histogram, the kernel and its bandwidth must be specified to compute a KDE. So, like histograms, violin plots require specification of arbitrary parameters.
The two big shortcomings of histograms are:
They break the rule of plot all of your data (though we sort-of deal with this by including a rug plot).
The choice of binning is arbitrary.
Violin plots suffer from both of these shortcomings and have the additional complication that they can assign nonzero density to values beyond the extremes of the measured data, even into unphysical territory.
I view histograms as an auxiliary feature of iqplot to visualize pdfs, with ECDFs being far more powerful for visualizing distributions. As such, I did not extend the functionality to include another pdf visualizer which is, in my opinion, not any better and actually worse than histograms.
For this reason, I do not include any other KDE-based plots, such as ridgeline plots.
Nonetheless, if you wish to overlay KDEs on a strip plot, it is doable without too much code.
[13]:
# Make the strip plot
p = iqplot.strip(
df, q="mpg", cats="origin", spread="jitter", cat_grid=True, x_range=[5, 50]
)
# x-values for mirrored KDE
x_upper = np.linspace(5, 50, 400)
x = np.concatenate((x_upper, x_upper[::-1]))
# Scale by 5 so that the KDE is more visibile
scale_factor = 5
# Build the KDEs
for name, g in df.groupby("origin"):
# Evalues Gaussian KDE with default parameters (Scott's rule for bandwidth)
pdf_fun = scipy.stats.gaussian_kde(g["mpg"])
y_vals = scale_factor * pdf_fun(x_upper)
# Construct y-values for above and below category on y-axis
y_upper = [(name, y_val) for y_val in y_vals]
y_lower = [(y[0], -y[1]) for y in y_upper][::-1]
y = y_upper + y_lower
# Add the KDE
p.line(x, y, color="gray")
p.patch(x, y, fill_color="gray", alpha=0.3, level="underlay")
# Add a histogram for comparison
p_hist = iqplot.striphistogram(
df,
q="mpg",
cats="origin",
spread="jitter",
x_range=[5, 50],
fill_kwargs=dict(fill_color="gray"),
line_kwargs=dict(line_color="gray"),
)
bokeh.io.show(bokeh.layouts.row(p, bokeh.layouts.Spacer(width=30), p_hist))
Why no extended box plots?
The box in a box-and-whisker plot contains the middle two quartiles of the quantitative data. One can add more boxes containing different percentile ranges, and such plots are called extended box plots. I did not include this functionality because I view box plots as annotations of well-defined, visually interpretable summary statistics. Extending the box plots becomes challenging because annotation or textual description of the edges of all of the boxes is necessary. If more percentiles are needed, they may be added to a plot, e.g., with dashes. Here is an example where we want to add the 10th and 90th percentiles of the data in red.
[14]:
# Make a stripbox plot
p = iqplot.stripbox(
data=df,
q="mpg",
cats="origin",
spread="jitter",
)
# Add 10th and 90th percentiles
df_10 = df.groupby("origin")["mpg"].quantile(0.1).reset_index()
df_90 = df.groupby("origin")["mpg"].quantile(0.9).reset_index()
p.scatter(
source=df_10,
x="mpg",
y="origin",
marker="dash",
angle=np.pi / 2,
color="tomato",
size=40,
line_width=2,
)
p.scatter(
source=df_90,
x="mpg",
y="origin",
marker="dash",
angle=np.pi / 2,
color="tomato",
size=40,
line_width=2,
)
bokeh.io.show(p)
Why no lollipop plots?
Lollipop plots visually resemble spike plots that are included in iqplot. They consist of a spike with a dot on top of it, resembling a lollipop. There are generally two types of plots referred to as lollipop plots.
Plots that are the same as bar graphs, except lollipops replace the bars.
Plots where the lollipop represents the count of a given category.
Lollipop plots like those described in item 1 are not included in iqplot for the same reason bar graphs are excluded.
Lollipop plots like those described in item 2 are not included because there is no quantitative variable. Rather, the positioning of each lollipop is the count of the number of times a given category appears. Contrast this with a spike plot, which is the count of the number of times a given quantitative measurement was observed.
With a little data frame manipulation, we can make a lollipop plot. Here is an example that gives the count of the total number of cars from each region in the data set.
[15]:
df_lolly = (
df["origin"]
.value_counts()
.reset_index()
)
df_lolly["count"] = df_lolly["count"].astype(int)
p = iqplot.strip(
df_lolly,
q="count",
cats="origin",
marker_kwargs=dict(size=10, color="#1f77b4"),
x_range=[0, 1.05 * df_lolly["count"].max()],
)
p.segment(
x0=0,
y0=df_lolly["origin"],
x1=df_lolly["count"],
y1=df_lolly["origin"],
color="#1f77b4",
line_width=2,
)
bokeh.io.show(p)
Why is there such limited statistical functionality?
Many plotting packages refer to themselves as “statistical visualization packages” or the like, with the excellent Seaborn package being an example. iqplot is not meant to be that. It only provides bootstrap confidence intervals for ECDFs and histograms and quantile calculations necessary to make box-and-whisker plots. The quantile calculations for box-and-whisker plots are obvious inclusions. The confidence intervals for ECDFs and histograms are included because they are confidence intervals of the plots themselves. (One could also make confidence intervals for the quantiles in a box plot, thereby constituting confidence intervals of the plot itself, but it is unclear how to display the result in a clear way.) Other statistical inference should be done as needed in a bespoke manner for a given inference problem for a given data set.