´s plotting helper library
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

177 lines
6.2 KiB

Drawing scatterplots
The ``incenp.plotting.scatterplot`` module provides a ``scatterplot``
function to facilitate the creation of scatter plots.
Note that what I call a ”scatter plot” here may not be the most common
acceptation of the term. I do *not* mean the 2-dimensional plotting of
two variables (one on the x-axis, the other on the y-axis). Rather, I
mean the plotting of a single variable on the y-axis, akin to a bar
chart, but with all data points depicted as scattered dots.
.. figure:: scatterplot1.png
A sample scatter plot.
The figure above is a sample “scatter plot”. The orange boxes are not
part of the plot, but have been added to illustrate what are *tracks*
and *subtracks* in the context of the ``incenp.plotting.scatterplot``
Sample data
The module is intended to work with indexed `DataFrame` objects
(including multi-indexed `DataFrame`). Let’s create such an object,
which we will use throughout this page:
.. code-block:: python
index = pd.MultiIndex.from_arrays([
['foo'] * 40 + ['bar'] * 40 + ['baz'] * 40 + ['qux'] * 40,
['one', 'two'] * 80
names=['first', 'second']
df = pd.DataFrame(np.random.randn(160,4), index = index,
columns=['A', 'B', 'C', 'D'])
This creates a `DataFrame` with 4 columns (``A`` to ``D``) and 160
rows, indexed in two levels (level ``first``, with 4 distinct values
``foo``, ``bar``, ``baz``, and ``qux``; and level ``second``, with 2
distinct values ``one`` and ``two``).
Quick start
As an initial example, here is the call to ``scatterplot`` to draw the
graph above (``ax`` is supposed to be a `matplotlib.axes.Axes` object):
.. code-block:: python
scatterplot(ax, df, columns='A',
tracks=['foo', 'bar', 'baz'], trackname='first',
subtracks=['one', 'two'], subtrackname='second')
ax.legend(['one', 'two'])
The ``columns`` parameter indicates that the values to be plotted comes
from the column named ``A``.
The ``tracks`` parameter gives the index values used to distribute the
values of column ``A`` into three different tracks (one track for rows
with index value ``foo``, one track for rows with index value ``bar``,
and so on); the associated ``trackname`` parameter indicates which index
level to use to lookup the values specified in the previous parameter,
if ``df`` is a multi-indexed `DataFrame`.
The ``subtracks`` and ``subtrackname`` parameters are similar to the
``tracks`` and ``trackname`` parameter above, but for subtracks instead
of tracks. Here, they are used to say that values from rows with index
value ``one`` are to be plotted on one subtrack, while values from rows
with index value ``two`` are to be plotted on another subtrack.
Playing with tracks, subtracks, columns
The following code will plot the same values as above, but will invert
the tracks and the subtracks: the second-level index (``second``) will
be used to distribute values along tracks while the first-level index
(``first``) will be used to distribute values along subtracks:
.. code-block:: python
scatterplot(ax, df, columns='A',
tracks=['one', 'two'], trackname='second',
subtracks=['foo', 'bar', 'baz'], subtrackname='first')
ax.legend(['foo', 'bar', 'baz'])
.. figure:: scatterplot2.png
A scatterplot with inverted tracks and subtracks.
Values from several columns in the source `DataFrame` can be plotted at
once, by giving a list of column names (instead of a single name) to the
``columns`` parameter. By default, values from each column are plotted
in a different track. In the following examples, values from the columns
``A``, ``B``, and ``C`` are plotted; the first-level index is used to
distribute values along three different subtracks; the second-level
index is used to filter the `DataFrame` prior to plotting so that only
rows with the index value ``one`` are plotted.
.. code-block:: python
scatterplot(ax, df.xs('one', level='second'),
columns=['A', 'B', 'C'],
subtracks=['foo', 'baz', 'qux'], subtrackname='first')
ax.legend(['foo', 'baz', 'qux'])
.. figure:: scatterplot3.png
A scatterplot with values from several columns of the source
To plot values from several columns as different subtracks rather than
different tracks, use the ``subtrackcolumns`` parameter as in the
example below. The ``tracks`` and ``trackname`` parameters may then be
used to define what goes into the tracks.
.. code-block:: python
scatterplot(ax, df.xs('one', level='second'),
columns=['A', 'B', 'C'], subtrackcolumns=True,
tracks=['foo', 'baz', 'qux'], trackname='first')
ax.legend(['A', 'B', 'C'])
.. figure:: scatterplot4.png
A scatterplot with values from several columns of the source
DataFrame, plotted as separate subtracks.
Miscellaneous features
When plotting *two* subtracks, the ``testfunc`` parameter may be used to
have the ``scatterplot`` function draws the result of a statistical test
comparing the values from each subtrack in each track.
The value of the ``testfunc`` parameter should be a function accepting
two `DataSeries` and returning a P-value, such as a the following
wrapper around Scipy’s ``mannwhitneyu`` function:
.. code-block:: python
from scipy.stats import mannwhitneyu
def do_mannwhitney(a, b):
result = mannwhitneyu(a, b)
return result.pvalue
Below is an example of using such a wrapper, with the resulting plot:
.. code-block:: python
scatterplot(ax, df, columns='B',
tracks=['foo', 'baz', 'qux'], trackname='first',
subtracks=['one', 'two'], subtrackname='second',
ax.legend(['one', 'two'])
.. figure:: scatterplot5.png
A scatterplot with results of statistical tests between subtracks.
The example above also shows the ``colors`` parameter, used to change
the colors for the different subtracks. It can either be a string
containing one-letter color codes, or a list of Matplotlib colors. The
string or the list must be at least as long as the number of subtracks
to plot.