Start Here: ``datascience`` Tutorial
====================================

This is a brief introduction to the functionality in
:py:mod:`datascience`.  For a complete reference guide, please see
:ref:`tables-overview`.

For other useful tutorials and examples, see:

- `The textbook introduction to Tables`_
- `Example notebooks`_

.. _The textbook introduction to Tables: http://www.inferentialthinking.com/chapter1/tables.html
.. _Example notebooks: https://github.com/deculler/TableDemos

.. contents:: Table of Contents
    :depth: 2
    :local:

Getting Started
---------------

The most important functionality in the package is is the :py:class:`Table`
class, which is the structure used to represent columns of data. First, load
the class:

.. ipython:: python

    from datascience import Table

In the IPython notebook, type ``Table.`` followed by the TAB-key to see a list
of members.

Note that for the Data Science 8 class we also import additional packages and
settings for all assignments and labs. This is so that plots and other available
packages mirror the ones in the textbook more closely. The exact code we use is:

.. code-block:: python

    # HIDDEN

    import matplotlib
    matplotlib.use('Agg')
    from datascience import Table
    %matplotlib inline
    import matplotlib.pyplot as plt
    import numpy as np
    plt.style.use('fivethirtyeight')

In particular, the lines involving ``matplotlib`` allow for plotting within the
IPython notebook.

Creating a Table
----------------

A Table is a sequence of labeled columns of data.

A Table can be constructed from scratch by extending an empty table with
columns.

.. ipython:: python

    t = Table().with_columns([
        'letter', ['a', 'b', 'c', 'z'],
        'count',  [  9,   3,   3,   1],
        'points', [  1,   2,   2,  10],
    ])

    print(t)

------

More often, a table is read from a CSV file (or an Excel spreadsheet).  Here's
the content of an example file:

.. ipython:: python

    cat sample.csv

And this is how we load it in as a :class:`Table` using
:meth:`~datascience.tables.Table.read_table`:

.. ipython:: python

    Table.read_table('sample.csv')

CSVs from URLs are also valid inputs to
:meth:`~datascience.tables.Table.read_table`:

.. ipython:: python

    Table.read_table('http://data8.org/textbook/notebooks/sat2014.csv')

------

It's also possible to add columns from a dictionary, but this option is
discouraged because dictionaries do not preserve column order.

.. ipython:: python

    t = Table().with_columns({
        'letter': ['a', 'b', 'c', 'z'],
        'count':  [  9,   3,   3,   1],
        'points': [  1,   2,   2,  10],
    })

    print(t)

Accessing Values
----------------

To access values of columns in the table, use
:meth:`~datascience.tables.Table.column`, which takes a column label or index
and returns an array. Alternatively, :meth:`~datascience.tables.Table.columns`
returns a list of columns (arrays).

.. ipython:: python

    t

    t.column('letter')
    t.column(1)

You can use bracket notation as a shorthand for this method:

.. ipython:: python

    t['letter'] # This is a shorthand for t.column('letter')
    t[1]        # This is a shorthand for t.column(1)

To access values by row, :meth:`~datascience.tables.Table.row` returns a
row by index. Alternatively, :meth:`~datascience.tables.Table.rows` returns an
list-like :class:`~datascience.tables.Table.Rows` object that contains
tuple-like :class:`~datascience.tables.Table.Row` objects.

.. ipython:: python

    t.rows
    t.rows[0]
    t.row(0)

    second = t.rows[1]
    second
    second[0]
    second[1]

To get the number of rows, use :attr:`~datascience.tables.Table.num_rows`.

.. ipython:: python

    t.num_rows


Manipulating Data
-----------------

Here are some of the most common operations on data. For the rest, see the
reference (:ref:`tables-overview`).

Adding a column with :meth:`~datascience.tables.Table.with_column`:

.. ipython:: python

    t
    t.with_column('vowel?', ['yes', 'no', 'no', 'no'])
    t # .with_column returns a new table without modifying the original

    t.with_column('2 * count', t['count'] * 2) # A simple way to operate on columns

Selecting columns with :meth:`~datascience.tables.Table.select`:

.. ipython:: python

    t.select('letter')
    t.select(['letter', 'points'])

Renaming columns with :meth:`~datascience.tables.Table.relabeled`:

.. ipython:: python

    t
    t.relabeled('points', 'other name')
    t
    t.relabeled(['letter', 'count', 'points'], ['x', 'y', 'z'])

Selecting out rows by index with :meth:`~datascience.tables.Table.take` and
conditionally with :meth:`~datascience.tables.Table.where`:

.. ipython:: python

    t
    t.take(2) # the third row
    t.take[0:2] # the first and second rows

.. ipython:: python

    t.where('points', 2) # rows where points == 2
    t.where(t['count'] < 8) # rows where count < 8

    t['count'] < 8 # .where actually takes in an array of booleans
    t.where([False, True, True, True]) # same as the last line

Operate on table data with :meth:`~datascience.tables.Table.sort`,
:meth:`~datascience.tables.Table.group`, and
:meth:`~datascience.tables.Table.pivot`

.. ipython:: python

    t
    t.sort('count')
    t.sort('letter', descending = True)

.. ipython:: python

    # You may pass a reducing function into the collect arg
    # Note the renaming of the points column because of the collect arg
    t.select(['count', 'points']).group('count', collect=sum)

.. ipython:: python

    other_table = Table().with_columns([
        'mar_status',  ['married', 'married', 'partner', 'partner', 'married'],
        'empl_status', ['Working as paid', 'Working as paid', 'Not working',
                        'Not working', 'Not working'],
        'count',       [1, 1, 1, 1, 1]])
    other_table

    other_table.pivot('mar_status', 'empl_status', 'count', collect=sum)

Visualizing Data
----------------

We'll start with some data drawn at random from two normal distributions:

.. ipython:: python

    normal_data = Table().with_columns([
        'data1', np.random.normal(loc = 1, scale = 2, size = 100),
        'data2', np.random.normal(loc = 4, scale = 3, size = 100)])

    normal_data

Draw histograms with :meth:`~datascience.tables.Table.hist`:

.. ipython:: python

    @savefig hist.png width=4in
    normal_data.hist()

.. ipython:: python

    @savefig hist_binned.png width=4in
    normal_data.hist(bins = range(-5, 10))

.. ipython:: python

    @savefig hist_overlay.png width=4in
    normal_data.hist(bins = range(-5, 10), overlay = True)

If we treat the ``normal_data`` table as a set of x-y points, we can
:meth:`~datascience.tables.Table.plot` and
:meth:`~datascience.tables.Table.scatter`:

.. ipython:: python

    @savefig plot.png width=4in
    normal_data.sort('data1').plot('data1') # Sort first to make plot nicer

.. ipython:: python

    @savefig scatter.png width=4in
    normal_data.scatter('data1')

.. ipython:: python

    @savefig scatter_line.png width=4in
    normal_data.scatter('data1', fit_line = True)

Use :meth:`~datascience.tables.Table.barh` to display categorical data.

.. ipython:: python

    t
    @savefig barh.png width=4in
    t.barh('letter')

Exporting
---------

Exporting to CSV is the most common operation and can be done by first
converting to a pandas dataframe with :meth:`~datascience.tables.Table.to_df`:

.. ipython:: python

    normal_data

    # index = False prevents row numbers from appearing in the resulting CSV
    normal_data.to_df().to_csv('normal_data.csv', index = False)

An Example
----------

We'll recreate the steps in `Chapter 3 of the textbook`_ to see if there is a
significant difference in birth weights between smokers and non-smokers using a
bootstrap test.

For more examples, check out `the TableDemos repo`_.

.. _Chapter 3 of the textbook: http://data8.org/text/3_inference.html#Using-the-Bootstrap-Method-to-Test-Hypotheses
.. _the TableDemos repo: https://github.com/deculler/TableDemos

From the text:

    The table ``baby`` contains data on a random sample of 1,174 mothers and
    their newborn babies. The column ``birthwt`` contains the birth weight of
    the baby, in ounces; ``gest_days`` is the number of gestational days, that
    is, the number of days the baby was in the womb. There is also data on
    maternal age, maternal height, maternal pregnancy weight, and whether or not
    the mother was a smoker.

.. ipython:: python

    baby = Table.read_table('http://data8.org/textbook/notebooks/baby.csv')
    baby # Let's take a peek at the table

    # Select out columns we want.
    smoker_and_wt = baby.select(['m_smoker', 'birthwt'])
    smoker_and_wt

Let's compare the number of smokers to non-smokers.

.. ipython:: python

    @savefig m_smoker.png width=4in
    smoker_and_wt.select('m_smoker').hist(bins = [0, 1, 2]);

We can also compare the distribution of birthweights between smokers and
non-smokers.

.. ipython:: python

    # Non smokers
    # We do this by grabbing the rows that correspond to mothers that don't
    # smoke, then plotting a histogram of just the birthweights.
    @savefig not_m_smoker_weights.png width=4in
    smoker_and_wt.where('m_smoker', 0).select('birthwt').hist()

    # Smokers
    @savefig m_smoker_weights.png width=4in
    smoker_and_wt.where('m_smoker', 1).select('birthwt').hist()

What's the difference in mean birth weight of the two categories?

.. ipython:: python

    nonsmoking_mean = smoker_and_wt.where('m_smoker', 0).column('birthwt').mean()
    smoking_mean = smoker_and_wt.where('m_smoker', 1).column('birthwt').mean()

    observed_diff = nonsmoking_mean - smoking_mean
    observed_diff

Let's do the bootstrap test on the two categories.

.. ipython:: python

    num_nonsmokers = smoker_and_wt.where('m_smoker', 0).num_rows
    def bootstrap_once():
        """
        Computes one bootstrapped difference in means.
        The table.sample method lets us take random samples.
        We then split according to the number of nonsmokers in the original sample.
        """
        resample = smoker_and_wt.sample(with_replacement = True)
        bootstrap_diff = resample.column('birthwt')[:num_nonsmokers].mean() - \
            resample.column('birthwt')[num_nonsmokers:].mean()
        return bootstrap_diff

    repetitions = 1000
    bootstrapped_diff_means = np.array(
        [ bootstrap_once() for _ in range(repetitions) ])

    bootstrapped_diff_means[:10]

    num_diffs_greater = (abs(bootstrapped_diff_means) > abs(observed_diff)).sum()
    p_value = num_diffs_greater / len(bootstrapped_diff_means)
    p_value


Drawing Maps
------------
To come.