"""
The GraphLab nearest neighbors toolkit is used to find the rows in a data table
that are most similar to a query row.

Finding nearest neighbors is a two-stage process, analogous to many other
GraphLab toolkits. First a
:py:class:`~graphlab.nearest_neighbors.NearestNeighborsModel` is created, using
a reference dataset contained in an :class:`~graphlab.SFrame`. Then this model
is queried to find the nearest reference data points to a set of new query
points, also stored in an SFrame. Here we download a toy dataset of house
attributes and prices.

.. sourcecode:: python

    >>> import graphlab as gl
    >>> sf = gl.SFrame('http://s3.amazonaws.com/GraphLab-Datasets/regression/houses.csv')

Because the features in this dataset have very different scales (e.g. price is
in the hundreds of thousands while the number of bedrooms is in the single
digits), it is important to standardize so that each feature is measured in
terms of standard deviations from the mean. In addition, both reference and
query datasets must have a string column with row labels. As downloaded, this
dataset does not have a column, so we add an index.

.. sourcecode:: python

    >>> for c in sf.column_names():         # standardize columns to have same scale
    >>>     sf[c] = (sf[c] - sf[c].mean()) / sf[c].std()

    >>> sf.add_row_label(column_name='house_id')
    >>> sf['house_id'] = sf['house_id'].astype(str)

The nearest neighbors model can be created with all of the columns in the
reference dataset, or a list of particular features.

.. sourcecode:: python

    >>> model = gl.nearest_neighbors.create(sf, label='house_id')
    >>> model = gl.nearest_neighbors.create(sf, label='house_id', features=['bedroom', 'bath', 'size'])

To retrieve the five closest neighbors for each document, query the model. The
result is an SFrame with four columns: query label, reference label, distance,
and rank of the reference point among the query point's nearest neighbors. Query
points are also contained in an SFrame, which must contain columns with the same
names as those used to construct the model. Often, the reference SFrame will be
used as the query SFrame as well.

.. sourcecode:: python

    >>> knn = model.query(sf, label='house_id', features=['bedroom', 'bath', 'size'], k=5)

Two choices are critical in computing nearest neighbors. The first is the
``distance`` function, which measures the dissimilarity between any pair of
observations. Currently, the options for this are ``euclidean``, ``manhattan``,
``jaccard``, ``cosine``, and ``auto``, which chooses the most reasonable
distance based on the type of features in the reference data. Select the
distance option when creating the model.

.. sourcecode:: python

    >>> model = gl.nearest_neighbors.create(sf, label='house_id',
                                            features=['bedroom', 'bath', 'size'],
                                            distance='manhattan')

The second critical choice in model creation is the ``method``. The
``brute-force`` method computes the distance between a query point and *each* of
the reference points, with a run time linear in the number of reference points.
The ``ball-tree`` method takes longer, but can speed up queries substantially by
partitioning the reference data into successively smaller balls and searching
only those that are relatively close to the query.  The default method is
``auto`` which chooses a reasonable method based on both the feature types and
selected distance function. The method parameter is also specified when the
model is created.

.. sourcecode:: python

    >>> model = gl.nearest_neighbors.create(sf, label='house_id',
                                            features=['bedroom', 'bath', 'size'],
                                            method='ball-tree', leaf_size=5)

Please see the toolkit's model and create methods for more details.
"""

from graphlab.toolkits.nearest_neighbors import nearest_neighbors