==========================
Cross-validation Examples
==========================

Optunity offers a simply interface to k-fold cross-validation_.

.. _cross-validation: http://en.wikipedia.org/wiki/Cross-validation_(statistics)

The fold generation procedure is aware of both strata and clusters.
Please refer to :doc:`/user/cross_validation` for an overview and 
:func:`optunity.cross_validated` for implementation and API details.

We will build examples step by step. The basic setup is a ``train`` and ``predict``
function along with some ``data`` to construct folds over::

    from __future__ import print_function
    import optunity as opt

    def train(x, y, filler=''):
        print(filler + 'Training data:')
        for instance, label in zip(x, y):
            print(filler + str(instance) + ' ' + str(label))

    def predict(x, filler=''):
        print(filler + 'Testing data:')
        for instance in x:
            print(filler + str(instance))


    data = list(range(9))
    labels = [0] * 9

The recommended way to perform cross-validation is using the 
:func:`optunity.cross_validation.cross_validated` function decorator::

    @opt.cross_validated(x=data, y=labels, num_folds=3)
    def cved(x_train, y_train, x_test, y_test):
        train(x_train, y_train)
        predict(x_test)
        return 0.0

    cved()


Nested cross-validation
--------------------------

Nested cross-validation is a commonly used approach to estimate the generalization 
performance of a modeling process which includes model selection internally. 
A good summary is provided here_.

.. _here: http://stats.stackexchange.com/a/65156/25433

Nested cv consists of two cross-validation procedures wrapped around eachother. The inner cv is
used for model selection, the outer cv estimates generalization performance.



This can be done in a straightforward manner using Optunity::

    @opt.cross_validated(x=data, y=labels, num_folds=3)
    def nested_cv(x_train, y_train, x_test, y_test):

        @opt.cross_validated(x=x_train, y=y_train, num_folds=3)
        def inner_cv(x_train, y_train, x_test, y_test):
            train(x_train, y_train, '...')
            predict(x_test, '...')
            return 0.0

        inner_cv()
        predict(x_test)
        return 0.0

    nested_cv()

The inner :func:`optunity.cross_validated` decorator has access to
the train and test folds generated by the outer procedure (``x_train`` and ``x_test``).
For notational simplicity we assume a problem without labels here.

.. note::
    The inner folds are regenerated in every iteration (since we are redefining ``inner_cv`` each time). 
    The inner folds will therefore be different each time. The outer folds remain static, unless ``regenerate_folds=True`` is passed.

Below we illustrate a more complete example of nested cv, which includes hyperparameter
optimization with :func:`optunity.maximize`. Assume we have access to the following functions
``svm=svm_train(x, y, c, g)`` and ``predictions=svm_predict(svm, x)``. Where ``c`` and ``g``
are hyperparameters to be optimized for accuracy::

    @opt.cross_validated(x=data, num_folds=3)
    def nested_cv(x_train, y_train, x_test, y_test):

        @opt.cross_validated(x=x_train, y=y_train, num_folds=3)
        def inner_cv(x_train, y_train, x_test, y_test, c, g):
            svm = svm_train(x_train, y_train, c, g)
            predictions = svm_predict(svm, x_test)
            return opt.score_functions.accuracy(y_test, predictions)

        optimal_parameters = opt.maximize(inner_cv, num_evals=100, c=[0, 10], g=[0, 10])
        optimal_svm = svm_train(x_train, y_train, **optimal_parameters)
        predictions = svm_predict(optimal_svm, x_test)
        return opt.score_functions.accuracy(y_test, predictions)

    overall_accuracy = nested_cv()

.. note::
    You are free to use different score and aggregation functions in the inner and outer cv.
