.. |GVar| replace:: :class:`gvar.GVar`
.. |nonlinear_fit| replace:: :class:`lsqfit.nonlinear_fit`
.. |BufferDict| replace:: :class:`gvar.BufferDict`

Overview and Tutorial
========================

Introduction
--------------------
The modules defined here are designed to facilitate
least-squares fitting of noisy data by multi-dimensional, nonlinear
functions of arbitrarily many parameters. The central module is
:mod:`lsqfit` because it provides the fitting functions. :mod:`lsqfit` makes
heavy use of auxiliary module :mod:`gvar`, which provides tools that
facilitate the analysis of error propagation, and also the creation of
complicated multi-dimensional Gaussian distributions.

The following (complete) code illustrates basic usage of :mod:`lsqfit`::
   
   import numpy as np
   import gvar as gv
   import lsqfit
   
   y = {                      # data for the dependent variable
       'data1' : gv.gvar([1.376, 2.010], [[ 0.0047, 0.01], [ 0.01, 0.056]]),
       'data2' : gv.gvar([1.329, 1.582], [[ 0.0047, 0.0067], [0.0067, 0.0136]]),
       'b/a'   : gv.gvar(2.0, 0.5)
       }
   x = {                      # independent variable
       'data1' : np.array([0.1, 1.0]),
       'data2' : np.array([0.1, 0.5])
       }
   prior = dict(a=gv.gvar(0.5, 0.5), b=gv.gvar(0.5, 0.5))
   
   def fcn(x, p):             # fit function of x and parameters p
      ans = {}
      for k in ['data1', 'data2']:
         ans[k] = gv.exp(p['a'] + x[k] * p['b'])
      ans['b/a'] = p['b'] / p['a']
      return ans
      
   # do the fit   
   fit = lsqfit.nonlinear_fit(data=(x, y), prior=prior, fcn=fcn)
   print(fit.format(100))     # print standard summary of fit
   
   p = fit.p                  # best-fit values for parameters
   outputs = dict(a=p['a'], b=p['b'])
   outputs['b/a'] = p['b']/p['a']
   inputs = dict(y=y, prior=prior)
   print(gv.fmt_values(outputs))              # tabulate outputs
   print(gv.fmt_errorbudget(outputs, inputs)) # print error budget for outputs
   
   # save best-fit values in file 'outputfile.p' for later use
   import pickle
   pickle.dump(fit.p, open('outputfile.p', 'wb'))

This code fits the function ``f(x,a,b)= exp(a+b*x)`` (see ``fcn(x,p)``)
to two sets of data, labeled ``data1`` and ``data2``, by varying parameters
``a`` and ``b`` until ``f(x['data1'],a,b)`` and ``f(x['data2'],a,b)``
equal ``y['data1']`` and ``y['data2']``, respectively, to within the
``y``\s' errors. The means and covariance matrices for the ``y``\s are
specified in the ``gv.gvar(...)``\s used to create them: for example, ::
   
   >>> print(y['data1'])
   [1.376(69) 2.01(24)]
   >>> print(y['data1'][0].mean, "+-", y['data1'][0].sdev)
   1.376 +- 0.068556546004
   >>> print(gv.evalcov(y['data1']))   # covariance matrix
   [[ 0.0047  0.01  ]
    [ 0.01    0.056 ]]
   
shows the means, standard deviations and covariance matrix for the data in
the first data set (0.0685565 is the square root of the 0.0047 in
the covariance matrix). The dictionary ``prior`` gives *a priori* estimates
for the two parameters, ``a`` and ``b``: each is assumed to be 0.5±0.5 
before fitting. The parameters ``p[k]`` in the fit function ``fcn(x, p)`` 
are stored in a dictionary having the same keys and layout as
``prior``. In addition, there is an extra piece of input data,
``y['b/a']``, which indicates that ``b/a`` is 2±0.5. The fit
function for this data is simply the ratio ``b/a`` (represented by
``p['b']/p['a']`` in fit function ``fcn(x,p)``). The fit function returns
a dictionary having the same keys and layout as the input data ``y``.

The output from the code sample above is:

.. literalinclude:: eg0.out

   
The best-fit values for ``a`` and ``b`` are 0.253(32) and
0.449(65), respectively; and the best-fit result for ``b/a`` is
1.78(30), which, because of correlations, is slightly more accurate
than might be expected from the separate errors for ``a`` and ``b``. The
error budget for each of these three quantities is tabulated at the end and
shows that the bulk of the error in each case comes from uncertainties in
the ``y`` data, with only small contributions from uncertainties in the
priors ``prior``. The fit results corresponding to each piece of input data
are also tabulated (``Fit: ...``); the agreement is excellent, as expected
given that the ``chi**2`` per degree of freedom is only 0.17.
   
The last section of the code uses Python's :mod:`pickle` module to save the
best-fit values of the parameters in a file for later use. They are recovered
using :mod:`pickle` again::
   
   >>> import pickle
   >>> p = pickle.load(open('outputfile.p', 'rb'))
   >>> print(p['a'])
   0.253(32)
   >>> print(p['b'])
   0.449(65)
   >>> print(p['b']/p['a'])
   1.78(30)
   
The recovered parameters are :class:`gvar.GVar`\s, with their full covariance
matrix intact. (:mod:`pickle` works here because the variables in ``fit.p``
are stored in a special dictionary of type :class:`gvar.BufferDict`;
:class:`gvar.GVar`\s cannot be pickled otherwise.)
   
Note that the constraint in ``y`` on ``b/a`` in this example is much tighter
than the constraints on ``a`` and ``b`` separately. This suggests a variation
on the previous code, where the tight restriction on ``b/a`` is built into the
prior rather than ``y``::

   ... as before ...
   
   y = {                      # data for the dependent variable
       'data1' : gv.gvar([1.376, 2.010], [[ 0.0047, 0.01], [ 0.01, 0.056]]),
       'data2' : gv.gvar([1.329, 1.582], [[ 0.0047, 0.0067], [0.0067, 0.0136]])
       }
   x = {                      # independent variable
       'data1' : np.array([0.1, 1.0]),
       'data2' : np.array([0.1, 0.5])
       }
   prior = dict(a=gv.gvar(0.5, 0.5))
   prior['b'] = prior['a']*gv.gvar(2.0, 0.5)

   def fcn(x, p):             # fit function of x and parameters p[k]
      ans = {}
      for k in ['data1', 'data2']:
         ans[k] = gv.exp(p['a'] + x[k]*p['b'])
      return ans
      
   ... as before ...

Here the dependent data ``y`` no longer has an entry for ``b/a``, and neither
do results from the fit function; but the prior for ``b`` is now 2±0.5 
times the prior for ``a``, thereby introducing a correlation that
limits the ratio ``b/a`` to be 2±0.5 in the fit. This code gives almost
identical results to the first one --- very slightly less accurate, since
there is less input data. We can often move information from the ``y`` data to
the prior or back since both are forms of input information.

There are several things worth noting from this example:

   * The input data (``y``) is expressed in terms of Gaussian random
     variables --- quantities with means and a covariance matrix. These are
     represented by objects of type :class:`gvar.GVar` in the code; module
     :mod:`gvar` has a variety of tools for creating and manipulating
     Gaussian random variables.
     
   * The input data is stored in a dictionary (``y``) whose values can
     be |GVar|\s or arrays of |GVar|\s. The use of a dictionary allows for
     far greater flexibility than, say, an array. The fit function 
     (``fcn(x, p)``) has to return a dictionary with the same layout as
     that of ``y`` (that is, with the same keys and where the value for
     each key has the same shape as the corresponding value in ``y``).
     :mod:`lsqfit` does allow ``y`` to be an array instead of a dictionary, 
     which might be preferable for very simple fits (but usually not 
     otherwise).
     
   * The independent data (``x``) can be anything; it is simply passed 
     through the fit code to the fit function ``fcn(x,p)``. It can 
     also be omitted altogether, in which case the fit function 
     depends only upon the parameters: ``fcn(p)``.
      
   * The fit parameters (``p`` in ``fcn(x,p)``) are also stored in a
     dictionary whose values are |GVar|\s or arrays of |GVar|\s. Again this
     allows for great flexibility. The layout of the parameter dictionary
     is copied from that of the prior (``prior``). Again ``p`` can be a
     single array instead of a dictionary, if that simplifies the code
     (which is usually not the case).
     
   * The best-fit values of the fit parameters (``fit.p[k]``) are also
     |GVar|\s and these capture statistical correlations between different
     parameters that are indicated by the fit. These output parameters can
     be combined in arithmetic expressions, using standard operators and
     standard functions, to obtain derived quantities. These operations
     take account of and track statistical correlations.
     
   * Function :func:`gvar.fmt_errorbudget` is a useful tool for assessing
     the origins (``inputs``) of the statistical errors obtained in various
     final results (``outputs``). It is particularly useful for analyzing
     the impact of the *a priori* uncertainties encoded in the prior
     (``prior``).
   
What follows is a brief tutorial that demonstrates in greater detail how to
use these modules in some standard variations on the data fitting problem.
As above, code for the examples is specified completely and so can be copied
into a file, and run as is. It can also be modified, allowing for
experimentation. At the very end, in an appendix, there is 
a very simple pedagogical example that illustrates the nature of priors
and demonstrates some of the simpler techniques supported by :mod:`lsqfit`.

*About Printing:* The examples in this tutorial use the ``print`` function
as it is used in Python 3. Drop the outermost parenthesis in each ``print``
statement if using Python 2; or add ::

  from __future__ import print_function

at the start of your file. 

.. _making-fake-data:

Making Fake Data
----------------
We need data in order to demonstrate curve fitting. The easiest route
is to make fake data. The recipe is simple: 1) choose some well defined
function ``f(x)`` of the independent variable ``x``; 2) choose values for
the ``x``\s, and therefore the "correct" values for ``y=f(x)``; and 3) add
random noise to the ``y``\s, to simulate measurement errors. Here we will work
through a simple implementation of this recipe to illustrate how the
:mod:`gvar` module can be used to build complicated Gaussian distributions (in
this case for the correlated noise in the ``y``\s). A reader eager to fit
real data can skip this section on first reading.

For the function ``f`` we choose something familiar (to some people): 
a sum of exponentials
``sum_i=0..99 a[i] exp(-E[i]*x)``. We take as our exact values for the
parameters ``a[i]=0.4`` and ``E[i]=0.9*(i+1)``, which are easy to remember.
This is simple in Python::

   import numpy as np
   
   def f_exact(x, nexp=100):
       return sum(0.4 * np.exp(-0.9 * (i + 1) * x) for i in range(nexp))
   
For ``x``\s we take ``1,2,3..10,12,14..20``, and exact ``y``\s are then given by
``f_exact(x)``::

   >>> x = array([1.,2.,3.,4.,5.,6.,7.,8.,9.,10.,12.,14.,16.,18.,20.])
   >>> y_exact = f_exact(x)
   >>> print(y_exact)               # correct/exact values for y
   [  2.74047100e-01   7.92134506e-02   2.88190008e-02 ... ]

Finally we need to add random noise to the ``y_exact``\s to obtain our
fit data. We do this by forming ``y_exact*noise`` where ::

   noise = 1 + sum_n=0..99 c[n] * (x / x_max) ** n,
   
Here ``x_max`` is the largest ``x`` used, and the ``c[n]`` are Gaussian
random variable with means and standard deviations of order 0.01. This
is easy to implement in Python using the :mod:`gvar` module::

   import gvar as gv
   
   def make_data(nexp=100, eps=0.01): # make x, y fit data
       x = np.array([1.,2.,3.,4.,5.,6.,7.,8.,9.,10.,12.,14.,16.,18.,20.])
       cr = gv.gvar(0.0, eps)
       c = [gv.gvar(cr(), eps) for n in range(100)]
       x_xmax = x/max(x)
       noise = 1 +  sum(c[n] * x_xmax ** n for n in range(100))
       y = f_exact(x, nexp) * noise
       return x, y

Gaussian variable ``cr`` represents a Gaussian distribution with mean
0.0 and width 0.01, which we use here as a random number generator:
``cr()`` is a number drawn randomly from the distribution represented by
``cr``::

   >>> print(cr)
   0.000(10)
   >>> print(cr())
   0.00452180208286
   >>> print(cr())
   -0.00731564589737

We use ``cr()`` to generate mean values for the Gaussian distributions
represented by the ``c[n]``\s, each of which has width 0.01. The resulting
``y``\s fluctuate around the corresponding values of ``f_exact(x)`` and have 
statistical errors::

   >>> print(y)
   [0.2752(27) 0.07951(80) ... ]
   >>> print(y-f_exact(x))
   [0.0011(27) 0.00029(80) ... ]
   
Different ``y``\s are also correlated (by construction), which becomes clear
if we evaluate the covariance matrix for the ``y``\s::

   >>> print(gv.evalcov(y))
   [[  7.52900382e-06   2.18173029e-06   7.95744444e-07 ... ]
    [  2.18173029e-06   6.33815228e-07   2.31761675e-07 ... ]
    [  7.95744444e-07   2.31761675e-07   8.49651978e-08 ... ]
    ...
   ]

The diagonal elements of the covariance matrix are the variances of the
individual ``y``\s; the off-diagonal elements are a measure of the
correlations::

   < (y[i]-<y[i]>) * (y[j]-<y[j]>) >.

The Gaussian variables ``y[i]`` together with the numbers ``x[i]`` comprise
our fake data.


.. _basic-fits:

Basic Fits
----------
Now that we have fit data, ``x,y = make_data()``, we pretend ignorance
of the exact functional relationship between ``x`` and ``y`` (*i.e.*,
``y=f_exact(x)``). Typically we *do* know the functional form and have some
*a priori* idea about the parameter values. The point of the fit is to
improve our knowledge of the parameter values, beyond our *a priori*
impressions, by analyzing the fit data. Here we see how to do this using
the :mod:`lsqfit` module.

First we need code to represent the fit function. In this case we know
that a sum of exponentials is appropriate, so we define the following 
Python function to represent the relationship between ``x`` and ``y`` in 
our fit::

   import numpy as np
   
   def f(x, p):         # function used to fit x, y data
       a = p['a']       # array of a[i]s
       E = p['E']       # array of E[i]s
       return sum(ai * np.exp(-Ei * x) for ai, Ei in zip(a, E))

The fit parameters, ``a[i]`` and ``E[i]``, are stored in a
dictionary, using labels ``a`` and ``b`` to access them. These parameters
are varied in the fit to find the best-fit values ``p=p_fit`` for which
``f(x, p_fit)`` most closely approximates the ``y``\s in our fit data. The
number of exponentials included in the sum is specified implicitly in this
function, by the lengths of the ``p['a']`` and ``p['E']`` arrays.

Next we need to define priors that encapsulate our *a priori* knowledge 
about the parameter values. In practice we almost always have *a priori* 
knowledge about parameters; it is usually impossible to design a fit
function without some sense of the parameter sizes. Given such knowledge
it is important (usually essential) to include it in the fit. This is 
done by designing priors for the fit, which are probability distributions 
for each parameter that describe the *a priori* uncertainty in that 
parameter. As in the previous section, we use objects of type
:class:`gvar.GVar` to describe (Gaussian) probability distributions.
Let's assume that before the fit we suspect that each ``a[i]`` is of order
0.5±0.5, while ``E[i]`` is of order ``1+i±0.5``. A prior
that represents this information is built using the following code::

   import lsqfit
   import gvar as gv

   def make_prior(nexp):               # make priors for fit parameters
       prior = gv.BufferDict()         # prior -- any dictionary works
       prior['a'] = [gv.gvar(0.5, 0.5) for i in range(nexp)]
       prior['E'] = [gv.gvar(i+1, 0.5) for i in range(nexp)]
       return prior

where ``nexp`` is the number of exponential terms that will be used (and
therefore the number of ``a``\s and ``E``\s). With ``nexp=3``, for example,
one would then have::

   >>> print(prior['a'])
   [0.50(50) 0.50(50) 0.50(50)]
   >>> print(prior['E'])
   [1.00(50), 2.00(50), 3.00(50)]

We use dictionary-like class :class:`gvar.BufferDict` for the prior because it
allows us to save the prior if we wish (using Python's :mod:`pickle` module).
If saving is unnecessary, :class:`gvar.BufferDict` can be replaced by
``dict()`` or most any other Python dictionary class.

With fit data, a fit function, and a prior for the fit parameters, we are 
finally ready to do the fit, which is now easy::

  fit = lsqfit.nonlinear_fit(data=(x, y), fcn=f, prior=prior)
  
So pulling together the entire code, from this section and the previous
one, our complete Python program for making fake data and fitting it is::

   import lsqfit
   import numpy as np
   import gvar as gv

   def f_exact(x, nexp=100):          # exact f(x)
       return sum(0.4*np.exp(-0.9*(i+1)*x) for i in range(nexp))

   def f(x, p):                        # function used to fit x, y data
       a = p['a']                      # array of a[i]s
       E = p['E']                      # array of E[i]s
       return sum(ai * np.exp(-Ei * x) for ai, Ei in zip(a, E))

   def make_data(nexp=100, eps=0.01): # make x, y fit data
       x = np.array([1.,2.,3.,4.,5.,6.,7.,8.,9.,10.,12.,14.,16.,18.,20.])
       cr = gv.gvar(0.0, eps)
       c = [gv.gvar(cr(), eps) for n in range(100)]
       x_xmax = x/max(x)
       noise = 1+ sum(c[n] * x_xmax ** n for n in range(100))
       y = f_exact(x, nexp) * noise
       return x, y

   def make_prior(nexp):               # make priors for fit parameters
       prior = gv.BufferDict()         # prior -- any dictionary works
       prior['a'] = [gv.gvar(0.5, 0.5) for i in range(nexp)]
       prior['E'] = [gv.gvar(i+1, 0.5) for i in range(nexp)]
       return prior

   def main():
       gv.ranseed([2009, 2010, 2011, 2012]) # initialize random numbers (opt.)
       x, y = make_data()              # make fit data
       p0 = None                       # make larger fits go faster (opt.)
       for nexp in range(3, 20):
           print('************************************* nexp =', nexp)
           prior = make_prior(nexp)
           fit = lsqfit.nonlinear_fit(data=(x, y), fcn=f, prior=prior, p0=p0)
           print(fit)                  # print the fit results
           E = fit.p['E']              # best-fit parameters
           a = fit.p['a']
           print('E1/E0 =', E[1] / E[0], '  E2/E0 =', E[2] / E[0])
           print('a1/a0 =', a[1] / a[0], '  a2/a0 =', a[2] / a[0])
           print()
           if fit.chi2 / fit.dof < 1.:
               p0 = fit.pmean          # starting point for next fit (opt.)

   if __name__ == '__main__':
       main()

We are not sure *a priori* how many exponentials are needed to fit our
data; given that there are only fifteen ``y``\s, and these are noisy, there
may only be information in the data about the first few terms. Consequently
we wrote our code to try fitting with each of ``nexp=3,4,5..19`` terms.
(The pieces of the code involving ``p0`` are optional; they make the
more complicated fits go about 30 times faster since the output from one
fit is used as the starting point for the next fit --- see the discussion
of the ``p0`` parameter for :class:`lsqfit.nonlinear_fit`.) Running
this code produces the following output, which is reproduced here in some
detail in order to illustrate a variety of features:

.. literalinclude:: eg1.out
   
There are several things to notice here:

   * Clearly three exponentials (``nexp=3``) is not enough. The ``chi**2`` 
     per degree of freedom (``chi2/dof``) is much larger than one. The
     ``chi**2`` improves significantly for ``nexp=4`` exponentials and by
     ``nexp=6`` the fit is as good as it is going to get --- there is
     essentially no change when further exponentials are added.
   
   * The best-fit values for each parameter are listed for each of the
     fits, together with the prior values (in brackets, on the right).
     Values for each ``a[i]`` and ``E[i]`` are listed in order, starting at
     the points indicated by the labels ``a`` and ``E``. Asterisks are
     printed at the end of the line if the mean best-fit value differs from 
     the prior's mean by more than one standard deviation; the number
     of asterisks, up to a maximum of 5, indicates how many standard 
     deviations the difference is. Differences of one or two standard
     deviations are not uncommon; larger differences could indicate a 
     problem with the prior or the fit.
     
     Once the fit converges, the best-fit values for the various parameters
     agree well --- that is to within their errors, approximately --- with
     the exact values, which we know since we are using fake data. For
     example, ``a`` and ``E`` for the first exponential are 0.402(4)
     and 0.9003(5), respectively, from the fit where the exact answers
     are 0.4 and 0.9; and we get 0.45(5) and 2.73(4) for
     the third exponential where the exact values are 0.4 and 2.7.
     
   * Note in the ``nexp=7`` fit how the means and standard deviations for
     the parameters governing the seventh (and last) exponential are almost
     identical to the values in the corresponding priors: 0.46(49) from
     the fit for ``a`` and 7.0(5) for ``E``. This tells us that our fit
     data has little or no information to add to what we knew *a priori*
     about these parameters --- there isn't enough data and what we have
     isn't accurate enough. 
     
     This situation is truer still of further terms as they are added in
     the ``nexp=8`` and later fits. This is why the fit results stop
     changing once we have ``nexp=6`` exponentials. There is no point in
     including further exponentials, beyond the need to verify that the fit
     has indeed converged.
     
   * The last fit includes ``nexp=19`` exponentials and therefore has 38
     parameters. This is in a fit to 15 ``y``\s. Old-fashioned fits, without
     priors, are impossible when the number of parameters exceeds the number
     of data points. That is clearly not the case here, where the number of
     terms and parameters can be made arbitrarily large, eventually (after
     ``nexp=6`` terms) with no effect at all on the results.
     
     The reason is that the prior that we include for each new parameter
     is, in effect, a new piece of data (the mean and standard deviation of
     the *a priori* expectation for that parameter); it leads to a new term
     in the ``chi**2`` function. We are fitting both the data and our *a
     priori* expectations for the parameters. So in the ``nexp=19`` fit,
     for example, we actually have 53 pieces of data to fit: the 15 ``y``\s
     plus the 38 prior values for the 38 parameters.
     
     The effective number of degrees of freedom (``dof`` in the output
     above) is the number of pieces of data minus the number of fit
     parameters, or 53-38=15 in this last case. With priors for every
     parameter, the number of degrees of freedom is always equal to the
     number of ``y``\s, irrespective of how many fit parameters there are.
     
   * The Gaussian Bayes Factor (whose logarithm is 
     ``logGBF`` in the output) is a measure of the likelihood that the actual
     data being fit could have come from a theory with the prior and
     fit function used in the
     fit. The larger this number, the more likely it is that prior/fit-function 
     and data
     could be related. Here it grows dramatically from the first fit
     (``nexp=3``) but then more-or-less stops changing around ``nexp=5``. The
     implication is that this data is much more likely to have come from a
     theory with ``nexp>=5`` than with ``nexp=3`` (which we know to be the
     actual case).
     
   * In the code, results for each fit are captured in a Python object
     ``fit``, which is of type :class:`lsqfit.nonlinear_fit`. A summary of the
     fit information is obtained by printing ``fit``. Also the best-fit
     results for each fit parameter can be accessed through ``fit.p``, as is
     done here to calculate various ratios of parameters.
     
     The errors in these last calculations automatically account for any
     correlations in the statistical errors for different parameters. This
     is obvious in the ratio ``a1/a0``, which would be 1.004(16) if
     there was no statistical correlation between our estimates for ``a1``
     and ``a0``, but in fact is 1.004(7) in this fit. The (positive)
     correlation is evident in the covariance matrix::

      >>> print(gv.evalcov([a[0], a[1]]))
      [[  1.61726195e-05   1.65492001e-05]
      [  1.65492001e-05   2.41547633e-05]]

Finally we inspect the fit's quality point by point. The input data are
compared with results from the fit function, evaluated with the best-fit
parameters, in the following table (obtained in the code by printing the
output from ``fit.format(100)``\)::

   Fit:
        x[k]               y[k]          f(x[k],p)
   -----------------------------------------------
           1        0.2752 (27)        0.2752 (20)
           2       0.07951 (80)       0.07952 (58)
           3       0.02891 (29)       0.02892 (21)
           4       0.01127 (11)      0.011272 (83)
           5      0.004502 (46)      0.004506 (34)
           6      0.001817 (19)      0.001819 (14)
           7     0.0007362 (79)     0.0007375 (57)
           8     0.0002987 (33)     0.0002994 (24)
           9     0.0001213 (14)    0.00012163 (99)
          10    0.00004926 (57)    0.00004943 (41)
          12       8.13(10)e-06      8.164(72)e-06
          14      1.342(19)e-06      1.348(13)e-06
          16      2.217(37)e-07      2.227(23)e-07
          18      3.661(85)e-08      3.679(40)e-08
          20       6.24(61)e-09      6.078(71)e-09

The fit is excellent over the entire eight orders of magnitude. This
information is presented again in the following plot, which shows the ratio
``y/f(x,p)``, as a function of ``x``, using the best-fit parameters ``p``.
The correct result for this ratio, of course, is one. The smooth variation
in the data --- smooth compared with the size of the statistical-error bars
--- is an indication of the statistical correlations between individual
``y``\s.

.. image:: fig1.*
   :width: 80%

This particular plot was made using the :mod:`matplotlib` module, with the 
following code added to the end of ``main()`` (outside the loop)::

      import pylab as plt   
      ratio = y / f(x, fit.pmean)
      plt.xlim(0, 21)
      plt.xlabel('x')
      plt.ylabel('y/f(x,p)')
      plt.errorbar(x=x, y=gv.mean(ratio), yerr=gv.sdev(ratio), fmt='ob')
      plt.plot([0.0, 21.0], [1.0, 1.0])
      plt.show()


Chained Fits
-------------
The priors in a fit represent knowledge that we have about the parameters 
before we do the fit. This knowledge might come from theoretical considerations
or experiment. Or it might come from another fit. Imagine that we want to add
new information to that extracted from the fit in the previous section. 
For example, we might learn from some other source that the ratio of 
amplitudes ``a[1]/a[0]`` equals 1±1e-5. The challenge is to combine 
this new information with information extracted from the fit above without rerunning
that fit. (We assume it is not possible to rerun the first fit, because, say, 
the input data for that fit has been lost or is unavailable.)

We can combine the new data with the old fit results by creating a new
fit using the best-fit parameters, ``fit.p``, from the old fit as the
priors for the new fit. To try this out, we add the following code 
onto the end of the ``main()`` subroutine in the previous section::

  def ratio(p):                       # new fit function
      a = p['a']
      return a[1] / a[0]

  prior = fit.p                       # prior = best-fit parameters from 1st fit
  data = gv.gvar(1, 1e-5)             # new data for the ratio

  newfit = lsqfit.nonlinear_fit(data=data, fcn=ratio, prior=prior)
  print(newfit)

The result of the new fit (to one piece of new data) is:

.. literalinclude:: eg1a.out

Parameters ``a[0]`` and ``E[0]`` are essentially unchanged by the new 
information, but ``a[i]`` and ``E[i]`` are more precise for ``i=2`` 
and ``i=3``, as is ``a[1]/a[0]``, of course. 
It might seem odd that ``E[1]``, for example, is changed at 
all, since the fit function, ``ratio(p)``, makes no mention of it. This 
is not surprising, however, since ``ratio(p)`` does depend up ``a[1]``,
and ``a[1]`` is strongly correlated with ``E[1]`` through the prior. It
is important to include all parameters from the first fit as 
parameters in the new fit in order to capture the impact of the new 
information on parameters correlated with ``a[1]/a[0]``.

It would have been easy to change the fit code in the previous section to
incorporate the new information about ``a[1]/a[0]``. The approach presented
here is numerically equivalent to that approach 
insofar as the ``chi**2`` function for the
original fit can be well approximated by a quadratic function 
in the fit parameters --- that is, insofar as 
``exp(-chi**2/2)`` is well approximated
by a Gaussian distribution in the parameters, as 
specified by the best-fit means and covariance matrix (in ``fit.p``).
This is, of course, a fundamental assumption underlying the
use of :mod:`lsqfit` in the first place.

Obviously, we can include further fits in order to incorporate more data. The
prior for each new fit is the best-fit output (``fit.p``) from the previous
fit. The output from the chain's final fit is the cummulative  result of all
of these fits.


``x`` has Error Bars
--------------------
We now consider variations on our basic fit analysis (described above). The 
first variation concerns what to do when the independent variables, the 
``x``\s, have errors, as well as the ``y``\s. This is easily handled by 
turning the ``x``\s into fit parameters, and otherwise dispensing 
with independent variables.

To illustrate this, we modify the basic analysis code above. 
First we need to add errors to the ``x``\s, which we do by 
changing ``make_data`` so that each ``x`` has a random value within about 
±0.001% of its original value and an error::

   def make_data(nexp=100, eps=0.01): # make x, y fit data
       x = np.array([1.,2.,3.,4.,5.,6.,7.,8.,9.,10.,12.,14.,16.,18.,20.])
       cr = gv.gvar(0.0, eps)
       c = [gv.gvar(cr(), eps) for n in range(100)]
       x_xmax = x/max(x)
       noise = 1+ sum(c[n] * x_xmax ** n for n in range(100))
       y = f_exact(x, nexp) * noise            # noisy y[i]s
       xfac = gv.gvar(1.0, 0.00001)    # Gaussian distrib'n: 1±0.001%
       x = np.array([xi * gv.gvar(xfac(), xfac.sdev) for xi in x]) # noisy x[i]s
       return x, y
   
Here :class:`gvar.GVar` object ``xfac`` is used as a random number
generator: each time it is called, ``xfac()`` is a different random number
from the distribution with mean ``xfac.mean`` and standard deviation
``xfac.sdev`` (that is, 1±0.00001). The main program is modified so
that the (now random) ``x`` array is treated as a fit parameter. The prior
for each ``x`` is, obviously, specified by the mean and standard deviation
of that ``x``, which is read directly out of the array of ``x``\s produced 
by ``make_data()``::

   def make_prior(nexp, x):            # make priors for fit parameters
       prior = gv.BufferDict()         # prior -- any dictionary works
       prior['a'] = [gv.gvar(0.5, 0.5) for i in range(nexp)]
       prior['E'] = [gv.gvar(i+1, 0.5) for i in range(nexp)]
       prior['x'] = x                  # x now an array of parameters
       return prior

   def main():
       gv.ranseed([2009, 2010, 2011, 2012]) # initialize random numbers (opt.)
       x, y = make_data()              # make fit data
       p0 = None                       # make larger fits go faster (opt.)
       for nexp in range(3, 20):
           print('************************************* nexp =', nexp)
           prior = make_prior(nexp, x)
           fit = lsqfit.nonlinear_fit(data=y, fcn=f, prior=prior, p0=p0)
           print(fit)                  # print the fit results
           E = fit.p['E']              # best-fit parameters
           a = fit.p['a']
           print('E1/E0 =', E[1] / E[0], '  E2/E0 =', E[2] / E[0])
           print('a1/a0 =', a[1] / a[0], '  a2/a0 =', a[2] / a[0])
           print()
           if fit.chi2/fit.dof<1.:
               p0 = fit.pmean          # starting point for next fit (opt.)
   
The fit data now consists of just the ``y`` array (``data=y``), and the 
fit function loses its ``x`` argument and gets its ``x`` values from the 
fit parameters ``p`` instead::

   def f(p):
       a = p['a']
       E = p['E']
       x = p['x']
       return sum(ai*exp(-Ei*x) for ai, Ei in zip(a, E))

Running the new code gives, for ``nexp=6`` terms:

.. literalinclude:: eg2.out

This looks quite a bit like what we obtained before, except that now there 
are 15 more parameters, one for each ``x``, and also now all results are
a good deal less accurate. Note that one result from this analysis is new 
values for the ``x``\s. In some cases the errors on the ``x`` values have
been reduced --- by information in the fit data.


.. _correlated-parameters:

Correlated Parameters; Gaussian Bayes Factor
---------------------------------------------
:class:`gvar.GVar` objects are very useful for handling more complicated
priors, including situations where we know *a priori* of correlations 
between parameters. Returning to the :ref:`basic-fits` example above, 
imagine a situation where we still have a ±0.5 uncertainty about the
value of any individual ``E[i]``, but we know *a priori* that the 
separations between adjacent ``E[i]``\s is 0.9±0.01. We want to 
build the correlation between adjacent ``E[i]``\s into our prior.

We do this by introducing a :class:`gvar.GVar` object ``de[i]`` for each
separate difference ``E[i]-E[i-1]``, with ``de[0]`` being ``E[0]``::

   de = [gvar(0.9, 0.01) for i in range(nexp)]
   de[0] = gvar(1, 0.5)    # different distribution for E[0]
   
Then ``de[0]`` specifies the probability distribution for ``E[0]``,
``de[0]+de[1]`` the distribution for ``E[1]``, ``de[0]+de[1]+de[2]`` the
distribution for ``E[2]``, and so on. This can be implemented (slightly 
inefficiently) in a single line of Python::

   E = [sum(de[:i+1]) for i in range(nexp)]
   
For ``nexp=3``, this implies that ::

   >>> print(E)
   [1.00(50) 1.90(50) 2.80(50)] 
   >>> print(E[1] - E[0], E[2] - E[1])
   0.900(10) 0.900(10)

which shows that each ``E[i]`` separately has an uncertainty of ±0.5 
(approximately) but that differences are specified to within ±0.01.

In the code, we need only change the definition of the prior in order to
introduce these correlations::

   def make_prior(nexp):               # make priors for fit parameters
       prior = gv.BufferDict()         # prior -- any dictionary works
       prior['a'] = [gv.gvar(0.5, 0.5) for i in range(nexp)]
       de = [gv.gvar(0.9, 0.01) for i in range(nexp)]
       de[0] = gv.gvar(1, 0.5)     
       prior['E'] = [sum(de[:    i + 1]) for i in range(nexp)]
       return prior
   
Running the code as before, but now with the correlated prior in place, we
obtain the following fit with ``nexp=7`` terms:

.. literalinclude:: eg3.out
   
   
The results are similar to before for the leading parameters, but
substantially more accurate for parameters describing the second and later
exponential terms, as might be expected given our enhanced knowledge about
the differences between ``E[i]``\s. The output energy differences are
particularly accurate: they range from ``E[1]-E[0] = 0.900(1)``, which is
ten times more precise than the prior, to ``E[6]-E[5] = 0.900(10)``, which
is just what was put into the fit through the prior (the fit data adds no
new information). The correlated prior allows us to merge our *a priori*
information about the energy differences with the new information carried
by the fit data ``x, y``.

Note that the Gaussian Bayes Factor (see ``logGBF`` in the output) is
significantly larger with the correlated prior (``logGBF = 227``) than it
was for the uncorrelated prior (``logGBF = 221``). Had we been
uncertain as to which prior was more appropriate, this difference says that
the data prefers the correlated prior. (More precisely, it says that we
would be ``exp(227-221) = 400`` times more likely to get this data 
from a theory with the
correlated prior than from one with the uncorrelated prior.) This
difference is significant despite the fact that the ``chi**2``\s in the two
cases are almost the same. ``chi**2`` tests goodness of fit, but there are
usually more ways than one to get a good fit. Some are more plausible 
than others, and the Bayes factor helps sort out which.


Tuning Priors and the Empirical Bayes Criterion
------------------------------------------------
Given two choices of prior for a parameter, the one that results in a larger
Gaussian Bayes Factor after fitting (see ``logGBF`` in fit output or
``fit.logGBF``) is the one preferred by the data. We can use this fact to tune
a prior or set of priors in situations where we are uncertain about the
correct *a priori* value: we vary the widths and/or central values of the
priors of interest to maximize ``logGBF``. This leads to complete nonsense if
it is applied to all the priors, but it is useful for tuning (or testing)
limited subsets of the priors when other information is unavailable. In effect
we are using the data to get a feel for what is a reasonable prior. This
procedure for setting priors is called the *Empirical Bayes* method.

This method is implemented in a driver program ::
    
    fit, z = lsqfit.empbayes_fit(z0, fitargs)
    
which varies :mod:`numpy` array ``z``, starting at ``z0``, to maximize
``fit.logGBF`` where ::

    fit = lsqfit.nonlinear_fit(**fitargs(z)). 
    
Function ``fitargs(z)`` returns a dictionary containing the arguments for
:func:`nonlinear_fit`. These arguments, and the prior in particular, are
varied as some function of ``z``. The optimal fit (that is, the one for which
``fit.logGBF`` is maximum) and ``z`` are returned.
    
To illustrate, consider tuning the widths of the priors for the amplitudes,
``prior['a']``, in the example from the previous section. This is done by
adding the following code to the end of ``main()`` subroutine::

   def fitargs(z, nexp=nexp, prior=prior, f=f, data=(x, y), p0=p0):
       z = np.exp(z)
       prior['a'] = [gv.gvar(0.5, 0.5   *   z[0]) for i in range(nexp)]
       return dict(prior=prior, data=data, fcn=f, p0=p0)
   ##
   z0 = [0.0]
   fit, z = empbayes_fit(z0, fitargs, tol=1e-3)
   print(fit)                  # print the optimized fit results
   E = fit.p['E']              # best-fit parameters
   a = fit.p['a']
   print('E1/E0 =', E[1]  /  E[0], '  E2/E0 =', E[2]  /  E[0])
   print('a1/a0 =', a[1]  /  a[0], '  a2/a0 =', a[2]  /  a[0])
   print("prior['a'] =", fit.prior['a'][0])
   print()

Function ``fitargs`` generates a dictionary containing the arguments for
:class:`lsqfit.nonlinear_fit`. These are identical to what we have been using
except that the width of the priors in ``prior['a']`` is adjusted according
to parameter ``z``. Function :func:`lsqfit.empbayes_fit` does fits for 
different values of ``z`` and selects the ``z`` that maximizes ``fit.logGBF``.
It returns the corresponding fit and the value of ``z``.

This code generates the following output when ``nexp=7``:

.. literalinclude:: eg4a.out

Reducing the width of the ``prior['a']``\s from 0.5 to 0.1 increased
``logGBF`` from 227 to 234. The error for ``a2/a0`` is 40%
smaller, but the other results are not much affected --- suggesting that the
details of ``prior['a']`` are not all that important, which is confirmed by
the error budgets generated in the next section. It is not surprising, of
course, that the optimal width is 0.1 since the mean values for the
``fit.p['a']``\s are clustered around 0.4, which is 0.1 below the mean
value of the priors ``prior['a']``.

The Bayes factor, ``exp(fit.logGBF)``, is useful for deciding about fit 
functions as well as priors. Consider the following two fits of the sort 
discussed in the previous section, one using just two terms in the fit
function and one using three terms:

.. literalinclude:: eg4GBF.out

Measured by their ``chi**2``\s, the two fits are almost equally good. The
Bayes factor for the first fit, however, is much larger than that for the 
second fit. It says that the probability that our fit data comes from an 
underlying theory with exactly two
terms is ``exp(254 - 243) = 59,874`` times larger than the probability
that it comes from a theory with three terms. In fact, the data comes from
a theory with only two terms since it was generated using the same code
as in the previous section but with ``x, y = make_data(2)`` instead of
``x, y = make_data()`` in the main program.


Partial Errors and Error Budgets
---------------------------------
We frequently want to know how much of the uncertainty in a fit result is
due to a particular input uncertainty or subset of input uncertainties
(from the input data and/or from the priors). We refer to such errors as
"partial errors" (or partial standard deviations) since each is only part
of the total uncertainty in the fit result. The collection of such partial
errors, each associated with a different input error, is called an "error
budget" for the fit result. The partial errors from all sources of input
error reproduce the total fit error when they are added in quadrature.

Given the ``fit`` object (an :class:`lsqfit.nonlinear_fit` object) from the
example in the section on :ref:`correlated-parameters`, for example, we can
extract such information using :meth:`gvar.GVar.partialsdev` --- for example:

.. literalinclude:: eg4c.out
   
This shows that the total uncertainty in ``E[1]/E[0]``, 0.00106, is 
the sum in quadrature of a contribution 0.00042 due to the priors 
specified by ``prior['E']``, 0.00016 due to ``prior['a']``, and 
0.00095 from the statistical errors in the input data ``y``. 

There are two utility functions for tabulating results and error budgets.
They require dictionaries of output results and inputs, and use the 
keys from the dictionaries to label columns and rows, respectively, in
an error-budget table::

   outputs = {'E1/E0':E[1] / E[0], 'E2/E0':E[2] / E[0],         
            'a1/a0':a[1] / a[0], 'a2/a0':a[2] / a[0]}
   inputs = {'E':fit.prior['E'], 'a':fit.prior['a'], 'y':y}
   print(fit.fmt_values(outputs))
   print(fit.fmt_errorbudget(outputs, inputs))

This gives the following output:

.. literalinclude:: eg4b.out
   
This table shows, for example, that the 0.37% uncertainty in ``E2/E0``
comes from a 0.09% contribution due to ``prior['a']``, a 0.07% contribution
due to due to statistical errors in the fit data ``y``, and a 0.35%
contribution due to ``prior['E']``, where, again, the total error is the
sum in quadrature of the partial errors. This suggests that reducing the
statistical errors in the input ``y`` data would reduce the error in
``E2/E0`` only slightly. On the other hand, more accurate ``y`` data should
significantly reduce the errors in ``E1/E0`` and ``a1/a0``, where ``y`` is
the dominant source of uncertainty. In fact a four-fold reduction in the
``y`` errors reduces the ``E1/E0`` error to 0.02% (from 0.05%) while
leaving the ``E2/E0`` error at 0.37%.


``y`` has No Error Bars
-----------------------
Occasionally there are fit problems where values for the dependent
variable ``y`` are known exactly (to machine precision). This poses a 
problem for least-squares fitting since the ``chi**2`` function is 
infinite when standard deviations are zero. How does one assign errors 
to exact ``y``\s in order to define a ``chi**2`` function that can be 
usefully minimized?

It is almost always the case in physical applications of this sort that the
fit function has in principle an infinite number of parameters. It is, of
course, impossible to extract information about infinitely many parameters
from a finite number of ``y``\s. In practice, however, we generally care about
only a few of the parameters in the fit function. (If this isn't the case,
give up.) The goal for a least-squares fit is to figure out what a finite
number of exact ``y``\s can tell us about the parameters we want to know.

The key idea here is to use priors to model the part of the fit function 
that we don't care about, and to remove that part of the function from 
the analysis by subtracting or dividing it out from the input data. To
illustrate, consider again the example described in the section on
:ref:`correlated-parameters`. Let us imagine that we know the exact values
for ``y`` for each of ``x=1, 1.2, 1.4...2.6, 2.8``. We are fitting this
data with a sum of exponentials ``a[i]*exp(-E[i]*x)`` where now we will
assume that *a priori* we know that: ``E[0]=1.0(5)``,
``E[i+1]-E[i]=0.9(2)``, and ``a[i]=0.5(5)``. Suppose that our goal is to
find good estimates for ``E[0]`` and ``a[0]``.

We know that for some set of parameters ::

   y = sum_i=0..inf  a[i]*exp(-E[i]*x)
   
for each ``x``\-\ ``y`` pair in our fit data. Given that  
``a[0]`` and ``E[0]`` are all we want to know, we might imagine defining
a new, modified dependent variable ``ymod``, equal to just
``a[0]*exp(-E[0]*x)``::

   ymod = y - sum_i=1..inf a[i]*exp(-E[i]*x)
   
We know everything on the right-hand side of this equation: we have exact
values for ``y`` and we have *a priori* estimates for the ``a[i]`` and
``E[i]`` with ``i>0``. So given means and standard deviations for every
``i>0`` parameter, and the exact ``y``, we can in principle determine a
mean and standard deviation for ``ymod``. The strategy then is to compute
the corresponding ``ymod`` for every ``y`` and ``x`` pair, and then fit
``ymod`` versus ``x`` to the *single* exponential ``a[0]*exp(-E[0]*t)``.
That fit will give values for ``a[0]`` and ``E[0]`` that reflect the
uncertainties in ``ymod``, which in turn originate in uncertainties in our
knowledge about the parameters for the ``i>0`` exponentials. 

It turns out to be quite simple to implement such a strategy using
:class:`gvar.GVar`\s. We convert our code by first modifying the main
program so that it provides prior information to a subroutine that computes
``ymod``. We will vary the number of terms ``nexp`` that are kept in the
fit, putting the rest into ``ymod`` as above (up to a maximum of 20
terms, which is close enough to infinity)::

   def main():
       gv.ranseed([2009, 2010, 2011, 2012])  # initialize random numbers (opt.)
       max_prior = make_prior(20)         # maximum sized prior
       p0 = None                          # make larger fits go faster (opt.)
       for nexp in range(1, 7):
           print('************************************* nexp =', nexp)
           fit_prior = gv.BufferDict()    # part of max_pior used in fit
           ymod_prior = gv.BufferDict()   # part of max_prior absorbed in ymod
           for k in max_prior:
               fit_prior[k] = max_prior[k][:nexp]
               ymod_prior[k] = max_prior[k][nexp:]
           x, y = make_data(ymod_prior)   # make fit data
           fit = lsqfit.nonlinear_fit(data=(x, y), fcn=f, prior=fit_prior, p0=p0)
           print(fit.format(10))          # print the fit results
           print()
           if fit.chi2/fit.dof<1.:
               p0 = fit.pmean             # starting point for next fit (opt.)

We put all of our *a priori* knowledge about parameters into prior
``max_prior`` and then pull out the part we need for the fit --- that is,
the first ``nexp`` terms. The remaining part of ``max_prior`` is used to
correct the exact data, which comes from a new ``make_data``::

   def make_data(ymod_prior):          # make x, y fit data
       x = np.arange(1., 10 * 0.2 +  1., 0.2)
       ymod = f_exact(x) - f(x, ymod_prior)        
       return x, ymod
   
Running the new code produces the following output, where again ``nexp`` is
the number of exponentials kept in the fit (and ``20-nexp`` is the number
pushed into the modified dependent variable ``ymod``):

.. literalinclude:: eg5a.out

Here we use ``fit.format(10)`` to print out a table of ``x`` and 
``y`` (actually ``ymod``) values, together with the value of the 
fit function using the best-fit parameters. There are several things
to notice:

   * Were we really only interested in ``a[0]`` and ``E[0]``, a 
     single-exponential fit would have been adequate. This is because we
     are in effect doing a 20-exponential fit even in that case, by
     including all but the first term as corrections to ``y``. The answers
     given by the first fit are correct (we know the exact values since we
     are using fake data).
     
     The ability to push uninteresting parameters into a ``ymod`` can be
     highly useful in practice since it is usually much cheaper to
     incorporate those fit parameters into ``ymod`` than it is to include
     them as fit parameters --- fits with smaller numbers of parameters are
     usually a lot faster.
    
   * The ``chi**2`` and best-fit parameter means and standard deviations
     are almost unchanged by shifting terms from ``ymod`` back into the
     fit function, as ``nexp`` increases. The final results for
     ``a[0]`` and ``E[0]``, for example, are nearly identical in the
     ``nexp=1`` and ``nexp=4`` fits.
     
     In fact it is straightforward to prove that best-fit parameter means
     and standard deviations, as well as ``chi**2``, should be exactly the
     same in such situations provided the fit function is linear in all fit
     parameters. Here the fit function is approximately linear, given our
     small standard deviations, and so results are only approximately
     independent of ``nexp``.
          
   * The uncertainty in ``ymod`` for a particular ``x`` decreases as 
     ``nexp`` increases and as ``x`` increases. Also the ``nexp``
     independence of the fit results depends upon capturing all of the
     correlations in the correction to ``y``. This is why
     :class:`gvar.GVar`\s are useful since they make the implementation of
     those correlations trivial.
     
   * Although we motivated this example by the need to deal with ``y``\s
     having no errors, it is straightforward to apply the same ideas to 
     a situation where the ``y``\s have errors. Again one might want to 
     do so since fitting uninteresting fit parameters is generally more 
     costly than absorbing them into the ``y`` (which then has a modified
     mean and standard deviation).
     

SVD Cuts and Roundoff Error
-----------------------------
All of the fits discussed above have (default) *SVD* cuts of 1e-15. This
has little impact in most of the problems, but makes a big difference
in the problem discussed in the previous section. Had we run that fit,
for example, with an *SVD* cut of 1e-19, instead of 1e-15, we would have
obtained the following output:

.. literalinclude:: eg5b.out
   
The standard deviations quoted for ``E1/E0``, *etc.* are much too large
compared with the standard deviations shown for the individual parameters,
and much larger than what we obtained in the previous section.
This is due to roundoff error. The standard deviations quoted for the
parameters are computed differently from the standard deviations in
``fit.p`` (which was used to calculate ``E1/E0``). The former come directly
from the curvature of the ``chi**2`` function at its minimum; the latter
are related back to the standard deviations of the input data and priors
used in the fit. The two should agree, but they will not agree if the
covariance matrix for the input ``y`` data is too ill-conditioned.

The inverse of the ``y`` covariance matrix is used in the ``chi**2``
function that is minimized by :class:`lsqfit.nonlinear_fit`. Given the
finite precision of computer hardware, it is impossible to compute this
inverse accurately if the matrix is singular or almost singular, and in
such situations the reliability of the fit results is in question. The
eigenvalues of the covariance matrix in this example (for ``nexp=6``)
indicate that this is the case: they range from ``7.2e-5`` down to
``4.2e-26``, covering 21 orders of magnitude. This is likely too large a
range to be handled with the 16--18 digits of precision available in normal
double precision computation. The smallest eigenvalues and their
eigenvectors are likely to be quite inaccurate, as is any method for
computing the inverse matrix.

The standard solution to this common problem in least-squares fitting is 
to introduce an *SVD* cut, here called ``svdcut``::

   fit = nonlinear_fit(data=(x, ymod), fcn=f, prior=prior, p0=p0, svdcut=1e-15)
   
Then the inverse of the ``y`` covariance matrix is computed from its
eigenvalues and eigenvectors, but with any eigenvalue smaller than
``svdcut`` times the largest eigenvalue replaced by the cutoff (that is,
by ``svdcut`` times the largest eigenvalue). This limits the singularity of
the covariance matrix, leading to improved numerical stability. The cost is
less precision in the final results since we are in effect decreasing the
precision of the input ``y`` data. This is a conservative move, but numerical
stability is worth the tradeoff.

Note that taking ``svdcut=-1e-15``, with a
minus sign, causes the problematic modes to be dropped. This is a more
conventional implementation of *SVD* cuts, but here it results in much less
precision than using ``svdcut=1e-15`` (giving, for example, 1.993(69)
for ``E1/E0``, which is almost three times less precise). Dropping modes is
equivalent to setting the corresponding variances to infinity, which is
(obviously) much more conservative and less realistic than setting them equal
to the *SVD*\-cutoff variance.

The error budget is interesting in this case. There is no contribution from
the original ``y`` data since it was exact. So all statistical uncertainty
comes from the priors in ``max_prior``, and from the *SVD* cut, which
contributes since it modifies the effective variances of several eigenmodes of
the covariance matrix. The *SVD* contribution can be obtained from
``fit.svdcorrection`` so the full error budget is constructed by the following
code, ::

   outputs = {'E1/E0':E[1] / E[0], 'E2/E0':E[2] / E[0],         
              'a1/a0':a[1] / a[0], 'a2/a0':a[2] / a[0]}
   inputs = {'E':max_prior['E'], 'a':max_prior['a'], 'svd':fit.svdcorrection}
   print(fit.fmt_values(outputs))
   print(fit.fmt_errorbudget(outputs, inputs))

which gives:

.. literalinclude:: eg5d.out
   
Here the contribution from the *SVD* cut is almost negligible.

Note that covariance matrices are rescaled so that all diagonal 
elements equal one before the *SVD* cut is applied. This means,
among other things, that uncorrelated errors --- that is, diagonal sub-matrices
of the covariance matrix --- are unaffected by *SVD* cuts. Applying an *SVD*
cut of 1e-4, for example, to the following singular covariance matrix, ::

  [[  1.0   1.0   0.0  ]
   [  1.0   1.0   0.0  ]
   [  0.0   0.0   1e-20]],

gives a new, non-singular matrix::

  [[  1.0001   0.9999   0.0  ]
   [  0.9999   1.0001   0.0  ]
   [  0.0      0.0      1e-20]]

:class:`lsqfit.nonlinear_fit` uses a default value for ``svdcut`` of 1e-15,
and applies *SVD* cuts to the covariance matrices from both the fit data and
the prior.  This default can be overridden as shown above, but for many
problems it is a good choice. Roundoff errors become more accute, however,
when there are strong positive correlations between different parts of the fit
data or prior.  Then much larger ``svdcut``\s may be needed.

The method :func:`lsqfit.nonlinear_fit.check_roundoff` can be used to check
for roundoff errors by adding the line ``fit.check_roundoff()`` after the
fit. It generates a warning if roundoff looks to be a problem. This check
is done automatically if ``debug=True`` is added to argument list of
:class:`lsqfit.nonlinear_fit`.

Bootstrap Error Analysis
------------------------
Our analysis above assumes that every probability distribution relevant to
the fit is approximately Gaussian. For example, we characterize the input
data for ``y`` by a mean and a covariance matrix obtained from averaging
many random samples of ``y``. For large sample sizes it is almost certainly
true that the average values follow a Gaussian distribution, but in
practical applications the sample size could be too small. The *statistical
bootstrap* is an analysis tool for dealing with such situations.

The strategy is to: 1) make a large number of "bootstrap copies" of the
original input data that differ from each other by random amounts
characteristic of the underlying randomness in the original data; 2) repeat
the entire fit analysis for each bootstrap copy of the data, extracting
fit results from each; and 3) use the variation of the fit results from
bootstrap copy to bootstrap copy to determine an approximate probability
distribution (possibly non-Gaussian) for the each result.
   
Consider the code from the previous section, where we might reasonably want 
another check on the error estimates for our results. That code can be
modified to include a bootstrap analysis by adding the following to the end of
the ``main()`` subroutine::
   
   Nbs = 40                                     # number of bootstrap copies
   outputs = {'E1/E0':[], 'E2/E0':[], 'a1/a0':[], 'a2/a0':[]}   # results
   for bsfit in fit.bootstrap_iter(n=Nbs):
       E = bsfit.pmean['E']                     # best-fit parameter values
       a = bsfit.pmean['a']                     #   (ignore errors)
       outputs['E1/E0'].append(E[1] / E[0])       # accumulate results
       outputs['E2/E0'].append(E[2] / E[0])
       outputs['a1/a0'].append(a[1] / a[0])
       outputs['a2/a0'].append(a[2] / a[0])
       outputs['E1'].append(E[1])
       outputs['a1'].append(a[1])
   # extract "means" and "standard deviations" from the bootstrap output;
   # print using .fmt() to create compact representation of GVars
   outputs = gv.dataset.avg_data(outputs, bstrap=True)
   print('Bootstrap results:')
   print('E1/E0 =', outputs['E1/E0'].fmt(), '  E2/E1 =', outputs['E2/E0'].fmt())
   print('a1/a0 =', outputs['a1/a0'].fmt(), '  a2/a0 =', outputs['a2/a0'].fmt())
   print('E1 =', outputs['E1'].fmt(), '  a1 =', outputs['a1'].fmt())
   
The results are consistent with the results obtained directly from the fit
(when using ``svdcut=1e-15``):

.. literalinclude:: eg5e.out

In particular, the bootstrap analysis confirms our previous error estimates
(to within 10-30%, since ``Nbs=40``). When ``Nbs`` is small, it is often
safer to use the median instead of the mean as the estimator, which is 
what ``gv.dataset.avg_data`` does here since flag ``bstrap`` is set 
to ``True``. 

Testing Fits with Simulated Data
--------------------------------
Ideally we would test a fitting protocol by doing fits of data similar to 
our actual fit but where we know the correct values for the fit parameters 
ahead of the fit. The :class:`lsqfit.nonlinear_fit` iterator ``simulated_fit_iter``
creates any number of such simulations of the original fit. Returning
again to the fits in the section on :ref:`correlated-parameters`, we 
can add three fit simulations to the end of the ``main`` program::

   def main():
       gv.ranseed([2009, 2010, 2011, 2012]) # initialize random numbers (opt.)
       x, y = make_data()              # make fit data
       p0 = None                       # make larger fits go faster (opt.)
       for nexp in range(3, 20):
           print('************************************* nexp =', nexp)
           prior = make_prior(nexp)
           fit = lsqfit.nonlinear_fit(data=(x, y), fcn=f, prior=prior, p0=p0)
           print(fit)                  # print the fit results
           E = fit.p['E']              # best-fit parameters
           a = fit.p['a']
           print('E1/E0 =', E[1] / E[0], '  E2/E0 =', E[2] / E[0])
           print('a1/a0 =', a[1] / a[0], '  a2/a0 =', a[2] / a[0])
           print()
           if fit.chi2 / fit.dof < 1.:
               p0 = fit.pmean          # starting point for next fit (opt.)
 
       # 3 fit simulations based upon last fit
       for sfit in fit.simulated_fit_iter(3):
           print(sfit)
           sE = sfit.p['E']             # best-fit parameters (simulation)
           sa = sfit.p['a']
           E = sfit.pexact['E']         # correct results for parameters
           a = sfit.pexact['a']
           print('E1/E0 =', sE[1] / sE[0], '  E2/E0 =', sE[2] / sE[0])
           print('a1/a0 =', sa[1] / sa[0], '  a2/a0 =', sa[2] / sa[0])
           print('\nSimulated Fit Values - Exact Values:')
           print(
               'E1/E0:', (sE[1] / sE[0]) - (E[1] / E[0]), 
               '  E2/E0:', (sE[2] / sE[0]) - (E[2] / E[0])
               )
           print(
               'a1/a0:', (sa[1] / sa[0]) - (a[1] / a[0]), 
               '  a2/a0:', (sa[2] / sa[0]) - (a[2] / a[0])
               )

           # compute chi**2 comparing selected fit results to exact results
           sim_results = [sE[0], sE[1], sa[0], sa[1]]
           exact_results = [E[0], E[1], a[0], a[1]]
           chi2 = gv.chi2(sim_results, exact_results)
           print(
               '\nParameter chi2/dof [dof] = %.2f' % (chi2 / gv.chi2.dof), 
               '[%d]' % gv.chi2.dof, 
               '  Q = %.1f' % gv.chi2.Q
               )

The fit data for each of the three simulations is the same as the original fit 
data except that the means have been adjusted (randomly) so the correct values 
for the fit parameters are in each case equal to ``pexact=fit.pmean``. 
Simulation fit results will typically differ from the correct values by 
an amount of order a standard deviation. With  sufficiently accurate data,
the results from a large number 
of simulations will be distributed in Gaussians centered on the correct 
values (``pexact``), with widths that equal the standard deviations 
given by the fit (``fit.psdev``). (With less accurate data, the 
distributions may become non-Gaussian, and the interpretation of fit
results more complicated.)

In the present example, the output from the three simulations is:

.. literalinclude:: eg4d.out

The simulations show that the fit values usually agree with the correct
values to within a standard deviation or so (the correct results here are 
the mean values from the last fit discussed in :ref:`correlated-parameters`).
Furthermore the error estimates for each parameter from 
the original fit are reproduced by the simulations. We also compute 
the ``chi**2`` for the difference between the leading fit parameters and
the exact values. This checks parameter values, standard deviations, and correlations.
The results are reasonable for four degrees of freedom. Here the first simulation
shows results that are off by a third of a standard deviation on average, but
this is not so unusual --- the ``Q=0.1`` indicates that it happens 10% of the time.

More thorough testing is possible: for example, one could run many simulations
(100?) to verify that the distribution of (simulation) fit results is Gaussian,
centered around ``pexact``. This is overkill in most 
situations, however. The three simulations above are enough to reassure 
us that the original fit estimates, including errors, are reliable. 

Positive Parameters
-------------------
The priors for |nonlinear_fit| are all Gaussian. There are situations,
however, where other distributions would be desirable. One such case is where
a parameter  is known to be positive, but is close to zero in value ("close"
being defined relative to the *a priori* uncertainty). For such cases we would
like to use  non-Gaussian priors that force positivity --- for example, priors
that  impose log-normal or exponential distributions on the parameter.
Ideally the decision to use such a distribution would  be made on a parameter-
by-parameter basis, when creating the priors, and would have no impact on the
definition of the fit function itself.

:mod:`lsqfit` provides a decorator, :class:`lsqfit.transform_p`, for fit
functions that makes this possible. This decorator only works for fit
functions that use dictionaries for their parameters. Given a prior ``prior``
for  a fit, the decorator is used in the following way: for example, ::
   
   @lsqfit.transform_p(prior.keys(), 0)
   def fitfcn(p):
      ...

when the parameter argument is the first argument of the fit function, or ::

   @lsqfit.transform_p(prior.keys(), 1)
   def fitfcn(x, p):
      ...

when the parameter argument is the second argument of the fit function (see
the :class:`lsqfit.transform_p` documentation for more detail). Consider any
parameter ``p['XX']`` used in ``fitfcn``. The prior  distribution for that
parameter can now be turned into a log-normal distribution  by replacing
``prior['XX']`` with ``prior['logXX']`` (or ``prior['log(XX)'']``) when
defining the prior, thereby assigning a Gaussian distribution to ``logXX``
rather than to ``XX``. Nothing need be changed in the fit function, other than
adding the decorator. The decorator automatically detects parameters whose
keys begin with ``'log'`` and adds new parameters to the parameter-dictionary
for ``fitfcn`` that are exponentials of those parameters.

To illustrate consider a simple problem where an experimental quantity ``y`` is
known to be positive, but experimental errors mean that measured values can
often be negative:: 

   import gvar as gv
   import lsqfit

   y = gv.gvar([
      '-0.17(20)', '-0.03(20)', '-0.39(20)', '0.10(20)', '-0.03(20)', 
      '0.06(20)', '-0.23(20)', '-0.23(20)', '-0.15(20)', '-0.01(20)', 
      '-0.12(20)', '0.05(20)', '-0.09(20)', '-0.36(20)', '0.09(20)', 
      '-0.07(20)', '-0.31(20)', '0.12(20)', '0.11(20)', '0.13(20)'
      ])

We want to know the average value ``a`` of the ``y``\s and so could
use the following fitting code::

   prior = gv.BufferDict(a=gv.gvar(0.02, 0.02))      # a = avg value of y's

   def fcn(p, N=len(y)):
      return N * [p['a']]

   fit = lsqfit.nonlinear_fit(prior=prior, data=y, fcn=fcn)
   print(fit)
   print('a =', fit.p['a'].fmt())

where we are assuming *a priori* information that suggests 
the average is around 0.02. The output from this code is:

.. literalinclude:: eg6-a.out

This is not such a useful result since much of the one-sigma range for ``a``
is negative, and yet we know that ``a`` must be postive.

A better analysis is to use a log-normal distribution for ``a``::

   prior = gv.BufferDict(loga=gv.log(gv.gvar(0.02, 0.02))) # loga not a

   @lsqfit.transform_p(prior.keys(), 0)
   def fcn(p, N=len(y)):
      return N * [p['a']]

   fit = lsqfit.nonlinear_fit(prior=prior, data=y, fcn=fcn)
   print(fit)
   print('a =', fit.transformed_p['a'].fmt())    # exp(loga)

The fit parameter is now ``log(a)`` rather than ``a`` itself,
but we are able to use the identical fit function. Here ``fit.transformed_p`` 
is the same as ``fit.p`` but augmented to include the exponentials of
any log-normal variables --- that is, ``a`` as well as ``loga``.
Rather than including all keys, the decorator can be written with a list 
containing just the variables to be transformed: here, 
``@lsqfit.transform_p(['loga'], 0)``.

The result from this fit is

.. literalinclude:: eg6-loga.out

which is more compelling. The "correct" value for ``a`` here is 0.015
(from the method used to generate the ``y``\s).

:func:`lsqfit.transform_p` also allows parameters to be replaced by their
square roots as fit parameters --- for example,  define ``prior['sqrt(a)']``
(or ``prior['sqrta']``) rather than ``prior['a']`` when creating the prior.
This again guarantees positive  parameters. The prior for ``p['a']`` is an
exponential distribution if the mean of ``p['sqrt(a)'']`` is zero. Using
``prior['sqrt(a)'']`` in place of ``prior['a']`` in the example above leads to
``a = 0.010(13)``, which is almost identical to the result obtained from the
log-normal distribution.


Troubleshooting
---------------
:class:`lsqfit.nonlinear_fit` error messages that come from inside the  *gsl*
routnines doing the fits are sometimes less than useful. They are usually due
to errors in one of the inputs to the fit  (that is, the fit data, the prior,
or the fit function). Setting ``debug=True`` in the argument list of
:class:`lsqfit.nonlinear_fit` might result in more  intelligible error
messages. This option also causes the fitter to check  for significant
roundoff errors in the matrix inversions of the covariance matrices.

Occasionally :class:`lsqfit.nonlinear_fit` appears to go crazy, with gigantic
``chi**2``\s (*e.g.*, ``1e78``). This could be because there is a genuine
zero-eigenvalue mode in the covariance matrix of the data or prior. Such a
zero mode makes it impossible to invert the covariance matrix when evaluating
``chi**2``. One fix is to include *SVD* cuts in the fit by setting, for
example, ``svdcut=(1e-14,1e-14)`` in the call to :class:`lsqfit.nonlinear_fit`.
These cuts will exclude exact or nearly exact zero modes, while leaving
important modes mostly unaffected.

Even if the *SVD* cuts work in such a case, the question remains as to why one
of the covariance matrices has a zero mode. A common cause is if the same
:class:`gvar.GVar` was used for more than one prior. For example, one might
think that ::

   >>> import gvar as gv
   >>> z = gv.gvar(1, 1)
   >>> prior = gv.BufferDict(a=z, b=z)

creates a prior 1±1 for each of parameter ``a`` and parameter ``b``.
Indeed each parameter separately is of order 1±1, but in a fit the two
parameters would be forced equal to each other because their priors are both
set equal to the same :class:`gvar.GVar`, ``z``::

   >>> print(prior['a'], prior['b'])
   1.0(1.0) 1.0(1.0)
   >>> print(prior['a']-prior['b'])
   0(0)

That is, while parameters ``a`` and ``b`` fluctuate over a range of 
1±1, they fluctuate together, in exact lock-step. The covariance matrix
for ``a`` and ``b`` must therefore be singular, with a zero mode corresponding
to the combination ``a-b``; it is all 1\s in this case::

   >>> import numpy as np
   >>> cov = gv.evalcov(prior.flat)    # prior's covariance matrix
   >>> print(np.linalg.det(cov))       # determinant is zero
   0.0

This zero mode upsets :func:`nonlinear_fit`. If ``a`` and ``b`` are meant to
fluctuate together then an *SVD* cut as above will give correct results (with
``a`` and ``b`` being forced equal to several decimal places, depending upon
the cut). Of course, simply replacing ``b`` by ``a`` in the fit function would
be even better. If, on the other hand, ``a`` and ``b`` were not meant to
fluctuate together, the prior should be redefined::

   >>> prior = gv.BufferDict(a=gv.gvar(1, 1), b=gv.gvar(1, 1))

where now each parameter has its own :class:`gvar.GVar`.   
   

Appendix: A Simple Pedagogical Example
-------------------------------------------
Consider a problem where we have five pieces of uncorrelated data for a 
function ``y(x)``::

                 
            x[i]       y(x[i])
            ----------------------
            0.1        0.5351 (54)
            0.3        0.6762 (67)
            0.5        0.9227 (91)
            0.7        1.3803(131)
            0.95       4.0145(399)

We know that ``y(x)`` has a Taylor expansion in ``x``::

   y(x) = sum_n=0..inf p[n] x**n

The challenge is to extract a reliable estimate for ``y(0)=p[0]`` from the
data --- that is, the challenge is to extrapolate the data to ``x=0``.

One approach that is certainly wrong is to truncate the expansion of
``y(x)`` after five terms, because there are only five pieces of data. 
That gives the following fit, where the gray band shows the 1-sigma 
uncertainty in the fit function evaluated with the best-fit parameters:

.. image:: appendix_1.*
   :width: 80%

This fit was generated using the following code::

   import numpy as np
   import gvar as gv
   import lsqfit

   # fit data
   y = gv.gvar([
      '0.5351(54)', '0.6762(67)', '0.9227(91)', '1.3803(131)', '4.0145(399)'
      ])
   x = np.array([0.1, 0.3, 0.5, 0.7, 0.95])

   # fit function
   def f(x, p):                  
      return sum(pn * x ** n for n, pn in enumerate(p))

   p0 = np.ones(5.)              # starting value for chi**2 minimization
   fit = lsqfit.nonlinear_fit(data=(x, y), p0=p0, fcn=f)
   print(fit.format(maxline=5))

Note that here the function ``gv.gvar`` converts the strings 
``'0.5351(54)'``, *etc.* into |GVar|\s. Running the code gives the 
following output::

   Least Square Fit (no prior):
     chi2/dof [dof] = 4.8e-27 [0]    Q = 0    logGBF = None    itns = 2

   Parameters:
                 0    0.742 (39)     [ 1.0 +- inf ]  
                 1    -3.86 (59)     [ 1.0 +- inf ]  
                 2    21.5 (2.4)     [ 1.0 +- inf ]  
                 3   -39.1 (3.7)     [ 1.0 +- inf ]  
                 4    25.8 (1.9)     [ 1.0 +- inf ]  

   Fit:
        x[k]           y[k]      f(x[k],p)
   ---------------------------------------
         0.1    0.5351 (54)    0.5351 (54)  
         0.3    0.6762 (67)    0.6762 (67)  
         0.5    0.9227 (91)    0.9227 (91)  
         0.7     1.380 (13)     1.380 (13)  
        0.95     4.014 (40)     4.014 (40)  

This is a "perfect" fit in that the fit function agrees exactly with 
the data; the ``chi**2`` for the fit is zero. The 5-parameter fit 
gives a fairly precise answer for ``p[0]``
(``0.74(4)``), but the curve looks oddly stiff. Also some of the 
best-fit values for the coefficients are quite 
large (*e.g.*, ``p[3]= -39(4)``), perhaps unreasonably large.

The problem with a 5-parameter fit is that there is no reason to neglect 
terms in the expansion of ``y(x)`` with ``n>4``. Whether or not
extra terms are important depends entirely on how large we 
expect the coefficients ``p[n]`` for ``n>4`` to be. The extrapolation
problem is impossible without some idea of the size of these
parameters; we need extra information.

In this case that extra information is obviously connected to questions
of convergence of the Taylor expansion we are using to model ``y(x)``.
Let's assume we know, from previous work, that the ``p[n]`` are of
order one. Then we would need to keep at least 91 terms in the 
Taylor expansion if we wanted the terms we dropped to be small compared
with the 1% data errors at ``x=0.95``. So a possible fitting function
would be::

   y(x; N) = sum_n=0..N p[n] x**n  

with ``N=90``. 

Fitting a 91-parameter formula to five pieces of data is also impossible.
Here, however, we have extra (*prior*) information: each coefficient is order
one, which we make specific by saying that they equal ``0(1)``. So we are 
actually fitting 91+5 pieces of data with 91 parameters. 

The prior information is introduced into the fit as a *prior*::

   import numpy as np
   import gvar as gv
   import lsqfit

   # fit data
   y = gv.gvar([
      '0.5351(54)', '0.6762(67)', '0.9227(91)', '1.3803(131)', '4.0145(399)'
      ])
   x = np.array([0.1, 0.3, 0.5, 0.7, 0.95])

   # fit function
   def f(x, p):                     
      return sum(pn * x ** n for n, pn in enumerate(p))

   # 91-parameter prior for the fit
   prior = gv.gvar(91 * ['0(1)'])  

   fit = lsqfit.nonlinear_fit(data=(x, y), prior=prior, fcn=f)
   print(fit.format(maxline=5))

Note that a starting value ``p0`` is not needed when a prior is specified. 
This code also gives an excellent fit, with a ``chi**2`` per degree of 
freedom of ``0.35`` (note that the data point at ``x=0.95`` is off the chart,
but agrees with the fit to within its 1% errors):

.. image:: appendix_2.*
   :width: 80%
   
The fit code output is::

   Least Square Fit:
     chi2/dof [dof] = 0.35 [5]    Q = 0.88    logGBF = -0.45508    itns = 2

   Parameters:
                 0   0.489 (17)     [  0.0 (1.0) ]  
                 1    0.40 (20)     [  0.0 (1.0) ]  
                 2    0.60 (64)     [  0.0 (1.0) ]  
                 3    0.44 (80)     [  0.0 (1.0) ]  
                 4    0.28 (87)     [  0.0 (1.0) ]  
                 5    0.19 (87)     [  0.0 (1.0) ]  
                 6    0.16 (90)     [  0.0 (1.0) ]  
                 7    0.16 (93)     [  0.0 (1.0) ]  
                 8    0.17 (95)     [  0.0 (1.0) ]  
                 9    0.18 (96)     [  0.0 (1.0) ]  
                10    0.19 (97)     [  0.0 (1.0) ]  
                11    0.19 (97)     [  0.0 (1.0) ]  
                12    0.19 (97)     [  0.0 (1.0) ]  
                13    0.19 (97)     [  0.0 (1.0) ]  
                14    0.18 (97)     [  0.0 (1.0) ]  
                15    0.18 (97)     [  0.0 (1.0) ]  
                  .
                  .
                  .
                88    0.0 (1.0)     [  0.0 (1.0) ]  
                89    0.0 (1.0)     [  0.0 (1.0) ]  
                90    0.0 (1.0)     [  0.0 (1.0) ]  

   Fit:
        x[k]           y[k]      f(x[k],p)
   ---------------------------------------
         0.1    0.5351 (54)    0.5349 (54)  
         0.3    0.6762 (67)    0.6768 (65)  
         0.5    0.9227 (91)    0.9219 (87)  
         0.7     1.380 (13)     1.381 (13)  
        0.95     4.014 (40)     4.014 (40)  

   Settings:
     svdcut = (1e-15,1e-15)   svdnum = (None,None)    reltol/abstol = 0.0001/0

This is a much more plausible fit than than the 5-parameter fit, and gives an
extrapolated value of ``p[0]=0.489(17)``. The original data points were
created using  a Taylor expansion with random coefficients, but
with ``p[0]`` set equal to ``0.5``. So this fit to the five data points (plus
91 *a priori* values for the ``p[n]`` with ``n<91``) gives the correct
result.  Increasing the number of terms further would have no effect since the
last terms added are having no impact, and so end up equal to the prior value
---  the fit data are not sufficiently precise to add new information about
these parameters.

We can test our priors for this fit by re-doing the fit with broader and 
narrower priors. Setting ``prior = gv.gvar(91 * ['0(3)'])`` gives an excellent
fit, ::

  Least Square Fit:
    chi2/dof [dof] = 0.039 [5]    Q = 1    logGBF = -5.0993    itns = 2

  Parameters:
                0   0.490 (33)     [  0.0 (3.0) ]  
                1    0.38 (48)     [  0.0 (3.0) ]  
                2    0.6 (1.8)     [  0.0 (3.0) ]  
                ... 

but with a very small ``chi2/dof`` and somewhat larger errors on the best-fit
estimates for the parameters. The logarithm of the (Gaussian) Bayes Factor,
``logGBF``, can be used to compare fits with different priors. It is the 
logarithm of the probability that our data would come from parameters
generated at random using the prior. The exponential of ``logGBF`` is 
more than 100 times larger with the original priors of ``0(1)`` than with  
priors of ``0(3)``. This says that our data is more than 100 times more 
likely to come from a world with parameters of order one than from one with 
parameters of order three. Put another way it says that
the size of the fluctuations in the data 
are more consistent with coefficients of order one than with coefficients of 
order three. The ``logGBF`` values argue for the original prior.

Narrower priors, ``prior = gv.gvar(91 * ['0.0(3)'])``, give a poor fit, 
and also a less optimal ``logGBF``::

  Least Square Fit:
    chi2/dof [dof] = 3.7 [5]    Q = 0.0024    logGBF = -3.3058    itns = 2

  Parameters:
                0   0.484 (11)     [  0.00 (30) ]  *
                1   0.454 (98)     [  0.00 (30) ]  *
                2    0.50 (23)     [  0.00 (30) ]  *
                ...

The priors are responsible for about half of the final error in our best
estimate of ``p[0]`` (with priors of ``0(1)``); the rest comes from the
uncertainty in the data. This can be  established by creating an error budget
using the code ::

    inputs = dict(prior=prior, y=y)
    outputs = dict(p0=fit.p[0])
    print(gv.fmt_errorbudget(inputs=inputs, outputs=outputs))

which prints the following table::

  Partial % Errors:
                              p0
  ------------------------------
                    y:      2.67
                prior:      2.23
  ------------------------------
                total:      3.48


There is a second, equivalent way of fitting this data that illustrates the
idea of *marginalization.* We really only care about parameter ``p[0]`` in 
our fit. This suggests that we remove ``n>0`` terms from the data *before*
we do the fit::

  ymod[i] = y[i] - sum_n=1...inf prior[n] * x[i] ** n

Before the fit, our best estimate for the parameters is from the priors. We
use these to create an estimate for the correction to each data point
coming from ``n>0`` terms in ``y(x)``. This new data, ``ymod[i]``, should be fit with 
a new fitting function, ``ymod(x) = p[0]`` --- that is, it should be fit
to a constant, independent of ``x[i]``. The last three lines of the code
above are easily modified to implement this idea::

   import numpy as np
   import gvar as gv
   import lsqfit

   # fit data
   y = gv.gvar([
      '0.5351(54)', '0.6762(67)', '0.9227(91)', '1.3803(131)', '4.0145(399)'
      ])
   x = np.array([0.1, 0.3, 0.5, 0.7, 0.95])

   # fit function
   def f(x, p):                     
      return sum(pn * x ** n for n, pn in enumerate(p))

   # prior for the fit
   prior = gv.gvar(91 * ['0(1)'])   

   # marginalize all but one parameter (p[0])
   priormod = prior[:1]                       # restrict fit to p[0]
   ymod = y - (f(x, prior) - f(x, priormod))  # correct y
   
   fit = lsqfit.nonlinear_fit(data=(x, ymod), prior=priormod, fcn=f)
   print(fit.format(maxline=5))

Running this code give::

  Least Square Fit (input data correlated with prior):
    chi2/dof [dof] = 0.35 [5]    Q = 0.88    logGBF = -0.45508    itns = 2

  Parameters:
                0   0.489 (17)     [  0.0 (1.0) ]  

  Fit:
       x[k]         y[k]     f(x[k],p)
  ------------------------------------
        0.1    0.54 (10)    0.489 (17)  
        0.3    0.68 (31)    0.489 (17)  
        0.5    0.92 (58)    0.489 (17)  
        0.7    1.38 (98)    0.489 (17)  
       0.95    4.0 (3.0)    0.489 (17)  *

  Settings:
    svdcut = (1e-15,1e-15)   svdnum = (None,None)    reltol/abstol = 0.0001/0

Remarkably this one-parameter fit gives results for ``p[0]`` that are identical (to 
machine precision) to our 91-parameter fit above. The 90 parameters for
``n>0`` are said to have been *marginalized* in this fit. Marginalizing a parameter
in this way has no effect if the fit function is linear in that parameter. 
Marginalization has almost no effect for nonlinear fits as well, 
provided the fit data have small errors (in which case the parameters are 
effectively linear). The fit here is:

.. image:: appendix_3.*
   :width: 80%

The constant is consistent with all of the data in ``ymod[i]``, 
even at ``x[i]=0.95``, because ``ymod[i]`` has much larger errors for
larger ``x[i]`` because of the correction terms.

Fitting to a constant is equivalent to doing a weighted average of the 
data plus the prior, so our fit can be replaced by an average::

   lsqfit.wavg(list(ymod) + list(priormod))

This again gives ``0.489(17)`` for our final result. 
Note that the central value for this average is below the central 
values for every data point in ``ymod[i]``. This is a consequence of large
positive correlations introduced into ``ymod`` when we remove the
``n>0`` terms. These correlations are captured automatically in our code,
and are essential --- removing the correlations between different 
``ymod``\s results in a final answer, ``0.564(97)``, which has much 
larger errors.




