======
filter
======

This framework is designed to filter and split Excel files using a
series of modular readers, filters, and writers. These components are
described below, followed by documentation for the currently available
readers, filters and writers. Right at the bottom are examples of how
to use the filter framework for various tasks.

Readers
=======

Readers are components that get data from a source and parse it into a
series of xlrd.Book objects before making calls to the appropriate
methods on the first filter in the chain. 

Because it is usually only the source of the data to be processed that
changes, a handy base class is provided for readers.

Here's an example reader that uses this base class to process the file
it is initialised with:

>>> import os
>>> from xlrd import open_workbook
>>> from xlutils.filter import BaseReader
>>> class MyReader(BaseReader):
...
...    def __init__(self,filename):
...       self.filename = filename
...
...    def get_filepaths(self):
...       return (os.path.join(test_files,self.filename),)

If you need to create a more unorthodox reader, such as one that reads
its data from a network socket or that needs to pass special
parameters to xlrd.open_workbook then read the implementation of
xlutils.filter.BaseReader

Filters
=======

Filters are the important bits of this framework. They are build up in
chains to achieve the results required for a particular task. A filter
must define certain methods, the full set of these is shown in the
example below. The implementation of these methods can do whatever the
filter requires, but generally they end up calling the appropriate
methods on the next filter. 

Here's an example filter that does nothing but print messages when its
methods are called and then call the next filter in the chain:

>>> class MyFilter:
...
...    def __init__(self,name):
...        self.name = name
... 
...    def start(self):
...        print self.name,'start'
...        self.next.start()
...
...    def workbook(self,rdbook,wtbook_name):
...        print self.name,'workbook',rdbook,wtbook_name
...        self.next.workbook(rdbook,wtbook_name)
...    
...    def sheet(self,rdsheet,wtsheet_name):
...        print self.name,'sheet',rdsheet,wtsheet_name
...        self.next.sheet(rdsheet,wtsheet_name)
...        
...    def set_rdsheet(self,rdsheet):
...        print self.name,'set_rdsheet',rdsheet
...        self.next.sheet(rdsheet,wtsheet_name)
...        
...    def row(self,rdrowx,wtrowx):
...        print self.name,'row',rdrowx,wtrowx
...        self.next.row(rdrowx,wtrowx)
...
...    def cell(self,rdrowx,rdcolx,wtrowx,wtcolx):
...        print self.name,'cell',rdrowx,rdcolx,wtrowx,wtcolx
...        self.next.cell(rdrowx,rdcolx,wtrowx,wtcolx)
...
...    def finish(self):
...        print self.name, 'finish'
...        self.next.finish()

For full details of when each of these methods are called, see the
docstrings of the xlutils.filter.BaseFilter

Writers
=======

Writers are components that get handle calls from the appropriate
methods on the last filter in the chain. It is the writer that
actually does the work of copying all the information from the
appropriate sources and writing them to the output files.

Because there is a lot of work involved in this and it is only usually
the method of writing the binary data to its destination that differs,
a handy base class is provided for writers.

Here's an example writer that just writes the data to a temporary
directory:

>>> from xlutils.filter import BaseWriter
>>> our_temp_dir = TempDirectory().path
>>> class MyWriter(BaseWriter):
...
...    def get_stream(self,filename):
...       return open(os.path.join(our_temp_dir,filename),'wb')

Available Readers
=================

  GlobReader
  ----------

  If you're processing files that are on disk, then this is probably
  the reader for you. It returns all files matching the path
  specification it's created with. Here's an example:

  >>> from xlutils.filter import GlobReader
  >>> r = GlobReader(os.path.join(test_files,'*.xls'))
  >>> [p[len(test_files)+1:] for p in r.get_filepaths()]
  ['test.xls', 'testall.xls', 'testnoformatting.xls']

  All the other functionality is provided by BaseReader:

  >>> isinstance(r,BaseReader)
  True

  XLRDReader
  ----------

  If you want to "save" an xlrd.Book object which you've already
  created in some other way:

  >>> from xlrd import open_workbook
  >>> wb = open_workbook(os.path.join(test_files,'testall.xls'))

  Then the XLRDReader is likely what you're after: 

  >>> from xlutils.filter import XLRDReader
  >>> r = XLRDReader(wb,'foo.xls')

  NB: You must supply a filename as shown above as the original
      filename is not stored in the xlrd.Book object

  Most functionality is provided by BaseReader:

  >>> isinstance(r,BaseReader)
  True

  But its get_workbooks method just returns the values it was
  instantiated with:
  
  >>> tuple(r.get_workbooks())
  ((<xlrd.Book object at ...>, 'foo.xls'),)
  >>> tuple(r.get_workbooks())[0][0] is wb
  True

  To show it working, here we send the book straight to a writer:

  >>> from xlutils.filter import DirectoryWriter,process
  >>> os.listdir(temp_dir)
  []
  >>> process(r,DirectoryWriter(temp_dir))
  >>> os.listdir(temp_dir)
  ['foo.xls']

  TestReader
  ----------

  This reader is specifically designed for testing filter
  implementations with known sets of cells. This example should give a
  good idea of how to use it and shows the filter created above in
  use:
  
  >>> from mock import Mock
  >>> from pprint import pprint
  >>> from xlutils.tests.test_filter import TestReader
  >>> from xlrd import XL_CELL_NUMBER
  >>> r = TestReader(
  ...         ('Sheet1',(('R0C0','R0C1'),
  ...                    ('R1C0','R1C1'))),
  ...         ('Sheet2',(((XL_CELL_NUMBER,0.0),),),)
  ...     )
  >>> c = Mock()
  >>> r(c)
  >>> pprint(c.method_calls)
  [('start', (), {}),
   ('workbook',
    (<xlutils.tests.fixtures.DummyBook...>, 'test.xls'),
    {}),
   ('sheet', (<xlrd.sheet.Sheet...>, 'Sheet1'), {}),
   ('row', (0, 0), {}),
   ('cell', (0, 0, 0, 0), {}),
   ('cell', (0, 1, 0, 1), {}),
   ('row', (1, 1), {}),
   ('cell', (1, 0, 1, 0), {}),
   ('cell', (1, 1, 1, 1), {}),
   ('sheet', (<xlrd.sheet.Sheet...>, 'Sheet2'), {}),
   ('row', (0, 0), {}),
   ('cell', (0, 0, 0, 0), {}),
   ('finish', (), {})]

Available Writers
=================

  DirectoryWriter
  ---------------

  If you're processing files that are on disk, then this is probably
  the writer for you. It stores files in the directory passed to it
  during creation. Here's an example:

  >>> from xlutils.filter import DirectoryWriter
  >>> temp_dir = TempDirectory().path
  >>> w = DirectoryWriter(temp_dir)

  Most of the functionality is provided by BaseWriter:

  >>> isinstance(w,BaseWriter)
  True

  But the get_stream method makes sure the files end up in the
  directory specified:

  >>> os.listdir(temp_dir)
  []
  >>> f = w.get_stream('test.xls')
  >>> f.write('some \r\n data')
  >>> f.close()
  >>> os.listdir(temp_dir)
  ['test.xls']
  >>> open(os.path.join(temp_dir,'test.xls'),'rb').read()
  'some \r\n data'

  StreamWriter
  ------------

  If you want to write exactly one workbook to a stream, then this is
  the writer for you:

  >>> from tempfile import TemporaryFile
  >>> from xlutils.filter import StreamWriter
  >>> tf = TemporaryFile()
  >>> w = StreamWriter(tf)

  Most of the functionality is provided by BaseWriter:

  >>> isinstance(w,BaseWriter)
  True

  But the get_stream method makes sure the excel data is written to
  the stream provided:

  >>> f = w.get_stream('test.xls')
  >>> f.write('xls data')
  >>> tf.seek(0)
  >>> tf.read()
  'xls data'

  NB: Only one file may be written to a StreamWriter, further attempts
  will result in an exception being raised:

  >>> w.get_stream('test2.xls')
  Traceback (most recent call last):
  ...
  Exception: Attempt to write more than one workbook

  StreamWriter also doesn't close any streams passed to it:

  >>> tf = TemporaryFile()
  >>> process(TestReader(('Sheet1',[['R0C0']])),StreamWriter(tf))
  >>> tf.seek(0)
  >>> len(tf.read())
  5632

  XLWTWriter
  ----------

  If you want to change cells after the filtering process is complete
  then an XLWTWriter can be used to obtain the xlwt.Workbook
  objects that result: 

  >>> from xlutils.filter import XLWTWriter
  >>> w = XLWTWriter()
  >>> process(TestReader(('Sheet1',[['R0C0']])),w)

  The objects can then be manipulated and saved as required:

  >>> w.output
  [('test.xls', <xlwt.Workbook.Workbook object at ...>)]

  >>> book = w.output[0][1]
  >>> book.get_sheet(0).write(0,1,'R0C1')

  >>> temp_dir = TempDirectory()
  >>> temp_dir.listdir()
  
  >>> book.save(os.path.join(temp_dir.path,w.output[0][0]))
  >>> temp_dir.listdir()
  test.xls

  As with previous writers, most of the functionality is provided by
  BaseWriter:

  >>> isinstance(w,BaseWriter)
  True

Available Filters
=================

  BaseFilter
  ----------

  This is a "do nothing" filter that makes a great base class for your
  own filters. All the required methods are implemented such that they
  just call the same method on the next filter in the chain.

  ColumnTrimmer
  -------------

  This filter will strip columns containing no useful data from the
  end of sheets. For example:

  >>> from xlutils.filter import process,ColumnTrimmer
  >>> r = TestReader(
  ...         ('Sheet1',(('','',''),
  ...                    ('R1C0','',''),
  ...                    ('R2C0','R2C1',' \t\r\n'))),
  ...         ('Sheet2',(('R0C0',),),),
  ...         ('Sheet3',(('R0C0','',(XL_CELL_NUMBER,0.0)),)),
  ...     )
  >>> c = Mock()
  >>> process(r,ColumnTrimmer(),c)
  >>> pprint(c.method_calls)
  [('start', (), {}),
   ('workbook',
    (<xlutils.tests.fixtures.DummyBook...>, 'test.xls'),
    {}),
   ('sheet', (<xlrd.sheet.Sheet...>, 'Sheet1'), {}),
   ('row', (0, 0), {}),
   ('row', (1, 1), {}),
   ('row', (2, 2), {}),
   ('cell', (0, 0, 0, 0), {}),
   ('cell', (0, 1, 0, 1), {}),
   ('cell', (1, 0, 1, 0), {}),
   ('cell', (1, 1, 1, 1), {}),
   ('cell', (2, 0, 2, 0), {}),
   ('cell', (2, 1, 2, 1), {}),
   ('sheet', (<xlrd.sheet.Sheet...>, 'Sheet2'), {}),
   ('row', (0, 0), {}),
   ('cell', (0, 0, 0, 0), {}),
   ('sheet', (<xlrd.sheet.Sheet...>, 'Sheet3'), {}),
   ('row', (0, 0), {}),
   ('cell', (0, 0, 0, 0), {}),
   ('finish', (), {})]

  When sheets are trimmed, a message is also logged to aid debugging:
 
  >>> from testfixtures import LogCapture
  >>> l = LogCapture()
  >>> process(r,ColumnTrimmer(),c)
  >>> print l
  xlutils.filter DEBUG
    Number of columns trimmed from 3 to 2 for sheet 'Sheet1'
  xlutils.filter DEBUG
    Number of columns trimmed from 3 to 1 for sheet 'Sheet3'

  The definition of 'no useful data' can also be controlled by passing
  in a function that returns True or False for each value:

  >>> def not_useful(cell):
  ...     if not cell.value or cell.value=='junk': return True

  >>> r = TestReader(
  ...         ('Sheet1',(('R1C0','','junk'),)),
  ...     )
  >>> c = Mock()
  >>> process(r,ColumnTrimmer(not_useful),c)
  >>> pprint(c.method_calls)
  [('start', (), {}),
   ('workbook',
    (<xlutils.tests.fixtures.DummyBook...>, 'test.xls'),
    {}),
   ('sheet', (<xlrd.sheet.Sheet...>, 'Sheet1'), {}),
   ('row', (0, 0), {}),
   ('cell', (0, 0, 0, 0), {}),
   ('finish', (), {})]

  ErrorFilter
  -----------

  This filter caches all method calls in a file on disk and will only
  pass them on to filters further down in the chain when its finish
  method has been called *and* no error messages have been logged to
  the python logging framework.

  NB: To be effective, this filter must be the last in the chain
      before the writer!

  Here's an example of how to set one up:

  >>> from xlutils.filter import process,ErrorFilter
  >>> f = ErrorFilter()

  To show the filter in action, we need a little helper:

  >>> import logging
  >>> from xlutils.filter import BaseFilter
  >>> class Log(BaseFilter):
  ...    def __init__(self,level):
  ...        self.level = level
  ...    def workbook(self,rdbook,wtbook_name):
  ...        if wtbook_name=='test.xls':
  ...            logging.getLogger('theLogger').log(self.level,'a message')
  ...        self.next.workbook(rdbook,wtbook_name)

  We also need to be able to see what messages are logged:

  >>> h = LogCapture()

  So, when we have errors logged, no methods other than finish are
  passed on to the next filter:
     
  >>> c = Mock()
  >>> process(MyReader('test.xls'),Log(logging.ERROR),f,c)
  >>> len(c.method_calls)
  0

  As well as the error message logged, we can also see the ErrorFilter
  logs an error to that that the method calls have not been passed on:

  >>> print h
  theLogger ERROR
    a message
  xlutils.filter ERROR
    No output as errors have occurred.
  >>> h.clear()
  
  This error message can be controlled when the ErrorFilter is
  instantiated:

  >>> f = ErrorFilter(message='wingnuts! errors have occurred!')
  >>> process(MyReader('test.xls'),Log(logging.ERROR),f,c)
  >>> print h
  theLogger ERROR
    a message
  xlutils.filter ERROR
    wingnuts! errors have occurred!

  However, when no errors are logged, all method calls are passed:

  >>> c = Mock()
  >>> process(MyReader('test.xls'),Log(logging.WARNING),f,c)
  >>> len(c.method_calls)
  17

  In addition to the logging of error messages, error cells will 
  also cause all methods to be filtered:

  >>> from xlrd import XL_CELL_ERROR
  >>> r = TestReader(
  ...         (u'Price(\xa3)',(((XL_CELL_ERROR,0),),)),
  ...         )
  >>> c = Mock()
  >>> h.clear()
  >>> process(r,ErrorFilter(),c)
  >>> len(c.method_calls)
  0
  >>> print h
  xlutils.filter ERROR
    Cell A1 of sheet 'Price(?)' contains a bad value: error (#NULL!)
  xlutils.filter ERROR
    No output as errors have occurred.
  
  You can also configure the log level at which messages prevent the
  ErrorFilter from passing its method calls on to the next filter in
  the chain:

  >>> f = ErrorFilter(logging.WARNING)

  Now, warnings will cause methods to not be passed on:

  >>> c = Mock()
  >>> process(MyReader('test.xls'),Log(logging.WARNING),f,c)
  >>> len(c.method_calls)
  0

  But if only debug messages are logged, the method calls will still
  be passed on:

  >>> c = Mock()
  >>> process(MyReader('test.xls'),Log(logging.DEBUG),f,c)
  >>> len(c.method_calls)
  17

  An example which may prove useful is how to set up a filter such
  that if any errors are logged while processing one workbook, that
  workbook is filtered out but other subsequent workbooks are not
  filtered out.

  This is done by inserting a filter such as the following earlier in
  the chain:

  >>> class BatchByWorkbook(BaseFilter):
  ...    started = False
  ...    def start(self): pass
  ...    def workbook(self,rdbook,wtbook_name):
  ...       if self.started:
  ...           self.next.finish()
  ...       self.next.start()
  ...       self.next.workbook(rdbook,wtbook_name)
  ...       self.started = True

  Here it is at work, starting with an empty output directory:

  >>> temp_dir = TempDirectory().path
  >>> os.listdir(temp_dir)
  []

  Now test.xls, testall.xls and testnoformatting.xls are processed,
  but errors are only logged while processing test.xls:

  >>> process(
  ...     GlobReader(os.path.join(test_files,'*.xls')),
  ...     BatchByWorkbook(),
  ...     Log(logging.ERROR),
  ...     ErrorFilter(),
  ...     DirectoryWriter(temp_dir)
  ... )

  So, the output directory contains testall.xls, but no test.xls:

  >>> os.listdir(temp_dir)
  ['testall.xls', 'testnoformatting.xls']

  Echo
  ----

  This filter will print calls to the methods configured when the
  filter is created along with the arguments passed.

  >>> from xlutils.filter import Echo,process
  >>> r = TestReader(
  ...         ('Sheet1',(('R0C0','R0C1'),
  ...                    ('R1C0','R1C1'))),
  ...     )
  >>> process(r,Echo(methods=('workbook',)),Mock())
  workbook:(<...DummyBook...>, 'test.xls')

  If 'True' is passed instead of a list of method names, then all
  methods called will be printed:

  >>> process(r,Echo(methods=True),Mock())
  start:()
  workbook:(<...DummyBook...>, 'test.xls')
  sheet:(<xlrd.sheet.Sheet...>, 'Sheet1')
  row:(0, 0)
  cell:(0, 0, 0, 0)
  cell:(0, 1, 0, 1)
  row:(1, 1)
  cell:(1, 0, 1, 0)
  cell:(1, 1, 1, 1)
  finish:()

  If you need to see what's happening at various points in a chain,
  you can also give an Echo a name:

  >>> process(r,Echo('first'),Echo('second'),Mock())
  'first' start:()
  'second' start:()
  'first' workbook:(<...DummyBook...>, 'test.xls')
  'second' workbook:(<...DummyBook...>, 'test.xls')
  'first' sheet:(<xlrd.sheet.Sheet...>, 'Sheet1')
  'second' sheet:(<xlrd.sheet.Sheet...>, 'Sheet1')
  'first' row:(0, 0)
  'second' row:(0, 0)
  'first' cell:(0, 0, 0, 0)
  'second' cell:(0, 0, 0, 0)
  'first' cell:(0, 1, 0, 1)
  'second' cell:(0, 1, 0, 1)
  'first' row:(1, 1)
  'second' row:(1, 1)
  'first' cell:(1, 0, 1, 0)
  'second' cell:(1, 0, 1, 0)
  'first' cell:(1, 1, 1, 1)
  'second' cell:(1, 1, 1, 1)
  'first' finish:()
  'second' finish:()


  MemoryLogger
  ------------

  This filter will dump stats to the path it was configured with using
  the heapy package if it is available. If it is not available, no
  operations are performed.

  For example, with a MemoryLogger configured as follows:

  >>> from xlutils.filter import MemoryLogger
  >>> m = MemoryLogger('/some/path',methods=('sheet','cell'))

  The equivalent of the following call::

    from guppy import hpy; hpy().heap().stat.dump('/some/path')

  will be performed whenever the MemoryLogger's sheet and cell methods
  are called.

  A MemoryLogger configured as followed will log memory usage whenever
  any of the MemoryLogger's methods are called:

  >>> m = MemoryLogger('/some/path',True)

  For more information on heapy, please see:

  http://guppy-pe.sourceforge.net/#Heapy

Example Usage
=============

Here's an example that makes use of all the example components
above to filter out odd numbered rows from an original workbook's
sheets. To do this we need one more filter:

>>> from xlutils.filter import BaseFilter
>>> class EvenFilter(BaseFilter):
...
...     def row(self,rdrowx,wtrowx):
...         if not rdrowx%2:
...             self.next.row(rdrowx,wtrowx)
...     
...     def cell(self,rdrowx,rdcolx,wtrowx,wtcolx):
...         if not rdrowx%2:
...             self.next.cell(rdrowx,rdcolx,wtrowx,wtcolx)

Now we can put it all together with a call to the process function:

>>> from xlutils.filter import process
>>> process(
...     MyReader('test.xls'),
...     MyFilter('before'),
...     EvenFilter(),
...     MyFilter('after'),
...     MyWriter()
...     )
before start
after start
before workbook <xlrd.Book object at ...> test.xls
after workbook <xlrd.Book object at ...> test.xls
before sheet <xlrd.sheet.Sheet object at ...> Sheet1
after sheet <xlrd.sheet.Sheet object at ...> Sheet1
before row 0 0
after row 0 0
before cell 0 0 0 0
after cell 0 0 0 0
before cell 0 1 0 1
after cell 0 1 0 1
before row 1 1
before cell 1 0 1 0
before cell 1 1 1 1
before sheet <xlrd.sheet.Sheet object at ...> Sheet2
after sheet <xlrd.sheet.Sheet object at ...> Sheet2
before row 0 0
after row 0 0
before cell 0 0 0 0
after cell 0 0 0 0
before cell 0 1 0 1
after cell 0 1 0 1
before row 1 1
before cell 1 0 1 0
before cell 1 1 1 1
before finish
after finish

As you can see if you've read this far, there's quite a lot of output,
but it's certainly informative! However, just to be on the safe side,
we can see that the output file was actually written:

>>> os.listdir(our_temp_dir)
['test.xls']
