==============================
  Nabu: Writing an Extractor
==============================

:Abstract:

   This document shows an example of writing an extractor for identifying and
   saving parts of documents from within a Nabu server.

.. contents::


Introduction
============

Nabu offers a framework for the extraction of meaningful portions of documents
into structured storage (e.g. database tables).  This extraction will generally
be customized for each end application and therefore we anticipate that people
will write their own extractors.

This document provides an example of that.  It aims at demonstrating the
simplicity of the task.

Note that we will be storing data with an SQL DBAPI-2.0 interface. You may want
to read `PEP 249`__ if you're not familiar with Python DBAPI-2.0.

__ http://www.python.org/peps/pep-0249.html


Example Application: Extracting Books from Field Lists
======================================================

For our example, we will examine the task of extracting information for book
references.  Imagine that we want to be able to put field lists that are book
references, scattered all over our documents, something that looks more or less
like BibTeX entries, for example::

   :book:
   :title: National Geographic Photography Field Guide 2nd Edition:
           Secrets to Making Great Pictures
   :author: Peter Burian, Bob Caputo
   :isbn: 079225676X
   :review:

     Excellent diversified advice from top photographers.  The book
     manages to pack lots of relevant content in a small format.  It
     contains a nice section on composition, which is what originally
     attracted me to it.  As per usual, I found the digital
     photography section useless, but for the most part, the
     information available in this book is of great value.  This is
     the best general book about photography that I've read.

Since the docutils parser works recursively, those field lists can be located
anywhere within the document [#]_, for example, in an item list, e.g.::

   Recent books read:

   - :title: Animal Farm
     :author: George Orwell
     :url: http://www.online-literature.com/orwell/animalfarm/
     :review:

        A classic work.

   - :title: Free Culture
     :author: Lawrence Lessig
     :url: http://www.amazon.com/o/ASIN/0143034650/

   - :title: A Scanner Darkly
     :author: Philip K. Dick
     :isbn: 0679736654

   ...

We would like to avoid having to say that the field list is for a book; it would
be nice if the parser could just figure that out by itself.  For example, the
empty ``:book:`` field in the first example above should not be mandatory for
the book reference to be detected.

  
.. note:: 

   One additional detail is that the field lists should be separate.  In
   reStructuredText, fields are allowed to have whitespace between them, so
   that::
   
      :title: A Scanner Darkly
   
      :author: Philip K. Dick
   
      :isbn: 0679736654
      :url: http://www.amazon.com/o/ASIN/0679736654/
   
   is a single field list.  This means that you cannot separate field lists with
   just whitespace.  I'm using an itemized list in the previous example, but
   other document structures could be used as well (see the `docutils document
   tree`_ a list of constructs that get recognized in reStructuredText).  This
   example exhibits a subtlety that may arise when you choose which structures
   will be parsed: you have to think about how the docutils parser will
   represent the data in the tree of nodes.  For example, the following would
   not work because it gets parsed as a single field list::
   
      :title: Free Culture
      :author: Lawrence Lessig
      :url: http://www.amazon.com/o/ASIN/0143034650/
    
      :title: A Scanner Darkly
      :author: Philip K. Dick
      :isbn: 0679736654

 
Looking at the Document Structure
=================================

Here is a pseudo-xml rendering of how the above example gets parsed by docutils
into a tree of nodes.  This is the tree that we will have to visit to extract
the references from.  You can use docutils' ``rst2pseudoxml.py`` tool to
generate such a tree to be able to visualize the document structure, and to help
design and write your extractor/visitor::

  <document source="/tmp/example.txt">
      <paragraph>
          Recent books read:
      <bullet_list bullet="-">
          <list_item>
              <field_list>
                  <field>
                      <field_name>
                          title
                      <field_body>
                          <paragraph>
                              Animal Farm
                  <field>
                      <field_name>
                          author
                      <field_body>
                          <paragraph>
                              George Orwell
                  <field>
                      <field_name>
                          url
                      <field_body>
                          <paragraph>
                              <reference refuri="http://www.online-literature.com/orwell/animalfarm/">
                                  http://www.online-literature.com/orwell/animalfarm/
                  <field>
                      <field_name>
                          review
                      <field_body>
                          <paragraph>
                              A classic work.
          <list_item>
              <field_list>
                  <field>
                      <field_name>
                          title
                      <field_body>
                          <paragraph>
                              Free Culture
                  <field>
                      <field_name>
                          author
                      <field_body>
                          <paragraph>
                              Lawrence Lessig
                  <field>
                      <field_name>
                          url
                      <field_body>
                          <paragraph>
                              <reference refuri="http://www.amazon.com/o/ASIN/0143034650/">
                                  http://www.amazon.com/o/ASIN/0143034650/
          <list_item>
              <field_list>
                  <field>
                      <field_name>
                          title
                      <field_body>
                          <paragraph>
                              A Scanner Darkly
                  <field>
                      <field_name>
                          author
                      <field_body>
                          <paragraph>
                              Philip K. Dick
                  <field>
                      <field_name>
                          isbn
                      <field_body>
                          <paragraph>
                              0679736654


The Extractor and the ExtractorStorage
======================================

In order to extract some stuff from our documents, we need to provide at least
two classes:

1. an extractor class, which is essentially a class derived from the docutils
   ``Transform`` class, whose role is to visit the parsed document tree and to
   find the stuff that it wants to find.  In our example, this class will be
   running a visitor to look for docutils field lists nodes and it will check
   the field names to find out if they are for books;

2. an extractor storage class, whose role is to put detected book references
   into whatever storage is desired.  It is decoupled from the extractor so that
   the same extractor algorithm can support different storage mechanisms.  This
   storage object is normally provided by the publisher handler script in its
   configuration.

   Typically, we would store the references in a database, or convert them into
   BibTeX format and store them sequentially in a file for later use.


Example code
============

All the code for implementing the book algorithm above fits neatly in a short
file.  Let's name it ``book.py``.

Imports
-------

We start our Python script with a brief description and some imports::

  #!/usr/bin/env python
  """
  Extract book entries.
  """

  # nabu imports
  from nabu import extract
  from nabu.extractors.flvis import FieldListVisitor

We import the ``nabu.extract`` module which contains the base class that we
need.  We also import a utility class that knows how to visit docutils field
lists [#]_.

The Extractor Class
-------------------

We then define the extractor class, which is a docutils ``Transform``::

  class BookExtractor(extract.Extractor):
      """
      Transform that looks at field lists and that heuristically attempts
      to find references to books.  For example,

        :title: National Geographic Photography Field Guide 2nd Edition:
                Secrets to Making Great Pictures
        :author: Peter Burian, Bob Caputo
        :url: http://www.amazon.com/o/ASIN/079225676X/
        :review:

          Excellent diversified advice from top photographers.  The book
          manages to pack lots of relevant content in a small format.  It
          contains a nice section on composition, which is what originally
          attracted me to it.  As per usual, I found the digital
          photography section useless, but for the most part, the
          information available in this book is of great value.  This is
          the best general book about photography that I've read.

      The heuristic looks at an empty :book: field, an :ISBN: field, or
      some list that has an author and a title. See the Book class below to
      find out which fields are stored.
      """

      default_priority = 900

Note that we put the full description in the docstring of the extractor class
rather than in the module docstring because the Nabu publisher program can
concatenate all the document strings of the extractors configured in the publish
handler CGI script and return that to the client, so that people who publish
content to a specific Nabu store can get an idea of what structures are
configured on the server (see the ``--help-transforms`` option to the Nabu
publisher program).

We also set the default priority that will specify the order in which the
extractors get to run.  This is important for extractors which modify the
document tree (the document tree can be stored and then served on a web server
just like extracted content--in fact, there is a trivial extractor provided that
does just that.  This can be easily leveraged to implement a Wiki or a Blog).

We then implement the ``apply()`` method which is called by docutils, after it
sets the ``.document`` attribute on the extractor (that is part of how
transforms are run by docutils)::

  ...
      def apply( self, **kwargs ):
          self.unid, self.storage = kwargs['unid'], kwargs['storage']

          v = FieldListVisitor(self.document)
          v.initialize()
          self.document.walk(v)
          v.finalize()


The keyword arguments are always the unique id for the source document and the
extractor storage object that is configured on the publisher handler.

Note that we use a special FieldListVisitor class that we've built to simplify
visiting generic field lists.  You could use any of the docutils visitors here
instead.  We run the visitor on the document by calling the ``walk()`` method.
This special visitor accumulates all the field lists in the document into a
dictionary that we then process and implement our heuristic to find the field
lists that match our criteria::

  ...
          for flist in v.getfieldlists():
              book = 0
              # if there is an empty book field, this is explicitly a book.
              if 'book' in flist and not flist['book'].strip():
                  book = 1
              # if there is an ISBN number, then it is *definitely* a book.
              elif 'isbn' in flist:
                  book = 1
	      # if there is an author and a title
              elif 'author' in flist and 'title' in flist:
                  book = 1

              if book:
                  self.store(flist) # store the book

Next, we implement a store method that converts the field value nodes returned
by the ``FieldListVisitor`` into Unicode text and calls our associated storage
object to actually put the data somewhere::

  ...
      def store( self, flist ):
          emap = {}
          for k, v in flist.iteritems():
              emap[k] = v.astext()

          self.storage.store(self.unid, emap)

That's it for the extractor class!

The ExtractorStorage Class
--------------------------

This class is in charge of storing the information identified by the extractor
and is called by it when it finds a new instance to store.  Different storage
classes can be created for different backends.

Note that we declared a ``unid`` field that is required not to be null.  This is
necessary to make it possible to remove the objects from a source document that
is being reprocessed.  We first remove all the old objects (this is done
automatically) and then the reprocessing adds the new set back into the
database.  All information extracted from a document is required to be
associated with the source document using this unique id field ``unid``.

Next, we write the extractor storage class::

  class BookStorage(extract.SQLExtractorStorage):
      """
      Book storage.
      """
      sql_tables = { 'book': '''

          CREATE TABLE book
          (
             unid TEXT NOT NULL,
             title TEXT,
             author TEXT,
             year TEXT,
             url TEXT,
             review TEXT
          )

          '''
          }

      def store( self, unid, *args ):
          data, = args

          cols = ('unid', 'title', 'author', 'year', 'url', 'review')
          values = [unid]
          for n in cols[1:]:
              values.append( data.get(n, '') )

          cursor = self.connection.cursor()
          cursor.execute("""
            INSERT INTO book (%s) VALUES (%%s, %%s, %%s, %%s, %%s, %%s)
            """ % ', '.join(cols), values)

          self.connection.commit()

Here we derive it from the special ``SQLExtractorStorage`` that Nabu provides
for storing stuff in databases using SQL.  Knowing which schemas the
storage class uses, this base class simply implements the protocol to initialize
the database connection for the wrapper objects, implements the protocol for
clearing objects extracted from a specific document, and to reset the tables
(see the source code if desired, it is very very simple).  The expected schema
classes are specified in the class attribute ``sql_tables``.

The class basically just creates a new book reference entry by instantiating the
``Book`` schema class.  It fills in default empty values for the entries that
have not been found by the extractor (we could have implemented this in the
extractor itself--this is by choice of contract between the extractor and the
storage object).

This completes the source code for our example.

Testing Your Extractor
======================

Before setting up your Nabu publisher handler with the new extractor, you can
test it using the ``nabu-test-extractor`` code provided with Nabu (installed
under ``nabu/lib/python/nabu/testextr.py``).

Run it like this, on a test document in reStructuredText format ``reading.txt``,
which presumably contains book references (create your own test document with
your favourite books)::

   $ ./testextr.py book.py ~/reading.txt


Publisher Handler Configuration Code
====================================

The only part that remains to be done in order to feed a database with book
references is to configure your Nabu publisher handler with the new extractor.
The Nabu publisher handler is the script that you should install on your web
server to receive the source documents and run the extractors on them.  There is
an example publisher handler under ``nabu/cgi-bin/nabu-publish-handler.cgi`` in
the source distribution.

First import your extractor code at the top of the file::

    ...
    import book
    ...

Then add the new extractor to the list of transforms that the server will be
configured with::

    transforms = (
        ...
        (book.BookExtractor, book.BookStorage(dbmodule, dbconn)),
        )


Complete Source Code
====================

The complete source code for the book example is reproduced here for
convenience::


  #!/usr/bin/env python
  """
  Extract book entries.
  """

  # nabu imports
  from nabu import extract
  from nabu.extractors.flvis import FieldListVisitor


  class BookExtractor(extract.Extractor):
      """
      Transform that looks at field lists and that heuristically attempts
      to find references to books.  For example,

        :title: National Geographic Photography Field Guide 2nd Edition:
                Secrets to Making Great Pictures
        :author: Peter Burian, Bob Caputo
        :url: http://www.amazon.com/o/ASIN/079225676X/
        :review:

          Excellent diversified advice from top photographers.  The book
          manages to pack lots of relevant content in a small format.  It
          contains a nice section on composition, which is what originally
          attracted me to it.  As per usual, I found the digital
          photography section useless, but for the most part, the
          information available in this book is of great value.  This is
          the best general book about photography that I've read.

      The heuristic looks at an empty :book: field, an :ISBN: field, or
      some list that has an author and a title. See the Book class below to
      find out which fields are stored.
      """

      default_priority = 900

      def apply( self, **kwargs ):
          self.unid, self.storage = kwargs['unid'], kwargs['storage']

          v = FieldListVisitor(self.document)
          v.initialize()
          self.document.walk(v)
          v.finalize()

          for flist in v.getfieldlists():
              book = 0
              # if there is an empty book field, this is explicitly a book.
              if 'book' in flist and not flist['book'].strip():
                  book = 1
              # if there is an ISBN number, then it is *definitely* a book.
              elif 'isbn' in flist:
                  book = 1
	      # if there is an author and a title
              elif 'author' in flist and 'title' in flist:
                  book = 1

              if book:
                  self.store(flist) # store the book

      def store( self, flist ):
          emap = {}
          for k, v in flist.iteritems():
              emap[k] = v.astext()

          self.storage.store(self.unid, emap)


  class BookStorage(extract.SQLExtractorStorage):
      """
      Book storage.
      """
      sql_tables = { 'book': '''

          CREATE TABLE book
          (
             unid TEXT NOT NULL,
             title TEXT,
             author TEXT,
             year TEXT,
             url TEXT,
             review TEXT
          )

          '''
          }

      def store( self, unid, *args ):
          data, = args

          cols = ('unid', 'title', 'author', 'year', 'url', 'review')
          values = [unid]
          for n in cols[1:]:
              values.append( data.get(n, '') )

          cursor = self.connection.cursor()
          cursor.execute("""
            INSERT INTO book (%s) VALUES (%%s, %%s, %%s, %%s, %%s, %%s)
            """ % ', '.join(cols), values)

          self.connection.commit()


Modfying the Document Tree
==========================

Your extractors do not have to modify the document tree, their goal is to find
information from the document tree and save it somewhere else (e.g. in the
database), and to leave the document tree alone (the information from the
document tree gets rendered if you then render the document itself).

However, if you like, you can make modifications to the document tree.  For
example, if your extractor discovers some pattern of nodes for storage, you may
also want to add a custom class to the parent node so that you can later render
those parts of the document specially marked with CSS rules.  You can also
remove nodes from the tree if you like.  

When Nabu processes the extractors on the server, the document tree gets stored
*after* your extractors have run.  Your modified tree will be stored in the
sources table in the database.


Final Notes
===========

Certainly, some improvements could be applied to the example we provide.  The
extractor could use a more sophisticated heuristic to determine if the field
list represents a book, parse various formats for the ISBN value so that it is
normalized and can then be used to automatically look up details from a URL on
some external website, we could convert and store the extracted values in BibTeX
format for later reuse, add more fields to the schema, parse not just generic
field lists, but also bibliographic fields of documents to allow them to enter a
book entry as well, etc.  This would not be difficult at all.

As you can see, writing extractors is very easy.  You can create as many
extractors as you want.  Nabu comes with a small library of extractors which
will grow as I leverage it more and more for my own applications, and with
contribution from other people if they care to send them to me.

You can find other extractors--including this one--in the Nabu source
distribution under ``nabu/lib/python/nabu/extractors``.

If you have found any part of this document unclear and you still do not
understand the nature of Nabu after reading this example, please feel free to
contact the author with questions.


.. [#] If the field list appears at the top of the document, it will have been
       parsed by the bibliographic fields parser and thus the ``:author:`` field
       will have been converted into an author node, and a tiny little bit more
       work would have to be done to look into those fields if we want a book
       reference to be specifiable at the top of the document as the
       bibliographic field.

.. [#] The code for this class is very simple and you could write similar
       utilities to enhance code reuse for detecting different generic patterns
       of document structure.  See the `docutils document tree`_ document for
       information about what kinds of nodes the document tree is made of.

.. _`docutils document tree`:
   http://docutils.sourceforge.net/docs/ref/doctree.html

