----------
ore.xapian
----------

The package provides facilities for indexing content exposed by a
zope application. It utilizes xapian for its indexing library, and 
operates as a thin wrapper around it. 

features 

 - processes all indexing operations asynchronously.

 - mechanisms for indexing/resolving content from multiple data stores.

 - easy to customize indexing behavior via adaptation.

Content
-------

Let's create some content to work with. The only responsibility on
content for purposes of integration with indexing is that they
implement the IIndexable marker interface. 

  >>> class Content( object ):
  ...    implements( interfaces.IIndexable )
  ...    __parent__ = None
  ...    @property 
  ...    def __name__( self ): return self.title
  ...    def __init__( self, **kw): self.__dict__.update(kw)
  ...    def __hash__( self ): return hash(self.title)
  >>>
  >>> rabbit = Content( title=u"rabbit", description="furry little creatures")
  >>> elephant = Content( title=u"elephant", description="large mammals with memory")
  >>> snake = Content( title=u"snake", description="reptile with scales")   
  >>>

Resolvers
---------

Resolvers allow us to index content from multiple data stores. ie. we
could have content from a relational database, and content from a
subversion, and content from the fs, that we'd like to index into
xapian. Resolvers allow us to unambigously identify object via an
identifier, and to retrieve an object given its identifier. Resolvers
are structured as named utilities, with the utility name corresponding
to the resolving strategy. 

One key requirement, is that we need to be able to load the content
asynchronously in a different thread in order for the indexing
machinery to work with it.

ore.xapian includes resolvers for both sqlalchemy mapped objects and
subversion objects (ore.svn nodes).

For the purposes of testing we'll construct a simple resolver scheme
and some sample content here:

  >>> class ContentResolver( object ):
  ...    implements( interfaces.IResolver )
  ...    scheme = "" # name of resolver utility ( optionally "" for default )
  ...    map = dict( rabbit=rabbit, elephant=elephant, snake=snake )
  ...
  ...    def id( self, object ): return object.title
  ...    def resolve( self, id ): return self.map[id]
  ...
  >>> component.provideUtility( ContentResolver() )

Catalog Definition
------------------

a core responsibility of any application utilizing this package, is to
define the application specific fields of interest to index. 

an application does this via constructing a xapian index connection
and adding additional fields:

  >>> import xappy
  >>> indexer = xappy.IndexerConnection('tmp.idx')
  >>> indexer.add_field_action('resolver', xappy.FieldActions.INDEX_EXACT )
  >>> indexer.add_field_action('resolver', xappy.FieldActions.STORE_CONTENT )
  >>> indexer.add_field_action('title', xappy.FieldActions.INDEX_FREETEXT )
  >>> indexer.add_field_action('title', xappy.FieldActions.STORE_CONTENT )
  >>> indexer.add_field_action('description', xappy.FieldActions.INDEX_FREETEXT )


Queue Processor
---------------

Now we can startup our asynchronous indexing thread, with this
index connection. Note we shouldn't attempt to perform any indexing
directly in the application threads with this indexer, as no locking is
performed by xapian. Instead, write operations are routed to the queue
processor which performs all modifications to the index in a separate
thread/process. For the purposes of testing, we'll also lower the time
threshold for index flushes (default 60s):
  
  >>> from ore.xapian import queue
  >>> queue.QueueProcessor.FLUSH_TIMEOUT = 0.5
  >>> queue.QueueProcessor.start( indexer )
  <ore.xapian.queue.QueueProcessor object at ...> 


Indexing
--------

Content indexing is automatically provided via event integration. Event 
subscribers for object modified, object added, and object removed are
utilized to generate index operations which are processed asynchronously 
by the queue processor. 

Operations
==========

However in order for the proper resolver to be associated with the
index operations for each object we need to construct an operation
factory thats associated to the resolver. The appropriate operation 
factory for an object will be found via adaptation:

  >>> from ore.xapian.operation import OperationFactory
  >>> class MyOperationFactory( OperationFactory ):
  ...      resolver_id = ContentResolver.scheme
  >>> component.provideAdapter( MyOperationFactory, (interfaces.IIndexable,) )

The operation factory is used by the various event handlers to create
operations for the index queue. The default implementation already
provides an appropriate generic implementation for the creation of
operations, our customization is only to ensure that the factory uses
the specified resolver.

Content Integration
===================

Applications will be typically be indexing many types of objects
corresponding to different interfaces and with different attribute
values. An index however tries to index object attributes into a
common set of fields appropriate for generic application usage and 
search interfaces. Therefore a common application need is to customize
the representation of an object that is indexed. 

  >>> class ContentIndexer( object ):
  ...      implements( interfaces.IIndexer )
  ...      def __init__( self, context): self.ob = context
  ...      def document( self, connection ):
  ...          doc = xappy.UnprocessedDocument()
  ...          doc.fields.append( xappy.Field( 'title', self.ob.title ))
  ...          doc.fields.append( xappy.Field( 'description', self.ob.description ))
  ...          return doc  
  >>> 
  >>> component.provideAdapter( ContentIndexer, (interfaces.IIndexable,) ) 

Now let's generate some events to kickstart the indexing:
   
  >>> from zope.event import notify
  >>> from zope.app.container.contained import ObjectAddedEvent
  >>>
  >>> notify( ObjectAddedEvent( rabbit ) )
  >>> notify( ObjectAddedEvent( elephant ) )
  >>> notify( ObjectAddedEvent( snake ) )
  >>> import time
  >>> time.sleep(0.6)

Searching
--------- 

Search Utilities are analagous to xapian search connections. To allow
for reuse of a connection and avoid passing constructor arguments, 
we construct a search gateway which functions as a container for 
pooling search connections and which we register as a utility for
easy access:
 
  >>> from ore.xapian import search
  >>> search_connections = search.ConnectionHub('tmp.idx')

We can get a search connection from a gateway by calling it:

  >>> searcher = search_connections.get()
  >>> query = searcher.query_parse('rabbit')
  >>> results = searcher.search( query, 0, 30)
  >>> len(results)
  1
  
We can retrieve the object indexed by calling the object() method on 
a individual search result:

  >>> results[0].object() is rabbit
  True
  
  >>> query = searcher.query_parse('mammals')
  >>> results = searcher.search( query, 0, 30 )
  >>> len(results) == 1
  True

Content Integration Redux
-------------------------

For verification let's test modification and deletion as well

TODO
 
  >>> from zope.lifecycleevent import ObjectModifiedEvent
  >>> from zope.app.container.contained import ObjectRemovedEvent
  
#>>> notify( o)

Cleanup
-------

To be a good testing citizen, we cleanup our queue processing thread.
  
  >>> queue.QueueProcessor.stop()

Caveats
-------

There are several caveats to using an indexing against relational
content, the primary one of concern is the use of non index aware
applications, performing modifications of the database structure.

there are additional ways to deal with this, if the index queue
is moved directly into the database, then modifying applications
can insert operations directly into the index queue. additionally
most databases support trigger operations that can perform this
functionality directly in the schema structure.

the additional constraint with using database based operations is
that additional properties of the domain model may be lost, or 
hard to capture for other appliacations or database triggers.


Todo 
----

to get a functioning implementation in short order.. several short
cuts were taken, to be revisited as time per permits. 

Caching SearchConnections
+++++++++++++++++++++++++

Ideally we'd be caching search connections for a request, instead of
constructing them on demand.

Multiple Catalogs
+++++++++++++++++

There is a single catalog per zope instance... local sites don't help
since there is no site setup mechanism for the async processing thread.

Batch and Collapse Operations
+++++++++++++++++++++++++++++

Many times an application will have multiple operations for a single
object, ideally an index should be to collapse redundant operations 
for the same object within a given timespan.

Multiple App Servers
++++++++++++++++++++

Multiple Application server processes was outside of the scope of the
initial implementation, and require the following.

This caveat also applies to multi process deployments.

Table Based Queue
+++++++++++++++++

storing operations into an rdbms queue will allow for multi
process/node usage of the index queue.

Remote Searcher
+++++++++++++++

with a multiprocess/node setup, a dedicated index server is needed.

xappy needs to be extended with remote searching capabilities.


