==================
    TODO: nabu
==================

.. contents::
..
    1  Misc
    2  Server Setup
    3  Directives
    4  Extractors
      4.1  Definitions
      4.2  Questions
    5  Jeff's Suggestions

Misc
====

- nabu/docutils: Create a web page on nabu for docutils proposals for
  establishing conventions for microformats, use my initial Nabu stuff
  and send an email to the mailing-list asking for feedback.
  Eventually we want to move some of Nabu's functionality directly
  into docutils.

- I want, like a wiki, to be able to serve documents from a local
  checkout. IMHO this whole thing would be simpler if the file came
  out of a Subversion repository and were accessible as such, e.g.
  references to local documents from the file would be translated to
  filenames from the subversion repository. I want to do this::

    .. image:: myfigure.png

  in a directory::

      /path/to/blog/losangeles/entry.txt
      /path/to/blog/losangeles/myfigure.png

  and for all the paths to be found automatically. This means that
  Nabu would have to become tied to a Subversion checkout. I think
  that might even be reasonable at some point, if it allows me to be
  able to access the files. I don't know.


- Implement a swish-e index on the database and link it up to walus.

- Check out links:
  - stikkit:
  - m[uo]ndial: db of place-names in the world
  - “The Predictors”, book, see ref.

- Interrupting the nabu client should be handled.

- Add a feature to directly email a brand-new peephole from the
  website (just enter the email addresses and send it from there).

- Add a field for some kinds of entries that allows us to store the order within
  a document in which the entry occurs. We should be able, for example, to extract
  the five first entries showing up in a source file.

- Produce a photo for my main website page, a series of photo-booth photographs
  stacked next to each other, vertically, for the main page.

- Birthday fields do not seem to extract properly, for the most pat

- Make my addrbook table available for my mobile phone

- You **REALLY** need to support multiple profiles for running nabu manually.
  It's really annoying to delete the entire server database by accident.

- We really need to detect a syntax for “Question”.  e.g. Q:

- We need to document the different input syntax extensions, in a single, very
  simple, document.

- All the astext() functions from the extractors should be replaced by a
  function that checks for the presence of lists, and concatenate if the arg is
  a list.  See book() for example.

- Skip symbolic links by default.

- Fix Nabu in that if there is an error, the client script should keep going
  to try to publish the other documents.

- Can I use views to hide the disclosure requirement in my queries entirely? e.g.::

    CREATE VIEW documents_public AS SELECT * FROM document;
    CREATE VIEW documents_shared AS SELECT * FROM document WHERE disclosure > 1;
    CREATE VIEW documents_private AS SELECT * FROM document WHERE disclosure > 2;

  And the same for all the remaining tables?  All the queries would then have to
  append the correct names from the tables they fetch from.

- Complete the movie extractor to actually store the list of movies, and to
  render the link using the catalog if there is a catalog number.

- We need to provide a way for the extractors to reliably report errors, some
  sort of function.

- You need to define a syntax for multiple links sharing the same description
  and keywords

- When an error occurs during upload, we should log the trace of the exception
  to the logging module, in order to be able to debug problems easily. (see
  gijsbert's email).

- We need to process the transforms within a single transaction, and to roll it
  back if there is an error, including rolling back the original deletions from
  all the table.  The point is that we want to maintain database consistency
  even if there is an error while processing the document.  This is very
  important!

- Make the source storage objects use a acquire/release pattern compatible with
  antipool, so that we could create them once in a mod_python handler rather
  than every time.

- When we change a document's unique Id, the old document remains in the
  database… is there anything we can do about that?

- We need to detect encoding errors better, I have too many documents which do
  not get detected/guessed automatically

- Write an extractor for all my google maps links and create a list of them, so
  I can find locations easily.  MapExtractor

- Write an extractor for dictionary definitions;  in this format:

    defn: unwonted
       Definition of unwontedness.


- Find how to suppress output of errors in HTML conversion, to avoid catching
  this output in the extractors and having this output pollute the contents of
  the extractors.

    e.g. photography.txt and contacts.

- Nabu ``-l`` is broken.

- Test and check with docutils 0.4

- Revise and fix ``client.py --dump``

- Add a "--config=" option to client rather than force setting NABURC.

- Make the html rendering not render output errors.

- support encrypted files from the publisher

  - problem: how do we identify documents to be published without decrypting?

    - emacs config must keep plain text at the top of the file when decrypting
      in a variable and add it back in when reencrypting

- shares some of the advantages of XML, structured hirarchical tree of nodes


Server Setup
============
- implement asynchronous document processing on the server

- check database locking issues in more detail?


Directives
==========

- Do something about the locator directive

Extractors
==========

- warn on files that have no titles (add minimum requirements extractor)

- there should be and .. end directive to stop the processing from that point on.
  - .. disclosure:: private directive could achieve something similar

- add a generic field list parser for creating new types of records without
  code: the entry type is the first field list, if it matches some pattern
  (e.g. if it's empty);

- url extractor: when right after a title, a list of URLs should use the title
  as the description string

Definitions
-----------

Support this:

  :paramour
    • noun archaic or derogatory a lover, especially the illicit partner of a
      married person.
    — ORIGIN from Old French par amour ‘by love’.

  :“la monstre tune”
    expr: beaucoup de fric.

- the first is a word definition
- the second, with either “, ", or ” is an expression



Questions
---------

Support the following syntax for questions:

    :Q: dhs dhs dhshd shdsd hsds as Emperor (Tsar) of Bulgaria from 1331 to 1371,
        during the Second Bulgarian Empire. The date of his birth is unknown. He died
        on February 17, 1371. The long reign of Ivan Alexander is considered a
        transitional period in Bulgarian medieval history. Ivan Alexander began his
        rule by dealing with internal problems and external threats from Bulgaria's
        neighbours, the Byzantine Empire and Serbia, as well as leading his empire
        into a period of economic recovery and cultural and religious
        renaissance. However, the emperor was later unable to cope with the mounting
        incursions of Ottoman forces, Hungarian invasions from the northwest and the
        Black Death. In an ill-fated attempt to combat these problems, he divided the
        country

           ...that competitions for the design of José Martí Memorial (pictured) in
           Havana, Cuba started in 1939, but the design that was finally constructed
           in 1953 was a variation on a design that had come in third .in the fourth
           competition?




Jeff's Suggestions
==================

Some things to be taken care of in here:

    > Hi Martin!  I've been playing with Nabu for the past couple of days, writing
    > my own extractors.  I'm looking at using it for the Python Advocacy effort,
    > where difference people can write documents on certain topics and Nabu links
    > them together in a way more structured than a wiki.  You can see a preliminary
    > page at advocacy.dfwpython.org, where the various drop-down links will go to
    > documents processed by Nabu.

    Oopsy I get an error.

    Bad Gateway
    The proxy server received an invalid response from an upstream server.




    > - In the extractor sources, I would add a comment warning people that only
    >    the first instance of a field in a block gets taken.

    You mean "in a field list".  That depends on the extractor.  The
    extractors are still changing a fair bit, and I'm not sure I consider
    them to be a part of Nabu just yet.  The ones I provide are the ones I
    use, but they really are just examples.  I'm thinking of separating
    them at some point.  Not sure it matters, there are so few users yet
    anyway.


    > - I wish there were a way for a date created/modified.  I can store _a date_
    >    in the store() method, but there is no distinction between create and modify
    >    at this level since Nabu entirely removes a record before adding/modifying
    >    it.  Admittedly this simplifies things, omitting the need for SQL MODIFY
    >    directive, but it loses the create/modify distinction.  Perhaps Nabu could
    >    stash the record from the __sources__ table and pass it to store(), so that
    >    store could know if this was the first or Nth time this document has been
    >    seen and have access to the date column in __sources__.

    Even that would not be sufficient: things extracted from the document
    themselves can be added, deleted or modified.  Unless there is a way
    to uniquely identify those "things" (whatever they are, bookmark,
    contact info, etc.), there is no way to figure that out.  IMO the
    problem is ill-defined.  However I think it would be possible to
    implement this for the documents themselves, as a special exception.
    I'm not sure I need it though.


    >    I know how to do this now, but it wasn't spelled out in the docs.

    thanks! :-)


    > - It would be useful for the extractors to have a method invoked at the
    >    start and end of processing a series of documents, for the purpose of
    >    generating summary views across a set of documents that might have changed.

    Great idea!


    > - It was unclear how the order in which extractors appear in the transforms
    >    list relate to the 'default_priority' field within each extractor.  They
    >    each control different aspects of processing but I don't quite get it.

    The default_priority defines the ordering.  This is more of a docutils
    issue that needs to be documented in the Nabu extractors developers
    guide.


    > I was at one point using multiple tables per extractor, for a many-to-many
    > relationship and hit the following:
    >
    > - In extract.py, class SQLExtractorStorage, the __init__ logic iterates over
    >    the sql_tables dictionary, creating them.  However if there are dependencies
    >    between the tables, such as one table for values and a second table for a
    >    many <-> many relationship, the second table may get created first,
    >    depending upon the vagueries of dictionary hashes.  I changed __sql_tables__
    >    to use an 'ordered dictionary' from the Python Cookbook so that the tables
    >    were created in the order listed.

    Hehe I also hit this problem not too long ago.  I changed the list to
    be a list of tuples instead, which now also includes the type of the
    object, e.g.

       sql_relations_unid = [
           ('document', 'TABLE', '''

               CREATE TABLE document
               (
                  unid TEXT PRIMARY KEY,
                  title TEXT,
                  author TEXT,
                  date DATE,
                  abstract TEXT,
                  location TEXT,

                  -- Disclosure is
                  --  0: public
                  --  1: shared
                  --  2: private
                  disclosure INT DEFAULT 2,

                  -- If this document is a tag index, the tag for which it is.
                  tagindex TEXT DEFAULT NULL

               )

           '''),

           ('tags', 'TABLE', '''

               CREATE TABLE tags
               (
                  unid TEXT NOT NULL,
                  tagname TEXT
               )

           '''),
           ]

       sql_relations = list(locresolv.schemas) + [
           ('tagindex_idx', 'INDEX',
            """CREATE INDEX tagindex_idx ON document (tagindex)""")
           ]



    > I also use table/column constraints in my database and had trouble controlling
    > the order in which tables were created/dropped, to satisfy those dependencies.
    >   I use 'CASCADE ON DELETE' on my tables to automate most of it.  I used the
    > 'ordered dictionary' trick to handle some more of it, and then wrestled with
    > the default_priority/transforms[] ordering.  However I still had these problems:

    (same as above.)


    > - I wanted my new tables to have their 'unid' dependent upon the Nabu master
    >    table which, I thought at first, was the doctree table (later I saw that
    >    it really is __sources__).  The problem is that I had to add a FinalDocTree

    yes, __sources__ is the single master resource.


    >    and InitialDocTree, to initially INSERT the unid into the doctree table
    >    at the start of processing, so that other table's later external references
    >    would work.  The FinalDocTree recorded the pickle of the document after all
    >    transforms have run.

    Personally I just store the final doctree and get rid of the initial
    doctree.  I don't have a constraint indeed, and I really should do
    that.



    > - When I switched to __sources__ I saw that you don't insert the document's
    >    info until the end of processing, breaking referential constraints.
    >    However because the various extractors each do their own database commits,
    >    this means the database is in an indeterminate state during processing.
    >    I'd like to see Nabu insert the info entry at the start of extractor
    >    processing and avoid doing any database commits until the end of all
    >    extractor processing.

    Yes (this is related to the above, I should fix it, great observation).





    > - The database security info is in the cgi-bin file connect.py.  I had to
    >    make sure that directory was execute-only re Apache.

    Bah, Apache stuff.  Always a PIA that stuff.


    > - The docs need a bit more Apache setup info, re cgi-bin, HTTP auth and
    >    perhaps even how to use HTTPS/SSL for non-public portions of the
    >    interaction.

    sure.


    > - For some reason, the cgi-bin program nabu-contents.cgi always displays
    >    a 404 Not Found message at _the bottom_ of the content.

    Oopsy. I think I fixed that a while ago.  Could you fetch the snapshot
    and confirm?


    > - If a 'user=' is not specified in the nabu.conf file, and one is not passed
    >    in on the command-line, nabu connects as user 'None'.  Perhaps it ought to
    >    print an error message.

    Ooppsy indeed.  Sounds like a bug.




    > - When there is an error in the CGI, the traces show the database login info.
    >    This may be a problem with the xmlrpclib, but it was disconcerning.

    Hmmm.  Sounds more like a CGI problem, not sure what I can do about that one.

    Note that I did a lot of work on the error reporting since your
    snapshot (dec 2005).  That used to be one of the weak areas, and
    sometimes still is when an extractor crashes.  One thing I'd like to
    do is handle xmlrpclib errors more gracefully.


    > - The docs need an expanded section on the "marker_regexp" and how document
    >    ids work.  First, the wording is unclear on whether _each_ document needs
    >    a unique id or whether a _set_ of documents need a group id.  As it turns
    >    out, the answer is both.  The docs give an example of a unicode spec
    >    preceding the :Id: field, but I didn't realize that you had to adjust your
    >    marker_regexp to _match_ that unicode spec.   So with the marker_regexp

    Hmm do you have to?  The client.py is doing re.search(), so you should
    not have to change it.  My .naburc does not customize the regexp.  I
    just tested with

    .. -*- coding: utf-8 -*-  :Id: testttt

    and it worked for me.


    >    field you can select, via a proper pattern, _which_ documents are examined
    >    and then you also set the pattern to extract the particular id you want.
    >    Some examples would help in the docs.  This also needs to clearly state

    Will do.


    >    the purpose of the (group) portion, and also what part of the pattern
    >    will be removed from the document before transmission.  I'm still getting
    >    :Id: transmitted but with a blank value and I'm not sure why.  I'm sure
    >    its to do with the pattern match.

    I'm not sure what you mean with the "group" portion.  What is this?










Hi Jeff

I did a little bit of work on Nabu while the laundry was going:

--------------------
    1. The nabu command uses a 4-argument call to process_source
      but the nabu server wants 5-arguments.  I added reporting_level
      to the client and it works now.

Hmm that's strange, I have this in client.py:223 ::

            errors, messages = server.process_source(
                pfile.unid, pfile.fn, xmlrpclib.Binary(pfile.contents),
                report_level)

and this in server.py:269 ::

    def process_source(self, unid, filename, contents_bin, report_level):

Do we have the same version?



--------------------
    2. In nabu-publish-handler.cgi, there is an invocation of the global method
      'create_server' but it isn't imported or defined in that .cgi file.
      Something got refactored incorrectly, I think.

FIXED


--------------------
    1. In the document extractor there is a reference to a 'walus' module, but
      no sign of it in a normal install.  I've commented it out.  Perhaps a
      conditional import test to drop out its functionality if not present?

The problem here was that I have a library called locresolv in another project
called 'walus' that depends on another project 'antiorm' that I do not want nabu
to depend on.  I have moved locresolv to nabu and make it a conditional import.
One remaining issue is that locresolv has some more work to be done on it in
order to be usable, so for now I have made it disabled.

BTW, locresolv is a cool library that allows you enter location names loosely,
given the assumption that the components of a location name are distinct, for
example, if this location has been seen at least once and is available in the
database::

   Farmer's Market, Beverly Hills, Los Angeles, California, USA

thereafter all of the following location entries will resolve to the same
place::

   Farmer's Market
   Farmer's Market, Beverly Hills
   Farmer's Market, Beverly Hills, USA
   Farmer's Market, California, USA
   Farmer's Market, Los Angeles, USA

Basically, it will resolve the name to its most complete form, so that when you
create a new blog entry you don't have to think about it too much.



--------------------
    3. Currently I'm tracking down an inability to do basic things with the
      server - I'm getting an XML-RPC error code of -1, with no traceback
      or error text.  This is when it is attempting to 'ping' the server
      right after creating the server proxy.

XML-RPC is notoriously terse when there is an exception on the server
side--almost nothing is brought back to the client.  I definitely need to do
something about this.

 FIXME


--------------------
    2. I chased what I thought was a problem, in that my old extractors
      stopped extracting data under the new snapshot.  No errors, they
      just didn't match anything.  The cause was a node.clear() call
      in the document extractor.  It removed the whole docinfo section,
      but I had a half dozen extractors plucking out various fields in
      that area.  Removed the node.clear() and they work again.

I remember fixing a bug with the docinfo but do not remember if that was what
you describe.   The node.clean() call in document.py::

        def depart_docinfo(self, node):
            self.in_docinfo = 0

            # Remove the bibliographic fields after processing.
            node.clear()

            raise nodes.StopTraversal

is necessary because we do not want to leave the docinfo fields visible in the
output rendered document.  I think that it would be wiser to do this at
rendering time instead.  So I removed the node.clear() call, and changed the
renderer in Walus instead to not render the docinfo fields.

--------------------
    > I thought I'd send along some feedback on the things I learned along the way.
    >   I was using the Dec 2005 release, not the intermediate SVN versions,
    > figuring the formal release is more stable.

    Actually, the latest is the most stable.  I fixed a lot of bugs since
    then.  Always use the snapshots.  I want to setup a public SVN server
    but I cant' right now because the SSL on my machine is hogged for my
    commercial application.

I spent some time today trying to setup a public Subversion repository for you
to use; it's taking time, I'm having troubles with DAV, trying to fix this with
my sysadmin ppl, I think the firewall rules do not allow it.  I have setup a
trac though:

  http://furius.ca/trac/public

The subversion will be at

  http://furius.ca/svn/public

when it's working (GET requests are already working, you still can't checkout or
ls, I don't know why)




--------------------
    3. Chased another false alarm; I missed changing one of my extractors
      from using sql_tables to sql_relations_unid in the storage class.
      However Nabu raised no alarm as a result of the SQL storage class
      not having any SQL tables.  Perhaps a sanity check would be useful,
      at least as a warning.


Well, there are defaults, and the defaults are empty.  You would like to force
the declarations instead?


--------------------
    > - I'd like to see some discussion in the docs about using the command-line
    >    switches.  For example, if I add an extractor on the server, but none of
    >    my documents have changed, how do I force a rerun of all my documents?

    Yes, I will add that in the docstring for the nabu client.  "nabu
    --help" has so many options that it isn't very clear how to use it.  A
    more human-readable document is in order.

Done (some).


--------------------
    > - At first I couldn't get nabu-contents.cgi to work, until I realized that
    >    it is tied to the specific set of tables used elsewhere.  I had book and
    >    link disabled as I wasn't using them, and was puzzled by the errors.

    Yes, those scripts are meant to be used as examples.
    In my current presentation layer, all that stuff is done wihtin
    resource handlers with my Ranvier framework.  I don't even use the CGI
    scripts.  They are just there for demo.  I should indicate that more
    clearly.


I add a big nasty warning in the cgi-bin publisher script.


--------------------
    > - In the docs, the section that talks about ~/.naburc or nabu.conf is unclear
    >    whether the client, server or both uses it.  I now know that only the
    >    client uses those files, and that the server uses settings in the cgi-bin
    >    files.

    Will clarify, thx.


Done.


--------------------
    > - All the tutorials are unclear about transforms.  Does simply "extracting"
    >    information from the tree necessarily alter it?  I didn't see much on why
    >    and how you'd actually _modify_ the tree.  Some expanded material on when
    >    and how to actually transform the tree would be cool.

    I've written a book extractor that modifies the tree recently.
    Basically, every time you want a list of things expressed as field
    lists the HTML rendering looks pretty bad (render a list of field
    lists with the HTML renderer and you'll see what I mean, it renders
    horrible, I think this is because of the way the HTML renderer works
    in most browsers).  My books get converted into DIVs now, it looks
    much better.

    Note that you may be able to make this change at render time rather
    than at extraction time.  Both are possible.


I added a section on this in the tutorial.
