================================================
   Nabu: A Text-Based Content Creation System
================================================

:Author: Martin Blais <blais@furius.ca>
:Date: 2005-04-25
:Abstract:

  We describe a system that aims at making it possible for users to create
  relevance in their content and allow building ways to serve it intelligently
  by providing a semantically rich access to this data.

.. contents::


Introduction
------------

We are considering a system for easing the management and publishing of various
kinds of information personal to a single producer with a system that is "as
simple as possible".

By personal information, we mean:

- contacts list, address book, birthday lists;
- list(s) of links (bookmarks);
- personal notes on specific themes (e.g. shopping for a new barbecue grill);
- essays, journal and blog entries;
- personal photographs;
- calendar and events;
- todo lists;


Motivation / Angles
-------------------

We present differing angles to various problems of information management"

* *Relevance of Information*.  Not all information that transits in a user's
  computer, or that is created by that user, is relevant.  There is a *great
  value* in getting rid of junk.

  We believe that at this point, this task can only be performed manually: there
  is no artificial intelligence algorithm that can be decide what is important
  for you.  This is a key assumption in the motivation for this system:
  ultimately, the user is not just a producer of information, but is also an
  editor.

  There is no search algorithm that will be able to automatically create value.
  However, we recognize that filtering technologies will play an important role
  in helping us become more efficient editors, but we do not believe that they
  of themselves will become able to create "the valuable" automatically anytime
  soon;


* *Disparate Storage and Unaccessible Source Data*.  Information is stored in
  various bits that have an associated meaning to them--let us call them
  "chunks" for now.  All information chunks are stored in different places, for
  example, all addresses are stored in an address book manager program, all
  email in a contacts list program.  Contacts lists are stored in PDAs, and
  despite the availability of synchronization programs (you're lucky if it even
  works), ultimately the data lives in different places and there are multiple
  copies of it.

  Also, these data storages used different methods, often specific to the data
  model that they choose, and only readable by the specific software that
  created them.  This makes them difficult to access this data, to build
  independent services on top.

  Services that allow you to enter some of that data online fall in the same
  trap: they store the data on their machines in a format that is not accessible
  for you, the user, to get it back.  You create value by publishing your data
  using their system but you do not have a way to get it back in a form suitable
  for reuse! ;


* *Data Entry*.  It is difficult to enter the information, for many reasons:

  - every time you need to enter a new type of information, you need to start a
    program specific to that information storage.  These programs change over
    time, and this means that you must learn a plethora of programs, just to
    enter the data;

  - more importantly, many times you would want to mix different types of data
    together in one logical unit.  For example, you might want to open a text
    file when you're researching a specific issue, for example, all information
    you find about a recently announced illness by your doctor, and you would
    want to store links to URLs of interest, contact information about local
    specialists may be able to help you, as well as text that you write yourself
    about the illness, or notes that you make on your condition, whatever.

    Personally, whenever I embark on any substantial task, I create a new small
    document for it--in the form of a text file-- and jot down notes as I
    discover more and more aspects of the problem that I'm working on.  I think
    many people do the same, or would do the same if they found a use for the
    document.  For example, you might want to share some of this knowledge.
    Right now, there is no easy way to do that;


* *Publishing is Difficult*.  Publishing much of the information you accumulate
  is still very hard.  There are many different systems out there which attempt
  to provide a way for you to publish certain types of data, in a specialized
  way.  These require much cusomization, and a market for specialized services
  (such as blogger) has emerged for specific uses of the data.

  Would it not be great if we could build services on top of all your data,
  rather than attempting to solve one specific use of the data?

  Also of note: online publishing systems store the data remotely, which make it
  difficult to build client software to edit it efficiently.  Making it possible
  to upload the data in a digested form once it's authored locally has some
  potential.


Simplicity
~~~~~~~~~~

One central value behind our views, and one that you must keep in mind when
considering the proposition that we're about to make, is that of the importance
of simplicity.  There is *great value* in keeping things *as simple as they need
to be*, because it allows the most flexible reuse of the information.

Much of the rewards of keeping design and data simple can be observed in the
power of the UNIX tools and operating system, which is built upon simple but
very powerful ideas, sockets and files that consist in generic streams of bytes,
and small tools that perform one task really well, and a simple and generic way
to connect those tools together (See "The Art of UNIX Programming",
E.S.Raymond).  This has made possible the creation of complex tools without
having to reinvent the small tools, but rather by improving those small tools in
a generic way, that would henceforth allow more possibilities for connecting
them in yet more different ways.  Keeping things generic and as simple as
possible is a potent idea.

This idea of designing systems as simple as they need be is also prevalent in
the practice of software development.  Over the past ten years, we are seeing
methodologies of development convergence towards this idea.  Extreme
programming, agile methodologies, and the growing adoption of dynamic languages
are a direct expression of the quest for reaching closer and closer to the
essence of the problems we're trying to solve while trying to get rid of
unneeded complications.  In many ways, software development is in the business
of creating complexity.  We are essentially recognizing that keeping our designs
and data models as simple as possible is the most efficient way of controlling
the growth of this complexity.


History
-------

This history behind the creation of this project stems from a long-standing need
from its author to maintain personal information in a way that is most useful
and that can be kept independent from specific software, over long periods of
time.  The sections below outline some of the problems I have tackled in the
past, and the partial solutions I have come to before creating the Nabu
extraction system. Nabu is meant to replace all these tricks to allow me to
extract, organize and selectively publish some of this information.


Maintaining an Address Book
~~~~~~~~~~~~~~~~~~~~~~~~~~~

I needed to maintain an address book.  At the time (circa 1993) on software was
decent that output a textual format which could be read for converting the data
into other formats.  Thus around 1997, I decided to transcribe all my physical
address books in a text file, following my supervisor's advice at university at
the time, used a paragraph-grep program to query it.  This worked great for many
years, except that there was no integration with my email programs.  I could
however grep and sed the address book file to generate a text file that could in
turn be imported by various email systems.  Over time, the one address book file
grew into many, and new contact information moved gradually into the documents
which provided context for them.

I think at some point I have started using the LDAP LDIF format to store the
files, but the naming was a bit too long or annoying to add entries with a text
editor, so I just created my own simple format, which looks like a list of
entries like this::

   n: New Navarino Bakery & Pastry Shop
   p: 514-279-7725
   a: 5563, avenue du parc, Montral, QC H2V 4H2


Cross-Browser Bookmarks
~~~~~~~~~~~~~~~~~~~~~~~

Another issue is that of maintaining a set of bookmarks.  One of the problems is
that every few years a new browser comes out, and I end up moving to it.  For
example, I started using the web with Xmosaic, and eventually moved to Netscape.
On Windows I eventually had to use IE, and eventually switched to Konqueror on a
Linux machine, and then Mozilla, which was very heavy, so eventually to Firefox.
Most of these browsers have slightly different bookmark storage formats which
are not conveniently edited within emacs.

A more important problem is that of the organization bookmarks.  Adding all
bookmarks in a linear list makes it nearly impossible to reuse them efficiently
(it is very hard to find a bookmark that you're looking for).  Tree structures
help alleviate this problem to some extent, but add another problem: when you
want to quickly add a bookmark (somehow, it always has to be quick), you have to
choose a single most appropriate place to put it, and if you're not very careful
with this you often have a hard time to find your bookmark back.

I found this problem really annoying, so I designed a very simple textual format
for bookmarks, where I would enter a description, url, and a list of keywords.
I wrote Tengis, a program that can read this format and can quickly query the
bookmarks with keywords.  Unfortunately, I never quite got used to using my own
software on top of the browser, and always end up grepping for the file within
emacs.

Here is an example excerpt of a bookmarks file::

    Babelfish
    http://babelfish.altavista.com
    search, languages, translation

    Amazon
    http://www.amazon.com
    search, books, music

    Abebooks
    http://www.abebooks.com/
    search, books


Another problem is that various links end up being stored in documents, text
files which I write when I accomplish some specific task. These do not make it
to the global bookmarks file.

For convenience, I wrote a script that could convert this file in a tree
structure and automatically generate bookmarks files for whatever browser I'm
using at the time.


Project Ideas, Mind-Mapping
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Whenever I have an idea for a project, something that I find interesting enough,
I document it.  I would like to share these documents, but they change quite a
bit over time, and they don't necessarily belong together for the presentation
layer.


Task Notes
~~~~~~~~~~

There is much information to be acquired when using computers.  A good habit
that I have acquired is to start a text file to jot notes whenever I take on a
task that is going to take a few hours.  This helps keep my focus organized, and
serves as reference if I have to repeat that task in the future.  It is also
very useful to just send those instructions when someone asks me how I
accomplished this task in the past.  I also avoid wasting time when I need to
make a new iteration of the same task-- I can review my thoughts at the time,
the decisions I made, etc.


Paper and Book Reviews
~~~~~~~~~~~~~~~~~~~~~~

When you are surveying a lot of scientific papers, it is good to take notes on
ideas and to summarize the crux of each paper that you read.  This helps
organize your thinking by forcing you to write and express your thoughts.  I
always wrote short 5 or 6 paragraph reviews of the papers that I read.  These
live in separate files and can sometimes be reused by friends when they ask me
about specific subjects, when I point them to some paper or other. 

Also, I like to take down quotes from the books that I read.  Whenever I read a
book, I mark down interesting passages, and when I'm done with the reading, I
take 30 mins to copy these passages in text files.  I sometimes like to feed
from this body of quotations to add to my signature in email (although I must
admit that I have eliminated using signatures at all for many years now).  In
any case, I sometimes enjoy going back to those review files when I'm having an
idea that relates to a book that I have read.


Key Themes
~~~~~~~~~~

A key theme behind the problems described above, is that the software that you
use to manipulate your personal information or notes files, is going to change.
Therefore it is a bad idea to use closed formats like that produced by MS Word,
or similar software, if you want to be able to maintain and use these documents
for a long time.

I very much trust simple text files.  They will always be readable, and
interpretable, and they use little storage.  In this context, docutils_ is an
amazing tool because it allows you to extract meaningful structure from them, as
long as you follow minimal conventions.  One of the principal motivators behind
this system is to provide the ability to maintain all sorts of personal
information using simple text files.  This is a key aspect.


Goal
----

Simply stated, our goal is the following:

  *To make it possible for users to create relevant content and allow building
  ways to serve it intelligently by providing a semantically rich access to his
  data.*

We want to make it possible to build services on top of the user's valuable
resource: information.  In order to do this, we have to make it possible for any
user to build this meaningful source of his information, to add relevance to it.
We want to:

1. make it easy to enter the information in a way that allows an automated
   system to extract the meaningful chunks of data and associate them with
   pre-defined (and extensible) semantics.

   This may involve some form of simple markup (e.g. "create new document",
   "insert contact info", "insert bookmark").  Easy means simple.  The interface
   and data format has to be *simple*, if not trivial;

2. provide a service that will store this extracted information in a way that is
   accessible by various publishing services;

3. create services that will offer creative views on this data.

   You can think of a blog interface, image galleries, a birthday notifier
   system, a system to sync your data store with your PDA, to serve your
   personal bookmarks as RSS feeds, to publish your travel log, to show your
   calendar of events, etc.

   These views would create value by providing convenient access and novelty on
   top of the user's data source.  Each of these views would use as its basis
   the parsed data source, stored and access in an efficient manner (i.e. in a
   database).


Our aim is clearly **NOT** to:

- create yet another specialized personal information management system.  For
  example, we do not want to manage email data.  We want to enable the creation
  of *relevant* data, for the user to create value by identifying *relevance* in
  his info, and we need to make this easy;

- create a desktop search system.  We do not want to deal with unorganized
  information on a user's system, but to provide a convenient space where a user
  can consciously organize most of his textual data.  Using all the unorganized
  information is difficult and a mistake, because there is a lot of non-relevant
  junk data;

We believe that relevance in information is the result of a certain amount of
conscious effort from the part of the user, and that search technologies have an
inherent limit in the quality of the information that they can provide, in terms
of filtering and organizing the data that navigates in a user's system.  This is
a key aspect of this document and the scope of what we're trying to achieve.
Search can help in organizing, but cannot organize for you.  Better search can
alleviate some of the need for organization, but we recognize that ultimately,
to create high-quality content, a conscious effort has to be made.


Requirements
------------

The problem is threefold:

1. input and organization the information: the process of creating, editing,
   entering, storing the information in the system;

2. extracting semantic chunks from it: parsing the input data and extracting
   meaning from its various components, meaning "across" the main organizational
   structure of input (for example, various input files may contain bookmarks,
   these bookmarks should be accessible in a global list of bookmarks);

3. publishing the information: making selected views of the information
   accessible over the networks, with specialized interfaces.


Input
~~~~~

- we will need to be able to edit the data offline, this is often the case for
  people who work with laptops or who are on the road;

- the data lives in "files", where files consist in logical and convenient units
  of organization of information for the user to input and edit, e.g.

  * my bookmarks file specific to my workplace;
  * a contacts/address list that relates to a specific trip;
  * a blog entry, perhaps with a snippet of code in it and some links/bookmarks.
  * book reviews, which contain quotes, and a short public blurb and a link to
    the book (say, to amazon);

- the information needs to have levels of disclosure, including the possibility
  of being hidden completely (i.e. not published nor extracted at all), and the
  possibility of being entirely public, and various levels in between.  Each
  file, but also each entry must be able to specify its level of disclosure
  individually as well;

- we want to have a system that is as simple as possible, therefore we will
  prefer text files that can be created in a normal editor, like emacs or vi,
  but that does not prevent the creation of client programs to generate these
  input files;

- we assume that not all "data" a user produces and consumes is revelant, data
  that gets included in the system is subject to revision and a minimal effort
  has been made on the part of the user to clean it up and select it.  The user
  has to "write" it, or somehow take a conscious step to request that certain
  information be included in the system;

- optional: many people must be able to edit the data concurrently, or a single
  user be able to work using various independent copies of the data;


Extraction
~~~~~~~~~~

- we must extract meaningful chunks of information from the input files;

- all information chunks that are extracted must be tracked to their input file,
  so that we can implement an incremental extraction algorithm that looks at the
  input and figures out:

  * which chunks are obsolete;
  * a list of only new chunks to be integrated.

- the kinds of "chunks" must be extensible or generic, so that the system is
  very flexible;

- input "files" may change location, therefore we should not rely on their
  filename as unique identifiers for the chunks of information;

- the extraction should make the data available in a data store (e.g. a SQL
  database), in a way that makes it possible to perform incremental updates,
  full updates, and in a way that makes it flexibly accessible to various
  publishing interfaces.


Publishing
~~~~~~~~~~

- we need to be able to provide the various chunks of information in various
  ways, and to organize them in various ways, for example:

  * a blog, organized by dates and/or categories;
  * a travel journal;
  * lists of bookmarks, served up as RSS to browsers can integrate them;
  * a gallery of images, by trip;
  * notes taken about a certain task, e.g. setting up incremental backups,
    setting up software on a particular laptop;
  * a preferred wine list, a reading list;
  * project ideas, essays;

- a pluggable architecture should be developed to make it possible to render
  each type of info chunk with a specific rendering system.  We should be able
  to extend the system so that a new type of entry can be rendered in the
  existing publishing system;


Trading-off Mutability of Data for Easy Extraction
--------------------------------------------------
An important note about Nabu's design, from an email thread::

  On 7/11/06, rouadec wrote:
  > >From: "Martin Blais":
  > >
  > >Hehe, me too!  I think Nabu could be of great help for you then, if
  > >you can edit text files carefully.
  > >
  >
  > See that's the thing I'm not so keen on, not the editing part but the fact
  > that eventually you want to be able to interact with the final result : add
  > comments on a blog, click somewhere on the agenda and add an entry, change
  > the tag of an image. I'm having the same problem with ibrouteur right now.
  >
  > Nabu as I understood it doesn't have this feedback loop either yet and if I
  > start manipulating the database withtout passing through the text files then
  > it's data I'll have to care about too which I don't want.

  Yes, that is very much true.  The database in my system is a
  redundant, disposable copy of the data which lives in the text files.
  The disadvantage is that unless the data to be added can be separated
  easily, it becomes useless to add or modify the extracted data in the
  database.  That can be seen as an advantage however, because all of
  the valuable data lies in the text files, which are easy to replicate,
  modify, edit, etc.   It is a shortcoming which I have considered
  reasonable, but which may not be for other users.  For me text files
  are really easy to access and the extractors that take the data out
  very easy to write, while having to create specialized interfaces for
  data living in a database is a lot of error-prone work.  Nabu is very
  much an experiment in that sense, capitalizing on the flexibility of
  the data extraction capabilities it provides, traded off against the
  immutability of the extracted data.

  Note that blog comments can be added to the system in a reasonable
  way: all you have to do for those is to create them in a separate
  table, which is never cleared.  They can be key'ed by the unique id of
  the associated documents.



Conclusion
----------

The key ideas driving our design are:

- text files are forever.  Structure and some meaning can be extracted from text
  files by following simple reStructuredText_ guidelines and using docutils_;

- not all information a user produces is relevant, there is a lot of junk.  A
  user's editorial involvement can create a higher-quality, semantically rich
  and valuable source of information;

- creating views on top of this semantically rich source of data allows the
  possibility for value added services and easier personal publishing.


.. _docutils: http://docutils.sourceforge.net
.. _reStructuredText: http://docutils.sourceforge.net/rst.html


