Installation-free Usage
=======================

This example shows how to use the Hadoop Distributed Cache (DC) to
distribute Python packages, possibly including Pydoop itself, to all
cluster nodes at job launch time. This is useful in all cases where
installing to each node is not feasible (e.g., lack of a shared mount
point). Of course, Hadoop itself must be already installed and
properly configured in all cluster nodes before you can run this.

Source code for this example is available under
``examples/self_contained``\ .


Example Application: Count Vowels
---------------------------------

The example MapReduce application, ``vowelcount``, is rather trivial: it counts
the occurrence of each vowel in the input text. Since the point here
is to show how a structured package can be distributed and imported,
the implementation is exceedingly verbose and inefficient.


How it Works
------------

The DC supports automatic distribution of files and archives across
the cluster at job launch time.  This feature can be used to dispatch
Python packages to all nodes, eliminating the need to install
dependencies for your application, including Pydoop itself::

  pydoop submit --upload-archive-to-cache vowelcount.tgz \
                --upload-archive-to-cache pydoop.tgz [...]

The ``pydoop.tgz`` and ``vowelcount.tgz`` archives will be copied to
all slave nodes and unpacked; in addition, ``pydoop`` and
``vowelcount`` symlinks will be created in the current working
directory of each task before it is executed.  If you include in each
archive the *contents* of the corresponding package, they will be
available for import.

The example is supposed to work with Pydoop and vowelcount *not*
installed on the slave nodes (you do need Pydoop on the client machine
used to run the example, however).
