=============================
ktable, simple k-mer counting
=============================

(This is not why most people want to use khmer -- see :doc:`guide` for that.)

khmer ktables are a simple C++ library for counting k-mers in DNA
sequences.  There's a complete Python wrapping and it should be pretty
darned fast; it's intended for genome-scale k-mer counting.

'ktable's are a table of 4**k counters.  khmer then maps each k-mer
into this table with a simple (and reversible) hash function.

Right now, only the Python interface is documented here.  The C++
interface is essentially identical; if you need to use it and want
it documented, drop me a line.

Counting Speed and Memory Usage
===============================

On the 5 mb *Shewanella oneidensis* genome, khmer takes less than a second
to count all k-mers, for any k between 6 and 12.  At 13 it craps out
because the table goes over my default stack size limit.

Approximate memory usage can be calculated by finding the size of a
``long long`` on your machine and then multiplying that by 4**k.
For a 12bp wordsize, this works out to 16384*1024; on an Intel-based
processor running Linux, ``long long`` is 8 bytes, so memory usage
is approximately 128 mb.

Python interface
================

Essentially everything requires a ``ktable``.

::

	import khmer
	ktable = khmer.new_ktable(L)

These commands will create a new ``ktable`` of size 4**L, suitable
for counting L-mers.

Each ``ktable`` object has a few accessor functions:

 * ``ktable.ksize()`` will return L.

 * ``ktable.max_hash()`` will return the max hash value in the table, 4**L - 1.

 * ``ktable.n_entries()`` will return the number of table entries, 4**L.

The forward and reverse hashing functions are directly accessible:

 * ``hashval = ktable.forward_hash(kmer)`` will return the hash value
    of the given kmer.

 * ``kmer = ktable.reverse_hash(hashval)`` will return the kmer that hashes
    to the given hashval.

There are also some counting functions:

 * ``ktable.count(kmer)`` will increment the count associated with the given kmer
   by one.

 * ``ktable.consume(sequence)`` will run through the sequence and count
   each kmer present.

 * ``n = ktable.get(kmer|hashval)`` will return the count associated with the
   given kmer string or the given hashval, whichever is passed in.

 * ``ktable.set(kmer|hashval, count)`` set the count for the given kmer
   string or hashval.

In all of the cases above, 'kmer' is an L-length string, 'hashval' is
a non-negative integer, and 'sequence' is a DNA sequence containing ONLY
A/C/G/T.

**Note:** 'N' is not a legal DNA character as far as khmer is concerned!

And, finally, there are some set operations:

  * ``ktable.clear()`` empties the ktable.

  * ``ktable.update(other)`` adds all of the entries in ``other`` into
    ``ktable``.  The wordsize must be the same for both ktables.

  * ``intersection = ktable.intersect(other)`` returns a ktable where
    only nonzero entries in both ktables are kept.  The count for ach
    entry is the sum of the counts in ``ktable`` and ``other``.

An Example
==========

This short code example will count all 6-mers present in the given
DNA sequence, and then print them all out along with their prevalence.

::

   # make a new ktable, L=6
   ktable = khmer.new_ktable(6)

   # count all k-mers in the given string
   ktable.consume("ATGAGAGACACAGGGAGAGACCCAATTAGAGAATTGGACC")

   # run through all entries. if they have nonzero presence, print.
   for i in range(0, ktable.n_entries()):
      n = ktable.get(i)
      if n:
         print ktable.reverse_hash(i), "is present", n, "times."

::

	CTB, 3/2005
