========================
Ready-made khmer scripts
========================

While khmer is research software (and will probably remain research
software for a while!) there are a bunch of scripts that can be used
out of the box.  Below is a quick-n-dirty guide to these scripts.

Basic k-mer filtering
=====================

filter-exact.py

Quality trimming
~~~~~~~~~~~~~~~~

quality-trim.py -- Filter Illumina data sets by removing reads
containing too many Ns, and truncate reads at QV=2 bases.
Preprocessing before any further steps.

Usage::

   %% quality-trim.py <fastq input filename> <fastq output filename>

This script can be applied to multiple data sets in parallel or in series.

Removing reads by k-mer match
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

filter-if-present.py -- Remove reads from a FASTA/FASTQ file that contain
any of the k-mers present in another FASTA/FASTQ file.  Used to remove
rRNA, repeats, Illumina primers, etc.

Usage::

   %% filter-if-present.py <mask FASTA/FASTQ filename> <FASTA/FASTQ to filter>

Output is placed in a file named after the FASTA/FASTQ file filtered,
with the suffix '.masked'.

This script can be applied to multiple data sets in parallel or in series,
as long as the same mask file is used.

Abundance filtering
-------------------

To use a probabilistic counting hash table to filter out reads
containing low-abundance k-mers, see the scripts 'filter-inexact-*.py'.

An "All k-mers" filter
~~~~~~~~~~~~~~~~~~~~~~

filter-inexact-all.py -- removes reads containing any k-mer with
an abundance of 1 (technically, < 2 ;).

This is an all-by-all process; filter **all your data** together in
one go, or else you will erroneously eliminate reads.  (Consider the
situation where a k-mer is present in precisely two reads, one in
`file1` and one in `file2`.  Both reads will be eliminated if `file1`
and `file2` are filtered separately.

A few points: 

 - this code uses only a few hundred megabytes beyond the hash table,
   so make the hashtable size as close to your available physical memory
   as you can.

 - the one-sided counting error means that some reads with low-abundance
   k-mers *might* be kept, but no reads will erroneously be removed.  You
   can run it iteratively, with different hash table sizes each time, to
   decrease the error rate.

 - increase MIN_ABUNDANCE beyond 2 to eliminate higher abundance reads.
   For example, setting MIN_ABUNDANCE to 5 will eliminate reads that
   contain k-mers with 4 or fewer abundance across the data set.

 - if you run this script iteratively on a data set, it will continue
   to eliminate reads.  Consider the situation where a k-mer A is
   present in the same read as a k-mer B, but k-mer B is also in
   another read.  If the first read (containing both k-mers A and B)
   is eliminated because k-mer A is low-abundance, then in the
   filtered data set, k-mer B will now be low-abundance and the second
   read will be eliminated in the next pass.  In comparison,
   'filter-inexact-any.py' (below) is invariant.
   
Other k-mer filters
~~~~~~~~~~~~~~~~~~~

We have not found these filters to be as useful as 'filter-inexact-all.py',
but they should work:

filter-inexact-any.py -- remove 

filter-inexact-once.py

filter-inexact-run.py

De Bruijn graph analysis and manipulation
=========================================

Graph size filtering
~~~~~~~~~~~~~~~~~~~~

graph-size-py.py - eliminate reads belonging to small de Bruijn graphs

Partitioning
~~~~~~~~~~~~

do-th-subset-calc.py (-save, -load below, but in memory)

Pipeline:

do-th-subset-save.py - to do local graph traversal and produce partition maps

do-subset-merge.py - to merge partition maps

(optional, inexact) filter-subsets-by-partsize.py - to eliminate small partitions

do-th-subset-load.py - to annotate sequences with their partition ID

(optional) combine-pe.py - to combine partitions that are joined by paired ends

extract-partitions.py - to extract reads into groups, grouped by partition

(utility) multi-abyss.py / multi-velvet.py - to assemble

(reporting) subset-report.py

Hashtable/tagset utilities
==========================

load-ht-and-tags.py


Miscellaneous utilities
=======================

assemstats.py

assemstats2.py

fastq-to-fasta.py

strip-partition.py

join_pe.py

extract-long-sequences.py

Miscellaneous read analysis
===========================

abundance-hist-by-position.py

hi-lo-abundance-by-position.py

Support code
============

test_scripts.py

velvet-assemble.sh

make-random.py


To remove?
==========

agglom.py

ctb-iterative-bench*

do-intertable-part.py

extract-surrender.py

To classify
===========

extract-single-partition.py

fasta-to-abundance-hist.py

get-occupancy.py

graph-partition-separate.py

graph-size-py-th.py
graph-size.py

kmer-abundance-hist.py

length-dist.py

occupy.py

partition-size-dist.py

test.sh

