Morphological/Inflection Engine for Croatian language
=====================================================
"text-hr" is Morphological/Inflection Engine for Croatian language written in
Python programming language. Includes stopwords and Part-Of-Speech tagging
engine (POS tagging) based on inverse inflection algorithm for detection.

Since API is not still freezed, this project is still in alpha.

TAGS 
----
    Croatian language, python, natural language processing (NLP),
    Part-of-speech (POS) tagging, stopwords, inverse inflection, 
    morphological lexicon


OZNAKE
------
    Hrvatski jezik, Python biblioteka, morfologija, infleksija, obrnuta
    infleksija, prepoznavanje vrsta riječi, računalna obrada govornog jezika,
    zaustavne riječi, morfološki leksikon

AUTHOR
======
Robert Lujo, Zagreb, Croatia, find mail address in LICENCE


FEATURES
========
To name the most important are:
 - inflection system - for producing all forms of one word
 - detection of word types (POS tagging) - from existing list of word forms
 - list of stopwords

System is based on unicode strings, default codepage to convert from and to 
string is cp-1250.

Check `Getting started`_.

INSTALLATION
============
Installation instructions - if you have installed pip package 
http://pypi.python.org/pypi/pip::

    pip install text-hr

If not, then old-fashioned way:
    - download zip from http://pypi.python.org/pypi/text-hr/
    - unzip
    - open shell
    - go to distribution directory
    - python setup.py install


GETTING STARTED
===============
There are three important parts that this project provides:
 - `Inflection system`_ - for producing all forms of one word
 - `Detection of word types (POS tagging)`_ - from existing list of word forms
 - `List of stopwords`_

Inflection system
-----------------
Usage example - start python shell::

    > python
    >>> from text_hr.verbs import Verb
    >>> v = Verb("platiti")
    >>> for k in sorted(v.forms.keys()):
    ...     print k, v.forms[k]
    ...
    AOR/P/1 [u'platismo']
    AOR/P/2 [u'platiste']
    AOR/P/3 [u'plati\u0161e']
    AOR/S/1 [u'platih']
    AOR/S/2 [u'plati']
    AOR/S/3 [u'plati']
    IMP/P/1 [u'platasmo', u'pla\u0107asmo', u'platijasmo']
    IMP/P/2 [u'plataste', u'pla\u0107aste', u'platijaste']
    IMP/P/3 [u'platahu', u'pla\u0107ahu', u'platijahu']
    ...
    VA_PA//P_O+S+V+N [u'pla\u0107eno']
    X_INF// [u'platiti']
    X_VAD_PAS// [u'plativ\u0161i']
    X_VAD_PRE// [u'plate\u0107i']
    X_VAD_PRE// [u'plate\u0107i']

Detection of word types (POS tagging)
-------------------------------------
TODO: to be done - check test_detect.txt for samples, and detect.py for the logic:

first example in test_detect.txt::

    >>> from text_hr.detect import WordTypeRecognizerExample
    >>> def test_it(word_list, word_types_filter=None, level=2):
    ...     wdh = WordTypeRecognizerExample(word_list, silent=True)
    ...     if not word_types_filter is None:
    ...         wdh.detect(word_types_filter=word_types_filter, level=level)  # e.g. word_types_filter=["N"]
    ...     else:
    ...         wdh.detect(level=level)  # all word types
    ...     lines_file = LinesFile()
    ...     wdh.dump_result(lines_file) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
    ...     print "\n".join(lines_file.lines)
    ...     return wdh

    >>> class LinesFile(object):
    ...     def __init__(self):
    ...         self.lines = []
    ...     def write(self, s):
    ...         self.lines.append(repr(s.rstrip()))

    >>> word_list = [
    ...   "Broj    84"
    ... , "broji   34"
    ... , "Brojila  28"
    ... , "broje   23"
    ... , "brojeći 22"
    ... , "brojim   7"
    ... , "brojimo  5"
    ... , "brojiš   4"
    ... , "brojahu  2"
    ... , "brojaše  1"
    ... , "brojite  1"
    ... , "-brijestovu 1"
    ... , "brijestovi 1"   #the only one checked with endswith, but all other will be checked with get_freq
    ... , "-brijestove 1"
    ... , "-brijestova 1"
    ... ]

    Lowest quality, but fastest
    >>> wdh = test_it(word_list, level=4) # doctest: +ELLIPSIS
    " 10/  183 -> brojati              (u'V-XX_-_JATI-je\\u0107i-0') 84/broj,34/broji,23/broje,22/broje\xe6i,7/brojim,5/brojimo,4/broji\x9a,2/brojahu,1/brojite,1/broja\x9ae"

List of stopwords
-----------------
TODO: to be simplified and explained in details. this is not tested.

Something like::

    from text_hr import word_types

    word_types_list = None
    for wordobj, l_key, cnt, _suff_id, wform_key, wform in word_types.get_all_std_words(word_types_list):
        if not (wordobj==wordobj_old and l_key==l_key_old):
            wordobj_data["value_base"] = wordobj
            l_key_flds = l_key.split("#")
            # wordobj              l_key                wform_key                      form
            # ondje                FX#ADV#MJE.GDJE                                     ''
            # one                  CH#PRON.OSO#         #P/3F#|A#1                     'njih'
            assert len(l_key_flds)==3, l_key_flds
            is_changeable = (l_key_flds[0]=="CH")
            print "word_type", l_key_flds[1]
            print "subtype",   l_key_flds[2]

        assert wordobj_obj
        # TODO:
        # if wform:
        #     raise NotImplementedError("now wordforms don't hold wf/key, but wf/cnt - it is reduced. Here this is not implemented!!!")


Further
-------
Since there is currently no good documentation, the best source of 
further information is by reading tests inside of modules and
tests in tests directory (dev version). More information in `Running tests`_.
And you can allways read a source.


DOCUMENTATION
=============
Sorry but currently there is no good documentation. In progress ...


SUPPORT
=======
Since this project is limited with my free time, support will
be limited. 


REPORT BUG OR REQUEST FEATURE
-----------------------------
If you encounter bug, the best is to report it to
bitbucket web page http://bitbucket.org/trebor74hr/text-hr.

If there will be an interest for development for
other inflection rich languages, I'd be glad to decouple 
language specific code and create new project that will 
be capable to deal with multiple languages.

The best way to contact me is by mail (find in LICENCE).

TODO list is in readme.txt (dev version).


CONTRIBUTION
============
Since this project is not currently in the stable API phase, contribution
should wait for a while. 


RUNNING TESTS
=============
All tests are doctests (not unittests). There are three type of tests in the
package: 

    1. doctests in each module - e.g. in verbs.py
    2. doctests in tests/test_*.txt - only development version
    3. tests which are not automatically compared - i.e. in special call mode
       detect.py can produce output file which needs to be compared 
       manually with some existing file. Such test(s) are very slow. This needs
       to be changed to be automatic.

Running each module directly will run 1. and 2. if running from development
version. To get development version
To use development version (http://bitbucket.org/trebor74hr/text-hr)::

 hg clone https://trebor74hr@bitbucket.org/trebor74hr/text-hr


create text_hr.pth in python site-packages directory with path to text-hr e.g.::

    r:\hg-clones\python\text-hr

To run all tests:
    - go to tests directory
    - run tests.py like (with sample output)::

        > python tests.py
        testing module   __init__
        testing module   adjectives
        ...
        testing module   word_types
        testing textfile R:\hg-clones\python\text-hr\tests\test_adj.txt
        ...
        testing textfile R:\hg-clones\python\text-hr\tests\test_verbs_type.txt
  
To run tests for just one module:
    - goto text_hr directory
    - run tests by running module, e.g.::

        > py pronouns.py
        __main__: running doctests
        ..\tests\test_pronouns.txt: running doctests

    - in the case you're not running from dev version, you'll get output like
      this::

        > py pronouns.py
        __main__: running doctests
        ..\tests\test_pronouns.txt: Not found, skipping

TODO
====
various things, see readme.txt for details.

CHANGES
=======
0.12 
----
ulr1 100608 :
    - README
    - enabled tests from tests.py for all 
    - enabled tests from directly from each modules

0.11 
----
ulr1 100607:
    - recreated repo at bitbucket
    - no .suff_registry.pickle and testing_*.out put in zip

0.10
----
ulr1 100605:
    - first installable release
