Metadata-Version: 1.0
Name: oice.langdet
Version: 1.0dev-r781
Summary: Automatic Language Detector
Home-page: http://www.uci.cu/
Author: Universidad de las Ciencias Informáticas
Author-email: UNKNOWN
License: GPL 3.0
Description: Language Detector
        -----------------
        
        This is a simple (yet powerful) automatic language detector. Currently
        the only languages we are capable to detect are:
        
        * English
        * Spanish
        * French
        
        Installation and Usage
        ----------------------
        
        To install just run the easy_install_ tool::
        
        easy_install oice.langdet
        
        .. _easy_install: http://peak.telecommunity.com/DevCenter/EasyInstall
        
        This will install a console script ``langdet``. Run ``langdet``
        passing a plain text filename as the first parameter. Examples::
        
        langdet simple.txt
        
        This will return the 2-letters `ISO 639-1`_ code of
        the detected language.
        
        .. _ISO 639-1: http://en.wikipedia.org/wiki/ISO_639
        
        You may also use ``oice.langdet`` in Python scripts like this::
        
        #!/usr/bin/env python2.5
        from StringIO import StringIO
        
        from oice.langdet import langdet
        from oice.langdet import streams
        from oice.langdet import languages
        
        text = streams.Stream(StringIO(u"Must be a Python Unicode text"))
        lang = langdet.LanguageDetector.detect(text)
        if lang == languages.spanish:
        print u'Texto en español'
        elif lang == languages.english:
        print u'English text'
        else:
        print u'France' # I don't speak/write French
        
        Caveats
        ~~~~~~~
        
        Currently there are some restrictions:
        
        * ``langdet`` does not work properly with standard input nor
        pipelines.
        
        * You cannot use a file-like object directly with
        ``LanguageDetector``, i.e, you must use the ``Stream`` wrapper.
        
        This is so because we try to guess the text encoding and normalize
        it to a Python Unicode String. However, we plan to remove this
        normalization step and count the frequency of octets and pairs of
        octets instead.
        
        * If the piece of text is not written in any of the languages we can
        detect, the best match (see `How it works`_) is selected.
        
        Work in progress
        ~~~~~~~~~~~~~~~~
        
        In a sentence: trying to solve the first two caveats, and thinking in
        Python 2.6 and Python 3.0.
        
        
        How it works
        ------------
        
        Language detection is based on stats on the frequency of letters and
        pairs of letters of the input text.
        
        The modules in the package ``oice.language.languages`` contains a
        "footprint" of text in those languages.
        
        The texts used in the generation of the footprints were:
        
        * El ingenioso hidalgo Don Quijote de la Mancha
        
        * The Holly Bible
        
        * La Folle Journée, ou Le Mariage de Figaro
        
        When trying to detect the language of some piece of text, first we
        count the frequencies of letters and pairs of letters in the text and
        then compare the results with the footprints of those language, the
        best match is selected.
        
        We use the simple `cosine similarity`__ equation to compare the text
        with the footprints of those texts.
        
        __ `Cosine Similarity Wikipedia`_
        
        .. _Cosine Similarity Wikipedia: http://en.wikipedia.org/wiki/Cosine_similarity
        
        
        Accuracy of the detection
        -------------------------
        
        To test the accuracy of this implementation we downloaded the full
        `European Parliament Proceedings Parallel Corpus 1996-2006`__ and ran
        the `langdet` script to the sets of English, Spanish and French
        documents.
        
        __ http://www.statmt.org/europarl/
        
        For each language we count the times the correct `ISO 639-1`_ code was
        returned by `langdet` like this (for counting documents detected as
        Spanish written)::
        
        find -type f -exec langdet {} \; | grep es | wc -l
        
        The results are summarized in the following table:
        
        .. table:: Summary of accuracy test for ``langdet``
        
        =============	=======	=======	====== ===========
        Real language	English	Spanish	French Errors [1]_
        =============	=======	=======	====== ===========
        English	98.78%	0%	0%     1.22%
        Spanish	0%	100%	0%     0%
        French		0%	0%	100%   0%
        Danish		1.22%	16.08%	82.7%  0%
        German		1.97%	0.15%	97.88% 0%
        Finnish	0.65%	5.9%	93.45% 0%
        Italian	0%	99.54%	0.46%  0%
        =============	======= ======= ====== ===========
        
        .. [1] Errors are generally produced when the detector cannot guess
        the encoding of the input text.
        
        In `Caveats`_ we propose a solution for this, however, it is
        not clear the impact in the accuracy of detection.
        
        The results shows that for documents in the languages that ``langdet``
        can detect, ``langdet`` behaves almost perfect.
        
        However, the results for documents in other languages show how
        misleading ``langdet`` could be in such cases. We ran those test for
        illustration purposes only.
        
        Nevertheless this results also shows that it would be very difficult
        for this simple algorithm to distinguish Spanish from Italian, and
        French from German.
        
        Changelog
        =========
        
        1.0 - Unreleased
        ----------------
        
        * Initial release
        
        
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Programming Language :: Python
Classifier: Topic :: Utilities
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: License :: OSI Approved :: GNU General Public License (GPL)
Classifier: Intended Audience :: Developers
