edeposit_autoparser.py
======================

This script is used to ease creation of new parsers.

Configuration file
------------------

The script expects configuration file with patterns, specified as ``-c``
parameter. Pattern files uses YAML as serialization format.

Inside the pattern file should be multiple pattern definitions. Here is example
of the test pattern file::

    html: simple_xml.xml
    first:
        data: i wan't this
        required: true
        notfoundmsg: Can't find variable '$name'.
    second:
        data: and this
    ---
    html: simple_xml2.xml
    first:
        data: something wanted
        required: true
        notfoundmsg: Can't find variable '$name'.
    second:
        data: another wanted thing

As you can see, this file contains two examples divided by ``---``. Each section,
of file have to contain ``html`` key pointing to either file or URL resource.

After the ``html`` key, there may be unlimited number of `variables`. Each
`variable` have to contain ``data`` key, which defines the match, which will be
parsed from the file ``html`` key is pointing to.

Optionally, you can also specify ``required`` and ``notfoundmsg``. If the
variable is required, it means that if generated parser will found data without
this variable, UserWarning exception is raised and ``notfoundmsg`` is used as
message. As you can see in example, you can use ``$name`` as variable which
holds variable name (`first` for example).

There is also special keyword ``tagname``, which can be used to further specify
correct element in case, that there is more than one element matching.

How it works
------------
Autoparser first reads all examples and locates elements, which content matching
pattern defined in ``data`` key. Spaces at the beginning and end of the pattern
and element's content are ignored.

When the autoparser collects all matching elements, it generates `DOM` paths
to each element.

After that, elimination process begins. In this step, autoparser throws away
all paths, that doesn't work for all corresponding variables in all examples.

When this is done, paths with best priority are selected and
:func:`.generate_parsers` is called.

Result from this call is string printed to the output. This string contains all
necessary parsers for each variable and also unittest.

You can then build the parser you need much more easilly, because now you have
working `pickers` from `DOM` and all you need to do is to clean the data.

Live example::

    $ ./edeposit_autoparser.py -c autoparser/autoparser_data/example_data.yaml 
    #! /usr/bin/env python
    # -*- coding: utf-8 -*-
    #
    # Interpreter version: python 2.7
    #
    # HTML parser generated by Autoparser
    # (https://github.com/edeposit/edeposit.amqp.harvester)
    #
    import os
    import os.path

    import httpkie
    import dhtmlparser


    # Utilities
    def _get_source(link):
        """
        Return source of the `link` whether it is filename or url.

        Args:
            link (str): Filename or URL.

        Returns:
            str: Content.

        Raises:
            UserWarning: When the `link` couldn't be resolved.
        """
        if link.startswith("http://") or link.startswith("https://"):
            down = httpkie.Downloader()
            return down.download(link)

        if os.path.exists(link):
            with open(link) as f:
                return f.read()

        raise UserWarning("html: '%s' is neither URL or data!" % link)


    def _get_encoding(dom, default="utf-8"):
        """
        Try to look for meta tag in given `dom`.

        Args:
            dom (obj): pyDHTMLParser dom of HTML elements.
            default (default "utr-8"): What to use if encoding is not found in
                                       `dom`.

        Returns:
            str/default: Given encoding or `default` parameter if not found.
        """
        encoding = dom.find("meta", {"http-equiv": "Content-Type"})

        if not encoding:
            return default

        encoding = encoding[0].params.get("content", None)

        if not encoding:
            return default

        return encoding.lower().split("=")[-1]


    def handle_encodnig(html):
        """
        Look for encoding in given `html`. Try to convert `html` to utf-8.

        Args:
            html (str): HTML code as string.

        Returns:
            str: HTML code encoded in UTF.
        """
        encoding = _get_encoding(
            dhtmlparser.parseString(
                html.split("</head>")[0]
            )
        )

        if encoding == "utf-8":
            return html

        return html.decode(encoding).encode("utf-8")


    def is_equal_tag(element, tag_name, params, content):
        """
        Check is `element` object match rest of the parameters.

        All checks are performed only if proper attribute is set in the HTMLElement.

        Args:
            element (obj): HTMLElement instance.
            tag_name (str): Tag name.
            params (dict): Parameters of the tag.
            content (str): Content of the tag.

        Returns:
            bool: True if everyhing matchs, False otherwise.
        """
        if tag_name and tag_name != element.getTagName():
            return False

        if params and not element.containsParamSubset(params):
            return False

        if content is not None and content.strip() != element.getContent().strip():
            return False

        return True


    def has_neigh(tag_name, params=None, content=None, left=True):
        """
        This function generates functions, which matches all tags with neighbours
        defined by parameters.

        Args:
            tag_name (str): Tag has to have neighbour with this tagname.
            params (dict): Tag has to have neighbour with this parameters.
            params (str): Tag has to have neighbour with this content.
            left (bool, default True): Tag has to have neigbour on the left, or
                                       right (set to ``False``).

        Returns:
            bool: True for every matching tag.

        Note:
            This function can be used as parameter for ``.find()`` method in
            HTMLElement.
        """
        def has_neigh_closure(element):
            if not element.parent \
               or not (element.isTag() and not element.isEndTag()):
                return False

            # filter only visible tags/neighbours
            childs = element.parent.childs
            childs = filter(
                lambda x: (x.isTag() and not x.isEndTag()) \
                          or x.getContent().strip() or x is element,
                childs
            )
            if len(childs) <= 1:
                return False

            ioe = childs.index(element)
            if left and ioe > 0:
                return is_equal_tag(childs[ioe - 1], tag_name, params, content)

            if not left and ioe + 1 < len(childs):
                return is_equal_tag(childs[ioe + 1], tag_name, params, content)

            return False

        return has_neigh_closure


    # Generated parsers
    def get_second(dom):
        el = dom.find(
            'container',
            {'id': 'mycontent'},
            fn=has_neigh(None, None, 'something something', left=False)
        )

        # pick element from list
        el = el[0] if el else None

        return el


    def get_first(dom):
        el = dom.wfind('root').childs

        if not el:
            raise UserWarning(
                "Can't find variable 'first'.\n" +
                'Tag name: root\n' +
                'El:' + str(el) + '\n' +
                'Dom:' + str(dom)
            )

        el = el[-1]

        el = el.wfind('xax').childs

        if not el:
            raise UserWarning(
                "Can't find variable 'first'.\n" +
                'Tag name: xax\n' +
                'El:' + str(el) + '\n' +
                'Dom:' + str(dom)
            )

        el = el[-1]

        el = el.wfind('container').childs

        if not el:
            raise UserWarning(
                "Can't find variable 'first'.\n" +
                'Tag name: container\n' +
                'El:' + str(el) + '\n' +
                'Dom:' + str(dom)
            )

        el = el[-1]

        return el


    # Unittest
    def test_parsers():
        # Test parsers against autoparser/autoparser_data/simple_xml.xml
        html = handle_encodnig(
            _get_source('autoparser/autoparser_data/simple_xml.xml')
        )
        dom = dhtmlparser.parseString(html)
        dhtmlparser.makeDoubleLinked(dom)

        second = get_second(dom)
        assert second.getContent().strip() == 'and this'

        first = get_first(dom)
        assert first.getContent().strip() == "i wan't this"

        # Test parsers against autoparser/autoparser_data/simple_xml2.xml
        html = handle_encodnig(
            _get_source('autoparser/autoparser_data/simple_xml2.xml')
        )
        dom = dhtmlparser.parseString(html)
        dhtmlparser.makeDoubleLinked(dom)

        second = get_second(dom)
        assert second.getContent().strip() == 'another wanted thing'

        first = get_first(dom)
        assert first.getContent().strip() == 'something wanted'


    # Run tests of the parser
    if __name__ == '__main__':
        test_parsers()

API
---
.. automodule:: harvester.edeposit_autoparser
    :members:
    :undoc-members:
    :private-members:
