====
Soup
====

The bop.soup module defines a subset of the API of the BeautifulSoup module.
It is intended as replacement for BeautifulSoup in cases were ill-formed
input data may cause infinite loops or other blocking border cases.

Let's take the typical task of extracting a tag as an example. Soup
is called exactly like BeautifulSoup:

    >>> from bop.soup import Soup
    
    >>> def extractTitle(html, default=None):
    ...     soup = Soup(html, fromEncoding=default)
    ...     title = soup.first('title')
    ...     if title is not None:
    ...         return title.string
    ...     for tag in 'h1', 'h2', 'h3', 'h4', 'h5', 'strong', 'p':
    ...         found = soup.first(tag)
    ...         if found:
    ...             texts = [s for s in found.contents if isinstance(s, unicode)]
    ...             return u''.join(texts).strip()
    ...     return u''

The above function returns the title as unicode string:

    >>> html = '''<html>
    ... <head>
    ... <title>Person Template</title>
    ... <meta name="description" content="A short description">
    ... </head>
    ... <body><p>Content</p></body></html>'''
    
    >>> extractTitle(html)
    u'Person Template'

If we apply this function to a image, we see that the parser doesn't hang:

    >>> import bop, os.path
    >>> here = os.path.dirname(bop.__file__)
    >>> tiff = os.path.join(here, 'testdata', 'test.tiff')
    >>> extractTitle(file(tiff).read())
    u''
    
A real world example which failed in former version:    
    
    >>> jpeg = os.path.join(here, 'testdata', 'test.jpg')
    >>> extractTitle(file(jpeg).read())
    u''
