Metadata-Version: 1.0
Name: newspaper
Version: 0.0.4
Summary: Simplified python article discovery & extraction.
Home-page: https://github.com/codelucas/newspaper/
Author: Lucas Ou-Yang
Author-email: lucasyangpersonal@gmail.com
License: The MIT License (MIT)

Copyright (c) 2013 Lucas Ou-Yang

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Description: Newspaper: Article scraping & curation
        ======================================
        
        .. image:: https://badge.fury.io/py/newspaper.png
            :target: http://badge.fury.io/py/newspaper
                :alt: Latest version
        
        Inspired by `requests`_ for its **simplicity** and powered by `lxml`_ for its **speed**; *newspaper*
        is a Python 2 library for extracting & curating articles from the web.
        
        Newspaper wants to change the way people handle article extraction with a new, more precise
        layer of abstraction. Newspaper caches whatever it can for speed. *Also, everything is in unicode*
        
        Please refer to `The Documentation`_ for a quickstart tutorial!
        
        A Glance:
        ---------
        
        .. code-block:: pycon
        
            >>> import newspaper
        
            >>> cnn_paper = newspaper.build('http://cnn.com')
        
            >>> for article in cnn_paper.articles:
            >>>     print article.url
            u'http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/'
            u'http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html'
            ...
        
            >>> for category in cnn_paper.category_urls():
            >>>     print category
        
            u'http://lifestyle.cnn.com'
            u'http://cnn.com/world'
            u'http://tech.cnn.com'
            ...
        
        .. code-block:: pycon
        
            >>> article = cnn_paper.articles[0]
        
        .. code-block:: pycon
        
            >>> article.download()
        
            >>> article.html
            u'<!DOCTYPE HTML><html itemscope itemtype="http://...'
        
        .. code-block:: pycon
        
            >>> article.parse()
        
            >>> article.authors
            [u'Leigh Ann Caldwell', 'John Honway']
        
            >>> article.text
            u'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'
        
        .. code-block:: pycon
        
            >>> article.nlp()
        
            >>> article.keywords
            ['New Years', 'resolution', ...]
        
            >>> article.summary
            u'The study shows that 93% of people ...'
        
        Documentation
        -------------
        
        Check out `The Documentation`_ for full and detailed guides using newspaper.
        
        Features
        --------
        
        - News url identification
        - Text extraction from html
        - Keyword extraction from text
        - Summary extraction from text
        - Author extraction from text
        - Top image extraction from html
        - All image extraction from html
        - Multi-threaded article download framework
        - Google trending terms extraction
        
        Get it now
        ----------
        ::
        
            $ pip install newspaper
        
            IMPORTANT
            If you know for sure that you'll use the natural language features,
            nlp(), you must download some separate nltk corpora below.
            You must download everything in python 2.6 - 2.7!
        
            $ curl https://raw.github.com/codelucas/newspaper/master/download_corpora.py | python2.7
        
        Todo List
        ---------
        
        - Add a "follow_robots.txt" option in the config object.
        - Bake in the CSSSelect and BeautifulSoup dependencies
        
        .. _`Quickstart guide`: https://newspaper.readthedocs.org/en/latest/
        .. _`The Documentation`: http://newspaper.readthedocs.org
        .. _`lxml`: http://lxml.de/
        .. _`requests`: http://docs.python-requests.org/en/latest/
        
        0.0.4 - Fully integrated python-goose library into newspaper. Article objects
                now have much more options. All configurations are now based on Configuration()
                objects which can be passed into Source or Article objects. Default configuration
                setups make this easy. Added simple multithreading article download framework.
        
Platform: UNKNOWN
