Metadata-Version: 1.0
Name: pbot
Version: 1.4.0
Summary: An simple site crawler with proxy support
Home-page: http://bitbucket.org/zeus/pbot
Author: Pavel Zhukov
Author-email: gelios@gmail.com
License: GPL
Description: Description:
        ============
        Pbot contains two modules, Bot and Spider
        
        Bot is a simple helper, created to save request state (cookies, referrer) between http requests. Also, it provides addional methods for adding cookies. With no dependencies this module is easy to use when you need to simulate browser.
        
        Spider it's pbot, armed by lxml (required). Provides addional methods for easy website crawling, see below.
        
        Bot is very easy to use::
        
        from pbot.pbot import Bot
        bot = Bot(proxies={'http': 'localhost:3128'}) # You can provide proxies, during bot creation, or set later as bot.proxies
        bot.add_cookie({'name': 'sample', 'value': 1, 'domain': 'example.com'})
        response = bot.open('http://example.com') # Open with cookies and empty referrer
        bot.follow('http://google.com') # Open google with example.com as a referrer
        response = bot.response # Response saved, and can be read later
        bot.follow('http://example.com', post={'q': 'abc'}) # You can provide post and get as keyword arguments
        bot.refresh_connector() # Flush cookies and referrer
        
        
        
        Spider gives you special features::
        
        from pbot.spider import Spider
        bot = Spider() # or Spider(force_encoding='utf-8') to force encoding for parser
        bot.open('http://example.com')
        bot.tree.xpath('//a') # lxml tree can be accessed by .tree, response will be automatically readed and parsed by lxml.html
        form = bot.xpath('//form[@id="main"]') # xpath shortcut for bot.tree.xpath
        bot.submit(form) # Submit lxml f§orm
        #
        # Crawler, recursively crawl from target page yielding xml_tree, query_url, real_url (real_url - url after all redirects).
        bot.crawl(self,
        url=None, # Target url to start crawling
        check_base=True, # Yield pages only on domain from url
        only_descendant=True, # Yield only pages that urls starts with url
        max_level=None, #Maximum level
        allowed_protocols=('http:', 'https:'),
        ignore_errors=True,
        ignore_starts=(), # Tuple/array,  ignore urls that starts with ignore_starts (exclude some parts of site)
        check_mime=())
        
        
Keywords: crawling,bot
Platform: UNKNOWN
