Metadata-Version: 1.1
Name: screp
Version: 0.3
Summary: Command-line utility for easy scraping of HTML documents
Home-page: https://github.com/darfire/screp
Author: Doru Arfire
Author-email: doruarfire@gmail.com
License: Copyright (c) 2013, Doru Arfire
All rights reserved.

This file is part of screp.

screp is free software: you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as published
 by the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.
Description: ==================================
        screp, easy command-line scrapping
        ==================================
        
        
        What is screp?
        ==============
        
        **screp** is a command line utility that provides easy and flexible scrapping of HTML documents. It
        works by finding a set of *anchors* (specified using a CSS selector) and then extracting information
        relative to those anchors, optionally post processing it using a set of standard operations. For each
        anchor it outputs a record formatted according to one of the supported formats (CSV, JSON or
        general).
        
        
        Invoking screp
        ==============
        
        **screp** is invoked using the following syntax:
        
        $ screp [OPTION] FORMAT_SPEC PRIMARY_SELECTOR [FILES]
        
        where:
        * FORMAT_SPEC is a format specification, one of:
          - *-c CSV_FORMAT_SPEC*, formats each record as a comma-separated-values row
          - *-j JSON_FORMAT_SPEC*, formats each record as a JSON object and the whole output as a list of
            JSON objects
          - *-f GENERAL_FORMAT_SPEC*, formats each record according to a general format where computed
            values are substituted to their specifications (similar to bash parameter substitution)
        * PRIMARY_SELECTOR is a CSS selector that specifies the *primary anchor*, as detailed below
        * FILE can be either a local file or an absolute URL; if no FILEs are specified the standard input
          is read
        
        
        How does screp work?
        ====================
        
        **screp** tries to automate many of the steps taken when writing your own scrapper, steps like:
        
        * fetching the HTML documents, if necessary
        * parsing HTML
        * locating areas of interest in the DOM of the document
        * locating interesting information around those areas
        * simple processing of these pieces of information
        * formatting of the information
        * outputting the information
        
        To use screp, you need to take a series of steps:
        * tell screp where to take the HTML documents; it works with multiple documents, from sources such
          as the web, the local file-system or STDIN
        * define the *primary anchor* using a CSS selector: these are elements through which you access
          records of interest in the HTML documents
        * specify the output format; this implies specifying:
          - *terms*, which are string computed relative to the anchors
          - how these terms are combined to produce a record; currently screp supports three methods of
            specifying formats:
              - CSV
              - JSON
              - general format
        * optionally, you can also define *secondary anchors*, which are elements computed relative to the
          *primary anchor* that can be used to define *terms* in a more succinct way
        
        Defining terms
        ==============
        
        A *term* has the following format::
            anchor.accessor.accessor.accessor|filter|filter|filter
        
        In other words, a term is an anchor(primary or secondary) followed by zero or more accessors
        followed by zero or more filters.
        
        *Accessors* and *filters* (also collectively called *actions*) are functions that take the output
        value of the last function (or the anchor, if this is the first action) and output another value. In
        other words, they form a pipeline.  Accessors act on DOM elements and sets (actually ordered lists)
        of elements, whereas filters act on strings. Each action has an in_type and an out_type. For a term
        to be correctly defined the out_type of an action needs to match the in_type of the following
        action.
        
        The supported types are: 'string', 'element', 'element_set'.
        
        Actions can have zero or more parameters. When the action takes parameters it is specified as a
        function::
            action(parameter1, parameter2, parameter3)
        
        When not, only the action name is specified (no parentheses).
        
        Finally, terms have restrictions of the out_type of their last action (also called the out_type of
        the term):
        * if a term is used inside a format specification, its out_type must be 'string'
        * if a term is used to define a secondary anchor, its out_type must be 'element'
        
        Examples of terms
        -----------------
        
        These are correct term definitions::
            '$.parent.parent.attr(title)|upper' outputs 'string'
            '@.desc(".record").first' outputs 'element    
            'anchor.ancestors(".box").children(".price")' outputs 'element_set'
        
        Predefined anchors and actions
        ==============================
        
        The following anchors are predefined:
        * **$** is the primary anchor defined by the primary anchor selector
        * **@** is the primary anchor representing the root of the current document
        
        The following accessors are predefined:
        * **first** [in_type='element_set', out_type='element']: returns the first element in an element_set
        * **last** [in_type='element_set', out_type='element']: returns the last element in an element_set
        * **nth(n)** [in_type='element_set', out_type='element']: returns the n-th element in an
          element_set; it also supports negative indexes, where -1 represents the last element, -2 the
          second-to-last element, and so on
        * **class** [in_type='element', out_type='string']: returns the value of the 'class' attribute 
        * **id** [in_type='element', out_type='string']: returns the value of the 'id' attribute 
        * **parent** [in_type='element', out_type='element']: returns the parent of the current element
        * **text** [in_type='element', out_type='string']: returns the text enclosed by the current element
        * **tag** [in_type='element', out_type='string']: returns the tag of the current element
        * **attr(attr_name)** [in_type='element', out_type='string']: returns the value of the current element's
          attribute with name 'attr_name'
        * **desc(css_sel)** [in_type='element', out_type='element_set']: returns the ordered list of
          descendants of the current element selected by the CSS selector specified by 'css_sel'
        * **fdesc(css_sel)** [in_type='element', out_type='element']: equivalent to
          .desc(css_sel).first
        * **ancestors(css_sel)** [in_type='element', out_type='element_set']: returns the list of ancestors
          of the current element that satisfy the CSS selector specified by 'css_sel'
        * **children(css_sel)** [in_type='element', out_type='element_set']: returns the list of children
          of the current element that satisfy the CSS selector specified by 'css_sel'
        * **psiblings(css_sel)** [in_type='element', out_type='element_set']: returns the list of preceding
          siblings of the current element that satisfy the CSS selector specified by 'css_sel'
        * **fsiblings(css_sel)** [in_type='element', out_type='element_set']: returns the list of following
          siblings of the current element that satisfy the CSS selector specified by 'css_sel'
        * **siblings(css_sel)** [in_type='element', out_type='element_set']: returns the list of siblings of
          the current element that satisfy the CSS selector specified by 'css_sel'
        * **matching(css_sel)** [in_type='element_set', out_type='element_set']: filters an element_set and
          returns all elements that match the CSS selector specified by 'css_sel'
        
        The following filters are predefined:
        * **upper** [in_type='string', out_type='string']: converts string to uppercase
        * **lower** [in_type='string', out_type='string']: converts string to lowercase
        * **trim** [in_type='string', out_type='string']: removes spaces at the beginning and end of the
          string
        * **strip(chars)** [in_type='string', out_type='string']: removes characters specified by 'chars'
          at the beginning and end of the string
        * **replace(old, new)** [in_type='string', out_type='string']: replaces all occurrences of 'old' with
          'new'
        * **resub(pattern, repl)** [in_type='string', out_type='string']: performs a regular expression
          substitution; *pattern* and *repl* are have the formats taken by the **re.sub** Python function
          from the standard Python library;
        
        Specifying output formats
        =========================
        
        CSV format
        ----------
        
        The CSV output format is specified using the -c option. Optionally, using the -H option you can
        specify a CSV header to output before outputting records.
        
        Example::
            -c '$.attr(title), $.parent.desc(".price").text | trim' -H 'name, price'
        
        
        JSON format
        -----------
        
        The JSON output format is defined using the -j option. It formats the output as a JSON list of
        objects, one for each record. The *--indent-json* flat tells screp to indent each object. The format
        is specified as a comma-separated list of *key=value* pairs, where the *key* represents the JSON key
        in the record object while *value* is a term specification.
        
        Example::
          - j 'text=$.text, ptext=$.parent.text | upper, gptext=$.parent.parent.text'
        
        
        General format
        --------------
        
        Then general format is specified by a general string containing term specifications. To distinguish
        it from the general format, each term specification is surrounded by braces. When formatting a
        record each term specification is substituted with the computed value for that term.
        
        Example::
          -f 'some header {$.parent.text | replace("X", "Y")} some middle {$.tag} some tail'
        
        
        Specifying secondary anchors
        ============================
        
        Secondary anchors are specified using the -a option. There can be any number of secondary anchors
        definitions. The definitions have the format **<name>=<term>** where <name> is an identifier and
        <term> is a term definition relative to any of the previously defined anchors (primary or
        secondary) that has outputs an element. Secondary anchors can be redefined in later -a options but
        only the last definition is retained.
        
        Secondary anchors examples
        --------------------------
        
        These are examples of secondary anchors definitions::
            -a 'p=$.parent' -a 'gp=p.parent' 
        
            -a 'interesting=$.fdesc(".interesting-class")' -a 'interesting=interesting.parent'
        
Platform: UNKNOWN
Classifier: License :: OSI Approved
Classifier: License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)
Classifier: Programming Language :: Python
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Topic :: Internet :: WWW/HTTP
