Metadata-Version: 1.1
Name: memsql-loader
Version: 2.0.0
Summary: MemSQL Loader helps you run complex ETL workflows against MemSQL
Home-page: https://github.com/memsql/memsql-loader
Author: MemSQL
Author-email: support@memsql.com
License: LICENSE.txt
Download-URL: https://github.com/memsql/memsql-loader/releases/latest
Description: =============
        MemSQL Loader
        =============
        
        MemSQL Loader is a tool that lets you load sets of files from Amazon S3, the Hadoop Distributed
        File System (HDFS), and the local filesystem into MemSQL (or MySQL) with just one command. You can
        specify all of the files you want to load with one command, and MemSQL Loader will take care of
        deduplicating files, parallelizing the workload, retrying files if they fail to load, and more.
        
        Background
        ==========
        
        One of the most common tasks with any database is loading large amounts of data into
        it from an external data store. Both MemSQL and MySQL provide the LOAD DATA command
        for this task; this command is very powerful, but by itself, it has a number of restrictions:
        
         * It can only read from the local filesystem, so loading data from a remote store like
           Amazon S3 requires first downloading the files you need.
         * Since it can only read from a single file at a time, if you want to load from multiple
           files, you need to issue multiple LOAD DATA commands. If you want to perform this work
           in parallel, you have to write your own scripts.
         * If you are loading multiple files, it’s up to you to make sure that you’ve deduplicated
           the files and their contents.
        
        At MemSQL, we’ve acutely felt all of these limitations. That’s why we developed MemSQL Loader,
        which solves all of the above problems and more.
        
        Basic Usage
        ===========
        
        Downloading the Loader
        ----------------------
        
        The loader is a standalone binary that you can download and run directly. We keep the latest
        version hosted at https://github.com/memsql/memsql-loader/releases. The binary is produced by compiling
        this python project with PyInstaller.
        
        You can download this repo and run the loader directly. If you do so, you will need to
        install virtualenv, libcurl, and libncurses. Once you've downloaded the repo,
        `cd` into its directory and run
        
            $ source activate
        
        You should see the prefix `(venv)` in your shell. You can run the loader with
        
            (venv) $ ./bin/memsql-loader --help
        
        Installing from PyPi
        --------------------
        
        MemSQL Loader is also released onto PyPi and thus is installable via tools such as pip.
        
            $ pip install memsql-loader
        
        When installing via pip, APSW will need to be installed separately.  Running memsql-loader will provide the correct installation command for APSW.
        
            $ memsql-loader
        
        Running the Loader
        ------------------
        
        The primary interface to the loader is the `memsql-loader load` command. The command takes arguments
        that specify the source, parsing options, and destination server. For example, to load some files
        from S3, you can run
        
            $ ./memsql-loader load -h 127.0.0.1 -u root --database db --table t \
                s3://memsql-loader-examples/sanity/*
        
        The loader automatically daemonizes and runs the load process in a background server. You can monitor its
        progress with
        
            $ ./memsql-loader ps --watch
        
        If you would like to run this example against MemSQL or MySQL, run
        
            memsql> CREATE DATABASE db;
            memsql> CREATE TABLE    db.t (a int, b int, primary key (a));
        
        File Pattern Syntax
        -------------------
        
        The loader supports loading files from Amazon S3, HDFS, and the local filesystem. The file's prefix
        determines the source. You can specify "s3://", "hdfs://", or "file://". If you omit the prefix,
        then the loader defaults to the local filesystem.
        
        The loader also supports glob syntax (with semantics similar to bash). A single `*` matches files
        in the current directory, and `**` matches files recursively. MemSQL Loader uses the glob2 library
        under the hood to facilitate this.
        
        File Parsing Options
        --------------------
        
        MemSQL Loader's command line options mirror the LOAD DATA command's syntax. See the `load data options`
        section in `./memsql-loader load --help` for reference.
        
        Automatic Deduplication
        -----------------------
        
        MemSQL Loader is designed to support one-time loads as well as synchronizing behavior. You can use this
        functionality to effectively sync a table's data to the set of files matching a path. The loader will automatically
        deduplicate files that it knows it does not need to load (by using the MD5 feature on S3), and transactionally
        delete and reload data when the contents of a file have changed.
        
        NOTE: This reload behavior requires specifying a column to use as a `file_id`.
        
        Spec Files
        ----------
        
        We found with usage that it was really convenient to be able to define a load job as a JSON file, instead of
        just command line options. MemSQL Loader lets you use "spec files" to accomplish this. To generate one, just
        append `--print-spec` to the `./memsql-loader load` command. It will generate a spec file that you can
        use with `--spec`. Any command line options that you provide along with `--spec` will override options
        in the spec file.
        
        TODO
        ====
        
        * We have a pretty big test suite for the loader, but it's tied closely to MemSQL's internal testing
          infrastructure. We're going to separate these tests out and add them to this repo.
        * Right now the loader supports MemSQL and MySQL (via the LOAD DATA command), but does not support
          other database systems. We would love for members of the community to add support for more systems.
        * Error reporting and job management is fairly undeveloped in the loader. We'll integrate this further
          into our MemSQL Ops platform over time, but it would be great to see some iteration on this here as well.
        
        Third-party code
        ================
        MemSQL Loader includes a fork of the python-glob2 project (https://github.com/miracle2k/python-glob2/).
        The code for this fork can be found in [memsql_loader/vendor/glob2](memsql_loader/vendor/glob2).
        
Platform: Linux
Platform: Mac OS X
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 2.7
