Copyright 2011,2012,2013 Ken Farmer
See the file "LICENSE" for the full license governing this code. 



=================================================================================================
Metadata Overview
=================================================================================================

The primary objectives of the gristle metadata include:
   - allow gristle tools to reference data structure info rather than force the user to 
     provide all info at run time or for the tools to determine it at run time.
   - allow the tools to accumulate data analysis into a central respository for time-series
     analysis or broad sharing of the results.

There are several main components to this solution:
    - metadata for data dictionary
    - metadata for analysis
    - metadata for tranformation
    - metadata API



=================================================================================================
Data Dictionary Tables
=================================================================================================

All of this metadata is defined up-front by the user to describe data schemas, structures 
and elements, deployments or instances of these data structures, and inspection profiles.

- Schema architypes - 4 primary tables that describe schemas in a general way, but do not 
  get into deployment specifics.  This separation is intended to allow a single definition
  with multiple instances that are described separately.  These tables include:
  - schema      - a set of structures - typically a database model, set of related csv
                  files, possibly an xml document.
  - collection	- a hierarchical structure that consists of fields, collections of 
                  fields, and collections of collections of fields.   Could be 
                  used to describe a csv file, database table, etc.
  - field       - a single discrete field or column of data within a collectin.
  - element     - an attribute that appears more than once in a schema or structure
                  can be defined once here and that definition reused anywhere it is 
                  referenced.

- Schema instances - 1 tables that describe a specific deployment of a schema.  This 
  allows a schema architype to be deployed multiple times, possibly in different ways 
  (csv file & fixed-length file & database table) or have slightly different 
  characteristics in each implementation.  Even if there is no difference in the config
  that needs to be configured - analysis and determination of the data can be kept 
  separated through these instances.  This table includes:
  - instance    - Can be used to distinguish between multiple database deployments,
                  different environments (prod, uat, test, etc), or different
                  sets of files (Colorado files vs Utah files, etc).  If there's 
                  only a single instance of a schema - then it still needs a single
                  entry here. 

=================================================================================================
Analysis Tables
=================================================================================================

- Schema inspection profiles - 1 table that describes inspection preferences.  This is used by
  tools that collect the extended metadata to identify concepts such as which fields to inspect,
  which fields to use sampling vs detailed analysis, etc.
  - analysis_profile   - where analysis preferences are stored.  This has optional 
                         relationalships to a number of tables to allow overrides.

All of this metadata is generated by data_gristle tools (such as gristle_determinator.py) 
and used by other data_gristle tools.   It provides actual data characteristics (value 
distributions, etc) for the basic metadata.

- analysis results - 4 tables that contain the results of inspections.   These include:
  - analysis            - One row for every inspection.  Primary key is inspect_id, also stores
                          timestamps, program involved, profile used, and return codes.
  - collection_analysis	- Every inspection will be of a schema & structure.  This table contains
                          any schema-level information that results from the inspection. 
                          There might not be much since most data is about the structures.
  - field_analysis      - Almost every inspection will involve file|table or field|column
                          level results.  These will all involve entries in this table.  Examples
                          include mean, median, type, case, etc.
  - field_analysis_value- This table stores frequency distribution results.
		


=================================================================================================
Examples:
=================================================================================================

1.   CSV Files:  User has 10 different CSV files to analyze, one of which he wants to repeat the analysis 
	on weekly basis as new data files arrive.  This will allow him to detect changes or new data
	quality problems with the new data.
     -  Schemas:  since no data is shared or referenced between any of these CSV files, each 
	    is stored as a separate schema.
     -  Structures:  each of these CSV files is a simple data structure - so just one row per 
	    CSV to be stored in the structure table (collection type), along with one row for each 
	    field (atomic type).
     -  Instances:  all CSV files will be defined as just a single instance.  The structures won't
            have any rows defined since the profiles will be based on the schemas instead.
     -  Profiles: no detailed profiles will be created - just defaults at the schema-level.
     -  Inspect:  each of the 9 fairly static CSV files will have just a single inspection, but the
            last most dynamic CSV will have hundreds.
     -  Schema_inspect:  each of the CSV files will have 1 entry per inspection, which for one of 
            CSV files means hundreds of rows.
     -  Structure_inspect:  each of the CSV files will have an entry in this table for each inspection,
            for each record and for each field.   The record entries will have attributes like record
            count.  The field entries will have entries like mean value.
     -  Field_value:  each field has an entry in this table for each inspection.  This is where types,
            means, medians, etc are stored.

2.  Database:  User has a database with 25 tables to analyze.  This is just a single database instance,
        but the data changes continuously and the analysis will be repeated weekly.
     -  Schemas:  just one row in this table to represent the database schema.
     -  Structures:   each of the 25 tables has a separate row, and each column also has a separate
	    row in this table.  The columns have hierarchically referencing their tables within this
            table.
     -  Instances:  the schema has a single entry in the instance table, but the tables and columns
            defined in the structures table are not provided with specific instance entries since their
            tool profiles will be kept simple (at least for now).
     -  Profiles:  no detailed profiles will be created - just defaults at the schema level.
     -  Inspect:  each of the 25 tables will be analyzed every week.  So, over the course of the year
            there will be 50 x 25 entries produced for a total of 1250 rows.
     -  Schema_inspect: each of the 50 weekly inspections will get a row in this table.
     -  Structure_inspect: each of the 50 weekly inspections will get a row in this table for each
            of the 25 tables as well as for each column.
     -  Field_value:  each of the columns inspected will store its frequency distributions here.


=================================================================================================
Metadata Issues
=================================================================================================
1.  command line set-inserts must include 100% of the columns involved - no leaving any to default
	- this is because of code that removes any columns with None value prior to calling set function
	- that code is there to help avoid unintentional set to None for updates
2.  
