package Statistics::Data::Dichotomize;
use strict;
use warnings FATAL => 'all';
use base qw(Statistics::Data);
use Carp qw(croak);
use Number::Misc qw(is_numeric);
use Statistics::Lite qw(mean median mode);

$Statistics::Data::Dichotomize::VERSION = '0.04';

=head1 NAME

Statistics::Data::Dichotomize - Dichotomize one or more numerical or categorical sequences into a single two-valued one

=head1 VERSION

This is documentation for B<Version 0.04> of Statistics-Data-Dichotomize.

=head1 SYNOPSIS

 use Statistics::Data::Dichotomize 0.04;
 my $ddat = Statistics::Data::Dichotomize->new();
 my $aref;
 
 $ddat->load(23, 24, 7, 55); # numerical data
 $aref = $ddat->cut(value => 'median',); # - or by precise value or function
 $aref = $ddat->swing(); # by successive rises and falls of value
 $aref = $ddat->shrink(rule => sub { return $_->[0] >= 20 ? : 1 : 0 }, winlen => 1); # like "cut" if winlen only 1
 $aref = $ddat->binate(oneis => 7); # returns (0, 0, 1, 0)

 # - alternatively, call any method giving data directly, without prior load():
 $aref = $ddat->cut(data => [23, 24, 7, 55], value => 20);
 $aref = $ddat->pool(data => [$aref1, $aref2]);

 # or by a multi-sequence load: - by named arefs:
 $ddat->load(foodat =>[qw/c b c a a/], bardat => [qw/b b b c a/]); # arbitrary names
 $aref = $ddat->binate(data => 'foodat', oneis => 'c',); # returns (1, 0, 1, 0, 0)

 # - or by anonymous arefs:
 $ddat->load([qw/c b c a a/], [qw/b b b c a/]); # categorical (stringy) data
 $aref = $ddat->match(); # returns [0, 1, 0, 0, 1]
 
=head1 DESCRIPTION

A module for binary transformation of one or more sequences of numerical or categorical data (array of numbers or strings). That is, given an array, the methods return a binary, binomial, dichotomous, two-valued sequence. Each method returns the dichotomized sequence as a reference to an array of 0s and 1s.

There are methods to do this for: (1) I<a single numerical sequence>, either (a) dichotomized ("L<cut|Statistics::Data::Dichotomize/cut>") about a specified or function-returned value, or a central statistic (mean, median or mode), or (b) dichtomotized according to successive rises and falls in value ("L<swing|Statistics::Data::Dichotomize/swing>"); (2) I<two numerical sequences>, collapsed ("L<pool|Statistics::Data::Dichotomize/pool>ed") into a single dichotomous sequence according to the rank order of their values; (3) a single categorical sequence where one value is set to equal 1 and all others equal 0 ("L<binate|Statistics::Data::Dichotomize/binate>"); (4) I<two categorical sequences>, collapsed into a single dichotomous sequence according to their pairwise "L<match|Statistics::Data::Dichotomize/match>"; and (5) a I<single numerical or categorical sequence> dichotomized according to whether or not independent slices of the data meet a specified rule ("L<shrink, boolwin|Statistics::Data::Dichotomize/shrink, boolwin>").

All arguments are given as an anonymous hash of key => value pairs, which (not shown in examples) can also be given as a hash reference.

=head1 SUBROUTINES/METHODS

=head2 new

To create class object directly from this module, inheriting all the L<Statistics::Data|Statistics::Data> methods.

=head2 load, add, access, unload

Methods for loading, updating and retrieving data are inherited from L<Statistics::Data|Statistics::Data>. See that manpage for details.

=cut

=head2 Numerical data: Single sequence dichotomization

=head3 cut

 ($aref, $val) = $ddat->cut(data => \@data, value => \&Statistics::Lite::median); # cut the given data at is median, getting back median too
 $aref = $ddat->cut(value => 'median', equal => 'gt'); # cut the last previously loaded data at its median
 $aref = $ddat->cut(value => 23); # cut anonymously cached data at a specific value
 $aref = $ddat->cut(value => 'mean', data => 'blues'); # cut named data (previously loaded as such) at its mean

Returns a reference to an array of dichotomously transformed values of a given array of numbers by categorizing its values as to whether they're numerically higher or lower than a particular value, e.g., their median, mean, mode or some given number, or some function that, given the array (unreferenced) returns a single value. Called in list context, returns a reference to the transformed values, and then the cut-value itself.

So the following data, when cut over values greater than or equal to 5, yield the dichotomous (Boolean) sequence:

 @orig_data  = (4, 3, 3, 5, 3, 4, 5, 6, 3, 5, 3, 3, 6, 4, 4, 7, 6, 4, 7, 3);
 @cut_data = (0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0);

The order of the original values is reflected in the returned "cut data", but their order is not taken into account in making up the dichotomy - in contrast to the L<swing|Statistics::Data::Dichotomize/swing> method.

Optional arguments, as follow, specify what value or measure to cut by (default is the median), and how to handle ties with the cut-value (default is to skip them).

=over 4

=item value => 'mean|median|mode' - or a specific numerical value, or code reference

Specifies the value at which the data will be cut. This could be the mean, median or mode (as calculated by L<Statistics::Lite|Statistics::Lite>), or a numerical value within the range of the data, or some appropriate subroutine - one that takes an array (not a reference to one) and returns a single value (presumably a descriptive of the values in the array). The default is the I<median>. The cut-value, as specified by B<value>, can be retrieved as the second element returned if calling for an array.

=item equal => 'I<gt>|I<lt>|I<0>'

Specifies how to cut the data should the cut-value (as specified by B<value>) be present in the data. The default value is 0: observations equal to the cut-value are skipped. If B<equal =E<gt> 'I<gt>'>: all data-values I<greater than or equal to> the cut-value will take on one code, and all data-values less than the cut-value will take on another. Alternatively, to cut all values I<less than or equal to> the criterion value into one code, and all higher values into another, use B<equal =E<gt> 'I<lt>'>.

=back

=cut

sub cut {
    my ( $self, @args ) = @_;
    my $args = ref $args[0]        ? $args[0]        : {@args};
    my $dat  = ref $args->{'data'} ? $args->{'data'} : $self->access($args);
    croak __PACKAGE__,
      '::cut All data must be numeric for dichotomizing about a cut-value'
      if !$self->all_numeric($dat);
    $args->{'value'} = 'median' if !defined $args->{'value'};
    $args->{'equal'} = 'gt'     if !defined $args->{'equal'};
    my ( $val, @seqs ) = ();

    # Get a cut-value:
    if ( !is_numeric( $args->{'value'} ) ) {
        my $code = \&{ delete $args->{'value'} };
        $val = $code->( @{$dat} );
    }
    else {
        $val = $args->{'value'};
    }

    # Categorize by number of observations above and below the cut_value:
    push @seqs,
        $_ > $val                ? 1
      : $_ < $val                ? 0
      : $args->{'equal'} eq 'gt' ? 1
      : $args->{'equal'} eq 'lt' ? 0
      : 1
      foreach @{$dat};
    return wantarray ? ( \@seqs, $val ) : \@seqs;
}

=head3 swing

 $aref = $ddat->swing(data => [3, 4, 7, 6, 5, 1, 2, 3, 2]); # "swing" these data
 $aref = $ddat->swing(label => 'reds'); # name a pre-loaded dataset for "swinging"
 $aref = $ddat->swing(); # use the last-loaded dataset

Returns a reference to an array of dichotomously transformed values of a single sequence of numerical values according to their consecutive rises and falls. Each value is subtracted from its successor, and the result is replaced with a 1 if the difference represents an increase, or 0 if it represents a decrease. For example (from Wolfowitz, 1943, p. 283), the following numerical sequence produces the subsequent dichotomous sequence.

 @values = (qw/3 4 7 6 5 1 2 3 2/);
 @dichot =   (qw/1 1 0 0 0 1 1 0/);

Dichotomously, the data commence with an ascending run of length 2 (from 3 to 4, and from 4 to 7), followed by a descending run of length 3 (from 7 to 6, 6 to 5, and 5 to 1), followed by an ascent of length 2 (from 1 to 2, from 2 to 3), and so on. The number of resulting dichotomous observations is 1 less than the original sample-size (elements in the given array).

=over 4

=item equal => 'I<gt>|I<lt>|I<rpt>|I<0>'

The default result when the difference between two successive values is zero is to skip the observation, and move onto the next succession (B<equal =E<gt> 0>). Alternatively, you may wish to repeat the result for the previous succession; skipping only a difference of zero should it occur as the first result (B<equal =E<gt> 'rpt'>). Or, a difference greater than or equal to zero is counted as an increase (B<equal =E<gt> 'gt'>), or a difference less than or equal to zero is counted as a decrease. For example, 

 @values =    (qw/3 3 7 6 5 2 2/);
 @dicho_def = (qw/1 0 0 0/); # First and final results (of 3 - 3, and 2 - 2) are skipped
 @dicho_rpt = (qw/1 0 0 0 0/); # First result (of 3 - 3) is skipped, and final result repeats the former
 @dicho_gt =  (qw/1 1 0 0 0 1/); # Greater than or equal to zero is an increase
 @dicho_lt =  (qw/0 1 0 0 0 0/); # Less than or equal to zero is a decrease

=back

=cut

sub swing {
    my ( $self, @args ) = @_;
    my $args = ref $args[0]        ? $args[0]        : {@args};
    my $dat  = ref $args->{'data'} ? $args->{'data'} : $self->access($args);
    croak __PACKAGE__, '::swing All data must be numeric for dichotomizing'
      if !$self->all_numeric($dat);
    $args->{'equal'} = 0 if !defined $args->{'equal'};    #- no default??
    my ( $i, $res, @seqs ) = ();

    # Replace observations with the succession of rises and falls:
    for ( $i = 0 ; $i < ( scalar @{$dat} - 1 ) ; $i++ ) {
        $res = $dat->[ ( $i + 1 ) ] - $dat->[$i];
        if ( $res > 0 ) {
            push @seqs, 1;
        }
        elsif ( $res < 0 ) {
            push @seqs, 0;
        }
        else {
            for ( $args->{'equal'} ) {
                if (/^rpt/xsm) {
                    push @seqs, $seqs[-1] if scalar @seqs;
                }
                elsif (/^gt/xsm) {
                    push @seqs, 1;
                }
                elsif (/^lt/xsm) {
                    push @seqs, 0;
                }
                else {
                    next;
                }
            }
        }
    }
    return \@seqs;
}

=head2 Numerical data: Two sequence dichotomization

See also the methods for categorical data where it is ok to ignore any order and intervals in numerical data.

=head3 pool

 $aref = $ddat->pool(data => [$aref1, $aref2]); # give data directly to function
 $aref = $ddat->pool(data => [$ddat->access(index => 0), $ddat->access(index => 1)]); # after $ddat->load(\@aref1, $aref2);
 $aref = $ddat->pool(data => [$ddat->access(label => '1'), $ddat->access(label => '2')]); # after $ddat->load(1 => $aref1, 2 => $aref2);

Returns a reference to an array of dichotomously transformed values of two sequences of I<numerical> data as a ranked pool, i.e., by pooling the data from each sequence according to the magnitude of their values at each trial, from lowest to heighest. Specifically, the values from both sequences are pooled and ordered from lowest to highest, and then dichotomized into runs according to the sequence from which neighbouring values come from. Another run occurs wherever there is a change in the source of the values. A non-random effect of, say, higher or lower values consistently coming from one sequence rather than another would be reflected in fewer runs than expected by chance.

This is typically used for a Wald-Walfowitz test of difference between two samples - ranking by median.

=cut

sub pool {
    my ( $self, @args ) = @_;
    my $args = ref $args[0]        ? $args[0]        : {@args};
    my $dat  = ref $args->{'data'} ? $args->{'data'} : $self->access($args);
    $self->all_numeric($_) foreach @{$dat};
    my ( $dat1, $dat2 ) = @{$dat};
    my $sum = scalar @{$dat1} + scalar @{$dat2};
    my @dat =
      ( [ sort { $a <=> $b } @{$dat1} ], [ sort { $a <=> $b } @{$dat2} ] );

    my ( $i, $x, $y, @seqs ) = (0);
    while ( scalar(@seqs) < $sum ) {
        $x = $dat[0]->[0];
        $y = $dat[1]->[0];
        $i = defined $x && defined $y ? $x < $y ? 0 : 1 : defined $x ? 0 : 1;
        shift @{ $dat[$i] };
        push @seqs, $i;
    }
    return \@seqs;
}
## DEV: consider: List::AllUtils::pairwise:
# @x = pairwise { $a + $b } @a, @b;   # returns index-by-index sums

=head2 Categorical data: Single sequence dichotomization

=head3 binate

 $aref = $ddat->binate(oneis => 'E'); # optionally specify a state in the sequence to be set as "1"
 $aref = $ddat->binate(data => \@ari, oneis => 'E'); # optionally specify a state in the sequence to be set as "1"

Returns a reference to an array of dichotomously transformed values of an array by setting the first element in the list to 1 (by default, or whatever is specified as B<oneis>) on all its occurrences in the array, and all other values in the array as zero.

=cut

sub binate {
    my ( $self, @args ) = @_;
    my $args = ref $args[0]        ? $args[0]        : {@args};
    my $dat  = ref $args->{'data'} ? $args->{'data'} : $self->access($args);
    my $oneis =
      defined $args->{'oneis'}
      ? delete $args->{'oneis'}
      : $dat->[0];    # What value set to 1 and others to zero?
    my $dats = [ map { $_ eq $oneis ? 1 : 0 } @{$dat} ]
      ;               # replace observations with 1s and 0s
    return $dats;
}

=head2 Categorical data: Two-sequence dichotomization

=head3 match

 $aref = $ddat->match(data => [\@aref1, \@aref2], lag => signed integer, loop => 0|1); # with optional crosslag of the two sequences
 $aref = $ddat->match(data => [$ddat->access(index => 0), $ddat->access(index => 1)]); # after $ddat->load(\@aref1, \@aref2);
 $aref = $ddat->match(data => [$ddat->access(label => '1'), $ddat->access(label => '2')]); # after $ddat->load(1 => \@aref1, 2 => \@aref2);

Returns a reference to an array of dichotomously transformed values of two paired arrays according to the match between the elements at each of their indices. Where the data-values are equal at a certain index, they are represented with a 1; otherwise a 0. Numerical or stringy data can be equated. For example, the following two arrays would be reduced to the third, where a 1 indicates a match (i.e., the values are "indexically equal").

 @foo_dat = (qw/1 3 3 2 1 5 1 2 4/);
 @bar_dat = (qw/4 3 1 2 1 4 2 2 4/);
 @bin_dat = (qw/0 1 0 1 1 0 0 1 1/);

The following options may be specified.

=over 4

=item lag => I<integer> (where I<integer> < number of observations I<or> I<integer> > -1 (number of observations) ) 

Match the two data-sets by shifting the first named set ahead or behind the other data-set by B<lag> observations. The default is zero. For example, one data-set might be targets, and another responses to the targets:

 targets   =	cbbbdacdbd
 responses =	daadbadcce

Matched as a single sequence of hits (1) and misses (0) where B<lag> = B<0> yields (for the match on "a" in the 6th index of both arrays):

 0000010000

With B<lag> => 1, however, each response is associated with the target one ahead of the trial for which it was observed; i.e., each target is shifted to its +1 index. So the first element in the above responses (I<d>) would be associated with the second element of the targets (I<b>), and so on. Now, matching the two data-sets with a B<+1> lag gives two hits, of the 4th and 7th elements of the responses to the 5th and 8th elements of the targets, respectively:

 000100100

making 5 runs. With B<lag> => 0, there are 3 runs. Lag values can be negative, so that B<lag> => -2 will give:

 00101010

Here, responses necessarily start at the third element (I<a>), the first hits occurring when the fifth response-element corresponds to the the third target element (I<b>). The last response (I<e>) could not be used, and the number of elements in the hit/miss sequence became n-B<lag> less the original target sequence. This means that the maximum value of lag must be one less the size of the data-sets, or there will be no data.

=item loop => 0|1

Implements circularized lagging if B<loop> => 1, where all lagged data are preserved by looping any excess to the start or end of the criterion data. The number of observations will then always be the same, regardless of the lag; i.e., the size of the returned array is the same as that of the given data. For example, matching the data in the example above with a lag of +1, with looping, creates an additional match between the final response and the first target (I<d>); i.e., the last element in the "response" array is matched to the first element of the "target" array:

 1000100100

=back

=cut

sub match {
    my ( $self, @args ) = @_;
    my $args = ref $args[0]        ? $args[0]        : {@args};
    my $dat  = ref $args->{'data'} ? $args->{'data'} : $self->access($args);
    $dat = $self->crosslag(
        lag  => $args->{'lag'},
        data => [ $dat->[0], $dat->[1] ],
        loop => $args->{'loop'}
    ) if $args->{'lag'};
    my $lim =
        scalar @{ $dat->[0] } <= scalar @{ $dat->[1] }
      ? scalar @{ $dat->[0] }
      : scalar @{ $dat->[1] };    # ensure criterion data-set is smallest
    my (@seqs) = ();
    for my $i ( 0 .. $lim ) {
        next if !defined $dat->[0]->[$i] || !defined $dat->[1]->[$i];
        $seqs[$i] = $dat->[0]->[$i] eq $dat->[1]->[$i] ? 1 : 0;
    }
    return \@seqs;
}

=head2 Numerical or categorical data: Single sequence dichotimisation

=head3 shrink, boolwin

 $aref = $ddat->shrink(winlen => INT, rule => CODE)

Returns a reference to an array of dichotomously transformed values of a numerical or categorical sequence by taking non-overlapping slices, or windows, as given in the argument B<winlen>, and making a true/false sequence out of them according to whether or not each slice passes a B<rule>. The B<rule> is a code reference that gets the data as reference to an array, and so might be something like this: 

 sub { return Statistics::Lite::mean(@{$_}) > 2 ? 1 : 0; }

If B<winlen> is set to 3, this rule by means would make the following numerical sequence of 9 elements shrink into the following dichotomous (Boolean) sequence of 3 elements:

 @data =  (1, 2, 3, 3, 3, 3, 4, 2, 1);
 @means = (2,       3,       2.5    );
 @dico =  (0,       1,       1      );

The B<rule> method must, of course, return dichotomous values to dichotomize the data, and B<winlen> should make up equally sized segments (no error is thrown if this isn't the case, the remainder just gets figured in the same way).

=cut

sub shrink {
    my ( $self, @args ) = @_;
    my $args = ref $args[0]        ? $args[0]        : {@args};
    my $dat  = ref $args->{'data'} ? $args->{'data'} : $self->access($args);
    my $lim  = scalar @{$dat};
    my $len  = int $args->{'winlen'};
    $len ||= 1;
    my $code = delete $args->{'rule'};
    croak __PACKAGE__, '::shrink Need a code to Boolean shrink'
      if not $code
      or ref $code ne 'CODE';
    my ( $i, @seqs );

    for ( $i = 0 ; $i < $lim ; $i += $len )
    {    # C-style for clear greater-than 1 increments per loop
        push @seqs, $code->( [ @{$dat}[ $i .. ( $i + $len - 1 ) ] ] );
    }
    return \@seqs;
}
*boolwin = \&shrink;

=head2 Utilities

=head3 crosslag

 @lagged_arefs = $ddat->crosslag(data => [\@ari1, \@ari2], lag => signed integer, loop => 0|1);
 $aref_of_arefs = $ddat->crosslag(data => [\@ari1, \@ari2], lag => signed integer, loop => 0|1); # same but not "wanting array" 

Takes two arrays and returns them cross-lagged against each other, shifting and popping values according to the number of "lags". Typically used when wanting to L<match|match> the two arrays against each other.

=over 4

=item lag => signed integer up to the number of elements

Takes the first array sent as "data" as the reference or "target" array for the second "response" array to be shifted so many lags before or behind it. With no looping of the lags, this means the returned arrays are "lag"-elements smaller than the original arrays. For example, with lag => +1 (and loop => 0, the default), and with data => [ [qw/c p w p s/], [qw/p s s w r/] ],

 (c p w p s) becomes (p w p s)
 (p s s w r) becomes (p s s w)

So, whereas the original data gave no matches across the two arrays, now, with the second of the two arrays shifted forward by one index, it has a match (of "p") at the first index with the first of the two arrays.

=item loop => 0|1

For circularized lagging, B<loop> => 1, and the size of the returned array is the same as those for the given data. For example, with a lag of +1, the last element in the "response" array is matched to the first element of the "target" array:

 (c p w p s) becomes (p w p s c) (looped with +1)
 (p s s w r) becomes (p s s w r) (no effect)

In this case, it might be more efficient to simply autolag the "target" sequence against itself.

=back

=cut

sub crosslag {
    my ( $self, @args ) = @_;
    my $args = ref $args[0] ? $args[0] : {@args};
    my $lag  = $args->{'lag'};
    my $dat1 = $args->{'data'}->[0];
    my $dat2 = $args->{'data'}->[1];
    my $loop = $args->{'loop'};
    return ( wantarray ? ( $dat1, $dat2 ) : [ $dat1, $dat2 ] )
      if not $lag
      or abs $lag >= scalar @{$dat1};

    my @dat1_lagged = @{$dat1};
    my @dat2_lagged = @{$dat2};

    if ( $lag > 0 ) {
        foreach ( 1 .. abs $lag ) {
            if ($loop) {
                unshift @dat1_lagged, pop @dat1_lagged;
            }
            else {
                shift @dat1_lagged;
                pop @dat2_lagged;
            }
        }
    }
    elsif ( $lag < 0 ) {
        foreach ( 1 .. abs $lag ) {
            if ($loop) {
                push @dat1_lagged, shift @dat1_lagged;
            }
            else {
                pop @dat1_lagged;
                shift @dat2_lagged;
            }
        }
    }
    return wantarray
      ? ( \@dat1_lagged, \@dat2_lagged )
      : [ \@dat1_lagged, \@dat2_lagged ];
}

=head1 AUTHOR

Roderick Garton, C<< <rgarton at cpan.org> >>

=head1 REFERENCES

Burdick, D. S., & Kelly, E. F. (1977). Statistical methods in parapsychological research. In B. B. Wolman (Ed.), I<Handbook of parapsychology> (pp. 81-130). New York, NY, US: Van Nostrand Reinhold. [Describes window-boolean reduction.]

Swed, F., & Eisenhart, C. (1943). Tables for testing randomness of grouping in a sequence of alternatives. I<Annals of Mathematical Statistics>, I<14>, 66-87. doi: L<10.1214/aoms/1177731494|http://dx.doi.org/10.1214/aoms/1177731494> [Describes pool method and test example.]

Wolfowitz, J. (1943). On the theory of runs with some applications to quality control. I<Annals of Mathematical Statistics>, I<14>, 280-288. doi: L<10.1214/aoms/1177731421|http://dx.doi.org/10.1214/aoms/1177731421> [Describes swings "runs up and down" and test example.]

=head1 BUGS

Please report any bugs or feature requests to C<bug-Statistics-Data-Dichotomize-0.04 at rt.cpan.org>, or through
the web interface at L<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Statistics-Data-Dichotomize-0.04>.  I will be notified, and then you'll
automatically be notified of progress on your bug as I make changes.

=head1 SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Statistics::Data::Dichotomize

You can also look for information at:

=over 4

=item * RT: CPAN's request tracker (report bugs here)

L<http://rt.cpan.org/NoAuth/Bugs.html?Dist=Statistics-Data-Dichotomize-0.04>

=item * AnnoCPAN: Annotated CPAN documentation

L<http://annocpan.org/dist/Statistics-Data-Dichotomize-0.04>

=item * CPAN Ratings

L<http://cpanratings.perl.org/d/Statistics-Data-Dichotomize-0.04>

=item * Search CPAN

L<http://search.cpan.org/dist/Statistics-Data-Dichotomize-0.04/>

=back

=head1 LICENSE AND COPYRIGHT

Copyright 2012-2016 Roderick Garton.

This program is free software; you can redistribute it and/or modify it
under the terms of either: the GNU General Public License as published
by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.

=cut

1;    # End of Statistics::Data::Dichotomize
