Collection of values on one dimension. Works as a column on a Spreadsheet.
The fast way to create a vector uses Array.to_vector or Array.to_scale.
v=[1,2,3,4].to_vector(:scale) v=[1,2,3,4].to_scale
Array of values considered as "Today", with date type. "NOW", "TODAY", :NOW and :TODAY are 'today' values, by default
Create a vector using (almost) any object
# File lib/statsample/vector.rb, line 104 def self.[](*args) values=[] args.each do |a| case a when Array values.concat a.flatten when Statsample::Vector values.concat a.to_a when Range values.concat a.to_a else values << a end end vector=new(values) vector.type=:scale if vector.can_be_scale? vector end
Creates a new Vector object.
data Any data which can be converted on Array
type Level of meausurement. See Vector#type
opts Hash of options
:missing_values Array of missing values. See Vector#missing_values
:today_values Array of 'today' values. See Vector#today_values
:labels Labels for data values
:name Name of vector
# File lib/statsample/vector.rb, line 71 def initialize(data=[], type=:nominal, opts=Hash.new) @data=data.is_a?(Array) ? data : data.to_a @type=type opts_default={ :missing_values=>[], :today_values=>['NOW','TODAY', :NOW, :TODAY], :labels=>{}, :name=>nil } @opts=opts_default.merge(opts) if @opts[:name].nil? @@n_table||=0 @@n_table+=1 @opts[:name]="Vector #{@@n_table}" end @missing_values=@opts[:missing_values] @labels=@opts[:labels] @today_values=@opts[:today_values] @name=@opts[:name] @valid_data=[] @data_with_nils=[] @date_data_with_nils=[] @missing_data=[] @has_missing_data=nil @scale_data=nil set_valid_data self.type=type end
Create a new scale type vector Parameters
Size
Value of each value
If block provided, is used to set the values of vector
# File lib/statsample/vector.rb, line 127
def self.new_scale(n,val=nil, &block)
if block
vector=n.times.map {|i| block.call(i)}.to_scale
else
vector=n.times.map { val}.to_scale
end
vector.type=:scale
vector
end
# File lib/statsample/vector.rb, line 424 def *(v) _vector_ari("*",v) end
Vector sum.
If v is a scalar, add this value to all elements
If v is a Array or a Vector, should be of the same size of this vector every item of this vector will be added to the value of the item at the same position on the other vector
# File lib/statsample/vector.rb, line 410 def +(v) _vector_ari("+",v) end
Vector rest.
If v is a scalar, rest this value to all elements
If v is a Array or a Vector, should be of the same size of this vector every item of this vector will be rested to the value of the item at the same position on the other vector
# File lib/statsample/vector.rb, line 420 def -(v) _vector_ari("-",v) end
Vector equality. Two vector will be the same if their data, missing values, type, labels are equals
# File lib/statsample/vector.rb, line 221 def ==(v2) raise TypeError,"Argument should be a Vector" unless v2.instance_of? Statsample::Vector @data==v2.data and @missing_values==v2.missing_values and @type==v2.type and @labels==v2.labels end
Retrieves i element of data
# File lib/statsample/vector.rb, line 367 def [](i) @data[i] end
Set i element of data. Note: Use set_valid_data if you include missing values
# File lib/statsample/vector.rb, line 372 def []=(i,v) @data[i]=v end
Add a value at the end of the vector. If second argument set to false, you should update the Vector usign Vector.set_valid_data at the end of your insertion cycle
# File lib/statsample/vector.rb, line 287 def add(v,update_valid=true) @data.push(v) set_valid_data if update_valid end
Population average deviation (denominator N)
# File lib/statsample/vector.rb, line 933 def average_deviation_population( m = nil ) check_type :scale m ||= mean ( @scale_data.inject( 0 ) { |a, x| ( x - m ).abs + a } ).quo( n_valid ) end
Generate nr resamples (with replacement) of size s from vector, computing each estimate from estimators over each resample. estimators could be a) Hash with variable names as keys and lambdas as values
a.bootstrap(:log_s2=>lambda {|v| Math.log(v.variance)},1000)
b) Array with names of method to bootstrap
a.bootstrap([:mean, :sd],1000)
c) A single method to bootstrap
a.jacknife(:mean, 1000)
If s is nil, is set to vector size by default.
Returns a dataset where each vector is an vector of length nr containing the computed resample estimates.
# File lib/statsample/vector.rb, line 538 def bootstrap(estimators, nr, s=nil) s||=n h_est, es, bss= prepare_bootstrap(estimators) nr.times do |i| bs=sample_with_replacement(s) es.each do |estimator| # Add bootstrap bss[estimator].push(h_est[estimator].call(bs)) end end es.each do |est| bss[est]=bss[est].to_scale bss[est].type=:scale end bss.to_dataset end
Return true if all data is Date, “today” values or nil
# File lib/statsample/vector.rb, line 689 def can_be_date? if @data.find {|v| !v.nil? and !v.is_a? Date and !v.is_a? Time and (v.is_a? String and !@today_values.include? v) and (v.is_a? String and !(v=~/\d{4,4}[-\/]\d{1,2}[-\/]\d{1,2}/))} false else true end end
Return true if all data is Numeric or nil
# File lib/statsample/vector.rb, line 698 def can_be_scale? if @data.find {|v| !v.nil? and !v.is_a? Numeric and !@missing_values.include? v} false else true end end
Raises an exception if type of vector is inferior to t type
# File lib/statsample/vector.rb, line 150 def check_type(t) Statsample::STATSAMPLE__.check_type(self,t) end
Coefficient of variation Calculed with the sample standard deviation
# File lib/statsample/vector.rb, line 1001
def coefficient_of_variation
check_type :scale
standard_deviation_sample.quo(mean)
end
Retrieves number of cases which comply condition. If block given, retrieves number of instances where block returns true. If other values given, retrieves the frequency for this value.
# File lib/statsample/vector.rb, line 662 def count(x=false) if block_given? r=@data.inject(0) {|s, i| r=yield i s+(r ? 1 : 0) } r.nil? ? 0 : r else frequencies[x].nil? ? 0 : frequencies[x] end end
Returns the database type for the vector, according to its content
# File lib/statsample/vector.rb, line 676 def db_type(dbs='mysql') # first, detect any character not number if @data.find {|v| v.to_s=~/\d{2,2}-\d{2,2}-\d{4,4}/} or @data.find {|v| v.to_s=~/\d{4,4}-\d{2,2}-\d{2,2}/} return "DATE" elsif @data.find {|v| v.to_s=~/[^0-9e.-]/ } return "VARCHAR (255)" elsif @data.find {|v| v.to_s=~/\./} return "DOUBLE" else return "INTEGER" end end
Dicotomize the vector with 0 and 1, based on lowest value If parameter if defined, this value and lower will be 0 and higher, 1
# File lib/statsample/vector.rb, line 257 def dichotomize(low=nil) fs=factors low||=factors.min @data_with_nils.collect{|x| if x.nil? nil elsif x>low 1 else 0 end }.to_scale end
Creates a duplicate of the Vector. Note: data, missing_values and labels are duplicated, so changes on original vector doesn’t propages to copies.
# File lib/statsample/vector.rb, line 139 def dup Vector.new(@data.dup,@type, :missing_values => @missing_values.dup, :labels => @labels.dup, :name=>@name) end
Returns an empty duplicate of the vector. Maintains the type, missing values and labels.
# File lib/statsample/vector.rb, line 144 def dup_empty Vector.new([],@type, :missing_values => @missing_values.dup, :labels => @labels.dup, :name=> @name) end
Iterate on each item. Equivalent to
@data.each{|x| yield x}
# File lib/statsample/vector.rb, line 273 def each @data.each{|x| yield(x) } end
Iterate on each item, retrieving index
# File lib/statsample/vector.rb, line 278 def each_index (0...@data.size).each {|i| yield(i) } end
Retrieves uniques values for data.
# File lib/statsample/vector.rb, line 723 def factors if @type==:scale @scale_data.uniq.sort elsif @type==:date @date_data_with_nils.uniq.sort else @valid_data.uniq.sort end end
Returns a hash with the distribution of frecuencies for the sample
# File lib/statsample/vector.rb, line 735 def frequencies Statsample::STATSAMPLE__.frequencies(@valid_data) end
Retrieves true if data has one o more missing values
# File lib/statsample/vector.rb, line 338 def has_missing_data? @has_missing_data end
With a fixnum, creates X bins within the range of data With an Array, each value will be a cut point
# File lib/statsample/vector.rb, line 976 def histogram(bins=10) check_type :scale if bins.is_a? Array #h=Statsample::Histogram.new(self, bins) h=Statsample::Histogram.alloc(bins) else # ugly patch. The upper limit for a bin has the form # x < range #h=Statsample::Histogram.new(self, bins) min,max=Statsample::Util.nice(@valid_data.min,@valid_data.max) # fix last data if max==@valid_data.max max+=1e-10 end h=Statsample::Histogram.alloc(bins,[min,max]) # Fix last bin end h.increment(@valid_data) h end
# File lib/statsample/vector.rb, line 719
def inspect
self.to_s
end
Return true if a value is valid (not nil and not included on missing values)
# File lib/statsample/vector.rb, line 376 def is_valid?(x) !(x.nil? or @missing_values.include? x) end
Returns a dataset with jacknife delete-k estimators estimators could be: a) Hash with variable names as keys and lambdas as values
a.jacknife(:log_s2=>lambda {|v| Math.log(v.variance)})
b) Array with method names to jacknife
a.jacknife([:mean, :sd])
c) A single method to jacknife
a.jacknife(:mean)
k represent the block size for block jacknife. By default is set to 1, for classic delete-one jacknife.
Returns a dataset where each vector is an vector of length cases/k containing the computed jacknife estimates.
Sawyer, S. (2005). Resampling Data: Using a Statistical Jacknife.
# File lib/statsample/vector.rb, line 599 def jacknife(estimators, k=1) raise "n should be divisible by k:#{k}" unless n%k==0 nb=(n / k).to_i h_est, es, ps= prepare_bootstrap(estimators) est_n=es.inject({}) {|h,v| h[v]=h_est[v].call(self) h } nb.times do |i| other=@data_with_nils.dup other.slice!(i*k,k) other=other.to_scale es.each do |estimator| # Add pseudovalue ps[estimator].push( nb * est_n[estimator] - (nb-1) * h_est[estimator].call(other)) end end es.each do |est| ps[est]=ps[est].to_scale ps[est].type=:scale end ps.to_dataset end
Kurtosis of the sample
# File lib/statsample/vector.rb, line 960 def kurtosis(m=nil) check_type :scale m||=mean fo=@scale_data.inject(0){|a,x| a+((x-m)**4)} fo.quo((@scale_data.size)*sd(m)**4)-3 end
Retrieves label for value x. Retrieves x if no label defined.
# File lib/statsample/vector.rb, line 345 def labeling(x) @labels.has_key?(x) ? @labels[x].to_s : x.to_s end
Maximum value
# File lib/statsample/vector.rb, line 852 def max check_type :ordinal @valid_data.max end
The arithmetical mean of data
# File lib/statsample/vector.rb, line 898
def mean
check_type :scale
sum.to_f.quo(n_valid)
end
Return the median (percentil 50)
# File lib/statsample/vector.rb, line 842
def median
check_type :ordinal
percentil(50)
end
Minimun value
# File lib/statsample/vector.rb, line 847 def min check_type :ordinal @valid_data.min end
Set missing_values. set_valid_data is called after changes
# File lib/statsample/vector.rb, line 381 def missing_values=(vals) @missing_values = vals set_valid_data end
Returns the most frequent item.
# File lib/statsample/vector.rb, line 754
def mode
frequencies.max{|a,b| a[1]<=>b[1]}.first
end
The numbers of item with valid data.
# File lib/statsample/vector.rb, line 758 def n_valid @valid_data.size end
Return the value of the percentil q
# File lib/statsample/vector.rb, line 820 def percentil(q) check_type :ordinal sorted=@valid_data.sort v= (n_valid * q).quo(100) if(v.to_i!=v) sorted[v.to_i] else (sorted[(v-0.5).to_i].to_f + sorted[(v+0.5).to_i]).quo(2) end end
Product of all values on the sample
# File lib/statsample/vector.rb, line 969 def product check_type :scale @scale_data.inject(1){|a,x| a*x } end
Proportion of a given value.
# File lib/statsample/vector.rb, line 770 def proportion(v=1) frequencies[v].quo(@valid_data.size) end
# File lib/statsample/vector.rb, line 801 def proportion_confidence_interval_t(n_poblation,margin=0.95,v=1) Statsample::proportion_confidence_interval_t(proportion(v), @valid_data.size, n_poblation, margin) end
# File lib/statsample/vector.rb, line 804 def proportion_confidence_interval_z(n_poblation,margin=0.95,v=1) Statsample::proportion_confidence_interval_z(proportion(v), @valid_data.size, n_poblation, margin) end
Returns a hash with the distribution of proportions of the sample.
# File lib/statsample/vector.rb, line 763
def proportions
frequencies.inject({}){|a,v|
a[v[0]] = v[1].quo(n_valid)
a
}
end
# File lib/statsample/vector.rb, line 250 def push(v) @data.push(v) set_valid_data end
The range of the data (max - min)
# File lib/statsample/vector.rb, line 888 def range; check_type :scale @scale_data.max - @scale_data.min end
Returns a ranked vector.
# File lib/statsample/vector.rb, line 831 def ranked(type=:ordinal) check_type :ordinal i=0 r=frequencies.sort.inject({}){|a,v| a[v[0]]=(i+1 + i+v[1]).quo(2) i+=v[1] a } @data.collect {|c| r[c] }.to_vector(type) end
Returns a new vector, with data modified by block. Equivalent to create a Vector after #collect on data
# File lib/statsample/vector.rb, line 236 def recode(type=nil) type||=@type @data.collect{|x| yield x }.to_vector(type) end
Modifies current vector, with data modified by block. Equivalent to #collect! on @data
# File lib/statsample/vector.rb, line 244 def recode! @data.collect!{|x| yield x } set_valid_data end
# File lib/statsample/vector.rb, line 773 def report_building(b) b.section(:name=>name) do |s| s.text _("n :%d") % n s.text _("n valid:%d") % n_valid s.text _("factors:%s") % factors.join(",") s.text _("mode: %s") % mode s.table(:name=>_("Distribution")) do |t| frequencies.sort.each do |k,v| key=labels.has_key?(k) ? labels[k]:k t.row [key, v , ("%0.2f%%" % (v.quo(n_valid)*100))] end end s.text _("median: %s") % median.to_s if(@type==:ordinal) if(@type==:scale) s.text _("mean: %0.4f") % mean s.text _("sd: %0.4f") % sd.to_s end end end
Returns an random sample of size n, with replacement, only with valid data.
In all the trails, every item have the same probability of been selected.
# File lib/statsample/vector.rb, line 636 def sample_with_replacement(sample=1) vds=@valid_data.size (0...sample).collect{ @valid_data[rand(vds)] } end
Returns an random sample of size n, without replacement, only with valid data.
Every element could only be selected once.
A sample of the same size of the vector is the vector itself.
# File lib/statsample/vector.rb, line 647 def sample_without_replacement(sample=1) raise ArgumentError, "Sample size couldn't be greater than n" if sample>@valid_data.size out=[] size=@valid_data.size while out.size<sample value=rand(size) out.push(value) if !out.include?value end out.collect{|i| @data[i]} end
Update valid_data, missing_data, data_with_nils and gsl at the end of an insertion.
Use after Vector.add(v,false) Usage:
v=Statsample::Vector.new v.add(2,false) v.add(4,false) v.data => [2,3] v.valid_data => [] v.set_valid_data v.valid_data => [2,3]
# File lib/statsample/vector.rb, line 306 def set_valid_data @valid_data.clear @missing_data.clear @data_with_nils.clear @date_data_with_nils.clear set_valid_data_intern set_scale_data if(@type==:scale) set_date_data if(@type==:date) end
Size of total data
# File lib/statsample/vector.rb, line 361 def size @data.size end
Skewness of the sample
# File lib/statsample/vector.rb, line 953 def skew(m=nil) check_type :scale m||=mean th=@scale_data.inject(0){|a,x| a+((x-m)**3)} th.quo((@scale_data.size)*sd(m)**3) end
Returns a hash of Vectors, defined by the different values defined on the fields Example:
a=Vector.new(["a,b","c,d","a,b"])
a.split_by_separator
=> {"a"=>#<Statsample::Type::Nominal:0x7f2dbcc09d88
@data=[1, 0, 1]>,
"b"=>#<Statsample::Type::Nominal:0x7f2dbcc09c48
@data=[1, 1, 0]>,
"c"=>#<Statsample::Type::Nominal:0x7f2dbcc09b08
@data=[0, 1, 1]>}
# File lib/statsample/vector.rb, line 493 def split_by_separator(sep=Statsample::SPLIT_TOKEN) split_data=splitted(sep) factors=split_data.flatten.uniq.compact out=factors.inject({}) {|a,x| a[x]=[] a } split_data.each do |r| if r.nil? factors.each do |f| out[f].push(nil) end else factors.each do |f| out[f].push(r.include?(f) ? 1:0) end end end out.inject({}){|s,v| s[v[0]]=Vector.new(v[1],:nominal) s } end
# File lib/statsample/vector.rb, line 516 def split_by_separator_freq(sep=Statsample::SPLIT_TOKEN) split_by_separator(sep).inject({}) {|a,v| a[v[0]]=v[1].inject {|s,x| s+x.to_i} a } end
Return an array with the data splitted by a separator.
a=Vector.new(["a,b","c,d","a,b","d"]) a.splitted => [["a","b"],["c","d"],["a","b"],["d"]]
# File lib/statsample/vector.rb, line 469 def splitted(sep=Statsample::SPLIT_TOKEN) @data.collect{|x| if x.nil? nil elsif (x.respond_to? :split) x.split(sep) else [x] end } end
Population Standard deviation (denominator N)
# File lib/statsample/vector.rb, line 927 def standard_deviation_population(m=nil) check_type :scale Math::sqrt( variance_population(m) ) end
Sample Standard deviation (denominator n-1)
# File lib/statsample/vector.rb, line 947 def standard_deviation_sample(m=nil) check_type :scale m||=mean Math::sqrt(variance_sample(m)) end
The sum of values for the data
# File lib/statsample/vector.rb, line 893 def sum check_type :scale @scale_data.inject(0){|a,x|x+a} ; end
Sum of squared deviation
# File lib/statsample/vector.rb, line 912 def sum_of_squared_deviation check_type :scale @scale_data.inject(0) {|a,x| x.square+a} - (sum.square.quo(n_valid)) end
Sum of squares for the data around a value. By default, this value is the mean
ss= sum{(xi-m)^2}
# File lib/statsample/vector.rb, line 906 def sum_of_squares(m=nil) check_type :scale m||=mean @scale_data.inject(0){|a,x| a+(x-m).square} end
# File lib/statsample/rserve_extension.rb, line 6 def to_REXP Rserve::REXP::Wrapper.wrap(data_with_nils) end
# File lib/statsample/vector.rb, line 396 def to_a if @data.is_a? Array @data.dup else @data.to_a end end
Ugly name. Really, create a Vector for standard ‘matrix’ package. dir could be :horizontal or :vertical
# File lib/statsample/vector.rb, line 711 def to_matrix(dir=:horizontal) case dir when :horizontal Matrix[@data] when :vertical Matrix.columns([@data]) end end
# File lib/statsample/vector.rb, line 706 def to_s sprintf("Vector(type:%s, n:%d)[%s]",@type.to_s,@data.size, @data.collect{|d| d.nil? ? "nil":d}.join(",")) end
Set data considered as “today” on data vectors
# File lib/statsample/vector.rb, line 386 def today_values=(vals) @today_values = vals set_valid_data end
Set level of measurement.
# File lib/statsample/vector.rb, line 391 def type=(t) @type=t set_scale_data if(t==:scale) set_date_data if (t==:date) end
Population variance (denominator N)
# File lib/statsample/vector.rb, line 918 def variance_population(m=nil) check_type :scale m||=mean squares=@scale_data.inject(0){|a,x| x.square+a} squares.quo(n_valid) - m.square end
Variance of p, according to poblation size
# File lib/statsample/vector.rb, line 794 def variance_proportion(n_poblation, v=1) Statsample::proportion_variance_sample(self.proportion(v), @valid_data.size, n_poblation) end
Sample Variance (denominator n-1)
# File lib/statsample/vector.rb, line 940
def variance_sample(m=nil)
check_type :scale
m||=mean
sum_of_squares(m).quo(n_valid - 1)
end
Variance of p, according to poblation size
# File lib/statsample/vector.rb, line 798 def variance_total(n_poblation, v=1) Statsample::total_variance_sample(self.proportion(v), @valid_data.size, n_poblation) end
Return a centered vector
# File lib/statsample/vector.rb, line 184 def vector_centered check_type :scale m=mean return ([nil]*size).to_scale if mean.nil? vector=vector_centered_compute(m) vector.name=_("%s(centered)") % @name vector end
Returns a Vector with data with labels replaced by the label.
# File lib/statsample/vector.rb, line 350 def vector_labeled d=@data.collect{|x| if @labels.has_key? x @labels[x] else x end } Vector.new(d,@type) end
Return a vector with values replaced with the percentiles of each values
# File lib/statsample/vector.rb, line 197 def vector_percentil check_type :ordinal c=@valid_data.size vector=ranked.map {|i| i.nil? ? nil : (i.quo(c)*100).to_f }.to_vector(@type) vector.name=_("%s(percentil)") % @name vector end
Return a vector usign the standarized values for data with sd with denominator n-1. With variance=0 or mean nil, returns a vector of equal size full of nils
# File lib/statsample/vector.rb, line 171 def vector_standarized(use_population=false) check_type :scale m=mean sd=use_population ? sdp : sds return ([nil]*size).to_scale if mean.nil? or sd==0.0 vector=vector_standarized_compute(m,sd) vector.name=_("%s(standarized)") % @name vector end
Generated with the Darkfish Rdoc Generator 2.