php/Math   
Recreational Mathematics   
   home  |  library  |  contact
 Math Notes
 Math Programming [25]
 Regression [3]
 Data Mining [17]
 Notation [6]
 Linear Algebra [9]
 Stats & Prob [15]
 Math Cognition [5]
 Space & Physics [6]
 Formulas [5]
 Fun & Games [2]
 Haskell [1]
 Bayes Theory [1]
 Site News [0]
 Math Projects [5]
 Polynomials [1]
 Calculus [9]
 Number Theory [3]
 Optimization [2]
 Financial [1]

 Math Links
 PHP/ir
 Andrew Gelman
 Chance Wiki
 Daniel Lemire
 KD Knuggets
 Social Stats
 MySQL Performance
 Hunch.net
 Matthew Hurst
 JMLR
 JSS
 Hal Daume III
 Math Notes >> Stats & Prob

Updates to LognormalDistribution.php [Stats & Prob
Posted on December 23, 2013 @ 01:15:18 PM by Paul Meagher

Božo Cvetkovski, who is working on a Master degree and doing some work in PHP, contacted me about a couple of errors he found in the LognormalDistribution.php class.

These are Božo's findings and fixes:

LognormalDistribution.php class gives correct result for values mu = 0, sigma = 1 that are used in example file which I found in test folder.

http://www.phpmath.com/build02/PDL/docs/tests.php?src=lognormal.php#source

But when you change values for mu and sigma it does not give correct results.

I did a little bit of research and found that you made a mistake by using the variance instead of standard deviation.

You need to change LognormalDistribution.php like this:


Find lines 29-31:

function LognormalDistribution($mu=0.0,$sigma=1.0) { $this->normal = new NormalDistribution($mu, $sigma * $sigma); }

Replace with this:

function LognormalDistribution($mu=0.0,$sigma=1.0) { $this->normal = new NormalDistribution($mu, $sigma) }


Find lines 43-45:

function getSigmaParameter() { return sqrt($this->normal->getVariance()); }

Replace with this:

function getSigmaParameter() { return $this->normal->getStandardDeviation(); }


Before change for mu= 4.15 and sigma= 0.37 it would return this:

LognormalDistribution(4.15, 0.37)
Methods Output
getMuParameter() 4.15
getSigmaParameter() 0.37
PDF(.2) 0.00025050224546219
CDF(.50) 1.9101344312285E-274
ICDF(0.95) 79.454176680834

After the change it returns these values:

LognormalDistribution(4.15, 0.37)
Methods Output
getMuParameter() 4.15
getSigmaParameter() 0.37
PDF(.2) 1.307667356577E-52
CDF(.50) 1.8882790329553E-39
ICDF(0.95) 116.58211120768


I tested the new class by comparing its results with results I get in Excel and Matlab. New, modified class works as expected.

I'd like to thank Božo for his work on finding and fixing these errors and have acknowledged Božo as another author of the LognormalDistribution.php class.

If anyone discovers any other errors in the Probability Distributions Library (PDL) please let me know. Also, if anyone has any new probability distribution classes, or wants to suggest changes the parameterization of a class (more parameters, different parameters), then shoot me an email with your thoughts or code. I've not had much time for awhile now to do php-based math programming but I'd love to see some multivariate distributions added to the library. The php port of the JAMA linear algebra library might come in handy here.

Permalink 

Stratified Random Sampling [Stats & Prob
Posted on March 20, 2008 @ 04:49:54 AM by Paul Meagher

Jan Steedman writes:

I was wondering if you could point me into the right direction with a somewhat off-topic sampling problem that I am currently having:

I have a set of people in the database with several coded variables (let's say: age group, gender & region). I'd like to draw a sample of 100 people in total from the database now with specific targets (i.e. constraints):

gender male: 50 people
gender female: 50 people

age group 1: 40 people
age group 2: 35 people
age group 3: 25 people

region 1: 10 people
region 2: 30 people
region 3: 60 people

The target size is 100 people in total and all constraints must be met. My problem now is how to select the "correct" set of people from the database in an efficient way? From all possible combinations of people there might only be one combination that will satisfy all constraints.

Here is a stratified_sample.php script that I came up with to solve Jan's problem. I'm not sure that it actually solved Jan's problem but I though I would publish it anyway.

<?php

// setup the gender, age, and region arrays with
// appropriate number of sampling codes in each
// array

for($i=0$i<100$i++) {

  if (
$i<50)
    
$gender[$i] = "GM";
  else
    
$gender[$i] = "GF";

  if (
$i<40)
    
$age[$i] = "A1";
  elseif(
$i<75)
    
$age[$i] = "A2";
  else
    
$age[$i] = "A3";

  if (
$i<10)
    
$region[$i] = "R1";
  elseif(
$i<40)
    
$region[$i] = "R2";
  else
    
$region[$i] = "R3";

}

// randomize the position of sampling codes in
// each array

shuffle($gender);
shuffle($age);
shuffle($region);

// the sample array contains the sampling codes
// for selecting each sample unit

// to create the sample array loop through the
// gender, age, and region arrays and concatenate
// the sampling codes from each array together
// to create the sampling code for that sample unit

for($i=0$i<100$i++) {
  
$g $gender[$i];
  
$a $age[$i];
  
$r $region[$i];
  
$sample[$i] = "$g$a$r";
}

// the sample array should meet the sampling
// constraints while meeting the constraints
// of randomization as well

echo "<pre>";
print_r($sample);
echo 
"</pre>";

?> 

The abbreviated output of the script looks like this:

Array
(
    [0] => GFA1R2
    [1] => GFA1R3
    [2] => GFA1R3
    [3] => GFA2R2
    [4] => GMA2R1
    [5] => GMA2R3
    [6] => GFA1R2
    [7] => GFA2R2
    [8] => GMA2R3
    [9] => GFA1R2
    [10] => GMA1R2
 
    ... snip ...

    [90] => GMA1R2
    [91] => GMA2R1
    [92] => GFA3R3
    [93] => GFA2R3
    [94] => GMA3R2
    [95] => GFA1R3
    [96] => GMA1R2
    [97] => GFA3R2
    [98] => GMA2R3
    [99] => GFA1R1
)

For this solution to fully work you would have to code each of your sample elements according to your 3 dimensional coding system (e.g., GFA1R3 meaning "Gender Female", "Area 1", "Region 3") and when you are selecting individuals matching these sampling codes you would randomly select a person matching your sampling code from among the pool of people matching the sampling code.

Permalink 

Population sampling [Stats & Prob
Posted on January 11, 2008 @ 04:11:14 AM by Paul Meagher

Below is a simple but useful class for generating a population sample:

<?php
/**
* Generates a random sample (with or without replacement) 
* of size n from population array.
*
* @see http://www.math.uah.edu/stat/objects/distributions/Functions.xhtml
*/
class Population {

  const 
WITHOUT_REPLACEMENT=0;
  const 
WITH_REPLACEMENT=1;

  function 
random($min=0$max=1) {
    return (
$min+lcg_value()*(abs($max-$min)));
  }

  
/**
  * This method computes a sample of a specified size from a specified population
  * and of a specified type (with or without replacement).
  * @param int[] $p  the population
  * @param int   $n  the sample size
  * @param int   $t  the type (0 without replacement, 1 with replacement);
  * @return int $s 
  */
  
public function getSample($p$n$t){
    
$m count($p);
    if (
$n 1$n 1; elseif ($n $m$n $m;
    
//Define the sample
    
if ($t == self::WITH_REPLACEMENT){
      for (
$i 0$i $n$i++){
        
$u = (int)($m $this->random());
        
$s[$i] = $p[$u];
      }
    } else{
      for (
$i=0$i $n$i++){
        
//Select a random index from 0 to m - i - 1;
        
$k $m $i;
        
$u = (int)($k $this->random());
        
//Define the sample element
        
$s[$i] = $p[$u];
        
// Interchange the sampled element p[u] with p[k - 1], at the end of the
        // population so that it will not be sampled again.
        
$temp    $p[$k-1];
        
$p[$k-1] = $p[$u];
        
$p[$u]   = $temp;
      }
    }
    return 
$s;
  }

}

?>

Here is some code to demonstrate usage:

<?php 

include "Population.php";

$p range(1,9);
$n 3;
$t 0;

$pop = new Population;
$s   $pop->getSample($p$n$t);

echo 
"<pre>";
print_r($s);
echo 
"</pre>";

?>

Here is what the output looks like:

Array
(
    [0] => 2
    [1] => 9
    [2] => 4
)

Perhaps you will find a use for this sampler some day...

Permalink 

Forms of uncertainty and their treatment [Stats & Prob
Posted on January 6, 2007 @ 06:42:36 PM by Paul Meagher

There are many forms of uncertainty which require different modelling techniques for their treatment.

The form of uncertaintly that one commonly deals with in undergraduate statistics courses is one of the simplist and most tractable; namely, the determination of probabilities for outcomes arising from a stationary random process. In Fuzzy Logic with Engineering Applications (2004), Timothy Ross uses three criteria to define a stationary random process (p. 10):

  1. The sample space on which the processes are defined cannot change from one experiment to another: that is, the outcome space cannot change.
  2. The frequency of occurence, or probability, of an event within that sample space is constant and cannot change from trial to trial or experiment to experiment.
  3. The outcomes must be repeatable from experiment to experiment. The outcome of one trial does not influence the oucome of a previous or future trial.

Examples of stationary random processes are coin and dice tossing, card games and selecting colored balls from an urn - typical problems encountered in undergraduate statistics and probabability courses.

Consider the form of uncertainty we have about the weather. Is it appropriate to model the weather in terms of a stationary random process? In this case, there are other general classes of random process that are more appropriate to use for modelling our uncertainty about weather outcomes (e.g., non-stationary markov process).

Some forms of uncertainty, however, do not require us to invoke the concept of a random process. According to Lofti Zadeh (1965), fuzzy set theory provides "a natural way of dealing with problems in which the source of imprecision is the absense of sharply defined criteria of class membership rather than the presence of random variables". In general, the forms of uncertainty arising from chance versus fuzziness require different modelling techniques: crisp sets in the former case, fuzzy sets in the latter case.

Exercise

The following exercise is taken from Ross (p. 16) who in turn cites James Bezdek as the source:

Suppose you are seated at a table on which rest two glasses of liquid. The liquid in the first glass is decribed to you as having a 95% chance of being healthful and good. The liquid in the second glass is described as having a 0.95 membership in the class of "healthful and good" liquids. Which glass would you select? Why?

Permalink 

Probability bounds for a standard uniform random variate [Stats & Prob
Posted on January 4, 2007 @ 02:45:34 PM by Paul Meagher

There is some inconsistency in the definition of a standard uniform random variate, particularly, in terms of the bounds used to define it. Common sense might suggest that the bounds would be [0..1] meaning that the end points are included as possibilities (i.e., the event certainly won't occur 0 or certainly will occur 1). Another common definition defines the range as (0..1) meaning the end points are not included as possibilities (a definition I came accross today in the book Discrete-Event System Simulation). The programming language Java defines the return value from Math.random as [0..1). The Java definition has always puzzled me until today when I clued in on a possible argument for it: A standard uniform random variate is commonly generated as a series of random digits from the set {0,1,2,3,4,5,6,7,8,9} with a "0." prefixed to the random digit series. This method for generating a uniform random variate precludes a 1 from occuring. I'm not sure why Java defines the bounds for a standard uniform random variate in the way they do, but at least I now have one fairly compelling technical reason for agreeing with the bounds they use (i.e., this is a sensible way to generate a standard uniform random variate).

Note: Java also offers a java.util.Random class that has some useful documentation and may be more appropriate to use in certain situations.

Permalink 

Inversion formulas for generating non-uniform random variates [Stats & Prob
Posted on December 26, 2006 @ 03:56:18 PM by Paul Meagher

Luc Devroye has written many useful and authoritative books and articles on probability theory and its applications. I also find his writing quite accessible which is one of the biggest compliments I can give to a mathematician. His recent chapter in the "Handbook of Simulation" has a nice list of probability density formulas and the corresponding inversion formulas that can be used to generate non-uniform random variates. That table is reproduced below:

Table 1: Some densities with distribution functions that are explicitly invertible.

Name Density Distribution function Random variate
Exponential e-x, x>0 1-e-x log(1/U)
Weibull(a), a>0 axa-1e-xa, x>0 1 - e-xa (log(1/U))1/a
Gumbel e-xe-e-x e-e-x -log(log(1/U))
Logistic 1/(2+ex+e-x) 1/(1+e-x) -log((1-U)/U)
Cauchy 1/π(1+x2) 1/2+(1/π)arctan(x) tan(πU)
Pareto(a), a>0 a/(xa+1), x>1 1-1/(xa) 1/U1/a

Note: U is a uniform random variate between 0 and 1.

Permalink 

Hazard Rate [Stats & Prob
Posted on November 17, 2006 @ 03:15:04 PM by Paul Meagher

I am doing some preliminary research on Hazard rates and developed the simple PHP script below to illustrate how it is computed using a Guassian probability model:

<?php

include "Math/PDL/NormalDistribution.php";

$norm = new NormalDistribution(10,2);
$PDF $norm->PDF(11);
$CDF $norm->CDF(11);
$HAZ $PDF/(1-$CDF);

echo 
"PDF: $PDF<br />"
echo 
"CDF: $CDF<br />";
echo 
"HAZ: $HAZ";

// Output:
// PDF: 0.17603266338215
// CDF: 0.69146246127401
// HAZ: 0.57053888518403

?>

One question I will be examining is whether the same process is always used to compute a hazard rate - if so then it should be easy to add a HAZ function to all the PDL probability distributions. I will also be looking into the computation of the cumulative hazard function (i.e., CHAZ).

Another question concerns which probability distibutions are most commonly used when computing hazard rates. I know, for example, that the lognormal, exponential, and weibull are commonly used in reliability studies.

As a concrete example of when a hazard function is useful, one often sees it used in the context of computing the risk associated with being a smoker versus a non-smoker - it gives you the probability that you will die by a certain age. Replace smoking with other "hazards" and you can see that there are probably lots of uses for computing hazard rates.

Permalink 

Gompertz function for modelling sales [Stats & Prob
Posted on November 6, 2006 @ 02:49:08 PM by Paul Meagher

The formula for the Gompertz function is:

y = k * a(bt)

The Gompertz function is often used to model product sales when the product is highly anticipated (i.e.., sales starts off rapidly and then levels off). Think, for example, of a movie release, game release, or software release.

Here is a class that evaluates the Gompertz function:

<?php
  
class Gompertz {

  var 
$k;
  var 
$a
  var 
$b;
  
  function 
__construct($k$a$b) {
    
$this->$k;
    
$this->$a;
    
$this->$b;
  }        
    
  function 
evaluate($t) {
    return (
$this->pow($this->apow($this->b$t)));
  }
  
}  

?>

Exercises

1. For parameters k=5, a=0.1, and b=0.8, use the Gompertz class to estimate when 1 million sales will be reached and the total sales expected. Note that time t is measured in weeks. The answer is the 2nd week and 5 million total sales, however, you will need to use this class to gain any insight into how the answers were obtained and how the Gompertz function might be applied to modelling other sales functions. Source Connally, E., et. al. (1998) Functions Modelling Change, John Wiley & Sons, Inc., p. 193.

2. How can we estimate the k, a, and b parameters to use in Gompertz function for arbitrary sales-by-week data?

Permalink 

Root mean square [Stats & Prob
Posted on November 4, 2006 @ 10:44:28 PM by Paul Meagher

The formula to compute the Root Mean Square of a 1D array of numbers is:

xrms = sqrt(1/n * sum(x[i]2))

Roy Lewallen noted that the idea behind RMS is revealed by the name:

....it’s the square Root of the Mean of the Square of the function. Mean is the same as average, so we calculate the RMS value of a function or waveform by first squaring it, then taking the average of the squared function or waveform, then taking the square root of the average.

This definition mentions the idea of measuring a waveform using RMS and indeed the RMS measure is commonly used to compute the average power using RMS current and RMS voltage.

Here is some php code that computes RMS:

<?php
// Root Mean Square (RMS) of a 1D array 
function RMS($vec){
  
$n   count($vec);
  
$sum 0.0;
  for(
$i=0$i<$n$i++)
    
$sum+=pow($vec[$i],2);
  
$sum /= $n;
  return 
sqrt($sum);
}

$vec = array(1,2,3);

echo 
"Answer: ".RMS($vec);

//Answer: 2.16024689947
?>

When we square the values we lose information about whether the magnitude is positive or negative. When we divide by n we average the squared values. When we square the result we get a value that is more like the original values. Conceptually you should now be able to see why the Root Mean Square can be used to estimate the mean, particularly in the context of waveforms where we are not concerned about the sign of the waveform fluctuations.

Permalink 

Probability as an effective procedure [Stats & Prob
Posted on November 3, 2006 @ 03:08:26 PM by Paul Meagher

Sometimes you develop a math program because you want to constructively explore a mathematical concept. Here is a simple php-based math program that expresses von Mises concept of probability as an effective procedure.

<?php
/**
* Implements von Mises empirical reduction of a 
* probability to the relative frequency of occurance 
* of an $eventOfInterest in a $dataStream under a  
* $deadLine.
*/
class {
  
  var 
$value 0.0;
 
  
// Evaluate p for an $eventOfInterest as its relative frequency of  
  // occurence in a $dataStream under a $deadLine. 
  
function p($eventOfInterest, &$dataStream$deadLine) {
    
$this->value $this->getRelativeFrequency($eventOfInterest$dataStream$deadLine);
  }
  
  
// Computes the relative frequency of the $eventOfInterest.
  
function getRelativeFrequency($eventOfInterest, &$dataStream$deadLine) {
    return 
$this->getFrequency($eventOfInterest$dataStream$deadLine) / $deadLine;
  }
  
  
// Computes the frequency of the $eventOfInterest in the $dataSteam 
  // until the specified $deadLine is reached.
  
function getFrequency($eventOfInterest, &$dataStream$deadLine) {
    
$frequency 0;
    for(
$i=0$i<$deadLine$i++) 
      if (
$dataStream[$i] == $eventOfInterest)
        
$frequency++;
    return 
$frequency
  }    
}

// Event we want to count the frequency of.
$eventOfInterest 1

// Sequence of data stream events.
$dataStream      = array(0111243);

// Terminate event counting when we reach this location 
// in the data stream sequence.
$deadLine        3

$p = new p($eventOfInterest$dataStream$deadLine);

echo 
$p->value;
?> 

The $p->value is 0.666666666667 and denotes the relative frequency of the event "1" in the data stream up to the point when observations of that $dataStream were terminated (i.e., the $deadLine).

Simple programs like this are actually very deep and if reflected upon and experimented with can quickly immerse you in the central concerns of computational learnability theory and epistemology. How, for example, should we construct our theories of induction and convergence to truth based upon this effective procedure definition of probability?

Exercises

1. Extend this program to store a more complex $dataStream so that you can define and count more a more complex $eventOfInterest. For example, generate a random $dataStream using a transition matrix and count the number of times a pair of values occur.

2. Create an adaptive deadline that gets incremented if a convergence criterion is not met. For example, try to reliably estimate the probability transition matrix that generated the $dataStream in exercise 1.

Permalink 

Mathematical probability [Stats & Prob
Posted on November 1, 2006 @ 02:53:21 PM by Paul Meagher

According the John Haigh (Probability Models. Springer, 2002, p. 4), a probability space consists of 3 elements (Ω, F, P):

  1. Ω is the set of outcomes.
  2. F is a collection of subsets of Ω called events E (or more technically, a σ-field).
  3. P is a function defined for each event E, satisfying three standard axioms.

The three axioms are:

  1. 0 ≤ P(E) ≤ 1
  2. P(Ω) = 1
  3. P(∪iEi) = ΣiP(Ei) where Ei ∩ Ej = ∅.

The theorems, lemmas, and corrollaries of probability theory are required to be consistent with this concept of a probability space and its three axioms. When these axioms are violated, the mathematics no longer belongs to probability theory. Information Theory is another example of a probability space that obeys these three laws.

Exercise

Consider a sampe space consisting of the outcomes of a series of dice rolling experiments: {1,2,4,1,3,6,3,5,4,5,4,6,2,3,4,1,2,2,3,3}. Develop a PHP program that illustrates the application of the three laws to this particular case (or "universe of discourse" as a logician might say).

Permalink 

Poker hand probabilities [Stats & Prob
Posted on October 31, 2006 @ 02:37:38 PM by Paul Meagher

The number of possible 5 card hands that can be dealt from a deck of 52 cards is:

# hands = nCr(52,5)
# hands = 52!/(52-5)!5!
# hands = (52x51x50x49x48)/(5x4x3x2x1)
# hands = 2,598,260

The probabilities of various poker hands is:

Poker Hand# CombinationsProbability
Royal Straight Flush44/2598260=0.00000154
Straight3636/2598260=0.00001386
Four of a Kind624624/2598260=0.00024016
Full House3,7443744/2598260=0.00144096
Flush5,1085108/2598260=0.00196593
Straight10,20010200/2598260=0.00392570
Three of a Kind54,91254912/2598260=0.02113414
Two Pairs123,552123552/2598260=0.04755182
One Pair1,098,2401098240/2598260=0.42268287
Remainder1,302,5401302540/2598260=0.50131242

Exercise

Figure out the math used to compute the # of combinations associated with each poker hand.

Permalink 

Statistical distance between two points [Stats & Prob
Posted on October 6, 2006 @ 03:32:29 AM by Paul Meagher

Here is some code that calculate the general statistical distance between two points:

<?php 
// Statistical distance between points P and Q 
// when variables are correlated

// 2D Points P & Q

$P = array(12); 
$Q = array(33); 

// Coefficients to be estimated

$a11 = ?; 
$a12 = ?; 
$a22 = ?; 

$sum $a11 pow($P[0]-$Q[0], 2) + 
       
$a12 * ($P[0]-$Q[0]) * ($P[1]-$Q[1]) + 
       
$a22 pow($P[1]-$Q[1], 2); 

$d sqrt($sum); 

echo 
$d;
?>

Estimating the values of a11, a12, and a13 involves some futher calculations. Here is one of them:

term1 = cos2(θ) / (cos2(θ)s11 + 2 sin(θ)cos(θ)s12 + sin2(θ)s22)
term2 = sin2(θ) / (cos2(θ)s22 + 2 sin(θ)cos(θ)s12 + sin2(θ)s11)
a11 = term1 + term2

Permalink 

Determining whether two vectors are orthogonal or not [Stats & Prob
Posted on September 25, 2006 @ 04:10:11 AM by Paul Meagher

Show that these two vectors are orthogonal:

v = (3, 0, 1, 0 , 4, - 1)

w = (-2, 5, 0, 2, -3, -18)

To answer the "orthogonality" question, we need to compute the inner-product or dot-product of these two vectors and if the sum is 0 or close to 0 (relative to the size of the vector components) then the two vectors can be considered orthogonal:

(2 x -3) + (0 x 5) + (1 x 0) + (0 x 2) + (4 x -3) + (-1 x -18) = ?

-6 + 0 + 0 + 0 -12 + 18 = 0

Permalink 

Calculating ArrayDistance using the Mahalanobis distance metric [Stats & Prob
Posted on July 24, 2006 @ 01:55:19 PM by Paul Meagher

I've been doing some research on how to calculate the "Mahalanobis distance" between two vectors. The internet resources I could find on this subject did not yield any definitive conclusions about how to implement this calculation. In the end, I decided to work from a conceptual definition that made sense to me (sum of z-score distances) and which is fully documented in the code below. I should note that another way to set up the calculation would be to "hand in" the mean and standard deviation values instead of calculating these estimates using one of the supplied vectors. Given the comparative nature of the ArrayDistance class (instantiated using X and Y Array data types), it made sense for me to use one of the supplied vectors to calculate the mean and standard deviation estimates to use in computing z scores.

<?php

/**
* @package Array_Distance
* @class Array_Distance.php


class ArrayDistance {

  var $n = 0;
  
  var $X = array();
  var $Y = array();  

  function Distance($X, $Y) {
    $this->n = count($X);
    if (($this->n != 0) AND ($this->n == count($Y))) { 
      $this->X = $X;
      $this->Y = $Y;        
      return true;
    } else 
      die("Size of X and Y arrays must be the same");      
  }

  function mass($vec) { 
    $sum = 0; 
    for($i=0; $i < $this->n; $i++)  
      $sum += $vec[$i]; 
    return $sum; 
  } 

  function mean($vec) { 
    return $this->mass($vec) / $this->n; 
  } 

  function variance($vec) { 
    $mean = $this->mean($vec); 
    $sum  = 0.0; 
    for($i=0; $i < $this->n; $i++) 
      $sum += pow($vec[$i] - $mean, 2); 
    return $sum /($this->n - 1); 
  } 

  function stdev($vec) { 
    return sqrt($this->variance($vec)); 
  } 
  
  /**
  * Mahalanobis distance interpreted as sum of z scores using 
  * X or Y vector to normalize the distances.
  *
  * If normalized using X, we calculate z-scores as follows:
  *
  * z-score[i] = (Y[i] - u(X)) / stdev(X)
  *
  * If normalized using Y, we calculate z-scores as follows:
  *
  * z-score[i] = (X[i] - u(Y)) / stdev(Y)
  *
  * A Malahanobis distance is conceptually an estimator of the distance 
  * between a random vector and a guassian distribution specified using 
  * mass and dispersion indices (i.e., mean and stdev) - which are the set 
  * of "sufficient" statistics for specifying a guassian distribution for
  * estimation purposes.
  *
  * @see http://en.wikipedia.org/wiki/Mahalanobis_distance
  */
  
function mahalanobis($normalize_using="y") {
    if (
$normalize_using=="y") {
      
$vec   $this->X;          
      
$mean  $this->mean($this->Y); 
      
$stdev $this->stdev($this->Y);    
    } else {
      
$vec   $this->Y;
      
$mean  $this->mean($this->X); 
      
$stdev $this->stdev($this->X);    
    } 
    for(
$i=0$i $this->n$i++) 
      
$sum += abs($vec[$i] - $mean) / $stdev;
    return 
$sum;
  }
    
  function 
mahal($normalize_using="y") {    
    return 
mahalanobis($normalize_using);
  }
  
}

$X = array(123);
$Y = array(246);

$d = new ArrayDistance($X$Y);

echo 
"The Mahalanobis distance between two vectors ";
echo 
"normalized using x is "$d->mahalanobis("x")."<br />";

echo 
"The Mahalanobis distance between two vectors ";
echo 
"normalized using y is "$d->mahalanobis("y")."<br />";

// The Mahalanobis distance between two vectors normalized using x is 6
// The Mahalanobis distance between two vectors normalized using y is 3
?>  

Perhaps something approaching a t-score estimate of distance would be better when supplying a small (<25) number of X and Y observations.

A convenience method called mahal is a shorthand for accessing the contents of the mahalanobis method.

Many of the statistical methods invoked in this class started out life in Daniel Lemire and Mark Hale's JSci/math/ArrayMath.java class.

Permalink 

 Archive 
 


php/Math Project
© 2011. All rights reserved.