php/Math   
Recreational Mathematics   
   home  |  library  |  contact
 Math Notes
 Math Programming [25]
 Regression [3]
 Data Mining [17]
 Notation [6]
 Linear Algebra [9]
 Stats & Prob [15]
 Math Cognition [5]
 Space & Physics [6]
 Formulas [5]
 Fun & Games [2]
 Haskell [1]
 Bayes Theory [1]
 Site News [0]
 Math Projects [5]
 Polynomials [1]
 Calculus [9]
 Number Theory [3]
 Optimization [2]
 Financial [1]

 Math Links
 PHP/ir
 Andrew Gelman
 Chance Wiki
 Daniel Lemire
 KD Knuggets
 Social Stats
 MySQL Performance
 Hunch.net
 Matthew Hurst
 JMLR
 JSS
 Hal Daume III
 Math Notes >> Data Mining

Information Retrieval Class [Data Mining
Posted on June 5, 2008 @ 02:21:07 PM by Paul Meagher

As I reflected on how I would begin computing summary statistics over an inverted index (which I discussed and implemented in my last post), I realized that I would have to immediately turn my script into a class in order to easily and efficiently pass around the inverted index to the statistical functions. Here is what I have developed so far (see also):

<?php

/**
* Information Retrievel

* Class used to explore information retrieval theory and concepts.
*/

define("DOC_ID"0);
define("TERM_POSITION"1);

class 
IR {

  public 
$num_docs 0;
  
  public 
$corpus_terms = array();

  
/* 
  * Show Documents
  *
  * Helper function that shows the contents of your corpus documents.
  *
  * @param array $D document corpus as array of strings
  */ 
  
function show_docs($D) {
    
$ndocs count($D);
    for(
$doc_num=0$doc_num $ndocs$doc_num++) {
      
?>
      <p>
      Document #<?php echo ($doc_num+1); ?>:<br />
      <?php echo $D[$doc_num]; ?>
      </p>
      <?php
    
}
  }

  
/* 
  * Create Index
  *
  * Creates an inverted index from the supplied corpus documents.
  * Inverted index stored in corpus_terms array.
  *
  * @param array $D document corpus as array of strings
  */   
  
function create_index($D) {  
    
$this->num_docs count($D);    
    for(
$doc_num=0$doc_num $this->num_docs$doc_num++) {      
      
// zero array containing document terms
      
$doc_terms = array();      
      
// simplified word tokenization process
      
$doc_terms explode(" "$D[$doc_num]);      
      
// here is where the indexing of terms to document locations happens
      
$num_terms count($doc_terms);
      for(
$term_position=0$term_position $num_terms$term_position++) {
        
$term strtolower($doc_terms[$term_position]);
        
$this->corpus_terms[$term][]=array($doc_num$term_position);
      }      
    }   
  }

  
/* 
  * Show Index
  *
  * Helper function that outputs inverted index in a standard format.
  */         
  
function show_index() {
    
// sort by key for alphabetically ordered output
    
ksort($this->corpus_terms);
    
// output a representation of the inverted index
    
foreach($this->corpus_terms AS $term => $doc_locations) {
      echo 
"<b>$term:</b> ";
      foreach(
$doc_locations AS $doc_location
        echo 
"{".$doc_location[DOC_ID].", ".$doc_location[TERM_POSITION]."} ";
      echo 
"<br />";  
    }    
  }   

  
/*
  * Term Frequency
  *
  * @param string $term
  * @return frequency of term in corpus
  */
  
function tf($term) {
    
$term strtolower($term);
    return 
count($this->corpus_terms[$term]);
  }
   
  
/*
  * Number Documents With
  * 
  * @param string $term
  * @return number of documents with term
  */
  
function ndw($term) {
    
$term strtolower($term);   
    
$doc_locations $this->corpus_terms[$term];
    
$num_locations count($doc_locations);
    
$docs_with_term = array();
    for(
$doc_location=0$doc_location $num_locations$doc_location++) 
      
$docs_with_term[$i]++;
    return 
count($docs_with_term);     
  }
   
  
/*
  * Inverse Document Frequency
  *
  * @param string $term
  * @return inverse document frequency of term
  */
   
function idf($term) {
     return 
log($this->num_docs)/$this->ndw($term);    
   }       

}

?>

Here is a script I developed to test the methods of the IR class:

<?php

include "IR.php";

$D[0] = "Shipment of gold delivered in a fire";
$D[1] = "Delivery of silver arrived in a silver truck";
$D[2] = "Shipment of gold arrived in a truck";

$ir = new IR();

echo 
"<p><b>Corpus:</b></p>";
$ir->show_docs($D);

$ir->create_index($D);

echo 
"<p><b>Inverted Index:</b></p>";
$ir->show_index();

$term "silver"
$tf  $ir->tf($term);
$ndw $ir->ndw($term);
$idf $ir->idf($term);
echo 
"<p>";
echo 
"Term Frequency of '$term' is $tf<br />";
echo 
"Number Of Documents with $term is $ndw<br />";
echo 
"Inverse Document Frequency of $term is $idf";
echo 
"</p>";  

?>

Here is the output that the script generates:

Corpus:

Document #1:
Shipment of gold delivered in a fire

Document #2:
Delivery of silver arrived in a silver truck

Document #3:
Shipment of gold arrived in a truck

Inverted Index:

a: {0, 5} {1, 5} {2, 5}
arrived: {1, 3} {2, 3}
delivered: {0, 3}
delivery: {1, 0}
fire: {0, 6}
gold: {0, 2} {2, 2}
in: {0, 4} {1, 4} {2, 4}
of: {0, 1} {1, 1} {2, 1}
shipment: {0, 0} {2, 0}
silver: {1, 2} {1, 6}
truck: {1, 7} {2, 6}

Term Frequency of 'silver' is 2
Number Of Documents with silver is 1
Inverse Document Frequency of silver is 1.0986122886681

The next step in this exploration of Information Retrieval Theory will be to develop a table that better summarizes term frequency and inverse term frequencies. I will use this table to verify that the results I'm generating match with tf-idf calculations found in the literature for the sample documents used in my calculations.

Permalink 

Information Retrieval Theory: Implementing an Inverted Index [Data Mining
Posted on June 3, 2008 @ 01:29:29 PM by Paul Meagher

Last night I downloaded a local copy of this book:

Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.

Most introductions to Information Retrieval begin by talking about the "Inverted Index" data structure. I had come accross this concept before, however, I haven't explored it in php code because I got hung up on the issues of what corpus to use and, more importantly, what word tokenization approach I should use for the documents. Through my surfing around on the topic of IR, I observed that you can bypass these issues by simplifying the corpus selection and word tokenization issues while retaining the essentials of the representational and computational issues. The code below illustrates a common simplification of the corpus concept and word tokenization concept that might similarly allow you to get started exploring some IR concepts.

<?php

/**
* Creates an inverted index and shows what it looks like.  Useful for
* explorating TF-IDF calculations in Information Retrieval theory.
*
* @see http://en.wikipedia.org/wiki/Inverted_index
*/

define("DOC_ID"0);
define("TERM_POSITION"1);

$D[0] = "Shipment of gold delivered in a fire";
$D[1] = "Delivery of silver arrived in a silver truck";
$D[2] = "Shipment of gold arrived in a truck";

$corpus_terms = array();

$num_docs count($D);

for(
$doc_num=0$doc_num $num_docs$doc_num++) {
  
  
// zero array containing document terms
  
$doc_terms = array();
  
  
// simplified word tokenization process
  
$doc_terms explode(" "$D[$doc_num]);
  
  
// here is where the indexing of terms to document locations happens
  
$num_terms count($doc_terms);
  for(
$term_position=0$term_position $num_terms$term_position++) 
    
$corpus_terms[strtolower($doc_terms[$term_position])][]=array($doc_num$term_position);

}

// sort by key for alphabetically ordered output
ksort($corpus_terms);

$num_terms count($corpus_terms);

// output a representation of the inverted index
foreach($corpus_terms AS $term => $doc_locations) {
  echo 
"<b>$term:</b> ";
  foreach(
$doc_locations AS $doc_location
    echo 
"{".$doc_location[DOC_ID].", ".$doc_location[TERM_POSITION]."} ";
  echo 
"<br />";  
}    
      
?>  

The output of this code looks like this:

a: {0, 5} {1, 5} {2, 5}
arrived: {1, 3} {2, 3}
delivered: {0, 3}
delivery: {1, 0}
fire: {0, 6}
gold: {0, 2} {2, 2}
in: {0, 4} {1, 4} {2, 4}
of: {0, 1} {1, 1} {2, 1}
shipment: {0, 0} {2, 0}
silver: {1, 2} {1, 6}
truck: {1, 7} {2, 6}

Sometimes it is useful to jump ahead in your exploration of certain topics and avoid, for example, the gritty details of document tokenization. What may seem like an avoidance of the problem, might end up informing you about important subtleties in the requirements for your information retrieval system. Exploratory inverted index code can help you to see the computational consequences of representing an inverted index in a certain manner.

For me, the next step in exploring IR theory is to explore in code the Term Frequency–Inverse Document Frequency calculations, generally abbreviated as TF-IDF. Tim Munroe illustrates some of the calculations involved, however, I will need to get my calculations working with the inverted index data structure I've developed so far.

Permalink 

Correlation Matrix [Data Mining
Posted on March 23, 2008 @ 12:16:14 PM by Paul Meagher

CorrelationMatrix.php is a class that can be used to compute a correlation matrix given a data matrix. A data matrix can be obtained by querying a database table or collection of tables and putting the resulting values into a two-dimensional data array. You could then run the class below on your two-dimensional data array.

<?php
/**
* Creates a correlation matrix from a given data matrix. 
*
* @see http://lib.stat.cmu.edu/multi/pca.c
*
* @author   Paul Meagher
* @version  0.2
* @modified Mar 23, 2008
*/
class CorrelationMatrix {
    
    private 
$eps    0.005;

    public 
$nrows   0;
    public 
$ncols   0;    
    
    public 
$means   = array();
    public 
$stddevs = array();    
    public 
$cormat  = array();    
    
  function 
CorrelationMatrix($data) {
                
    
// num rows
    
$this->nrows count($data);    
    
    
// num cols
    
$this->ncols count($data[0]); 
    
    
// Determine mean of column vectors of input data matrix 
    
for ($j=0$j $this->ncols$j++) {
      
$this->means[$j] = 0.0;
      for(
$i=0$i $this->nrows$i++) 
        
$this->means[$j] += $data[$i][$j];
      
$this->means[$j] /= $this->nrows;
    }
          
    
// Determine standard deviations of column vectors of data matrix. 
    
for ($j=0$j $this->ncols$j++) {
      
$this->stddevs[$j] = 0.0;
      for (
$i=0$i $this->nrows$i++) 
        
$this->stddevs[$j] += (($data[$i][$j]-$this->means[$j])*($data[$i][$j]-$this->means[$j]));
      
$this->stddevs[$j] /= $this->nrows;
      
$this->stddevs[$j] = sqrt($this->stddevs[$j]);
      
// The following in an inelegant but usual way to handle
      // near-zero stddev values, which would cause a zero-
      // divide error. 
      
if ($this->stddevs[$j] <= $this->eps
        
$this->stddevs[$j] = 1.0;
    }
    
    
// Center and reduce the column vectors. 
    
for ($i=0$i $this->nrows$i++) {
      for (
$j=0$j $this->ncols$j++) {
        
$data[$i][$j] -= $this->means[$j];
        
$x sqrt($this->nrows);
        
$x *= $this->stddevs[$j];
        
$data[$i][$j] /= $x;
      }
    }
  
    
// Calculate the m * m correlation matrix. 
    
for ($j1=0$j1 $this->ncols-1$j1++) {
      
$this->cormat[$j1][$j1] = 1.0;
      for (
$j2=$j1+1$j2 $this->ncols$j2++) {
        
$this->cormat[$j1][$j2] = 0.0;
        for (
$i=0$i $this->nrows$i++)
          
$this->cormat[$j1][$j2] += ( $data[$i][$j1] * $data[$i][$j2]);
        
$this->cormat[$j2][$j1] = $this->cormat[$j1][$j2];
      }
    }
    
    
$this->cormat[$this->ncols-1][$this->ncols-1] = 1.0;
    
  }
  
  function 
printMeans() {  
    
printf("Column Means: <br />");
    for (
$j=0$j $this->ncols$j++)  
      
printf("%7.2f"$this->means[$j]);  
    
printf("<br />");  
  }

  function 
printStdDevs() {
    
printf("Column Standard Deviations: <br />");
    for (
$j=0$j $this->ncols$j++) 
      
printf("%7.2f"$this->stddevs[$j]); 
    
printf("<br />");
  }

  function 
printCorMat() {
    
printf("Correlation Matrix: <br />");
    for (
$i=0$i $this->ncols$i++) {     
      for (
$j=0$j $this->ncols$j++) 
        
//echo $this->cormat[$i][$j]." ";
        
printf("%7.4f"$this->cormat[$i][$j]); 
      
printf("<br />");
    }
  }
      
}

?>

Here is a test script that demonstrates usage and generates some output:

<?php

include "CorrelationMatrix.php";

$data[0] = array(123);
$data[1] = array(234);
$data[2] = array(345);
$data[3] = array(468);
$data[4] = array(5810);

$cm = new CorrelationMatrix($data);
$cm->printMeans();
$cm->printStdDevs();
$cm->printCorMat();

?>

The output looks like this:

Column Means:
3.00 4.60 6.00
Column Standard Deviations:
1.41 2.15 2.61
Correlation Matrix:
1.0000 0.9848 0.9762
0.9848 1.0000 0.9970
0.9762 0.9970 1.0000

Permalink 

Computing binary item similarity [Data Mining
Posted on December 5, 2006 @ 04:39:35 PM by Paul Meagher

An intermediate stage in computing binary item similarity is to cross-tabulate the number of 1-1, 1-0, 0-1, and 0-0 pairs denoted by a, b, c, and d respectively in the table below:

  Item j
  1 0
Item i 1 a b
0 c d

Computing binary item similarity is not simply a matter of counting the number of a's and d's in the table above. For example, we may want to weight 1-1 (or "a") matches greater than 0-0 (or "d") matches or compute various ratios of matches to mismatches.

Richard A. Johnson & Dean W. Wichern (2002) have compiled a list of weighting schemes for binary item similarity which is reproduced below with minor changes (note that p=a+b+c+d):

Weighting Scheme Rationale
1. (a+d)/p Equal weights for 1-1 matches and 0-0 matches
2. 2(a+d)/(2(a+d)+b+c) Double weights for 1-1 matches and 0-0 matches
3. (a+d)/(a+d+2(b+c)) Double weights for unmatched pairs
4. a/p No 0-0 matches in numerator
5. a/(a+b+c) No 0-0 matches in numerator or denominator. The 0-0 matches are treated as irrelevant.
6. 2a/(2a+b+c) No 0-0 matches in numerator or denominator. Double weight for 1-1 matches.
7. a/(a+2(b+c)) No 0-0 matches in numerator or denominator. Double weight for unmatched pairs.
8. a/(b+c) Ratio of matches to mismatches with 0-0 matches excluded.

If we wanted to create a class that calculated binary item similarity according to these different weighting schemes then the class skeleton might look like this:

<?php
/** 
* Class to compute binary item similarity according to 
* eight different weighting schemes
*/
class BinaryItemSimilarity {
  
  
// store the contingency table so all 
  // methods can access it
  
var $ContingencyTable = array();

  
// instantiate class with contingency table values
  
function BinaryItemSimilarity($ContingencyTable) {);
  
  function 
useEqualWeightForMatches() {);  
  
  function 
useDoubleWeightForMatches() {);  
  
  function 
useDoubleWeightForNonMatches() {);    
  
  function 
useNoZeroMatchesInNumerator() {);  
  
  function 
useNoZeroMatches() {);        
  
  function 
useNoZeroMatchesWithDoubleWeightMatches() {);        
  
  function 
useNoZeroMatchesWithDoubleWeightNonMatches() {);        
  
  function 
useRatioOfMatchesToNonMatchesExcludingZeroMatches() {);              

}
?>

Exercises

1. Complete the implementation of the BinaryItemSimilarity class.

2. Do you think the API for this class is too verbose? Is a self-documenting API a good thing even when method names get this long?

3. Are there alternate names for these weighting schemes? Would it be better to use these names instead of these long method names?

Permalink 

Binary similarity for items and attributes [Data Mining
Posted on December 4, 2006 @ 04:05:13 PM by Paul Meagher

Binary similarity can be computed between items or attributes.

Binary item similiary can be measured by defining a collection of binary attributes for each item and comparing items on whether or not they share the same binary value for each attribute:

Attributes
  Attribute 1 Attribute 2 Attribute 3
Item 1 1 0 0
Item 2 1 1 0
Item 3 0 1 1

In this example, we compare item 1 and item 2 and observe there is a 1-1 match, a 0-0 match, and a 0-1 mismatch.

Binary attribute similarity measures similarity between data columns rather than data rows:

Attributes
  Attribute 1 Attribute 2 Attribute 3
Item 1 1 0 0
Item 2 1 1 0
Item 3 1 0 0

In this example, we compare attribute 1 and attribute 2 and observe there is a 1-1 match and two 1-0 mismatches.

Permalink 

Machine independent comparison accuracy [Data Mining
Posted on November 25, 2006 @ 03:17:45 AM by Paul Meagher

The Weka Data Mining library is perhaps the most impressive collection of opensource software for learning and doing Data Mining (along with the R project). Weka ships with an excellent textbook as well. If you examine the source you will quickly notice that they don't rely upon Java's native numerical comparison functions. Instead, they have chosen to "roll their own" comparison functions. These comparison functions are located in the Utils.java class. I ported a subset of this class to PHP - the subset dealing with basic numerical comparisons:

<?php
/**
* Subset of the java-based Utils class from the Weka data mining library.
*
* @see http://www.koders.com/java/fidCCCC334D9F2AED4BBB7CF993FF7A5C4E99F4F05C.aspx
*
* You can use this class to ensure that your comparison results are
* not machine dependent.
*/
class Utils {
  
  
/** 
  * The small deviation allowed in floating point comparisons. 
  */
  
const SMALL 1e-6;
    
  
/**
  * Tests if a is equal to b.
  *
  * @param float $a 
  * @param float $b
  * @return boolean
  */
  
public static function eq($a$b){    
    return ((float)
$a - (float)$b Utils::SMALL) && ((float)$b - (float)$a Utils::SMALL); 
  }
  
  
/**
  * Tests if a is smaller or equal to b.
  *
  * @param float $a 
  * @param float $b
  * @return boolean 
  */
  
public static function smOrEq($a$b) {    
    return ((float)
$a - (float)$b Utils::SMALL);
  }

  
/**
  * Tests if a is greater or equal to b.
  *
  * @param float $a 
  * @param float $b
  * @return boolean
  */
  
public static function grOrEq($a$b) {    
    return ((float)
$b-(float)$a Utils::SMALL);
  }
  
  
/**
  * Tests if a is smaller than b.
  *
  * @param float $a 
  * @param float $b
  * @return boolean
  */
  
public static function sm($a$b) {    
    return ((float)
$b-(float)$a Utils::SMALL);
  }

  
/**
  * Tests if a is greater than b.
  *
  * @param float $a 
  * @param float $b
  * @return boolean
  */
  
public static function gr($a$b) {    
    return ((float)
$a-(float)$b Utils::SMALL);
  }  

}
?>

Here is an example of usage:

<?php
include "Utils.php";

$x 1.00001;

echo 
"$x == 1.0 ? "
if (
Utils::eq($x1.0)) echo "true"; else echo "false";
echo 
"<br />";

echo 
"$x > 1.0 ? "
if (
Utils::gr($x1.0)) echo "true"; else echo "false";
echo 
"<br />";

echo 
"$x >= 1.0 ? "
if (
Utils::grOrEq($x1.0)) echo "true"; else echo "false";
echo 
"<br />";

echo 
"$x < 1.0 ? "
if (
Utils::sm($x1.0)) echo "true"; else echo "false";
echo 
"<br />";

echo 
"$x <= 1.0 ? "
if (
Utils::smOrEq($x1.0)) echo "true"; else echo "false";
echo 
"<br />";

echo 
"==================<br />";

$x 1.000001;

echo 
"$x == 1.0 ? "
if (
Utils::eq($x1.0)) echo "true"; else echo "false";
echo 
"<br />";

echo 
"$x > 1.0 ? "
if (
Utils::gr($x1.0)) echo "true"; else echo "false";
echo 
"<br />";

echo 
"$x >= 1.0 ? "
if (
Utils::grOrEq($x1.0)) echo "true"; else echo "false";
echo 
"<br />";

echo 
"$x < 1.0 ? "
if (
Utils::sm($x1.0)) echo "true"; else echo "false";
echo 
"<br />";

echo 
"$x <= 1.0 ? "
if (
Utils::smOrEq($x1.0)) echo "true"; else echo "false";
echo 
"<br />";
?>

Here is the output of the test script:

1.00001 == 1.0 ? false
1.00001 > 1.0 ? true
1.00001 >= 1.0 ? true
1.00001 < 1.0 ? false
1.00001 <= 1.0 ? false
==================
1.000001 == 1.0 ? true
1.000001 > 1.0 ? false
1.000001 >= 1.0 ? true
1.000001 < 1.0 ? false
1.000001 <= 1.0 ? true

Exercise

If you were developing a data mining/machine learning library, do you think it would be a good idea to rely upon PHP's native comparison functions or would you use the above comparison functions?

Permalink 

Measuring interrater reliability [Data Mining
Posted on November 14, 2006 @ 02:36:03 PM by Paul Meagher

The kappa function is commonly used as a measure of interrater reliability. The kappa functions below compute a kappa score given a 2x2 table of observed counts. These interrater "confusion" counts can be represented using either a one dimensional array (use the kappa1 function below):

array(count(no/no), count(no/yes), count(yes/no), count(yes/yes));

or a two dimensional array (use the kappa2 function below):

array(
  array(count(no/no), count(no/yes)), 
  array(count(yes/no), count(yes/yes))
);

<?php
/**
* Compute kappa for a 2x2 table.

* An index which compares the agreement against that which might be 
* expected by chance. Possible values range from +1 (perfect agreement) 
* to 0 (no agreement above that expected by chance) to -1 (complete 
* disagreement).
*
* @see http://www.dmi.columbia.edu/homepages/chuangj/kappa/
*
* @param $table2x2 array of interrater no/yes agreements/disagreements 
*
* @return $kappa statistic
*/
function kappa2($table2x2) {  
  
$row1 $table2x2[0][0] + $table2x2[0][1];
  
$row2 $table2x2[1][0] + $table2x2[1][1];
  
$col1 $table2x2[0][0] + $table2x2[1][0];
  
$col2 $table2x2[0][1] + $table2x2[1][1];
  
$n    $col1+$col2;
  
$pObserved = ($table2x2[0][0] + $table2x2[1][1]) / $n;
  
$pExpected = (($col1 $row1) + ($col2 $row2)) / ($n*$n);
  
$kappa     = ($pObserved $pExpected) / ($pExpected);
  return 
$kappa;
}

/*
* @see http://bij.isi.uu.nl/ImageJ/api/index.html {BIJStats.java}  
*/
function kappa1($table) {
  
$g1 $table[0] + $table[1];
  
$g2 $table[2] + $table[3];
  
$f1 $table[0] + $table[2];
  
$f2 $table[1] + $table[3];
  
$n  $f1+$f2;
  
$pObserved = ($table[0] + $table[3]) / $n;
  
$pExpected = (($f1 $g1) + ($f2 $g2)) / ($n*$n);
  
$kappa     =  ($pObserved $pExpected) / ($pExpected);
  return 
$kappa;
}  
  
$table2x2 = array(
  array(
107), 
  array(
012)
);

echo 
kappa2($table2x2);
?>

The example values input and output by this script were taken from this discussion of the kappa statistic.

There are two reasons why I am particularly interested in the kappa statistic:

1. The kappa statistic is a close cousin of the chi square statistic which I have written about on numerious occasions and developed math packages for. I have seen the kappa statistic used alongside the chi square statistic to provide additional insight into tables of categorical count data.

2. An important aspect of the concept of "Collaborative Filtering" is the idea of "interrater reliability" - developing operational defintions of what it means to agree or disagree in your assessment of some object attribute. The kappa statistic is a good place to begin one's exploration of the mathematics of collaborative filtering.

Exercise

Generalize the algorithm for computing a "kappa" score so that it can be used on 3x3 tables and nxn tables.

Permalink 

Triangle Inequality Experiments using the Euclidean Metric [Data Mining
Posted on July 27, 2006 @ 06:34:55 PM by Paul Meagher

Here is a stochoastic proof that the Euclidean metric obeys the triangle inequality.

<?php

include "Array_Distance.php";

function 
randomIntegers($size$lo$hi) {
  
$int_vector = array();
  for (
$i=0$i<$size$i++)
    
$int_vector[] = mt_rand($lo$hi);
  return 
$int_vector;

  
$num_experiments  10;  
$num_dimensions 2;
$num_rule_matches  0;
$num_rule_failures 0;
?>

<h1>Triangle Inequality Experiments using the Euclidean Metric</h1>
<pre>
d(x, z) <= d(x, y) + d(y, z) (triangle inequality)
</pre>

<?php  
for($e=0$e<$num_experiments$e++) {

  
$X randomIntegers($num_dimensions09);
  
$Y randomIntegers($num_dimensions09);
  
$Z randomIntegers($num_dimensions09);

  
$d = new Array_Distance($X$Z);
  
$dXZ $d->euclidean();

  
$d = new Array_Distance($X$Y);
  
$dXY $d->euclidean();

  
$d = new Array_Distance($Y$Z);
  
$dYZ $d->euclidean();
 
  if (
$dYZ <= ($dXY $dYZ)) {
    echo 
"Experiment #".($e+1)." obeys triangle inequality: ";
    echo 
$dYZ."&lt;=(".$dXY." + ".$dYZ.")<br/>"
    
$num_rule_matches++;
  } else {
    echo 
"Does not obey triangle inequality<br />"
    
$num_rule_failures++;
  }  

}
?>

<h1>Results</h1>

<?php  
echo "<b>Num Experiments: </b> $num_experiments <br />";   
echo 
"<b>Num Rule Matches: </b> $num_rule_matches <br />";   
echo 
"<b>Num Rule Failures: </b> $num_rule_failures <br />";     
echo 
"<b>Rule Success Rate: </b> ".(($num_rule_matches/$num_experiments)*100)."%<br />";
?>

You can see the ouput of this script here.

I wonder what would happen if I substituted the minkowski($exp) method for the euclidean() method in the above test_triangle_inequality.php script?

Permalink 

Euclidean, cityblock, and minkowski distance metrics [Data Mining
Posted on July 26, 2006 @ 05:41:51 PM by Paul Meagher

Vector distance metrics can be viewed as Euclidean if they satisfy the following relationship among the vectors:

1. d(x, y) >= 0 (non-negativity) 
2. d(x, y) = 0  if and only if  x = y (identity) 
3. d(x, y) = d(y, x) (symmetry) 
4. d(x, z) <= d(x, y) + d(y, z)  (triangle inequality)

The Minkowski vector distance metric (i.e., minkowski($exp)) is a generalization of the Euclidean and Cityblock vector distance metrics achieved by passing in an "exponent" term (i.e., $exp).

Here is the code used to compute these 3 vector distance metrics:

<?php

function euclidean() {
  
$sum 0.0;
  for(
$i=0$i<$this->n$i++) 
    
$sum += pow($this->X[$i]-$this->Y[$i], 2);
  return 
sqrt($sum);
}

function 
cityblock() {
  
$sum 0.0;
  for(
$i=0$i<$this->n$i++) 
    
$sum += abs($this->X[$i]-$this->Y[$i]);
  return 
$sum;
}

function 
minkowski($exp) {
  
$sum 0.0;
  for(
$i=0$i<$this->n$i++) 
    
$sum += pow(abs($this->X[$i]-$this->Y[$i]), $exp);
  return 
pow($sum1/$exp);
}

?>

Exercise 1: Constuct a program that illustrates the idea/fact that the minkoski() function is a generalization of the cityblock() and euclidean() functions. Is the PHP program "proof" of this mathematical idea/fact?

Exercise 2: Construct a PHP program that tests whether the minkowski($exp) function satisfies the triangle inequality relationship. I will provide some code tomorrow that offers a stochoastic answer.

Exercise 3: Discuss the relationships between space and distance? Is space "real" and/or "constructed"? Is distance "real" and/or "constructed"?

Permalink 

Chebychev and Canberra distance measures added [Data Mining
Posted on July 25, 2006 @ 01:31:41 PM by Paul Meagher

Added the chebychev() and canberra() distance methods to the Array_Distance class.

Question: Are these two new distance measures methods "true" distance metrics? Do they obey the triangle inequality?

Partial Answer:

The triangle inequality is defined as:

d(u, v) <= d(u,w) + d(w, v) for all u, v, w vectors

Snagged this definition from The C Clustering Library (PDF) documentation which has a good discussion of distance metrics and applications.

Permalink 

Array_Distance download page [Data Mining
Posted on July 24, 2006 @ 03:27:48 PM by Paul Meagher

I have set up an Array_Distance download page where the public can download and demo the Array_Distance class.

Prior to releasing this first public release of the class I made sure I pre-assigned all values being computed to 0.0. Making these pre-assignments (e.g., $sum = 0.0) might help to ensure that PHP stores and manipulates these computed values using PHP's float data type.

Permalink 

More distance methods [Data Mining
Posted on July 21, 2006 @ 12:17:24 PM by Paul Meagher

I added three more methods for measuring the "distance" between two vectors: covariance(), correlation(), and centroid(). To compute these distances I needed to add a few standard statistical computations (i.e., mass(), mean(), and variance()). These statistical methods will be useful when implementing further distance metrics.

There was also an error in the minkowski() method - I used a textbook with a wrong definition and corrected the issue when I did more research on this important distance metric. Note that the cityblock() and euclidean() metrics are special cases of the minkowski() metric (with exponents of 1 and 2 respectively).

Here is the latest version of the Distance object:

<?php

class Distance {

  var 
$n 0;
  
  var 
$X = array();
  var 
$Y = array();  

  function 
Distance($X$Y) {
    
$this->count($X);
    if ((
$this->!= 0) AND ($this->== count($Y))) { 
      
$this->$X;
      
$this->$Y;        
      return 
true;
    } else 
      die(
"Size of X and Y arrays must be the same");      
  }

  function 
mass($vec) { 
    
$sum 0
    for(
$i=0$i $this->n$i++)  
      
$sum += $vec[$i]; 
    return 
$sum
  } 

  function 
mean($vec) { 
    return 
$this->mass($vec) / $this->n
  } 

  function 
variance($vec) { 
    
$mean $this->mean($vec); 
    
$sum  0.0
    for(
$i=0$i $this->n$i++) 
      
$sum += pow($vec[$i] - $mean2); 
    return 
$sum /($this->1); 
  } 

  function 
euclidean() {
    
$sum 0;
    for(
$i=0$i<$this->n$i++) 
      
$sum += pow($this->X[$i]-$this->Y[$i], 2);
    return 
sqrt($sum);
  }

  function 
cityblock() {
    
$sum 0;
    for(
$i=0$i<$this->n$i++) 
      
$sum += abs($this->X[$i]-$this->Y[$i]);
    return 
$sum;
  }

  function 
minkowski($exp) {
    
$sum 0;
    for(
$i=0$i<$this->n$i++) 
      
$sum += pow(abs($this->X[$i]-$this->Y[$i]), $exp);
    return 
pow($sum1/$exp);
  }

  function 
difference() {
    
$sum 0;
    for(
$i=0$i<$this->n$i++) 
      if (
$this->X[$i] != $this->Y[$i])
        
$sum += 1;
    return 
$sum;
  } 

  function 
covariance() { 
    
$x_mean $this->mean($this->X); 
    
$y_mean $this->mean($this->Y); 
    
$sum 0.0
    for(
$i=0$i $this->n$i++) 
      
$sum += ($this->X[$i]-$x_mean) * ($this->Y[$i]-$y_mean); 
    return 
$sum / ($this->1); 
  } 

  function 
correlation() { 
    
$denom sqrt($this->variance($this->X) * $this->variance($this->Y)); 
    if(
$denom != 0
      return(
$this->covariance() / $denom); 
    else { 
      if ((
$this->variance($this->X)==0) && ($this->variance($this->Y)==0)) 
        return 
1.0
      else 
        return 
0.0;  // impossible to correlate a null signal with another 
    

  }     

  function 
centroid() {
    return 
abs($this->mean($this->X)-$this->mean($this->Y));
  }

}

$X = array(123);
$Y = array(246);

$d = new Distance($X$Y);

echo 
"The Cityblock distance between the vectors is ".$d->cityblock()."<br />";
echo 
"The Minkowski distance between the vectors with exp=1 is ".$d->minkowski(1)."<br />";
echo 
"The Euclidean distance between the vectors is ".$d->euclidean()."<br />";
echo 
"The Minkowski distance between the vectors with exp=2 is ".$d->minkowski(2)."<br />";
echo 
"The Difference distance between the vectors is ".$d->difference()."<br />";
echo 
"The Covariance distance between the vectors is ".$d->covariance()."<br />";
echo 
"The Correlation distance between the vectors is ".$d->correlation()."<br />";
echo 
"The Centroid distance between the vectors is ".$d->centroid()."<br />";

// The Cityblock distance between the vectors is 6
// The Minkowski distance between the vectors with exp=1 is 6
// The Euclidean distance between the vectors is 3.7416573867739
// The Minkowski distance between the vectors with exp=2 is 3.7416573867739
// The Difference distance between the vectors is 3
// The Covariance distance between the vectors is 2
// The Correlation distance between the vectors is 1
// The Centroid distance between the vectors is 2

?>  

Permalink 

Some distance metrics [Data Mining
Posted on July 18, 2006 @ 02:18:07 PM by Paul Meagher

Many datamining applications rely upon some form of distance metric to compute how similiar one vector of numbers is to another. Decided to create a class to collect these distance metrics. So far I have added 4 distance metrics - three useful for continuous variables and the last one - the difference method - useful for discrete variables.

<?php

class Distance {

  var 
$n 0;
  
  var 
$X = array();
  var 
$Y = array();  

  function 
Distance($X$Y) {
    
$this->count($X);
    if ((
$this->!= 0) AND ($this->== count($Y))) { 
      
$this->$X;
      
$this->$Y;        
      return 
true;
    } else 
      die(
"Size of X and Y arrays must be the same");      
  }
  
  function 
euclidean() {
    
$sum 0;
    for(
$i=0$i<$this->n$i++) 
      
$sum += pow($this->X[$i]-$this->Y[$i], 2);
    return 
sqrt($sum);
  }

  function 
cityblock() {
    
$sum 0;
    for(
$i=0$i<$this->n$i++) 
      
$sum += abs($this->X[$i]-$this->Y[$i]);
    return 
$sum;
  }

  function 
minkowski($exp) {
    
$sum 0;
    for(
$i=0$i<$this->n$i++) 
      
$sum += pow(abs($this->X[$i]-$this->Y[$i]), $exp);
    return 
$sum;
  }

  function 
difference() {
    
$sum 0;
    for(
$i=0$i<$this->n$i++) 
      if (
$this->X[$i] != $this->Y[$i])
        
$sum += 1;
    return 
$sum;
  } 
  
}

$X = array(123);
$Y = array(246);

$d = new Distance($X$Y);

echo 
"The Euclidean distance between the vectors is ".$d->euclidean()."<br />";
echo 
"The Cityblock distance between the vectors is ".$d->cityblock()."<br />";
echo 
"The Minkowski distance between the vectors with exp=1 is ".$d->minkowski(1)."<br />";
echo 
"The Difference distance between the vectors is ".$d->difference()."<br />";

// The Euclidean distance between the vectors is 3.7416573867739
// The Cityblock distance between the vectors is 6
// The Minkowski distance between the vectors with exp=1 is 6
// The Difference distance between the vectors is 3

?>  

There are a large number of distance metrics and I will add more to the class as I have the time. Let me know if you discover any errors in the above code or if you have any other vector distance metrics you would like to add.

Permalink 

Exploratory Data Analysis in the 21st century [Data Mining
Posted on June 21, 2006 @ 12:44:47 AM by Paul Meagher

Just finished reading a book called "Graphic Discovery" (2005) by Harold Wainer. It is a nice short book (main content takes up 149 pages) and if you are interested in Edward Tufte's writings you will want to read this book as well for it's gallery of interesting data graphics and commentary.

The final two chapters of the book are a look into the future of data graphics and were done in collaboration with data graphics guru John Tukey before he died in 2000. According to Tukey and Wainer, these are the 4 graphic tools that will be the focus of Exploratory Data Analysis (EDA) research in 21st century:

  1. Rotatable Scatterplots
  2. Slicing Engine
  3. Nearness Engine
  4. Smoothing Engine

These are all interactive graphic discovery tools for multi-dimensional spaces. You will need to read the book to get a sense of how these tools might work, however, the exact nature of what these tools are and how they work is the subject of future research so don't expect crisp definitions (otherwise I would have provided them above) but rather suggestive examples.

The book also contains numerious interesting and instructive tidbits such as this one:

Abraham Wald, in some work he did during World War II, was trying to determine where to add extra armor to planes on the basis of the pattern of bullet holes in returning aricraft. His conclusion was to determine carefully where returning planes had been shot and put extra armor every place else! (p. 148)

Why do you think Wald did this?

Permalink 

An object-oriented and normalized "Survey" data type for php-based datamining applications [Data Mining
Posted on June 13, 2006 @ 01:26:05 AM by Paul Meagher

I have done some implementation work on a "Survey" data frame object so my suggestion a few days ago about a fundamental "Survey" data type for php-based datamining can be fleshed out in some more detail.

Here are some of the "results" of my R & D on this concept to date.

I have implemented a "Survey" data object as a database-oriented data structure. A prototypical "Survey" data object has 1 database table that is used to hold survey data, and 3 other database tables that are used to hold metadata about this survey data table.

The names of these metadata tables are "Surveys", "SurveyQuestions", "SurveyAnswers". This represents both an object-oriented decomposition of the domain of "Surveys" as well as a normalization of the database tables required to implement the data object. One might shorten the object names to "Surveys", "Questions", "Answers" to make everything appear more fundamental and elegant.

A "Survey" consists of a set of "Questions" where each "Question" has either a finite or infinite set of possible "Answers". The possible "Answers" to a given "Question" would be enumerated in the "Answers" table.

The "Survey" object would have methods for inserting/updating/deleting/selecting the Surveys/Questions/Answers metadata tables which would, in turn, modify the structure of the database table used to hold the raw data. If you add a question to a particular "Survey" this would have the side-effect of adding a column to the raw data table using the column name specified by the insert/updateQuestion method.

The "Survey" object would require more or less work to set up depending on how much Survey, Question and Answer metadata you wanted to specify for it. Once specified, your "Survey" object could potentially be used by many different PHP math applications that assume the data has a "Survey" structure and thus contains quite a bit of useful metadata that could be used 1) to reduce the burden of specifying the nature of the data set, 2) to provide useful labelling in reporting screens, and 3) to document the nature of the dataset for archival purposes.

Permalink 

Survey objects for data mining [Data Mining
Posted on June 10, 2006 @ 01:27:26 AM by Paul Meagher

If you were developing a data mining platform using PHP, one question you might want to ask yourself is whether the platform needs to define a standard data object that might be used by all math library components to interrogate arbitrary data sets (in practice, the data would mostly reside in a mysql database).

In R you have an object called a "DataFrame" that serves this role.

I have been thinking with calling the PHP object that enscapsulates arbitrary data sets a "Survey" object. The "Survey" metaphor is quite suggestive of many desirable features that a data structure useful for data mining should have. For example, a survey is a multidimensional information storage object, often with reflection/introspection features, that is designed to systematically collect and relate data about some phenomonon of interest. Surveys often use database tables as storage objects and have a strong columnar orientation/structure.

Permalink 

Setting up php-mysql datamining applications [Data Mining
Posted on April 19, 2006 @ 03:16:52 AM by Paul Meagher

When sharing datamining code in an agile development context, a recurring issue is that you can't easily tar up your application because your configuration file contains sensitive information in the form of usernames and passwords to connect to a real database. Inconvenient and inelegant archiving solutions need to be created to prevent your real usernames and passwords from shipping with your opensource datamining application.

In Protecting a MySQL user/password in a PHP script, Greg Beaver discusses various solutions but the solution I like best is to use enviroment variables to configure the username and password values for your database connection:

  1. Add these settings to your .htaccess file (where mysqluser and mysqlpass are to be changed to appropriate values):

    SetEnv DBUSER mysqluser
    SetEnv DBPASS mysqlpass

  2. Test that the values of the environment variables can be accessed from a PHP script:

    <?php
    echo "<pre>";
    echo 
    $_SERVER['DBUSER'];
    echo 
    $_SERVER['DBPASS'];
    echo 
    "</pre>";
    ?>

  3. You can then connect to your db like this:

    $db = mysqli_connect('localhost', $_SERVER['DBUSER'], $_SERVER['DBPASS']);

In the future, php/Math datamining applications will utilize a database connection setup like this to make sharing of php-mysql code easier.

Those of you who can't set the .htaccess environment for your web directories, could instead set the $_SERVER values via a config.php file that would be included at the top of every datamining application script. This is less convenient to set up and tarball than using Apache environment variables to maintain your database username and password.

Permalink 

 Archive 
 


php/Math Project
© 2011. All rights reserved.