php/Math   
Recreational Mathematics   
   home  |  library  |  contact
 Math Notes
 Math Programming [24]
 Regression [3]
 Data Mining [17]
 Notation [6]
 Linear Algebra [8]
 Stats & Prob [14]
 Math Cognition [5]
 Space & Physics [6]
 Formulas [5]
 Fun & Games [2]
 Haskell [1]
 Bayes Theory [1]
 Site News [0]
 Math Projects [4]
 Polynomials [1]
 Calculus [9]
 Number Theory [3]
 Optimization [2]

 Math Links
 PHP/ir
 Andrew Gelman
 Chance Wiki
 Daniel Lemire
 KD Knuggets
 Social Stats
 MySQL Performance
 Hunch.net
 Matthew Hurst
 JMLR
 JSS
 Hal Daume III
 Math Notes >> Recent

Elementwise operations [Math Programming
Posted on January 28, 2012 @ 06:33:46 AM by Paul Meagher

In my last post, I discussed how basic arithmetic operations might be vectorized using PHP. I wasn't very satisfied with the implemention. One issue was that I was duplicating alot of code and felt that the vectorization code could be reused across vectorized arithmetic operations better. Another issue was that I didn't study how matlab/octave elementwise operations worked in detail so didn't know what the allowable argument types were (scalar, vector, matrix) and what combination of argument types was allowed on both sides of the arithmetic operation. I'm happy to report that I've addressed both of these issues in the new implementation. While I could probably add more error handling, I think the code below could be used to easily vectorize many operations in PHP such as exp, sin, cos, pow, etc...

<?php
/**
* elementwise_operations.php
*
* Functions that mimic matlab/octave's elementwise operations (e.g., .+, .-, .*, ./):
*
*  .+  =  madd  =  element-by-element addition
*  .-  =  msub  =  element-by-element subtraction
*  .*  =  mmul  =  element-by-element multiplication
*  ./  =  mdiv  =  element-by-element division
*/

/**
* Determines whether a variable is a scalar, vector, or matrix
*/
function get_type($var) {  
  if (
is_numeric($var)) {
    return 
"scalar";
  } elseif (
is_array($var)) {
    if (
is_array($var[0])) 
      return 
"matrix";
    else 
      return 
"vector";
  } else 
    return 
false;
}

/**
* Function that vectorizes the elementary operations.
*/
function vectorize($a$b$op) {
  
  
$atype get_type($a);  
  
$btype get_type($b);

  if ((
$atype != false) AND ($btype != false))
    
$arguments $atype."-".$btype;
  else 
    die(
"Argument is not a scalar, vector or matrix.");
  
  switch (
$arguments) {
    
    case 
"scalar-scalar"
      return 
$op($a$b);
      break;

    case 
"vector-scalar"
      
$asize count($a);
      
$b     array_fill(0$asize$b);
      return 
array_map($op$a$b);      
      break;

    case 
"vector-vector":
      
$asize count($a);
      
$bsize count($b);      
      if (
$asize == $bsize
        return 
array_map($op$a$b);              
      else 
        die(
"Nonconformant arguments.");
      break; 
      
    case 
"matrix-scalar":        
      
$nrows count($a);
      
$ncols count($a[0]);      
      for(
$i=0$i<$nrows$i++) {
        
$bvec  array_fill(0$ncols$b);        
        
$c[$i] = array_map($op$a[$i], $bvec);           
      }  
      return 
$c;
      break;
      
    case 
"matrix-matrix":        
      
$arows count($a);
      
$acols count($a[0]); 
      
$brows count($b);
      
$bcols count($b[0]);       
      if ((
$arows == $brows) AND ($acols == $bcols)) {
        for(
$i=0$i<$arows$i++) 
          
$c[$i] = array_map($op$a[$i], $b[$i]);   
        return 
$c;                        
      } else 
        die(
"Nonconformant arguments.");      
      break;             
  }    
  
}


// Elementary arithmetic operations.

function _add($addend1$addend2) {
  return (
$addend1 $addend2);
}

function 
_sub($minuend$subtrahend) {
  return (
$minuend $subtrahend);
}

function 
_mul($factor1$factor2) {
  return (
$factor1 $factor2);
}

function 
_div($numerator$divisor) {
  return (
$numerator $divisor);
}


// Elementwise operators.

function madd($m$addend) {
  return 
vectorize($m$addend"_add");
}

function 
msub($m$subtrahend) {
  return 
vectorize($m$subtrahend"_sub");
}

function 
mmul($m$factor) {
  return 
vectorize($m$factor"_mul");
}

function 
mdiv($m$divisor) {
  return 
vectorize($m$divisor"_div");
}


// Test elementwise addition.

$s 5;
$v = array(123);
$m = array(array(123), array(456));

// scalar-scalar case
$ans madd($s$s);
print_r($ans);
echo 
"<br />";

// vector-scalar case
$ans madd($v$s);
print_r($ans);
echo 
"<br />";

// vector-vector case
$ans madd($v$v);
print_r($ans);
echo 
"<br />";

// matrix-scalar case
$ans madd($m$s);
print_r($ans);
echo 
"<br />";

// matrix-matrix case
$ans madd($m$m);
print_r($ans);

// What the output looks like:

// 10
// Array ( [0] => 6 [1] => 7 [2] => 8 )
// Array ( [0] => 2 [1] => 4 [2] => 6 )
// Array ( [0] => Array ( [0] => 6 [1] => 7 [2] => 8 ) [1] => Array ( [0] => 9 [1] => 10 [2] => 11 ) )
// Array ( [0] => Array ( [0] => 2 [1] => 4 [2] => 6 ) [1] => Array ( [0] => 8 [1] => 10 [2] => 12 ) ) 

?>
  

Permalink 

Vectorizing basic arithmetic operations [Math Programming
Posted on January 23, 2012 @ 12:45:37 AM by Paul Meagher

Continuing on with my research into adding better matrix/vector capabilities to PHP, I decided to implement the basic arithmetic functions (addition, subtraction, division, multiplication) as vectorized functions. To make these functions concise and easy to remember, I used a system of function naming that involves prefixing the function call with a "v" (for "vector") and then using the first three letters of the arithmetic operation name (add, sub, div, mul) as the suffix:

  1. vadd($v, $addend): vector addition
  2. vsub($v, $subtrahend): vector subtaction
  3. vdiv($v, $divisor): vector division
  4. vmul($v, $factor): vector multiplication

In all cases, the first argument is a vector of numbers, followed by the value to apply to each element of the vector. The code below is proof-of-concept of how to vectorize these functions and is not meant as production-level code:

<?php
/**
* Set of functions that vectorize the basic arithmetic functions:
* addition, subtraction, division, multiplication.  No error 
* checking.
*/

// function for vectorized addition

function _add($addend1$addend2) {
  return (
$addend1 $addend2);
}

function 
vadd($v$addend) {
  
$size      count($v);
  
$add_array array_fill(0$size$addend);
  return 
array_map("_add"$v$add_array);
}

// function for vectorized subtraction

function _sub($minuend$subtrahend) {
  return (
$minuend $subtrahend);
}

function 
vsub($v$subtrahend) {
  
$size      count($v);
  
$sub_array array_fill(0$size$subtrahend);
  return 
array_map("_sub"$v$sub_array);
}

// function for vectorized multiplication

function _mul($factor1$factor2) {
  return (
$factor1 $factor2);
}

function 
vmul($v$factor) {
  
$size      count($v);
  
$mul_array array_fill(0$size$factor);
  return 
array_map("_mul"$v$mul_array);
}

// function for vectorized division

function _div($numerator$divisor) {
  return (
$numerator $divisor);
}

function 
vdiv($v$divisor) {
  
$size      count($v);
  
$div_array array_fill(0$size$divisor);
  return 
array_map("_div"$v$div_array);
}

$v   = array(51015);
$arg 5;

$b vadd($v$arg);
print_r($b);
echo 
"<br />";

$b vsub($v$arg);
print_r($b);
echo 
"<br />";

$b vmul($v$arg);
print_r($b);
echo 
"<br />";

$b vdiv($v$arg);
print_r($b);
echo 
"<br />";

// Output:

// Array ( [0] => 10 [1] => 15 [2] => 20 )
// Array ( [0] => 0 [1] => 5 [2] => 10 )
// Array ( [0] => 25 [1] => 50 [2] => 75 )
// Array ( [0] => 1 [1] => 2 [2] => 3 ) 

?>

The alternative to vectorized functions is the ad-hoc creation of loops to perform similar vectorized operations in your code. Think how much programming time and effort has been wasted coding loops to do vectorized arithmetic when there could be vectorized arithmetic functions available to do it all in one simple command.

This only touches the surface of the number of mathematical functions that should be vectorized in PHP (e.g., vsin, vcos, vexp, etc...) to make PHP a much more efficient language to program numeric-based algorithms. We could easily add a "v" suffix to all these function to indicate that they are all vector-based.

Another option to consider would be to go even further and have these functions handle matrix operations, with vectors being just the special case of a one-dimensional matrix. In that case, we might want to instead prefix all the functions with an "m" instead of a "v". I had considered this, but was too lazy to implement functions to handle the matrix case. This is a good exercise for the reader.

At the end of the day we should look at languages like Matlab/Octave to give us guidance on how to proceed with respect to the range of functions to vectorize and whether we should handle vector or matrix arguments to these functions. While we may not be able to use Matlab/Octave's concise syntax to code vectorized algorithms, we can develop slightly more verbose functions like vadd, vsub, vdiv, and vmul to efficiently code numeric algorithms.

Permalink 

Matrix manipulation functions for PHP: Part 3 [Math Programming
Posted on January 14, 2012 @ 05:28:18 PM by Paul Meagher

Another useful function that matlab/octave supports is the linspace(start, end, num_points) function. The linspace function returns an array of n numbers that are equally spaced between the start and end points. A function like this is very useful when graphing an axis and when testing vectorized functions. You will see it used quite frequently in matlab/octave examples.

Below is an implementation that captures the behavior of Octave's linspace function. It also includes some code to test the output of the linspace function.

<?php

define
("PRECISION"4);

function 
linspace($start$end$num_points) {

  
$format "%01.".PRECISION."f";

  
$start      = (float)$start;
  
$end        = (float)$end;  
  
$num_points = (int)$num_points 1;
  
  
$difference abs($end-$start);
 
  if (
$num_points <= 1)
    return array(
$end);
    
  if (
$num_points == 2
    return array(
$start$end);
  
  if (
$num_points >= 3) {

    if (
$difference 0) {

      
      
$increment sprintf($format$difference/$num_points);  
      
$increment $difference/$num_points;        
      
      
$A[0] = sprintf($format$start);
     
      for (
$i=1$i $num_points$i++) {
        
$A[$i] = sprintf($format, ($A[$i-1] + $increment));
        
$next_num $A[$i];
      }
        
      
$A[] = sprintf($format$end);

      return 
$A;

    } else {
    
      return 
array_fill(0$num_points+1$start);
      
    }  

  }

}

// Testing the output of the linspace function.

echo "<pre>";
echo 
"linspace(1, 5, 4) \n";
print_r(linspace(154));
echo 
"\n";
echo 
"linspace(1, 1.8, 4) \n";
print_r(linspace(11.84));
echo 
"\n";
echo 
"linspace(1, 1, 4) \n";
print_r(linspace(114));
echo 
"\n";
echo 
"linspace(5, 7, 0)\n";
print_r(linspace(570));
echo 
"</pre>";
?>

The output of this script looks like this:

linspace(1, 5, 4) 
Array
(
    [0] => 1.0000
    [1] => 2.3333
    [2] => 3.6666
    [3] => 5.0000
)

linspace(1, 1.8, 4) 
Array
(
    [0] => 1.0000
    [1] => 1.2667
    [2] => 1.5334
    [3] => 1.8000
)

linspace(1, 1, 4) 
Array
(
    [0] => 1
    [1] => 1
    [2] => 1
    [3] => 1
)

linspace(5, 7, 0)
Array
(
    [0] => 7
)

An interesting aspect of the behavior of the linspace function is that it handles odd cases in ways that one might think should throw an error. Instead of throwing an error when you enter, for example, linspace(5, 7, 0) - because you can't return 0 points - it instead tries to do something sensible and returns the end point (i.e., 7). Whether this is sensible or not I will leave you to decide, however, the larger point is that Octave seems biased towards returning a numeric result of some sort rather than an error result. If you don't know the detailed behavior of these functions you could end up with some nasty surprises in terms of the result returned; perhaps it would be better if an error gets returned rather than a numeric result in some cases. Nevertheless, when doing mathematical programming you should consider whether it is better to throw an error result in response to odd inputs or try to generate a sensible numeric answer by, for example, anticipating the type of mistakes a user might make when passing arguments to a function and handling this mistakes appropriately (e.g., the low and hi values might be in reverse order from what is expected, so reverse them, and give the answer based upon the corrected ordering of arguments). A good way to master Octave is to port some of it's matrix manipulation functions to PHP. To do so you will need to observe how the Octave function behaves with boundary cases; a type of study you probably would not have engaged in otherwise.

On another note, it is my hope that I can begin to revive this blog again in 2012. I don't have time to blog as frequently as I did in the past, however, I would like to try for at least one blog post per week - morely like on weekends when I have more time for recreational math, than on weekdays. This is one of my resolutions in 2012 so stay tuned to see if this pans out :-)

Permalink 

Matrix manipulation functions for PHP: Part 2 [Math Programming
Posted on January 7, 2012 @ 01:41:26 PM by Paul Meagher

In the last post, I bemoaned the fact that PHP lacks some useful matrix manipulation functions that matlab/octave support. I decided to see how difficult it would be to create a script that would 1) parse a matlab/octave-like recipe for creating a matrix, and 2) generate a matrix based upon that parse. I initially thought such work would require some tricky regular expressions for parsing the recipe, but it appears that the language for specifying matrices was designed to make parsing relatively easy and that major parts of the parsing could be handled by exploding on various control characters. I'm not claiming that the code below implements all of the matlab/octave language for specifying matrices, however, it does support much of functionality you will see discussed in matlab/octave books.

<?php
/**
* @script matrix.php
*
* Script for creating matrices using matlab/octave like recipes.
*
* @author Paul Meagher
* @version 0.1
* @modified Jan 6, 2012
*/


/**
* Returns a parse of the recipe for a desired matrix.
* The parse that is returned contains a "valid" flag 
* that is set to false (i.e., 0) if the number of 
* columns in each row is unequal.
*/
function parse_matrix_recipe($recipe) {
  
  
// Initialize parse validity status to 1. Parse validity 
  // status will be set to 0 if expression is invalid.    
  
$parse['valid'] = 1;       
  
  
$num_rows 0;
  
$num_cols 0;

  
// Get rid of excess whitespace.
  
$recipe preg_replace('/\s\s+/'' '$recipe);
      
  
// Get all the rows.
  
$rows explode(';'$recipe);
  
  
$num_rows count($rows);
  
  
// Parse all the rows  
  
for($i=0$i<$num_rows$i++) {
  
    
// Check first if the row has the : range delimiter.
    
if (substr_count($rows[$i], ":")) {
  
      list(
$start$limit) = explode(":"$rows[$i]);
        
      if (
settype($start"int") AND settype($limit"int")) {
    
        
$parse[$i]['start'] = $start;
        
$parse[$i]['limit'] = $limit;        
              
        
// Counting number of elements generated by range function
        // is an inefficient way of checking that all rows have
        // same number of columns, but it work.  Optimize later :-)
        
if ($i == 0
          
$init_num_cols $row_num_cols count(range($start$limit));
        else
          
$row_num_cols count(range($start,$limit));
                        
      } else {        
        
        
// Cannot convert range values to integers.
        
$parse['valid'] = 0;
        break;
        
      }
    
    } else {          
            
      
$parse[$i]['array'] = explode(" "$rows[$i]);
            
      if (
$i == 0
        
$init_num_cols $row_num_cols count($parse[$i]['array']);
      else
        
$row_num_cols count($parse[$i]['array']);
            
    }

    
// Exit loop and set parse to false if one row has more columns 
    // than another.
    
if ($init_num_cols != $row_num_cols) {
      
$parse['valid'] = 0;       
      break;
    }    
    
  }
  
  
// If parse is valid, set matrix dimensions.  
  
if ($parse['valid'] == 1) {
    
$parse['num_rows'] = $num_rows;
    
$parse['num_cols'] = $init_num_cols;  
  }
  
  return 
$parse;

}

/**
* Uses parse elements to generate and return a matrix
*/
function generate_matrix($parse) {
 
  
$mat = array();
 
  
$num_rows $parse['num_rows'];
  
  if (
$num_rows == 1) {    

    
// Assign supplied array to vector.
    
if (isset($parse[0]['array']))
      
$vec $parse[0]['array'];
    
// Assign specified range of values to vector.  
    
else 
      
$vec range($parse[0]['start'], $parse[0]['limit']);
      
    return 
$vec;
    
  } else {    
    
    for(
$i=0$i<$num_rows$i++) {        

      
// Assign supplied array to row. 
      
if (isset($parse[$i]['array']))      
        
$mat[$i] = $parse[$i]['array'];
      
// Assign range of values to row.  
      
else
        
$mat[$i] = range($parse[$i]['start'], $parse[$i]['limit']);
    }
    
    return 
$mat;
    
  }

}

/**
* Main function for creating matrices.
*/
function m($recipe) {  
  
  
$parse parse_matrix_recipe($recipe);
  
  if (
$parse['valid'] == 1
    return 
generate_matrix($parse); 
  else
    return array();
    
}

?>

As you can see above, the main function used to specify a matrix is the m( ) function. I envision a complementary submatrix operator, sm( ), that would be used to access a submatrix of a supplied matrix (i.e., sm($recipe, $A)). Matlab/octaves language for specifying a submatrix is more complex than the language for specifying a matrix, so coding the sm( ) function will be a bit more challenging. You don't really know how challenging, however, until you try ...

Below is a test script that shows that the matrix.php script outputs the appropriate matrices given matlab/octave like recipes.

<?php

include "matrix.php";

$recipe1 '-1:1; 3:1';
$recipe2 '1;2;3;4';
$recipe3 '1 2  3;4 5   6';
$recipe4 '1';
$recipe5 '1 2 3;4:6';

$A m($recipe1);
$B m($recipe2);
$C m($recipe3);
$D m($recipe4);
$E m($recipe5);

echo 
"<pre>";
echo 
"\$A = m('".$recipe1."') \n";
echo 
"\$A = \n";
print_r($A);
echo 
"\n";

echo 
"\$B = m('".$recipe2."') \n";
echo 
"\$B = \n";
print_r($B);
echo 
"\n";

echo 
"\$C = m('".$recipe3."') \n";
echo 
"\$C = \n";
print_r($C);
echo 
"\n";

echo 
"\$D = m('".$recipe4."') \n";
echo 
"\$D = \n";
print_r($D);
echo 
"\n";

echo 
"\$E = m('".$recipe5."') \n";
echo 
"\$E = \n";
print_r($E);
echo 
"</pre>";

?>

Here is what the output of this script looks like:

$A = m('-1:1; 3:1') 
$A = 
Array
(
    [0] => Array
        (
            [0] => -1
            [1] => 0
            [2] => 1
        )
    [1] => Array
        (
            [0] => 3
            [1] => 2
            [2] => 1
        )
)

$B = m('1;2;3;4') 
$B = 
Array
(
    [0] => Array
        (
            [0] => 1
        )

    [1] => Array
        (
            [0] => 2
        )

    [2] => Array
        (
            [0] => 3
        )

    [3] => Array
        (
            [0] => 4
        )
)

$C = m('1 2  3;4 5   6') 
$C = 
Array
(
    [0] => Array
        (
            [0] => 1
            [1] => 2
            [2] => 3
        )
    [1] => Array
        (
            [0] => 4
            [1] => 5
            [2] => 6
        )
)

$D = m('1') 
$D = 
Array
(
    [0] => 1
)

$E = m('1 2 3;4:6') 
$E = 
Array
(
    [0] => Array
        (
            [0] => 1
            [1] => 2
            [2] => 3
        )
    [1] => Array
        (
            [0] => 4
            [1] => 5
            [2] => 6
        )
)

Originally I suggested that we use a mat( ) function to both create and access a matrix. I decided that if we are going to shorten "matrix" to "mat" we might as well go all the way and use an "m" function as the function name. While I think that we could still use one m( ) function for both constructing and accessing matrices, I think it would be better to develop a separate sm( ) function to make initial development of a matrix accessor easier. Finally, using m( ) as the function used to create/access a matrix reminds me of JQuery's $() operator so I don't think there is anything inherently wrong with using a one letter function name if the importance of the operator supports reserving a letter for it. This is arguably the case for an operator designed to create and access matrices.

Permalink 

Matrix manipulation functions for PHP: Part 1 [Math Programming
Posted on January 4, 2012 @ 01:12:29 PM by Paul Meagher

Currently reading some books on Matlab (so that I can program in it's opensource version, Octave, better). In Matlab/Octave you can construct a matrix like so:

mat = [1:3; 4:6]

which returns:

mat =

   1   2   3
   4   5   6

Perhaps in PHP we should be able to do something similiar like this:

$A = mat('1:3; 4:6');

Where mat(' replaces [ and '); replaces ]. Implementing a mat function like this would require implementing 1) a parser for the expression between single quotes (or double quotes), and 2) a code generator that uses the parsed expression to generate the appropriate array or nested arrays.

In the matrix constructor for the JAMA package, there are methods to "fill" the arrays with values but they are very primative compared to matlab/octaves notation for constructing filled matrices. The proposed "mat" function could work in a very complementary way with JAMA's matrix constructor.

A mat function is only one matrix manipulation function that php might implement to make matrix manipulation easier. Other functions that would be useful are linspace, reshape, and repmat. Matlab's/Octave's method for transposing a matrix involves using a single quote operator ' outside the right bracket (e.g., mat = [1:3; 4:6]') that might be functionalized as a transpose or trans operator. Finally, Matlab/Octave also supports an expressive language for accessing the contents of a matrix that would also be useful to have. One possibility is that the mat function might take a second argument, a matrix, that would be accessed via an expression between the single quotes of the first argument like so:

$b = mat('end, 1', $A);

Which equates to:

$b = 4;

I would argue that for PHP to be a usable language for matrix manipulation it should offer matrix manipulation functions like the Matlab/Octave ones proposed here. A few functions like these, inspired by the way Matlab/Octive implements these functions, would take PHP a long way towards being a useful matrix manipulation language. They would be complementary with the JAMA Linear Algebra Package and would make it easier to port Matlab/Octave-based matrix algorithms to PHP. Many vectorized algorithms require the ability to efficiently slice and dice matrices as the algorithm proceeds towards a solution which is what these methods provide.

Permalink 

Information Retrieval Class [Data Mining
Posted on June 5, 2008 @ 02:21:07 PM by Paul Meagher

As I reflected on how I would begin computing summary statistics over an inverted index (which I discussed and implemented in my last post), I realized that I would have to immediately turn my script into a class in order to easily and efficiently pass around the inverted index to the statistical functions. Here is what I have developed so far (see also):

<?php

/**
* Information Retrievel

* Class used to explore information retrieval theory and concepts.
*/

define("DOC_ID"0);
define("TERM_POSITION"1);

class 
IR {

  public 
$num_docs 0;
  
  public 
$corpus_terms = array();

  
/* 
  * Show Documents
  *
  * Helper function that shows the contents of your corpus documents.
  *
  * @param array $D document corpus as array of strings
  */ 
  
function show_docs($D) {
    
$ndocs count($D);
    for(
$doc_num=0$doc_num $ndocs$doc_num++) {
      
?>
      <p>
      Document #<?php echo ($doc_num+1); ?>:<br />
      <?php echo $D[$doc_num]; ?>
      </p>
      <?php
    
}
  }

  
/* 
  * Create Index
  *
  * Creates an inverted index from the supplied corpus documents.
  * Inverted index stored in corpus_terms array.
  *
  * @param array $D document corpus as array of strings
  */   
  
function create_index($D) {  
    
$this->num_docs count($D);    
    for(
$doc_num=0$doc_num $this->num_docs$doc_num++) {      
      
// zero array containing document terms
      
$doc_terms = array();      
      
// simplified word tokenization process
      
$doc_terms explode(" "$D[$doc_num]);      
      
// here is where the indexing of terms to document locations happens
      
$num_terms count($doc_terms);
      for(
$term_position=0$term_position $num_terms$term_position++) {
        
$term strtolower($doc_terms[$term_position]);
        
$this->corpus_terms[$term][]=array($doc_num$term_position);
      }      
    }   
  }

  
/* 
  * Show Index
  *
  * Helper function that outputs inverted index in a standard format.
  */         
  
function show_index() {
    
// sort by key for alphabetically ordered output
    
ksort($this->corpus_terms);
    
// output a representation of the inverted index
    
foreach($this->corpus_terms AS $term => $doc_locations) {
      echo 
"<b>$term:</b> ";
      foreach(
$doc_locations AS $doc_location
        echo 
"{".$doc_location[DOC_ID].", ".$doc_location[TERM_POSITION]."} ";
      echo 
"<br />";  
    }    
  }   

  
/*
  * Term Frequency
  *
  * @param string $term
  * @return frequency of term in corpus
  */
  
function tf($term) {
    
$term strtolower($term);
    return 
count($this->corpus_terms[$term]);
  }
   
  
/*
  * Number Documents With
  * 
  * @param string $term
  * @return number of documents with term
  */
  
function ndw($term) {
    
$term strtolower($term);   
    
$doc_locations $this->corpus_terms[$term];
    
$num_locations count($doc_locations);
    
$docs_with_term = array();
    for(
$doc_location=0$doc_location $num_locations$doc_location++) 
      
$docs_with_term[$i]++;
    return 
count($docs_with_term);     
  }
   
  
/*
  * Inverse Document Frequency
  *
  * @param string $term
  * @return inverse document frequency of term
  */
   
function idf($term) {
     return 
log($this->num_docs)/$this->ndw($term);    
   }       

}

?>

Here is a script I developed to test the methods of the IR class:

<?php

include "IR.php";

$D[0] = "Shipment of gold delivered in a fire";
$D[1] = "Delivery of silver arrived in a silver truck";
$D[2] = "Shipment of gold arrived in a truck";

$ir = new IR();

echo 
"<p><b>Corpus:</b></p>";
$ir->show_docs($D);

$ir->create_index($D);

echo 
"<p><b>Inverted Index:</b></p>";
$ir->show_index();

$term "silver"
$tf  $ir->tf($term);
$ndw $ir->ndw($term);
$idf $ir->idf($term);
echo 
"<p>";
echo 
"Term Frequency of '$term' is $tf<br />";
echo 
"Number Of Documents with $term is $ndw<br />";
echo 
"Inverse Document Frequency of $term is $idf";
echo 
"</p>";  

?>

Here is the output that the script generates:

Corpus:

Document #1:
Shipment of gold delivered in a fire

Document #2:
Delivery of silver arrived in a silver truck

Document #3:
Shipment of gold arrived in a truck

Inverted Index:

a: {0, 5} {1, 5} {2, 5}
arrived: {1, 3} {2, 3}
delivered: {0, 3}
delivery: {1, 0}
fire: {0, 6}
gold: {0, 2} {2, 2}
in: {0, 4} {1, 4} {2, 4}
of: {0, 1} {1, 1} {2, 1}
shipment: {0, 0} {2, 0}
silver: {1, 2} {1, 6}
truck: {1, 7} {2, 6}

Term Frequency of 'silver' is 2
Number Of Documents with silver is 1
Inverse Document Frequency of silver is 1.0986122886681

The next step in this exploration of Information Retrieval Theory will be to develop a table that better summarizes term frequency and inverse term frequencies. I will use this table to verify that the results I'm generating match with tf-idf calculations found in the literature for the sample documents used in my calculations.

Permalink 

Information Retrieval Theory: Implementing an Inverted Index [Data Mining
Posted on June 3, 2008 @ 01:29:29 PM by Paul Meagher

Last night I downloaded a local copy of this book:

Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.

Most introductions to Information Retrieval begin by talking about the "Inverted Index" data structure. I had come accross this concept before, however, I haven't explored it in php code because I got hung up on the issues of what corpus to use and, more importantly, what word tokenization approach I should use for the documents. Through my surfing around on the topic of IR, I observed that you can bypass these issues by simplifying the corpus selection and word tokenization issues while retaining the essentials of the representational and computational issues. The code below illustrates a common simplification of the corpus concept and word tokenization concept that might similarly allow you to get started exploring some IR concepts.

<?php

/**
* Creates an inverted index and shows what it looks like.  Useful for
* explorating TF-IDF calculations in Information Retrieval theory.
*
* @see http://en.wikipedia.org/wiki/Inverted_index
*/

define("DOC_ID"0);
define("TERM_POSITION"1);

$D[0] = "Shipment of gold delivered in a fire";
$D[1] = "Delivery of silver arrived in a silver truck";
$D[2] = "Shipment of gold arrived in a truck";

$corpus_terms = array();

$num_docs count($D);

for(
$doc_num=0$doc_num $num_docs$doc_num++) {
  
  
// zero array containing document terms
  
$doc_terms = array();
  
  
// simplified word tokenization process
  
$doc_terms explode(" "$D[$doc_num]);
  
  
// here is where the indexing of terms to document locations happens
  
$num_terms count($doc_terms);
  for(
$term_position=0$term_position $num_terms$term_position++) 
    
$corpus_terms[strtolower($doc_terms[$term_position])][]=array($doc_num$term_position);

}

// sort by key for alphabetically ordered output
ksort($corpus_terms);

$num_terms count($corpus_terms);

// output a representation of the inverted index
foreach($corpus_terms AS $term => $doc_locations) {
  echo 
"<b>$term:</b> ";
  foreach(
$doc_locations AS $doc_location
    echo 
"{".$doc_location[DOC_ID].", ".$doc_location[TERM_POSITION]."} ";
  echo 
"<br />";  
}    
      
?>  

The output of this code looks like this:

a: {0, 5} {1, 5} {2, 5}
arrived: {1, 3} {2, 3}
delivered: {0, 3}
delivery: {1, 0}
fire: {0, 6}
gold: {0, 2} {2, 2}
in: {0, 4} {1, 4} {2, 4}
of: {0, 1} {1, 1} {2, 1}
shipment: {0, 0} {2, 0}
silver: {1, 2} {1, 6}
truck: {1, 7} {2, 6}

Sometimes it is useful to jump ahead in your exploration of certain topics and avoid, for example, the gritty details of document tokenization. What may seem like an avoidance of the problem, might end up informing you about important subtleties in the requirements for your information retrieval system. Exploratory inverted index code can help you to see the computational consequences of representing an inverted index in a certain manner.

For me, the next step in exploring IR theory is to explore in code the Term Frequency–Inverse Document Frequency calculations, generally abbreviated as TF-IDF. Tim Munroe illustrates some of the calculations involved, however, I will need to get my calculations working with the inverted index data structure I've developed so far.

Permalink 

Law of Surfing [Formulas
Posted on May 30, 2008 @ 03:31:42 PM by Paul Meagher

Bernardo Huberman formulated a few "laws of the web". One of these laws concerns the distribution of clickstream lengths.

The probability density function of L, the length of a clickstream, is, according to the "Law of Surfing", distributed according to an inverse guassian function:

f(L, v, λ) = sqrt(λ/2π) * L-3/2 * e-(λ/2*v*v*L)*(L-v)^2

What do the symbols L, v, λ mean?

L is session length measured in page clicks.

v is the expected value or mean. A fitted distribution of path length probabilities for Boston University students had a value of v=51.19 which the study referred to as "mean visits".

λ is related to the expected value and variance λ = v3/Var(L). The fitted value for λ was equal to 3.53.

Here is a php script to compute the probability densities for each clickstream length:

<?php
/**
* Raises a number to a floating point power
*
* a^b = e^(b log a) which is not 10log but the e-log (aka "ln")
* so instead of pow( $a , 0.6 ) use something like 
* exp( 0.6 * log($a))
*
* @see http://ca3.php.net/manual/en/function.pow.php#47297
*/
function fpow($base$fexp) {
  return 
exp($fexp log($base));
}

/** 
* Used to compute probability density associated with each clickstream
* length.
*/
function inverse_guassian($L$v$lambda) {
  return 
sqrt($lambda/(M_PI)) 
         * 
fpow($L, -3/2
         * 
exp(-1.0 * ($lambda/(pow($v2) * $L)) * pow($L-$v2)); 
}

$v 51.19;
$lambda 3.53;

for(
$L=1$L 20$L++) {
  
$p inverse_guassian($L$v$lambda);
  echo 
"f(".$L.", ".$v.", ".$lambda.") = ".$p."<br />";
}

?>

And these are the probability densites the program generates:

f(1, 51.19, 3.53) = 0.13738001323331
f(2, 51.19, 3.53) = 0.11731429052801
f(3, 51.19, 3.53) = 0.08563996238308
f(4, 51.19, 3.53) = 0.064395173972196
f(5, 51.19, 3.53) = 0.050294705235689
f(6, 51.19, 3.53) = 0.040551681574111
f(7, 51.19, 3.53) = 0.033538749822867
f(8, 51.19, 3.53) = 0.028310951045661
f(9, 51.19, 3.53) = 0.024298496373469
f(10, 51.19, 3.53) = 0.021143050105347
f(11, 51.19, 3.53) = 0.018610350025871
f(12, 51.19, 3.53) = 0.016541931634992
f(13, 51.19, 3.53) = 0.014827372300491
f(14, 51.19, 3.53) = 0.013387711569616
f(15, 51.19, 3.53) = 0.012165196593551
f(16, 51.19, 3.53) = 0.0111167387119
f(17, 51.19, 3.53) = 0.0102096203282
f(18, 51.19, 3.53) = 0.0094186073802048
f(19, 51.19, 3.53) = 0.0087239635554072

Permalink 

Powers of ten [Space & Physics
Posted on May 27, 2008 @ 11:05:40 PM by Paul Meagher

There is a celebrated book called "Powers of Ten" that offers up snapshots of what the universe looks like as you decrease or increase your magnification by successive powers of ten.

The cognitive scientist, Allan Newell, applied the "Powers of Ten" concept to the time dimension and observed that human action can be explained by biological, cognitive, rational and social constructs that generally operate at different time scales (Pirolli, p 19).

Table 1. Time scale on which human action occurs.

Scale (seconds)Time UnitBand
107MonthsSocial
106Weeks 
105Days 
104HoursRational
10310 Minutes 
102Minutes 
10110 SecondsCognitive
1001 Second 
10-1100 Milliseconds 
10-21 MillisecondBiological

Permalink 

Particle Swarm Optimization (PSO) : Part 1 [Optimization
Posted on May 26, 2008 @ 02:47:13 PM by Paul Meagher

An interesting area of research that I was not very familiar with is Computational Swarm Intelligence. A book by A.P. Engelbrecht (2005), "Fundamentals of Computational Swarm Intelligence", John Wiley & Sons, offers an excellent overview of the area. Andries breaks the field down into three main subfields:

  1. Genetic Algorithm Research
  2. Particle Swarm Optimization
  3. Ant Colony Optimization

The book takes an "engineering approach" to the topic areas: The computational tools used to study these areas can be applied to a host of engineering problems requiring optimization.

The book is 599 pages long. The section on Particle Swarm Optimization is around 275 pages long. It is a book within the book and probably the best coverage of this topic area if you have a specific interest in Particle Swarm Optimization PSO algorithms.

The section of the book on PSO presents this as the first formula to learn:

xi(t) = xi(t) + vi(t + 1)

This tells us that the position of particle i in a multidimensional space is a function of the position of the particle at time xi(t) and the velocity of the particle vi(t + 1).

You can use a uniform random number generator to initialize the positions of the particles.

xi(0) ~ U(xmin, xmax)

Permalink 

Basic Formulas for Optimal Foraging Theory [Formulas
Posted on May 23, 2008 @ 02:03:11 AM by Paul Meagher

I am currently reading the book "Optimal Foraging Theory" (Oxford: 2007) by Peter Pirolli.

Peter Pirolli lays the foundation for Optimal Foraging Theory (OFT) with this formula:

R = g(tw) / (tb + tw)

This is a rate equation. The equation tells us that the overall rate of gain R of food or information is equal to a gain function applied to the amount of time you have been foraging within the current patch g(tw) divided by the amount of time it takes to move between patches tb plus the time you have spent foraging in the current patch tw.

Charnov's Marginal Value Theorem can be used to find the optimal amount of time t* to spend in a patch.

To apply the theorem we first need to rewrite the equation above as an optimal rate R* equation:

R* = g(t*) / (tb + t*)

An optimal forager obeys the two rules below (see p. 8) to find an optimal t* value:

  • if (g'(tw) >= R*) then continue foraging in the patch.
  • if (g'(tw) < R*) then start looking for a new patch

Note that g'(tw) refers to the derivative of the gain function for within-patch foraging. The means we need to compute the slope of the gain function for various values of t. If the slope at t is greater than or equal to the maximal slope R*, we stay in the patch; otherwise, we move to a new patch.

This is an abstract formulation of the optimization problem that Optimal Foraging Theory is concerned with.

Applications to particular problems in web design require operationalizatizing the concept of a "patch" and what "foraging within a patch" and "moving between patches" consists of.

The book offers ideas about how to operationalize these concepts in particular domains, including web design and web usability. To learn how to apply optimal foraging theory from a usability guru, I recommend reading a recent article by Jakob Nielson (Nov. 2007: Long versus Short Content).

Permalink 

Game Artificial Intelligence [Fun & Games
Posted on May 4, 2008 @ 02:06:14 AM by Paul Meagher

The book Game Artificial Intelligence (2007), by John Ahlquist and Jeannie Novak, is an interesting read. It is not a difficult read and only includes small illustrative code snippets, nevertheless, I learned quite a bit from it. The book is very well designed and attractive with excellent illustrations and screen shots from a variety of games. It includes numerous short interviews and blurbs by senior game designers and developers from major studios that works well with the text. I'm not really a "gamer" myself and one of the reasons I found the book interesting was because it offered a high-level, knowledgeable, and insightful discussion of program-level game design using as examples some of the games my son plays plus many others.

The code snippet below is a PHP version of a thought-provoking snippet from the book showing how the "retreat" and "attack" behavior of a game AI might be controlled:

function OnSeePlayer($agent, $player) {
  if (Health($agent) < DefaultHealth($agent)*0.1) 
    Retreat($agent);
  else
    Attack($agent, $player);
}

So, if the agent sees the player, and the agent's health is less that 10% of it's normal health, the agent will retreat; otherwise the agent will attack. It is interesting to consider how one might go about building up the framework around this code snippet so that it would play a useful role in determining a Game AI's behavior. I suspect that much of the work on the Game AI component of game design could be done without any flashy graphics; mostly what you would need are simulations of your opponents tactics and strategies to see how your game AI reacts under different situations (that would eventually map onto the flashy game graphics depicting the actions and reactions).

I've not read the last two chapters on pathfinding yet so can't give a full review of this book. Nevertheless, if you are a geek looking for a good beach book this summer, I'd recommend taking this book along. If you can't play your videogames in the sand, then at least you can read an interesting discussion about the nature of the brains inside your games.

Permalink 

PHP Excel Classes 2007 [Math Projects
Posted on April 1, 2008 @ 10:10:35 AM by Paul Meagher

The opensource PHP Excel Classes 2007 Project aims to provide "a set of classes for the PHP programming language, which allow you to write to Excel 2007 files and read from Excel 2007 files. This project is built around Microsoft's OpenXML standard and PHP". What impresses me about this project is the inclusion of a formula parser and calculation engine that mimics much of Excel's functionality. I can see a use for the formula parser and the calculation engine in other projects that don't involve Excel.

Much of the functionality associated with the calculation engine is located in the PHPExcel_Calculation_Functions.php class which includes full implementations of the following 352 math functions:

  • ABS
  • ACCRINT
  • ACCRINTM
  • ACOS
  • ACOSH
  • ADDRESS
  • AMORDEGRC
  • AMORLINC
  • AND
  • AREAS
  • ASC
  • ASIN
  • ASINH
  • ATAN
  • ATAN2
  • ATANH
  • AVEDEV
  • AVERAGE
  • AVERAGEA
  • AVERAGEIF
  • AVERAGEIFS
  • BAHTTEXT
  • BESSELI
  • BESSELJ
  • BESSELK
  • BESSELY
  • BETADIST
  • BETAINV
  • BIN2DEC
  • BIN2HEX
  • BIN2OCT
  • BINOMDIST
  • CEILING
  • CELL
  • CHAR
  • CHIDIST
  • CHIINV
  • CHITEST
  • CHOOSE
  • CLEAN
  • CODE
  • COLUMN
  • COLUMNS
  • COMBIN
  • COMPLEX
  • CONCATENATE
  • CONFIDENCE
  • CONVERT
  • CORREL
  • COS
  • COSH
  • COUNT
  • COUNTA
  • COUNTBLANK
  • COUNTIF
  • COUNTIFS
  • COUPDAYBS
  • COUPDAYSNC
  • COUPNCD
  • COUPNUM
  • COUPPCD
  • COVAR
  • CRITBINOM
  • CUBEKPIMEMBER
  • CUBEMEMBER
  • CUBEMEMBERPROPER
  • CUBERANKEDMEMBER
  • CUBESET
  • CUBESETCOUNT
  • CUBEVALUE
  • CUMIPMT
  • CUMPRINC
  • DATE
  • DATEDIF
  • DATEVALUE
  • DAVERAGE
  • DAY
  • DAYS360
  • DB
  • DCOUNT
  • DCOUNTA
  • DDB
  • DEC2BIN
  • DEC2HEX
  • DEC2OCT
  • DEGREES
  • DELTA
  • DEVSQ
  • DGET
  • DISC
  • DMAX
  • DMIN
  • DOLLAR
  • DOLLARDE
  • DOLLARFR
  • DPRODUCT
  • DSTDEV
  • DSTDEVP
  • DSUM
  • DURATION
  • DVAR
  • DVARP
  • EDATE
  • EFFECT
  • EOMONTH
  • ERF
  • ERFC
  • ERROR.TYPE
  • EVEN
  • EXACT
  • EXP
  • EXPONDIST
  • FACT
  • FACTDOUBLE
  • FALSE
  • FDIST
  • FIND
  • FINDB
  • FINV
  • FISHER
  • FISHERINV
  • FIXED
  • FLOOR
  • FORECAST
  • FREQUENCY
  • FTEST
  • FV
  • FVSCHEDULE
  • GAMMADIST
  • GAMMAINV
  • GAMMALN
  • GCD
  • GEOMEAN
  • GESTEP
  • GETPIVOTDATA
  • GROWTH
  • HARMEAN
  • HEX2BIN
  • HEX2DEC
  • HEX2OCT
  • HLOOKUP
  • HOUR
  • HYPERLINK
  • HYPGEOMDIST
  • IF
  • IFERROR
  • IMABS
  • IMAGINARY
  • IMARGUMENT
  • IMCONJUGATE
  • IMCOS
  • IMDIV
  • IMEXP
  • IMLN
  • IMLOG10
  • IMLOG2
  • IMPOWER
  • IMPRODUCT
  • IMREAL
  • IMSIN
  • IMSQRT
  • IMSUB
  • IMSUM
  • INDEX
  • INDIRECT
  • INFO
  • INT
  • INTERCEPT
  • INTRATE
  • IPMT
  • IRR
  • ISBLANK
  • ISERR
  • ISERROR
  • ISEVEN
  • ISLOGICAL
  • ISNA
  • ISNONTEXT
  • ISNUMBER
  • ISODD
  • ISPMT
  • ISREF
  • ISTEXT
  • JIS
  • KURT
  • LARGE
  • LCM
  • LEFT
  • LEFTB
  • LEN
  • LENB
  • LINEST
  • LN
  • LOG
  • LOG10
  • LOGEST
  • LOGINV
  • LOGNORMDIST
  • LOOKUP
  • LOWER
  • MATCH
  • MAX
  • MAXA
  • MDETERM
  • MDURATION
  • MEDIAN
  • MID
  • MIDB
  • MIN
  • MINA
  • MINUTE
  • MINVERSE
  • MIRR
  • MMULT
  • MOD
  • MODE
  • MONTH
  • MROUND
  • MULTINOMIAL
  • N
  • NA
  • NEGBINOMDIST
  • NETWORKDAYS
  • NOMINAL
  • NORMDIST
  • NORMINV
  • NORMSDIST
  • NORMSINV
  • NOT
  • NOW
  • NPER
  • NPV
  • OCT2BIN
  • OCT2DEC
  • OCT2HEX
  • ODD
  • ODDFPRICE
  • ODDFYIELD
  • ODDLPRICE
  • ODDLYIELD
  • OFFSET
  • OR
  • PEARSON
  • PERCENTILE
  • PERCENTRANK
  • PERMUT
  • PHONETIC
  • PI
  • PMT
  • POISSON
  • POWER
  • PPMT
  • PRICE
  • ORICEDISC
  • PRICEMAT
  • PROB
  • PRODUCT
  • PROPER
  • PV
  • QUARTILE
  • QUOTIENT
  • RADIANS
  • RAND
  • RANDBETWEEN
  • RANK
  • RATE
  • RECEIVED
  • REPLACE
  • REPLACEB
  • REPT
  • RIGHT
  • RIGHTB
  • ROMAN
  • ROUND
  • ROUNDDOWN
  • ROUNDUP
  • ROW
  • ROWS
  • RSQ
  • RTD
  • SEARCH
  • SEARCHB
  • SECOND
  • SERIESSUM
  • SIGN
  • SIN
  • SINH
  • SKEW
  • SLN
  • SLOPE
  • SMALL
  • SQRT
  • SQRTPI
  • STANDARDIZE
  • STDEV
  • STDEVA
  • STDEVP
  • STDEVPA
  • STEYX
  • SUBSTITUTE
  • SUBTOTAL
  • SUM
  • SUMIF
  • SUMIFS
  • SUMPRODUCT
  • SUMSQ
  • SUMX2MY2
  • SUMX2PY2
  • SUMXMY2
  • SYD
  • T
  • TAN
  • TANH
  • TBILLEQ
  • TBILLPRICE
  • TBILLYIELD
  • TDIST
  • TEXT
  • TIME
  • TIMEVALUE
  • TINV
  • TODAY
  • TRANSPOSE
  • TREND
  • TRIM
  • TRIMMEAN
  • TRUE
  • TRUNC
  • TTEST
  • TYPE
  • UPPER
  • USDOLLAR
  • VALUE
  • VAR
  • VARA
  • VARP
  • VARPA
  • VDB
  • VERSION
  • VLOOKUP
  • WEEKDAY
  • WEEKNUM
  • WEIBULL
  • WORKDAY
  • XIRR
  • XNPV
  • YEAR
  • YEARFRAC
  • YIELD
  • YIELDDISC
  • YIELDMAT
  • ZTEST

If you are interested in the intersection of PHP and Math I encourage you to test drive the PHPExcel Project. The code is licensed under the GPL v2.

Permalink 

Correlation Matrix [Data Mining
Posted on March 23, 2008 @ 12:16:14 PM by Paul Meagher

CorrelationMatrix.php is a class that can be used to compute a correlation matrix given a data matrix. A data matrix can be obtained by querying a database table or collection of tables and putting the resulting values into a two-dimensional data array. You could then run the class below on your two-dimensional data array.

<?php
/**
* Creates a correlation matrix from a given data matrix. 
*
* @see http://lib.stat.cmu.edu/multi/pca.c
*
* @author   Paul Meagher
* @version  0.2
* @modified Mar 23, 2008
*/
class CorrelationMatrix {
    
    private 
$eps    0.005;

    public 
$nrows   0;
    public 
$ncols   0;    
    
    public 
$means   = array();
    public 
$stddevs = array();    
    public 
$cormat  = array();    
    
  function 
CorrelationMatrix($data) {
                
    
// num rows
    
$this->nrows count($data);    
    
    
// num cols
    
$this->ncols count($data[0]); 
    
    
// Determine mean of column vectors of input data matrix 
    
for ($j=0$j $this->ncols$j++) {
      
$this->means[$j] = 0.0;
      for(
$i=0$i $this->nrows$i++) 
        
$this->means[$j] += $data[$i][$j];
      
$this->means[$j] /= $this->nrows;
    }
          
    
// Determine standard deviations of column vectors of data matrix. 
    
for ($j=0$j $this->ncols$j++) {
      
$this->stddevs[$j] = 0.0;
      for (
$i=0$i $this->nrows$i++) 
        
$this->stddevs[$j] += (($data[$i][$j]-$this->means[$j])*($data[$i][$j]-$this->means[$j]));
      
$this->stddevs[$j] /= $this->nrows;
      
$this->stddevs[$j] = sqrt($this->stddevs[$j]);
      
// The following in an inelegant but usual way to handle
      // near-zero stddev values, which would cause a zero-
      // divide error. 
      
if ($this->stddevs[$j] <= $this->eps
        
$this->stddevs[$j] = 1.0;
    }
    
    
// Center and reduce the column vectors. 
    
for ($i=0$i $this->nrows$i++) {
      for (
$j=0$j $this->ncols$j++) {
        
$data[$i][$j] -= $this->means[$j];
        
$x sqrt($this->nrows);
        
$x *= $this->stddevs[$j];
        
$data[$i][$j] /= $x;
      }
    }
  
    
// Calculate the m * m correlation matrix. 
    
for ($j1=0$j1 $this->ncols-1$j1++) {
      
$this->cormat[$j1][$j1] = 1.0;
      for (
$j2=$j1+1$j2 $this->ncols$j2++) {
        
$this->cormat[$j1][$j2] = 0.0;
        for (
$i=0$i $this->nrows$i++)
          
$this->cormat[$j1][$j2] += ( $data[$i][$j1] * $data[$i][$j2]);
        
$this->cormat[$j2][$j1] = $this->cormat[$j1][$j2];
      }
    }
    
    
$this->cormat[$this->ncols-1][$this->ncols-1] = 1.0;
    
  }
  
  function 
printMeans() {  
    
printf("Column Means: <br />");
    for (
$j=0$j $this->ncols$j++)  
      
printf("%7.2f"$this->means[$j]);  
    
printf("<br />");  
  }

  function 
printStdDevs() {
    
printf("Column Standard Deviations: <br />");
    for (
$j=0$j $this->ncols$j++) 
      
printf("%7.2f"$this->stddevs[$j]); 
    
printf("<br />");
  }

  function 
printCorMat() {
    
printf("Correlation Matrix: <br />");
    for (
$i=0$i $this->ncols$i++) {     
      for (
$j=0$j $this->ncols$j++) 
        
//echo $this->cormat[$i][$j]." ";
        
printf("%7.4f"$this->cormat[$i][$j]); 
      
printf("<br />");
    }
  }
      
}

?>

Here is a test script that demonstrates usage and generates some output:

<?php

include "CorrelationMatrix.php";

$data[0] = array(123);
$data[1] = array(234);
$data[2] = array(345);
$data[3] = array(468);
$data[4] = array(5810);

$cm = new CorrelationMatrix($data);
$cm->printMeans();
$cm->printStdDevs();
$cm->printCorMat();

?>

The output looks like this:

Column Means:
3.00 4.60 6.00
Column Standard Deviations:
1.41 2.15 2.61
Correlation Matrix:
1.0000 0.9848 0.9762
0.9848 1.0000 0.9970
0.9762 0.9970 1.0000

Permalink 

Stratified Random Sampling [Stats & Prob
Posted on March 20, 2008 @ 04:49:54 AM by Paul Meagher

Jan Steedman writes:

I was wondering if you could point me into the right direction with a somewhat off-topic sampling problem that I am currently having:

I have a set of people in the database with several coded variables (let's say: age group, gender & region). I'd like to draw a sample of 100 people in total from the database now with specific targets (i.e. constraints):

gender male: 50 people
gender female: 50 people

age group 1: 40 people
age group 2: 35 people
age group 3: 25 people

region 1: 10 people
region 2: 30 people
region 3: 60 people

The target size is 100 people in total and all constraints must be met. My problem now is how to select the "correct" set of people from the database in an efficient way? From all possible combinations of people there might only be one combination that will satisfy all constraints.

Here is a stratified_sample.php script that I came up with to solve Jan's problem. I'm not sure that it actually solved Jan's problem but I though I would publish it anyway.

<?php

// setup the gender, age, and region arrays with
// appropriate number of sampling codes in each
// array

for($i=0$i<100$i++) {

  if (
$i<50)
    
$gender[$i] = "GM";
  else
    
$gender[$i] = "GF";

  if (
$i<40)
    
$age[$i] = "A1";
  elseif(
$i<75)
    
$age[$i] = "A2";
  else
    
$age[$i] = "A3";

  if (
$i<10)
    
$region[$i] = "R1";
  elseif(
$i<40)
    
$region[$i] = "R2";
  else
    
$region[$i] = "R3";

}

// randomize the position of sampling codes in
// each array

shuffle($gender);
shuffle($age);
shuffle($region);

// the sample array contains the sampling codes
// for selecting each sample unit

// to create the sample array loop through the
// gender, age, and region arrays and concatenate
// the sampling codes from each array together
// to create the sampling code for that sample unit

for($i=0$i<100$i++) {
  
$g $gender[$i];
  
$a $age[$i];
  
$r $region[$i];
  
$sample[$i] = "$g$a$r";
}

// the sample array should meet the sampling
// constraints while meeting the constraints
// of randomization as well

echo "<pre>";
print_r($sample);
echo 
"</pre>";

?> 

The abbreviated output of the script looks like this:

Array
(
    [0] => GFA1R2
    [1] => GFA1R3
    [2] => GFA1R3
    [3] => GFA2R2
    [4] => GMA2R1
    [5] => GMA2R3
    [6] => GFA1R2
    [7] => GFA2R2
    [8] => GMA2R3
    [9] => GFA1R2
    [10] => GMA1R2
 
    ... snip ...

    [90] => GMA1R2
    [91] => GMA2R1
    [92] => GFA3R3
    [93] => GFA2R3
    [94] => GMA3R2
    [95] => GFA1R3
    [96] => GMA1R2
    [97] => GFA3R2
    [98] => GMA2R3
    [99] => GFA1R1
)

For this solution to fully work you would have to code each of your sample elements according to your 3 dimensional coding system (e.g., GFA1R3 meaning "Gender Female", "Area 1", "Region 3") and when you are selecting individuals matching these sampling codes you would randomly select a person matching your sampling code from among the pool of people matching the sampling code.

Permalink 

Longest Common Subsequence [Optimization
Posted on February 21, 2008 @ 10:13:09 PM by Paul Meagher

The Longest Common Subsequence algorithm compares two strings and finds the longest subsequence shared by both strings. There are lots of applications, however, two are particularly noteworthy:

  1. Computing the similarity between sequences of genetic material.
  2. Computing differences between files (i.e., unix "diff" command).

So, without further ado, here is an algorithm for computing the Longest Common Subsequence:

<?php
/**
* Port of Sedgewick & Wayne's Longest Common Subsequence algorithm 
* found here:
*
* @see http://www.cs.princeton.edu/introcs/96optimization/LCS.java.html
* @see http://www.cs.princeton.edu/introcs/96optimization/
*
* @author Paul Meagher
* @ported Feb 21, 2008
*
* See also:
*
* @see http://www.ics.uci.edu/~eppstein/161/960229.html
*
* @param string $S1 first string 
* @param string $S2 second string 
* @returns string containing common subsequence or empty string
*/
function string_lcs($S1$S2) {

  
$m strlen($S1);
  
$n strlen($S2);

  
// A[i][j] = length of LCS of S1[i..m] and S2[j..n]
  
$A = array();

  
// compute length of LCS and all subproblems via dynamic programming
  
for ($i $m-1$i >= 0$i--) {
    for (
$j $n-1$j >= 0$j--) {
      if (
$S1[$i] == $S2[$j])
        
$A[$i][$j] = $A[$i+1][$j+1] + 1;
      else 
        
$A[$i][$j] = max($A[$i+1][$j], $A[$i][$j+1]);
    }
  }

  
// recover LCS itself and print it to standard output
  
$i 0;
  
$j 0;
  
$LCS "";
  while(
$i $m && $j $n) {
    if (
$S1[$i] == $S2[$j]) {
      
$LCS .= $S1[$i];
      
$i++;
      
$j++;
    } elseif (
$A[$i+1][$j] >= $A[$i][$j+1]) 
      
$i++;
    else                                 
      
$j++;
  }
  
  return 
$LCS;

}
  
?>  

Here is a test program with the output displayed in the source code:

<?php

include "string_lcs.php";

$S1 "abcdefghijk";
$S2 "ijklmnop";

$subsequence string_lcs($S1$S2);
echo 
$subsequence;

// Output: ijk

?>

I've taken a few liberties in my port of Sedgewick and Wayne's LCS class. I reimplemented the LCS class as a string_lcs function as a class seemed like overkill at this point. I also use the word "string_" as a prefix for the function name. I used this prefix in anticipation of eventually building a library of such string functions (in addition to PHP's collection of string functions). I also renamed some variables to be more mnemonic (e.g., $S1, $S2) or standard (e.g., $A, $m, $n) and changed the capitalization of variable names to conform with whether they are being treated as a scalar (lowercase) or vector/matrix quantity (uppercase).

Consult some of the links contained in the source to learn more about how the algorithm works and about Dynamic Programming more generally.

Permalink 

Order statistics [Notation
Posted on February 16, 2008 @ 01:36:53 PM by Paul Meagher

Order statistics are statistics derived from data ordered from smallest to largest. I've recently come across a nice notation for representing order statistics which I thought deserved a blog entry.

The notation involves putting a subscripted bracket around a positional index. Using this scheme, all the measurements in our ordered sample would be denoted as y[1],...,y[n]. In other words, the brackets around the subscripts tips us off that the elements denoted by these indexed y values are not simply the raw sample values in whatever order they were collected, but rather the sample values ordered from smallest to largest.

As an example of how this notation can be used, consider how we might express the formula for computing the first, second and third quartiles (Q1, Q2, Q3). Recall that Q1 is the value that 25% of our values are less than or equal to. Q2 is the value that 50% of our values are less than or equal to and Q3 is the value that 75% of our values are less than or equal to. Or, using our notation:

Q1 = y[(n+1)/4]

Q2 = y[(n+1)/2]

Q3 = y[3(n+1)/4]

If the subscripts are not integers, we average the two closest order statisitcs to obtain the quartile value.

The minimum value y[1], the three quartiles (Q1, Q2, Q3), and the maximum value y[n] are often reported together and are called the five number summary. The box-and-whisker plot is a graphical way to depict the five number summary.

As a final example of how our order notation can be used consider the formula for the trimmed mean. In the formula below mk is the trimmed mean with k values trimed from the top and bottom of an ordered list of numbers:

mk = ∑i=k+1 to n-k x [ i ] / (n - 2k)

Here is some PHP code for computing a trimmed mean where we remove the top and bottom 2 values from our computed mean (i.e., k=2):

<?php

function trimmed_mean($X$k) {
  
$n count($X);
  
sort($XSORT_NUMERIC); 
  for(
$i=$k$i $n-$k$i++) 
    
$sum += $X[$i];
  
$mean $sum / ($n $k);
  return 
$mean;
}


$X = array(551520122550);
echo 
"Trimmed mean is ".trimmed_mean($X2).".";

// Output: Trimmed mean is 20.

?>

The next time you compute a mean you might want to consider computing a trimmed mean using a few different values of k to see how affected by outliers your computed mean is.

Permalink 

On the way to NURBS [Calculus
Posted on January 28, 2008 @ 02:03:17 AM by Paul Meagher

NURBS is an acronym for "Non-Uniform Rational B-Spline". NURBS are used extensively in CAD systems to represent surfaces. I hope to someday understand the math and programming behind NURBS better. Towards that end, I have ported some code that manipulates Bernstein polynomials. Bernstein polynomials unify the math behind Bezier approximation schemes and Spline approximation schemes. NURBS are another approximation scheme which requires that you know something about Bernstein polynomials in order to appreciate the math involved.

The Bernstein class below can be used to generate the coordinates needed to draw a curve that interpolates and approximates a set of control points. Vector-based image manipulation programs offer a Bezier tool that allows you to create curves by moving around a set of bounding control points. As you move the curve control points, the shape of the Bezier curve changes to interpolate and approximate the new position of the control points. The class below implements the math required to interpolate and approximate the position of the control points. The math looks like this:

The implementation of this math looks like this:

<?php
/**
* Provides methods to manipulate the Bernstein polynomial.
*
* An nth-degree Berstein polynomial is given by:
*
* B(i,n,u) = n! * u^i * (1-u)^(n-i) / i! * (n-i)!
* with  0 <= i <= n
*
* @author R.P. from OpaleTeam
* @see http://opale.belios.de
* @license LPGL
* @date 05/2002
*
* Ported by Paul Meagher on Jan 20, 2008.
*/
class Bernstein {

  
/**
  * Returns the evaluation, for a given value, of a Bernstein
  * polynomal defined by the given arguments.
  *
  * @param int $i 
  * @param int $n the degree of the polynomial
  * @param double $u the value for which the polynomial will be computed
  * @return double the computed value
  */
  
public function getValue($i$n$u) {
  
    
$temp[$n-$i] = 1.0;
    
$u1          1.0 $u;

    for (
$k 1$k <= $n$k++) 
      for (
$j $n$j >= $k$j--) 
        
$temp[$j] = $u1 $temp[$j] + $u $temp[$j-1];       
    
    return 
$temp[$n];

  }
  
  
/**
  * Returns all Bernstein polynomials evaluations, for a given
  * value, of a Bernstein polynomal defined by the given arguments.
  *
  * @param int $n the degree of the polynomial
  * @param double $u the value for which the polynomial will be computed
  * @return double[], the computed value
  */
  
public function getAllValues($n$u) {

    
$b[0] = 1.0;
    
$u1   1.0 $u;

    for (
$i 1$i <= $n$i++) {
      
$saved 0.0;
      for (
$k 0$k $i$k++) {
        
$temp  $b[$k];
        
$b[$k] = $saved $u1 $temp;
        
$saved $u $temp;
      }
      
$b[$i] = $saved;
    }
    
    return 
$b;

  }  
  

}

?>

The bernstein_test.php example below illustrates usage. In this example, we use the Bernstein class to find the coefficients to multiply the $X[$i] (i.e., $X = array(1, 2, 4, 3)) and $Y[$i] coordinates (i.e., $Y = array(1, 3, 3, 1)) of 4 control points. After we separately multiply the $X[$i] and $Y[$i] terms by the Bernstein cofficients we get a new point at $t that approximates the control points.

The setup value of $n is equal to 1 minus the number of control points. $T is an array of values in the range [0,1] that identify the left (0) to right (1) positions along the approximating and interpolating curve where we want to compute it's coordinates.

<?php
/**
* Testing Bernstein output against worked example in:
*
* Rogers, David F. An Introduction to NURBS With Historical Perspective.  
* 2001. Morgan Kaufmann, San Francisco. pp. 22-24
*/

include "Bernstein.php";

$berstein = new Bernstein;

$n 3;

$X = array(1243);
$Y = array(1331);

?>
<table border='1' cellspacing='1' cellpadding='2'>
  <tr>
    <th>t</th>
    <th>B<sub>3,0</sub></th>
    <th>B<sub>3,1</sub></th>
    <th>B<sub>3,2</sub></th>
    <th>B<sub>3,3</sub></th>
    <th>P(t)</th>   
  </tr>
  <?php
  $T 
= array(00.150.350.500.650.851);  
  foreach (
$T as $t) {
    
$B $berstein->getAllValues($n$t);
    
?>
    <tr>
      <td align='center'><?php echo $t ?></td>
      <?php
      $i
=0;
      
$x=0.0;
      
$y=0.0;     
      foreach (
$B as $b) {
        
?>
        <td align='center'><?php echo $b ?></td>
        <?php
        $x 
+= $b *$X[$i];
        
$y += $b *$Y[$i];       
        
$i++;       
      }
      
?>
      <td align='center'><?php echo "($x, $y)"?></td>
    </tr>
    <?php
  

  
?>
</table>

The test script above produces the following HTML table as output:

t B3,0 B3,1 B3,2 B3,3 P(t)
0 1 0 0 0 (1, 1)
0.15 0.614125 0.325125 0.057375 0.003375 (1.504, 1.765)
0.35 0.274625 0.443625 0.238875 0.042875 (2.246, 2.365)
0.5 0.125 0.375 0.375 0.125 (2.75, 2.5)
0.65 0.042875 0.238875 0.443625 0.274625 (3.119, 2.365)
0.85 0.003375 0.057375 0.325125 0.614125 (3.261, 1.765)
1 0 0 0 1 (3, 1)

The P(t) column above contains a set of points which we can graph that will approximate the set of control points a user has specified for the curve they want drawn. The x coordinate of each point was obtained by multiplying the 4 Berstein coefficients by the 4 x coordinates. Similarly, the y coordinate of each point was obtained by multiplying the 4 Bernstein coefficients by the 4 y coordinates.

Permalink 

Population sampling [Stats & Prob
Posted on January 11, 2008 @ 04:11:14 AM by Paul Meagher

Below is a simple but useful class for generating a population sample:

<?php
/**
* Generates a random sample (with or without replacement) 
* of size n from population array.
*
* @see http://www.math.uah.edu/stat/objects/distributions/Functions.xhtml
*/
class Population {

  const 
WITHOUT_REPLACEMENT=0;
  const 
WITH_REPLACEMENT=1;

  function 
random($min=0$max=1) {
    return (
$min+lcg_value()*(abs($max-$min)));
  }

  
/**
  * This method computes a sample of a specified size from a specified population
  * and of a specified type (with or without replacement).
  * @param int[] $p  the population
  * @param int   $n  the sample size
  * @param int   $t  the type (0 without replacement, 1 with replacement);
  * @return int $s 
  */
  
public function getSample($p$n$t){
    
$m count($p);
    if (
$n 1$n 1; elseif ($n $m$n $m;
    
//Define the sample
    
if ($t == self::WITH_REPLACEMENT){
      for (
$i 0$i $n$i++){
        
$u = (int)($m $this->random());
        
$s[$i] = $p[$u];
      }
    } else{
      for (
$i=0$i $n$i++){
        
//Select a random index from 0 to m - i - 1;
        
$k $m $i;
        
$u = (int)($k $this->random());
        
//Define the sample element
        
$s[$i] = $p[$u];
        
// Interchange the sampled element p[u] with p[k - 1], at the end of the
        // population so that it will not be sampled again.
        
$temp    $p[$k-1];
        
$p[$k-1] = $p[$u];
        
$p[$u]   = $temp;
      }
    }
    return 
$s;
  }

}

?>

Here is some code to demonstrate usage:

<?php 

include "Population.php";

$p range(1,9);
$n 3;
$t 0;

$pop = new Population;
$s   $pop->getSample($p$n$t);

echo 
"<pre>";
print_r($s);
echo 
"</pre>";

?>

Here is what the output looks like:

Array
(
    [0] => 2
    [1] => 9
    [2] => 4
)

Perhaps you will find a use for this sampler some day...

Permalink 

Monte Carlo Integration [Calculus
Posted on January 9, 2008 @ 03:09:37 AM by Paul Meagher

Monte carlo integration is a way to find the area under a curve using random numbers. The program below provides a nice introductory example of a monte carlo method and how it can be used to solve a calculus problem.

<?php
/**
* Approximates area of the region under the graph of a
* monotonic function and above the x-axis beween a and b.
*
* Adaptation of Monte Carlo Method program by James M. Sconyers
* found in:
*
* Larson, Hostetler, Edwards (2006). Calculus I with Precalculus.
* 2nd Edition. Houghton Mifflin Co, Boston.  p. 490
*/
class MonteCarlo_Integration {

  
/**
  * @see http://ca.php.net/manual/en/function.mt-rand.php#75793
  */
  
function random_float($min$max) {
    return (
$min+lcg_value()*(abs($max-$min)));
  }
  
  
/**
  * Compute area under the curve using monte carlo method.
  * 
  * @param object $f function object
  * @param float $a lower limit
  * @param float $b upper limit
  * @param int $n number of samples
  * @return float area under the curve
  */
  
function integrate($f$a$b$n) {
  
    
$p 0;
  
    if (
$f->valueAt($a) > $f->valueAt($b))
      
$ymax $f->valueAt($a);
    else 
      
$ymax $f->valueAt($b);
    
    for (
$i=1$i $n$i++) {
      
$x $a + ($b $a) * $this->random_float(01);
      
$y $ymax $this->random_float(01);
      if (
$y $f->valueAt($x)) 
        
$p $p+1;
    }
     
    
$area = ($p/$n)*($b-$a)*$ymax;
    
    return 
$area;
  
  }

}   

?>    

Here is a program to test the MonteCarlo_Integration.php class:

<?php

include "MonteCarlo_Integration.php";

class 
{    
    function 
valueAt($x) {
    return 
pow($x2);
  }
}


$f = new F;
$a 0;
$b 2;
$n 10000;

$mc   = new MonteCarlo_Integration;
$area $mc->integrate($f$a$b$n);

echo 
"Approximate Area: $area";

?> 

The output of one execution of the program is:

Approximate Area: 2.6632

The exact answer is 8/3.

One programming project would involve histogramming and summarizing the area estimates produced by the above class where $n, the number of samples, is varied. Doing so would give you a feel for how variable the estimates are with different sample sizes and how the monte carlo estimates are distributed. A project like this might offer some insight into more advanced monte carlo applications.

Permalink 

 Archive 
 


php/Math Project
© 2011. All rights reserved.