|
|
Information Retrieval Class |
[Data Mining] |
|
Posted on June 5, 2008 @ 02:21:07 PM by Paul Meagher
As I reflected on how I would begin computing summary statistics over an inverted index (which I discussed and implemented in my last post), I realized that I would have to immediately turn my script into a class in order to easily and efficiently pass around the inverted index to the statistical functions. Here is what I have developed so far (see also):
<?php
/** * Information Retrievel * * Class used to explore information retrieval theory and concepts. */
define("DOC_ID", 0); define("TERM_POSITION", 1);
class IR {
public $num_docs = 0; public $corpus_terms = array();
/* * Show Documents * * Helper function that shows the contents of your corpus documents. * * @param array $D document corpus as array of strings */ function show_docs($D) { $ndocs = count($D); for($doc_num=0; $doc_num < $ndocs; $doc_num++) { ?> <p> Document #<?php echo ($doc_num+1); ?>:<br /> <?php echo $D[$doc_num]; ?> </p> <?php } }
/* * Create Index * * Creates an inverted index from the supplied corpus documents. * Inverted index stored in corpus_terms array. * * @param array $D document corpus as array of strings */ function create_index($D) { $this->num_docs = count($D); for($doc_num=0; $doc_num < $this->num_docs; $doc_num++) { // zero array containing document terms $doc_terms = array(); // simplified word tokenization process $doc_terms = explode(" ", $D[$doc_num]); // here is where the indexing of terms to document locations happens $num_terms = count($doc_terms); for($term_position=0; $term_position < $num_terms; $term_position++) { $term = strtolower($doc_terms[$term_position]); $this->corpus_terms[$term][]=array($doc_num, $term_position); } } }
/* * Show Index * * Helper function that outputs inverted index in a standard format. */ function show_index() { // sort by key for alphabetically ordered output ksort($this->corpus_terms); // output a representation of the inverted index foreach($this->corpus_terms AS $term => $doc_locations) { echo "<b>$term:</b> "; foreach($doc_locations AS $doc_location) echo "{".$doc_location[DOC_ID].", ".$doc_location[TERM_POSITION]."} "; echo "<br />"; } }
/* * Term Frequency * * @param string $term * @return frequency of term in corpus */ function tf($term) { $term = strtolower($term); return count($this->corpus_terms[$term]); } /* * Number Documents With * * @param string $term * @return number of documents with term */ function ndw($term) { $term = strtolower($term); $doc_locations = $this->corpus_terms[$term]; $num_locations = count($doc_locations); $docs_with_term = array(); for($doc_location=0; $doc_location < $num_locations; $doc_location++) $docs_with_term[$i]++; return count($docs_with_term); } /* * Inverse Document Frequency * * @param string $term * @return inverse document frequency of term */ function idf($term) { return log($this->num_docs)/$this->ndw($term); }
}
?>
Here is a script I developed to test the methods of the IR class:
<?php
include "IR.php";
$D[0] = "Shipment of gold delivered in a fire"; $D[1] = "Delivery of silver arrived in a silver truck"; $D[2] = "Shipment of gold arrived in a truck";
$ir = new IR();
echo "<p><b>Corpus:</b></p>"; $ir->show_docs($D);
$ir->create_index($D);
echo "<p><b>Inverted Index:</b></p>"; $ir->show_index();
$term = "silver"; $tf = $ir->tf($term); $ndw = $ir->ndw($term); $idf = $ir->idf($term); echo "<p>"; echo "Term Frequency of '$term' is $tf<br />"; echo "Number Of Documents with $term is $ndw<br />"; echo "Inverse Document Frequency of $term is $idf"; echo "</p>";
?>
Here is the output that the script generates:
Corpus:
Document #1:
Shipment of gold delivered in a fire
Document #2:
Delivery of silver arrived in a silver truck
Document #3:
Shipment of gold arrived in a truck
Inverted Index: a: {0, 5} {1, 5} {2, 5} arrived: {1, 3} {2, 3} delivered: {0, 3} delivery: {1, 0} fire: {0, 6} gold: {0, 2} {2, 2} in: {0, 4} {1, 4} {2, 4} of: {0, 1} {1, 1} {2, 1} shipment: {0, 0} {2, 0} silver: {1, 2} {1, 6} truck: {1, 7} {2, 6}
Term Frequency of 'silver' is 2 Number Of Documents with silver is 1 Inverse Document Frequency of silver is 1.0986122886681
The next step in this exploration of Information Retrieval Theory will be to develop a table that better summarizes term frequency and inverse term frequencies. I will use this table to verify that the results I'm generating match with tf-idf calculations found in the literature for the sample documents used in my calculations.
|
|
Permalink |
|
Information Retrieval Theory: Implementing an Inverted Index |
[Data Mining] |
|
Posted on June 3, 2008 @ 01:29:29 PM by Paul Meagher
Last night I downloaded a local copy of this book:
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.
Most introductions to Information Retrieval begin by talking about the "Inverted Index" data structure. I had come accross this concept before, however, I haven't explored it in php code because I got hung up on the issues of what corpus to use and, more importantly, what word tokenization approach I should use for the documents. Through my surfing around on the topic of IR, I observed that you can bypass these issues by simplifying the corpus selection and word tokenization issues while retaining the essentials of the representational and computational issues. The code below illustrates a common simplification of the corpus concept and word tokenization concept that might similarly allow you to get started exploring some IR concepts.
<?php
/** * Creates an inverted index and shows what it looks like. Useful for * explorating TF-IDF calculations in Information Retrieval theory. * * @see http://en.wikipedia.org/wiki/Inverted_index */
define("DOC_ID", 0); define("TERM_POSITION", 1);
$D[0] = "Shipment of gold delivered in a fire"; $D[1] = "Delivery of silver arrived in a silver truck"; $D[2] = "Shipment of gold arrived in a truck";
$corpus_terms = array();
$num_docs = count($D);
for($doc_num=0; $doc_num < $num_docs; $doc_num++) { // zero array containing document terms $doc_terms = array(); // simplified word tokenization process $doc_terms = explode(" ", $D[$doc_num]); // here is where the indexing of terms to document locations happens $num_terms = count($doc_terms); for($term_position=0; $term_position < $num_terms; $term_position++) $corpus_terms[strtolower($doc_terms[$term_position])][]=array($doc_num, $term_position);
}
// sort by key for alphabetically ordered output ksort($corpus_terms);
$num_terms = count($corpus_terms);
// output a representation of the inverted index foreach($corpus_terms AS $term => $doc_locations) { echo "<b>$term:</b> "; foreach($doc_locations AS $doc_location) echo "{".$doc_location[DOC_ID].", ".$doc_location[TERM_POSITION]."} "; echo "<br />"; } ?>
The output of this code looks like this:
a: {0, 5} {1, 5} {2, 5} arrived: {1, 3} {2, 3} delivered: {0, 3} delivery: {1, 0} fire: {0, 6} gold: {0, 2} {2, 2} in: {0, 4} {1, 4} {2, 4} of: {0, 1} {1, 1} {2, 1} shipment: {0, 0} {2, 0} silver: {1, 2} {1, 6} truck: {1, 7} {2, 6}
Sometimes it is useful to jump ahead in your exploration of certain topics and avoid, for example, the gritty details of document tokenization. What may seem like an avoidance of the problem, might end up informing you about important subtleties in the requirements for your information retrieval system. Exploratory inverted index code can help you to see the computational consequences of representing an inverted index in a certain manner.
For me, the next step in exploring IR theory is to explore in code the
Term Frequency–Inverse Document Frequency calculations, generally abbreviated as TF-IDF. Tim Munroe illustrates some of the calculations involved, however, I will need to get my calculations working with the inverted index data structure I've developed so far.
|
|
Permalink |
|
Law of Surfing |
[Formulas] |
|
Posted on May 30, 2008 @ 03:31:42 PM by Paul Meagher
Bernardo Huberman formulated a few "laws of the web". One of these laws concerns the distribution of clickstream lengths.
The probability density function of L, the length of a clickstream, is, according to the "Law of Surfing", distributed according to an inverse guassian function:
f(L, v, λ) = sqrt(λ/2π) * L-3/2 * e-(λ/2*v*v*L)*(L-v)^2
What do the symbols L, v, λ mean?
L is session length measured in page clicks.
v is the expected value or mean. A fitted distribution of path length probabilities for Boston University students had a value of v=51.19 which the
study referred to as "mean visits".
λ is related to the expected value and variance
λ = v3/Var(L). The fitted value for λ was equal to 3.53.
Here is a php script to compute the probability densities for each clickstream length:
<?php /** * Raises a number to a floating point power * * a^b = e^(b log a) which is not 10log but the e-log (aka "ln") * so instead of pow( $a , 0.6 ) use something like * exp( 0.6 * log($a)) * * @see http://ca3.php.net/manual/en/function.pow.php#47297 */ function fpow($base, $fexp) { return exp($fexp * log($base)); }
/** * Used to compute probability density associated with each clickstream * length. */ function inverse_guassian($L, $v, $lambda) { return sqrt($lambda/(2 * M_PI)) * fpow($L, -3/2) * exp(-1.0 * ($lambda/(2 * pow($v, 2) * $L)) * pow($L-$v, 2)); }
$v = 51.19; $lambda = 3.53;
for($L=1; $L < 20; $L++) { $p = inverse_guassian($L, $v, $lambda); echo "f(".$L.", ".$v.", ".$lambda.") = ".$p."<br />"; }
?>
And these are the probability densites the program generates:
f(1, 51.19, 3.53) = 0.13738001323331 f(2, 51.19, 3.53) = 0.11731429052801 f(3, 51.19, 3.53) = 0.08563996238308 f(4, 51.19, 3.53) = 0.064395173972196 f(5, 51.19, 3.53) = 0.050294705235689 f(6, 51.19, 3.53) = 0.040551681574111 f(7, 51.19, 3.53) = 0.033538749822867 f(8, 51.19, 3.53) = 0.028310951045661 f(9, 51.19, 3.53) = 0.024298496373469 f(10, 51.19, 3.53) = 0.021143050105347 f(11, 51.19, 3.53) = 0.018610350025871 f(12, 51.19, 3.53) = 0.016541931634992 f(13, 51.19, 3.53) = 0.014827372300491 f(14, 51.19, 3.53) = 0.013387711569616 f(15, 51.19, 3.53) = 0.012165196593551 f(16, 51.19, 3.53) = 0.0111167387119 f(17, 51.19, 3.53) = 0.0102096203282 f(18, 51.19, 3.53) = 0.0094186073802048 f(19, 51.19, 3.53) = 0.0087239635554072
|
|
Permalink |
|
Powers of ten |
[Space & Physics] |
|
Posted on May 27, 2008 @ 11:05:40 PM by Paul Meagher
There is a celebrated book called "Powers of Ten" that offers up snapshots of what the universe looks like as you decrease or increase your magnification by successive powers of ten.
The cognitive scientist, Allan Newell, applied the "Powers of Ten" concept to the time dimension and observed that human action can be explained by biological, cognitive, rational and social constructs that generally operate at different time scales (Pirolli, p 19).
Table 1. Time scale on which human action occurs.
| Scale (seconds) | Time Unit | Band |
| 107 | Months | Social |
| 106 | Weeks | |
| 105 | Days | |
| 104 | Hours | Rational |
| 103 | 10 Minutes | |
| 102 | Minutes | |
| 101 | 10 Seconds | Cognitive |
| 100 | 1 Second | |
| 10-1 | 100 Milliseconds | |
| 10-2 | 1 Millisecond | Biological |
|
|
Permalink |
|
Particle Swarm Optimization (PSO) : Part 1 |
[Optimization] |
|
Posted on May 26, 2008 @ 02:47:13 PM by Paul Meagher
An interesting area of research that I was not very familiar with is Computational Swarm Intelligence. A book by A.P. Engelbrecht (2005), "Fundamentals of Computational Swarm Intelligence", John Wiley & Sons, offers an excellent overview of the area. Andries breaks the field down into three main subfields:
- Genetic Algorithm Research
- Particle Swarm Optimization
- Ant Colony Optimization
The book takes an "engineering approach" to the topic areas: The computational tools used to study these areas can be applied to a host of engineering problems requiring optimization.
The book is 599 pages long. The section on Particle Swarm Optimization is around 275 pages long. It is a book within the book and probably the best coverage of this topic area if you have a specific interest in Particle Swarm Optimization PSO algorithms.
The section of the book on PSO presents this as the first formula to learn:
xi(t) = xi(t) + vi(t + 1)
This tells us that the position of particle i in a multidimensional space is a function of the position of the particle at time xi(t) and the velocity of the particle vi(t + 1).
You can use a uniform random number generator to initialize the positions of the particles.
xi(0) ~ U(xmin, xmax)
|
|
Permalink |
|
Basic Formulas for Optimal Foraging Theory |
[Formulas] |
|
Posted on May 23, 2008 @ 02:03:11 AM by Paul Meagher
I am currently reading the book "Optimal Foraging Theory" (Oxford: 2007) by Peter Pirolli.
Peter Pirolli lays the foundation for Optimal Foraging Theory (OFT) with this formula:
R = g(tw) / (tb + tw)
This is a rate equation. The equation tells us that the overall rate of gain R of food or information is equal to a gain function applied to the amount of time you have been foraging within the current patch g(tw) divided by the amount of time it takes to move between patches tb plus the time you have spent foraging in the current patch tw.
Charnov's Marginal Value Theorem can be used to find the optimal amount of time t* to spend in a patch.
To apply the theorem we first need to rewrite the equation above as an optimal rate R* equation:
R* = g(t*) / (tb + t*)
An optimal forager obeys the two rules below (see p. 8) to find an optimal t* value:
-
if (g'(tw) >= R*) then continue foraging in the patch.
-
if (g'(tw) < R*) then start looking for a new patch
Note that g'(tw) refers to the derivative of the gain function for within-patch foraging. The means we need to compute the slope of the gain function for various values of t. If the slope at t is greater than or equal to the maximal slope R*, we stay in the patch; otherwise, we move to a new patch.
This is an abstract formulation of the optimization problem that Optimal Foraging Theory is concerned with.
Applications to particular problems in web design require operationalizatizing the concept of a "patch" and what "foraging within a patch" and "moving between patches" consists of.
The book offers ideas about how to operationalize these concepts in particular domains, including web design and web usability. To learn how to apply optimal foraging theory from a usability guru, I recommend reading a recent article by Jakob Nielson (Nov. 2007: Long versus Short Content).
|
|
Permalink |
|
Game Artificial Intelligence |
[Fun & Games] |
|
Posted on May 4, 2008 @ 02:06:14 AM by Paul Meagher
The book Game Artificial Intelligence (2007), by John Ahlquist and Jeannie Novak, is an interesting read. It is not a difficult read and only includes small illustrative code snippets, nevertheless, I learned quite a bit from it. The book is very well designed and attractive with excellent illustrations and screen shots from a variety of games. It includes numerous short interviews and blurbs by senior game designers and developers from major studios that works well with the text. I'm not really a "gamer" myself and one of the reasons I found the book interesting was because it offered a high-level, knowledgeable, and insightful discussion of program-level game design using as examples some of the games my son plays plus many others.
The code snippet below is a PHP version of a thought-provoking snippet from the book showing how the "retreat" and "attack" behavior of a game AI might be controlled:
function OnSeePlayer($agent, $player) {
if (Health($agent) < DefaultHealth($agent)*0.1)
Retreat($agent);
else
Attack($agent, $player);
}
So, if the agent sees the player, and the agent's health is less that 10% of it's normal health, the agent will retreat; otherwise the agent will attack. It is interesting to consider how one might go about building up the framework around this code snippet so that it would play a useful role in determining a Game AI's behavior. I suspect that much of the work on the Game AI component of game design could be done without any flashy graphics; mostly what you would need are simulations of your opponents tactics and strategies to see how your game AI reacts under different situations (that would eventually map onto the flashy game graphics depicting the actions and reactions).
I've not read the last two chapters on pathfinding yet so can't give a full review of this book. Nevertheless, if you are a geek looking for a good beach book this summer, I'd recommend taking this book along. If you can't play your videogames in the sand, then at least you can read an interesting discussion about the nature of the brains inside your games.
|
|
Permalink |
|
PHP Excel Classes 2007 |
[Math Projects] |
|
Posted on April 1, 2008 @ 10:10:35 AM by Paul Meagher
The opensource PHP Excel Classes 2007 Project aims to provide "a set of classes for the PHP programming language, which allow you to write to Excel 2007 files and read from Excel 2007 files. This project is built around Microsoft's OpenXML standard and PHP". What impresses me about this project is the inclusion of a formula parser and calculation engine that mimics much of Excel's functionality. I can see a use for the formula parser and the calculation engine in other projects that don't involve Excel.
Much of the functionality associated with the calculation engine is located in the PHPExcel_Calculation_Functions.php class which includes full implementations of the following 352 math functions:
If you are interested in the intersection of PHP and Math I encourage you to test drive the PHPExcel Project. The code is licensed under the GPL v2.
|
|
Permalink |
|
Correlation Matrix |
[Data Mining] |
|
Posted on March 23, 2008 @ 12:16:14 PM by Paul Meagher
CorrelationMatrix.php is a class that can be used to compute a correlation matrix given a data matrix. A data matrix can be obtained by querying a database table or collection of tables and putting the resulting values into a two-dimensional data array. You could then run the class below on your two-dimensional data array.
<?php
/**
* Creates a correlation matrix from a given data matrix.
*
* @see http://lib.stat.cmu.edu/multi/pca.c
*
* @author Paul Meagher
* @version 0.2
* @modified Mar 23, 2008
*/
class CorrelationMatrix {
private $eps = 0.005;
public $nrows = 0;
public $ncols = 0;
public $means = array();
public $stddevs = array();
public $cormat = array();
function CorrelationMatrix($data) {
// num rows
$this->nrows = count($data);
// num cols
$this->ncols = count($data[0]);
// Determine mean of column vectors of input data matrix
for ($j=0; $j < $this->ncols; $j++) {
$this->means[$j] = 0.0;
for($i=0; $i < $this->nrows; $i++)
$this->means[$j] += $data[$i][$j];
$this->means[$j] /= $this->nrows;
}
// Determine standard deviations of column vectors of data matrix.
for ($j=0; $j < $this->ncols; $j++) {
$this->stddevs[$j] = 0.0;
for ($i=0; $i < $this->nrows; $i++)
$this->stddevs[$j] += (($data[$i][$j]-$this->means[$j])*($data[$i][$j]-$this->means[$j]));
$this->stddevs[$j] /= $this->nrows;
$this->stddevs[$j] = sqrt($this->stddevs[$j]);
// The following in an inelegant but usual way to handle
// near-zero stddev values, which would cause a zero-
// divide error.
if ($this->stddevs[$j] <= $this->eps)
$this->stddevs[$j] = 1.0;
}
// Center and reduce the column vectors.
for ($i=0; $i < $this->nrows; $i++) {
for ($j=0; $j < $this->ncols; $j++) {
$data[$i][$j] -= $this->means[$j];
$x = sqrt($this->nrows);
$x *= $this->stddevs[$j];
$data[$i][$j] /= $x;
}
}
// Calculate the m * m correlation matrix.
for ($j1=0; $j1 < $this->ncols-1; $j1++) {
$this->cormat[$j1][$j1] = 1.0;
for ($j2=$j1+1; $j2 < $this->ncols; $j2++) {
$this->cormat[$j1][$j2] = 0.0;
for ($i=0; $i < $this->nrows; $i++)
$this->cormat[$j1][$j2] += ( $data[$i][$j1] * $data[$i][$j2]);
$this->cormat[$j2][$j1] = $this->cormat[$j1][$j2];
}
}
$this->cormat[$this->ncols-1][$this->ncols-1] = 1.0;
}
function printMeans() {
printf("Column Means: <br />");
for ($j=0; $j < $this->ncols; $j++)
printf("%7.2f", $this->means[$j]);
printf("<br />");
}
function printStdDevs() {
printf("Column Standard Deviations: <br />");
for ($j=0; $j < $this->ncols; $j++)
printf("%7.2f", $this->stddevs[$j]);
printf("<br />");
}
function printCorMat() {
printf("Correlation Matrix: <br />");
for ($i=0; $i < $this->ncols; $i++) {
for ($j=0; $j < $this->ncols; $j++)
//echo $this->cormat[$i][$j]." ";
printf("%7.4f", $this->cormat[$i][$j]);
printf("<br />");
}
}
}
?>
Here is a test script that demonstrates usage and generates some output:
<?php
include "CorrelationMatrix.php";
$data[0] = array(1, 2, 3); $data[1] = array(2, 3, 4); $data[2] = array(3, 4, 5); $data[3] = array(4, 6, 8); $data[4] = array(5, 8, 10);
$cm = new CorrelationMatrix($data); $cm->printMeans(); $cm->printStdDevs(); $cm->printCorMat();
?>
The output looks like this:
Column Means:
3.00 4.60 6.00
Column Standard Deviations:
1.41 2.15 2.61
Correlation Matrix:
1.0000 0.9848 0.9762
0.9848 1.0000 0.9970
0.9762 0.9970 1.0000
|
|
Permalink |
|
Stratified Random Sampling |
[Stats & Prob] |
|
Posted on March 20, 2008 @ 04:49:54 AM by Paul Meagher
Jan Steedman writes:
I was wondering if you could point me into the right direction with a somewhat off-topic sampling problem that I am currently having:
I have a set of people in the database with several coded variables (let's say: age group, gender & region). I'd like to draw a sample of 100 people in total from the database now with specific targets (i.e. constraints):
gender male: 50 people
gender female: 50 people
age group 1: 40 people
age group 2: 35 people
age group 3: 25 people
region 1: 10 people
region 2: 30 people
region 3: 60 people
The target size is 100 people in total and all constraints must be met. My problem now is how to select the "correct" set of people from the database in an efficient way? From all possible combinations of people there might only be one combination that will satisfy all constraints.
Here is a stratified_sample.php script that I came up with to solve Jan's problem. I'm not sure that it actually solved Jan's problem but I though I would publish it anyway.
<?php
// setup the gender, age, and region arrays with // appropriate number of sampling codes in each // array
for($i=0; $i<100; $i++) {
if ($i<50) $gender[$i] = "GM"; else $gender[$i] = "GF";
if ($i<40) $age[$i] = "A1"; elseif($i<75) $age[$i] = "A2"; else $age[$i] = "A3";
if ($i<10) $region[$i] = "R1"; elseif($i<40) $region[$i] = "R2"; else $region[$i] = "R3";
}
// randomize the position of sampling codes in // each array
shuffle($gender); shuffle($age); shuffle($region);
// the sample array contains the sampling codes // for selecting each sample unit
// to create the sample array loop through the // gender, age, and region arrays and concatenate // the sampling codes from each array together // to create the sampling code for that sample unit
for($i=0; $i<100; $i++) { $g = $gender[$i]; $a = $age[$i]; $r = $region[$i]; $sample[$i] = "$g$a$r"; }
// the sample array should meet the sampling // constraints while meeting the constraints // of randomization as well
echo "<pre>"; print_r($sample); echo "</pre>";
?>
The abbreviated output of the script looks like this:
Array
(
[0] => GFA1R2
[1] => GFA1R3
[2] => GFA1R3
[3] => GFA2R2
[4] => GMA2R1
[5] => GMA2R3
[6] => GFA1R2
[7] => GFA2R2
[8] => GMA2R3
[9] => GFA1R2
[10] => GMA1R2
... snip ...
[90] => GMA1R2
[91] => GMA2R1
[92] => GFA3R3
[93] => GFA2R3
[94] => GMA3R2
[95] => GFA1R3
[96] => GMA1R2
[97] => GFA3R2
[98] => GMA2R3
[99] => GFA1R1
)
For this solution to fully work you would have to code each of your sample elements according to your 3 dimensional coding system (e.g., GFA1R3 meaning "Gender Female", "Area 1", "Region 3") and when you are selecting individuals matching these sampling codes you would randomly select a person matching your sampling code from among the pool of people matching the sampling code.
|
|
Permalink |
|
Longest Common Subsequence |
[Optimization] |
|
Posted on February 21, 2008 @ 10:13:09 PM by Paul Meagher
The Longest Common Subsequence algorithm compares two strings and finds the longest subsequence shared by both strings. There are lots of applications, however, two are particularly noteworthy:
- Computing the similarity between sequences of genetic material.
- Computing differences between files (i.e., unix "diff" command).
So, without further ado, here is an algorithm for computing the Longest Common Subsequence:
<?php /** * Port of Sedgewick & Wayne's Longest Common Subsequence algorithm * found here: * * @see http://www.cs.princeton.edu/introcs/96optimization/LCS.java.html * @see http://www.cs.princeton.edu/introcs/96optimization/ * * @author Paul Meagher * @ported Feb 21, 2008 * * See also: * * @see http://www.ics.uci.edu/~eppstein/161/960229.html * * @param string $S1 first string * @param string $S2 second string * @returns string containing common subsequence or empty string */ function string_lcs($S1, $S2) {
$m = strlen($S1); $n = strlen($S2);
// A[i][j] = length of LCS of S1[i..m] and S2[j..n] $A = array();
// compute length of LCS and all subproblems via dynamic programming for ($i = $m-1; $i >= 0; $i--) { for ($j = $n-1; $j >= 0; $j--) { if ($S1[$i] == $S2[$j]) $A[$i][$j] = $A[$i+1][$j+1] + 1; else $A[$i][$j] = max($A[$i+1][$j], $A[$i][$j+1]); } }
// recover LCS itself and print it to standard output $i = 0; $j = 0; $LCS = ""; while($i < $m && $j < $n) { if ($S1[$i] == $S2[$j]) { $LCS .= $S1[$i]; $i++; $j++; } elseif ($A[$i+1][$j] >= $A[$i][$j+1]) $i++; else $j++; } return $LCS;
} ?>
Here is a test program with the output displayed in the source code:
<?php
include "string_lcs.php";
$S1 = "abcdefghijk"; $S2 = "ijklmnop";
$subsequence = string_lcs($S1, $S2); echo $subsequence;
// Output: ijk
?>
I've taken a few liberties in my port of Sedgewick and Wayne's LCS class. I reimplemented the LCS class as a string_lcs function as a class seemed like overkill at this point. I also use the word "string_" as a prefix for the function name. I used this prefix in anticipation of eventually building a library of such string functions (in addition to PHP's collection of string functions). I also renamed some variables to be more mnemonic (e.g., $S1, $S2) or standard (e.g., $A, $m, $n) and changed the capitalization of variable names to conform with whether they are being treated as a scalar (lowercase) or vector/matrix quantity (uppercase).
Consult some of the links contained in the source to learn more about how the algorithm works and about Dynamic Programming more generally.
|
|
Permalink |
|
Order statistics |
[Notation] |
|
Posted on February 16, 2008 @ 01:36:53 PM by Paul Meagher
Order statistics are statistics derived from data ordered from smallest to largest. I've recently come across a nice notation for representing order statistics which I thought deserved a blog entry.
The notation involves putting a subscripted bracket around a positional index. Using this scheme, all the measurements in our ordered sample would be denoted as y[1],...,y[n]. In other words, the brackets around the subscripts tips us off that the elements denoted by these indexed y values are not simply the raw sample values in whatever order they were collected, but rather the sample values ordered from smallest to largest.
As an example of how this notation can be used, consider how we might express the formula for computing the first, second and third quartiles (Q1, Q2, Q3). Recall that Q1 is the value that 25% of our values are less than or equal to. Q2 is the value that 50% of our values are less than or equal to and Q3 is the value that 75% of our values are less than or equal to. Or, using our notation:
Q1 = y[(n+1)/4]
Q2 = y[(n+1)/2]
Q3 = y[3(n+1)/4]
If the subscripts are not integers, we average the two closest order statisitcs to obtain the quartile value.
The minimum value y[1], the three quartiles (Q1, Q2, Q3), and the maximum value y[n] are often reported together and are called the five number summary. The box-and-whisker plot is a graphical way to depict the five number summary.
As a final example of how our order notation can be used consider the formula for the trimmed mean. In the formula below mk is the trimmed mean with k values trimed from the top and bottom of an ordered list of numbers:
mk = ∑i=k+1 to n-k x [ i ] / (n - 2k)
Here is some PHP code for computing a trimmed mean where we remove the top and bottom 2 values from our computed mean (i.e., k=2):
<?php
function trimmed_mean($X, $k) { $n = count($X); sort($X, SORT_NUMERIC); for($i=$k; $i < $n-$k; $i++) $sum += $X[$i]; $mean = $sum / ($n - 2 * $k); return $mean; }
$X = array(55, 15, 20, 1, 2, 25, 50); echo "Trimmed mean is ".trimmed_mean($X, 2).".";
// Output: Trimmed mean is 20.
?>
The next time you compute a mean you might want to consider computing a trimmed mean using a few different values of k to see how affected by outliers your computed mean is.
|
|
Permalink |
|
On the way to NURBS |
[Calculus] |
|
Posted on January 28, 2008 @ 02:03:17 AM by Paul Meagher
NURBS is an acronym for "Non-Uniform Rational B-Spline". NURBS are used extensively in CAD systems to represent surfaces. I hope to someday understand the math and programming behind NURBS better. Towards that end, I have ported some code that manipulates Bernstein polynomials. Bernstein polynomials unify the math behind Bezier approximation schemes and Spline approximation schemes. NURBS are another approximation scheme which requires that you know something about Bernstein polynomials in order to appreciate the math involved.
The Bernstein class below can be used to generate the coordinates needed to draw a curve that interpolates and approximates a set of control points. Vector-based image manipulation programs offer a Bezier tool that allows you to create curves by moving around a set of bounding control points. As you move the curve control points, the shape of the Bezier curve changes to interpolate and approximate the new position of the control points. The class below implements the math required to interpolate and approximate the position of the control points. The math looks like this:
The implementation of this math looks like this:
<?php /** * Provides methods to manipulate the Bernstein polynomial. * * An nth-degree Berstein polynomial is given by: * * B(i,n,u) = n! * u^i * (1-u)^(n-i) / i! * (n-i)! * with 0 <= i <= n * * @author R.P. from OpaleTeam * @see http://opale.belios.de * @license LPGL * @date 05/2002 * * Ported by Paul Meagher on Jan 20, 2008. */ class Bernstein {
/** * Returns the evaluation, for a given value, of a Bernstein * polynomal defined by the given arguments. * * @param int $i * @param int $n the degree of the polynomial * @param double $u the value for which the polynomial will be computed * @return double the computed value */ public function getValue($i, $n, $u) { $temp[$n-$i] = 1.0; $u1 = 1.0 - $u;
for ($k = 1; $k <= $n; $k++) for ($j = $n; $j >= $k; $j--) $temp[$j] = $u1 * $temp[$j] + $u * $temp[$j-1]; return $temp[$n];
} /** * Returns all Bernstein polynomials evaluations, for a given * value, of a Bernstein polynomal defined by the given arguments. * * @param int $n the degree of the polynomial * @param double $u the value for which the polynomial will be computed * @return double[], the computed value */ public function getAllValues($n, $u) {
$b[0] = 1.0; $u1 = 1.0 - $u;
for ($i = 1; $i <= $n; $i++) { $saved = 0.0; for ($k = 0; $k < $i; $k++) { $temp = $b[$k]; $b[$k] = $saved + $u1 * $temp; $saved = $u * $temp; } $b[$i] = $saved; } return $b;
}
}
?>
The bernstein_test.php example below illustrates usage. In this example, we use the Bernstein class to find the coefficients to multiply the $X[$i] (i.e., $X = array(1, 2, 4, 3)) and $Y[$i] coordinates (i.e., $Y = array(1, 3, 3, 1)) of 4 control points. After we separately multiply the $X[$i] and $Y[$i] terms by the Bernstein cofficients we get a new point at $t that approximates the control points.
The setup value of $n is equal to 1 minus the number of control points. $T is an array of values in the range [0,1] that identify the left (0) to right (1) positions along the approximating and interpolating curve where we want to compute it's coordinates.
<?php /** * Testing Bernstein output against worked example in: * * Rogers, David F. An Introduction to NURBS With Historical Perspective. * 2001. Morgan Kaufmann, San Francisco. pp. 22-24 */
include "Bernstein.php";
$berstein = new Bernstein;
$n = 3;
$X = array(1, 2, 4, 3); $Y = array(1, 3, 3, 1);
?> <table border='1' cellspacing='1' cellpadding='2'> <tr> <th>t</th> <th>B<sub>3,0</sub></th> <th>B<sub>3,1</sub></th> <th>B<sub>3,2</sub></th> <th>B<sub>3,3</sub></th> <th>P(t)</th> </tr> <?php $T = array(0, 0.15, 0.35, 0.50, 0.65, 0.85, 1); foreach ($T as $t) { $B = $berstein->getAllValues($n, $t); ?> <tr> <td align='center'><?php echo $t ?></td> <?php $i=0; $x=0.0; $y=0.0; foreach ($B as $b) { ?> <td align='center'><?php echo $b ?></td> <?php $x += $b *$X[$i]; $y += $b *$Y[$i]; $i++; } ?> <td align='center'><?php echo "($x, $y)"; ?></td> </tr> <?php } ?> </table>
The test script above produces the following HTML table as output:
| t |
B3,0 |
B3,1 |
B3,2 |
B3,3 |
P(t) |
| 0 |
1 |
0 |
0 |
0 |
(1, 1) |
| 0.15 |
0.614125 |
0.325125 |
0.057375 |
0.003375 |
(1.504, 1.765) |
| 0.35 |
0.274625 |
0.443625 |
0.238875 |
0.042875 |
(2.246, 2.365) |
| 0.5 |
0.125 |
0.375 |
0.375 |
0.125 |
(2.75, 2.5) |
| 0.65 |
0.042875 |
0.238875 |
0.443625 |
0.274625 |
(3.119, 2.365) |
| 0.85 |
0.003375 |
0.057375 |
0.325125 |
0.614125 |
(3.261, 1.765) |
| 1 |
0 |
0 |
0 |
1 |
(3, 1) |
The P(t) column above contains a set of points which we can graph that will approximate the set of control points a user has specified for the curve they want drawn. The x coordinate of each point was obtained by multiplying the 4 Berstein coefficients by the 4 x coordinates. Similarly, the y coordinate of each point was obtained by multiplying the 4 Bernstein coefficients by the 4 y coordinates.
|
|
Permalink |
|
Population sampling |
[Stats & Prob] |
|
Posted on January 11, 2008 @ 04:11:14 AM by Paul Meagher
Below is a simple but useful class for generating a population sample:
<?php /** * Generates a random sample (with or without replacement) * of size n from population array. * * @see http://www.math.uah.edu/stat/objects/distributions/Functions.xhtml */ class Population {
const WITHOUT_REPLACEMENT=0; const WITH_REPLACEMENT=1;
function random($min=0, $max=1) { return ($min+lcg_value()*(abs($max-$min))); }
/** * This method computes a sample of a specified size from a specified population * and of a specified type (with or without replacement). * @param int[] $p the population * @param int $n the sample size * @param int $t the type (0 without replacement, 1 with replacement); * @return int $s */ public function getSample($p, $n, $t){ $m = count($p); if ($n < 1) $n = 1; elseif ($n > $m) $n = $m; //Define the sample if ($t == self::WITH_REPLACEMENT){ for ($i = 0; $i < $n; $i++){ $u = (int)($m * $this->random()); $s[$i] = $p[$u]; } } else{ for ($i=0; $i < $n; $i++){ //Select a random index from 0 to m - i - 1; $k = $m - $i; $u = (int)($k * $this->random()); //Define the sample element $s[$i] = $p[$u]; // Interchange the sampled element p[u] with p[k - 1], at the end of the // population so that it will not be sampled again. $temp = $p[$k-1]; $p[$k-1] = $p[$u]; $p[$u] = $temp; } } return $s; }
}
?>
Here is some code to demonstrate usage:
<?php
include "Population.php";
$p = range(1,9); $n = 3; $t = 0;
$pop = new Population; $s = $pop->getSample($p, $n, $t);
echo "<pre>"; print_r($s); echo "</pre>";
?>
Here is what the output looks like:
Array
(
[0] => 2
[1] => 9
[2] => 4
)
Perhaps you will find a use for this sampler some day...
|
|
Permalink |
|
Monte Carlo Integration |
[Calculus] |
|
Posted on January 9, 2008 @ 03:09:37 AM by Paul Meagher
Monte carlo integration is a way to find the area under a curve using random numbers. The program below provides a nice introductory example of a monte carlo method and how it can be used to solve a calculus problem.
<?php /** * Approximates area of the region under the graph of a * monotonic function and above the x-axis beween a and b. * * Adaptation of Monte Carlo Method program by James M. Sconyers * found in: * * Larson, Hostetler, Edwards (2006). Calculus I with Precalculus. * 2nd Edition. Houghton Mifflin Co, Boston. p. 490 */ class MonteCarlo_Integration {
/** * @see http://ca.php.net/manual/en/function.mt-rand.php#75793 */ function random_float($min, $max) { return ($min+lcg_value()*(abs($max-$min))); } /** * Compute area under the curve using monte carlo method. * * @param object $f function object * @param float $a lower limit * @param float $b upper limit * @param int $n number of samples * @return float area under the curve */ function integrate($f, $a, $b, $n) { $p = 0; if ($f->valueAt($a) > $f->valueAt($b)) $ymax = $f->valueAt($a); else $ymax = $f->valueAt($b); for ($i=1; $i < $n; $i++) { $x = $a + ($b - $a) * $this->random_float(0, 1); $y = $ymax * $this->random_float(0, 1); if ($y < $f->valueAt($x)) $p = $p+1; } $area = ($p/$n)*($b-$a)*$ymax; return $area; }
}
?>
Here is a program to test the MonteCarlo_Integration.php class:
<?php
include "MonteCarlo_Integration.php";
class F { function valueAt($x) { return pow($x, 2); } }
$f = new F; $a = 0; $b = 2; $n = 10000;
$mc = new MonteCarlo_Integration; $area = $mc->integrate($f, $a, $b, $n);
echo "Approximate Area: $area";
?>
The output of one execution of the program is:
Approximate Area: 2.6632
The exact answer is 8/3.
One programming project would involve histogramming and summarizing the area estimates produced by the above class where $n, the number of samples, is varied. Doing so would give you a feel for how variable the estimates are with different sample sizes and how the monte carlo estimates are distributed. A project like this might offer some insight into more advanced monte carlo applications.
|
|
Permalink |
|
php/Math Project
© 2007. All rights reserved.
|