php/Math   
Recreational Mathematics   
   home  |  library  |  contact
 Math Notes
 Math Programming [25]
 Regression [3]
 Data Mining [17]
 Notation [6]
 Linear Algebra [9]
 Stats & Prob [15]
 Math Cognition [5]
 Space & Physics [6]
 Formulas [5]
 Fun & Games [2]
 Haskell [1]
 Bayes Theory [1]
 Site News [0]
 Math Projects [5]
 Polynomials [1]
 Calculus [9]
 Number Theory [3]
 Optimization [2]
 Financial [1]

 Math Links
 PHP/ir
 Andrew Gelman
 Chance Wiki
 Daniel Lemire
 KD Knuggets
 Social Stats
 MySQL Performance
 Hunch.net
 Matthew Hurst
 JMLR
 JSS
 Hal Daume III
 Math Notes >> Permanent Link

Information Retrieval Class [Regression
Posted on June 5, 2008 @ 02:21:07 PM by Paul Meagher

As I reflected on how I would begin computing summary statistics over an inverted index (which I discussed and implemented in my last post), I realized that I would have to immediately turn my script into a class in order to easily and efficiently pass around the inverted index to the statistical functions. Here is what I have developed so far (see also):

<?php

/**
* Information Retrievel

* Class used to explore information retrieval theory and concepts.
*/

define("DOC_ID"0);
define("TERM_POSITION"1);

class 
IR {

  public 
$num_docs 0;
  
  public 
$corpus_terms = array();

  
/* 
  * Show Documents
  *
  * Helper function that shows the contents of your corpus documents.
  *
  * @param array $D document corpus as array of strings
  */ 
  
function show_docs($D) {
    
$ndocs count($D);
    for(
$doc_num=0$doc_num $ndocs$doc_num++) {
      
?>
      <p>
      Document #<?php echo ($doc_num+1); ?>:<br />
      <?php echo $D[$doc_num]; ?>
      </p>
      <?php
    
}
  }

  
/* 
  * Create Index
  *
  * Creates an inverted index from the supplied corpus documents.
  * Inverted index stored in corpus_terms array.
  *
  * @param array $D document corpus as array of strings
  */   
  
function create_index($D) {  
    
$this->num_docs count($D);    
    for(
$doc_num=0$doc_num $this->num_docs$doc_num++) {      
      
// zero array containing document terms
      
$doc_terms = array();      
      
// simplified word tokenization process
      
$doc_terms explode(" "$D[$doc_num]);      
      
// here is where the indexing of terms to document locations happens
      
$num_terms count($doc_terms);
      for(
$term_position=0$term_position $num_terms$term_position++) {
        
$term strtolower($doc_terms[$term_position]);
        
$this->corpus_terms[$term][]=array($doc_num$term_position);
      }      
    }   
  }

  
/* 
  * Show Index
  *
  * Helper function that outputs inverted index in a standard format.
  */         
  
function show_index() {
    
// sort by key for alphabetically ordered output
    
ksort($this->corpus_terms);
    
// output a representation of the inverted index
    
foreach($this->corpus_terms AS $term => $doc_locations) {
      echo 
"<b>$term:</b> ";
      foreach(
$doc_locations AS $doc_location
        echo 
"{".$doc_location[DOC_ID].", ".$doc_location[TERM_POSITION]."} ";
      echo 
"<br />";  
    }    
  }   

  
/*
  * Term Frequency
  *
  * @param string $term
  * @return frequency of term in corpus
  */
  
function tf($term) {
    
$term strtolower($term);
    return 
count($this->corpus_terms[$term]);
  }
   
  
/*
  * Number Documents With
  * 
  * @param string $term
  * @return number of documents with term
  */
  
function ndw($term) {
    
$term strtolower($term);   
    
$doc_locations $this->corpus_terms[$term];
    
$num_locations count($doc_locations);
    
$docs_with_term = array();
    for(
$doc_location=0$doc_location $num_locations$doc_location++) 
      
$docs_with_term[$i]++;
    return 
count($docs_with_term);     
  }
   
  
/*
  * Inverse Document Frequency
  *
  * @param string $term
  * @return inverse document frequency of term
  */
   
function idf($term) {
     return 
log($this->num_docs)/$this->ndw($term);    
   }       

}

?>

Here is a script I developed to test the methods of the IR class:

<?php

include "IR.php";

$D[0] = "Shipment of gold delivered in a fire";
$D[1] = "Delivery of silver arrived in a silver truck";
$D[2] = "Shipment of gold arrived in a truck";

$ir = new IR();

echo 
"<p><b>Corpus:</b></p>";
$ir->show_docs($D);

$ir->create_index($D);

echo 
"<p><b>Inverted Index:</b></p>";
$ir->show_index();

$term "silver"
$tf  $ir->tf($term);
$ndw $ir->ndw($term);
$idf $ir->idf($term);
echo 
"<p>";
echo 
"Term Frequency of '$term' is $tf<br />";
echo 
"Number Of Documents with $term is $ndw<br />";
echo 
"Inverse Document Frequency of $term is $idf";
echo 
"</p>";  

?>

Here is the output that the script generates:

Corpus:

Document #1:
Shipment of gold delivered in a fire

Document #2:
Delivery of silver arrived in a silver truck

Document #3:
Shipment of gold arrived in a truck

Inverted Index:

a: {0, 5} {1, 5} {2, 5}
arrived: {1, 3} {2, 3}
delivered: {0, 3}
delivery: {1, 0}
fire: {0, 6}
gold: {0, 2} {2, 2}
in: {0, 4} {1, 4} {2, 4}
of: {0, 1} {1, 1} {2, 1}
shipment: {0, 0} {2, 0}
silver: {1, 2} {1, 6}
truck: {1, 7} {2, 6}

Term Frequency of 'silver' is 2
Number Of Documents with silver is 1
Inverse Document Frequency of silver is 1.0986122886681

The next step in this exploration of Information Retrieval Theory will be to develop a table that better summarizes term frequency and inverse term frequencies. I will use this table to verify that the results I'm generating match with tf-idf calculations found in the literature for the sample documents used in my calculations.

Permalink 

No comments entered ...

 Archive 
 

php/Math Project
© 2011. All rights reserved.