|
Posted on June 5, 2008 @ 02:21:07 PM by Paul Meagher
As I reflected on how I would begin computing summary statistics over an inverted index (which I discussed and implemented in my last post), I realized that I would have to immediately turn my script into a class in order to easily and efficiently pass around the inverted index to the statistical functions. Here is what I have developed so far (see also):
<?php
/** * Information Retrievel * * Class used to explore information retrieval theory and concepts. */
define("DOC_ID", 0); define("TERM_POSITION", 1);
class IR {
public $num_docs = 0; public $corpus_terms = array();
/* * Show Documents * * Helper function that shows the contents of your corpus documents. * * @param array $D document corpus as array of strings */ function show_docs($D) { $ndocs = count($D); for($doc_num=0; $doc_num < $ndocs; $doc_num++) { ?> <p> Document #<?php echo ($doc_num+1); ?>:<br /> <?php echo $D[$doc_num]; ?> </p> <?php } }
/* * Create Index * * Creates an inverted index from the supplied corpus documents. * Inverted index stored in corpus_terms array. * * @param array $D document corpus as array of strings */ function create_index($D) { $this->num_docs = count($D); for($doc_num=0; $doc_num < $this->num_docs; $doc_num++) { // zero array containing document terms $doc_terms = array(); // simplified word tokenization process $doc_terms = explode(" ", $D[$doc_num]); // here is where the indexing of terms to document locations happens $num_terms = count($doc_terms); for($term_position=0; $term_position < $num_terms; $term_position++) { $term = strtolower($doc_terms[$term_position]); $this->corpus_terms[$term][]=array($doc_num, $term_position); } } }
/* * Show Index * * Helper function that outputs inverted index in a standard format. */ function show_index() { // sort by key for alphabetically ordered output ksort($this->corpus_terms); // output a representation of the inverted index foreach($this->corpus_terms AS $term => $doc_locations) { echo "<b>$term:</b> "; foreach($doc_locations AS $doc_location) echo "{".$doc_location[DOC_ID].", ".$doc_location[TERM_POSITION]."} "; echo "<br />"; } }
/* * Term Frequency * * @param string $term * @return frequency of term in corpus */ function tf($term) { $term = strtolower($term); return count($this->corpus_terms[$term]); } /* * Number Documents With * * @param string $term * @return number of documents with term */ function ndw($term) { $term = strtolower($term); $doc_locations = $this->corpus_terms[$term]; $num_locations = count($doc_locations); $docs_with_term = array(); for($doc_location=0; $doc_location < $num_locations; $doc_location++) $docs_with_term[$i]++; return count($docs_with_term); } /* * Inverse Document Frequency * * @param string $term * @return inverse document frequency of term */ function idf($term) { return log($this->num_docs)/$this->ndw($term); }
}
?>
Here is a script I developed to test the methods of the IR class:
<?php
include "IR.php";
$D[0] = "Shipment of gold delivered in a fire"; $D[1] = "Delivery of silver arrived in a silver truck"; $D[2] = "Shipment of gold arrived in a truck";
$ir = new IR();
echo "<p><b>Corpus:</b></p>"; $ir->show_docs($D);
$ir->create_index($D);
echo "<p><b>Inverted Index:</b></p>"; $ir->show_index();
$term = "silver"; $tf = $ir->tf($term); $ndw = $ir->ndw($term); $idf = $ir->idf($term); echo "<p>"; echo "Term Frequency of '$term' is $tf<br />"; echo "Number Of Documents with $term is $ndw<br />"; echo "Inverse Document Frequency of $term is $idf"; echo "</p>";
?>
Here is the output that the script generates:
Corpus:
Document #1:
Shipment of gold delivered in a fire
Document #2:
Delivery of silver arrived in a silver truck
Document #3:
Shipment of gold arrived in a truck
Inverted Index: a: {0, 5} {1, 5} {2, 5} arrived: {1, 3} {2, 3} delivered: {0, 3} delivery: {1, 0} fire: {0, 6} gold: {0, 2} {2, 2} in: {0, 4} {1, 4} {2, 4} of: {0, 1} {1, 1} {2, 1} shipment: {0, 0} {2, 0} silver: {1, 2} {1, 6} truck: {1, 7} {2, 6}
Term Frequency of 'silver' is 2 Number Of Documents with silver is 1 Inverse Document Frequency of silver is 1.0986122886681
The next step in this exploration of Information Retrieval Theory will be to develop a table that better summarizes term frequency and inverse term frequencies. I will use this table to verify that the results I'm generating match with tf-idf calculations found in the literature for the sample documents used in my calculations.
|