php/Math   
Recreational Mathematics   
   home  |  library  |  contact
 Math Notes
 Math Programming [25]
 Regression [3]
 Data Mining [17]
 Notation [6]
 Linear Algebra [9]
 Stats & Prob [15]
 Math Cognition [5]
 Space & Physics [6]
 Formulas [5]
 Fun & Games [2]
 Haskell [1]
 Bayes Theory [1]
 Site News [0]
 Math Projects [5]
 Polynomials [1]
 Calculus [9]
 Number Theory [3]
 Optimization [2]
 Financial [1]

 Math Links
 PHP/ir
 Andrew Gelman
 Chance Wiki
 Daniel Lemire
 KD Knuggets
 Social Stats
 MySQL Performance
 Hunch.net
 Matthew Hurst
 JMLR
 JSS
 Hal Daume III
 Math Notes >> Permanent Link

Information Retrieval Theory: Implementing an Inverted Index [Regression
Posted on June 3, 2008 @ 01:29:29 PM by Paul Meagher

Last night I downloaded a local copy of this book:

Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.

Most introductions to Information Retrieval begin by talking about the "Inverted Index" data structure. I had come accross this concept before, however, I haven't explored it in php code because I got hung up on the issues of what corpus to use and, more importantly, what word tokenization approach I should use for the documents. Through my surfing around on the topic of IR, I observed that you can bypass these issues by simplifying the corpus selection and word tokenization issues while retaining the essentials of the representational and computational issues. The code below illustrates a common simplification of the corpus concept and word tokenization concept that might similarly allow you to get started exploring some IR concepts.

<?php

/**
* Creates an inverted index and shows what it looks like.  Useful for
* explorating TF-IDF calculations in Information Retrieval theory.
*
* @see http://en.wikipedia.org/wiki/Inverted_index
*/

define("DOC_ID"0);
define("TERM_POSITION"1);

$D[0] = "Shipment of gold delivered in a fire";
$D[1] = "Delivery of silver arrived in a silver truck";
$D[2] = "Shipment of gold arrived in a truck";

$corpus_terms = array();

$num_docs count($D);

for(
$doc_num=0$doc_num $num_docs$doc_num++) {
  
  
// zero array containing document terms
  
$doc_terms = array();
  
  
// simplified word tokenization process
  
$doc_terms explode(" "$D[$doc_num]);
  
  
// here is where the indexing of terms to document locations happens
  
$num_terms count($doc_terms);
  for(
$term_position=0$term_position $num_terms$term_position++) 
    
$corpus_terms[strtolower($doc_terms[$term_position])][]=array($doc_num$term_position);

}

// sort by key for alphabetically ordered output
ksort($corpus_terms);

$num_terms count($corpus_terms);

// output a representation of the inverted index
foreach($corpus_terms AS $term => $doc_locations) {
  echo 
"<b>$term:</b> ";
  foreach(
$doc_locations AS $doc_location
    echo 
"{".$doc_location[DOC_ID].", ".$doc_location[TERM_POSITION]."} ";
  echo 
"<br />";  
}    
      
?>  

The output of this code looks like this:

a: {0, 5} {1, 5} {2, 5}
arrived: {1, 3} {2, 3}
delivered: {0, 3}
delivery: {1, 0}
fire: {0, 6}
gold: {0, 2} {2, 2}
in: {0, 4} {1, 4} {2, 4}
of: {0, 1} {1, 1} {2, 1}
shipment: {0, 0} {2, 0}
silver: {1, 2} {1, 6}
truck: {1, 7} {2, 6}

Sometimes it is useful to jump ahead in your exploration of certain topics and avoid, for example, the gritty details of document tokenization. What may seem like an avoidance of the problem, might end up informing you about important subtleties in the requirements for your information retrieval system. Exploratory inverted index code can help you to see the computational consequences of representing an inverted index in a certain manner.

For me, the next step in exploring IR theory is to explore in code the Term Frequency–Inverse Document Frequency calculations, generally abbreviated as TF-IDF. Tim Munroe illustrates some of the calculations involved, however, I will need to get my calculations working with the inverted index data structure I've developed so far.

Permalink 

No comments entered ...

 Archive 
 

php/Math Project
© 2011. All rights reserved.