org.galagosearch.core.parse
Class TagTokenizer

java.lang.Object
  extended by org.galagosearch.core.parse.TagTokenizer
All Implemented Interfaces:
org.galagosearch.tupleflow.Processor<Document>, org.galagosearch.tupleflow.Source<Document>, org.galagosearch.tupleflow.Step

@InputClass(className="org.galagosearch.core.parse.Document")
@OutputClass(className="org.galagosearch.core.parse.Document")
public class TagTokenizer
extends java.lang.Object
implements org.galagosearch.tupleflow.Source<Document>, org.galagosearch.tupleflow.Processor<Document>

This class processes document text into tokens that can be indexed.

The text is assumed to contain some HTML/XML tags. The tokenizer tries to extract as much data as possible from each document, even if it is not well formed (e.g. there are start tags with no ending tags). The resulting document object contains an array of terms and an array of tags.

Author:
trevor

Nested Class Summary
static class TagTokenizer.Pair
           
 
Field Summary
 org.galagosearch.tupleflow.Processor<Document> processor
           
 
Constructor Summary
TagTokenizer()
           
 
Method Summary
 void close()
           
 java.lang.Class<Document> getInputClass()
           
 java.lang.Class<Document> getOutputClass()
           
 java.util.ArrayList<TagTokenizer.Pair> getTokenPositions()
           
 void onAmpersand()
           
 void process(Document document)
          Parses the text in the document.text attribute and fills in the document.terms and document.tags arrays, then passes that document to the next processing stage.
 void reset()
          Resets parsing in preparation for the next document.
 void setProcessor(org.galagosearch.tupleflow.Step processor)
           
 void tokenize(Document document)
          Parses the text in the document.text attribute and fills in the document.terms and document.tags arrays.
 Document tokenize(java.lang.String text)
          Parses the text in the input string and returns a document object.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

processor

public org.galagosearch.tupleflow.Processor<Document> processor
Constructor Detail

TagTokenizer

public TagTokenizer()
Method Detail

reset

public void reset()
Resets parsing in preparation for the next document.


onAmpersand

public void onAmpersand()

process

public void process(Document document)
             throws java.io.IOException
Parses the text in the document.text attribute and fills in the document.terms and document.tags arrays, then passes that document to the next processing stage.

Specified by:
process in interface org.galagosearch.tupleflow.Processor<Document>
Parameters:
document -
Throws:
java.io.IOException

tokenize

public void tokenize(Document document)
Parses the text in the document.text attribute and fills in the document.terms and document.tags arrays.

Parameters:
document -
Throws:
java.io.IOException

tokenize

public Document tokenize(java.lang.String text)
                  throws java.io.IOException
Parses the text in the input string and returns a document object. This method calls the {#link tokenize(Document) other variant}.

Returns:
A new document object containing the parsed text from the input string.
Throws:
java.io.IOException

getTokenPositions

public java.util.ArrayList<TagTokenizer.Pair> getTokenPositions()

setProcessor

public void setProcessor(org.galagosearch.tupleflow.Step processor)
                  throws org.galagosearch.tupleflow.IncompatibleProcessorException
Specified by:
setProcessor in interface org.galagosearch.tupleflow.Source<Document>
Throws:
org.galagosearch.tupleflow.IncompatibleProcessorException

close

public void close()
           throws java.io.IOException
Specified by:
close in interface org.galagosearch.tupleflow.Processor<Document>
Throws:
java.io.IOException

getInputClass

public java.lang.Class<Document> getInputClass()

getOutputClass

public java.lang.Class<Document> getOutputClass()


Copyright © 2009. All Rights Reserved.