|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.galagosearch.core.parse.TagTokenizer
@InputClass(className="org.galagosearch.core.parse.Document") @OutputClass(className="org.galagosearch.core.parse.Document") public class TagTokenizer
This class processes document text into tokens that can be indexed.
The text is assumed to contain some HTML/XML tags. The tokenizer tries to extract as much data as possible from each document, even if it is not well formed (e.g. there are start tags with no ending tags). The resulting document object contains an array of terms and an array of tags.
| Nested Class Summary | |
|---|---|
static class |
TagTokenizer.Pair
|
| Field Summary | |
|---|---|
org.galagosearch.tupleflow.Processor<Document> |
processor
|
| Constructor Summary | |
|---|---|
TagTokenizer()
|
|
| Method Summary | |
|---|---|
void |
close()
|
java.lang.Class<Document> |
getInputClass()
|
java.lang.Class<Document> |
getOutputClass()
|
java.util.ArrayList<TagTokenizer.Pair> |
getTokenPositions()
|
void |
onAmpersand()
|
void |
process(Document document)
Parses the text in the document.text attribute and fills in the document.terms and document.tags arrays, then passes that document to the next processing stage. |
void |
reset()
Resets parsing in preparation for the next document. |
void |
setProcessor(org.galagosearch.tupleflow.Step processor)
|
void |
tokenize(Document document)
Parses the text in the document.text attribute and fills in the document.terms and document.tags arrays. |
Document |
tokenize(java.lang.String text)
Parses the text in the input string and returns a document object. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public org.galagosearch.tupleflow.Processor<Document> processor
| Constructor Detail |
|---|
public TagTokenizer()
| Method Detail |
|---|
public void reset()
public void onAmpersand()
public void process(Document document)
throws java.io.IOException
process in interface org.galagosearch.tupleflow.Processor<Document>document -
java.io.IOExceptionpublic void tokenize(Document document)
document -
java.io.IOException
public Document tokenize(java.lang.String text)
throws java.io.IOException
java.io.IOExceptionpublic java.util.ArrayList<TagTokenizer.Pair> getTokenPositions()
public void setProcessor(org.galagosearch.tupleflow.Step processor)
throws org.galagosearch.tupleflow.IncompatibleProcessorException
setProcessor in interface org.galagosearch.tupleflow.Source<Document>org.galagosearch.tupleflow.IncompatibleProcessorException
public void close()
throws java.io.IOException
close in interface org.galagosearch.tupleflow.Processor<Document>java.io.IOExceptionpublic java.lang.Class<Document> getInputClass()
public java.lang.Class<Document> getOutputClass()
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||