Package org.galagosearch.core.parse

Interface Summary
DocumentStreamParser  
 

Class Summary
AdditionalTextCombiner Adds tuples of type AdditionalDocumentText to the end of the text field in a document.
AnchorTextCreator  
AnchorTextDocumentCreator From an IdentifiedLink object, this class constructs a document containing only anchor text.
ArcParser Parses ARC files, like those produced by the Heretrix web crawler.
CollectionLengthCounter  
DateExtractor A very crude extractor of dates from text.
Document  
DocumentDataExtractor Copies a few pieces of metadata about a document (identifier, url, length) from a document object and stores them in a DocumentData tuple.
DocumentDataNumberer Sequentially numbers document data objects.
DocumentFilter  
DocumentIndexReader  
DocumentIndexWriter Writes document text and metadata to an index file.
DocumentLinkData  
DocumentSource From a set of inputs, splits the input into many DocumentSplit records.
DocumentToKeyValuePair This is used in conjunction with KeyValuePairToDocument.
Extent  
ExtentExtractor Converts all tags from a document object into DocumentExtent tuples.
ExtentsNumberer  
FieldConflater  
IndexReaderSplitParser Reads Document data from an index file.
KeyValuePairToDocument This is used in conjunction with DocumentToKeyValuePair.
LinkCombiner  
LinkExtractor Extracts links from documents (anchor text, URLs).
Porter2Stemmer  
PositionPostingsNumberer  
PostingsPositionExtractor  
PriorParser  
StringPooler The point of this class is to replace strings in document objects with already-used copies.
Tag This class represents a tag in a XML/HTML document.
TagTokenizer This class processes document text into tokens that can be indexed.
TagTokenizer.Pair  
TrecTextParser  
TrecWebParser  
UniversalParser  
WordCounter  
WordCountReducer  
WordFilter WordFilter filters out unnecessary words from documents.
 



Copyright © 2009. All Rights Reserved.