
Galago is now available in a pre-release version for download.
The system is functional but incomplete. The documentation is still rough, and there will probably be some major API changes before a stable 1.0 release.
Note: Galago is not yet available, but a version should be available sometime in Fall 2007.
Galago is a new search engine software package developed primarily for research into efficient query processing techniques with lots of query features. The system will be available in Fall 2007 under a BSD-like license.
There are two pieces to Galago: the indexing component and the query processing component. The indexing component is written in Java and is designed to be highly flexible. It can produce binned indexes (of the type used in the "Efficient Document Retrieval in Main Memory" paper, SIGIR 2007) or position-based indexes (like those used in Indri). Query processing can be done completely in Java, although there is also a high performance C++ retrieval engine that works with binned indexes only.
The Galago indexer is built on top of TupleFlow, which is based on ideas from Dean and Ghemawat's MapReduce. The MapReduce model considers key/value pairs, and supports a map operation followed by a reduce operation. Many users at Google, and also within the Nutch project, have found MapReduce to be a useful model for processing text data, because it conveniently abstracts the problems of out-of-core and distributed processing.
In the TupleFlow model, we think of data as large database tables. Just as in a database, tables are built from rows of data tuples. Each tuple contains some fixed number of data elements, each with its own type. Also like a database, it is possible to sort any table on any column (or set of columns) with a simple operation. A programmer can write a Processor object, which processes data in a table one row at a time, and writes a new table as output (possibly with a different table schema). Processor objects are chained together at runtime based on the user's preferences.
Inversion is simple in this model. First, run a parser that converts text documents into a table of words and counts, like this:
| Document | Word | Count |
|---|---|---|
| 0 | cat | 3 |
| 0 | hat | 2 |
| 0 | the | 5 |
| 1 | cat | 4 |
| 1 | the | 2 |
This table is ordered first by document, then by word. To invert it, just sort in alphabetical order by word, then by document:
| Document | Word | Count |
|---|---|---|
| 0 | cat | 3 |
| 1 | cat | 4 |
| 0 | hat | 2 |
| 0 | the | 5 |
| 1 | the | 2 |
This sorted data can be easily converted into a typical inverted index. Note that while this data looks space-inefficient, TupleFlow understands sorted order and uses sorting information to reduce space usage. For example, the encoding for the second table is:
("cat", 2) (0, 3) (1, 4) ("hat", 1) (0, 2) ("the", 2) (0, 5) (1, 2)
By understanding the order of the data, TupleFlow can store "cat" just once, while recording the fact that the next two rows contain the word "cat". It's this automatic compression that makes the TupleFlow model feasible.
TupleFlow automatically detects data dependencies between stages and uses this information to automatically schedule portions of the computation on a cluster of machines. Galago's indexer has been successfully used on a cluster of 70 CPUs.
The following XML fragment is a portion of a Galago indexing parameter file. This part describes how to convert TREC documents into postings. The TupleFlow code can use this description to parse documents in parallel on a cluster of processors.
<stage id="parsing">
<dependencies>
<input id="textfiles" />
<output id="counts" />
<output id="documents" />
<output id="postings" />
</dependencies>
<steps>
<step class="galago.parse.UniversalParser"
data="textfiles" />
<step class="galago.parse.TagTokenizer" />
<step class="galago.parse.Stopper">
<word>the</word>
<word>and</word>
<word>or</word>
<word>of</word>
<word>to</word>
<word>a</word>
</step>
<step class="galago.parse.Porter2Stemmer" />
<multi>
<!-- body -->
<group>
<step class="galago.parse.FieldPostingsCounter" />
<split id="postings" />
</group>
<!-- language model counts -->
<group>
<step class="galago.parse.WordCounter" />
<sort order="+word"
class="galago.types.WordCountType"
reducer="galago.parse.WordCountReducer"
object-limit="1000000" />
<write id="counts" />
</group>
<!-- identifier extractor -->
<group>
<step class="galago.parse.DocumentIdentifierExtractor" />
<sort order="+identifier"
class="galago.types.DocumentIdentifierType" />
<write id="documents" />
</group>
</multi>
</steps>
</stage>
This parameter file section only describes parsing; a real parameter file would describe many different stages of indexing. Some of these would be link analysis stages, e.g. PageRank computation, and others would be for statistics gathering. The final stage creates the index.
The indexing model in Galago isn't a single monolithic indexer, but really just a collection of index stages that can be arranged in a parameter file. The parameter file format makes it easy to add your own stages without having to modify any Galago code.
Galago includes a BinnedIndexWriter class which can create a Galago binned index from a list of (document, word, bin) tuples. This makes it very easy to create your own custom index builder which can work with Galago's query processor.
Query processing is done efficiently using binned indexes, as described in "Efficient Document Retrieval in Main Memory". Although the paper describes holding the entire index in memory, Galago can also use indexes that are larger than memory.
Galago will also include a document-at-a-time query processor that is less efficient but more flexible than the optimized method. This code hasn't been completed yet, and may not make it into the initial release.
Galago comes with a Java-based web interface that looks like a traditional web search engine. It is all Java, and needs a compliant application server (like Tomcat, etc.) to work properly.
TupleFlow can run on a single machine, or can use a cluster of machines by using Sun's Grid Engine via the DRMAA interface.
The indexer comes with typical indexing components, like document parsers, stemmers and stoppers. A simple snippet generator is includes for use with the web interface.