What's in a default Galago index?

postings: The postings list contains each token in each document, just as the TagTokenizer decoded it. Each posting is a triple: (word, document, position). You can see this data by using dump-index, which is described in the Quick Start tutorial.

stemmedPostings: This is the same data as the postings file, except all words have also been stemmed with the Porter stemmer. Galago has both stemmed and unstemmed lists so that you can choose between them at query time.

extents: This file stores tag information for all tags in a document. Each posting has four parts: (tag, document, start, end).

documentLengths: This file contains a 4-byte length for each document in the collection.

documentNames: This file contains the document identifier for each document in the collection. It is encoded with the DocumentNamesWriter class, and it's optimized for storing TREC-style document identifiers that end with a dash followed by numbers.

How can I look at data in an index?

Use the galago dump-index tool to look at any inverted index, including postings, stemmedPostings and extents. You'll see the data dumped in comma-separated values (CSV).

What are index parts?

An index part has these characteristics:

  • It is a file stored in the parts directory within a Galago structured index
  • It was written by IndexWriter
  • Its parameters section contains a readerClass name which is a subclass of StructuredIndexPartReader

The StructuredIndexPartReader interface allows an index part to add operators to the query language. So, if your index can produce extents, you would expose an #extents operator. Users can then type an expression like this to retrieve the body extents from myExtentsIndexPart.

#extents:body,part=myExtentsIndexPart()

You can add a new index part to a Galago index just by putting a file in the parts directory. If you have an instance of Galago running, you'll need to restart it so it will recognize the new index part.

TupleFlow

Galago is based on TupleFlow, which is a distributed computing framework like MapReduce, Hadoop or Dryad. TupleFlow is most similar to Dryad.

In MapReduce, each computation has two stages: the Map stage, where each input item is transformed into a key/value pair, and the Reduce stage, where those pairs are aggregated by key. TupleFlow extends this model in three ways. First, TupleFlow can have an arbitrary number of stages, not just two. Second, you're not restricted to key/value pairs; your data can be arbitrary tuples, and stages are strongly typed. Third, each stage can have an arbitrary number of inputs and outputs.

TupleFlow computations can be thought of as a directed acyclic graph of stages. It's called TupleFlow, because tuples flow along the edges. Since the graph is acyclic, there's guaranteed to be a reasonable ordering of the stages. The graph represents both data flow and order dependencies, just like a Makefile. Once a computation is expressed as a TupleFlow graph, it can be scaled up to run on any number of processors.

The current released version of Galago uses threads for parallelism and is therefore limited to a single machine. Galago has been used for over a year on clusters of computers, but we don't get have a clean solution that's easy for other people to set up. Once we have that, we'll release it.

galagotype

TupleFlow tuples are described in galagotype files. There are many examples of these in the Galago codebase. A type looks a lot like a Java class, but without any methods. A type begins with a set of field definitions, and ends with order definitions. Here's an example:

package org.myapp;

type WordCount {
    String word;
    int count;
  
    order: +word;
    order: +count;
    order: +count -word;
}

This definition says that we want a tuple type called WordCount which consists of a String called word and an integer called count. We'd like to be able to sort these tuples in ascending order by word, or ascending order by count, or ascending order by count with ties broken in descending order by word.

In Galago, the order is part of the type. Each stage specifies not just the kind of tuples it wants to receive, but the order they should come in. Order specifications define both the sorted order of the tuples and how they are distributed among machines or threads. For example, if you choose "+word" as the hash function for a particular stage, all WordCount tuples with the same word will appear at the same machine/thread.

galagotype files are converted into Java code by using the TypeBuilder in galagosearch-tupleflow-typebuilder. While TypeBuilder does have a command-line version, it's easier to just use it as a Maven plugin. If you put your galagotype files in the src/main/galagotype directory of your project, the TypeBuilder will automatically convert them into Java classes.

Modifying the index

Most people who use Galago won't need to modify the indexing process. If you're considering changing the indexing process, you may want to get the retrieval documentation first. It might be that you can get the effect you want without changing the index process.

If you want to modify the index, look at these classes to get started:

  • IndexReader/IndexWriter
  • StructuredIterator
  • StructuredIndexPartReader
  • Traversal (also see retrieval documentation)