Some of this text is out of date and refers to an older version of Galago. I've added it to the new website for reference.
Galago is a search engine toolkit primarily research and educational use. It is released under a BSD license, so it may be incorporated freely into commercial work, but most commercial users may want something with less of an experimental focus.
Galago differs from the other open source search packages in primarily its customizability and scalability. Galago allows most of the indexing and retrieval process to be changed, often without changing any code. The indexing process is based on TupleFlow, a MapReduce variant, which can efficiently distribute execution across many processors or machines. TupleFlow can be used separately from the indexing and retrieval parts of Galago.
Some of the indexes built in Galago can be processed by C++ code for faster retrieval performance.
There are two general ways to customize the retrieval process: generating custom indexes, and using the query language.
The custom index approach lets you use the flexible indexing tools in Galago to build specialized indexes that have your own ranking function built-in. Once you have built one of these highly customized indexes (usually called binned indexes), you can't change the query processing strategy. However, these binned indexes lead to very fast retrieval performance.
The slower, but more flexible approach is to use the query language. Galago includes a query language which is a descendent of the Inquery and Indri query languages. It's a little bit like SQL for text. It allows you to describe how you want documents to be ranked in a very flexible way. Since flexibility has a cost, this approach is quite a bit slower than the custom index approach. However, you may not notice the difference unless you have high query loads or over a million documents to search.
The indexing component of Galago is based on a framework called TupleFlow. It is probably best to think of TupleFlow as a mix between MapReduce, make/ant, and a database system. TupleFlow is like MapReduce in that it can efficiently parallelize a large computation. It is like make or ant in that it runs based on a file that looks much like a Makefile. And, it is like a database system because the glue that holds the computation together are lists of tuples, just like database tables.
If you are familiar with Makefiles, you know how a Makefile is a series of targets. Each target contains a series of rules that builds a particular target. The targets are linked together by dependence. The job of the make program (or ant) is to process the rules and dependence information to build some software. For example, to build a Java jar file, you must first compile Java source code files into Java class files, then combine those class files together into a jar file. Therefore, in a build file the final target would be a jar creation stage, but that target would depend on compilation steps. With a well designed Makefile, you simply ask for a jar file, and the build system realizes that code must be compiled first to give you what you want.
TupleFlow works in much the same way. Instead of targets, we have stages. A stage is, broadly, a series of steps that convert data of some type into data of another type. Unlike MapReduce, TupleFlow allows a single stage to have many kinds of inputs and many kinds of outputs. Also unlike most MapReduce-like systems, which tend to be code-based, TupleFlow expresses all of the important logic in an XML parameter file. An example of a TupleFlow stage description is shown below:
<stage id="parsing"> <dependencies> <input id="textfiles" /> <output id="counts" /> <output id="documents" /> <output id="postings" /> </dependencies> <connections> <input class="galago.types.FileName" id="filenames" order="+filename"/> <output class="galago.types.WordCount" order="+word" id="counts" /> <output class="galago.types.DocumentIdentifier" order="+identifier" id="documents" /> <output class="galago.types.DocumentWordPosition" order="+document +word +position" id="postings" /> <output class="galago.types.DocumentIdentifierExtent" order="+document" id="extents"/> </connections> <steps> <input id="filenames" /> <step class="galago.parse.UniversalParser" data="textfiles" /> <step class="galago.parse.TagTokenizer" /> <step class="galago.parse.Stopper"> <word>the</word> <word>and</word> <word>or</word> <word>of</word> <word>to</word> <word>a</word> </step> <step class="galago.parse.Porter2Stemmer" /> <multi> <!-- body --> <group> <step class="galago.parse.FieldPostingsCounter" /> <split id="postings" /> </group> <!-- language model counts --> <group> <step class="galago.parse.WordCounter" /> <step class="galago.tupleflow.Sorter"> <order>+word</order> <class>galago.types.WordCount</class> </step> <write id="counts" /> </group> <!-- identifier extractor --> <group> <step class="galago.parse.DocumentIdentifierExtractor" /> <step class="galago.tupleflow.Sorter"> <order>+identifier</order> <class>galago.types.DocumentIdentifierType</class> </step> <write id="documents" /> </group> </multi> </steps> </stage>
This stage is a parsing stage. Let's look at how this stage works by starting at the <steps> tag. The best way to think of this stage description is to think of data flowing through each step, starting at from the input tag and flowing to the output tags. From the input step, a list of filenames flows to the UniversalParser. The parser opens each file in turn and extracts a stream of documents. Those documents flow through the tokenizer, then the stopper, and then the stemmer. At this point, the data hits a <multi> tag. A copy of each processed document is sent to each of the three <group> tags. The first one stores word position information that will be stored in an inverted index. The second group counts the words in the document so these statistics can be used later for retrieval purposes. Finally, the last group just extracts the name of the document and stores it in a document names list.
Most of these tags are step tags. A step tag loads a Java class and connects it to a data pipeline. A Java class needs to implement the Step interface for this to work. The Processor interface shows how the data passes from one object to another: each object calls the process method of the next object. If you want, you can create your own Step classes and add them into your index process. Galago already comes with many of the Step classes you'll need.
Before we go further, notice that there is a special step class called Sorter. Sorter reads in all of its incoming tuples and then sorts them based on the order specified. Sorter uses temporary files to do this sorting if there is too much data to fit in memory. Soring is a powerful tool for this kind of text/data processing.
A sample job file is provided with the toolkit as an example. The job file shows how stages are connected together by connections. These connections correspond to the dependence information in a Makefile. The connections tell TupleFlow which order to run the stages, and whether stages can be distributed. By specifying the workflow in this way, TupleFlow can automatically distribute work between multiple processors on one machine, or many machines on a network (with some help from a job scheduler like Grid Engine or Condor).
Notice in the sample that you can define property variables that can be used later in the job file. This can be useful for repeatedly specifying pathnames.
There are three types of queries that Galago supports for StructuredRetrieval. The first is simple mode. Simple mode queries are parsed by SimpleQuery. The simple query parser is very permissive of excess punctuation, and allows the use of quotes to indicate phrases. An example might look like this:"white house" "press conference"
The second mode is explicit mode. Explicit mode queries require almost everything about query processing to be specified within the query. An example:
#combine( #feature:dirichlet:mu=1500( #ordered:width=1( #positions:white() #positions:house() ) ) #feature:dirichlet:mu=1500( #ordered:width=1( #positions:press() #positions:conference() ) ) )
The final mode is structured mode. The structured mode allows the use of query operators, but without the wordiness of the explicit mode. An example:#combine( #1(white house) #1(press conference) )
In the explicit query mode, each operator corresponds directly to an iterator used in the retrieval process. Each operator takes some number of iterators as arguments, and returns an iterator itself. There are three basic iterator types:
You can add your own iterator types if you want, but all queries must return a ScoreIterator object.
The #scores operator retrieves a list of document scores from the score index. A good example of this might be a static document prior like PageRank. You would refer to PageRank like this: #scores:pagerank().
The #extents operator is used for fields, like titles or headings. The operator #extents:title() retrieves a list of title field positions for use in the #inside operator, or all by itself.
The #positions operator retrieves word position information. This is useful when using a proximity operator, like this: #ordered( #positions:white() #positions:house() ). You can use the #counts operator for words as well, but the #counts operator returns only counts. Therefore, it can only be used with operators that only need term count information, like #feature:dirichlet.
To turn term or extent counts into scores (which may be probabilities), we use term weighting functions. The standard one is dirichlet, which performs well in practice. The dirichlet method accepts mu as a smoothing parameter. If none is given, the default value of 1500 is used. The dirichlet method also relies on a collectionProbability value which is automatically computed from the index, but can be overridden within the query.
#combine: Takes many ScoreIterators as arguments. Adds the score from each iterator to form a final document score. (Note that the probabilistic operators are in log space, so this addition is like multiplying probabilities).
#scale: Takes a single ScoreIterator argument, and multiplies its score by some constant factor. The #scale and #combine operators can be used together to get the same effect as the Indri #weight operator. Example: #scale:weight=0.5( #scores:pagerank() )
#ordered: Like the Inquery/Indri ordered window operator. Searches for occurrences of its arguments appearing in text less than width-1 words apart, and order matters. #ordered:width=1( #positions:white() #positions:house() ) searches for the phrase "white house".
#unordered: Like #ordered, except the order of the words doesn't matter, and the width argument is interpreted differently. For #unordered, the width is the entire width of the match (length from the beginning of the first extent to the end of the last extent).
#synonym: Returns the union of its ExtentIterator arguments.
#inside: Takes two ExtentIterator arguments. Returns for instances where the first extent appears inside the second extent. Example: #inside( #positions:dog() #extents:title() ) searches for the word dog appearing in the title of a document.
#feature: The extensible operator. This is the operator that can load other classes and iterators and add them to the retrieval network.
The #feature operator can be used to extend the query language with your own custom operators. The syntax to use is #feature:class=MyClass:arg1=val1( #counts:dog() ). This operator would load the Java class MyClass, passing arg1=val1 to it as a parameter, and #counts:dog() as a child iterator. The only restriction is that MyClass must implement the DocumentDataIterator interface (which currently contains no methods).
Example:#feature:class=edu.umass.MyProximityOperator:width=4( #positions:dog() #positions:cat() )
This example loads a class called edu.umass.MyProximityOperator. Its constructor is called with two arguments. The first argument is a Parameters object with the key "width" set to "4". The second argument is an ArrayList containing two ExtentIterators, one for dog and one for cat. If a two argument constructor doesn't exist, Galago will throw an exception. Galago also requires that MyProximityOperator implement the DocumentDataIterator interface.
Remember that your operator class must be in your classpath. This does not mean that your new operator needs to be in the Galago JAR file. You can distribute your custom operators in your own JAR. This makes it easy to mix add-on operators from many different people; just use many different JAR files to store the operators.
You can extend the query language by building your own DocumentDataIterator. You need to create a class that implements DocumentDataIterator, and preferably either ExtentIterator, CountIterator, or ScoreIterator. A ScoreIterator returns score/probability/belief values. An extent iterator returns word or phrase extents (begin and end positions in a document). A CountIterator returns the number of times a paricular event (like a word occurrence) appears in a document.
Your iterator class must have a constructor that takes a Parameters object as the first argument, and an ArrayList of DocumentDataIterators as the second argument. Look at OrderedWindowIterator as an example. The parameters passed between the colons right after the operator name are passed through the Parameters object, and the iterators between the parentheses are passed through the childIterators list. Your job is to use the data in those child iterators to make a new kind of data.
Once you've made your class, make sure that it is in your classpath, and reference it like this: #feature:class=galago.mypackage.MyOperator( #counts:dog() #counts:cat() )
The whole TupleFlow concept is based on data items, called tuples, flowing between steps. Any kind of Java object can be used to represent a tuple, as long as it is never sorted, hashed, or transmitted between stages. However, any tuple that is sorted, hashed, or transmitted needs to be a subclass of galago.tupleflow.Type.
The galago.tupleflow.Type interface forces the tuple class to provide Comparator objects for use in sorting, hash functions for hashing, and readers and writers for file serialization. While occasionally you might want to write all this code yourself, it is probably best to let Galago do it for you. The TemplateTypeBuilder class/program can help you do this.
Suppose you want a type that represents word counts. Clearly this tuple needs two elements: a string that holds a word, and an integer that holds a count. You can do it like this:
java galago.tupleflow.TemplateTypeBuilder \ WordCount.java WordCount org.mytypes \ String:word int:count
This command makes a Java class called WordCount in the file WordCount.java within the package ord.mytypes. The class contains a member variable called word which is a String, and count which is an integer.
However, that command didn't help us with ordering. The TemplateTypeBuilder program requires that you specify what orders you want this type to support. Here are some example orders:
java galago.tupleflow.TemplateTypeBuilder \ WordCount.java WordCount org.mytypes \ String:word int:count \ order:+word order:-word order:+word-count
Notice that there are no spaces in the order strings. Typically Galago order strings have spaces in them, but we take them out when using TemplateTypeBuilder.
TemplateTypeBuilder supports a few different column types:
The last option, bytes, is a special case. It represents a byte array that is treated like a UTF-8 string. The reason for using bytes instead of String is that bytes types are ordered exactly the way that the C++ retrieval expects them. All index writer classes need to take words in bytes mode instead of String, otherwise your C++ retrieval application may crash intermittently.
GalagoWeb allows you to quickly put Galago indexes on the web. GalagoWeb gives you a typical web-style search interface, as well as three REST-style web services that allow you to retrieve search results, highlighted snippets, and full document text.
The current web interface only supports BinnedIndexes, but that will change soon. The required servlet parameters are:
You will want to use the galago.store.DocumentStoreWriter class during index time, since it can automatically load your indexed documents into your database.
GalagoWeb needs to run in some kind of Java servlet container. Apache Tomcat is a popular option, but others will work too.
Make a structured index, then use the StructuredRetrieval program (more information to follow eventually)
Galago includes a complete C++ retrieval system for binned indexes, and will soon include an implementation for document-sorted score indexes. Neither of these support the explicit structured query language; this is primarily for short, uncomplicated queries that need to be processed at high speed.
The C++ code that comes with Galago is capable of reading the kinds of indexes that are built with the IndexWriter class. Therefore it should be fairly simple to adapt the existing C++ code to a new type of index that you create yourself.