ISBI Laboration 2: Effects of stemming in search engines

Introduction

In this lab exercise we will experiment with writing stemming rules for improving search results in a search engine. We will be focusing on improving recall, without sacrificing precision.

You should preferably work in groups of 3. This lab exercise is also available in Swedish, which is more challenging and therefore also a bit more interesting. To attempt the Swedish version of the lab you should, however, have at least one native Swedish speaker in the group, and the non-native Swedish group members must read the English lab description in order to understand the exercise. The native Swedish group members must at least read the Swedish lab instructions.


First

Skim through the whole laboration description before starting, especially the final section: Reporting your results

Background

In a traditional, string matching based search engine, precision (that is, not all correct documents are found) and recall (that is, not all relevant documents are found) may be very low or unreliable. This may be due to the fact that the documents contain inflected forms of a word. For instance, if the query is "work", documents containing the forms "working", "works", "worked" etc. may not be found.

Stemming is a relatively simple method (both with respect to development and processing time) which can be used for improving precision and recall in search engines. In this laboration, we will primarily focus on improving recall. In order to handle inflections, the words "working", "works" and "worked" will be normalized into the stem "work". It is, however, not trivial to capture all possible inflections with general stems. Some non-related words might have the same stems, such as "bull" and "bully" for instance (generating the same stem "bull"), which would lead to a decrease in precision.

The stemmer used in this laboration is written in PHP and is designed to be generic and to perform searches in real time. It is designed only to be an example, so do not use it as a measure of the true "cost" or impact of applying a stemmer to a search engine. If you are interested in stemming implementations, take a look at Snowball, which is designed to create stemming algorithms for information retrieval and has implementations for several languages.

Task description

Your assignment is to write stemming rules in order to improve recall in a search engine result. You will write these rules in a simple search engine designed for this laboration: Stemming Lab. In this search engine, you will write stemming rules according to a three-step-model in order to normalize the required words to appropriate stems. Six (and only six) documents with the query words in the file names can be found to each of the following queries:

1work office
2stock economy
3forge document
4forget document
5stocking economy

In the interface it is of course possible to enter any search queries, and also to use different inflections of the query words. However, it is these words that should be used for the stemming rules you develop for the task, and it is also these words that will be used for checking your reported stemming rules.

Getting started

Go to Stemming Lab and get acquainted with the interface. The interface consists of three fields of rules, named Rules steps1-3. These fields should be used for writing the rules. You can also use a separate text file. The steps represent the three steps the stemmer goes through when stemming a word to an appropriate stem (please observe that an appropriate stem does not have to be, and often is not, an actual word). The stemmer will apply the rules step by step, from top to bottom. As soon as it can apply a rule it will move on to next step, which is why it is important that you place your rules within the steps in order for them to interact correctly.

Under the rule fields there is a search field for writing queries and a list box with predefined queries. Your task is to use the stemmer in order to get the search engine to return exactly six documents for each predefined query. When you choose one of these queries it will automatically be added in the search field.
Moreover, there is a button which should be used for performing a search for the query stated, as well as a button for testing the current stemming rules on the words in the search field. This means that if you're unsure of what your current stemming rules actually do, you can write some different words in different inflections in order to see which stem the rules produce (if you also tick Print trace you can see which rules that were applied and which transformations that have been made in each step in the query and the files).

Rule syntax and some tips

The rules follow the following syntax: ^left context|morpheme$->replacement;

You can use the following operators:

#Comment
^NOT (i.e negation)
|Marks end of left context
$Marks end of matched word
->Transformation left to right
;Marks end of rule

All parts of a rule are optional extept -> and ;
It is, however, often easier to always also include $ which indicates that matching should be performed on the end of a word (suffix rules).
The following is an example of a very limited set of rules:

Step1Step2Step3
s$->; er$->; tt$->t;
ic$->;

The rules are applied top-down until a rule is applicable, after which the stemmer continues to the next step. This means that a maximum of three rules can be applied, one for each step. The rules you create are kept during searching, rule testing and also between the two interface views (see below). It should thus theoretically be possible to keep your rules throghout the laboration. However, all sorts of things may happen (the server does not respond, you accidentally press the wrong button or close the window etc), so it is recommendeded to save the rules from time to time. This can be done by clicking the small blue disk and choose save as text in the browser file-menu.1

If you want to load your rules in the interface again, you can click on the small orange symbol right above the disk. This will lead to a more compact view for loading rules. Locate your rule file which has been saved to disk and perform a query, rule test or change view to load the rules in the interface.2 In this view you can also get some information about the rule format. You can also look at the original set of rules, by clicking the small question mark on the right of the browse button.


Reporting your results

Your laboration report must contain two parts: a set of rules and a short paragraph containing thoughts and reflections regarding the laboration and stemming/searching. The lab report should be submitted to the FirstClass conference ISBI Laborations. Before submitting, make sure your rule file follows the correct syntax and that it produces all six documents for each of the predefined queries in the search engine. The rule file syntax can be tested by loading the set of rules in the search engine and test it against the predefined queries. Any rule files that cannot be loaded properly will be returned without taking any measures to correct them.

Any questions and thoughts should be asked to the laboration instructor during the laboration session. Please observe that the laboration is mandatory for passing the course.

Foot notes:
1) You can of course also copy the rule file and paste it in a text editor, e-mail client etc.
2) It is also possible to paste a complete rule file in the first rule field Rules step1 in the starting view. In this case any rules included in steps 2 and 3 will be ignored.



English queries and translation of the lab instructions by Sumithra Velupillai. Implementation of the search engine, definition of the rule sytax and initial design of the laboration by Martin Hassel.