Word Stemming

In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form — generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.

Here we implement Porter stemming algorithm and Paice/Husk' Lancaster stemming algorithm in Java. The Porter stemmer was very widely used and became the de facto standard algorithm used for English stemming. The Porter stemmer is a context sensitive suffix removal algorithm. The stemmer is divided into a number of linear steps, five or six depending upon the definition of a step, that are used to produce the final stem.

The Paice/Husk' Lancaster stemming algorithm is known to be very strong and aggressive. The stemmer utilizes a single table of rules, each of which may specify the removal or replacement of an ending.