Mailing List Archive: Re: Industry Use of Lucene?

Hello,

Lingway (http://www.lingway.com) is a french company that specializes in the
design, development and implementation of linguistics-based software
solutions. We are using Lucene in one of our projects, which can be seen at
http://kant.lingway.com/LGfisc/index.html.

This demo provides an access to fiscal legal texts in French (Code General
des Impôts) through our linguistic technology, which analyses the user
input, retrieves the most relevant terms and adds semantically related
terms. This helps to retrieve more documents related to the query. An other
aspect is that the linguistic analysis gives automatically all possibles
forms for a word (singular, plural, masculine, feminine) and corrects some
user mistyping (like the lack of accent in impots for impôts).

The analysis provides a disambiguisation between homographic forms (e.g.
verb to book and noun a book). This is why the system proposes related terms
only for the form found in the user sentence. At last, the boolean operators
used in the query are computed according to the weight and role of terms in
the user query.

Since the documentation of the demo is in French (by the way it could be
interesting to know where the Lucene user come from, and in which
proportions), I'll give you a brief overview of the functionalities.

Let's figure that we typed the following query : réduction d'impôts pour les couples

1/ The number of documents found is indicated by :

24 documents trouvés sur (réduction d' impôt), couple
The second element ( (réduction d' impôt), couple ) gives an information
about which terms (and their related ones) have been sent to the query. You
can try other analysis by passing the mouse over this element, which will
display a contextual menu with all possible degradations of the original
query. By default, the system returns the results for the best matching
analysis.

2/ Information about the document

Article 200 sexies Section V : Calcul de l'impôt
Termes pertinents : réductions impôt - couple - couples - famille -
exonérés

Clicking on the document's reference (Article 200 sexies) opens the document
in a pop-up window. (see below)

Relevant terms are indicated in grey in the second line. These terms have
been sent to Lucene in a query generated by the system and appears in that
document. Note that the color of the terms depends on the weight of this
term in this document (more relevant terms are darker). Here we can see
that the boolean query generated contains not only the words present in the
original query (réduction - impôt - couple) but also related terms found by
the linguistic analysis (famille - exonérés) and morphologic variations
(singular - plural forms). We can see also that "réduction d' impôts" has been
recognized as a compound word.

This functionality helps the user to know roughly what's the content of a
document without opening it.

3/ Displaying the document

A click on the document's reference opens it in a pop-up window. The system
highlights the words of the text which are present in the query. This
functionality uses partially Mark Schreiber's proposals
(http://www.iq-computing.de/lucene/highlight.htm), the difference beeing
that our highlighter recognizes coumpound words (e.g. it will highlight
"réduction d'impôts" as a whole and not separately "réduction" and
"impôts").

-------------------------------------------------------

A set of examples (in French of course) are available at
http://kant.lingway.com/LGfisc/about.html#exemples

Committers : We would really be happy to be mentionned in the powered by
Lucene page (http://jakarta.apache.org/lucene/docs/powered.html). Is it possible?

An demo of our system in English is planned. We are waiting for your
suggestions : what would you like us to show you?

Any questions or comments are welcome. You can send them to
julien.nioche@lingway.com.
Please take a look at our site (www.lingway.com) for more information about
our activities.

Thank you

Julien Nioche / www.lingway.com