Hello, I'd better begin by identifying myself as a newbie.
I am investigating using Lucene as a search tool for a library of technical
documents, much of which consists of pieces of source code and discussion of
the content.
The standard analyzer does an adequate job with normal text but strips out
non-alpha characters in code fragments; the whitespace analyzer does an
adequate job with source code but at the expense of treating punctuation
characters as significant text.
As a couple of trivial examples, the line "The !F1 key." ideally needs to be
analyzed as [the] [!f1] [key]. The standard analyzer turns it into [the]
[f1] [key] while the Whitespace analyzer turns it into [the] [!f1] [key.].
Similarly "the abort() function, or the stop() function." ideally needs to
be analyzed as [the] [abort()] [function] [or] [the] [stop()] [function].
But no analyzer will retain the parentheses while discarding the comma and
full stop.
Are there examples of analyzers for technical documentation around, or any
helpful pointers? Or am I barking up a rotten tree here?
cheers
T
I am investigating using Lucene as a search tool for a library of technical
documents, much of which consists of pieces of source code and discussion of
the content.
The standard analyzer does an adequate job with normal text but strips out
non-alpha characters in code fragments; the whitespace analyzer does an
adequate job with source code but at the expense of treating punctuation
characters as significant text.
As a couple of trivial examples, the line "The !F1 key." ideally needs to be
analyzed as [the] [!f1] [key]. The standard analyzer turns it into [the]
[f1] [key] while the Whitespace analyzer turns it into [the] [!f1] [key.].
Similarly "the abort() function, or the stop() function." ideally needs to
be analyzed as [the] [abort()] [function] [or] [the] [stop()] [function].
But no analyzer will retain the parentheses while discarding the comma and
full stop.
Are there examples of analyzers for technical documentation around, or any
helpful pointers? Or am I barking up a rotten tree here?
cheers
T