Mailing List Archive: Serializing the analyzer

Dear all,

From the last discussion we had on this list, we've made the following
points:
* The normalizer I proposed should be frozen, hard-coded in the source of
the application, in order to make sure the user don't change it by mistake.
* It seems a good idea to serialize the analyzer along with the index, in
order to encourage the developer to use it when accessing/writing the index.

I've been looking deeply into making a compiler for the normalizer language
I ended up with in the previous episode. However, my conclusions are that
this is not yet really interesting. Why, you may ask. Well, the main
advantage of having a compiler is the possible speed gain that one could
have by turning the set of transducers into a single, big, but deterministic
fsa transducer. Unfortunately I just couldn't find a way of producing such a
thing in a reasonnable amount of my time (I am probably just like you all,
doing all this on my spare time, and I can hardly find the motivation to go
into a several-weeks-long-at-fulltime task for something I am not completly
convinced of). The main problems being that it is not possible to have a
tabular-based deterministic fsa because of unicode being on 16-bit (not even
talking about unicode 3.1 which uses more than 16 bits!), and I just
couldn't find a nice way to have a deterministic fsa for this (well, I
figured some, but the reduction process is really painful). I thought of
using a non-deterministic fsa, but I am now quite sure this wouldn't improve
the speed processing over the current solution, so that's not a solution I
am interested in. Of course, if I am wrong on any point above, I'd be glad
to hear about it !

However, you may argue that the main point of compiling the normalizer into
java source code is to discourage the developer to change the analyzer
associated to an index, and that's in no way related to speed improvement. I
agree. But here is why I refered to the serialization in the second point :
serializing the analyzer with the index provides nearly the same level of
confidence that the user will not change it inadvertently (IMHO). That's
why, instead of writing a compiler, I'd rather rely on the analyzer object
being being stored with the index, therefore strongly associating both of
them.

Regarding the analyzer serialization, roughly:

* I proposed the "indexreader.getAnalyzer()" method of retrieving the
original analyzer of an index. Seems ok for everyone ?

* Serializing the analyzer requires, obviously, that the analyzers be
serializable. I think lucene should deal with non-serializable analyzer the
easy way : tolerate non-serializable analyzer, which means that creating an
indexwriter with a non-serializable analyzer should not throw an exception
and should not invalidate the index creation. In this case,
indexreader.getAnalyzer() would just return null (as this is still part of
the nominal case, I don't think throwing an exception is pertinent, but this
is debatable, and I won't fight for a millisecond on this topic).

* Maintenance : storing a serialized object also implies retrieving the
object later. Later may mean "later with the same class files" but also
"later with a modified class files". This is the main drawback of
serialization : one has to take high care of the compatibility when
maintaining the software. This is not necessarily hard nor painful, but
nonetheless mandatory, so it must be made clear for the developers that
changing the source code of an analyzer/stemmer/whatever that is stored in
the serialization data MUST manage ascendant compatibility. The java
serialization is strict on this point, and using the default serialization
mechanism "as-is" would probably result in breaking the compatibility,
therefore having a program not working anymore because the object can't read
its own data. Please note that although this may be seen as a major risk in
a project, this is rarely howevere a real problem in practice : once the
developer sees the object is not compatible, adding ascendant compatibility
should always be easy, except in case that we should handle with care. For
example :

* Turning a StopAnalyzer with hard-coded word into a StopAnalyzer that
reads stop words from a file is typically a problem. In this case, the class
shouldn't be modified, but a new class should be created instead. Adding new
field that cannot be guessed by the new analyzer class is probably a problem
for our use case.

* Generally speaking any in-depth change of the analyzer or of its
member objects would cause a problem. But again, one _shouldn't_ try to make
conceptual change to an analyzer already used by an index. So, accepting it,
and documenting it may be a valid option (imho).

However, there shouldn't be many cases in which analyze/stemmer evolution
create such structural changes, in fact the case of the stopanalyzer is the
only one I could find.

Ok, I think I listed most of the pros & cons of the serialization of the
analyzers. Someone sees additional points ? So, what about the serialization
? Does it seem acceptable ?

Rodrigo

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>