On Tue, Nov 18, 2014 at 1:16 PM, Marvin Humphrey <marvin@rectangular.com> wrote:
> On Sat, Nov 15, 2014 at 3:22 AM, Michael McCandless
> <lucene@mikemccandless.com> wrote:
>
>> The analysis chain (attributes) is overly complex.
>
> If you were to start from scratch, what would the analysis chain look like?
Hi Marvin, long time no talk! I like the new Go bindings for Lucy!
Here are some things that bug me about Lucene's analysis APIs:
Lucene's Attributes have separate interface from impl, with default
impls, and this causes complex code in oal.util.Attribute*. It seems
like overkill. Seems like we should just have concrete core impls for
the atts Lucene knows how to index.
There are 5 java source files in that package related to attributes
(Attribute.java AttributeFactory.java AttributeImpl.java
AttributeReflector.java AttributeSource.java): too much.
There should not be a global AttributeFactory that owns all attrs
throughout the pipeline: that's too global. Rather, each stage should
be free to control what the next stages sees (LUCENE-2450) ... the
namespace should be private to that stage, and each stage can
delete/add/replace the incoming bindings it saw. This may seem more
complex but I think it'd be simpler in the end? And, the first stage
should not have to be responsible for clearing things that later
stages had inserted: common source of bugs for that first Tokenizer to
not call clearAttributes.
Reuse of token streams was an "afterthought" that took a long time to
work its way down to simpler APIs, but now we ReuseStrategy,
AnalyzerWrapper, DelegatingAnalzyerWrapper.
Custom analyzers can't be (easily?) serialized, so ES and Solr have
their own layers to parse a custom chain from JSON/XML. Those layers
could do better error checking...
Can we do something better with offsets, such that TokenFilters (not
just Tokenizers/CharReaders) would also be able to set correct
offsets?
The stuffing of things into "analysis" that really should have been a
"gentle schema" is annoying: KeywordAnalyzer, Numeric*.
Token filters that want to create graphs are nearly impossible. E.g
you cannot put a WDF in front of SynonymFilter today because
SynonymFilter can't handle an incoming graph (LUCENE-5012).
Deleted tokens should still be present, just "marked" as deleted (so
IW doesn't index them). This would make it possible (to Rob's horror)
for tokenizers to preserve every single character they saw, but things
that are not tokens (punctuation, whitespace) are marked deleted.
Maybe this makes it possible for all stages to work with offsets
properly?
There is probably more, and probably lots of people disagree that
these are even "problems" :)
Mike McCandless
http://blog.mikemccandless.com