Mailing List Archive

Developing experimental "more advanced" analyzers
Hi There,

I'm new to lucene (in fact im interested in ElasticSearch but in this case
its related to lucene) and I want to make some experiments with some
enhanced analyzers.

Indeed I have an external linguistic component which I want to connect to
Lucene / EleasticSearch. So before I'm producing a bunch of useless code, I
want to make sure that I'm going the right way.

The linguistic component needs at least a whole sentence as Input (at best
it would be the whole text at once).

So as far as I can see I would need to create a custom Analyzer and
overrride "createComponents" and "normalize".

Is that correct or am I on the wrong track?

Bests
Chris
Re: Developing experimental "more advanced" analyzers [ In reply to ]
On Mon, May 29, 2017 at 8:36 AM, Christian Becker
<christian.freisen@gmail.com> wrote:
> Hi There,
>
> I'm new to lucene (in fact im interested in ElasticSearch but in this case
> its related to lucene) and I want to make some experiments with some
> enhanced analyzers.
>
> Indeed I have an external linguistic component which I want to connect to
> Lucene / EleasticSearch. So before I'm producing a bunch of useless code, I
> want to make sure that I'm going the right way.
>
> The linguistic component needs at least a whole sentence as Input (at best
> it would be the whole text at once).
>
> So as far as I can see I would need to create a custom Analyzer and
> overrride "createComponents" and "normalize".
>

There is a base class for tokenizers that want to see
sentences-at-a-time in order to divide into words:

https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/SegmentingTokenizerBase.java#L197-L201

There are two examples that use it in the test class:

https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/test/org/apache/lucene/analysis/util/TestSegmentingTokenizerBase.java#L145
Re: Developing experimental "more advanced" analyzers [ In reply to ]
I'm sorry - I didn't write down, that my intention is to have linguistic
annotations like stems and maybe part of speech information. For sure,
tokenization is one of the things I want to do.

2017-05-29 19:02 GMT+02:00 Robert Muir <rcmuir@gmail.com>:

> On Mon, May 29, 2017 at 8:36 AM, Christian Becker
> <christian.freisen@gmail.com> wrote:
> > Hi There,
> >
> > I'm new to lucene (in fact im interested in ElasticSearch but in this
> case
> > its related to lucene) and I want to make some experiments with some
> > enhanced analyzers.
> >
> > Indeed I have an external linguistic component which I want to connect to
> > Lucene / EleasticSearch. So before I'm producing a bunch of useless
> code, I
> > want to make sure that I'm going the right way.
> >
> > The linguistic component needs at least a whole sentence as Input (at
> best
> > it would be the whole text at once).
> >
> > So as far as I can see I would need to create a custom Analyzer and
> > overrride "createComponents" and "normalize".
> >
>
> There is a base class for tokenizers that want to see
> sentences-at-a-time in order to divide into words:
>
> https://github.com/apache/lucene-solr/blob/master/
> lucene/analysis/common/src/java/org/apache/lucene/analysis/util/
> SegmentingTokenizerBase.java#L197-L201
>
> There are two examples that use it in the test class:
>
> https://github.com/apache/lucene-solr/blob/master/
> lucene/analysis/common/src/test/org/apache/lucene/analysis/util/
> TestSegmentingTokenizerBase.java#L145
>
Re: Developing experimental "more advanced" analyzers [ In reply to ]
If you used our products which have Elastic plugins,POS, Stems and
Leminisation it would be much easier.




Kind Regards



Chris

VP International

E: cbrown@basistech.com

T: +44 208 622 2900

M: +44 7796946934

USA Number: +16173867107

Lakeside House, 1 Furzeground Way, Stockley Park, Middlesex, UB11 1BD, UK


On 29 May 2017 at 19:42, Christian Becker <christian.freisen@gmail.com>
wrote:

> I'm sorry - I didn't write down, that my intention is to have linguistic
> annotations like stems and maybe part of speech information. For sure,
> tokenization is one of the things I want to do.
>
> 2017-05-29 19:02 GMT+02:00 Robert Muir <rcmuir@gmail.com>:
>
> > On Mon, May 29, 2017 at 8:36 AM, Christian Becker
> > <christian.freisen@gmail.com> wrote:
> > > Hi There,
> > >
> > > I'm new to lucene (in fact im interested in ElasticSearch but in this
> > case
> > > its related to lucene) and I want to make some experiments with some
> > > enhanced analyzers.
> > >
> > > Indeed I have an external linguistic component which I want to connect
> to
> > > Lucene / EleasticSearch. So before I'm producing a bunch of useless
> > code, I
> > > want to make sure that I'm going the right way.
> > >
> > > The linguistic component needs at least a whole sentence as Input (at
> > best
> > > it would be the whole text at once).
> > >
> > > So as far as I can see I would need to create a custom Analyzer and
> > > overrride "createComponents" and "normalize".
> > >
> >
> > There is a base class for tokenizers that want to see
> > sentences-at-a-time in order to divide into words:
> >
> > https://github.com/apache/lucene-solr/blob/master/
> > lucene/analysis/common/src/java/org/apache/lucene/analysis/util/
> > SegmentingTokenizerBase.java#L197-L201
> >
> > There are two examples that use it in the test class:
> >
> > https://github.com/apache/lucene-solr/blob/master/
> > lucene/analysis/common/src/test/org/apache/lucene/analysis/util/
> > TestSegmentingTokenizerBase.java#L145
> >
>
Re: Developing experimental "more advanced" analyzers [ In reply to ]
I am glad that basistech has tools to bring lemmings back :-} I am guessing you also have lemmati[z|s]ation.


> On May 29, 2017, at 12:37 PM, Chris Brown <cbrown@basistech.com> wrote:
>
> If you used our products which have Elastic plugins,POS, Stems and
> Leminisation it would be much easier.
>
>
>
>
> Kind Regards
>
>
>
> Chris
>
> VP International
>
> E: cbrown@basistech.com
>
> T: +44 208 622 2900
>
> M: +44 7796946934
>
> USA Number: +16173867107
>
> Lakeside House, 1 Furzeground Way, Stockley Park, Middlesex, UB11 1BD, UK
>
>
> On 29 May 2017 at 19:42, Christian Becker <christian.freisen@gmail.com>
> wrote:
>
>> I'm sorry - I didn't write down, that my intention is to have linguistic
>> annotations like stems and maybe part of speech information. For sure,
>> tokenization is one of the things I want to do.
>>
>> 2017-05-29 19:02 GMT+02:00 Robert Muir <rcmuir@gmail.com>:
>>
>>> On Mon, May 29, 2017 at 8:36 AM, Christian Becker
>>> <christian.freisen@gmail.com> wrote:
>>>> Hi There,
>>>>
>>>> I'm new to lucene (in fact im interested in ElasticSearch but in this
>>> case
>>>> its related to lucene) and I want to make some experiments with some
>>>> enhanced analyzers.
>>>>
>>>> Indeed I have an external linguistic component which I want to connect
>> to
>>>> Lucene / EleasticSearch. So before I'm producing a bunch of useless
>>> code, I
>>>> want to make sure that I'm going the right way.
>>>>
>>>> The linguistic component needs at least a whole sentence as Input (at
>>> best
>>>> it would be the whole text at once).
>>>>
>>>> So as far as I can see I would need to create a custom Analyzer and
>>>> overrride "createComponents" and "normalize".
>>>>
>>>
>>> There is a base class for tokenizers that want to see
>>> sentences-at-a-time in order to divide into words:
>>>
>>> https://github.com/apache/lucene-solr/blob/master/
>>> lucene/analysis/common/src/java/org/apache/lucene/analysis/util/
>>> SegmentingTokenizerBase.java#L197-L201
>>>
>>> There are two examples that use it in the test class:
>>>
>>> https://github.com/apache/lucene-solr/blob/master/
>>> lucene/analysis/common/src/test/org/apache/lucene/analysis/util/
>>> TestSegmentingTokenizerBase.java#L145
>>>
>>
Re: Developing experimental "more advanced" analyzers [ In reply to ]
Im this case its more fun not going the easy way ????


Chris Collins <chris_j_collins@yahoo.com.invalid> schrieb am Mo. 29. Mai
2017 um 21:41:

> I am glad that basistech has tools to bring lemmings back :-} I am
> guessing you also have lemmati[z|s]ation.
>
>
> > On May 29, 2017, at 12:37 PM, Chris Brown <cbrown@basistech.com> wrote:
> >
> > If you used our products which have Elastic plugins,POS, Stems and
> > Leminisation it would be much easier.
> >
> >
> >
> >
> > Kind Regards
> >
> >
> >
> > Chris
> >
> > VP International
> >
> > E: cbrown@basistech.com
> >
> > T: +44 208 622 2900
> >
> > M: +44 7796946934
> >
> > USA Number: +16173867107
> >
> > Lakeside House, 1 Furzeground Way, Stockley Park, Middlesex, UB11 1BD, UK
> >
> >
> > On 29 May 2017 at 19:42, Christian Becker <christian.freisen@gmail.com>
> > wrote:
> >
> >> I'm sorry - I didn't write down, that my intention is to have linguistic
> >> annotations like stems and maybe part of speech information. For sure,
> >> tokenization is one of the things I want to do.
> >>
> >> 2017-05-29 19:02 GMT+02:00 Robert Muir <rcmuir@gmail.com>:
> >>
> >>> On Mon, May 29, 2017 at 8:36 AM, Christian Becker
> >>> <christian.freisen@gmail.com> wrote:
> >>>> Hi There,
> >>>>
> >>>> I'm new to lucene (in fact im interested in ElasticSearch but in this
> >>> case
> >>>> its related to lucene) and I want to make some experiments with some
> >>>> enhanced analyzers.
> >>>>
> >>>> Indeed I have an external linguistic component which I want to connect
> >> to
> >>>> Lucene / EleasticSearch. So before I'm producing a bunch of useless
> >>> code, I
> >>>> want to make sure that I'm going the right way.
> >>>>
> >>>> The linguistic component needs at least a whole sentence as Input (at
> >>> best
> >>>> it would be the whole text at once).
> >>>>
> >>>> So as far as I can see I would need to create a custom Analyzer and
> >>>> overrride "createComponents" and "normalize".
> >>>>
> >>>
> >>> There is a base class for tokenizers that want to see
> >>> sentences-at-a-time in order to divide into words:
> >>>
> >>> https://github.com/apache/lucene-solr/blob/master/
> >>> lucene/analysis/common/src/java/org/apache/lucene/analysis/util/
> >>> SegmentingTokenizerBase.java#L197-L201
> >>>
> >>> There are two examples that use it in the test class:
> >>>
> >>> https://github.com/apache/lucene-solr/blob/master/
> >>> lucene/analysis/common/src/test/org/apache/lucene/analysis/util/
> >>> TestSegmentingTokenizerBase.java#L145
> >>>
> >>
>
>
RE: Developing experimental "more advanced" analyzers [ In reply to ]
Hi,

as you are using Elasticsearch, there is no need to implement an Analyzer instance. In general, this is never needed in Lucene, too, as there is the class CustomAnalyzer that uses a builder pattern to construct an analyzer like Elasticsearch or Solr are doing.

For your use-case you need to implement a custom Tokenizer and/or several TokenFilters. In addition you need to create the corresponding factory classes and bundle everything as an Elasticsearch plugin. I'd suggest to ask on the Elasticsearch mailing lists about this. After that you can define your analyzer in the Elasticsearch mapping/index config.

The Tokenizer and TokenFilters can be implemented, e.g. like Robert Muir was telling you. The sentence stuff can be done as a segmenting tokenizer subclass. Keep in mind, that many tasks can be done with already existing TokenFilters and/or Tokenizers.

Lucene has no index support for POS tags, they are only used in the analysis chain. To somehow add them to the index, you may use TokenFilters as last stage that adds the POS tags to the term (e.g., term "Windmill", pos "subject" could be combined in the last TokenFilter to a term called "Windmill#subject" and indexed like that). For keeping track of POS tags during the analysis (between the tokenfilters and tokenizers), you may need to define custom attributes.

Check the UIMA analysis module for more information how to do this.

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Christian Becker [mailto:christian.freisen@gmail.com]
> Sent: Monday, May 29, 2017 2:37 PM
> To: general@lucene.apache.org
> Subject: Developing experimental "more advanced" analyzers
>
> Hi There,
>
> I'm new to lucene (in fact im interested in ElasticSearch but in this case
> its related to lucene) and I want to make some experiments with some
> enhanced analyzers.
>
> Indeed I have an external linguistic component which I want to connect to
> Lucene / EleasticSearch. So before I'm producing a bunch of useless code, I
> want to make sure that I'm going the right way.
>
> The linguistic component needs at least a whole sentence as Input (at best
> it would be the whole text at once).
>
> So as far as I can see I would need to create a custom Analyzer and
> overrride "createComponents" and "normalize".
>
> Is that correct or am I on the wrong track?
>
> Bests
> Chris