Mailing List Archive: Developing experimental "more advanced" analyzers

Developing experimental "more advanced" analyzers

christian.freisen at gmail

May 29, 2017, 5:36 AM

Post #1 of 7 (1525 views)

Hi There,

I'm new to lucene (in fact im interested in ElasticSearch but in this case
its related to lucene) and I want to make some experiments with some
enhanced analyzers.

Indeed I have an external linguistic component which I want to connect to
Lucene / EleasticSearch. So before I'm producing a bunch of useless code, I
want to make sure that I'm going the right way.

The linguistic component needs at least a whole sentence as Input (at best
it would be the whole text at once).

So as far as I can see I would need to create a custom Analyzer and
overrride "createComponents" and "normalize".

Is that correct or am I on the wrong track?

Bests
Chris

Re: Developing experimental "more advanced" analyzers [ In reply to ]

rcmuir at gmail

May 29, 2017, 10:02 AM

Post #2 of 7 (1524 views)

On Mon, May 29, 2017 at 8:36 AM, Christian Becker
<christian.freisen@gmail.com> wrote:
> Hi There,
>
> I'm new to lucene (in fact im interested in ElasticSearch but in this case
> its related to lucene) and I want to make some experiments with some
> enhanced analyzers.
>
> Indeed I have an external linguistic component which I want to connect to
> Lucene / EleasticSearch. So before I'm producing a bunch of useless code, I
> want to make sure that I'm going the right way.
>
> The linguistic component needs at least a whole sentence as Input (at best
> it would be the whole text at once).
>
> So as far as I can see I would need to create a custom Analyzer and
> overrride "createComponents" and "normalize".
>

There is a base class for tokenizers that want to see
sentences-at-a-time in order to divide into words:

https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/SegmentingTokenizerBase.java#L197-L201

There are two examples that use it in the test class:

https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/test/org/apache/lucene/analysis/util/TestSegmentingTokenizerBase.java#L145

Re: Developing experimental "more advanced" analyzers [ In reply to ]

christian.freisen at gmail

May 29, 2017, 11:42 AM

Post #3 of 7 (1524 views)

I'm sorry - I didn't write down, that my intention is to have linguistic
annotations like stems and maybe part of speech information. For sure,
tokenization is one of the things I want to do.

2017-05-29 19:02 GMT+02:00 Robert Muir <rcmuir@gmail.com>:

> On Mon, May 29, 2017 at 8:36 AM, Christian Becker
> <christian.freisen@gmail.com> wrote:
> > Hi There,
> >
> > I'm new to lucene (in fact im interested in ElasticSearch but in this
> case
> > its related to lucene) and I want to make some experiments with some
> > enhanced analyzers.
> >
> > Indeed I have an external linguistic component which I want to connect to
> > Lucene / EleasticSearch. So before I'm producing a bunch of useless
> code, I
> > want to make sure that I'm going the right way.
> >
> > The linguistic component needs at least a whole sentence as Input (at
> best
> > it would be the whole text at once).
> >
> > So as far as I can see I would need to create a custom Analyzer and
> > overrride "createComponents" and "normalize".
> >
>
> There is a base class for tokenizers that want to see
> sentences-at-a-time in order to divide into words:
>
> https://github.com/apache/lucene-solr/blob/master/
> lucene/analysis/common/src/java/org/apache/lucene/analysis/util/
> SegmentingTokenizerBase.java#L197-L201
>
> There are two examples that use it in the test class:
>
> https://github.com/apache/lucene-solr/blob/master/
> lucene/analysis/common/src/test/org/apache/lucene/analysis/util/
> TestSegmentingTokenizerBase.java#L145
>

Re: Developing experimental "more advanced" analyzers [ In reply to ]

cbrown at basistech

May 29, 2017, 12:37 PM

Post #4 of 7 (1522 views)

If you used our products which have Elastic plugins,POS, Stems and
Leminisation it would be much easier.

Kind Regards

Chris

VP International

E: cbrown@basistech.com

T: +44 208 622 2900

M: +44 7796946934

USA Number: +16173867107

Lakeside House, 1 Furzeground Way, Stockley Park, Middlesex, UB11 1BD, UK

On 29 May 2017 at 19:42, Christian Becker <christian.freisen@gmail.com>
wrote:

> I'm sorry - I didn't write down, that my intention is to have linguistic
> annotations like stems and maybe part of speech information. For sure,
> tokenization is one of the things I want to do.
>
> 2017-05-29 19:02 GMT+02:00 Robert Muir <rcmuir@gmail.com>:
>
> > On Mon, May 29, 2017 at 8:36 AM, Christian Becker
> > <christian.freisen@gmail.com> wrote:
> > > Hi There,
> > >
> > > I'm new to lucene (in fact im interested in ElasticSearch but in this
> > case
> > > its related to lucene) and I want to make some experiments with some
> > > enhanced analyzers.
> > >
> > > Indeed I have an external linguistic component which I want to connect
> to
> > > Lucene / EleasticSearch. So before I'm producing a bunch of useless
> > code, I
> > > want to make sure that I'm going the right way.
> > >
> > > The linguistic component needs at least a whole sentence as Input (at
> > best
> > > it would be the whole text at once).
> > >
> > > So as far as I can see I would need to create a custom Analyzer and
> > > overrride "createComponents" and "normalize".
> > >
> >
> > There is a base class for tokenizers that want to see
> > sentences-at-a-time in order to divide into words:
> >
> > https://github.com/apache/lucene-solr/blob/master/
> > lucene/analysis/common/src/java/org/apache/lucene/analysis/util/
> > SegmentingTokenizerBase.java#L197-L201
> >
> > There are two examples that use it in the test class:
> >
> > https://github.com/apache/lucene-solr/blob/master/
> > lucene/analysis/common/src/test/org/apache/lucene/analysis/util/
> > TestSegmentingTokenizerBase.java#L145
> >
>

Re: Developing experimental "more advanced" analyzers [ In reply to ]

chris_j_collins at yahoo

May 29, 2017, 12:41 PM

Post #5 of 7 (1522 views)

I am glad that basistech has tools to bring lemmings back :-} I am guessing you also have lemmati[z|s]ation.

> On May 29, 2017, at 12:37 PM, Chris Brown <cbrown@basistech.com> wrote:
>
> If you used our products which have Elastic plugins,POS, Stems and
> Leminisation it would be much easier.
>
>
>
>
> Kind Regards
>
>
>
> Chris
>
> VP International
>
> E: cbrown@basistech.com
>
> T: +44 208 622 2900
>
> M: +44 7796946934
>
> USA Number: +16173867107
>
> Lakeside House, 1 Furzeground Way, Stockley Park, Middlesex, UB11 1BD, UK
>
>
> On 29 May 2017 at 19:42, Christian Becker <christian.freisen@gmail.com>
> wrote:
>
>> I'm sorry - I didn't write down, that my intention is to have linguistic
>> annotations like stems and maybe part of speech information. For sure,
>> tokenization is one of the things I want to do.
>>
>> 2017-05-29 19:02 GMT+02:00 Robert Muir <rcmuir@gmail.com>:
>>
>>> On Mon, May 29, 2017 at 8:36 AM, Christian Becker
>>> <christian.freisen@gmail.com> wrote:
>>>> Hi There,
>>>>
>>>> I'm new to lucene (in fact im interested in ElasticSearch but in this
>>> case
>>>> its related to lucene) and I want to make some experiments with some
>>>> enhanced analyzers.
>>>>
>>>> Indeed I have an external linguistic component which I want to connect
>> to
>>>> Lucene / EleasticSearch. So before I'm producing a bunch of useless
>>> code, I
>>>> want to make sure that I'm going the right way.
>>>>
>>>> The linguistic component needs at least a whole sentence as Input (at
>>> best
>>>> it would be the whole text at once).
>>>>
>>>> So as far as I can see I would need to create a custom Analyzer and
>>>> overrride "createComponents" and "normalize".
>>>>
>>>
>>> There is a base class for tokenizers that want to see
>>> sentences-at-a-time in order to divide into words:
>>>
>>> https://github.com/apache/lucene-solr/blob/master/
>>> lucene/analysis/common/src/java/org/apache/lucene/analysis/util/
>>> SegmentingTokenizerBase.java#L197-L201
>>>
>>> There are two examples that use it in the test class:
>>>
>>> https://github.com/apache/lucene-solr/blob/master/
>>> lucene/analysis/common/src/test/org/apache/lucene/analysis/util/
>>> TestSegmentingTokenizerBase.java#L145
>>>
>>

Re: Developing experimental "more advanced" analyzers [ In reply to ]

christian.freisen at gmail

May 29, 2017, 1:32 PM

Post #6 of 7 (1522 views)

Im this case its more fun not going the easy way ????

Chris Collins <chris_j_collins@yahoo.com.invalid> schrieb am Mo. 29. Mai
2017 um 21:41:

> I am glad that basistech has tools to bring lemmings back :-} I am
> guessing you also have lemmati[z|s]ation.
>
>
> > On May 29, 2017, at 12:37 PM, Chris Brown <cbrown@basistech.com> wrote:
> >
> > If you used our products which have Elastic plugins,POS, Stems and
> > Leminisation it would be much easier.
> >
> >
> >
> >
> > Kind Regards
> >
> >
> >
> > Chris
> >
> > VP International
> >
> > E: cbrown@basistech.com
> >
> > T: +44 208 622 2900
> >
> > M: +44 7796946934
> >
> > USA Number: +16173867107
> >
> > Lakeside House, 1 Furzeground Way, Stockley Park, Middlesex, UB11 1BD, UK
> >
> >
> > On 29 May 2017 at 19:42, Christian Becker <christian.freisen@gmail.com>
> > wrote:
> >
> >> I'm sorry - I didn't write down, that my intention is to have linguistic
> >> annotations like stems and maybe part of speech information. For sure,
> >> tokenization is one of the things I want to do.
> >>
> >> 2017-05-29 19:02 GMT+02:00 Robert Muir <rcmuir@gmail.com>:
> >>
> >>> On Mon, May 29, 2017 at 8:36 AM, Christian Becker
> >>> <christian.freisen@gmail.com> wrote:
> >>>> Hi There,
> >>>>
> >>>> I'm new to lucene (in fact im interested in ElasticSearch but in this
> >>> case
> >>>> its related to lucene) and I want to make some experiments with some
> >>>> enhanced analyzers.
> >>>>
> >>>> Indeed I have an external linguistic component which I want to connect
> >> to
> >>>> Lucene / EleasticSearch. So before I'm producing a bunch of useless
> >>> code, I
> >>>> want to make sure that I'm going the right way.
> >>>>
> >>>> The linguistic component needs at least a whole sentence as Input (at
> >>> best
> >>>> it would be the whole text at once).
> >>>>
> >>>> So as far as I can see I would need to create a custom Analyzer and
> >>>> overrride "createComponents" and "normalize".
> >>>>
> >>>
> >>> There is a base class for tokenizers that want to see
> >>> sentences-at-a-time in order to divide into words:
> >>>
> >>> https://github.com/apache/lucene-solr/blob/master/
> >>> lucene/analysis/common/src/java/org/apache/lucene/analysis/util/
> >>> SegmentingTokenizerBase.java#L197-L201
> >>>
> >>> There are two examples that use it in the test class:
> >>>
> >>> https://github.com/apache/lucene-solr/blob/master/
> >>> lucene/analysis/common/src/test/org/apache/lucene/analysis/util/
> >>> TestSegmentingTokenizerBase.java#L145
> >>>
> >>
>
>

RE: Developing experimental "more advanced" analyzers [ In reply to ]

uwe at thetaphi

May 30, 2017, 1:23 AM

Post #7 of 7 (1518 views)

Hi,

as you are using Elasticsearch, there is no need to implement an Analyzer instance. In general, this is never needed in Lucene, too, as there is the class CustomAnalyzer that uses a builder pattern to construct an analyzer like Elasticsearch or Solr are doing.

For your use-case you need to implement a custom Tokenizer and/or several TokenFilters. In addition you need to create the corresponding factory classes and bundle everything as an Elasticsearch plugin. I'd suggest to ask on the Elasticsearch mailing lists about this. After that you can define your analyzer in the Elasticsearch mapping/index config.

The Tokenizer and TokenFilters can be implemented, e.g. like Robert Muir was telling you. The sentence stuff can be done as a segmenting tokenizer subclass. Keep in mind, that many tasks can be done with already existing TokenFilters and/or Tokenizers.

Lucene has no index support for POS tags, they are only used in the analysis chain. To somehow add them to the index, you may use TokenFilters as last stage that adds the POS tags to the term (e.g., term "Windmill", pos "subject" could be combined in the last TokenFilter to a term called "Windmill#subject" and indexed like that). For keeping track of POS tags during the analysis (between the tokenfilters and tokenizers), you may need to define custom attributes.

Check the UIMA analysis module for more information how to do this.

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Christian Becker [mailto:christian.freisen@gmail.com]
> Sent: Monday, May 29, 2017 2:37 PM
> To: general@lucene.apache.org
> Subject: Developing experimental "more advanced" analyzers
>
> Hi There,
>
> I'm new to lucene (in fact im interested in ElasticSearch but in this case
> its related to lucene) and I want to make some experiments with some
> enhanced analyzers.
>
> Indeed I have an external linguistic component which I want to connect to
> Lucene / EleasticSearch. So before I'm producing a bunch of useless code, I
> want to make sure that I'm going the right way.
>
> The linguistic component needs at least a whole sentence as Input (at best
> it would be the whole text at once).
>
> So as far as I can see I would need to create a custom Analyzer and
> overrride "createComponents" and "normalize".
>
> Is that correct or am I on the wrong track?
>
> Bests
> Chris