Mailing List Archive

Manipulate stored string in Lucene
Dear all,

currently I am reading text fields that contain xml text. Hence, the solr input may look like this:

<field name=”tagged_text”>&lt;sec sec-type="Introduction" id="SECID0E4F"&gt;
&lt;title&gt;Introduction&lt;/title&gt;
&lt;/sec&gt;
</field>

With all “<” and “>” escaped.
I wrote a tokenizer that indexes the tag attributes (e.g. sec-type=”Introduction”) on the position of the tagged word (“Introduction” in this case) and hence I need the HTML tags when indexing. However, I want to strip the HTML in the stored string that is shown to the user on a query. So far, I figured out that the index and the stored string a separated. Thus, I thought it should be possible to manipulate the stored string either after indexing.

Is there a way to do so? I would prefer to manipulate the stored string and not introduce a second field with the plain text in the input file.

I am glad for any help!

Best Regards,

Adrian

-------------------------------------------------------
Adrian Pachzelt
- Fachinformationsdienst Biodiversitaetsforschung -
- Hosting von Open Access-Zeitschriften -
Universitaetsbibliothek Johann Christian Senckenberg
Bockenheimer Landstr. 134-138
60325 Frankfurt am Main
Tel. 069/798-39382
a.pachzelt@ub.uni-frankfurt.de<mailto:a.pachzelt@ub.uni-frankfurt.de>
-------------------------------------------------------
Re: Manipulate stored string in Lucene [ In reply to ]
Hi,

You don't need a second field name, but you can once add the indexed field with stored=false and then add a second instance with same field name and the original stored content, but not indexed. If you want to have docvalues, the same can be done for docvalues. Internally, Lucene does it like that anyways. Adding a field to store and index at same time is just for convenience.

Uwe

Am May 9, 2018 5:57:40 AM UTC schrieb "Pachzelt, Adrian" <A.Pachzelt@ub.uni-frankfurt.de>:
>Dear all,
>
>currently I am reading text fields that contain xml text. Hence, the
>solr input may look like this:
>
><field name=”tagged_text”>&lt;sec sec-type="Introduction"
>id="SECID0E4F"&gt;
>&lt;title&gt;Introduction&lt;/title&gt;
>&lt;/sec&gt;
></field>
>
>With all “<” and “>” escaped.
>I wrote a tokenizer that indexes the tag attributes (e.g.
>sec-type=”Introduction”) on the position of the tagged word
>(“Introduction” in this case) and hence I need the HTML tags when
>indexing. However, I want to strip the HTML in the stored string that
>is shown to the user on a query. So far, I figured out that the index
>and the stored string a separated. Thus, I thought it should be
>possible to manipulate the stored string either after indexing.
>
>Is there a way to do so? I would prefer to manipulate the stored string
>and not introduce a second field with the plain text in the input file.
>
>I am glad for any help!
>
>Best Regards,
>
>Adrian
>
>-------------------------------------------------------
>Adrian Pachzelt
>- Fachinformationsdienst Biodiversitaetsforschung -
>- Hosting von Open Access-Zeitschriften -
>Universitaetsbibliothek Johann Christian Senckenberg
>Bockenheimer Landstr. 134-138
>60325 Frankfurt am Main
>Tel. 069/798-39382
>a.pachzelt@ub.uni-frankfurt.de<mailto:a.pachzelt@ub.uni-frankfurt.de>
>-------------------------------------------------------

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de
Re: Manipulate stored string in Lucene [ In reply to ]
Oh it's Solr? Then it's not easy possible. Plain Lucene works like that.

Uwe

Am May 9, 2018 6:09:42 AM UTC schrieb Uwe Schindler <uwe@thetaphi.de>:
>Hi,
>
>You don't need a second field name, but you can once add the indexed
>field with stored=false and then add a second instance with same field
>name and the original stored content, but not indexed. If you want to
>have docvalues, the same can be done for docvalues. Internally, Lucene
>does it like that anyways. Adding a field to store and index at same
>time is just for convenience.
>
>Uwe
>
>Am May 9, 2018 5:57:40 AM UTC schrieb "Pachzelt, Adrian"
><A.Pachzelt@ub.uni-frankfurt.de>:
>>Dear all,
>>
>>currently I am reading text fields that contain xml text. Hence, the
>>solr input may look like this:
>>
>><field name=”tagged_text”>&lt;sec sec-type="Introduction"
>>id="SECID0E4F"&gt;
>>&lt;title&gt;Introduction&lt;/title&gt;
>>&lt;/sec&gt;
>></field>
>>
>>With all “<” and “>” escaped.
>>I wrote a tokenizer that indexes the tag attributes (e.g.
>>sec-type=”Introduction”) on the position of the tagged word
>>(“Introduction” in this case) and hence I need the HTML tags when
>>indexing. However, I want to strip the HTML in the stored string that
>>is shown to the user on a query. So far, I figured out that the index
>>and the stored string a separated. Thus, I thought it should be
>>possible to manipulate the stored string either after indexing.
>>
>>Is there a way to do so? I would prefer to manipulate the stored
>string
>>and not introduce a second field with the plain text in the input
>file.
>>
>>I am glad for any help!
>>
>>Best Regards,
>>
>>Adrian
>>
>>-------------------------------------------------------
>>Adrian Pachzelt
>>- Fachinformationsdienst Biodiversitaetsforschung -
>>- Hosting von Open Access-Zeitschriften -
>>Universitaetsbibliothek Johann Christian Senckenberg
>>Bockenheimer Landstr. 134-138
>>60325 Frankfurt am Main
>>Tel. 069/798-39382
>>a.pachzelt@ub.uni-frankfurt.de<mailto:a.pachzelt@ub.uni-frankfurt.de>
>>-------------------------------------------------------
>
>--
>Uwe Schindler
>Achterdiek 19, 28357 Bremen
>https://www.thetaphi.de

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de
AW: Manipulate stored string in Lucene [ In reply to ]
Hi Uwe,

thanks for the advice. Yes, I use Solr overall, but thought it would be a Lucene issue.

Previously, I followed your proposed solution. I set the original field as stored=false indexed=true, created a copyfield, and in the copied field set stored=true indexed=false. However, I do not know how to manipulate the stored string in the copyField. Do you have an idea?

Thanks a lot! :)

Adrian

-------------------------------------------------------
Adrian Pachzelt
- Fachinformationsdienst Biodiversitaetsforschung -
- Hosting von Open Access-Zeitschriften -
Universitaetsbibliothek Johann Christian Senckenberg
Bockenheimer Landstr. 134-138
60325 Frankfurt am Main
Tel. 069/798-39382
a.pachzelt@ub.uni-frankfurt.de
-------------------------------------------------------


-----Ursprüngliche Nachricht-----
Von: Uwe Schindler [mailto:uwe@thetaphi.de]
Gesendet: Mittwoch, 9. Mai 2018 08:11
An: general@lucene.apache.org
Betreff: Re: Manipulate stored string in Lucene

Oh it's Solr? Then it's not easy possible. Plain Lucene works like that.

Uwe

Am May 9, 2018 6:09:42 AM UTC schrieb Uwe Schindler <uwe@thetaphi.de>:
>Hi,
>
>You don't need a second field name, but you can once add the indexed
>field with stored=false and then add a second instance with same field
>name and the original stored content, but not indexed. If you want to
>have docvalues, the same can be done for docvalues. Internally, Lucene
>does it like that anyways. Adding a field to store and index at same
>time is just for convenience.
>
>Uwe
>
>Am May 9, 2018 5:57:40 AM UTC schrieb "Pachzelt, Adrian"
><A.Pachzelt@ub.uni-frankfurt.de>:
>>Dear all,
>>
>>currently I am reading text fields that contain xml text. Hence, the
>>solr input may look like this:
>>
>><field name=”tagged_text”>&lt;sec sec-type="Introduction"
>>id="SECID0E4F"&gt;
>>&lt;title&gt;Introduction&lt;/title&gt;
>>&lt;/sec&gt;
>></field>
>>
>>With all “<” and “>” escaped.
>>I wrote a tokenizer that indexes the tag attributes (e.g.
>>sec-type=”Introduction”) on the position of the tagged word
>>(“Introduction” in this case) and hence I need the HTML tags when
>>indexing. However, I want to strip the HTML in the stored string that
>>is shown to the user on a query. So far, I figured out that the index
>>and the stored string a separated. Thus, I thought it should be
>>possible to manipulate the stored string either after indexing.
>>
>>Is there a way to do so? I would prefer to manipulate the stored
>string
>>and not introduce a second field with the plain text in the input
>file.
>>
>>I am glad for any help!
>>
>>Best Regards,
>>
>>Adrian
>>
>>-------------------------------------------------------
>>Adrian Pachzelt
>>- Fachinformationsdienst Biodiversitaetsforschung -
>>- Hosting von Open Access-Zeitschriften -
>>Universitaetsbibliothek Johann Christian Senckenberg
>>Bockenheimer Landstr. 134-138
>>60325 Frankfurt am Main
>>Tel. 069/798-39382
>>a.pachzelt@ub.uni-frankfurt.de<mailto:a.pachzelt@ub.uni-frankfurt.de>
>>-------------------------------------------------------
>
>--
>Uwe Schindler
>Achterdiek 19, 28357 Bremen
>https://www.thetaphi.de

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de
Re: Manipulate stored string in Lucene [ In reply to ]
Hello, Adrien.
If I got you right, it's an UpdateRequestProcessor's duty see
https://lucene.apache.org/solr/guide/7_3/update-request-processors.html


On Wed, May 9, 2018 at 11:39 AM, Pachzelt, Adrian <
A.Pachzelt@ub.uni-frankfurt.de> wrote:

> Hi Uwe,
>
> thanks for the advice. Yes, I use Solr overall, but thought it would be a
> Lucene issue.
>
> Previously, I followed your proposed solution. I set the original field as
> stored=false indexed=true, created a copyfield, and in the copied field set
> stored=true indexed=false. However, I do not know how to manipulate the
> stored string in the copyField. Do you have an idea?
>
> Thanks a lot! :)
>
> Adrian
>
> -------------------------------------------------------
> Adrian Pachzelt
> - Fachinformationsdienst Biodiversitaetsforschung -
> - Hosting von Open Access-Zeitschriften -
> Universitaetsbibliothek Johann Christian Senckenberg
> Bockenheimer Landstr. 134-138
> 60325 Frankfurt am Main
> Tel. 069/798-39382
> a.pachzelt@ub.uni-frankfurt.de
> -------------------------------------------------------
>
>
> -----Ursprüngliche Nachricht-----
> Von: Uwe Schindler [mailto:uwe@thetaphi.de]
> Gesendet: Mittwoch, 9. Mai 2018 08:11
> An: general@lucene.apache.org
> Betreff: Re: Manipulate stored string in Lucene
>
> Oh it's Solr? Then it's not easy possible. Plain Lucene works like that.
>
> Uwe
>
> Am May 9, 2018 6:09:42 AM UTC schrieb Uwe Schindler <uwe@thetaphi.de>:
> >Hi,
> >
> >You don't need a second field name, but you can once add the indexed
> >field with stored=false and then add a second instance with same field
> >name and the original stored content, but not indexed. If you want to
> >have docvalues, the same can be done for docvalues. Internally, Lucene
> >does it like that anyways. Adding a field to store and index at same
> >time is just for convenience.
> >
> >Uwe
> >
> >Am May 9, 2018 5:57:40 AM UTC schrieb "Pachzelt, Adrian"
> ><A.Pachzelt@ub.uni-frankfurt.de>:
> >>Dear all,
> >>
> >>currently I am reading text fields that contain xml text. Hence, the
> >>solr input may look like this:
> >>
> >><field name=”tagged_text”>&lt;sec sec-type="Introduction"
> >>id="SECID0E4F"&gt;
> >>&lt;title&gt;Introduction&lt;/title&gt;
> >>&lt;/sec&gt;
> >></field>
> >>
> >>With all “<” and “>” escaped.
> >>I wrote a tokenizer that indexes the tag attributes (e.g.
> >>sec-type=”Introduction”) on the position of the tagged word
> >>(“Introduction” in this case) and hence I need the HTML tags when
> >>indexing. However, I want to strip the HTML in the stored string that
> >>is shown to the user on a query. So far, I figured out that the index
> >>and the stored string a separated. Thus, I thought it should be
> >>possible to manipulate the stored string either after indexing.
> >>
> >>Is there a way to do so? I would prefer to manipulate the stored
> >string
> >>and not introduce a second field with the plain text in the input
> >file.
> >>
> >>I am glad for any help!
> >>
> >>Best Regards,
> >>
> >>Adrian
> >>
> >>-------------------------------------------------------
> >>Adrian Pachzelt
> >>- Fachinformationsdienst Biodiversitaetsforschung -
> >>- Hosting von Open Access-Zeitschriften -
> >>Universitaetsbibliothek Johann Christian Senckenberg
> >>Bockenheimer Landstr. 134-138
> >>60325 Frankfurt am Main
> >>Tel. 069/798-39382
> >>a.pachzelt@ub.uni-frankfurt.de<mailto:a.pachzelt@ub.uni-frankfurt.de>
> >>-------------------------------------------------------
> >
> >--
> >Uwe Schindler
> >Achterdiek 19, 28357 Bremen
> >https://www.thetaphi.de
>
> --
> Uwe Schindler
> Achterdiek 19, 28357 Bremen
> https://www.thetaphi.de
>



--
Sincerely yours
Mikhail Khludnev
AW: Manipulate stored string in Lucene [ In reply to ]
I will check this out! Thank you, Mikhail! :)

-------------------------------------------------------
Adrian Pachzelt
- Fachinformationsdienst Biodiversitaetsforschung -
- Hosting von Open Access-Zeitschriften -
Universitaetsbibliothek Johann Christian Senckenberg
Bockenheimer Landstr. 134-138
60325 Frankfurt am Main
Tel. 069/798-39382
a.pachzelt@ub.uni-frankfurt.de
-------------------------------------------------------


-----Ursprüngliche Nachricht-----
Von: Mikhail Khludnev [mailto:mkhl@apache.org]
Gesendet: Mittwoch, 9. Mai 2018 11:15
An: general@lucene.apache.org
Betreff: Re: Manipulate stored string in Lucene

Hello, Adrien.
If I got you right, it's an UpdateRequestProcessor's duty see
https://lucene.apache.org/solr/guide/7_3/update-request-processors.html


On Wed, May 9, 2018 at 11:39 AM, Pachzelt, Adrian <
A.Pachzelt@ub.uni-frankfurt.de> wrote:

> Hi Uwe,
>
> thanks for the advice. Yes, I use Solr overall, but thought it would be a
> Lucene issue.
>
> Previously, I followed your proposed solution. I set the original field as
> stored=false indexed=true, created a copyfield, and in the copied field set
> stored=true indexed=false. However, I do not know how to manipulate the
> stored string in the copyField. Do you have an idea?
>
> Thanks a lot! :)
>
> Adrian
>
> -------------------------------------------------------
> Adrian Pachzelt
> - Fachinformationsdienst Biodiversitaetsforschung -
> - Hosting von Open Access-Zeitschriften -
> Universitaetsbibliothek Johann Christian Senckenberg
> Bockenheimer Landstr. 134-138
> 60325 Frankfurt am Main
> Tel. 069/798-39382
> a.pachzelt@ub.uni-frankfurt.de
> -------------------------------------------------------
>
>
> -----Ursprüngliche Nachricht-----
> Von: Uwe Schindler [mailto:uwe@thetaphi.de]
> Gesendet: Mittwoch, 9. Mai 2018 08:11
> An: general@lucene.apache.org
> Betreff: Re: Manipulate stored string in Lucene
>
> Oh it's Solr? Then it's not easy possible. Plain Lucene works like that.
>
> Uwe
>
> Am May 9, 2018 6:09:42 AM UTC schrieb Uwe Schindler <uwe@thetaphi.de>:
> >Hi,
> >
> >You don't need a second field name, but you can once add the indexed
> >field with stored=false and then add a second instance with same field
> >name and the original stored content, but not indexed. If you want to
> >have docvalues, the same can be done for docvalues. Internally, Lucene
> >does it like that anyways. Adding a field to store and index at same
> >time is just for convenience.
> >
> >Uwe
> >
> >Am May 9, 2018 5:57:40 AM UTC schrieb "Pachzelt, Adrian"
> ><A.Pachzelt@ub.uni-frankfurt.de>:
> >>Dear all,
> >>
> >>currently I am reading text fields that contain xml text. Hence, the
> >>solr input may look like this:
> >>
> >><field name=”tagged_text”>&lt;sec sec-type="Introduction"
> >>id="SECID0E4F"&gt;
> >>&lt;title&gt;Introduction&lt;/title&gt;
> >>&lt;/sec&gt;
> >></field>
> >>
> >>With all “<” and “>” escaped.
> >>I wrote a tokenizer that indexes the tag attributes (e.g.
> >>sec-type=”Introduction”) on the position of the tagged word
> >>(“Introduction” in this case) and hence I need the HTML tags when
> >>indexing. However, I want to strip the HTML in the stored string that
> >>is shown to the user on a query. So far, I figured out that the index
> >>and the stored string a separated. Thus, I thought it should be
> >>possible to manipulate the stored string either after indexing.
> >>
> >>Is there a way to do so? I would prefer to manipulate the stored
> >string
> >>and not introduce a second field with the plain text in the input
> >file.
> >>
> >>I am glad for any help!
> >>
> >>Best Regards,
> >>
> >>Adrian
> >>
> >>-------------------------------------------------------
> >>Adrian Pachzelt
> >>- Fachinformationsdienst Biodiversitaetsforschung -
> >>- Hosting von Open Access-Zeitschriften -
> >>Universitaetsbibliothek Johann Christian Senckenberg
> >>Bockenheimer Landstr. 134-138
> >>60325 Frankfurt am Main
> >>Tel. 069/798-39382
> >>a.pachzelt@ub.uni-frankfurt.de<mailto:a.pachzelt@ub.uni-frankfurt.de>
> >>-------------------------------------------------------
> >
> >--
> >Uwe Schindler
> >Achterdiek 19, 28357 Bremen
> >https://www.thetaphi.de
>
> --
> Uwe Schindler
> Achterdiek 19, 28357 Bremen
> https://www.thetaphi.de
>



--
Sincerely yours
Mikhail Khludnev