Mailing List Archive

Multi-IDF for a single term possible?
Hello,

We are using TF-IDF for scoring (Yet to migrate to BM25). Different
entities (DOC_TYPES) are crunched & stored together in a single index.

When it comes to IDF, I find that there is a single value computed across
documents & stored as part of TermStats, whereas our documents are not
homogeneous. So, a single IDF value doesn't work for us

We would like to compute IDF for each <Term/DOC_TYPE> pair, store it &
later use the paired-IDF values during query time. Is something like this
possible via Codecs or other mechanisms?

Any help is much appreciated

--
Ravi
Re: Multi-IDF for a single term possible? [ In reply to ]
Is there any reason why you are not storing each DOC_TYPE in its own index?

On Tue, Dec 3, 2019 at 1:50 PM Ravikumar Govindarajan
<ravikumar.govindarajan@gmail.com> wrote:
>
> Hello,
>
> We are using TF-IDF for scoring (Yet to migrate to BM25). Different
> entities (DOC_TYPES) are crunched & stored together in a single index.
>
> When it comes to IDF, I find that there is a single value computed across
> documents & stored as part of TermStats, whereas our documents are not
> homogeneous. So, a single IDF value doesn't work for us
>
> We would like to compute IDF for each <Term/DOC_TYPE> pair, store it &
> later use the paired-IDF values during query time. Is something like this
> possible via Codecs or other mechanisms?
>
> Any help is much appreciated
>
> --
> Ravi



--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Multi-IDF for a single term possible? [ In reply to ]
it is enough to give each its own field.

On Tue, Dec 3, 2019 at 7:57 AM Adrien Grand <jpountz@gmail.com> wrote:

> Is there any reason why you are not storing each DOC_TYPE in its own index?
>
> On Tue, Dec 3, 2019 at 1:50 PM Ravikumar Govindarajan
> <ravikumar.govindarajan@gmail.com> wrote:
> >
> > Hello,
> >
> > We are using TF-IDF for scoring (Yet to migrate to BM25). Different
> > entities (DOC_TYPES) are crunched & stored together in a single index.
> >
> > When it comes to IDF, I find that there is a single value computed across
> > documents & stored as part of TermStats, whereas our documents are not
> > homogeneous. So, a single IDF value doesn't work for us
> >
> > We would like to compute IDF for each <Term/DOC_TYPE> pair, store it &
> > later use the paired-IDF values during query time. Is something like this
> > possible via Codecs or other mechanisms?
> >
> > Any help is much appreciated
> >
> > --
> > Ravi
>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re:Multi-IDF for a single term possible? [ In reply to ]
Hi Ravi,
Can you give more details on how you store an entity into lucene? what is a doc type?
what fields do you have?

Cheers

From: java-user@lucene.apache.org At: 12/03/19 12:50:40To: java-user@lucene.apache.org
Subject: Multi-IDF for a single term possible?

Hello,

We are using TF-IDF for scoring (Yet to migrate to BM25). Different
entities (DOC_TYPES) are crunched & stored together in a single index.

When it comes to IDF, I find that there is a single value computed across
documents & stored as part of TermStats, whereas our documents are not
homogeneous. So, a single IDF value doesn't work for us

We would like to compute IDF for each <Term/DOC_TYPE> pair, store it &
later use the paired-IDF values during query time. Is something like this
possible via Codecs or other mechanisms?

Any help is much appreciated

--
Ravi
Re: Multi-IDF for a single term possible? [ In reply to ]
>
> it is enough to give each its own field.
>

I kind of over-simplified the problem at hand. Apologies.

DOC_TYPE is just one aspect of the problem. The other one is that, it is
actually shared index where there are multiple-users (100-3000 users per
index). There are many hundreds of such shared-indexes in our cluster

Search happens per-user & it doesn't make sense to have a single IDF. We
are ideally looking at some lucene extensions/tricks to store & retrieve
IDF in <User/DOC_TYPE> pairs.

Is there any reason why you are not storing each DOC_TYPE in its own index?


There are some common-fields across all DOC_TYPES (Ex: content/attachment
et al..) & to provide unified-search for a user, we colocate them in a
single index

--
Ravi

On Tue, Dec 3, 2019 at 6:30 PM Diego Ceccarelli (BLOOMBERG/ LONDON) <
dceccarelli4@bloomberg.net> wrote:

> Hi Ravi,
> Can you give more details on how you store an entity into lucene? what is
> a doc type?
> what fields do you have?
>
> Cheers
>
> From: java-user@lucene.apache.org At: 12/03/19 12:50:40To:
> java-user@lucene.apache.org
> Subject: Multi-IDF for a single term possible?
>
> Hello,
>
> We are using TF-IDF for scoring (Yet to migrate to BM25). Different
> entities (DOC_TYPES) are crunched & stored together in a single index.
>
> When it comes to IDF, I find that there is a single value computed across
> documents & stored as part of TermStats, whereas our documents are not
> homogeneous. So, a single IDF value doesn't work for us
>
> We would like to compute IDF for each <Term/DOC_TYPE> pair, store it &
> later use the paired-IDF values during query time. Is something like this
> possible via Codecs or other mechanisms?
>
> Any help is much appreciated
>
> --
> Ravi
>
>
>
Re: Multi-IDF for a single term possible? [ In reply to ]
IDF is a simple measure to calculate. So, if building a separate index for
each user is not an ideal solution, then I suggest you could try to
calculate these statistics upfront. Just maintain these statistics for each
user, then use them in the query process.

As the search time, you use these stats in your ranking. One possible way
is to write a similarity wrapper that will read the needed information from
a hash map.

Regards
Ameer



On Wed, 4 Dec 2019 at 00:55, Ravikumar Govindarajan <
ravikumar.govindarajan@gmail.com> wrote:

> >
> > it is enough to give each its own field.
> >
>
> I kind of over-simplified the problem at hand. Apologies.
>
> DOC_TYPE is just one aspect of the problem. The other one is that, it is
> actually shared index where there are multiple-users (100-3000 users per
> index). There are many hundreds of such shared-indexes in our cluster
>
> Search happens per-user & it doesn't make sense to have a single IDF. We
> are ideally looking at some lucene extensions/tricks to store & retrieve
> IDF in <User/DOC_TYPE> pairs.
>
> Is there any reason why you are not storing each DOC_TYPE in its own index?
>
>
> There are some common-fields across all DOC_TYPES (Ex: content/attachment
> et al..) & to provide unified-search for a user, we colocate them in a
> single index
>
> --
> Ravi
>
> On Tue, Dec 3, 2019 at 6:30 PM Diego Ceccarelli (BLOOMBERG/ LONDON) <
> dceccarelli4@bloomberg.net> wrote:
>
> > Hi Ravi,
> > Can you give more details on how you store an entity into lucene? what is
> > a doc type?
> > what fields do you have?
> >
> > Cheers
> >
> > From: java-user@lucene.apache.org At: 12/03/19 12:50:40To:
> > java-user@lucene.apache.org
> > Subject: Multi-IDF for a single term possible?
> >
> > Hello,
> >
> > We are using TF-IDF for scoring (Yet to migrate to BM25). Different
> > entities (DOC_TYPES) are crunched & stored together in a single index.
> >
> > When it comes to IDF, I find that there is a single value computed across
> > documents & stored as part of TermStats, whereas our documents are not
> > homogeneous. So, a single IDF value doesn't work for us
> >
> > We would like to compute IDF for each <Term/DOC_TYPE> pair, store it &
> > later use the paired-IDF values during query time. Is something like this
> > possible via Codecs or other mechanisms?
> >
> > Any help is much appreciated
> >
> > --
> > Ravi
> >
> >
> >
>
Re: Multi-IDF for a single term possible? [ In reply to ]
Thanks Ameer!

Was thinking about few ideas. Thought something like tapping into Codec
extension to store multi-IDF values in 2 files, namely an IDF Meta-file & a
IDF Data-file

IDF Meta-file holds List of {UserId, Terms-Data-File-Offset} pairs for each
Term, encoded via ForUtil.

IDF Data-file that holds a "Count" & {doc_type, idf} pairs, encoded as
vInts. [."Count" is the number of vInt pairs to decode for a given UserId]

TermStats for each Term also needs to be extended to store the start
offsets pairs of {IDF Meta-file, IDF Data-file}, as vLongs

There's a possibility of long-tail occurring in IDF Meta file. That is, the
users sharing a term (possibly a common term) could be very high, in which
case we might need to generate a sampling data. But it is currently doesn't
happen in our app

This is just a quick hack & really don't have an estimate of the penalty we
have to pay for fetching this info

Not sure if this is a worthwhile idea to explore. Any input from members is
much appreciated

--
Ravi

On Tue, Dec 3, 2019 at 10:30 PM Ameer Albahem <ameer.albahem@gmail.com>
wrote:

> IDF is a simple measure to calculate. So, if building a separate index for
> each user is not an ideal solution, then I suggest you could try to
> calculate these statistics upfront. Just maintain these statistics for each
> user, then use them in the query process.
>
> As the search time, you use these stats in your ranking. One possible way
> is to write a similarity wrapper that will read the needed information from
> a hash map.
>
> Regards
> Ameer
>
>
>
> On Wed, 4 Dec 2019 at 00:55, Ravikumar Govindarajan <
> ravikumar.govindarajan@gmail.com> wrote:
>
> > >
> > > it is enough to give each its own field.
> > >
> >
> > I kind of over-simplified the problem at hand. Apologies.
> >
> > DOC_TYPE is just one aspect of the problem. The other one is that, it is
> > actually shared index where there are multiple-users (100-3000 users per
> > index). There are many hundreds of such shared-indexes in our cluster
> >
> > Search happens per-user & it doesn't make sense to have a single IDF. We
> > are ideally looking at some lucene extensions/tricks to store & retrieve
> > IDF in <User/DOC_TYPE> pairs.
> >
> > Is there any reason why you are not storing each DOC_TYPE in its own
> index?
> >
> >
> > There are some common-fields across all DOC_TYPES (Ex: content/attachment
> > et al..) & to provide unified-search for a user, we colocate them in a
> > single index
> >
> > --
> > Ravi
> >
> > On Tue, Dec 3, 2019 at 6:30 PM Diego Ceccarelli (BLOOMBERG/ LONDON) <
> > dceccarelli4@bloomberg.net> wrote:
> >
> > > Hi Ravi,
> > > Can you give more details on how you store an entity into lucene? what
> is
> > > a doc type?
> > > what fields do you have?
> > >
> > > Cheers
> > >
> > > From: java-user@lucene.apache.org At: 12/03/19 12:50:40To:
> > > java-user@lucene.apache.org
> > > Subject: Multi-IDF for a single term possible?
> > >
> > > Hello,
> > >
> > > We are using TF-IDF for scoring (Yet to migrate to BM25). Different
> > > entities (DOC_TYPES) are crunched & stored together in a single index.
> > >
> > > When it comes to IDF, I find that there is a single value computed
> across
> > > documents & stored as part of TermStats, whereas our documents are not
> > > homogeneous. So, a single IDF value doesn't work for us
> > >
> > > We would like to compute IDF for each <Term/DOC_TYPE> pair, store it &
> > > later use the paired-IDF values during query time. Is something like
> this
> > > possible via Codecs or other mechanisms?
> > >
> > > Any help is much appreciated
> > >
> > > --
> > > Ravi
> > >
> > >
> > >
> >
>