Mailing List Archive

cvs commit: jakarta-lucene TODO.txt
otis 02/05/27 16:56:54

Added: . TODO.txt
Log:
- Lucene TO-DO items.

Revision Changes Path
1.1 jakarta-lucene/TODO.txt

Index: TODO.txt
===================================================================
$Revision: 1.1 $

LUCENE TO-DO ITEMS


- Term Vector support
c.f.
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgNo=273
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgNo=272

- Support for Search Term Highlighting
c.f.
http://www.geocrawler.org/archives/3/2624/2001/9/50/6553088/
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=115271
http://www.iq-computing.de/index.asp?menu=projekte-lucene-highlight
http://nagoya.apache.org/eyebrowse/BrowseList?listName=lucene-dev@jakarta.apache.org&by=thread&from=56403

- Better support for hits sorted by things other than score.
An easy, efficient case is to support results sorted by the order documents were
added to the index. A little harder and less efficient is support for
results sorted by an arbitrary field.
c.f.
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=114756
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00228.html

- Add ability to "boost" individual documents/fields.
When a document is indexed, a numeric "boost" value could be specified for the whole
document, and/or for individual fields. This value would be multipled into
scores for hits on this document. This would facilitate the implementation of
things like Google's PageRank.
c.f.
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=114749
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=114757

- Add to FSDirectory the ability to specify where lock files live and
to disable the use of lock files altogether (for read-only media).
c.f.
http://nagoya.apache.org/eyebrowse/BrowseList?listName=lucene-user@jakarta.apache.org&by=thread&from=57011

- Add some requested methods:
String[] Document.getValues(String fieldName);
String[] IndexReader.getIndexedFields();
void Token.setPositionIncrement(int);
c.f.
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=330010
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=330009

- Péter Halácsy's changes to the QueryParser that make it possible to
programmatically specify a default operator (OR or AND).
c.f.
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=115677

- The recenly submitted code that allows for queries such as
"Microsoft suc*" to match "Microsoft success" and "Microsoft sucks".
c.f.
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=333275

- Make package protected abstract methods of org.apache.lucene.search.Searcher
public (I'd like to be able to make subclasses of Searcher, IndexWriter, InderReader).
c.f.
http://www.mail-archive.com/cgi-bin/htsearch?method=and&format=short&config=lucene-dev_jakarta_apache_org&restrict=&exclude=&words=IndexAccessControl

- Add lastModified() method to Directory, FSDirectory and RamDirectory, so
it could be cached in IndexWriter/Searcher manager.

- Support for adding more than 1 term to the same position.
N.B. I think the Finnish lady already implemented this. It required some
pieces of Lucene to be modified. (OG).

- The ability to retrieve the number of occurences not only for a term
but also for a Phrase.
c.f.
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00101.html

- Alex Murzaku contributed some code for dealing with Russian.
c.f.
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=115631

- A lady from Finland submitted code for handling Finnish.

- Dutch stemmer, analyzer, etc.
c.f.
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgNo=145

- French stemmer, analyzer, etc.
c.f.
http://nagoya.apache.org/eyebrowse/BrowseList?listName=lucene-dev@jakarta.apache.org&by=thread&from=56256

- Che Dong's CJKTokenizer for Chinese, Japanese, and Korean.
c.f.
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=330905

- Selecting a language-specific analyzer according to a locale.
Now we rewrite parts of lucene codes in order to use another
analyzer. It will be useful to select analyzer without touching codes.

- Adding "-encoding" option and encoding-sensitive methods to tools.
Current tools needs minor changes on a Japanese (and other language)
environment: adding an "-encode" option and argument, useing
Reader/Writer classes instead of InputStream/OutputStream classes, etc.


$Revision: 1.1 $




--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: cvs commit: jakarta-lucene TODO.txt [ In reply to ]
I tried to dig out everything and anything from my Lucene email folder
that looked interesting and useful. I tried finding list archive URLs
for all items, so that people could easily see previous discussions or
get to contributed code, instead of starting the brain-storming process
from scratch. Not all items have references, but that may be improved
later.
As we implement/add things to Lucene we can remove items from this
list, and as we get feature requests that seem popular and useful we
can append them to the list.

I just got sick of keeping all these emails, flagging them, and so on,
and had to put this list of TO-DO items somewhere. I think others may
find it useful, too.

Otis



--- otis@apache.org wrote:
> otis 02/05/27 16:56:54
>
> Added: . TODO.txt
> Log:
> - Lucene TO-DO items.
>
> Revision Changes Path
> 1.1 jakarta-lucene/TODO.txt
>
> Index: TODO.txt
> ===================================================================
> $Revision: 1.1 $
>
> LUCENE TO-DO ITEMS
>
>
> - Term Vector support
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgNo=273
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgNo=272
>
> - Support for Search Term Highlighting
> c.f.
> http://www.geocrawler.org/archives/3/2624/2001/9/50/6553088/
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=115271
>
> http://www.iq-computing.de/index.asp?menu=projekte-lucene-highlight
>
>
http://nagoya.apache.org/eyebrowse/BrowseList?listName=lucene-dev@jakarta.apache.org&by=thread&from=56403
>
> - Better support for hits sorted by things other than score.
> An easy, efficient case is to support results sorted by the order
> documents were
> added to the index. A little harder and less efficient is
> support for
> results sorted by an arbitrary field.
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=114756
>
>
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00228.html
>
> - Add ability to "boost" individual documents/fields.
> When a document is indexed, a numeric "boost" value could be
> specified for the whole
> document, and/or for individual fields. This value would be
> multipled into
> scores for hits on this document. This would facilitate the
> implementation of
> things like Google's PageRank.
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=114749
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=114757
>
> - Add to FSDirectory the ability to specify where lock files live
> and
> to disable the use of lock files altogether (for read-only
> media).
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/BrowseList?listName=lucene-user@jakarta.apache.org&by=thread&from=57011
>
> - Add some requested methods:
> String[] Document.getValues(String fieldName);
> String[] IndexReader.getIndexedFields();
> void Token.setPositionIncrement(int);
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=330010
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=330009
>
> - Péter Halácsy's changes to the QueryParser that make it possible
> to
> programmatically specify a default operator (OR or AND).
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=115677
>
> - The recenly submitted code that allows for queries such as
> "Microsoft suc*" to match "Microsoft success" and "Microsoft
> sucks".
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=333275
>
> - Make package protected abstract methods of
> org.apache.lucene.search.Searcher
> public (I'd like to be able to make subclasses of Searcher,
> IndexWriter, InderReader).
> c.f.
>
>
http://www.mail-archive.com/cgi-bin/htsearch?method=and&format=short&config=lucene-dev_jakarta_apache_org&restrict=&exclude=&words=IndexAccessControl
>
> - Add lastModified() method to Directory, FSDirectory and
> RamDirectory, so
> it could be cached in IndexWriter/Searcher manager.
>
> - Support for adding more than 1 term to the same position.
> N.B. I think the Finnish lady already implemented this. It
> required some
> pieces of Lucene to be modified. (OG).
>
> - The ability to retrieve the number of occurences not only for a
> term
> but also for a Phrase.
> c.f.
>
>
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00101.html
>
> - Alex Murzaku contributed some code for dealing with Russian.
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=115631
>
> - A lady from Finland submitted code for handling Finnish.
>
> - Dutch stemmer, analyzer, etc.
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgNo=145
>
> - French stemmer, analyzer, etc.
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/BrowseList?listName=lucene-dev@jakarta.apache.org&by=thread&from=56256
>
> - Che Dong's CJKTokenizer for Chinese, Japanese, and Korean.
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=330905
>
> - Selecting a language-specific analyzer according to a locale.
> Now we rewrite parts of lucene codes in order to use another
> analyzer. It will be useful to select analyzer without touching
> codes.
>
> - Adding "-encoding" option and encoding-sensitive methods to
> tools.
> Current tools needs minor changes on a Japanese (and other
> language)
> environment: adding an "-encode" option and argument, useing
> Reader/Writer classes instead of InputStream/OutputStream
> classes, etc.
>
>
> $Revision: 1.1 $
>
>
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>


__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: cvs commit: jakarta-lucene TODO.txt [ In reply to ]
Can numeric support be added to the todo list? I realize that James Ricci
and I are the only two that have requested it recently.... and maybe its an
unreasonable request for lucene, due to its design and the amount of work it
would take to implement it - which I'm willing to live with - but I haven't
seen anything that points one way or the other as to why numeric support
doesn't exist in lucene.

Thanks,

Dan



-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
Sent: Monday, May 27, 2002 7:03 PM
To: Lucene Developers List
Subject: Re: cvs commit: jakarta-lucene TODO.txt


I tried to dig out everything and anything from my Lucene email folder
that looked interesting and useful. I tried finding list archive URLs
for all items, so that people could easily see previous discussions or
get to contributed code, instead of starting the brain-storming process
from scratch. Not all items have references, but that may be improved
later.
As we implement/add things to Lucene we can remove items from this
list, and as we get feature requests that seem popular and useful we
can append them to the list.

I just got sick of keeping all these emails, flagging them, and so on,
and had to put this list of TO-DO items somewhere. I think others may
find it useful, too.

Otis



--- otis@apache.org wrote:
> otis 02/05/27 16:56:54
>
> Added: . TODO.txt
> Log:
> - Lucene TO-DO items.
>
> Revision Changes Path
> 1.1 jakarta-lucene/TODO.txt
>
> Index: TODO.txt
> ===================================================================
> $Revision: 1.1 $
>
> LUCENE TO-DO ITEMS
>
>
> - Term Vector support
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apach
e.org&msgNo=273
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apach
e.org&msgNo=272
>
> - Support for Search Term Highlighting
> c.f.
> http://www.geocrawler.org/archives/3/2624/2001/9/50/6553088/
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apach
e.org&msgId=115271
>
> http://www.iq-computing.de/index.asp?menu=projekte-lucene-highlight
>
>
http://nagoya.apache.org/eyebrowse/BrowseList?listName=lucene-dev@jakarta.ap
ache.org&by=thread&from=56403
>
> - Better support for hits sorted by things other than score.
> An easy, efficient case is to support results sorted by the order
> documents were
> added to the index. A little harder and less efficient is
> support for
> results sorted by an arbitrary field.
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apach
e.org&msgId=114756
>
>
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00228.html
>
> - Add ability to "boost" individual documents/fields.
> When a document is indexed, a numeric "boost" value could be
> specified for the whole
> document, and/or for individual fields. This value would be
> multipled into
> scores for hits on this document. This would facilitate the
> implementation of
> things like Google's PageRank.
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apach
e.org&msgId=114749
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apach
e.org&msgId=114757
>
> - Add to FSDirectory the ability to specify where lock files live
> and
> to disable the use of lock files altogether (for read-only
> media).
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/BrowseList?listName=lucene-user@jakarta.a
pache.org&by=thread&from=57011
>
> - Add some requested methods:
> String[] Document.getValues(String fieldName);
> String[] IndexReader.getIndexedFields();
> void Token.setPositionIncrement(int);
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apach
e.org&msgId=330010
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apach
e.org&msgId=330009
>
> - Péter Halácsy's changes to the QueryParser that make it possible
> to
> programmatically specify a default operator (OR or AND).
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apach
e.org&msgId=115677
>
> - The recenly submitted code that allows for queries such as
> "Microsoft suc*" to match "Microsoft success" and "Microsoft
> sucks".
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apach
e.org&msgId=333275
>
> - Make package protected abstract methods of
> org.apache.lucene.search.Searcher
> public (I'd like to be able to make subclasses of Searcher,
> IndexWriter, InderReader).
> c.f.
>
>
http://www.mail-archive.com/cgi-bin/htsearch?method=and&format=short&config=
lucene-dev_jakarta_apache_org&restrict=&exclude=&words=IndexAccessControl
>
> - Add lastModified() method to Directory, FSDirectory and
> RamDirectory, so
> it could be cached in IndexWriter/Searcher manager.
>
> - Support for adding more than 1 term to the same position.
> N.B. I think the Finnish lady already implemented this. It
> required some
> pieces of Lucene to be modified. (OG).
>
> - The ability to retrieve the number of occurences not only for a
> term
> but also for a Phrase.
> c.f.
>
>
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00101.html
>
> - Alex Murzaku contributed some code for dealing with Russian.
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apach
e.org&msgId=115631
>
> - A lady from Finland submitted code for handling Finnish.
>
> - Dutch stemmer, analyzer, etc.
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apach
e.org&msgNo=145
>
> - French stemmer, analyzer, etc.
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/BrowseList?listName=lucene-dev@jakarta.ap
ache.org&by=thread&from=56256
>
> - Che Dong's CJKTokenizer for Chinese, Japanese, and Korean.
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apach
e.org&msgId=330905
>
> - Selecting a language-specific analyzer according to a locale.
> Now we rewrite parts of lucene codes in order to use another
> analyzer. It will be useful to select analyzer without touching
> codes.
>
> - Adding "-encoding" option and encoding-sensitive methods to
> tools.
> Current tools needs minor changes on a Japanese (and other
> language)
> environment: adding an "-encode" option and argument, useing
> Reader/Writer classes instead of InputStream/OutputStream
> classes, etc.
>
>
> $Revision: 1.1 $
>
>
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>


__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: cvs commit: jakarta-lucene TODO.txt [ In reply to ]
Hi there!

I just came back from a trip and noticed the start of
feature requests document for the next version. First,
let me correct Otis that I just submitted some raw
Java code generated from Snowball giving at as an
example to several requests for handling Russian. That
code was not integrated into Lucene's Analyzer
framework.
Second, never heard from Doug whether it is possible
in theory to implement some other similarity/distance
function and to plug these instead of the standard
enhanced tf*idf. I think there was at list one other
Lucene user interested in this (especially in the case
of short fields like addresses, single sentences,
etc.)

Cheers,

Alex

--- otis@apache.org wrote:
> - Alex Murzaku contributed some code for dealing
> with Russian.
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=115631


__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: cvs commit: jakarta-lucene TODO.txt [ In reply to ]
Hi there!

I just came back from a trip and noticed the start of
feature requests document for the next version. First,
let me correct Otis that I just submitted some raw
Java code generated from Snowball giving at as an
example to several requests for handling Russian. That
code was not integrated into Lucene's Analyzer
framework.
Second, never heard from Doug whether it is possible
in theory to implement some other similarity/distance
function and to plug these instead of the standard
enhanced tf*idf. I think there was at list one other
Lucene user interested in this (especially in the case
of short fields like addresses, single sentences,
etc.)

Cheers,

Alex

--- otis@apache.org wrote:
> - Alex Murzaku contributed some code for dealing
> with Russian.
> c.f.
>
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=115631


__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: cvs commit: jakarta-lucene TODO.txt [ In reply to ]
> From: Alex Murzaku [mailto:murzaku.at.yahoo.com@cutting.at.lucene.com]
>
> Second, never heard from Doug whether it is possible
> in theory to implement some other similarity/distance
> function and to plug these instead of the standard
> enhanced tf*idf. I think there was at list one other
> Lucene user interested in this (especially in the case
> of short fields like addresses, single sentences,
> etc.)

Some scoring changes are hard, some are easy.

Relatively easy things:
- changing per-term factor in score -- currently idf, i.e.,
log(numDocs/df+1)+1
- changing factor based on term's freq within document -- currently
sqrt(tf)
- changing the coordination factor, the boost a hit gets for containing a
large percentage of the query terms.

These all correspond to methods in Similarity.java. These could be made
into a TermWeight interface, with a default implementation, and a way to
specify an alternate implementation when building a searcher.

Somewhat harder:
- changing per-document factor in score -- currently sqrt(docLength). This
is also a Similarity method, but it is called when the index is created, so
its implementation cannot be changed at search time.

The scoring formula sums products of all these factors.

Harder-yet things:
- change the form of the scoring formula itself: a fair amount of code
assumes that scores are a sums of products of the above factors. It would
be challenging to design things both so that the formula can be easily
altered and so that things are efficient. I think if folks really want to
change the formula fundamentally, they're best off using IndexReader
directly and writing a search algorithm from scratch.

So what in particular that you're interested in altering? Would you be
satisfied with the addition of a TermWeight interface?

Doug

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: cvs commit: jakarta-lucene TODO.txt [ In reply to ]
Thanks for the very quick answer Doug!

I think/hope that the first solution you offer could
be the most flexible and realistic for the next
release. Once we have a TermWeight interface with a
default implementation, we could start tinker and
experiment with it until we reach something more
satisfactory to the more esoteric uses of Lucene.

Since I was referring to apps with homogeneous data
(i.e. all records with more or less the same length)
then sqrt(docLength) should remain constant.

My concern, in more general terms, is when query is
the same size as the indexed documents (sentence to
sentence, address to address, file to file) which
could find uses in clustering, data clean-up, etc.
While large document to large document similarity
works fine (but is very slow), short text to short
text similarity seemed more problematic in my
experiments. In any case my problem in these
experiments wasn't just Lucene...

Thanks again,

Alex

--- cutting@lucene.com wrote:
> > From: Alex Murzaku
> [mailto:murzaku.at.yahoo.com@cutting.at.lucene.com]
> >
> > Second, never heard from Doug whether it is
> possible
> > in theory to implement some other
> similarity/distance
> > function and to plug these instead of the standard
> > enhanced tf*idf. I think there was at list one
> other
> > Lucene user interested in this (especially in the
> case
> > of short fields like addresses, single sentences,
> > etc.)
>
> Some scoring changes are hard, some are easy.
>
> Relatively easy things:
> - changing per-term factor in score -- currently
> idf, i.e.,
> log(numDocs/df+1)+1
> - changing factor based on term's freq within
> document -- currently
> sqrt(tf)
> - changing the coordination factor, the boost a hit
> gets for containing a
> large percentage of the query terms.
>
> These all correspond to methods in Similarity.java.
> These could be made
> into a TermWeight interface, with a default
> implementation, and a way to
> specify an alternate implementation when building a
> searcher.
>
> Somewhat harder:
> - changing per-document factor in score --
> currently sqrt(docLength). This
> is also a Similarity method, but it is called when
> the index is created, so
> its implementation cannot be changed at search time.
>
> The scoring formula sums products of all these
> factors.
>
> Harder-yet things:
> - change the form of the scoring formula itself: a
> fair amount of code
> assumes that scores are a sums of products of the
> above factors. It would
> be challenging to design things both so that the
> formula can be easily
> altered and so that things are efficient. I think
> if folks really want to
> change the formula fundamentally, they're best off
> using IndexReader
> directly and writing a search algorithm from
> scratch.
>
> So what in particular that you're interested in
> altering? Would you be
> satisfied with the addition of a TermWeight
> interface?
>
> Doug
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>


__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>