Mailing List Archive

VOTE: Possible features for next release
Hello all,

Below is a list of all features that were requested/suggested for the next
release of Lucene.

If you are in favor of the feature AND you are willing to help implement /
integrate and test it please put a +1 in the brackets. If you are against a
feature please put a -1 in the brackets and provide a reason.

Note: Non committers can vote here, but at least 1 committer must be active
on the feature (i.e. willing to test and integrate it) for it to be part of
the next release.

If something is unclear please let me know. Also, if people have suggestions
on a better way to organize this, let me know.

--Peter

---------------------------------------------------------------------------


[ ] Peter Halacsy's changes to the QueryParser that, I believe, make it
possible to programmatically specify a default operator (OR or AND).

[ ] The recently submitted code that allows for queries such as "Microsoft
suc*" to match "Microsoft success" and "Microsoft sucks".

[ ] Alex Murzaku contributed some code for dealing with Russian.

[ ] A lady from Finland submitted code for handling Finnish.

[ ] Japanese Analyzer ( Kazuhiro Kazama <kazama@ingrid.org>)

[ ] make package protected abtract methods of
org.apache.lucene.search.Searcher to public (I'd like to be able to make
subclasses of Searcher, IndexWriter, InderReader )

[ ] Term Vector Support

[ ] add lastModified() method to Directory, FSDirectory and RamDirectory (so
it could be cached in IndexWriter/Searcher manager)

[ ] support for adding more than 1 term to the same position (I'm sorry I
didn't find Doug's email about this)

[ ] Does anyone see a problem with adding support for storing unindexed,
untokenized *binary* data as document fields? At the moment, the closest
thing we have is unindexed, untokenized *character* data. Looking at the
source, this will be a trivial change, but I'm curious to learn if there are
specific reasons (other than inclination and opportunity) that this has been
left out.

[ ] Another feature could be the ability to retrieve the number of
occurences not only for a term but also for a Phrase (see
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00101.html)

[ ] Better support for hits sorted by things other than score. An easy,
efficient case is to support results sorted by the order documents were
added to the index.

[ ] Support for results sorted by an arbitrary field.

[ ] Add ability to "boost" individual documents/fields. When a document is
indexed, a numeric "boost" value could be specified for the whole document,
and/or for individual fields. This value would be multipled into scores for
hits on this document. This would facilitate the implementation of things
like Google's pagerank.

[ ] Add to FSDirectory the ability to specify where lock files live and to
disable the use of lock files altogether (for read-only media).

[ ] Add some requested methods:
String[] Document.getValues(String fieldName);
String[] IndexReader.getIndexedFields();
void Token.setPositionIncrement(int);


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: VOTE: Possible features for next release [ In reply to ]
Another suggestion -

Real support for indexing of numbers. Especially for filtering on them, so
that it doesn't have to be implemented in the fashion of the date filter
with bit set vectors and alphabetic numbers.

I have no idea how difficult this is to implement, but I know I would have a
use for it.

Dan




-----Original Message-----
From: Peter Carlson [mailto:carlson@bookandhammer.com]
Sent: Thursday, May 23, 2002 10:29 AM
To: Lucene Developers List
Subject: VOTE: Possible features for next release


Hello all,

Below is a list of all features that were requested/suggested for the next
release of Lucene.

.......


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: VOTE: Possible features for next release [ In reply to ]
ndexSearcher, IndexWriter, Searcher, Directory, FSDirectory, RAMDirectory can be subclassed from outer package.

It would be very useful to make for example a ManagedSearcher class that is returned by an IndexAccessControl class:
http://www.mail-archive.com/cgi-bin/htsearch?method=and&format=short&config=lucene-dev_jakarta_apache_org&restrict=&exclude=&words=IndexAccessControl

peter

> -----Original Message-----
> From: Peter Carlson [mailto:carlson@bookandhammer.com]
> Sent: Thursday, May 23, 2002 5:29 PM
> To: Lucene Developers List
> Subject: VOTE: Possible features for next release
>
[...]

>
> [ ] Peter Halacsy's changes to the QueryParser that, I
> believe, make it
> possible to programmatically specify a default operator (OR or AND).
>
> [ ] The recently submitted code that allows for queries such
> as "Microsoft
> suc*" to match "Microsoft success" and "Microsoft sucks".
>
> [ ] Alex Murzaku contributed some code for dealing with Russian.
>
> [ ] A lady from Finland submitted code for handling Finnish.
>
> [ ] Japanese Analyzer ( Kazuhiro Kazama <kazama@ingrid.org>)
>
> [ ] make package protected abtract methods of
> org.apache.lucene.search.Searcher to public (I'd like to be
> able to make
> subclasses of Searcher, IndexWriter, InderReader )
>
> [ ] Term Vector Support
>
> [ ] add lastModified() method to Directory, FSDirectory and
> RamDirectory (so
> it could be cached in IndexWriter/Searcher manager)
>
> [ ] support for adding more than 1 term to the same position
> (I'm sorry I
> didn't find Doug's email about this)
>
> [ ] Does anyone see a problem with adding support for storing
> unindexed,
> untokenized *binary* data as document fields? At the moment,
> the closest
> thing we have is unindexed, untokenized *character* data.
> Looking at the
> source, this will be a trivial change, but I'm curious to
> learn if there are
> specific reasons (other than inclination and opportunity)
> that this has been
> left out.
>
> [ ] Another feature could be the ability to retrieve the number of
> occurences not only for a term but also for a Phrase (see
> http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg0
0101.html)

[ ] Better support for hits sorted by things other than score. An easy,
efficient case is to support results sorted by the order documents were
added to the index.

[ ] Support for results sorted by an arbitrary field.

[ ] Add ability to "boost" individual documents/fields. When a document is
indexed, a numeric "boost" value could be specified for the whole document,
and/or for individual fields. This value would be multipled into scores for
hits on this document. This would facilitate the implementation of things
like Google's pagerank.

[ ] Add to FSDirectory the ability to specify where lock files live and to
disable the use of lock files altogether (for read-only media).

[ ] Add some requested methods:
String[] Document.getValues(String fieldName);
String[] IndexReader.getIndexedFields();
void Token.setPositionIncrement(int);


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: VOTE: Possible features for next release [ In reply to ]
> [ ] Peter Halacsy's changes to the QueryParser that, I
> believe, make it
> possible to programmatically specify a default operator (OR or AND).
>
> [ ] The recently submitted code that allows for queries such
> as "Microsoft
> suc*" to match "Microsoft success" and "Microsoft sucks".
>
> [ ] Alex Murzaku contributed some code for dealing with Russian.
>
> [ ] A lady from Finland submitted code for handling Finnish.
>
> [ ] Japanese Analyzer ( Kazuhiro Kazama <kazama@ingrid.org>)
>
> [ ] make package protected abtract methods of
> org.apache.lucene.search.Searcher to public (I'd like to be
> able to make
> subclasses of Searcher, IndexWriter, InderReader )
>
> [ ] Term Vector Support
>
> [ ] add lastModified() method to Directory, FSDirectory and
> RamDirectory (so
> it could be cached in IndexWriter/Searcher manager)
>
> [ ] support for adding more than 1 term to the same position
> (I'm sorry I
> didn't find Doug's email about this)
>
> [+] Does anyone see a problem with adding support for storing
> unindexed,
> untokenized *binary* data as document fields? At the moment,
> the closest
> thing we have is unindexed, untokenized *character* data.
> Looking at the
> source, this will be a trivial change, but I'm curious to
> learn if there are
> specific reasons (other than inclination and opportunity)
> that this has been
> left out.
>
> [ ] Another feature could be the ability to retrieve the number of
> occurences not only for a term but also for a Phrase (see
> http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg0
0101.html)

[ ] Better support for hits sorted by things other than score. An easy,
efficient case is to support results sorted by the order documents were
added to the index.

[ ] Support for results sorted by an arbitrary field.

[ ] Add ability to "boost" individual documents/fields. When a document is
indexed, a numeric "boost" value could be specified for the whole document,
and/or for individual fields. This value would be multipled into scores for
hits on this document. This would facilitate the implementation of things
like Google's pagerank.

[ ] Add to FSDirectory the ability to specify where lock files live and to
disable the use of lock files altogether (for read-only media).

[ ] Add some requested methods:
String[] Document.getValues(String fieldName);
String[] IndexReader.getIndexedFields();
void Token.setPositionIncrement(int);


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: VOTE: Possible features for next release [ In reply to ]
--

On Thu, 23 May 2002 08:29:17
Peter Carlson wrote:
>Hello all,
>
>Below is a list of all features that were requested/suggested for the next
>release of Lucene.
>
>If you are in favor of the feature AND you are willing to help implement /
>integrate and test it please put a +1 in the brackets. If you are against a
>feature please put a -1 in the brackets and provide a reason.
>
>Note: Non committers can vote here, but at least 1 committer must be active
>on the feature (i.e. willing to test and integrate it) for it to be part of
>the next release.
>
>If something is unclear please let me know. Also, if people have suggestions
>on a better way to organize this, let me know.
>
>--Peter
>
>---------------------------------------------------------------------------
>
>
>[+1] Peter Halacsy's changes to the QueryParser that, I believe, make it
>possible to programmatically specify a default operator (OR or AND).
>
>[+1] The recently submitted code that allows for queries such as "Microsoft
>suc*" to match "Microsoft success" and "Microsoft sucks".
>
>[+1] Alex Murzaku contributed some code for dealing with Russian.
>
>[+1] A lady from Finland submitted code for handling Finnish.
>
>[+1] Japanese Analyzer ( Kazuhiro Kazama <kazama@ingrid.org>)
>
>[+1] make package protected abtract methods of
>org.apache.lucene.search.Searcher to public (I'd like to be able to make
>subclasses of Searcher, IndexWriter, InderReader )
>
>[+1] Term Vector Support
>
>[ ] add lastModified() method to Directory, FSDirectory and RamDirectory (so
>it could be cached in IndexWriter/Searcher manager)
>
>[ ] support for adding more than 1 term to the same position (I'm sorry I
>didn't find Doug's email about this)
>
>[+1] Does anyone see a problem with adding support for storing unindexed,
>untokenized *binary* data as document fields? At the moment, the closest
>thing we have is unindexed, untokenized *character* data. Looking at the
>source, this will be a trivial change, but I'm curious to learn if there are
>specific reasons (other than inclination and opportunity) that this has been
>left out.
>
>[+1] Another feature could be the ability to retrieve the number of
>occurences not only for a term but also for a Phrase (see
>http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00101.html)
>
>[+1] Better support for hits sorted by things other than score. An easy,
>efficient case is to support results sorted by the order documents were
>added to the index.
>
>[+1] Support for results sorted by an arbitrary field.
>
>[+1] Add ability to "boost" individual documents/fields. When a document is
>indexed, a numeric "boost" value could be specified for the whole document,
>and/or for individual fields. This value would be multipled into scores for
>hits on this document. This would facilitate the implementation of things
>like Google's pagerank.
>
>[ ] Add to FSDirectory the ability to specify where lock files live and to
>disable the use of lock files altogether (for read-only media).
>
>[+1] Add some requested methods:
> String[] Document.getValues(String fieldName);
> String[] IndexReader.getIndexedFields();
> void Token.setPositionIncrement(int);
>
>

Also i want ot notify the follow:

1.More support for the HighLight system ("summarizer tool"), this needs some change to the Query classes as suggested bye "Maik Schreiber": BooleanQuery.getClauses(), some other methods public instead of private, etc, etc. In this way will be more confortable for users that want add this features to their search engine, because right now every time there is a new release we have to make those changes.
Also will be good have a method that gives back the positions of the terms found inside the document (i think is somewhere in the Scorer but i don't know how use it), in that way: Analyzer + TermPositions => very easy produce an highlight. So the Document retrived from the Hits should have a method to get the TermPositions.
Actually i am using the Jakarta ORO to search/match the terms inside the text and seems too slow,specially with large files.

2.I see a lot of "problems" when Searching and Updating on the same index. May be is just me, but what i discovered is:
a)It is not possible "update" a document, it is possible just delete and re-add, that mean open a Reader, do a delete, close the reader, open a writer, add the document, optimize , close the writer.
So it is possible move the "delete" method from the IndexReader to the IndexWriter? Or it is impossible for tech. reasons? In this way we open just the Writer to do update,delete and add documents. This is useful when the index needs to be updated often.
b)There is no way to update just a field in a document,you need to update the entire document, so a field update will be good,may be this is hard to do.

3.More documentation about the Index Format will be useful for users like me that don't know how the index is built, segment,terms,positions, and their relationship.

4.Keep the index searcher opened inside the servlet or jsp save a lot of time, from my tests on a 1GB index (600k docs) i see an average time like:
a)open for each request: 110 ms
b)open just once: 40 ms
Also i built a SearchEngineManager with RMI that send a callback to the servlet (registered clients) after it refresh the index, so i re-open the searcher just when i really need.
I'll write i nice email to explain what my SearchEngineManager does in details, because it does more than that.

May be i went "out of topic" but i think is the right moment to discuss such features.




>--
>To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
>


________________________________________________________
Outgrown your current e-mail service?
Get a 25MB Inbox, POP3 Access, No Ads and No Taglines with LYCOS MAIL PLUS.
http://login.mail.lycos.com/brandPage.shtml?pageId=plus

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: VOTE: Possible features for next release [ In reply to ]
> From: none none
>
> 2.I see a lot of "problems" when Searching and Updating on
> the same index. May be is just me, but what i discovered is:
> a)It is not possible "update" a document, it is possible
> just delete and re-add, that mean open a Reader, do a delete,
> close the reader, open a writer, add the document, optimize ,
> close the writer.
> So it is possible move the "delete" method from the
> IndexReader to the IndexWriter? Or it is impossible for tech.
> reasons? In this way we open just the Writer to do
> update,delete and add documents. This is useful when the
> index needs to be updated often.

This is not possible without a major re-structuring of Lucene. This has
been discussed here before. Perhaps this should be in the FAQ.

> b)There is no way to update just a field in a document,you
> need to update the entire document, so a field update will be
> good,may be this is hard to do.

This is not possible without a major re-structuring of Lucene. This has
been discussed here before. Perhaps this should be in the FAQ.

> 3.More documentation about the Index Format will be useful
> for users like me that don't know how the index is built,
> segment,terms,positions, and their relationship.

This has also been discussed on the mailing lists, but not put into a
published document. Hopefully most users should not need to know the
details of the index format. However, it still should probably be
documented.

See, for example:

http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-user@jakarta.apac
he.org&msgNo=1618

Doug

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: VOTE: Possible features for next release [ In reply to ]
>
>
>
>2.I see a lot of "problems" when Searching and Updating on the same index. May be is just me, but what i discovered is:
> a)It is not possible "update" a document, it is possible just delete and re-add, that mean open a Reader, do a delete, close the reader, open a writer, add the document, optimize , close the writer.
>So it is possible move the "delete" method from the IndexReader to the IndexWriter? Or it is impossible for tech. reasons? In this way we open just the Writer to do update,delete and add documents. This is useful when the index needs to be updated often.
> b)There is no way to update just a field in a document,you need to update the entire document, so a field update will be good,may be this is hard to do.
>
The (a) and maybe the (b) is also on my wish/todo list. The approach I
was thinking of taking was to make these "transactional", so that both
changes are done on an IndexWriter and they take effect when the writer
closes. The old document in an older segment gets "shadowed" by the new
one until the optimization. Once optimized, the old document is gone for
good. I think this will solve the issue of updates being combersome and
will also make a lot of the concurrency headaches go away.

Another feature on my list is reduction in the number of open files.
This is especially a problem in cases where many indexes are in use at
the same time. The approach I'm thinking of here is to merge all files
of a given segment into a single file after that segment is closed.
Since the segment is never written to after it's first created (besides
the deleted file?), there should be no problem with fragmentation and
growth. This can be done as part of the segment merge, so that the
output of a merge is this new single-file segment rather then the
existing multi-file one. For extra credit, we can add support for having
1 to n file handles allocated to this one-file segment, in case OS will
be able to do better optimization of non-contegious reads when multiple
file handles are used. Does anyone know?

Dmitry.




--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Possible features for next release [ In reply to ]
What about the Support for Search Term Highlighting? (see Maik Schreiber's
paper)

It seems to have vanished from the list of features?

----- Original Message -----
From: "Peter Carlson" <carlson@bookandhammer.com>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Sent: Thursday, May 23, 2002 8:29 AM
Subject: VOTE: Possible features for next release


> Hello all,
>
> Below is a list of all features that were requested/suggested for the next
> release of Lucene.
>
> If you are in favor of the feature AND you are willing to help implement /
> integrate and test it please put a +1 in the brackets. If you are against
a
> feature please put a -1 in the brackets and provide a reason.
>
> Note: Non committers can vote here, but at least 1 committer must be
active
> on the feature (i.e. willing to test and integrate it) for it to be part
of
> the next release.
>
> If something is unclear please let me know. Also, if people have
suggestions
> on a better way to organize this, let me know.
>
> --Peter
>
> --------------------------------------------------------------------------
-
>
>
> [ ] Peter Halacsy's changes to the QueryParser that, I believe, make it
> possible to programmatically specify a default operator (OR or AND).
>
> [ ] The recently submitted code that allows for queries such as "Microsoft
> suc*" to match "Microsoft success" and "Microsoft sucks".
>
> [ ] Alex Murzaku contributed some code for dealing with Russian.
>
> [ ] A lady from Finland submitted code for handling Finnish.
>
> [ ] Japanese Analyzer ( Kazuhiro Kazama <kazama@ingrid.org>)
>
> [ ] make package protected abtract methods of
> org.apache.lucene.search.Searcher to public (I'd like to be able to make
> subclasses of Searcher, IndexWriter, InderReader )
>
> [ ] Term Vector Support
>
> [ ] add lastModified() method to Directory, FSDirectory and RamDirectory
(so
> it could be cached in IndexWriter/Searcher manager)
>
> [ ] support for adding more than 1 term to the same position (I'm sorry I
> didn't find Doug's email about this)
>
> [ ] Does anyone see a problem with adding support for storing unindexed,
> untokenized *binary* data as document fields? At the moment, the closest
> thing we have is unindexed, untokenized *character* data. Looking at the
> source, this will be a trivial change, but I'm curious to learn if there
are
> specific reasons (other than inclination and opportunity) that this has
been
> left out.
>
> [ ] Another feature could be the ability to retrieve the number of
> occurences not only for a term but also for a Phrase (see
> http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg00101.html)
>
> [ ] Better support for hits sorted by things other than score. An easy,
> efficient case is to support results sorted by the order documents were
> added to the index.
>
> [ ] Support for results sorted by an arbitrary field.
>
> [ ] Add ability to "boost" individual documents/fields. When a document
is
> indexed, a numeric "boost" value could be specified for the whole
document,
> and/or for individual fields. This value would be multipled into scores
for
> hits on this document. This would facilitate the implementation of things
> like Google's pagerank.
>
> [ ] Add to FSDirectory the ability to specify where lock files live and to
> disable the use of lock files altogether (for read-only media).
>
> [ ] Add some requested methods:
> String[] Document.getValues(String fieldName);
> String[] IndexReader.getIndexedFields();
> void Token.setPositionIncrement(int);
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
RE: VOTE: Possible features for next release [ In reply to ]
> -----Original Message-----
> From: none none [mailto:korfut@lycos.com]
> Sent: Thursday, May 23, 2002 7:39 PM
> To: Lucene Developers List
> Subject: RE: VOTE: Possible features for next release
>
>
>
> --
their relationship.
>
> 4.Keep the index searcher opened inside the servlet or jsp
> save a lot of time, from my tests on a 1GB index (600k docs)
> i see an average time like:
> a)open for each request: 110 ms
> b)open just once: 40 ms

An indexsearcher must be closed and reopened if (and only if) the index has been modified. Scott Ganyo posted an IndexAccessControl class that can be used to manage searchers. After that we discussed it in private emails.

I hope this logic can be part of Lucene some time. I've attached my last version of IndexAccessControl (iac). Use:
// You first make an instance of IndexAccessControl. IAC is as many
// instances as many directory (or index path) the application wants to use.
IndexAccessControl iac = IndexAccessControl.getInstance(directory);
// Request a searcher
Searcher searcher = iac.getSearcher();

// use the searcher here

// close it: iac won't close the searcher only returns it to a cache (or it can be called pool with size 1)
searcher.close();

This code is as fast as you had only one IndexSearcher in your app (to be exact there is only one real searcher opened).

IndexAccessControl has also methods for managing Writers and Readers.

peter

ps: to compile this code you must modify some classes of lucene: Searcher, Directory etc.