Mailing List Archive

fixed url and How to contribute code to lucene sandbox?
http://www.chedong.com/tech/lucene.html

fixed reference url with:
http://jakarta.apache.org/lucene/

BTW:
How to contribute code to lucene sandbox?


Che, Dong

----- Original Message -----
From: "Otis Gospodnetic" <otis_gospodnetic@yahoo.com>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Sent: Sunday, September 08, 2002 12:01 AM
Subject: Re: Lucene introduction in Chinese


> Thank you for this.
> I think we should add this to the contribution page or some other place
> on the Lucene site (I'll take a look in a bit).
> I would like to just add a link to it.
>
> Note: the link to Lucene's home page at the bottom of the page is
> wrong: http://jakarta.apache.org/Lucene/
> should be
> http://jakarta.apache.org/lucene/
>
> Thanks,
> Otis
>
>
Re: fixed url and How to contribute code to lucene sandbox? [ In reply to ]
I will add this to the contributions page.

--Peter
On Saturday, September 7, 2002, at 10:48 PM, Che Dong wrote:

> http://www.chedong.com/tech/lucene.html
>
> fixed reference url with:
> http://jakarta.apache.org/lucene/
>
> BTW:
> How to contribute code to lucene sandbox?
>
>
> Che, Dong
>
> ----- Original Message -----
> From: "Otis Gospodnetic" <otis_gospodnetic@yahoo.com>
> To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
> Sent: Sunday, September 08, 2002 12:01 AM
> Subject: Re: Lucene introduction in Chinese
>
>
>> Thank you for this.
>> I think we should add this to the contribution page or some other
>> place
>> on the Lucene site (I'll take a look in a bit).
>> I would like to just add a link to it.
>>
>> Note: the link to Lucene's home page at the bottom of the page is
>> wrong: http://jakarta.apache.org/Lucene/
>> should be
>> http://jakarta.apache.org/lucene/
>>
>> Thanks,
>> Otis
>>
>>
>


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: fixed url and How to contribute code to lucene sandbox? [ In reply to ]
I checked the I post before
http://nagoya.apache.org/eyebrowse/SearchList?listId=&listName=lucene-dev@jakarta.apache.org&searchText=Che&defaultField=sender&Search=Search


mainly in two fields:

1. custom sorting beside default score sorting: make docID alias one field you need output sorting
solved by sort data before indexing(example sorted by field PostDate), so docID can be an alias to the sort field. if we make hitCollector
sort with docID or 1/docID or even complex stragety (docID * score)...
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=115469
IndexOrderSearcher: sort data before indexing and use 1/docID instead of score

2. CJK support:
2.1 sigram based(no word segment just use one character as a token): modified from StandardTokenizer.java
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=330905
CJKTokenizer for Asia language(Chinese Japanese Korean) Word Segment
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=450266
StandardTokenizer with sigram based CJK Support

2.2 bigram based word segment: modified from SimpleTokenizer to CJKTokenizer.java
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg01220.html

Thank you

I also have some advise and working on lucene structure(Document Field Index) => XML binding. If we Make a standard lucene.dtd as a default lucene input format maight be use for applacation intergration with lucene.


Che, Dong
----- Original Message -----
From: "Peter Carlson" <carlson@bookandhammer.com>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Sent: Sunday, September 08, 2002 2:08 PM
Subject: Re: fixed url and How to contribute code to lucene sandbox?


> I will add this to the contributions page.
>
> --Peter
> On Saturday, September 7, 2002, at 10:48 PM, Che Dong wrote:
>
> > http://www.chedong.com/tech/lucene.html
> >
> > fixed reference url with:
> > http://jakarta.apache.org/lucene/
> >
> > BTW:
> > How to contribute code to lucene sandbox?
> >
> >
> > Che, Dong
> >
> > ----- Original Message -----
> > From: "Otis Gospodnetic" <otis_gospodnetic@yahoo.com>
> > To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
> > Sent: Sunday, September 08, 2002 12:01 AM
> > Subject: Re: Lucene introduction in Chinese
> >
> >
> >> Thank you for this.
> >> I think we should add this to the contribution page or some other
> >> place
> >> on the Lucene site (I'll take a look in a bit).
> >> I would like to just add a link to it.
> >>
> >> Note: the link to Lucene's home page at the bottom of the page is
> >> wrong: http://jakarta.apache.org/Lucene/
> >> should be
> >> http://jakarta.apache.org/lucene/
> >>
> >> Thanks,
> >> Otis
> >>
> >>
> >
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
Re: fixed url and How to contribute code to lucene sandbox? [ In reply to ]
> BTW:
> How to contribute code to lucene sandbox?

You can just mail lucene-dev and attach your code.
Is this regarding your other contributions?
I haven't had the chance to really look at them yet :(

Stuff that goes into Sandbox is usually a software component or a
project. We haven't really put any code snippets (e.g. single classes)
in there.
Maybe we should have a place to use as a repository for various code
snippets that people contribute, that would otherwise get lost in the
mailing list archived, I don't know.

Otis


__________________________________________________
Do You Yahoo!?
Yahoo! Finance - Get real-time stock quotes
http://finance.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: fixed url and How to contribute code to lucene sandbox? [ In reply to ]
For code to be added to Sandbox, it also has to be APL.

Otis, I suggest creating a space on Lucene's website for these ad-hoc
contribs. I know Sandbox was meant for this, but its not reasonable to
expect everyone to APL their code. I'm willing to maintain this section if
necessary. Attachments can be emailed to me or to the list, and I'll add
them in. The alternative is we relax the requirement for Sandbox code to be
APL, or create a SF project for this stuff (ugh).

Regards,
Kelvin


On Sun, 8 Sep 2002 19:57:17 -0700 (PDT), Otis Gospodnetic wrote:
>>BTW:
>>How to contribute code to lucene sandbox?
>
>You can just mail lucene-dev and attach your code.
>Is this regarding your other contributions?
>I haven't had the chance to really look at them yet :(
>
>Stuff that goes into Sandbox is usually a software component or a
>project. We haven't really put any code snippets (e.g. single
>classes)
>in there.
>Maybe we should have a place to use as a repository for various code
>snippets that people contribute, that would otherwise get lost in the
>mailing list archived, I don't know.
>
>Otis
>
>
>__________________________________________________
>Do You Yahoo!?
>Yahoo! Finance - Get real-time stock quotes
>http://finance.yahoo.com
>
>--
>To unsubscribe, e-mail: <mailto:lucene-dev-
>unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <mailto:lucene-dev-
>help@jakarta.apache.org>





--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: fixed url and How to contribute code to lucene sandbox? [ In reply to ]
I think that all the code in the sandbox should be APL. If not, then I
would suggest people email their contribution and we create a link to
it in a mail archive.
This way we don't have to maintain it, but it is available with out a
special web server from the contributor.

--Peter




On Sunday, September 8, 2002, at 08:10 PM, Kelvin Tan wrote:

> For code to be added to Sandbox, it also has to be APL.
>
> Otis, I suggest creating a space on Lucene's website for these ad-hoc
> contribs. I know Sandbox was meant for this, but its not reasonable to
> expect everyone to APL their code. I'm willing to maintain this
> section if
> necessary. Attachments can be emailed to me or to the list, and I'll
> add
> them in. The alternative is we relax the requirement for Sandbox code
> to be
> APL, or create a SF project for this stuff (ugh).
>
> Regards,
> Kelvin
>
>
> On Sun, 8 Sep 2002 19:57:17 -0700 (PDT), Otis Gospodnetic wrote:
>>> BTW:
>>> How to contribute code to lucene sandbox?
>>
>> You can just mail lucene-dev and attach your code.
>> Is this regarding your other contributions?
>> I haven't had the chance to really look at them yet :(
>>
>> Stuff that goes into Sandbox is usually a software component or a
>> project. We haven't really put any code snippets (e.g. single
>> classes)
>> in there.
>> Maybe we should have a place to use as a repository for various code
>> snippets that people contribute, that would otherwise get lost in the
>> mailing list archived, I don't know.
>>
>> Otis
>>
>>
>> __________________________________________________
>> Do You Yahoo!?
>> Yahoo! Finance - Get real-time stock quotes
>> http://finance.yahoo.com
>>
>> --
>> To unsubscribe, e-mail: <mailto:lucene-dev-
>> unsubscribe@jakarta.apache.org>
>> For additional commands, e-mail: <mailto:lucene-dev-
>> help@jakarta.apache.org>
>
>
>
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: fixed url and How to contribute code to lucene sandbox? [ In reply to ]
Peter,

I agree Sandbox code should be APL. One of the primary reasons Sandbox was
created was to avoid having to use the mailing list archives as a
repository of sorts for contributions, which IMHO, kinda sucks.

The questions here are:

1) What about bits of APL one-class contributions? Do they go into Sandbox?
If so, do their contributors become Sandbox committers? If not, who'd be
responsible for maintaining them?
2) What about non-APL contributions? Is there an alternative to using the
mail archives to provide access to them?

Regards,
Kelvin


On Sun, 8 Sep 2002 23:07:27 -0700, Peter Carlson wrote:
>I think that all the code in the sandbox should be APL. If not, then
>I
>would suggest people email their contribution and we create a link
>to
>it in a mail archive.
>This way we don't have to maintain it, but it is available with out
>a
>special web server from the contributor.
>
>--Peter
>
>
>
>
>On Sunday, September 8, 2002, at 08:10 PM, Kelvin Tan wrote:
>
>>For code to be added to Sandbox, it also has to be APL.
>>
>>Otis, I suggest creating a space on Lucene's website for these ad-
>>hoc
>>contribs. I know Sandbox was meant for this, but its not reasonable
>>to
>>expect everyone to APL their code. I'm willing to maintain this
>>section if
>>necessary. Attachments can be emailed to me or to the list, and
>>I'll
>>add
>>them in. The alternative is we relax the requirement for Sandbox
>>code
>>to be
>>APL, or create a SF project for this stuff (ugh).
>>
>>Regards,
>>Kelvin
>>
>>
>>On Sun, 8 Sep 2002 19:57:17 -0700 (PDT), Otis Gospodnetic wrote:
>>>>BTW:
>>>>How to contribute code to lucene sandbox?
>>>
>>>You can just mail lucene-dev and attach your code.
>>>Is this regarding your other contributions?
>>>I haven't had the chance to really look at them yet :(
>>>
>>>Stuff that goes into Sandbox is usually a software component or a
>>>project. We haven't really put any code snippets (e.g. single
>>>classes)
>>>in there.
>>>Maybe we should have a place to use as a repository for various
>>>code
>>>snippets that people contribute, that would otherwise get lost in
>>>the
>>>mailing list archived, I don't know.
>>>
>>>Otis
>>>
>>>
>>>__________________________________________________
>>>Do You Yahoo!?
>>>Yahoo! Finance - Get real-time stock quotes
>>>http://finance.yahoo.com
>>>
>>>--
>>>To unsubscribe, e-mail: <mailto:lucene-dev-
>>>unsubscribe@jakarta.apache.org>
>>>For additional commands, e-mail: <mailto:lucene-dev-
>>>help@jakarta.apache.org>
>>
>>
>>
>>
>>
>>--
>>To unsubscribe, e-mail:
>><mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>>For additional commands, e-mail:
>><mailto:lucene-dev-help@jakarta.apache.org>
>>
>>
>
>
>--
>To unsubscribe, e-mail: <mailto:lucene-dev-
>unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <mailto:lucene-dev-
>help@jakarta.apache.org>





--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: fixed url and How to contribute code to lucene sandbox? [ In reply to ]
Che Dong wrote:
> 1. custom sorting beside default score sorting: make docID alias one field you need output sorting
> solved by sort data before indexing(example sorted by field PostDate), so docID can be an alias to the sort field. if we make hitCollector
> sort with docID or 1/docID or even complex stragety (docID * score)...
> http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=115469
> IndexOrderSearcher: sort data before indexing and use 1/docID instead of score

That's an interesting approach. I don't recall ever seeing this message
when it was originally posted. Sorry.

I had imagined instead adding this functionality to Hits.java. Having
a different Searcher implementation makes it possible for folks to use
MultiSearcher to combine results from an IndexSearcher and an
IndexOrderSearcher, which would not make sense. If the functionality
instead resides in Hits.java, then it could not be misused in this way.

So the way I was going to do it was to add something to Hits.java like:
public static final long ORDER_BY_SCORE = 1;
public static final long ORDER_BY_DOC_NUM = 1;
public void setHitOrdering(int order);

If ORDER_BY_SCORE is specfied then Hits would work as it does now. This
would be the default. But when ORDER_BY_DOC_NUM is specified then
Hits.java would use a HitCollector to implement this ordering.

> 2. CJK support:
> 2.1 sigram based(no word segment just use one character as a token): modified from StandardTokenizer.java
> http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=330905
> CJKTokenizer for Asia language(Chinese Japanese Korean) Word Segment
> http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=450266
> StandardTokenizer with sigram based CJK Support
>
> 2.2 bigram based word segment: modified from SimpleTokenizer to CJKTokenizer.java
> http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg01220.html

I think it would be great to have some support for asian languages built
into Lucene. Which of these approaches do you think is best? I like
the idea of a StandardTokenizer or SimpleTokenizer that automatically
provides this via bigrams. What do others think?

Doug



--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: fixed url and How to contribute code to lucene sandbox? [ In reply to ]
I don't know any Asian languages but from earlier experimentations, I
remember that some time bigram tokenization could hurt matching, e.g.:

w1w2w3 == tokenized as ==> w1w2 w2w3 (or even _w1 w1w2 w2w3 w3_) would
miss a search for w2. w1 w2 w3 would work better.

--- Doug Cutting <cutting@lucene.com> wrote:
> Che Dong wrote:
> > 2. CJK support:
> > 2.1 sigram based(no word segment just use one character as a
> token): modified from StandardTokenizer.java
> >
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=330905
> > CJKTokenizer for Asia language(Chinese Japanese Korean) Word
> Segment
> >
>
http://nagoya.apache.org/eyebrowse/ReadMsg?listName=lucene-dev@jakarta.apache.org&msgId=450266
> > StandardTokenizer with sigram based CJK Support
> >
> > 2.2 bigram based word segment: modified from SimpleTokenizer to
> CJKTokenizer.java
> >
>
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg01220.html
>
> I think it would be great to have some support for asian languages
> built
> into Lucene. Which of these approaches do you think is best? I like
>
> the idea of a StandardTokenizer or SimpleTokenizer that automatically
>
> provides this via bigrams. What do others think?
>
> Doug
>
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
>


=====
__________________________________
alex@lissus.com -- http://www.lissus.com

__________________________________________________
Do you Yahoo!?
Yahoo! News - Today's headlines
http://news.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>