Mailing List Archive: Some questions

Some questions

Apr 19, 2002, 4:57 AM

Post #1 of 4 (547 views)

Hi all,

my name is Laura and I'm a new member of this list. I'm a long date
user of tomcat and I'm also a meber of tomcat user list.
Yesterday looking at the jakarta menu I saw lucene and I said:"What is
this?"
Reading lucene home page I understood that Lucene is a very interesting
and big project. I'm looking for a product as lucene!!!
Because I'm a new member of this list o new user of lucene I have some
questions that you answer easily to, I think.
Well, I saw that lucene create the index on the filesystem: I think
that this is a problem for producion enviroment. I usually use
Database, for example Oracle.
Is it possible integrate Lucene with Oracle or some other db (Mysql)?

I think that there isn't any Italian Anylizer, is it?
How can I write one?

The last question is: I suppose that my search engine is able to spider
web sites. Is it possible spidering urls?
For example is it possible that with a page I spider this page, then I
extract the links of the page and at least I'd like spidering also
these links?
How can I do this?

Well I hope to be able to use lucene.

Thanks for your help

Laura

Re: Some questions [ In reply to ]

karl at gan

Apr 19, 2002, 6:00 AM

Post #2 of 4 (552 views)

Permalink

> Well, I saw that lucene create the index on the filesystem: I think
> that this is a problem for producion enviroment. I usually use
> Database, for example Oracle.
> Is it possible integrate Lucene with Oracle or some other db (Mysql)?

you can store the index in blob-fields, but thats about it so far....

> I think that there isn't any Italian Anylizer, is it?
> How can I write one?

the implementation for lucene is pretty straight forward, take a look at the
contributed GermanAnalyzer. Inside the implementing class you implement
stopwords, language dependent case switching etc...

When it comes to the english and german analyzers they also perform stemming
(making "computers" match "computer" and "histories" match "history" etc).
This requires to create a program that can understand the plurals/singulars
of Italian. A good start might be to look at http://snowball.sourceforge.net
as they have a italian stemmer allready.

> The last question is: I suppose that my search engine is able to spider
> web sites. Is it possible spidering urls?
> For example is it possible that with a page I spider this page, then I
> extract the links of the page and at least I'd like spidering also
> these links?
> How can I do this?

As lucene works with only the text content for a doc you will have to create a
spider that retrieves a url, extracts the text and feeds it to lucene, then
extract the links and process each of these links in the same manner. for
this you will need a html parser..

happy hacking!

mvh karl øie

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Some questions [ In reply to ]

ferrante at unige

Apr 19, 2002, 7:24 AM

Post #3 of 4 (555 views)

Permalink

> ... snip ...
> I think that there isn't any Italian Anylizer, is it?
> How can I write one?
>
> ... snip ...
I'am interesting in this contribute too. Can I help?

--------------------------------------------------
Marco Ferrante (ferrante@unige.it)
CSITA (Centro Servizi Informatici e Telematici d'Ateneo)
Università degli Studi di Genova - Italy
Via Brigata Salerno, ponte - 16147 Genova
tel (+39) 0103532621 (interno tel. 2621)
--------------------------------------------------

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: Some questions - Analyzer [ In reply to ]

rosenm at sirma

Apr 19, 2002, 7:44 AM

Post #4 of 4 (540 views)

Permalink

see Lucene Officila FAQ:

Question 17:

17. Can I write my own custom analyzer ?
Sure. An analyzer is basically a factory object that creates a TokenStream
object used to tokenized the text. A typical analyzer implementation creates
the TokenStream by creating a standard tokenizer and combining it with a
series of filters, each perform a different processing of the token stream.

Here is a sample customized analyzer contributed by Joanne Proton:

public class MyAnalyzer extends Analyzer
{

/*
* An array containing some common words that
* are not usually useful for searching.
*/
private static final String[] STOP_WORDS =
{
"a" , "and" , "are" , "as" ,
"at" , "be" , "but" , "by" ,
"for" , "if" , "in" , "into" ,
"is" , "it" , "no" , "not" ,
"of" , "on" , "or" , "s" ,
"such" , "t" , "that" , "the" ,
"their" , "then" , "there" , "these" ,
"they" , "this" , "to" , "was" ,
"will" ,
"with"
};

/*
* Stop table
*/
final static private Hashtable stopTable =
StopFilter.makeStopTable(STOP_WORDS);

/*
* Create a token stream for this analyzer.
*/
public final TokenStream tokenStream(final Reader reader)
{
TokenStream result = new StandardTokenizer(reader);

result = new StandardFilter(result);
result = new LowerCaseFilter(result);
result = new StopFilter(result, stopTable);
result = new PorterStemFilter(result);

return result;
}

----- Original Message -----
From: "Marco Ferrante" <ferrante@unige.it>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Friday, April 19, 2002 5:24 PM
Subject: Re: Some questions

> > ... snip ...
> > I think that there isn't any Italian Anylizer, is it?
> > How can I write one?
> >
> > ... snip ...
> I'am interesting in this contribute too. Can I help?
>
> --------------------------------------------------
> Marco Ferrante (ferrante@unige.it)
> CSITA (Centro Servizi Informatici e Telematici d'Ateneo)
> Università degli Studi di Genova - Italy
> Via Brigata Salerno, ponte - 16147 Genova
> tel (+39) 0103532621 (interno tel. 2621)
> --------------------------------------------------
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>