Mailing List Archive: idea: lucene doclet for indexing javadoc better

One hassle/problem is that if a search engine (say...Lucene...)
is indexing javadoc (html generated from *.java),
it has to wade thru all kinds of junk to get at what's interesting.
And if you try to summarize the document by taking the
1st "n" words (after ignoring tags) you get something like
"Overview Package Class Use Deprecated Index PREV CLASS NEXT CLASS
FRAMES NO FRAMES SUMMARY: INNER | FIELD | CONSTR | METHOD DETAIL: FIELD
| CONSTR".

I've done a proof of concept of using the javadoc doclet api and having
an indexer keyed off of that to create a javadoc index, instead of
spidering the output.
It's very prelim.
I was just wondering if this has been done before, or been discussed
before.

I guess the general principle is that it's always better to index the
orig
src of info and not the generated html. This is why lucene is much nicer
than
other engines (say, htdig), as the other engines seem to only be able to
spider.

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

Hi,

I have an app running on my box that does exactly this. Besides a bog
standard jsp UI, it also has a funky IE toolbar (like the google bar) to
perform the searches, plus it serves up the java source if you click thru
the results page.

The indexer is indeed run as a doclet via an Ant script. The index is then
packaged up into a WAR and deployed to Tomcat 4. A WARdirectory explodes the
index into either RAM or FS depending on the deployment descriptor since not
all appservers expand WARs (WebLogic for one)

I'm indexing class and method names, modifiers (public, abstract etc),
parameters, imports and some other bits as well as free text of the source
code.

Cool eh? Since I'm not much of a COM programmer, the IE bar is taking a bit
longer than I wanted (I've also lost my MSDN library CD which doesn't help
:-( but if anyone's interested in how things are at the moment, let me
know. Once I've put some polish on it, I was going to perhaps try to write a
JavaWorld article and of course donate the code. But for now - it works for
me :-)

I could do with a hand writing a decent QueryParser (JavaCC is not something
I want to dig into) as the standard one has it's limitations esp when you
want to search for arrays (as in params:String[])

Hope this helps,

Les

-----Original Message-----
From: Spencer, Dave
To: lucene-dev@jakarta.apache.org
Sent: 13/03/02 02:28
Subject: idea: lucene doclet for indexing javadoc better

One hassle/problem is that if a search engine (say...Lucene...)
is indexing javadoc (html generated from *.java),
it has to wade thru all kinds of junk to get at what's interesting.
And if you try to summarize the document by taking the
1st "n" words (after ignoring tags) you get something like
"Overview Package Class Use Deprecated Index PREV CLASS NEXT CLASS
FRAMES NO FRAMES SUMMARY: INNER | FIELD | CONSTR | METHOD DETAIL: FIELD
| CONSTR".

I've done a proof of concept of using the javadoc doclet api and having
an indexer keyed off of that to create a javadoc index, instead of
spidering the output.
It's very prelim.
I was just wondering if this has been done before, or been discussed
before.

I guess the general principle is that it's always better to index the
orig
src of info and not the generated html. This is why lucene is much nicer
than
other engines (say, htdig), as the other engines seem to only be able to
spider.

--
To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>