Mailing List Archive: JSP Parser class wanted

JSP Parser class wanted

Feb 20, 2002, 5:15 PM

Post #1 of 6 (1293 views)

Please, does anyone have a JSPParser class that parses JSPs?

I hacked the HTMLParser class that comes in the Lucene demo and made it parse and index JSPs. But when i would do a search, the jsp tags

<%pageContext.setAttribute( "req", request );%>
<%@ page import="com.propelnewmedia.tags.BreadcrumbTrailer"%>

and so on, were included in the summary.
Then, I figured out a way to get the JSP tags out of the summary (and i think out of the index as well).

What I did was designate JSP tags (anything starting with <% and ending with %>) as a 3rd comment type in the void CommentTag() :, TOKEN :, and <WithinCommentN> TOKEN : sections of HTMLParser.jj

I just copied and pasted the relevant code for Comment2 and mimicked that for my new Comment type. I then recompiled HTMLParser.jj using javacc.

I'm still not out of the woods though. I still need to know how to make Lucene not include list element values, etc in the search hits. For instance, if a keyword happens to be in a <selection> list, it gets counted as a hit.

Any suggestions (or preferably, working code) would be massively appreciated!. Thanks in advance.

RE: JSP Parser class wanted [ In reply to ]

Stephan.Strittmatter.ext at kst

Feb 21, 2002, 12:30 AM

Post #2 of 6 (1268 views)

Permalink

I am also interssted in a JSPParser.
Probably adding it to demo-pakage?

Greetings,

Stephan

> -----Original Message-----
> From: w i l l i a m__b o y d [mailto:will@javafreelancer.com]
> Sent: Thursday, February 21, 2002 1:15 AM
> To: Lucene Users List
> Subject: JSP Parser class wanted
>
>
> Please, does anyone have a JSPParser class that parses JSPs?
>
> I hacked the HTMLParser class that comes in the Lucene demo
> and made it parse and index JSPs. But when i would do a
> search, the jsp tags
>
> <%pageContext.setAttribute( "req", request );%>
> <%@ page import="com.propelnewmedia.tags.BreadcrumbTrailer"%>
>
>
> and so on, were included in the summary.
> Then, I figured out a way to get the JSP tags out of the
> summary (and i think out of the index as well).
>
> What I did was designate JSP tags (anything starting with <%
> and ending with %>) as a 3rd comment type in the void
> CommentTag() :, TOKEN :, and <WithinCommentN> TOKEN :
> sections of HTMLParser.jj
>
> I just copied and pasted the relevant code for Comment2 and
> mimicked that for my new Comment type. I then recompiled
> HTMLParser.jj using javacc.
>
> I'm still not out of the woods though. I still need to know
> how to make Lucene not include list element values, etc in
> the search hits. For instance, if a keyword happens to be in
> a <selection> list, it gets counted as a hit.
>
> Any suggestions (or preferably, working code) would be
> massively appreciated!. Thanks in advance.
>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: JSP Parser class wanted [ In reply to ]

will at javafreelancer

Feb 23, 2002, 11:27 AM

Post #3 of 6 (1273 views)

Permalink

i have had some success in solving my problem. mind you, it is a hack; a quick fix. it may or may not work for everyone. also the jsp pages i am indexing/searching have very little dynamically generated content. they are mostly static.

my problem was there was too much gobbledy-gook turning up in the summary. i only wanted content from the main body of the document to appear in the summary. since all of my relevant body content is inside <p> tags my approach was to have the parser only add stuff that is in <p> tags to the summary. to do that, in the HtmlParser.jj file that comes with the lucene demo, I added the following line amongst the other variable declarations:

...

boolean inPTag = false;

...

then i changed the addText() method to:
void addText(String text) throws IOException {
if (inScript)
return;
if (inTitle)
title.append(text);
else {
if ( !inPTag ) // I added this line...
return; // ... and this line
addToSummary(text);
if (!titleComplete && !title.equals("")) { // finished title
synchronized(this) {
titleComplete = true; // tell waiting threads
notifyAll();
} // end synchronized blick
} // if
} // end else

length += text.length();
pipeOut.write(text);

afterSpace = false;
}

then i changed the Tag() method to:
void Tag() throws IOException :
{
Token t1, t2;
boolean inImg = false;
}
{
t1=<TagName> {
inTitle = t1.image.equalsIgnoreCase("<title"); // keep track if in <TITLE>
inImg = t1.image.equalsIgnoreCase("<img"); // keep track if in <IMG>
if (inScript) { // keep track if in <SCRIPT>
inScript = !t1.image.equalsIgnoreCase("</script");
} else {
inScript = t1.image.equalsIgnoreCase("<script");
}
// i added the following if conditional:
if (inPTag) { // keep track if in p tag
inPTag = !t1.image.equalsIgnoreCase("</p");
} else {
inPTag = t1.image.equalsIgnoreCase("<p");
}
}
(t1=<ArgName>
(<ArgEquals>
(t2=ArgValue() // save ALT text in IMG tag
{
// I commented the next two lines out because I didn't want the contents
// of alt tags showing up in the summary:
// if (inImg && t1.image.equalsIgnoreCase("alt") && t2 != null)
// addText("[" + t2.image + "]");
}
)?
)?
)*
<TagEnd>
}

all of the above is in addition to the other changes i mentioned in my earlier posts.

Then I recompiled HtmlParser.jj with javacc; compiled the java files that javacc produced; stuffed those class files into a jar; then placed the jar in the classpath so that the lucene indexer could see the new parser.

hope this helps. if anyone has a better solution please post it here. as i said, it's a hack. but with my deadline, it is all i have time for. one day i would love to spend the time really learning javacc and lucene inside and out. then maybe i could build a proper parser. today is just not that day ;¬)

Re: JSP Parser class wanted [ In reply to ]

puffmail at darksleep

Feb 23, 2002, 6:25 PM

Post #4 of 6 (1267 views)

Permalink

w i l l i a m__b o y d <will@javafreelancer.com> writes:

> i have had some success in solving my problem. mind you, it is a
> hack; a quick fix. it may or may not work for everyone. also the jsp
> pages i am indexing/searching have very little dynamically generated
> content. they are mostly static.

If they're mostly static, why not just code a little crawler to
request the pages via the web-server and parse the rendered HTML?

Steven J. Owens
puff@darksleep.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: JSP Parser class wanted [ In reply to ]

will at javafreelancer

Feb 24, 2002, 4:22 AM

Post #5 of 6 (1283 views)

Permalink

> If they're mostly static, why not just code a little crawler to
> request the pages via the web-server and parse the rendered HTML?
>

right then. i've added that onto my list of things to do. immediately after
"meet project deadline" and "...learning javacc and lucene inside and
out..." ;¬) if anyone has such code they're willing to contribute i would
put it to good use.

----- Original Message -----
From: Steven J. Owens <puffmail@darksleep.com>
To: Lucene Users List <lucene-user@jakarta.apache.org>; w i l l i a m__b o y
d <will@javafreelancer.com>
Sent: Sunday, February 24, 2002 1:25 AM
Subject: Re: JSP Parser class wanted

> w i l l i a m__b o y d <will@javafreelancer.com> writes:
>
> > i have had some success in solving my problem. mind you, it is a
> > hack; a quick fix. it may or may not work for everyone. also the jsp
> > pages i am indexing/searching have very little dynamically generated
> > content. they are mostly static.
>
> If they're mostly static, why not just code a little crawler to
> request the pages via the web-server and parse the rendered HTML?
>
> Steven J. Owens
> puff@darksleep.com

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: JSP Parser class wanted [ In reply to ]

chrisopler at free

Feb 24, 2002, 5:24 AM

Post #6 of 6 (1272 views)

Permalink

Hi,

this is a great tool to retrieve and scrape html pages (rendered or not)...

http://www.research.compaq.com/SRC/WebL/

:-)

Chris Opler

w i l l i a m__b o y d wrote:

> > If they're mostly static, why not just code a little crawler to
> > request the pages via the web-server and parse the rendered HTML?
> >
>
> right then. i've added that onto my list of things to do. immediately after
> "meet project deadline" and "...learning javacc and lucene inside and
> out..." ;¬) if anyone has such code they're willing to contribute i would
> put it to good use.
>
> ----- Original Message -----
> From: Steven J. Owens <puffmail@darksleep.com>
> To: Lucene Users List <lucene-user@jakarta.apache.org>; w i l l i a m__b o y
> d <will@javafreelancer.com>
> Sent: Sunday, February 24, 2002 1:25 AM
> Subject: Re: JSP Parser class wanted
>
> > w i l l i a m__b o y d <will@javafreelancer.com> writes:
> >
> > > i have had some success in solving my problem. mind you, it is a
> > > hack; a quick fix. it may or may not work for everyone. also the jsp
> > > pages i am indexing/searching have very little dynamically generated
> > > content. they are mostly static.
> >
> > If they're mostly static, why not just code a little crawler to
> > request the pages via the web-server and parse the rendered HTML?
> >
> > Steven J. Owens
> > puff@darksleep.com
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

--
=======================
http://www.openwine.org

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>