Mailing List Archive: cvs commit: jakarta-lucene/xdocs/stylesheets project.xml

acoliver 02/01/26 07:01:32

Modified: . BUILD.txt build.properties build.xml
src/test/org/apache/lucene IndexTest.java
src/test/org/apache/lucene/index DocTest.java
xdocs/stylesheets project.xml
Added: src/demo Search.html Search.jhtml
src/demo/org/apache/lucene/demo DeleteFiles.java
FileDocument.java HTMLDocument.java IndexFiles.java
IndexHTML.java SearchFiles.java
src/demo/org/apache/lucene/demo/html Entities.java
HTMLParser.jj ParserThread.java Test.java
src/jsp README.txt configuration.jsp footer.jsp header.jsp
index.jsp results.jsp
src/jsp/WEB-INF web.xml
xdocs demo.xml demo2.xml demo3.xml demo4.xml
Removed: src/demo/org/apache/lucene DeleteFiles.java
FileDocument.java HTMLDocument.java IndexFiles.java
IndexHTML.java Search.html Search.jhtml
SearchFiles.java
src/demo/org/apache/lucene/HTMLParser .cvsignore
Entities.java HTMLParser.jj ParserThread.java
Test.java
Log:
Reviewed by: Doug Cutting / Lucene Community
new demo build target
added getting started guide
modified tests
moved demo to demo subpackage
added war demo

Revision Changes Path
1.2 +3 -3 jakarta-lucene/BUILD.txt

Index: BUILD.txt
===================================================================
RCS file: /home/cvs/jakarta-lucene/BUILD.txt,v
retrieving revision 1.1
retrieving revision 1.2
diff -u -r1.1 -r1.2
--- BUILD.txt 4 Nov 2001 17:23:04 -0000 1.1
+++ BUILD.txt 26 Jan 2002 15:01:31 -0000 1.2
@@ -1,6 +1,6 @@
Lucene Build Instructions

-$Id: BUILD.txt,v 1.1 2001/11/04 17:23:04 cutting Exp $
+$Id: BUILD.txt,v 1.2 2002/01/26 15:01:31 acoliver Exp $

Basic steps:
0) Install JDK 1.3, Ant 1.4, and the Ant 1.4 optional.jar.
@@ -52,14 +52,14 @@
Download either a zip or a tarred/gzipped version of the archive, and
uncompress it into a directory of your choice.

-Step 3) Connect to the top-level of your Lucene installation
+Step 2) Connect to the top-level of your Lucene installation

Lucene's top-level directory contains the build.properties and
build.xml files. You don't need to change any of the settings in
these files, but you do need to run ant from this location so it knows
where to find them.

-Step 4) Run ant.
+Step 3) Run ant.

Assuming you have ant in your PATH and have set ANT_HOME to the
location of your ant installation, typing "ant" at the shell prompt

1.16 +3 -0 jakarta-lucene/build.properties

Index: build.properties
===================================================================
RCS file: /home/cvs/jakarta-lucene/build.properties,v
retrieving revision 1.15
retrieving revision 1.16
diff -u -r1.15 -r1.16
--- build.properties 25 Dec 2001 19:34:04 -0000 1.15
+++ build.properties 26 Jan 2002 15:01:31 -0000 1.16
@@ -14,6 +14,7 @@

src.dir = ./src/java
demo.src = ./src/demo
+demo.jsp = ./src/jsp
test.src = ./src/test
docs.dir = ./docs
lib.dir = ./lib
@@ -37,6 +38,8 @@
build.demo = ${build.dir}/demo
build.demo.src = ${build.demo}/src
build.demo.classes = ${build.demo}/classes
+build.demo.name = ${name}-demos-${version}
+build.war.name = luceneweb

build.test = ${build.dir}/test
build.test.src = ${build.test}/src

1.18 +45 -3 jakarta-lucene/build.xml

Index: build.xml
===================================================================
RCS file: /home/cvs/jakarta-lucene/build.xml,v
retrieving revision 1.17
retrieving revision 1.18
diff -u -r1.17 -r1.18
--- build.xml 19 Nov 2001 01:19:23 -0000 1.17
+++ build.xml 26 Jan 2002 15:01:31 -0000 1.18
@@ -121,6 +121,45 @@
/>
</target>

+ <target name="jardemo" depends="compile,demo" if="javacc.present">
+ <jar
+ jarfile="${build.demo}/${build.demo.name}.jar"
+ basedir="${build.demo.classes}"
+ excludes="**/*.java"
+ />
+ </target>
+
+ <target name="wardemo" depends="compile,demo,jar,jardemo" if="javacc.present">
+ <mkdir dir="${build.demo}/${build.war.name}"/>
+ <mkdir dir="${build.demo}/${build.war.name}/WEB-INF"/>
+ <mkdir dir="${build.demo}/${build.war.name}/WEB-INF/lib"/>
+
+ <copy todir="${build.demo}/${build.war.name}">
+ <fileset dir="${demo.jsp}">
+ <include name="**/*.jsp"/>
+ <include name="**/*.xml"/>
+ </fileset>
+ </copy>
+
+ <copy todir="${build.demo}/${build.war.name}/WEB-INF/lib">
+ <fileset dir="${build.dir}">
+ <include name="*.jar"/>
+ </fileset>
+ </copy>
+
+ <copy todir="${build.demo}/${build.war.name}/WEB-INF/lib">
+ <fileset dir="${build.demo}">
+ <include name="*.jar"/>
+ </fileset>
+ </copy>
+
+ <jar
+ jarfile="${build.demo}/${build.war.name}.war"
+ basedir="${build.demo}/${build.war.name}"
+ excludes="**/*.java"
+ />
+ </target>
+



@@ -163,9 +202,9 @@
</copy>

<javacc
- target="${build.demo.src}/org/apache/lucene/HTMLParser/HTMLParser.jj"
+ target="${build.demo.src}/org/apache/lucene/demo/html/HTMLParser.jj"
javacchome="${javacc.zip.dir}"
- outputdirectory="${build.demo.src}/org/apache/lucene/HTMLParser"
+ outputdirectory="${build.demo.src}/org/apache/lucene/demo/html"
/>

<mkdir dir="${build.demo.classes}"/>
@@ -321,7 +360,7 @@



- <target name="package" depends="jar, javadocs, demo">
+ <target name="package" depends="jar, javadocs, demo, wardemo">
<mkdir dir="${dist.dir}"/>
<mkdir dir="${dist.dir}/docs"/>
<mkdir dir="${dist.dir}/docs/api"/>
@@ -339,6 +378,7 @@
<fileset dir="${build.demo.classes}"/>
</copy>

+
<copy todir="${dist.dir}/src">
<fileset dir="src"/>
</copy>
@@ -353,6 +393,8 @@
</fileset>
</copy>
<copy file="${build.dir}/${final.name}.jar" todir="${dist.dir}"/>
+ <copy file="${build.demo}/${build.demo.name}.jar" todir="${dist.dir}"/>
+ <copy file="${build.demo}/${build.war.name}.war" todir="${dist.dir}"/>
</target>



1.1 jakarta-lucene/src/demo/Search.html

Index: Search.html
===================================================================
<HTML>
<HEAD>
<TITLE>Lucene Search Demo</TITLE>
</HEAD>
<BODY>

<CENTER>
<H1>
Lucene Search Demo</H1>

<form name=search action=http://localhost:8080/Search.jhtml method=get>
<input name=query size=44> <input type=submit value=Search></form>

</CENTER>

</BODY>
</HTML>

1.1 jakarta-lucene/src/demo/Search.jhtml

Index: Search.jhtml
===================================================================
<HTML>



<java type=import>
javax.servlet.*
javax.servlet.http.*
java.io.*
org.apache.lucene.analysis.*
org.apache.lucene.document.*
org.apache.lucene.index.*
org.apache.lucene.search.*
org.apache.lucene.queryParser.*
org.apache.lucene.demo.*
org.apache.lucene.demo.html.Entities
</java>

<java>
// get index from request
String indexName = request.getParameter("index");
if (indexName == null) // default to "index"
indexName = "index";
Searcher searcher = // make searcher
new IndexSearcher(getReader(indexName));

// get query from request
String queryString = request.getParameter("query");
if (queryString == null)
throw new ServletException("no query specified");

int start = 0; // first hit to display
String startString = request.getParameter("start");
if (startString != null)
start = Integer.parseInt(startString);

int hitsPerPage = 10; // number of hits to display
String hitsString = request.getParameter("hitsPerPage");
if (hitsString != null)
hitsPerPage = Integer.parseInt(hitsString);

boolean showSummaries = true; // show summaries?
if ("false".equals(request.getParameter("showSummaries")))
showSummaries = false;

Query query = null;
try { // parse query
query = QueryParser.parse(queryString, "contents", analyzer);
} catch (ParseException e) { // error parsing query
</java>
<HEAD><TITLE>Error Parsing Query</TITLE></HEAD><BODY>
<p>While parsing `queryString`: `e.getMessage()`
<java>
return;
}

String servletPath = request.getRequestURI(); // getServletPath should work
int j = servletPath.indexOf('?'); // here but doesn't, so we
if (j != -1) // remove query by hand...
servletPath = servletPath.substring(0, j);

</java>

<head><title>Lucene Search Results</title></head><body>

<center>
<form name=search action=`servletPath` method=get>
<input name=query size=44 value='`queryString`'>
<input type=hidden name=index value="`indexName`">
<input type=hidden name=hitsPerPage value=`hitsPerPage`>
<input type=hidden name=showSummaries value=`showSummaries`>
<input type=submit value=Search>
</form>
</center>
<java>
Hits hits = searcher.search(query); // perform query
int end = Math.min(hits.length(), start + hitsPerPage);
</java>

<p>Hits <b><java type=print>start+1</java>-<java type=print>end</java></b>
(out of <java type=print>hits.length()</java> total matching documents):

<ul>
<java>
for (int i = start; i < end; i++) { // display the hits
Document doc = hits.doc(i);
String title = doc.get("title");
if (title.equals("")) // use url for docs w/o title
title = doc.get("url");
</java>
<p><b><java type=print>(int)(hits.score(i) * 100.0f)</java>%
<a href="`doc.get("url")`">
<java type=print>Entities.encode(title)</java>
</b></a>
<java>
if (showSummaries) { // maybe show summary
</java>
<ul><i>Summary</i>:
<java type=print>Entities.encode(doc.get("summary"))</java>
</ul>
<java>
}
}
</java>
</ul>

<java>
if (end < hits.length()) { // insert next page button
</java>
<center>
<form name=search action=`servletPath` method=get>
<input type=hidden name=query value='`queryString`'>
<input type=hidden name=start value=`end`>
<input type=hidden name=index value="`indexName`">
<input type=hidden name=hitsPerPage value=`hitsPerPage`>
<input type=hidden name=showSummaries value=`showSummaries`>
<input type=submit value=Next>
</form>
</center>
<java>
}
</java>

</body>

<java type=class>

Analyzer analyzer = new StopAnalyzer(); // used to tokenize queries

/** Keep a cache of open IndexReader's, so that an index does not have to
opened for each query. The cache re-opens an index when it has changed
so that additions and deletions are visible ASAP. */

static Hashtable indexCache = new Hashtable(); // name->CachedIndex

class CachedIndex { // an entry in the cache
IndexReader reader; // an open reader
long modified; // reader's modified date

CachedIndex(String name) throws IOException {
modified = IndexReader.lastModified(name); // get modified date
reader = IndexReader.open(name); // open reader
}
}

IndexReader getReader(String name) throws ServletException {
CachedIndex index = // look in cache
(CachedIndex)indexCache.get(name);

try {
if (index != null && // check up-to-date
(index.modified == IndexReader.lastModified(name)))
return index.reader; // cache hit
else {
index = new CachedIndex(name); // cache miss
}
} catch (IOException e) {
StringWriter writer = new StringWriter();
PrintWriter pw = new PrintWriter(writer);
throw new ServletException("Could not open index " + name + ": " +
e.getClass().getName() + "--" +
e.getMessage());
}

indexCache.put(name, index); // add to cache
return index.reader;
}
</java>

1.1 jakarta-lucene/src/demo/org/apache/lucene/demo/DeleteFiles.java

Index: DeleteFiles.java
===================================================================
package org.apache.lucene.demo;

/* ====================================================================
* The Apache Software License, Version 1.1
*
* Copyright (c) 2001 The Apache Software Foundation. All rights
* reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in
* the documentation and/or other materials provided with the
* distribution.
*
* 3. The end-user documentation included with the redistribution,
* if any, must include the following acknowledgment:
* "This product includes software developed by the
* Apache Software Foundation (http://www.apache.org/)."
* Alternately, this acknowledgment may appear in the software itself,
* if and wherever such third-party acknowledgments normally appear.
*
* 4. The names "Apache" and "Apache Software Foundation" and
* "Apache Lucene" must not be used to endorse or promote products
* derived from this software without prior written permission. For
* written permission, please contact apache@apache.org.
*
* 5. Products derived from this software may not be called "Apache",
* "Apache Lucene", nor may "Apache" appear in their name, without
* prior written permission of the Apache Software Foundation.
*
* THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
* WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
* ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
* USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
* OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
* ====================================================================
*
* This software consists of voluntary contributions made by many
* individuals on behalf of the Apache Software Foundation. For more
* information on the Apache Software Foundation, please see
* <http://www.apache.org/>.
*/

import java.io.IOException;

import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;

class DeleteFiles {
public static void main(String[] args) {
try {
Directory directory = FSDirectory.getDirectory("demo index", false);
IndexReader reader = IndexReader.open(directory);

// Term term = new Term("path", "pizza");
// int deleted = reader.delete(term);

// System.out.println("deleted " + deleted +
// " documents containing " + term);

for (int i = 0; i < reader.maxDoc(); i++)
reader.delete(i);

reader.close();
directory.close();

} catch (Exception e) {
System.out.println(" caught a " + e.getClass() +
"\n with message: " + e.getMessage());
}
}
}

1.1 jakarta-lucene/src/demo/org/apache/lucene/demo/FileDocument.java

Index: FileDocument.java
===================================================================
package org.apache.lucene.demo;

/* ====================================================================
* The Apache Software License, Version 1.1
*
* Copyright (c) 2001 The Apache Software Foundation. All rights
* reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in
* the documentation and/or other materials provided with the
* distribution.
*
* 3. The end-user documentation included with the redistribution,
* if any, must include the following acknowledgment:
* "This product includes software developed by the
* Apache Software Foundation (http://www.apache.org/)."
* Alternately, this acknowledgment may appear in the software itself,
* if and wherever such third-party acknowledgments normally appear.
*
* 4. The names "Apache" and "Apache Software Foundation" and
* "Apache Lucene" must not be used to endorse or promote products
* derived from this software without prior written permission. For
* written permission, please contact apache@apache.org.
*
* 5. Products derived from this software may not be called "Apache",
* "Apache Lucene", nor may "Apache" appear in their name, without
* prior written permission of the Apache Software Foundation.
*
* THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
* WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
* ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
* USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
* OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
* ====================================================================
*
* This software consists of voluntary contributions made by many
* individuals on behalf of the Apache Software Foundation. For more
* information on the Apache Software Foundation, please see
* <http://www.apache.org/>.
*/

import java.io.File;
import java.io.Reader;
import java.io.FileInputStream;
import java.io.BufferedReader;
import java.io.InputStreamReader;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.DateField;

/** A utility for making Lucene Documents from a File. */

public class FileDocument {
/** Makes a document for a File.
<p>
The document has three fields:
<ul>
<li><code>path</code>--containing the pathname of the file, as a stored,
tokenized field;
<li><code>modified</code>--containing the last modified date of the file as
a keyword field as encoded by <a
href="lucene.document.DateField.html">DateField</a>; and
<li><code>contents</code>--containing the full contents of the file, as a
Reader field;
*/
public static Document Document(File f)
throws java.io.FileNotFoundException {

// make a new, empty document
Document doc = new Document();

// Add the path of the file as a field named "path". Use a Text field, so
// that the index stores the path, and so that the path is searchable
doc.add(Field.Text("path", f.getPath()));

// Add the last modified date of the file a field named "modified". Use a
// Keyword field, so that it's searchable, but so that no attempt is made
// to tokenize the field into words.
doc.add(Field.Keyword("modified",
DateField.timeToString(f.lastModified())));

// Add the contents of the file a field named "contents". Use a Text
// field, specifying a Reader, so that the text of the file is tokenized.
// ?? why doesn't FileReader work here ??
FileInputStream is = new FileInputStream(f);
Reader reader = new BufferedReader(new InputStreamReader(is));
doc.add(Field.Text("contents", reader));

// return the document
return doc;
}

private FileDocument() {}
}

1.1 jakarta-lucene/src/demo/org/apache/lucene/demo/HTMLDocument.java

Index: HTMLDocument.java
===================================================================
package org.apache.lucene.demo;

/* ====================================================================
* The Apache Software License, Version 1.1
*
* Copyright (c) 2001 The Apache Software Foundation. All rights
* reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in
* the documentation and/or other materials provided with the
* distribution.
*
* 3. The end-user documentation included with the redistribution,
* if any, must include the following acknowledgment:
* "This product includes software developed by the
* Apache Software Foundation (http://www.apache.org/)."
* Alternately, this acknowledgment may appear in the software itself,
* if and wherever such third-party acknowledgments normally appear.
*
* 4. The names "Apache" and "Apache Software Foundation" and
* "Apache Lucene" must not be used to endorse or promote products
* derived from this software without prior written permission. For
* written permission, please contact apache@apache.org.
*
* 5. Products derived from this software may not be called "Apache",
* "Apache Lucene", nor may "Apache" appear in their name, without
* prior written permission of the Apache Software Foundation.
*
* THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
* WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
* ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
* USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
* OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
* ====================================================================
*
* This software consists of voluntary contributions made by many
* individuals on behalf of the Apache Software Foundation. For more
* information on the Apache Software Foundation, please see
* <http://www.apache.org/>.
*/

import java.io.*;
import org.apache.lucene.document.*;
import org.apache.lucene.demo.html.HTMLParser;

/** A utility for making Lucene Documents for HTML documents. */

public class HTMLDocument {
static char dirSep = System.getProperty("file.separator").charAt(0);

public static String uid(File f) {
// Append path and date into a string in such a way that lexicographic
// sorting gives the same results as a walk of the file hierarchy. Thus
// null (\u0000) is used both to separate directory components and to
// separate the path from the date.
return f.getPath().replace(dirSep, '\u0000') +
"\u0000" +
DateField.timeToString(f.lastModified());
}

public static String uid2url(String uid) {
String url = uid.replace('\u0000', '/'); // replace nulls with slashes
return url.substring(0, url.lastIndexOf('/')); // remove date from end
}

public static Document Document(File f)
throws IOException, InterruptedException {
// make a new, empty document
Document doc = new Document();

// Add the url as a field named "url". Use an UnIndexed field, so
// that the url is just stored with the document, but is not searchable.
doc.add(Field.UnIndexed("url", f.getPath().replace(dirSep, '/')));

// Add the last modified date of the file a field named "modified". Use a
// Keyword field, so that it's searchable, but so that no attempt is made
// to tokenize the field into words.
doc.add(Field.Keyword("modified",
DateField.timeToString(f.lastModified())));

// Add the uid as a field, so that index can be incrementally maintained.
// This field is not stored with document, it is indexed, but it is not
// tokenized prior to indexing.
doc.add(new Field("uid", uid(f), false, true, false));

HTMLParser parser = new HTMLParser(f);

// Add the tag-stripped contents as a Reader-valued Text field so it will
// get tokenized and indexed.
doc.add(Field.Text("contents", parser.getReader()));

// Add the summary as an UnIndexed field, so that it is stored and returned
// with hit documents for display.
doc.add(Field.UnIndexed("summary", parser.getSummary()));

// Add the title as a separate Text field, so that it can be searched
// separately.
doc.add(Field.Text("title", parser.getTitle()));

// return the document
return doc;
}

private HTMLDocument() {}
}

1.1 jakarta-lucene/src/demo/org/apache/lucene/demo/IndexFiles.java

Index: IndexFiles.java
===================================================================
package org.apache.lucene.demo;

/* ====================================================================
* The Apache Software License, Version 1.1
*
* Copyright (c) 2001 The Apache Software Foundation. All rights
* reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in
* the documentation and/or other materials provided with the
* distribution.
*
* 3. The end-user documentation included with the redistribution,
* if any, must include the following acknowledgment:
* "This product includes software developed by the
* Apache Software Foundation (http://www.apache.org/)."
* Alternately, this acknowledgment may appear in the software itself,
* if and wherever such third-party acknowledgments normally appear.
*
* 4. The names "Apache" and "Apache Software Foundation" and
* "Apache Lucene" must not be used to endorse or promote products
* derived from this software without prior written permission. For
* written permission, please contact apache@apache.org.
*
* 5. Products derived from this software may not be called "Apache",
* "Apache Lucene", nor may "Apache" appear in their name, without
* prior written permission of the Apache Software Foundation.
*
* THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
* WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
* ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
* USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
* OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
* ====================================================================
*
* This software consists of voluntary contributions made by many
* individuals on behalf of the Apache Software Foundation. For more
* information on the Apache Software Foundation, please see
* <http://www.apache.org/>.
*/

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;

import java.io.File;
import java.util.Date;

class IndexFiles {
public static void main(String[] args) {
try {
Date start = new Date();

IndexWriter writer = new IndexWriter("index", new StandardAnalyzer(), true);
indexDocs(writer, new File(args[0]));

writer.optimize();
writer.close();

Date end = new Date();

System.out.print(end.getTime() - start.getTime());
System.out.println(" total milliseconds");

} catch (Exception e) {
System.out.println(" caught a " + e.getClass() +
"\n with message: " + e.getMessage());
}
}

public static void indexDocs(IndexWriter writer, File file)
throws Exception {
if (file.isDirectory()) {
String[] files = file.list();
for (int i = 0; i < files.length; i++)
indexDocs(writer, new File(file, files[i]));
} else {
System.out.println("adding " + file);
writer.addDocument(FileDocument.Document(file));
}
}
}

1.1 jakarta-lucene/src/demo/org/apache/lucene/demo/IndexHTML.java

Index: IndexHTML.java
===================================================================
package org.apache.lucene.demo;

/* ====================================================================
* The Apache Software License, Version 1.1
*
* Copyright (c) 2001 The Apache Software Foundation. All rights
* reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in
* the documentation and/or other materials provided with the
* distribution.
*
* 3. The end-user documentation included with the redistribution,
* if any, must include the following acknowledgment:
* "This product includes software developed by the
* Apache Software Foundation (http://www.apache.org/)."
* Alternately, this acknowledgment may appear in the software itself,
* if and wherever such third-party acknowledgments normally appear.
*
* 4. The names "Apache" and "Apache Software Foundation" and
* "Apache Lucene" must not be used to endorse or promote products
* derived from this software without prior written permission. For
* written permission, please contact apache@apache.org.
*
* 5. Products derived from this software may not be called "Apache",
* "Apache Lucene", nor may "Apache" appear in their name, without
* prior written permission of the Apache Software Foundation.
*
* THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
* WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
* ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
* USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
* OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
* ====================================================================
*
* This software consists of voluntary contributions made by many
* individuals on behalf of the Apache Software Foundation. For more
* information on the Apache Software Foundation, please see
* <http://www.apache.org/>.
*/

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.*;
import org.apache.lucene.document.Document;
import org.apache.lucene.util.Arrays;
import org.apache.lucene.demo.html.HTMLParser;

import java.io.File;
import java.util.Date;

class IndexHTML {
private static boolean deleting = false; // true during deletion pass
private static IndexReader reader; // existing index
private static IndexWriter writer; // new index being built
private static TermEnum uidIter; // document id iterator

public static void main(String[] argv) {
try {
String index = "index";
boolean create = false;
File root = null;

String usage = "IndexHTML [-create] [-index <index>] <root_directory>";

if (argv.length == 0) {
System.err.println("Usage: " + usage);
return;
}

for (int i = 0; i < argv.length; i++) {
if (argv[i].equals("-index")) { // parse -index option
index = argv[++i];
} else if (argv[i].equals("-create")) { // parse -create option
create = true;
} else if (i != argv.length-1) {
System.err.println("Usage: " + usage);
return;
} else
root = new File(argv[i]);
}

Date start = new Date();

if (!create) { // delete stale docs
deleting = true;
indexDocs(root, index, create);
}

writer = new IndexWriter(index, new StandardAnalyzer(), create);
writer.maxFieldLength = 1000000;

indexDocs(root, index, create); // add new docs

System.out.println("Optimizing index...");
writer.optimize();
writer.close();

Date end = new Date();

System.out.print(end.getTime() - start.getTime());
System.out.println(" total milliseconds");

} catch (Exception e) {
System.out.println(" caught a " + e.getClass() +
"\n with message: " + e.getMessage());
}
}

/* Walk directory hierarchy in uid order, while keeping uid iterator from
/* existing index in sync. Mismatches indicate one of: (a) old documents to
/* be deleted; (b) unchanged documents, to be left alone; or (c) new
/* documents, to be indexed.
*/

private static void indexDocs(File file, String index, boolean create)
throws Exception {
if (!create) { // incrementally update

reader = IndexReader.open(index); // open existing index
uidIter = reader.terms(new Term("uid", "")); // init uid iterator

indexDocs(file);

if (deleting) { // delete rest of stale docs
while (uidIter.term() != null && uidIter.term().field() == "uid") {
System.out.println("deleting " +
HTMLDocument.uid2url(uidIter.term().text()));
reader.delete(uidIter.term());
uidIter.next();
}
deleting = false;
}

uidIter.close(); // close uid iterator
reader.close(); // close existing index

} else // don't have exisiting
indexDocs(file);
}

private static void indexDocs(File file) throws Exception {
if (file.isDirectory()) { // if a directory
String[] files = file.list(); // list its files
Arrays.sort(files); // sort the files
for (int i = 0; i < files.length; i++) // recursively index them
indexDocs(new File(file, files[i]));

} else if (file.getPath().endsWith(".html") || // index .html files
file.getPath().endsWith(".htm") || // index .htm files
file.getPath().endsWith(".txt")) { // index .txt files

if (uidIter != null) {
String uid = HTMLDocument.uid(file); // construct uid for doc

while (uidIter.term() != null && uidIter.term().field() == "uid" &&
uidIter.term().text().compareTo(uid) < 0) {
if (deleting) { // delete stale docs
System.out.println("deleting " +
HTMLDocument.uid2url(uidIter.term().text()));
reader.delete(uidIter.term());
}
uidIter.next();
}
if (uidIter.term() != null && uidIter.term().field() == "uid" &&
uidIter.term().text().compareTo(uid) == 0) {
uidIter.next(); // keep matching docs
} else if (!deleting) { // add new docs
Document doc = HTMLDocument.Document(file);
System.out.println("adding " + doc.get("url"));
writer.addDocument(doc);
}
} else { // creating a new index
Document doc = HTMLDocument.Document(file);
System.out.println("adding " + doc.get("url"));
writer.addDocument(doc); // add docs unconditionally
}
}
}
}

1.1 jakarta-lucene/src/demo/org/apache/lucene/demo/SearchFiles.java

Index: SearchFiles.java
===================================================================
package org.apache.lucene.demo;

/* ====================================================================
* The Apache Software License, Version 1.1
*
* Copyright (c) 2001 The Apache Software Foundation. All rights
* reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in
* the documentation and/or other materials provided with the
* distribution.
*
* 3. The end-user documentation included with the redistribution,
* if any, must include the following acknowledgment:
* "This product includes software developed by the
* Apache Software Foundation (http://www.apache.org/)."
* Alternately, this acknowledgment may appear in the software itself,
* if and wherever such third-party acknowledgments normally appear.
*
* 4. The names "Apache" and "Apache Software Foundation" and
* "Apache Lucene" must not be used to endorse or promote products
* derived from this software without prior written permission. For
* written permission, please contact apache@apache.org.
*
* 5. Products derived from this software may not be called "Apache",
* "Apache Lucene", nor may "Apache" appear in their name, without
* prior written permission of the Apache Software Foundation.
*
* THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
* WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
* ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
* USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
* OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
* ====================================================================
*
* This software consists of voluntary contributions made by many
* individuals on behalf of the Apache Software Foundation. For more
* information on the Apache Software Foundation, please see
* <http://www.apache.org/>.
*/

import java.io.IOException;
import java.io.BufferedReader;
import java.io.InputStreamReader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Hits;
import org.apache.lucene.queryParser.QueryParser;

class SearchFiles {
public static void main(String[] args) {
try {
Searcher searcher = new IndexSearcher("index");
Analyzer analyzer = new StandardAnalyzer();

BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
while (true) {
System.out.print("Query: ");
String line = in.readLine();

if (line.length() == -1)
break;

Query query = QueryParser.parse(line, "contents", analyzer);
System.out.println("Searching for: " + query.toString("contents"));

Hits hits = searcher.search(query);
System.out.println(hits.length() + " total matching documents");

final int HITS_PER_PAGE = 10;
for (int start = 0; start < hits.length(); start += HITS_PER_PAGE) {
int end = Math.min(hits.length(), start + HITS_PER_PAGE);
for (int i = start; i < end; i++) {
Document doc = hits.doc(i);
String path = doc.get("path");
if (path != null) {
System.out.println(i + ". " + path);
} else {
String url = doc.get("url");
if (url != null) {
System.out.println(i + ". " + url);
System.out.println(" - " + doc.get("title"));
} else {
System.out.println(i + ". " + "No path nor URL for this document");
}
}
}

if (hits.length() > end) {
System.out.print("more (y/n) ? ");
line = in.readLine();
if (line.length() == 0 || line.charAt(0) == 'n')
break;
}
}
}
searcher.close();

} catch (Exception e) {
System.out.println(" caught a " + e.getClass() +
"\n with message: " + e.getMessage());
}
}
}

1.1 jakarta-lucene/src/demo/org/apache/lucene/demo/html/Entities.java

Index: Entities.java
===================================================================
package org.apache.lucene.demo.html;

/* ====================================================================
* The Apache Software License, Version 1.1
*
* Copyright (c) 2001 The Apache Software Foundation. All rights
* reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in
* the documentation and/or other materials provided with the
* distribution.
*
* 3. The end-user documentation included with the redistribution,
* if any, must include the following acknowledgment:
* "This product includes software developed by the
* Apache Software Foundation (http://www.apache.org/)."
* Alternately, this acknowledgment may appear in the software itself,
* if and wherever such third-party acknowledgments normally appear.
*
* 4. The names "Apache" and "Apache Software Foundation" and
* "Apache Lucene" must not be used to endorse or promote products
* derived from this software without prior written permission. For
* written permission, please contact apache@apache.org.
*
* 5. Products derived from this software may not be called "Apache",
* "Apache Lucene", nor may "Apache" appear in their name, without
* prior written permission of the Apache Software Foundation.
*
* THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
* WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
* ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
* USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
* OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
* ====================================================================
*
* This software consists of voluntary contributions made by many
* individuals on behalf of the Apache Software Foundation. For more
* information on the Apache Software Foundation, please see
* <http://www.apache.org/>.
*/

import java.util.*;

public class Entities {
static final Hashtable decoder = new Hashtable(300);
static final String[] encoder = new String[0x100];

static final String decode(String entity) {
if (entity.charAt(entity.length()-1) == ';') // remove trailing semicolon
entity = entity.substring(0, entity.length()-1);
if (entity.charAt(1) == '#') {
int start = 2;
int radix = 10;
if (entity.charAt(2) == 'X' || entity.charAt(2) == 'x') {
start++;
radix = 16;
}
Character c =
new Character((char)Integer.parseInt(entity.substring(start), radix));
return c.toString();
} else {
String s = (String)decoder.get(entity);
if (s != null)
return s;
else return "";
}
}

static final public String encode(String s) {
int length = s.length();
StringBuffer buffer = new StringBuffer(length * 2);
for (int i = 0; i < length; i++) {
char c = s.charAt(i);
int j = (int)c;
if (j < 0x100 && encoder[j] != null) {
buffer.append(encoder[j]); // have a named encoding
buffer.append(';');
} else if (j < 0x80) {
buffer.append(c); // use ASCII value
} else {
buffer.append("&#"); // use numeric encoding
buffer.append((int)c);
buffer.append(';');
}
}
return buffer.toString();
}

static final void add(String entity, int value) {
decoder.put(entity, (new Character((char)value)).toString());
if (value < 0x100)
encoder[value] = entity;
}

static {
add("&nbsp", 160);
add("&iexcl", 161);
add("&cent", 162);
add("&pound", 163);
add("&curren", 164);
add("&yen", 165);
add("&brvbar", 166);
add("&sect", 167);
add("&uml", 168);
add("&copy", 169);
add("&ordf", 170);
add("&laquo", 171);
add("&not", 172);
add("&shy", 173);
add("&reg", 174);
add("&macr", 175);
add("&deg", 176);
add("&plusmn", 177);
add("&sup2", 178);
add("&sup3", 179);
add("&acute", 180);
add("&micro", 181);
add("&para", 182);
add("&middot", 183);
add("&cedil", 184);
add("&sup1", 185);
add("&ordm", 186);
add("&raquo", 187);
add("&frac14", 188);
add("&frac12", 189);
add("&frac34", 190);
add("&iquest", 191);
add("&Agrave", 192);
add("&Aacute", 193);
add("&Acirc", 194);
add("&Atilde", 195);
add("&Auml", 196);
add("&Aring", 197);
add("&AElig", 198);
add("&Ccedil", 199);
add("&Egrave", 200);
add("&Eacute", 201);
add("&Ecirc", 202);
add("&Euml", 203);
add("&Igrave", 204);
add("&Iacute", 205);
add("&Icirc", 206);
add("&Iuml", 207);
add("&ETH", 208);
add("&Ntilde", 209);
add("&Ograve", 210);
add("&Oacute", 211);
add("&Ocirc", 212);
add("&Otilde", 213);
add("&Ouml", 214);
add("&times", 215);
add("&Oslash", 216);
add("&Ugrave", 217);
add("&Uacute", 218);
add("&Ucirc", 219);
add("&Uuml", 220);
add("&Yacute", 221);
add("&THORN", 222);
add("&szlig", 223);
add("&agrave", 224);
add("&aacute", 225);
add("&acirc", 226);
add("&atilde", 227);
add("&auml", 228);
add("&aring", 229);
add("&aelig", 230);
add("&ccedil", 231);
add("&egrave", 232);
add("&eacute", 233);
add("&ecirc", 234);
add("&euml", 235);
add("&igrave", 236);
add("&iacute", 237);
add("&icirc", 238);
add("&iuml", 239);
add("&eth", 240);
add("&ntilde", 241);
add("&ograve", 242);
add("&oacute", 243);
add("&ocirc", 244);
add("&otilde", 245);
add("&ouml", 246);
add("&divide", 247);
add("&oslash", 248);
add("&ugrave", 249);
add("&uacute", 250);
add("&ucirc", 251);
add("&uuml", 252);
add("&yacute", 253);
add("&thorn", 254);
add("&yuml", 255);
add("&fnof", 402);
add("&Alpha", 913);
add("&Beta", 914);
add("&Gamma", 915);
add("&Delta", 916);
add("&Epsilon",917);
add("&Zeta", 918);
add("&Eta", 919);
add("&Theta", 920);
add("&Iota", 921);
add("&Kappa", 922);
add("&Lambda", 923);
add("&Mu", 924);
add("&Nu", 925);
add("&Xi", 926);
add("&Omicron",927);
add("&Pi", 928);
add("&Rho", 929);
add("&Sigma", 931);
add("&Tau", 932);
add("&Upsilon",933);
add("&Phi", 934);
add("&Chi", 935);
add("&Psi", 936);
add("&Omega", 937);
add("&alpha", 945);
add("&beta", 946);
add("&gamma", 947);
add("&delta", 948);
add("&epsilon",949);
add("&zeta", 950);
add("&eta", 951);
add("&theta", 952);
add("&iota", 953);
add("&kappa", 954);
add("&lambda", 955);
add("&mu", 956);
add("&nu", 957);
add("&xi", 958);
add("&omicron",959);
add("&pi", 960);
add("&rho", 961);
add("&sigmaf", 962);
add("&sigma", 963);
add("&tau", 964);
add("&upsilon",965);
add("&phi", 966);
add("&chi", 967);
add("&psi", 968);
add("&omega", 969);
add("&thetasym",977);
add("&upsih", 978);
add("&piv", 982);
add("&bull", 8226);
add("&hellip", 8230);
add("&prime", 8242);
add("&Prime", 8243);
add("&oline", 8254);
add("&frasl", 8260);
add("&weierp", 8472);
add("&image", 8465);
add("&real", 8476);
add("&trade", 8482);
add("&alefsym",8501);
add("&larr", 8592);
add("&uarr", 8593);
add("&rarr", 8594);
add("&darr", 8595);
add("&harr", 8596);
add("&crarr", 8629);
add("&lArr", 8656);
add("&uArr", 8657);
add("&rArr", 8658);
add("&dArr", 8659);
add("&hArr", 8660);
add("&forall", 8704);
add("&part", 8706);
add("&exist", 8707);
add("&empty", 8709);
add("&nabla", 8711);
add("&isin", 8712);
add("&notin", 8713);
add("&ni", 8715);
add("&prod", 8719);
add("&sum", 8721);
add("&minus", 8722);
add("&lowast", 8727);
add("&radic", 8730);
add("&prop", 8733);
add("&infin", 8734);
add("&ang", 8736);
add("&and", 8743);
add("&or", 8744);
add("&cap", 8745);
add("&cup", 8746);
add("&int", 8747);
add("&there4", 8756);
add("&sim", 8764);
add("&cong", 8773);
add("&asymp", 8776);
add("&ne", 8800);
add("&equiv", 8801);
add("&le", 8804);
add("&ge", 8805);
add("&sub", 8834);
add("&sup", 8835);
add("&nsub", 8836);
add("&sube", 8838);
add("&supe", 8839);
add("&oplus", 8853);
add("&otimes", 8855);
add("&perp", 8869);
add("&sdot", 8901);
add("&lceil", 8968);
add("&rceil", 8969);
add("&lfloor", 8970);
add("&rfloor", 8971);
add("&lang", 9001);
add("&rang", 9002);
add("&loz", 9674);
add("&spades", 9824);
add("&clubs", 9827);
add("&hearts", 9829);
add("&diams", 9830);
add("&quot", 34);
add("&amp", 38);
add("&lt", 60);
add("&gt", 62);
add("&OElig", 338);
add("&oelig", 339);
add("&Scaron", 352);
add("&scaron", 353);
add("&Yuml", 376);
add("&circ", 710);
add("&tilde", 732);
add("&ensp", 8194);
add("&emsp", 8195);
add("&thinsp", 8201);
add("&zwnj", 8204);
add("&zwj", 8205);
add("&lrm", 8206);
add("&rlm", 8207);
add("&ndash", 8211);
add("&mdash", 8212);
add("&lsquo", 8216);
add("&rsquo", 8217);
add("&sbquo", 8218);
add("&ldquo", 8220);
add("&rdquo", 8221);
add("&bdquo", 8222);
add("&dagger", 8224);
add("&Dagger", 8225);
add("&permil", 8240);
add("&lsaquo", 8249);
add("&rsaquo", 8250);
add("&euro", 8364);

}
}

1.1 jakarta-lucene/src/demo/org/apache/lucene/demo/html/HTMLParser.jj

Index: HTMLParser.jj
===================================================================
/* ====================================================================
* The Apache Software License, Version 1.1
*
* Copyright (c) 2001 The Apache Software Foundation. All rights
* reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in
* the documentation and/or other materials provided with the
* distribution.
*
* 3. The end-user documentation included with the redistribution,
* if any, must include the following acknowledgment:
* "This product includes software developed by the
* Apache Software Foundation (http://www.apache.org/)."
* Alternately, this acknowledgment may appear in the software itself,
* if and wherever such third-party acknowledgments normally appear.
*
* 4. The names "Apache" and "Apache Software Foundation" and
* "Apache Lucene" must not be used to endorse or promote products
* derived from this software without prior written permission. For
* written permission, please contact apache@apache.org.
*
* 5. Products derived from this software may not be called "Apache",
* "Apache Lucene", nor may "Apache" appear in their name, without
* prior written permission of the Apache Software Foundation.
*
* THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
* WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
* ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
* USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
* OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
* ====================================================================
*
* This software consists of voluntary contributions made by many
* individuals on behalf of the Apache Software Foundation. For more
* information on the Apache Software Foundation, please see
* <http://www.apache.org/>.
*/

// HTMLParser.jj

options {
STATIC = false;
OPTIMIZE_TOKEN_MANAGER = true;
//DEBUG_LOOKAHEAD = true;
//DEBUG_TOKEN_MANAGER = true;
}

PARSER_BEGIN(HTMLParser)

package org.apache.lucene.demo.html;

import java.io.*;

public class HTMLParser {
public static int SUMMARY_LENGTH = 200;

StringBuffer title = new StringBuffer(SUMMARY_LENGTH);
StringBuffer summary = new StringBuffer(SUMMARY_LENGTH * 2);
int length = 0;
boolean titleComplete = false;
boolean inTitle = false;
boolean inScript = false;
boolean afterTag = false;
boolean afterSpace = false;
String eol = System.getProperty("line.separator");
PipedReader pipeIn = null;
PipedWriter pipeOut;

public HTMLParser(File file) throws FileNotFoundException {
this(new FileInputStream(file));
}

public String getTitle() throws IOException, InterruptedException {
if (pipeIn == null)
getReader(); // spawn parsing thread
while (true) {
synchronized(this) {
if (titleComplete || (length > SUMMARY_LENGTH))
break;
wait(10);
}
}
return title.toString().trim();
}

public String getSummary() throws IOException, InterruptedException {
if (pipeIn == null)
getReader(); // spawn parsing thread
while (true) {
synchronized(this) {
if (summary.length() >= SUMMARY_LENGTH)
break;
wait(10);
}
}
if (summary.length() > SUMMARY_LENGTH)
summary.setLength(SUMMARY_LENGTH);

String sum = summary.toString().trim();
String tit = getTitle();
if (sum.startsWith(tit))
return sum.substring(tit.length());
else
return sum;
}

public Reader getReader() throws IOException {
if (pipeIn == null) {
pipeIn = new PipedReader();
pipeOut = new PipedWriter(pipeIn);

Thread thread = new ParserThread(this);
thread.start(); // start parsing
}

return pipeIn;
}

void addToSummary(String text) {
if (summary.length() < SUMMARY_LENGTH) {
summary.append(text);
if (summary.length() >= SUMMARY_LENGTH) {
synchronized(this) {
notifyAll();
}
}
}
}

void addText(String text) throws IOException {
if (inScript)
return;
if (inTitle)
title.append(text);
else {
addToSummary(text);
if (!titleComplete && !title.equals("")) { // finished title
synchronized(this) {
titleComplete = true; // tell waiting threads
notifyAll();
}
}
}

length += text.length();
pipeOut.write(text);

afterSpace = false;
}

void addSpace() throws IOException {
if (inScript)
return;
if (!afterSpace) {
if (inTitle)
title.append(" ");
else
addToSummary(" ");

String space = afterTag ? eol : " ";
length += space.length();
pipeOut.write(space);
afterSpace = true;
}
}

// void handleException(Exception e) {
// System.out.println(e.toString()); // print the error message
// System.out.println("Skipping...");
// Token t;
// do {
// t = getNextToken();
// } while (t.kind != TagEnd);
// }
}

PARSER_END(HTMLParser)

void HTMLDocument() throws IOException :
{
Token t;
}
{
// try {
( Tag() { afterTag = true; }
| t=Decl() { afterTag = true; }
| CommentTag() { afterTag = true; }
| t=<Word> { addText(t.image); afterTag = false; }
| t=<Entity> { addText(Entities.decode(t.image)); afterTag = false; }
| t=<Punct> { addText(t.image); afterTag = false; }
| <Space> { addSpace(); afterTag = false; }
)* <EOF>
// } catch (ParseException e) {
// handleException(e);
// }
}

void Tag() throws IOException :
{
Token t1, t2;
boolean inImg = false;
}
{
t1=<TagName> {
inTitle = t1.image.equalsIgnoreCase("<title"); // keep track if in <TITLE>
inImg = t1.image.equalsIgnoreCase("<img"); // keep track if in <IMG>
if (inScript) { // keep track if in <SCRIPT>
inScript = !t1.image.equalsIgnoreCase("</script");
} else {
inScript = t1.image.equalsIgnoreCase("<script");
}
}
(t1=<ArgName>
(<ArgEquals>
(t2=ArgValue() // save ALT text in IMG tag
{
if (inImg && t1.image.equalsIgnoreCase("alt") && t2 != null)
addText("[" + t2.image + "]");
}
)?
)?
)*
<TagEnd>
}

Token ArgValue() :
{
Token t = null;
}
{
t=<ArgValue> { return t; }
| LOOKAHEAD(2)
<ArgQuote1> <CloseQuote1> { return t; }
| <ArgQuote1> t=<Quote1Text> <CloseQuote1> { return t; }
| LOOKAHEAD(2)
<ArgQuote2> <CloseQuote2> { return t; }
| <ArgQuote2> t=<Quote2Text> <CloseQuote2> { return t; }
}

Token Decl() :
{
Token t;
}
{
t=<DeclName> ( <ArgName> | ArgValue() | <ArgEquals> )* <TagEnd>
{ return t; }
}

void CommentTag() :
{}
{
(<Comment1> ( <CommentText1> )* <CommentEnd1>)
|
(<Comment2> ( <CommentText2> )* <CommentEnd2>)
}

TOKEN :
{
< TagName: "<" ("/")? ["A"-"Z","a"-"z"] (<ArgName>)? > : WithinTag
| < DeclName: "<" "!" ["A"-"Z","a"-"z"] (<ArgName>)? > : WithinTag

| < Comment1: "" > : DEFAULT
}

<WithinComment2> TOKEN :
{
< CommentText2: (~[">"])+ >
| < CommentEnd2: ">" > : DEFAULT
}

1.1 jakarta-lucene/src/demo/org/apache/lucene/demo/html/ParserThread.java

Index: ParserThread.java
===================================================================
package org.apache.lucene.demo.html;

/* ====================================================================
* The Apache Software License, Version 1.1
*
* Copyright (c) 2001 The Apache Software Foundation. All rights
* reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in
* the documentation and/or other materials provided with the
* distribution.
*
* 3. The end-user documentation included with the redistribution,
* if any, must include the following acknowledgment:
* "This product includes software developed by the
* Apache Software Foundation (http://www.apache.org/)."
* Alternately, this acknowledgment may appear in the software itself,
* if and wherever such third-party acknowledgments normally appear.
*
* 4. The names "Apache" and "Apache Software Foundation" and
* "Apache Lucene" must not be used to endorse or promote products
* derived from this software without prior written permission. For
* written permission, please contact apache@apache.org.
*
* 5. Products derived from this software may not be called "Apache",
* "Apache Lucene", nor may "Apache" appear in their name, without
* prior written permission of the Apache Software Foundation.
*
* THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
* WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
* ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
* USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
* OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
* ====================================================================
*
* This software consists of voluntary contributions made by many
* individuals on behalf of the Apache Software Foundation. For more
* information on the Apache Software Foundation, please see
* <http://www.apache.org/>.
*/

import java.io.*;

class ParserThread extends Thread {
HTMLParser parser;

ParserThread(HTMLParser p) {
parser = p;
}

public void run() { // convert pipeOut to pipeIn
try {
try { // parse document to pipeOut
parser.HTMLDocument();
} catch (ParseException e) {
System.out.println("Parse Aborted: " + e.getMessage());
} catch (TokenMgrError e) {
System.out.println("Parse Aborted: " + e.getMessage());
} finally {
parser.pipeOut.close();
synchronized (parser) {
parser.summary.setLength(parser.SUMMARY_LENGTH);
parser.titleComplete = true;
parser.notifyAll();
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

1.1 jakarta-lucene/src/demo/org/apache/lucene/demo/html/Test.java

Index: Test.java
===================================================================
package org.apache.lucene.demo.html;

/* ====================================================================
* The Apache Software License, Version 1.1
*
* Copyright (c) 2001 The Apache Software Foundation. All rights
* reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in
* the documentation and/or other materials provided with the
* distribution.
*
* 3. The end-user documentation included with the redistribution,
* if any, must include the following acknowledgment:
* "This product includes software developed by the
* Apache Software Foundation (http://www.apache.org/)."
* Alternately, this acknowledgment may appear in the software itself,
* if and wherever such third-party acknowledgments normally appear.
*
* 4. The names "Apache" and "Apache Software Foundation" and
* "Apache Lucene" must not be used to endorse or promote products
* derived from this software without prior written permission. For
* written permission, please contact apache@apache.org.
*
* 5. Products derived from this software may not be called "Apache",
* "Apache Lucene", nor may "Apache" appear in their name, without
* prior written permission of the Apache Software Foundation.
*
* THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
* WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
* ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
* USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
* OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
* OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
* ====================================================================
*
* This software consists of voluntary contributions made by many
* individuals on behalf of the Apache Software Foundation. For more
* information on the Apache Software Foundation, please see
* <http://www.apache.org/>.
*/

import java.io.*;

class Test {
public static void main(String[] argv) throws Exception {
if ("-dir".equals(argv[0])) {
String[] files = new File(argv[1]).list();
java.util.Arrays.sort(files);
for (int i = 0; i < files.length; i++) {
System.err.println(files[i]);
File file = new File(argv[1], files[i]);
parse(file);
}
} else
parse(new File(argv[0]));
}

public static void parse(File file) throws Exception {
HTMLParser parser = new HTMLParser(file);
System.out.println("Title: " + Entities.encode(parser.getTitle()));
System.out.println("Summary: " + Entities.encode(parser.getSummary()));
LineNumberReader reader = new LineNumberReader(parser.getReader());
for (String l = reader.readLine(); l != null; l = reader.readLine())
System.out.println(l);
}
}

1.1 jakarta-lucene/src/jsp/README.txt

Index: README.txt
===================================================================
To build the Jakarta Lucene web app demo just run
"ant wardemo" from the Jakarta Lucene Installation
directory (follow the master instructions in
BUILD.txt). If you have questions please post
them to the Jakarta Lucene mailing lists. To
actually figure this out you really need to
read the Lucene "Getting Started" guide provided
with the doc build ("ant docs").

1.1 jakarta-lucene/src/jsp/configuration.jsp

Index: configuration.jsp
===================================================================
<%
/* Author: Andrew C. Oliver (acoliver2@users.sourceforge.net) */
String appTitle = "Jakarta Lucene Example - Intranet Server Search Application";
/* make sure you point the below string to the index you created with IndexHTML */
String indexLocation = "/opt/lucene/index";
String appfooter = "Jakarta Lucene Template WebApp 1.0";
%>

1.1 jakarta-lucene/src/jsp/footer.jsp

Index: footer.jsp
===================================================================
<% /* Author Andrew C. Oliver (acoliver2@users.sourceforge.net) */ %>
<p>
<center>
<%=appfooter%>
</center>
</p>
</body>
</html>

1.1 jakarta-lucene/src/jsp/header.jsp

Index: header.jsp
===================================================================
<%@include file="configuration.jsp"%>
<% /* Author: Andrew C. Oliver (acoliver2@users.sourceforge.net */ %>
<html>
<header>
<title><%=appTitle%></title>
</header>
<body>
<center>
<p>
Welcome to the Lucene Template application. (This is the header)
</p>
</center>

1.1 jakarta-lucene/src/jsp/index.jsp

Index: index.jsp
===================================================================
<%@include file="header.jsp"%>
<% /* Author: Andrew C. Oliver (acoliver2@users.sourceforge.net) */ %>
<center>
<form name="search" action="results.jsp" method="get">
<p>
<input name="query" size="44"/> Search Criteria
</p>
<p>
<input name="maxresults" size="4" value="100"/> Results Per Page 
<input type="submit" value="Search"/>
</p>
</form>
</center>
<%@include file="footer.jsp"%>

1.1 jakarta-lucene/src/jsp/results.jsp

Index: results.jsp
===================================================================
<%@ page import = " javax.servlet.*, javax.servlet.http.*, java.io.*, org.apache.lucene.analysis.*, org.apache.lucene.document.*, org.apache.lucene.index.*, org.apache.lucene.search.*, org.apache.lucene.queryParser.*, org.apache.lucene.demo.*, org.apache.lucene.demo.html.Entities" %>

<%
/*
Author: Andrew C. Oliver, SuperLink Software, Inc. (acoliver2@users.sourceforge.net)

This jsp page is deliberatly written in the horrble java directly embedded
in the page style for an easy and conceise demonstration of Lucene.
Due note...if you write pages that look like this...sooner or later
you'll have a maintenance nightmere. If you use jsps...use taglibs
and beans! That being said, this should be acceptable for a small
page demonstrating how one uses Lucene in a web app.

This is also deliberately overcommented. ;-)

*/
%>
<%@include file="header.jsp"%>
<%
boolean error = false; //used to control flow for error messages
String indexName = indexLocation; //local copy of the configuration variable
IndexSearcher searcher = null; //the searcher used to open/search the index
Query query = null; //the Query created by the QueryParser
Hits hits = null; //the search results
int startindex = 0; //the first index displayed on this page
int maxpage = 50; //the maximum items displayed on this page
String queryString = null; //the query entered in the previous page
String startVal = null; //string version of startindex
String maxresults = null; //string version of maxpage
int thispage = 0; //used for the for/next either maxpage or
//hits.length() - startindex - whichever is
//less

try {
searcher = new IndexSearcher(
IndexReader.open(indexName) //create an indexSearcher for our page
);
} catch (Exception e) { //any error that happens is probably due
//to a permission problem or non-existant
//or otherwise corrupt index
%>
<p>ERROR opening the Index - contact sysadmin!</p>
<p>While parsing query: <%=e.getMessage()%></p>
<% error = true; //don't do anything up to the footer
}
%>
<%
if (error == false) { //did we open the index?
queryString = request.getParameter("query"); //get the search criteria
startVal = request.getParameter("startat"); //get the start index
maxresults = request.getParameter("maxresults"); //get max results per page
try {
maxpage = Integer.parseInt(maxresults); //parse the max results first
startindex = Integer.parseInt(startVal); //then the start index
} catch (Exception e) { } //we don't care if something happens we'll just start at 0
//or end at 50

if (queryString == null)
throw new ServletException("no query "+ //if you don't have a query then
"specified"); //you probably played on the
//query string so you get the
//treatment

Analyzer analyzer = new StopAnalyzer(); //construct our usual analyzer
try {
query = QueryParser.parse(queryString, "contents", analyzer); //parse the
} catch (ParseException e) { //query and construct the Query
//object
//if its just "operator error"
//send them a nice error HTML

%>
<p>Error While parsing query: <%=e.getMessage()%></p>
<%
error = true; //don't bother with the rest of
//the page
}
}
%>
<%
if (error == false && searcher != null) { // if we've had no errors
// searcher != null was to handle
// a weird compilation bug
thispage = maxpage; // default last element to maxpage
hits = searcher.search(query); // run the query
if (hits.length() == 0) { // if we got no results tell the user
%>
<p> I'm sorry I couldn't find what you were looking for. </p>
<%
error = true; // don't bother with the rest of the
// page
}
}

if (error == false && searcher != null) {
%>
<table>
<tr>
<td>Document</td>
<td>Summary</td>
</tr>
<%
if ((startindex + maxpage) > hits.length()) {
thispage = hits.length() - startindex; // set the max index to maxpage or last
} // actual search result whichever is less

for (int i = startindex; i < (thispage + startindex); i++) { // for each element
%>
<tr>
<%
Document doc = hits.doc(i); //get the next document
String doctitle = doc.get("title"); //get its title
String url = doc.get("url"); //get its url field
if (doctitle.equals("")) //use the url if it has no title
doctitle = url;
//then output!
%>
<td><a href="<%=url%>"><%=doctitle%></a></td>
<td><%=doc.get("summary")%></td>
</tr>
<%
}
%>
<% if ( (startindex + maxpage) < hits.length()) { //if there are more results...display
//the more link

String moreurl="results.jsp?query=" + queryString + //construct the "more" link
"&maxresults=" + maxpage +
"&startat=" + (startindex + maxpage);
%>
<tr>
<td></td><td><a href="<%=moreurl%>">More Results>></a></td>
</tr>
<%
}
%>
</table>

<% } //then include our footer.
%>
<%@include file="footer.jsp"%>

1.1 jakarta-lucene/src/jsp/WEB-INF/web.xml

Index: web.xml
===================================================================
<?xml version="1.0" encoding="ISO-8859-1"?>

<!DOCTYPE web-app
PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN"
"http://java.sun.com/dtd/web-app_2_3.dtd">

<web-app>

</web-app>

1.3 +1 -1 jakarta-lucene/src/test/org/apache/lucene/IndexTest.java

Index: IndexTest.java
===================================================================
RCS file: /home/cvs/jakarta-lucene/src/test/org/apache/lucene/IndexTest.java,v
retrieving revision 1.2
retrieving revision 1.3
diff -u -r1.2 -r1.3
--- IndexTest.java 18 Sep 2001 17:35:57 -0000 1.2
+++ IndexTest.java 26 Jan 2002 15:01:32 -0000 1.3
@@ -58,7 +58,7 @@
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.TermPositions;
import org.apache.lucene.document.Document;
-import org.apache.lucene.FileDocument;
+import org.apache.lucene.demo.FileDocument;

import java.io.File;
import java.util.Date;

1.3 +1 -1 jakarta-lucene/src/test/org/apache/lucene/index/DocTest.java

Index: DocTest.java
===================================================================
RCS file: /home/cvs/jakarta-lucene/src/test/org/apache/lucene/index/DocTest.java,v
retrieving revision 1.2
retrieving revision 1.3
diff -u -r1.2 -r1.3
--- DocTest.java 18 Sep 2001 17:35:57 -0000 1.2
+++ DocTest.java 26 Jan 2002 15:01:32 -0000 1.3
@@ -59,7 +59,7 @@
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.Directory;
import org.apache.lucene.document.Document;
-import org.apache.lucene.FileDocument;
+import org.apache.lucene.demo.FileDocument;

import java.io.File;
import java.util.Date;

1.1 jakarta-lucene/xdocs/demo.xml

Index: demo.xml
===================================================================
<?xml version="1.0"?>
<document>
<properties>
<author email="acoliver@apache.org">Andrew C. Oliver</author>
<title>Jakarta Lucene - Building and Installing the Basic Demo</title>
</properties>
<body>

<section name="About this Document">
<p>
This document is intended as a "getting started" guide to using and running the
Jakarta Lucene demos. It walks you through some basic installation and configuration.
</p>
</section>

<section name="About the Demos">
<p>
The Lucene Demo code is a set of command line example applications that demonstrate various
functionality of Lucene and how one should go about adding it to their
applications.
</p>
</section>

<section name="Setting your classpath">
<p>
First, extract the latest Lucene distribution.
</p>
<p>
You should see the Jakarta Lucene jar file in the directory you created
when you extracted the archive. It should be named something like
<b>lucene-{version}.jar</b>.
</p>
<p>
You should also see a file called called <b>lucene-demos-{version}.jar</b>.
Put both of these files in your Java CLASSPATH.
</p>
</section>

<section name="Indexing Files">
<p>
Once you've gotten this far you're probably itching to go. Let's <b> build an index!</b>
Assuming you've set your classpath correctly, just type
"java org.apache.lucene.demo.IndexFiles {full-path-to-lucene}/src". This will produce
a subdirectory called "index" which will contain an index of all of the Lucene
sourcecode.
</p>
<p>
<b> To search the index </b> type "java org.apache.lucene.demo.SearchFiles". You'll be prompted
for a query. Type in a swear word and press the enter key. You'll see that the Lucene
developers are very well mannered and get no results. Now try entering the word "vector".
That should return a whole bunch of documents. The results will page at every tenth
result and ask you whether you want more results.
</p>
</section>

<section name="About the code...">
<p>
<a href="demo2.html">read on>>></a>
</p>
</section>

</body>
</document>

1.1 jakarta-lucene/xdocs/demo2.xml

Index: demo2.xml
===================================================================
<?xml version="1.0"?>
<document>
<properties>
<author email="acoliver@apache.org">Andrew C. Oliver</author>
<title>Jakarta Lucene - Basic Demo Sources Walkthrough</title>
</properties>
<body>

<section name="About the Code">
<p>
In this section we walk through the sources behind the basic Lucene demo such as where to
find it, its parts and their function. This section is intended for Java developers
wishing to understand how to use Jakarta Lucene in their applications.
</p>
</section>

<section name="Location of the source">
<p>
Relative to the directory created when you extracted Lucene or retreived it from CVS, you
should see a directory called "src" which in turn contains a directory called "demo".
This is the root for all of the Lucene demos. Under this directory is org/apache/lucene/demo,
this is where all the Java sources live.
</p>
<p>
Within this directory you should see the IndexFiles class we executed earlier. Bring that
up in vi or your alternative text editor and lets take a look at it.
</p>
</section>

<section name="IndexFiles">
<p>
As we discussed in the previous walkthrough, the IndexFiles class creates a Lucene Index.
Lets take a look at how it does this.
</p>
<p>
The first substantial thing the main function does is instantiate an instance
of IndexWriter. It passes a string called "index" and a new instance of a class called
"StandardAnalyzer". The "index" string is the name of the directory that all index information
should be stored in. Because we're not passing any path information, one must assume this
will be created as a subdirectory of the current directory (if does not already exist). On
some platforms this may actually result in it being created in other directories (such as
the user's home directory).
</p>
<p>
The <b>IndexWriter</b> is the main class responsible for creating indicies. To use it you
must instantiate it with a path that it can write the index into, if this path does not
exist it will create it, otherwise it will refresh the index living at that path. You
must a also pass an instance of <b>org.apache.analysis.Analyzer</b>.
</p>
<p>
The <b>Analyzer</b>, in this case, the <b>Stop Analyzer</b> is little more than a standard Java
Tokenizer, converting all strings to lowercase and filtering out useless words from the index.
By useless words I mean common language words such as articles (a,an,the) and other words that
would be useless for searching. It should be noted that there are different rules for every
language, and you should use the proper analyzer for each. Lucene currently provides Analyzers
for English and German.
</p>
<p>
Looking down further in the file, you should see the indexDocs() code. This recursive function
simply crawls the directories and uses FileDocument to create Document objects. The Document
is simply a data object to represent the content in the file as well as its creation time and
location. These instances are added to the indexWriter. Take a look inside FileDocument. Its
not particularly complicated, it just adds fields to the Document.
</p>
<p>
As you can see there isn't much to creating an index. The devil is in the details. You may also
wish to examine the other samples in this directory, particularly the IndexHTML class. It is
a bit more complex but builds upon this example.
</p>
</section>

<section name="Searching Files">
<p>
The SearchFiles class is quite simple. It primarily collaborates with an IndexSearcher, StandardAnalyzer
(which is used in the IndexFiles class as well) and a QueryParser. The query parser is constructed
with an analyzer used to interperate your query in the same way the Index was interperated: finding
the end of words and removing useless words like 'a', 'an' and 'the'. The Query object contains the
results from the QueryParser which is passed to the searcher. The searcher results are returned in
a collection of Documents called "Hits" which is then iterated through and displayed to the user.
</p>
</section>

<section name="The Web example...">
<p>
<a href="demo3.html">read on>>></a>
</p>
</section>

</body>
</document>

1.1 jakarta-lucene/xdocs/demo3.xml

Index: demo3.xml
===================================================================
<?xml version="1.0"?>

<document>
<properties>
<author email="acoliver@apache.org">Andrew C. Oliver</author>
<title>Jakarta Lucene - Building and Installing the Basic Demo</title>
</properties>
<body>

<section name="About this Document">
<p>
This document is intended as a "getting started" guide to installing and running the
Jakarta Lucene web application demo. This guide assumes that you have read the
information in the previous two examples or already know it anyhow. We'll use
Tomcat 4.0.1 as our reference web container. These demos should work with nearly
any container, but it is up to you to adapt them appropriately.
</p>
</section>

<section name="About the Demos">
<p>
The Lucene Web Application demo is a template web application intended for deployment
on Tomcat or a similar web container. It's NOT designed as a "best practices"
implementation by ANY means. Its more of a "hello world" type Lucene Web App.
The purpose of this application is to demonstrate Lucene. With that being said,
it should be relatively simple to create a small searchable website in Tomcat or
a similar application server.
</p>
</section>

<section name="Indexing Files">
<p>
Once you've gotten this far you're probably itching to go.
Let's start by creating the index you'll need for the web examples.
Since you've already set your classpath in the previous examples,
all you need to do is type
<b> "java org.apache.lucene.demo.IndexHTML -create -index {index-dir} .."</b>.
You'll need to do this from your {tomcat}/webapps/luceneweb directory. {index-dir}
should be a directory that Tomcat has permission to read and write, but is
outside of a web accessible context. By default the webapp is configured
to look in <b>/opt/lucene/index</b> for this index.
</p>
</section>

<section name="Deploying the Demos">
<p>Located in your distribution directory you should see
a war file called luceneweb.war. Copy this to your
{tomcat-home}/webapps directory. You may need to restart
Tomcat. </p>
</section>

<section name="Configuration">
<p>
From your Tomcat directory look in the webapps/luceneweb subdirectory. If its not
present, try browsing to "http://localhost:8080/luceneweb" then look again.
Edit a file called configuration.jsp. Ensure that the indexLocation is equal to the
location you used for your index. You may also customize the appTitle and appFooter
strings as you see fit. Once you have finsihed altering the configuration you should
restart Tomcat. You may also wish to update the war file by typing
<b>jar -uf luceneweb.war configuration.jsp</b> from the luceneweb subdirectory.
(The u option is not available in all versions of jar. In this case recreate the war file).
</p>
</section>

<section name="Running the Demos">
<p>Now you're ready to roll. In your browser set the url to "http://localhost:8080/luceneweb"
enter "test" and the number of items per page and press search.</p>
<p>You should now be looking either at a number of results (provided you didn't erase the
Tomcat examples) or nothing. Try other search terms. Depending on the number of items
per page you set and results returned, there may be a link at the bottom that says "more results>>",
clicking it goes to subsequent pages. If you get an error regarding opening the index, then you
probably set the path in "configuration" incorrectly or Tomcat doesn't have permissions to the
index (or you skipped the step of creating it).</p>
</section>

<section name="About the code...">
<p>
If you want to know more about how this web app works or how to customize it then
<a href="demo4.html">read on>>></a>.
</p>
</section>

</body>
</document>

1.1 jakarta-lucene/xdocs/demo4.xml

Index: demo4.xml
===================================================================
<?xml version="1.0"?>
<document>
<properties>
<author email="acoliver@apache.org">Andrew C. Oliver</author>
<title>Jakarta Lucene - Basic Demo Sources Walkthrough</title>
</properties>
<body>

<section name="About the Code">
<p>
In this section we walk through the sources behind the basic Lucene Web Application demo.
Where to find it, its parts, and their function. This section is intended for Java developers
wishing to understand how to use Jakarta Lucene in their applications or for those involved
in deploying web applications based on Lucene.
</p>
</section>

<section name="Location of the source (developers/deployers)">
<p>
Relative the directory created when you extracted Lucene or retreived it from CVS, you
should see a directory called "src" which in turn contains a directory called "jsp".
This is the root for all of the Lucene web demo.
</p>
<p>
Within this directory you should see the index.jsp class. Bring this up in vi or your
editor of choice.
</p>
</section>

<section name="index.jsp (developers/deployers)">
<p>
This jsp page is pretty boring by itself. All it does is include a header, display a form and
include a footer. If you look at the form, it has two fields: query (where you enter your
search criteria) and maxresults where you specify the number of results per page. If you look
at the form tag, you'll notice it uses the get method as opposed to the post. While this is
considered deprecated functionality by the latest w3c specs, its unlikely to go away due to the
usefulness of being able to bookmark things like searches. By the structure of this JSP it should
be easy to customize it without even editing this particular file. You could simply change the
header and footer. Let's look at the header.jsp (located in the same directory) next.
</p>
</section>

<section name="header.jsp (developers/deployers)">
<p>
The header is also very simple by itself. The only thing it does is include the configuration.jsp
(which you looked at in the last section of this guide) and set the title and a brief header. This
would be a good place to put your own custom HTML to "pretty" things up a bit. We won't cover the
footer because all it does is display the footer and close your tags. Let's look at the results.jsp,
the meat of this application next.
</p>
</section>

<section name="results.jsp (developers)">
<p>
The results.jsp had a lot more functionality. Much of it is for paging the search results we'll not
cover this as its commented well enough. It does not peform any optimizations such as caching results,
etc. as that would make this a more complex example. The first thing in this page is the actual imports
for the Lucene classes and Lucene demo classes. These classes are loaded from the jars included in the
WEB-INF/lib directory in the final war file.
</p>
<p>
You'll notice that this file includes the same header and footer as the "index.jsp". From there the jsp
constructs an IndexSearcher with the "indexLocation" that was specified in the "configuration.jsp". If there
is an error of any kind in opening the index, it is diplayed ot the user and a boolean flag is set to tell
the rest of the sections of the jsp not to continue.
</p>
<p>
From there, this jsp attempts to get the search criteria, the start index (used for paging) and the maximum
number of results per page. If the maximum results per page is not set or not valid then it and the
start index are set to default values. If only the start index is invalid it is set to a default value. If
the criteria isn't provided then a servlet error is thrown (it is assumed that this is the result of url tampering
or some form of browser malfunction).
</p>
<p>
The jsp moves on to construct a StandardAnalyzer just as in the simple demo, to analyze the search critieria, it
is passed to the QueryParser along with the criteria to construct a Query object. You'll also notice the
string literal "contents" included. This is to specify the search should include the the contents and not
the title, url or some other field in the indexed documents. If there is any error in constructing a Query
object an error is displayed to the user.
</p>
<p>
In the next section of the jsp the IndexSearcher is asked to search given the query object. the results are
returned in a collection called "hits". If the length property of the hits collection is 0 then an error
is displayed to the user and the error flag is set.
</p>
<p>
Finally the jsp iterates through the hits collection and displayed properties of the "Document" objects we talked
about in the first walkthrough. These objects contain "known" fields specific to their indexer (in this case
"IndexHTML" constructs a document with "url", "title" and "contents"). You'll notice that these results are paged
but the search is repeated every time. This is an area where optimization could improve performance for large
result sets.
</p>
</section>

<section name="More sources (developers)">
<p>
There are additional sources used by the web app that were not specifically covered by either walkthrough. For
example the HTML parser, the IndexHTML class and HTMLDocument class. These are very similar to the classes
covered in the first example, however they have properties sepecific to parsing and indexing HTML. This is
beyond our scope; however, by now you should feel like you're "getting started" with Lucene.
</p>
</section>

<section name="Where to go from here? (Everyone!)">
<p>
There are a number of things this demo doesn't do or doesn't do quite right. For instance, you may
have noticed that documents in the root context are unreachable (unless you reconfigure Tomcat to
support that context or redirect to it), anywhere where the directory doesn't quite match the context mapping,
you'll have a broken link in your results. If you want to index non-local files or have some other
needs this isn't supported, plus there may be security issues with running the indexing application from
your webapps directory. There are a number of things left for you the implementor or developer to do.
</p>
<p>
In time some of these things may be added to Lucene as features (if you've got a good idea we'd love to hear it!),
but for now: this is where you begin and the search engine/indexer ends. Lastly, one would assume you'd
want to follow the above advice and customize the application to look a little more fancy than black on
white with "Lucene Template" at the top. We'll see you on the Lucene Users' or Developers' mailing lists!
</p>
</section>

<section name="When to contact the Author">
<p>
Please resist the urge to contact the authors of this document (without bribes of fame and fortune attached). First
contact the <a href="http://jakarta.apache.org/site/mail.html">mailing lists</a>. That being said feedback,
and modifications to this document and samples are ever so greatly appreciatedThey are just best sent to the
lists so that everyone can share in them. Certainly you'll get the most help there as well.
Thanks for understanding.
</p>
</section>

</body>
</document>

1.5 +1 -0 jakarta-lucene/xdocs/stylesheets/project.xml

Index: project.xml
===================================================================
RCS file: /home/cvs/jakarta-lucene/xdocs/stylesheets/project.xml,v
retrieving revision 1.4
retrieving revision 1.5
diff -u -r1.4 -r1.5
--- project.xml 2 Oct 2001 15:54:16 -0000 1.4
+++ project.xml 26 Jan 2002 15:01:32 -0000 1.5
@@ -15,6 +15,7 @@

<menu name="Documentation">
<item name="FAQ" href="http://www.lucene.com/cgi-bin/faq/faqmanager.cgi" target="_blank"/>
+ <item name="Getting Started" href="/gettingstarted.html"/>
<item name="Articles" href="/resources.html"/>
<item name="Javadoc" href="/api/index.html"/>
</menu>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>