Mailing List Archive: cvs commit: jakarta-lucene/xdocs luceneplan.xml

otis 2002/06/04 08:29:32

Modified: xdocs luceneplan.xml
Log:
- Fixed spelling a bit.
- Nukes trailing blank spaces.

Revision Changes Path
1.4 +68 -68 jakarta-lucene/xdocs/luceneplan.xml

Index: luceneplan.xml
===================================================================
RCS file: /home/cvs/jakarta-lucene/xdocs/luceneplan.xml,v
retrieving revision 1.3
retrieving revision 1.4
diff -u -r1.3 -r1.4
--- luceneplan.xml 24 Feb 2002 16:33:21 -0000 1.3
+++ luceneplan.xml 4 Jun 2002 15:29:32 -0000 1.4
@@ -1,5 +1,5 @@
<?xml version="1.0" encoding="UTF-8"?>
-
+
<document>
<properties>
<title>Plan for enhancements to Lucene</title>
@@ -8,7 +8,7 @@
</authors>
</properties>
<body>
-
+
<section name="Purpose">
<p>
The purpose of this document is to outline plans for
@@ -21,8 +21,8 @@
The best reference is <a href="http://www.htdig.org">
htDig</a>, though it is not quite as sophisticated as
Lucene, it has a number of features that make it
- desireable. It however is a traditional c-compiled app
- which makes it somewhat unpleasent to install on some
+ desirable. It however is a traditional c-compiled app
+ which makes it somewhat unpleasant to install on some
platforms (like Solaris!).
</p>
<p>
@@ -30,42 +30,42 @@
community for an initial reaction, advice, feedback and
consent. Following this it will be submitted to the
Lucene user community for support. Although, I'm (Andy
- Oliver) capable of providing these enhancements by
- myself, I'd of course prefer to work on them in concert
+ Oliver) capable of providing these enhancements by
+ myself, I'd of course prefer to work on them in concert
with others.
</p>
<p>
- While I'm outlaying a fairly large featureset, these can
+ While I'm outlaying a fairly large feature set, these can
be implemented incrementally of course (and are probably
best if done that way).
</p>
</section>
-
+
<section name="Goal and Objectives">
<p>
The goal is to provide features to Lucene that allow it
- to be used as a dropin search engine. It should provide
+ to be used as a drop-in search engine. It should provide
many of the features of projects like <a
href="http://www.htdig.org">htDig</a> while surpassing
- them with unique Lucene features and capabillities such as
+ them with unique Lucene features and capabilities such as
easy installation on and java-supporting platform,
- and support for document fields and field searches. And
+ and support for document fields and field searches. And
of course, <a href="http://apache.org/LICENSE">
a pragmatic software license</a>.
</p>
<p>
To reach this goal we'll implement code to support the
following objectives that augment but do not replace
- the current Lucene featureset.
+ the current Lucene feature set.
</p>
<ul>
<li>
- Document Location Independance - meaning mapping
+ Document Location Independence - meaning mapping
real contexts to runtime contexts.
Essentially, if the document is at
/var/www/htdocs/mydoc.html, I probably want it
indexed as
- http://www.bigevilmegacorp.com/mydoc.html.
+ http://www.bigevilmegacorp.com/mydoc.html.
</li>
<li>
Standard methods of creating central indicies -
@@ -73,21 +73,21 @@
many environments than is *remote* indexing (for
instance http). I would suggest that most folks
would prefer that general functionality be
- suppored by Lucene instead of having to write
+ supported by Lucene instead of having to write
code for every indexing project. Obviously, if
what they are doing is *special* they'll have to
- code, but general document indexing accross
- webservers would not qualify.
+ code, but general document indexing across
+ web servers would not qualify.
</li>
<li>
- Document interperatation abstraction - currently
+ Document interpretation abstraction - currently
one must handle document object construction via
custom code. A standard interface for plugging
- in format handlers should be supported.
+ in format handlers should be supported.
</li>
<li>
Mime and file-extension to document
- interperatation mapping.
+ interpretation mapping.
</li>
</ul>
</section>
@@ -128,7 +128,7 @@
</li>
<li>
replacement type - the type of
- replacewith path: relative, url or
+ replace with path: relative, URL or
path.
</li>
<li>
@@ -153,8 +153,8 @@
0 - Long.MAX_VALUE.
</li>
<li>
- SleeptimeBetweenCalls - can be used to
- avoid flooding a machine with too many
+ SleeptimeBetweenCalls - can be used to
+ avoid flooding a machine with too many
requests
</li>
<li>
@@ -163,12 +163,12 @@
inactivity.
</li>
<li>
- IncludeFilter - include only items
- matching filter. (can occur mulitple
+ IncludeFilter - include only items
+ matching filter. (can occur multiple
times)
</li>
<li>
- ExcludeFilter - exclude only items
+ ExcludeFilter - exclude only items
matching filter. (can occur multiple
times)
</li>
@@ -196,9 +196,9 @@
(probably from the command line) read
this properties file and get them from
it. Command line options override
- the properties file in the case of
+ the properties file in the case of
duplicates. There should also be an
- enivironment variable or VM parameter to
+ environment variable or VM parameter to
set this.
</li>
</ul>
@@ -209,8 +209,8 @@
</p>
<p>
This should extend the AbstractCrawler and
- support any addtional options required for a
- filesystem index.
+ support any additional options required for a
+ file system index.
</p>


@@ -218,12 +218,12 @@
<b>HTTP Crawler </b>
</p>
<p>
- Supports the AbstractCrawler options as well as:
+ Supports the AbstractCrawler options as well as:
</p>
<ul>
<li>
- span hosts - Wheter to span hosts or not,
- by default this should be no.
+ span hosts - Whether to span hosts or not,
+ by default this should be no.
</li>
<li>
restrict domains - (ignored if span
@@ -237,11 +237,11 @@
recurse and go to
/nextcontext/index.html this option says
to also try /nextcontext to get the dir
- lsiting)
+ listing)
</li>
<li>
map extensions -
- (always/default/never/fallback). Wether
+ (always/default/never/fallback). Whether
to always use extension mapping, by
default (fallback to mime type), NEVER
or fallback if mime is not available
@@ -254,12 +254,12 @@
</ul>

</section>
-
+
<section name="MIMEMap">
<p>
A configurable registry of document types, their
- description, an identifyer, mime-type and file
- extension. This should map both MIME -> factory
+ description, an identifier, mime-type and file
+ extension. This should map both MIME -> factory
and extension -> factory.
</p>
<p>
@@ -287,7 +287,7 @@
<td>"html,htm"</td>
<td></td>
<td>HTMLDocumentFactory</td>
- </tr>
+ </tr>
</table>
</section>
<section name="DocumentFactory">
@@ -300,17 +300,17 @@
</section>
<section name="FieldMapping classes">
<p>
- A class taht maps standard fields from the
+ A class that maps standard fields from the
DocumentFactories into *fields* in the Document objects
they create. I suggest that a regular expression system
or xpath might be the most universal way to do this.
For instance if perhaps I had an XML factory that
represented XML elements as fields, I could map content
- from particular fields to ther fields or supress them
+ from particular fields to their fields or suppress them
entirely. We could even make this configurable.
</p>
<p>
-
+
for example:
</p>
<ul>
@@ -333,48 +333,48 @@
title.suppress=false
</li>
</ul>
- <p>
- In this example we map html documents such that all
- fields are suppressed but author and title. We map
- author and title to anything in the content matching
- author: (and x characters). Okay my regular expresions
+ <p>
+ In this example we map html documents such that all
+ fields are suppressed but author and title. We map
+ author and title to anything in the content matching
+ author: (and x characters). Okay my regular expresions
suck but hopefully you get the idea.
</p>
</section>
<section name="Final Thoughts">
<p>
- We might also consider eliminating the DocumentFactory
- entirely by making an AbstractDocument from which the
- current document object would inherit from. I
- experimented with this locally, and it was a relatively
- minor code change and there was of course no difference
- in performance. The Document Factory classes would
- instead be instances of various subclasses of
+ We might also consider eliminating the DocumentFactory
+ entirely by making an AbstractDocument from which the
+ current document object would inherit from. I
+ experimented with this locally, and it was a relatively
+ minor code change and there was of course no difference
+ in performance. The Document Factory classes would
+ instead be instances of various subclasses of
AbstractDocument.
</p>
<p>
- My inspiration for this is HTDig (http://www.htdig.org/).
- While this goes slightly beyond what HTDig provides by
- providing field mapping (where HTDIG is just interested
- in Strings/numbers wherever they are found), it provides
- at least what I would need to use this as a dropin for
- most places I contract at (with the obvious exception of
- a default set of content handlers which would of course
+ My inspiration for this is HTDig (http://www.htdig.org/).
+ While this goes slightly beyond what HTDig provides by
+ providing field mapping (where HTDIG is just interested
+ in Strings/numbers wherever they are found), it provides
+ at least what I would need to use this as a drop-in for
+ most places I contract at (with the obvious exception of
+ a default set of content handlers which would of course
develop naturally over time).
</p>
<p>
- I am able to certainly contribute to this effort if the
- development community is open to it. I'd suggest we do
- it iteratively in stages and not aim for all of this at
+ I am able to certainly contribute to this effort if the
+ development community is open to it. I'd suggest we do
+ it iteratively in stages and not aim for all of this at
once (for instance leave out the field mapping at first).
</p>
<p>
-
- Anyhow, please give me some feedback, counter
- suggestions, let me know if I'm way off base or out of
+
+ Anyhow, please give me some feedback, counter
+ suggestions, let me know if I'm way off base or out of
line, etc. -Andy
</p>
</section>
-
+
</body>
</document>

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>