Mailing List Archive

Fw: Configuration RFC
ok, this is my proposal for the crawler configuration. And you tell me if
I'm reinventing the wheel:

Overview
--------

I distinguish (logically, not necessarily on a class level) between 5
different types of components:
- "filters" are parts of the message pipeline. They get a message and either
pass it on or not. They are put into a messageHandler pipeline and are
notified about their insertion. Filters don't know about each other. If they
share common data, this has to be kept on the Service level
- "services" are things like the host manager, probably a logfile manager,
and other things that the other components share. The other components
should be able to access these services, but the services should not know
about them.
- "storages" (or sinks) are where the documents go after they have been
fetched
- "sources" . are sources of messages (i.e. URLMessages). They typically run
within their own thread of control and know the messageHandler.
- then there are some "steering" components that monitor the pipeline and
probably reconfigure it. They build the infrastructure. The ThreadMonitor
gathers runtime information. If we want to have this information displayed
the way we do it now, we need it to know all the other components. I'd leave
that as it is at the moment, we could change it later. But I'd like the
configuration component to know as little as possible about the other
components. See below how I'd like to achieve that.


Layer Diagram
-------------


---------------|-------------------------------------
| MessageHandler(?)
|- - - - - - - - - - -
ThreadMon-> |source | filter | filter... | storage
|
|--------------|----------------------
Configurator-> | v
| Services
---------------|-------------------------------------

I'm not quite sure where the MessageHandler fits in here. Is it also a
service? I like a layered model better.
The other possibility would be to regard all components as being independent
and on the same level. But the configurator keeps track of the interactions
between them.

Configuration
-------------

Prerequisite: All the components mentioned are implemented as JavaBeans
(tada, the main idea today!)

Then we can use bean utility classes to set their properties. I've had a
look at jakarta-commons which contains a BeanUtils package which should
contain whatever we need.

since every service/filter is a singleton, we can distinguish it in the
property file by its class name. If we ever need two instances of a class,
we'd have to change that. But for simplicity, I think this will do well at
this time.

Then I think we can use a syntax of the property file like

<ClassName>.<propertyName>=<PropertyValue>

"ClassName" can be fully qualified (i.e. with package) or we could assume a
default package like "de.lanlab.larm.fetcher". This could serve us well if
the package name changes.
[.If the class name is fully qualified, however, we'd have a problem with
nested property names like "package.class.foo.bar", however]


The Configurator
----------------

The configurator should be capable of the following:
- divide class names from property names.
- initialize the classes found
- register the instances in some kind of naming service (i.e. a global
HashMap)
- find and resolve dependencies among the different components
- set the properties according to the props file (using BeanUtil's
PropertyUtils.set(|Mapped|Indexed)Property())
- provide a decent error handling (i.e. return line numbers if exceptions
are thrown)


Connecting different components:
--------------------------------

I don't want components to create other components or services. This should
be done by the configurator. I can imagine two ways how components may be
connected:
- They tell the configurator that they need this or that service. I.e. the
VisitedFilter needs the HostManager.
- A property contains Service Names. Than these services have to be set up
before the property is set.
Therefore, the config process needs to be at least twofold: In a first step
the components are set up and initialized, and in a second step, connections
between components are set up.

Config File Overlays
--------------------

I had the same idea as Andrew about how config files should be able to
overwrite each other.
Internally all properties are treated equally. But the user has to be able
to distinguish several layers of configurations: I.e. a specific setup of
the components that is reused every time, but different domains to be
crawled.
Therefore I propose that different config files can be specified which are
loaded subsequently, probably overwriting properties already specified. I.e.

java ...
de.lanlab... -Iglobal.properties -Imycrawl.properties -DFetcher.threads=50

which means: global.properties is loaded first, then mycrawl.properties is
included and probably overwrites some of the settings in global.properties,
and at last the property Fetcher.threads is set manually.

I know that the JRun server uses a similar method: There you have one
global.properties and a local.properties for each server process instance. I
always found this very useful.


Example Property File:
---------------------

Configurator.services=HostManager,MessageHandler,LoggerService,HTTPProtocol
# do we need this?

# MessageHandler is initialized first and gets the filters property set.
# those filters have to be initialized in a second step, when all is set up.
MessageHandler.filters=URLLengthFilter,URLScopeFilter,RobotExclusionFilter,U
RLVisitedFilter,KnownPathsFilter
# configurator knows here that we need a MessageHandler, so the
Configurator.services line above is redundant in this case

#
LoggerService.baseDir=logs/
LoggerService.logs=store,links # defines property names used below
# LoggerService.logs.store.class=SimpleLogger
LoggerService.logs.store.fileName=store.log
LoggerService.logs.links.fileName=links.log
LoggerService.logs.store.fileName=store.log


StoragePipeline.docStorages=LogStorage
StoragePipeline.linkStorages=LinkLogStorage,MessageHandler

LogStorage.log=store # the log name from the logger service
LinkLogStorage.log=links

# LuceneStorage.analyzer=org.apache.lucene.analysis.StandardAnalyzer
# LuceneStorage.createIndex=true
# LuceneStorage.indexName=luceneIndex
# LuceneStorage.fieldInfos=url,content
# LuceneStorage.fieldInfos.url = Index,Store
# LuceneStorage.fieldInfos.content = Index,Store,Tokenize


# manually define host synonyms. I don't know if there's a better way than
the following, and if the method used here is possible anyway (one property
two times)
HostManager.synonym=www.foo1.bar.com,www.foo2.bar.com
HostManager.synonym=www1.foo.com,www2.foo.com
# or
# HostManager.addSynonym=www.foo1.bar.com,www.foo2.bar.com
# HostManager.addSynonym=www1.foo.com,www2.foo.com
# coded as void setAddSynonym(String) - not so nice

# alternative:
HostManager.synonyms[0]=www.foo1.bar.com,www.foo2.bar.com
HostManager.synonyms[1]=www1.foo.com,www2.foo.com
# but this would prevent adding further synonyms in other config files

URLScopeFilter.inScope=http://.*myHost.*
# or additionally URLScopeFilter.outOfScope=... ?

# RobotExclusionFilter doesn't have properties. It just needs to know the
host manager. MessageHandler should
# make clear that the filter has to be initialized. I think both have to
provide a method like
# 'String[] componentsNeeded()' that return the component names to set up.
# MessageHandler would return the value as specified in
"MessageHandler.filters", REFilter would return
# HostManager

HTTPProtocol.extractGZippedFiles=false

URLLengthFilter.maxURLLength=255

Fetcher.threadNumber=25
Fetcher.docStorage=StoragePipeline
Fetcher.linkStorage=StoragePipeline
# here comes the MIME type stuff, which is not yet implemented. Only HTML is
parsed, the rest is stored as-is.


# this is an example of another storage:

# SQLStorage.driver=com.ashna.JTurbo.driver.Driver
# SQLStorage.url=jdbc:JTurbo://host/parameters
# SQLStorage.user=...
# SQLStorage.password=...


Some Closing Remarks
--------------------

Ok you made it until here. Very good.
I think with this configuration LARM can be much more than just a crawler.
With a few changes it can also be used as a processor for documents that
come over the network, i.e. in a JMS topic.
I haven't mentioned what I call "Sources" or message producers. These are
active components that run in their own thread and put messages into the
queue.
If we have a JMSStorage and a JMSSource, then the crawler can be divided
into two pieces just from the config file.
part one: Fetcher -> JMSStorage
part two: JMSSource -> LuceneStorage
with all the possibilities for distribution.
Given a different source, one could also imagine feeding the crawler with
files or with URLs from a web frontend.



Clemens




--------------------------------------
http://www.cmarschner.net



--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Configuration RFC [ In reply to ]
Clemens,

This looks very exciting. I have not had a chance to look at your LARM code,
but this overview is fairly informative.


Question 1.

I think you may have mentioned it, but how do the Sources fit in.

For example, one of the goal I am trying to get would be to get a URL
(xyz.html) and then change the url based on some pattern. So change xyz.html
to xyz.xml and get the .xml as a request.
I think you mentioned below that the source would send another request
message to get the data and then eat the message if it didn't also want to
get the .html file? Or would this be the message handler?

Also, would the source be able to say, I want to get all files which meet
this pattern and blindly attempt to get the set of files with changing
parameters? A message generator?

I know that these are specifics, but I would like to know how these fit into
the architecture? It seems like there could be a URL/message generator which
put potential URLs into the queue and if they didn't exist then it would
just log that. Is this what you have architected?


Question 2.

Also, should there be a layer for post processing or is that a filter. So if
you got an xml file and wanted to transform it, then you could use a filter?

This sourcing might also work really well with the Cocoon project.

Question 3.
Is there any built in support for only getting new files (or updating
changed URLs) or is that the job of the storage layer?

What would be the unique ID for each retrieved in the storage URL/dynamic
URL? The same URL could have different data.

Question 4.
URLScopeFilter - Is this just a wildcard based system, or does it also
handle full regex?


Question 5.
How do you define the pipeline. Right now you have Standardpipeline, but I
don't see a configuration for what the standardpipeline is.

Question 6.
Content validator. Is there anywhere which would be programmed in to say,
only the date as changed on this web page and so I don't want to consider it
changed or update it.

Question 7.
Notification - Is there someway, to notify (via email) that a change has
occurred to a given file, or a new file is available. Is the thought that
this would be part of logging?

I know that some of these questions are very specific, but I think it might
provide a validation of good architecture to see how these fit in. I think
that the idea of a message queue that either gets filled by someone's own
generator or based on other links embedded in a web page provides a very
flexible architecture.

I like the idea of the overwriting config files, but I am personally a fan
of one big file that the user can configure. This seems to lend it self to
less debugging. So maybe instead of multiple config files based on the url,
maybe a master set of config options with the overwriting parameters based
on the url pattern.
Something like

<config>
<global>
<follow-linkMatchPattern>*</follow-linkMatchPattern>
</global>
<site urlMatch="*.apache.org">
<follow-linkMatchPattern>*.html</follow-linkMatchPattern>
</site>
</config>

So here the default would follow all links, but if it were in apache.org
then it would only follow .html links. I don't know if this is a real
parameter, just an example. This is how the apache web server works. Tomcat
works in a similar way, but there are different files (web.xml), although
this is mostly because they are completely different applications
potentially written by different people.


Thanks for sharing.

--Peter



On 7/13/02 11:53 AM, "Clemens Marschner" <cmad@lanlab.de> wrote:

>
> ok, this is my proposal for the crawler configuration. And you tell me if
> I'm reinventing the wheel:
>
> Overview
> --------
>
> I distinguish (logically, not necessarily on a class level) between 5
> different types of components:
> - "filters" are parts of the message pipeline. They get a message and either
> pass it on or not. They are put into a messageHandler pipeline and are
> notified about their insertion. Filters don't know about each other. If they
> share common data, this has to be kept on the Service level
> - "services" are things like the host manager, probably a logfile manager,
> and other things that the other components share. The other components
> should be able to access these services, but the services should not know
> about them.
> - "storages" (or sinks) are where the documents go after they have been
> fetched
> - "sources" . are sources of messages (i.e. URLMessages). They typically run
> within their own thread of control and know the messageHandler.
> - then there are some "steering" components that monitor the pipeline and
> probably reconfigure it. They build the infrastructure. The ThreadMonitor
> gathers runtime information. If we want to have this information displayed
> the way we do it now, we need it to know all the other components. I'd leave
> that as it is at the moment, we could change it later. But I'd like the
> configuration component to know as little as possible about the other
> components. See below how I'd like to achieve that.
>
>
> Layer Diagram
> -------------
>
>
> ---------------|-------------------------------------
> | MessageHandler(?)
> |- - - - - - - - - - -
> ThreadMon-> |source | filter | filter... | storage
> |
> |--------------|----------------------
> Configurator-> | v
> | Services
> ---------------|-------------------------------------
>
> I'm not quite sure where the MessageHandler fits in here. Is it also a
> service? I like a layered model better.
> The other possibility would be to regard all components as being independent
> and on the same level. But the configurator keeps track of the interactions
> between them.
>
> Configuration
> -------------
>
> Prerequisite: All the components mentioned are implemented as JavaBeans
> (tada, the main idea today!)
>
> Then we can use bean utility classes to set their properties. I've had a
> look at jakarta-commons which contains a BeanUtils package which should
> contain whatever we need.
>
> since every service/filter is a singleton, we can distinguish it in the
> property file by its class name. If we ever need two instances of a class,
> we'd have to change that. But for simplicity, I think this will do well at
> this time.
>
> Then I think we can use a syntax of the property file like
>
> <ClassName>.<propertyName>=<PropertyValue>
>
> "ClassName" can be fully qualified (i.e. with package) or we could assume a
> default package like "de.lanlab.larm.fetcher". This could serve us well if
> the package name changes.
> [.If the class name is fully qualified, however, we'd have a problem with
> nested property names like "package.class.foo.bar", however]
>
>
> The Configurator
> ----------------
>
> The configurator should be capable of the following:
> - divide class names from property names.
> - initialize the classes found
> - register the instances in some kind of naming service (i.e. a global
> HashMap)
> - find and resolve dependencies among the different components
> - set the properties according to the props file (using BeanUtil's
> PropertyUtils.set(|Mapped|Indexed)Property())
> - provide a decent error handling (i.e. return line numbers if exceptions
> are thrown)
>
>
> Connecting different components:
> --------------------------------
>
> I don't want components to create other components or services. This should
> be done by the configurator. I can imagine two ways how components may be
> connected:
> - They tell the configurator that they need this or that service. I.e. the
> VisitedFilter needs the HostManager.
> - A property contains Service Names. Than these services have to be set up
> before the property is set.
> Therefore, the config process needs to be at least twofold: In a first step
> the components are set up and initialized, and in a second step, connections
> between components are set up.
>
> Config File Overlays
> --------------------
>
> I had the same idea as Andrew about how config files should be able to
> overwrite each other.
> Internally all properties are treated equally. But the user has to be able
> to distinguish several layers of configurations: I.e. a specific setup of
> the components that is reused every time, but different domains to be
> crawled.
> Therefore I propose that different config files can be specified which are
> loaded subsequently, probably overwriting properties already specified. I.e.
>
> java ...
> de.lanlab... -Iglobal.properties -Imycrawl.properties -DFetcher.threads=50
>
> which means: global.properties is loaded first, then mycrawl.properties is
> included and probably overwrites some of the settings in global.properties,
> and at last the property Fetcher.threads is set manually.
>
> I know that the JRun server uses a similar method: There you have one
> global.properties and a local.properties for each server process instance. I
> always found this very useful.
>
>
> Example Property File:
> ---------------------
>
> Configurator.services=HostManager,MessageHandler,LoggerService,HTTPProtocol
> # do we need this?
>
> # MessageHandler is initialized first and gets the filters property set.
> # those filters have to be initialized in a second step, when all is set up.
> MessageHandler.filters=URLLengthFilter,URLScopeFilter,RobotExclusionFilter,U
> RLVisitedFilter,KnownPathsFilter
> # configurator knows here that we need a MessageHandler, so the
> Configurator.services line above is redundant in this case
>
> #
> LoggerService.baseDir=logs/
> LoggerService.logs=store,links # defines property names used below
> # LoggerService.logs.store.class=SimpleLogger
> LoggerService.logs.store.fileName=store.log
> LoggerService.logs.links.fileName=links.log
> LoggerService.logs.store.fileName=store.log
>
>
> StoragePipeline.docStorages=LogStorage
> StoragePipeline.linkStorages=LinkLogStorage,MessageHandler
>
> LogStorage.log=store # the log name from the logger service
> LinkLogStorage.log=links
>
> # LuceneStorage.analyzer=org.apache.lucene.analysis.StandardAnalyzer
> # LuceneStorage.createIndex=true
> # LuceneStorage.indexName=luceneIndex
> # LuceneStorage.fieldInfos=url,content
> # LuceneStorage.fieldInfos.url = Index,Store
> # LuceneStorage.fieldInfos.content = Index,Store,Tokenize
>
>
> # manually define host synonyms. I don't know if there's a better way than
> the following, and if the method used here is possible anyway (one property
> two times)
> HostManager.synonym=www.foo1.bar.com,www.foo2.bar.com
> HostManager.synonym=www1.foo.com,www2.foo.com
> # or
> # HostManager.addSynonym=www.foo1.bar.com,www.foo2.bar.com
> # HostManager.addSynonym=www1.foo.com,www2.foo.com
> # coded as void setAddSynonym(String) - not so nice
>
> # alternative:
> HostManager.synonyms[0]=www.foo1.bar.com,www.foo2.bar.com
> HostManager.synonyms[1]=www1.foo.com,www2.foo.com
> # but this would prevent adding further synonyms in other config files
>
> URLScopeFilter.inScope=http://.*myHost.*
> # or additionally URLScopeFilter.outOfScope=... ?
>
> # RobotExclusionFilter doesn't have properties. It just needs to know the
> host manager. MessageHandler should
> # make clear that the filter has to be initialized. I think both have to
> provide a method like
> # 'String[] componentsNeeded()' that return the component names to set up.
> # MessageHandler would return the value as specified in
> "MessageHandler.filters", REFilter would return
> # HostManager
>
> HTTPProtocol.extractGZippedFiles=false
>
> URLLengthFilter.maxURLLength=255
>
> Fetcher.threadNumber=25
> Fetcher.docStorage=StoragePipeline
> Fetcher.linkStorage=StoragePipeline
> # here comes the MIME type stuff, which is not yet implemented. Only HTML is
> parsed, the rest is stored as-is.
>
>
> # this is an example of another storage:
>
> # SQLStorage.driver=com.ashna.JTurbo.driver.Driver
> # SQLStorage.url=jdbc:JTurbo://host/parameters
> # SQLStorage.user=...
> # SQLStorage.password=...
>
>
> Some Closing Remarks
> --------------------
>
> Ok you made it until here. Very good.
> I think with this configuration LARM can be much more than just a crawler.
> With a few changes it can also be used as a processor for documents that
> come over the network, i.e. in a JMS topic.
> I haven't mentioned what I call "Sources" or message producers. These are
> active components that run in their own thread and put messages into the
> queue.
> If we have a JMSStorage and a JMSSource, then the crawler can be divided
> into two pieces just from the config file.
> part one: Fetcher -> JMSStorage
> part two: JMSSource -> LuceneStorage
> with all the possibilities for distribution.
> Given a different source, one could also imagine feeding the crawler with
> files or with URLs from a web frontend.
>
>
>
> Clemens
>
>
>
>
> --------------------------------------
> http://www.cmarschner.net
>
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Configuration RFC [ In reply to ]
> I think you may have mentioned it, but how do the Sources fit in.
>
> For example, one of the goal I am trying to get would be to get a URL
> (xyz.html) and then change the url based on some pattern. So change
xyz.html
> to xyz.xml and get the .xml as a request.
> I think you mentioned below that the source would send another request
> message to get the data and then eat the message if it didn't also want to
> get the .html file? Or would this be the message handler?

Hm, I think most of your question could be answered in the file int
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/webcr
awler-LARM/doc/

Changing the URL would be very easy with a URLRewriteFilter, i.e.
class RewriteFilter implements Filter {
public Message handleMessage(Message m) {
URL u = ((URLMessage)m).getURL();
// do something with the URL
((URLMessage)m).setURL(u);
return m; }}

> Also, would the source be able to say, I want to get all files which meet
> this pattern and blindly attempt to get the set of files with changing
> parameters? A message generator?

I don't really know if I get your point. What do you want to accomplish?

> I know that these are specifics, but I would like to know how these fit
into
> the architecture? It seems like there could be a URL/message generator
which
> put potential URLs into the queue and if they didn't exist then it would
> just log that. Is this what you have architected?

Yes, from what you're writing I think this would be an application.

> Also, should there be a layer for post processing or is that a filter. So
if
> you got an xml file and wanted to transform it, then you could use a
filter?

At this point this would be a storage in the storage pipeline, although the
name is a little misleading.
Filters are only used to process links (so called URLMessages) before they
get into the crawler threads. The output of these threads is put into the
"storage", which can be a storage pipeline that works just like the filter
pipeline.
In this storage pipeline you can do with the document whatever you want.
Even post processing. The object put into the storage is a "WebDocument"
which contains the URL, the document's title, mime type, size, date, and a
set of name-value pairs which include the raw document by default.

> This sourcing might also work really well with the Cocoon project.

Yes, probably.

> Is there any built in support for only getting new files (or updating
> changed URLs) or is that the job of the storage layer?

I have written an experimental repository that registers itself as a storage
and a filter. From the storage point of view, it puts all URLs it gets in a
MySQL database. When acting as a filter, it reads them from the database and
adds the date when it was last crawled to the URLMessage. Then the crawling
tasks sends an "If-Modified-Since" header and stops crawling the document if
it was not modified.
Unfortunately it turned out that the storage itself is way too slow. Slower
than crawling all documents from the start.
I haven't checked it in yet, please let me know if you're interested. The
point I have with it is that some config stuff is included in the source
code, and I'd like to move these URLs out of the source code first. That's
why I put so much emphasis on the configuration issue at this point.

> What would be the unique ID for each retrieved in the storage URL/dynamic
> URL? The same URL could have different data.

hm at the moment the URL itself is the unique ID. What parameters could
cause the data to be different? I can only imagine the URL, a cookie and the
time of the crawl. Cookies are managed by the HTTP layer at this time. I
don't even know exactly how cookies are treated at the moment.
To be more specific, I haven't expected a single URL to point to different
kinds of pages, but different URLs to point to the same page. Therefore the
URLs are "normalized" to lower the chance that a URL is ambiguous. I.e.
http://host/path1/./path2 is normalized to http://host/path1/path2

> Question 4.
> URLScopeFilter - Is this just a wildcard based system, or does it also
> handle full regex?

full Perl5 regex, provided by the Apache ORO library. I.e. I'm using the
regex
http://[^/]*\(uni-muenchen\.de\|lmu\.de\|lrz-muenchen\.de\|leo\.org\|student
enwerk\.mhn\.de\|zhs-muenchen\.de\).*

> Question 5.
> How do you define the pipeline. Right now you have Standardpipeline, but I
> don't see a configuration for what the standardpipeline is.

please refer to the document I mentioned. I don't know what you mean by
Standardpipeline.

> Question 6
> Content validator. Is there anywhere which would be programmed in to say,
> only the date as changed on this web page and so I don't want to consider
it
> changed or update it.

You mean computing some sort of checksum (like the "Nilsimsa" mentioned in a
thread some days ago)? This could probably done within a storage. But you
need fast access to a repository to accomplish the comparison you mentioned.
And you'd have to download the file to compute that checksum.

> Question 7.
> Notification - Is there someway, to notify (via email) that a change has
> occurred to a given file, or a new file is available. Is the thought that
> this would be part of logging?

this could probably done within the logging stuff. Probably replace the
standard logger (which is not thread-safe and thus very fast) with Log4J and
use an appender that suits your needs.
A prerequisite would again be the repository I mentioned. Probably you could
spend some time to make it really fast...

> I like the idea of the overwriting config files, but I am personally a fan
> of one big file that the user can configure. This seems to lend it self to
> less debugging. So maybe instead of multiple config files based on the
url,
> maybe a master set of config options with the overwriting parameters based
> on the url pattern.
> Something like
>
> <config>
> <global>
> <follow-linkMatchPattern>*</follow-linkMatchPattern>
> </global>
> <site urlMatch="*.apache.org">
> <follow-linkMatchPattern>*.html</follow-linkMatchPattern>
> </site>
> </config>
>
> So here the default would follow all links, but if it were in apache.org
> then it would only follow .html links. I don't know if this is a real
> parameter, just an example. This is how the apache web server works.
Tomcat
> works in a similar way, but there are different files (web.xml), although
> this is mostly because they are completely different applications
> potentially written by different people.

I don't think this contradicts the way I outlined.
I think the separation I mentioned is necessary to divide the crawler's
overall configuration (i.e. which filters are used and how they are put
together) from the specific parameters for a crawl (like the example you
mentioned).
What you mean is overwriting general crawl parameters with specific crawl
parameters for specific domains. This is a new issue that I haven't
addressed in the RFC yet.

Besides:
you have used an XML format for the configuration. I personally think XML is
often overengineering. The current proposal comes with Java property files.
I have found property files to be more straightforward, easier to write,
easier to read, and you don't need a 2 MB XML parser. I'm a fan of XP's
"implement it the most simple way you could possibly imagine".
And the times I have used XML I've used an XML->Java converter (Castor XML)
that spared me from parsing it "manually"; another tool I don't want to use
in this project. What do you think? If I had to use XML I'd probably have to
delve into the Tomcat sources to find out how they cope with config stuff.


Clemens






--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Configuration RFC [ In reply to ]
>
>
>Besides:
>you have used an XML format for the configuration. I personally think XML is
>often overengineering. The current proposal comes with Java property files.
>I have found property files to be more straightforward, easier to write,
>easier to read, and you don't need a 2 MB XML parser. I'm a fan of XP's
>"implement it the most simple way you could possibly imagine".
>And the times I have used XML I've used an XML->Java converter (Castor XML)
>that spared me from parsing it "manually"; another tool I don't want to use
>in this project. What do you think? If I had to use XML I'd probably have to
>delve into the Tomcat sources to find out how they cope with config stuff.
>
>
Parsing XML is not so hard. If you're employed as a software developer
I suggest you learn one day or another you'll need to. That issue
aside, why would you write a XML configuration aka property handler. Do
you think you'd do it better than those who went
before you? Perhaps....

some thoughts:

http://jakarta.apache.org/avalon/api/org/apache/avalon/framework/configuration/package-frame.html
http://jakarta.apache.org/avalon/api/org/apache/avalon/framework/configuration/SAXConfigurationHandler.html
http://jakarta.apache.org/avalon/api/org/apache/avalon/framework/configuration/NamespacedSAXConfigurationHandler.html

-Andy

>
>Clemens
>
>
>
>
>
>
>--
>To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
>
>
>




--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Configuration RFC [ In reply to ]
With (at least) one thing you're right: This all seems pretty much Avalon to
me. After writing the doc, I read a little in the Avalon docs and found all
that very similar. I already mentioned that some days ago.

--Clemens


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Configuration RFC [ In reply to ]
On 7/13/02 1:47 PM, "Clemens Marschner" <cmad@lanlab.de> wrote:

>
>> I think you may have mentioned it, but how do the Sources fit in.
>>
>> For example, one of the goal I am trying to get would be to get a URL
>> (xyz.html) and then change the url based on some pattern. So change
> xyz.html
>> to xyz.xml and get the .xml as a request.
>> I think you mentioned below that the source would send another request
>> message to get the data and then eat the message if it didn't also want to
>> get the .html file? Or would this be the message handler?
>
> Hm, I think most of your question could be answered in the file int
> http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/webcr
> awler-LARM/doc/
>
> Changing the URL would be very easy with a URLRewriteFilter, i.e.
> class RewriteFilter implements Filter {
> public Message handleMessage(Message m) {
> URL u = ((URLMessage)m).getURL();
> // do something with the URL
> ((URLMessage)m).setURL(u);
> return m; }}
>

Great

>> Also, would the source be able to say, I want to get all files which meet
>> this pattern and blindly attempt to get the set of files with changing
>> parameters? A message generator?
>
> I don't really know if I get your point. What do you want to accomplish?

So there are some sites which have content which matches a given pattern,
like article20020701-1.html.

It is much easier to crawl if you just get the article based on a pattern of
article[date]-[sequence].html then getting to it through links. This is what
I would like be able to accomplish.

>
>> I know that these are specifics, but I would like to know how these fit
> into
>> the architecture? It seems like there could be a URL/message generator
> which
>> put potential URLs into the queue and if they didn't exist then it would
>> just log that. Is this what you have architected?
>
> Yes, from what you're writing I think this would be an application.
>
>> Also, should there be a layer for post processing or is that a filter. So
> if
>> you got an xml file and wanted to transform it, then you could use a
> filter?
>
> At this point this would be a storage in the storage pipeline, although the
> name is a little misleading.
> Filters are only used to process links (so called URLMessages) before they
> get into the crawler threads. The output of these threads is put into the
> "storage", which can be a storage pipeline that works just like the filter
> pipeline.
> In this storage pipeline you can do with the document whatever you want.
> Even post processing. The object put into the storage is a "WebDocument"
> which contains the URL, the document's title, mime type, size, date, and a
> set of name-value pairs which include the raw document by default.
>
>> This sourcing might also work really well with the Cocoon project.
>
> Yes, probably.
>
>> Is there any built in support for only getting new files (or updating
>> changed URLs) or is that the job of the storage layer?
>
> I have written an experimental repository that registers itself as a storage
> and a filter. From the storage point of view, it puts all URLs it gets in a
> MySQL database. When acting as a filter, it reads them from the database and
> adds the date when it was last crawled to the URLMessage. Then the crawling
> tasks sends an "If-Modified-Since" header and stops crawling the document if
> it was not modified.
> Unfortunately it turned out that the storage itself is way too slow. Slower
> than crawling all documents from the start.
> I haven't checked it in yet, please let me know if you're interested. The
> point I have with it is that some config stuff is included in the source
> code, and I'd like to move these URLs out of the source code first. That's
> why I put so much emphasis on the configuration issue at this point.
>
>> What would be the unique ID for each retrieved in the storage URL/dynamic
>> URL? The same URL could have different data.
>
> hm at the moment the URL itself is the unique ID. What parameters could
> cause the data to be different? I can only imagine the URL, a cookie and the
> time of the crawl. Cookies are managed by the HTTP layer at this time. I
> don't even know exactly how cookies are treated at the moment.
> To be more specific, I haven't expected a single URL to point to different
> kinds of pages, but different URLs to point to the same page. Therefore the
> URLs are "normalized" to lower the chance that a URL is ambiguous. I.e.
> http://host/path1/./path2 is normalized to http://host/path1/path2
>

This comes up when there is a MVC url methodology or a URL with POST
parameters.
So /app1/ShowResults

Could show lots of different results depending on what were the parameters
passed.




>> Question 4.
>> URLScopeFilter - Is this just a wildcard based system, or does it also
>> handle full regex?
>
> full Perl5 regex, provided by the Apache ORO library. I.e. I'm using the
> regex
> http://[^/]*\(uni-muenchen\.de\|lmu\.de\|lrz-muenchen\.de\|leo\.org\|student
> enwerk\.mhn\.de\|zhs-muenchen\.de\).*

Great


>
>> Question 6
>> Content validator. Is there anywhere which would be programmed in to say,
>> only the date as changed on this web page and so I don't want to consider
> it
>> changed or update it.
>
> You mean computing some sort of checksum (like the "Nilsimsa" mentioned in a
> thread some days ago)? This could probably done within a storage. But you
> need fast access to a repository to accomplish the comparison you mentioned.
> And you'd have to download the file to compute that checksum.

What I was thinking was being able to do a difference and then to say if the
only thing that changed meets this pattern then ignore it as changed. The
idea would be to ignore items like dates or counters which change
dynamically.



>
>> I like the idea of the overwriting config files, but I am personally a fan
>> of one big file that the user can configure. This seems to lend it self to
>> less debugging. So maybe instead of multiple config files based on the
> url,
>> maybe a master set of config options with the overwriting parameters based
>> on the url pattern.
>> Something like
>>
>> <config>
>> <global>
>> <follow-linkMatchPattern>*</follow-linkMatchPattern>
>> </global>
>> <site urlMatch="*.apache.org">
>> <follow-linkMatchPattern>*.html</follow-linkMatchPattern>
>> </site>
>> </config>
>>
>> So here the default would follow all links, but if it were in apache.org
>> then it would only follow .html links. I don't know if this is a real
>> parameter, just an example. This is how the apache web server works.
> Tomcat
>> works in a similar way, but there are different files (web.xml), although
>> this is mostly because they are completely different applications
>> potentially written by different people.
>
> I don't think this contradicts the way I outlined.
> I think the separation I mentioned is necessary to divide the crawler's
> overall configuration (i.e. which filters are used and how they are put
> together) from the specific parameters for a crawl (like the example you
> mentioned).
> What you mean is overwriting general crawl parameters with specific crawl
> parameters for specific domains. This is a new issue that I haven't
> addressed in the RFC yet.
>



> Besides:
> you have used an XML format for the configuration. I personally think XML is
> often overengineering. The current proposal comes with Java property files.
> I have found property files to be more straightforward, easier to write,
> easier to read, and you don't need a 2 MB XML parser. I'm a fan of XP's
> "implement it the most simple way you could possibly imagine".
> And the times I have used XML I've used an XML->Java converter (Castor XML)
> that spared me from parsing it "manually"; another tool I don't want to use
> in this project. What do you think? If I had to use XML I'd probably have to
> delve into the Tomcat sources to find out how they cope with config stuff.
>

I don't care if it's XML format, xml tends to be more clear with
relationships between parameters. Large property like files can get
confusing without good comments.




I'll go and read more about what you have already done and try provide more
constructive comments.

Thanks again for providing this infrastructure.


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Configuration RFC [ In reply to ]
Peter,

> This comes up when there is a MVC url methodology or a URL with POST
> parameters.
> So /app1/ShowResults
>
> Could show lots of different results depending on what were the
> parameters passed.

This wouldn't really apply to a crawler such as LARM, as it discovers
and follows only URLs it finds in fetched documents. It ignores HTML
forms, vairous form fields, etc., does not POST, just GETs links that
it finds and that pass the filtering criteria.

This should also answer the other question about directly fetching
links that match a pattern. Most crawlers are designed to fetch pages,
extract HTML, links, images, etc., keep some, throw away some, fetch
some more, and so on.

Otis


__________________________________________________
Do You Yahoo!?
Yahoo! Autos - Get free new car price quotes
http://autos.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Configuration RFC [ In reply to ]
Otis,

Many site are using this MVC methodology to create their sites. Why develop
a new crawler with the same limitations?

--Peter


On 7/13/02 8:06 PM, "Otis Gospodnetic" <otis_gospodnetic@yahoo.com> wrote:

> Peter,
>
>> This comes up when there is a MVC url methodology or a URL with POST
>> parameters.
>> So /app1/ShowResults
>>
>> Could show lots of different results depending on what were the
>> parameters passed.
>
> This wouldn't really apply to a crawler such as LARM, as it discovers
> and follows only URLs it finds in fetched documents. It ignores HTML
> forms, vairous form fields, etc., does not POST, just GETs links that
> it finds and that pass the filtering criteria.
>
> This should also answer the other question about directly fetching
> links that match a pattern. Most crawlers are designed to fetch pages,
> extract HTML, links, images, etc., keep some, throw away some, fetch
> some more, and so on.
>
> Otis
>
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Autos - Get free new car price quotes
> http://autos.yahoo.com
>
> --
> To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Configuration RFC [ In reply to ]
> >> Also, would the source be able to say, I want to get all files which
meet
> >> this pattern and blindly attempt to get the set of files with changing
> >> parameters? A message generator?
> >
> > I don't really know if I get your point. What do you want to accomplish?
>
> So there are some sites which have content which matches a given pattern,
> like article20020701-1.html.
>
> It is much easier to crawl if you just get the article based on a pattern
of
> article[date]-[sequence].html then getting to it through links. This is
what
> I would like be able to accomplish.

This could be done in two ways:
Either you discover such a pattern in a filter. Then this filter could
generate new messages and put them in front of the message handler queue.
I was already thinking of a very "greedy" crawling mechanism where for each
URL found leads to one message for each directory it is contained in. I.e.
http://host/my/little/path/page.html ->
-> http://host/my/little/path/
-> http://host/my/little/
-> http://host/my/
-> http://host/
Most of these messages will be filtered by the VisitedFilter, but it can
also discover "hidden" directories... probably more than the web master
would like...


> > hm at the moment the URL itself is the unique ID. What parameters could
> > cause the data to be different? I can only imagine the URL, a cookie and
the
> > time of the crawl. Cookies are managed by the HTTP layer at this time. I
> > don't even know exactly how cookies are treated at the moment.
> > To be more specific, I haven't expected a single URL to point to
different
> > kinds of pages, but different URLs to point to the same page. Therefore
the
> > URLs are "normalized" to lower the chance that a URL is ambiguous. I.e.
> > http://host/path1/./path2 is normalized to http://host/path1/path2
> >
>
> This comes up when there is a MVC url methodology or a URL with POST
> parameters.
> So /app1/ShowResults
>
> Could show lots of different results depending on what were the parameters
> passed.

POST operations are not supported at this time. I don't see an application
for that. POST is only used
in forms, where it doesn't make sense for a crawler to enter "some"
information, or probably with Javascript, for which there is no suitable
parser that detects location.hrefs (will not be easy in any but the most
trivial cases). I also don't know any crawler that does this.

> > You mean computing some sort of checksum (like the "Nilsimsa" mentioned
in a
> > thread some days ago)? This could probably done within a storage. But
you
> > need fast access to a repository to accomplish the comparison you
mentioned.
> > And you'd have to download the file to compute that checksum.
>
> What I was thinking was being able to do a difference and then to say if
the
> only thing that changed meets this pattern then ignore it as changed. The
> idea would be to ignore items like dates or counters which change
> dynamically.

I think this is similar to what I said. There's also a paper by
Garcia-Molina et al. about this topic (see Citeseer, "Finding near replicas
of documents on the web"

Clemens



--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Configuration RFC [ In reply to ]
> This could be done in two ways:
...
sorry..

the second way: Write a Source that puts in new messages, but runs in its
own thread.
I have to add that sources don't exist at this time.

Clemens


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Configuration RFC [ In reply to ]
Hi Clemens,

I read the document you put together about this crawler. Thanks.

Below are some comments and questions from someone just getting into the
crawling concepts, but trying to provide constructive ideas. I have not
looked at the code yet, but that's next on the list. I hope this is helpful
and provides a good dialog.

1) The MessageQueue system seems to be somewhat problematic because of
memory issues. This seems like it should be an abstract class with a few
potential options including your CachingQueue, and a SQLQueue that would
handle many of the issue of memory and persistence for large scale crawls.

2) Extensible Priority Queue. As you were talking about limiting the number
of threads that access one host at a time, but this might fly in the face of
the URL reordering concept that you write about later. So if this were
somehow an interface which had different options, this might be more
flexible.

3) Distribution does seem like a problem to be solved (but my guess is in
longer term). With a distributed system, it seems like it would be best to
have as little communication as possible between the different units. One
thought would be as you stated to partition up the work. The only thought I
have would be to be able to do this at a domain level and not just a
directory level.

5) Potentially adding a receiving pipeline. You have talked about this as a
storage pipeline, but I don't think it should be connected to storage. For
example, I think that processing should occur and then go to storage. Either
a File System or SQL based storage. The storage should not be related to the
post processing. Also, the link parsing should be part of this processing
and not the fetching. This might also make it more scalable since you could
distribute the load better.

5) Here are a few items that I see as potential bottle necks. Are there any
others that you want to account for?
A) The time to connect to the site. (Network IO constraint)
B) The time to download the page. (Network and file system IO constraint)
C) Parsing the page. (CPU and Memory constraint)
D) Managing Multiple Threads (CPU constraint)
E) List of Visited links (Memory constraint)

6) Things I am going to try to find out from the code:

Overall class naming convention / architecture. Class Diagram.

Source types handled (HTTP, FTP, FILE, SQL?)

Authentication - How does LARM handle this, what types are supported
(digest, ssl, form)

Frames - Is there a encompassing reference file name or is each individual
file. What if you want to display the page?

Cookies and Headers - Support for cookies / HTTP headers

Javascript - How does it handle links made through javascript (error out,
ignore, handle them?)


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Configuration RFC [ In reply to ]
> Source types handled (HTTP, FTP, FILE, SQL?)

These can basically be handled with URLs (certainly the first three.)
The crawler should generate a list of document URLs to be indexed, and
then the indexer, which you should be able to throttle so it doesn't
take up excessive resources, then later goes and gathers the actual
document.

Having a framework for dealing with multiple file types (text, HTML,
PDF, Word, etc) is critical. There was a proposal that floated around
a few months ago which should be dusted off.


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Configuration RFC [ In reply to ]
I think that this parsing / indexing should be part of the Receiving
pipeline. The developer may want to convert the pdf to html as well as index
it.

If there is a proposal to do this that would be great.

--Peter



On 7/14/02 8:53 PM, "Brian Goetz" <brian@quiotix.com> wrote:

> Having a framework for dealing with multiple file types (text, HTML,
> PDF, Word, etc) is critical. There was a proposal that floated around
> a few months ago which should be dusted off.


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Configuration RFC [ In reply to ]
[snip]
>Having a framework for dealing with multiple file types (text, HTML,
>PDF, Word, etc) is critical. There was a proposal that floated
>around
>a few months ago which should be dusted off.

Indyo, the indexing framework I checked into Sandbox (under the appex
project) handles this aspect of it. I need abit more time to get the
documentation sorted out, but it'll be real soon now.

Regards,
Kelvin


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Configuration RFC [ In reply to ]
> >Having a framework for dealing with multiple file types (text, HTML,
> >PDF, Word, etc) is critical. There was a proposal that floated
> >around
> >a few months ago which should be dusted off.
>
> Indyo, the indexing framework I checked into Sandbox (under the appex
> project) handles this aspect of it. I need abit more time to get the
> documentation sorted out, but it'll be real soon now.

I think I submitted a simple framework for plugging in document
converters. The idea was that converters would digest a document
and produce a list of named fields, and then there was a mapping
to map the document fields to the Lucene field names (which might
be different.) It was pretty simple.

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: Configuration RFC [ In reply to ]
> Below are some comments and questions from someone just getting into the
> crawling concepts, but trying to provide constructive ideas.

Very very good. That's the kind of discussion I wanted.

> 1) The MessageQueue system seems to be somewhat problematic because of
> memory issues. This seems like it should be an abstract class with a few
> potential options including your CachingQueue, and a SQLQueue that would
> handle many of the issue of memory and persistence for large scale crawls.

Probably another misleading name. The MessageHandler is an active component
that runs in its own thread. You can only put messages into the handler like
in a queue, but then they are processed by the message handler thread, which
runs on high priority. The caching queue is just a utility class that
implements an abstract queue that holds most of its contents on secondary
storage. That means memory is only constrained by the size of the hard drive
and the number of files that can be saved in a directory. Do you think
that's not enough?

> 2) Extensible Priority Queue. As you were talking about limiting the
number
> of threads that access one host at a time, but this might fly in the face
of
> the URL reordering concept that you write about later. So if this were
> somehow an interface which had different options, this might be more
> flexible.

Could you be more specific on that?
At the moment there's an interface Queue:
public Object remove();
public void insert(Object o);
public void insertMultiple(Collection c);
public int size();
and a base class called TaskQueue which contains the tasks from the
ThreadPool. Fetcher extends this Queue and uses a FetcherTaskQueue, which
contains a small CachingQueue for each server and gets threads in a
round-robin manner.

Queue (i)
+-----------+-----------------+
^ ^
CachingQueue TaskQueue <-used in larm.threads
^
^--------uses----- FetcherTaskQueue <- used in larm.fetcher

You're right that this could be ameliorated. Do you have a suggestion?

> 3) Distribution does seem like a problem to be solved (but my guess is in
> longer term). With a distributed system, it seems like it would be best to
> have as little communication as possible between the different units. One
> thought would be as you stated to partition up the work. The only thought
I
> have would be to be able to do this at a domain level and not just a
> directory level.

I could imagine other ways to partition it. I.e. take the hashValue of each
URL and partition the space between Integer.MIN_VALUE and Integer.MAX_VALUE.
I think partitioning with host names can become highly imbalanced.
Regarding communication, you are completely right. This has to take place in
batch mode, with a couple of hundred to thousand URLs at a time.
There's a paper on this on the www2002 cdrom ("Parallel Crawlers") that
expresses what I was thinking about for two years now ;-|

> 5) Potentially adding a receiving pipeline. You have talked about this as
a
> storage pipeline, but I don't think it should be connected to storage. For
> example, I think that processing should occur and then go to storage.
Either
> a File System or SQL based storage. The storage should not be related to
the
> post processing.

Well it doesn't have to. The difference between the message queue and the
storage pipeline is that the first is triggered by the message handler and
contains URLMessages and the latter is triggered by the fetcher threads and
contain WebDocuments. For some reason I felt they would be similar, so both
are derived from a "Message" class. In fact WebDocument is derived from
URLMessage, which is not totally correct, since URLMessage is used for links
and contains a referer URL, and a WebDocument is not a link. The right way
would probably be
Message
^
URLMessage
+----------+----------+
^ ^
Link (Web)Document

The term "Web" is only correct for http documents, but since it could be any
doc. this could be left out as well.
The reason why documents are put into a storage pipeline and not something
like a storage queue is that it doesn't make sense to queue the high amout
of data, only to store it a short time later again.
But there's no reason not to use the storage pipeline for processing. If you
want to store the raw data, load it later again and process it, the only
thing to be defined is a "document source" that reads documents and puts
them in a processing/storage pipeline.
I see that we're talking about the right names here. Rename StoragePipeline
to ProcessingPipeline, and you're done. Storage is then maybe a special kind
of processing...?

> Also, the link parsing should be part of this processing
> and not the fetching.

True that the doc retrieval has to be divided from the processing part. The
FetcherTask.run() method is completely bloated and was subject to a lot of
experiments.
At this time it is completely limited to HTTP URLs, doesn't do intelligent
mime type processing etc.
The reason why it was done within the fetcher threads are that the
implementation of the storage pipeline is pretty new, and that the scope I
had in mind was limited (large intranets). Parsing could as well be done
within the processing pipeline.

> This might also make it more scalable since you could
> distribute the load better.
Why do you think so? I don't see a lot of difference there.
Let me think about it. If you're talking about the thread of control where
the processing is done, I think this should be the fetcher threads, at least
optionally. This could be done right now since the storages are reentrant
(or at least should be) and are called by the fetchers. The processing
pipeline should be done in parallel, to take advantage of additional
processors (I mean real MPs), if there are any.
On the other hand, the fetcher threads have to be able to get a bunch of
URLMessages, fetch them, and put them in the processing pipeline all at
once, to reduce synchronization. At this time documents are put into the
storage pipeline one after the other, which has to be synchronized in two
places: Where the actual document is stored and where the links are put back
into the message handler. That's too much and probably one of the main
reasons why the crawler doesn't scale up well to 100 threads or more.

> 5) Here are a few items that I see as potential bottle necks. Are there
any
> others that you want to account for?
> A) The time to connect to the site. (Network IO constraint)
> B) The time to download the page. (Network and file system IO constraint)
> C) Parsing the page. (CPU and Memory constraint)
> D) Managing Multiple Threads (CPU constraint)
> E) List of Visited links (Memory constraint)

F) storing the page is also a constraint, especially with SQL databases
(network/file system IO) or when you use the current LuceneStorage (Otis
knows what I'm talking about...).
G) logging (IO constraint). It's buffered now, and flushed every second time
the ThreadMonitor awakes (every 10 seconds at this time). Besides, the
SimpleLogger class is not thread safe and logging is only done in one thread
whenever possible. I have used Log4J before but never in these performance
critical situations, so I can't tell how it would behave here.
Since logging is done so extensively it can become a perfomance killer. I.e.
when you log a date each time a new Date object was created. I don't know if
Log4J reuses its objects well. I mostly turned date creation off because it
flooded the heap.

If you want to see another source for heap drain, look how
url.toExternalForm() works...

> 6) Things I am going to try to find out from the code:
>
> Overall class naming convention / architecture. Class Diagram.

Yep; can you reverse engineer that from the code?
I think the threads, storage, and parser packages are ok. The thread pool
was influenced by an article in "Java-Magazin" 2 years ago.
The fetcher package should probably be broken up somehow.
The util package is really bad since it should not have any relationships
with the other packages. Especially the WebDocument needs to be moved out
there. The gui package is outdated, and the graph package just contains a
first attempt to build a graph database of the links crawled.
I've written a much better one recently, one that compresses URLs and which
can be saved and reloaded. Unfortunately, it needs a sorted list of URLs as
its input and thus cannot be used by the URLVisitedFilter. The only way to
circumvent the E) constraint I see is using a BTree, but that would impose
great new IO constraints since this class is used so often.

Other conventions: The source code is, although not intended, formatted
almost in concordance with the Avalon conventions
(http://jakarta.apache.org/avalon/code-standards.html) which I like very
well (I really hate when opening brackets are not on their own line, so
don't make me change that :-) I use "this." instead of m_ for members
although I was used to that in my MFC times.

> Source types handled (HTTP, FTP, FILE, SQL?)

The only part that is limited to HTTP at this time is the FetcherTask. I had
so many problems with URLConnection that I had to replace it with the
HTTPClient; but that client doesn't

> Authentication - How does LARM handle this, what types are supported
> (digest, ssl, form)

Could be handled by HTTPClient (I think all of it) but I haven't tried out.

> Frames - Is there a encompassing reference file name or is each individual
> file. What if you want to display the page?

There's no non-ambiguous way to determine in which frame a file was put,
since it can be linked by a lot of sources. The links.log contains the link
type (0=normal link, 1=frame src, 2=redirect) for each link recorded.
If you want to get the frame URL but index the real contents you have to do
an analysis of the overall graph structure. I.e. take the inbound frame link
with the highest page rank, etc.

> Cookies and Headers - Support for cookies / HTTP headers

All supported by the HTTPClient. I think Cookies are saved and resent
automatically. Only a subset of the received HTTP headers are written to the
logs at this time. If the cookie storage becomes a problem, you can also
change the internal CookieModule to cater your needs.
I personally think HTTPClient is designed very well; I have used it for more
than half a year now and it simply works. That's why I don't want to spend
my time to move to the Apache HTTP library, although it may also be working.
If you wonder about the strange way how it is treated within the build
process: I had to adapt a couple of methods of the HTTPClient before I was
told to put it in the Apache CVS.

> Javascript - How does it handle links made through javascript (error out,
> ignore, handle them?)

Nope, and I don't see a solution for all but the most trivial cases
(document.location.href=...) to handle this. Will lead to a malformed URL at
this time and be filtered out.

Last thing: Many of the utility classes have a test method in their main()
method. This has to be moved out to some JUnit test cases. I think Mehran
wanted to do something there.

Clemens






--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>