ok, this is my proposal for the crawler configuration. And you tell me if
I'm reinventing the wheel:
Overview
--------
I distinguish (logically, not necessarily on a class level) between 5
different types of components:
- "filters" are parts of the message pipeline. They get a message and either
pass it on or not. They are put into a messageHandler pipeline and are
notified about their insertion. Filters don't know about each other. If they
share common data, this has to be kept on the Service level
- "services" are things like the host manager, probably a logfile manager,
and other things that the other components share. The other components
should be able to access these services, but the services should not know
about them.
- "storages" (or sinks) are where the documents go after they have been
fetched
- "sources" . are sources of messages (i.e. URLMessages). They typically run
within their own thread of control and know the messageHandler.
- then there are some "steering" components that monitor the pipeline and
probably reconfigure it. They build the infrastructure. The ThreadMonitor
gathers runtime information. If we want to have this information displayed
the way we do it now, we need it to know all the other components. I'd leave
that as it is at the moment, we could change it later. But I'd like the
configuration component to know as little as possible about the other
components. See below how I'd like to achieve that.
Layer Diagram
-------------
---------------|-------------------------------------
| MessageHandler(?)
|- - - - - - - - - - -
ThreadMon-> |source | filter | filter... | storage
|
|--------------|----------------------
Configurator-> | v
| Services
---------------|-------------------------------------
I'm not quite sure where the MessageHandler fits in here. Is it also a
service? I like a layered model better.
The other possibility would be to regard all components as being independent
and on the same level. But the configurator keeps track of the interactions
between them.
Configuration
-------------
Prerequisite: All the components mentioned are implemented as JavaBeans
(tada, the main idea today!)
Then we can use bean utility classes to set their properties. I've had a
look at jakarta-commons which contains a BeanUtils package which should
contain whatever we need.
since every service/filter is a singleton, we can distinguish it in the
property file by its class name. If we ever need two instances of a class,
we'd have to change that. But for simplicity, I think this will do well at
this time.
Then I think we can use a syntax of the property file like
<ClassName>.<propertyName>=<PropertyValue>
"ClassName" can be fully qualified (i.e. with package) or we could assume a
default package like "de.lanlab.larm.fetcher". This could serve us well if
the package name changes.
[.If the class name is fully qualified, however, we'd have a problem with
nested property names like "package.class.foo.bar", however]
The Configurator
----------------
The configurator should be capable of the following:
- divide class names from property names.
- initialize the classes found
- register the instances in some kind of naming service (i.e. a global
HashMap)
- find and resolve dependencies among the different components
- set the properties according to the props file (using BeanUtil's
PropertyUtils.set(|Mapped|Indexed)Property())
- provide a decent error handling (i.e. return line numbers if exceptions
are thrown)
Connecting different components:
--------------------------------
I don't want components to create other components or services. This should
be done by the configurator. I can imagine two ways how components may be
connected:
- They tell the configurator that they need this or that service. I.e. the
VisitedFilter needs the HostManager.
- A property contains Service Names. Than these services have to be set up
before the property is set.
Therefore, the config process needs to be at least twofold: In a first step
the components are set up and initialized, and in a second step, connections
between components are set up.
Config File Overlays
--------------------
I had the same idea as Andrew about how config files should be able to
overwrite each other.
Internally all properties are treated equally. But the user has to be able
to distinguish several layers of configurations: I.e. a specific setup of
the components that is reused every time, but different domains to be
crawled.
Therefore I propose that different config files can be specified which are
loaded subsequently, probably overwriting properties already specified. I.e.
java ...
de.lanlab... -Iglobal.properties -Imycrawl.properties -DFetcher.threads=50
which means: global.properties is loaded first, then mycrawl.properties is
included and probably overwrites some of the settings in global.properties,
and at last the property Fetcher.threads is set manually.
I know that the JRun server uses a similar method: There you have one
global.properties and a local.properties for each server process instance. I
always found this very useful.
Example Property File:
---------------------
Configurator.services=HostManager,MessageHandler,LoggerService,HTTPProtocol
# do we need this?
# MessageHandler is initialized first and gets the filters property set.
# those filters have to be initialized in a second step, when all is set up.
MessageHandler.filters=URLLengthFilter,URLScopeFilter,RobotExclusionFilter,U
RLVisitedFilter,KnownPathsFilter
# configurator knows here that we need a MessageHandler, so the
Configurator.services line above is redundant in this case
#
LoggerService.baseDir=logs/
LoggerService.logs=store,links # defines property names used below
# LoggerService.logs.store.class=SimpleLogger
LoggerService.logs.store.fileName=store.log
LoggerService.logs.links.fileName=links.log
LoggerService.logs.store.fileName=store.log
StoragePipeline.docStorages=LogStorage
StoragePipeline.linkStorages=LinkLogStorage,MessageHandler
LogStorage.log=store # the log name from the logger service
LinkLogStorage.log=links
# LuceneStorage.analyzer=org.apache.lucene.analysis.StandardAnalyzer
# LuceneStorage.createIndex=true
# LuceneStorage.indexName=luceneIndex
# LuceneStorage.fieldInfos=url,content
# LuceneStorage.fieldInfos.url = Index,Store
# LuceneStorage.fieldInfos.content = Index,Store,Tokenize
# manually define host synonyms. I don't know if there's a better way than
the following, and if the method used here is possible anyway (one property
two times)
HostManager.synonym=www.foo1.bar.com,www.foo2.bar.com
HostManager.synonym=www1.foo.com,www2.foo.com
# or
# HostManager.addSynonym=www.foo1.bar.com,www.foo2.bar.com
# HostManager.addSynonym=www1.foo.com,www2.foo.com
# coded as void setAddSynonym(String) - not so nice
# alternative:
HostManager.synonyms[0]=www.foo1.bar.com,www.foo2.bar.com
HostManager.synonyms[1]=www1.foo.com,www2.foo.com
# but this would prevent adding further synonyms in other config files
URLScopeFilter.inScope=http://.*myHost.*
# or additionally URLScopeFilter.outOfScope=... ?
# RobotExclusionFilter doesn't have properties. It just needs to know the
host manager. MessageHandler should
# make clear that the filter has to be initialized. I think both have to
provide a method like
# 'String[] componentsNeeded()' that return the component names to set up.
# MessageHandler would return the value as specified in
"MessageHandler.filters", REFilter would return
# HostManager
HTTPProtocol.extractGZippedFiles=false
URLLengthFilter.maxURLLength=255
Fetcher.threadNumber=25
Fetcher.docStorage=StoragePipeline
Fetcher.linkStorage=StoragePipeline
# here comes the MIME type stuff, which is not yet implemented. Only HTML is
parsed, the rest is stored as-is.
# this is an example of another storage:
# SQLStorage.driver=com.ashna.JTurbo.driver.Driver
# SQLStorage.url=jdbc:JTurbo://host/parameters
# SQLStorage.user=...
# SQLStorage.password=...
Some Closing Remarks
--------------------
Ok you made it until here. Very good.
I think with this configuration LARM can be much more than just a crawler.
With a few changes it can also be used as a processor for documents that
come over the network, i.e. in a JMS topic.
I haven't mentioned what I call "Sources" or message producers. These are
active components that run in their own thread and put messages into the
queue.
If we have a JMSStorage and a JMSSource, then the crawler can be divided
into two pieces just from the config file.
part one: Fetcher -> JMSStorage
part two: JMSSource -> LuceneStorage
with all the possibilities for distribution.
Given a different source, one could also imagine feeding the crawler with
files or with URLs from a web frontend.
Clemens
--------------------------------------
http://www.cmarschner.net
--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
I'm reinventing the wheel:
Overview
--------
I distinguish (logically, not necessarily on a class level) between 5
different types of components:
- "filters" are parts of the message pipeline. They get a message and either
pass it on or not. They are put into a messageHandler pipeline and are
notified about their insertion. Filters don't know about each other. If they
share common data, this has to be kept on the Service level
- "services" are things like the host manager, probably a logfile manager,
and other things that the other components share. The other components
should be able to access these services, but the services should not know
about them.
- "storages" (or sinks) are where the documents go after they have been
fetched
- "sources" . are sources of messages (i.e. URLMessages). They typically run
within their own thread of control and know the messageHandler.
- then there are some "steering" components that monitor the pipeline and
probably reconfigure it. They build the infrastructure. The ThreadMonitor
gathers runtime information. If we want to have this information displayed
the way we do it now, we need it to know all the other components. I'd leave
that as it is at the moment, we could change it later. But I'd like the
configuration component to know as little as possible about the other
components. See below how I'd like to achieve that.
Layer Diagram
-------------
---------------|-------------------------------------
| MessageHandler(?)
|- - - - - - - - - - -
ThreadMon-> |source | filter | filter... | storage
|
|--------------|----------------------
Configurator-> | v
| Services
---------------|-------------------------------------
I'm not quite sure where the MessageHandler fits in here. Is it also a
service? I like a layered model better.
The other possibility would be to regard all components as being independent
and on the same level. But the configurator keeps track of the interactions
between them.
Configuration
-------------
Prerequisite: All the components mentioned are implemented as JavaBeans
(tada, the main idea today!)
Then we can use bean utility classes to set their properties. I've had a
look at jakarta-commons which contains a BeanUtils package which should
contain whatever we need.
since every service/filter is a singleton, we can distinguish it in the
property file by its class name. If we ever need two instances of a class,
we'd have to change that. But for simplicity, I think this will do well at
this time.
Then I think we can use a syntax of the property file like
<ClassName>.<propertyName>=<PropertyValue>
"ClassName" can be fully qualified (i.e. with package) or we could assume a
default package like "de.lanlab.larm.fetcher". This could serve us well if
the package name changes.
[.If the class name is fully qualified, however, we'd have a problem with
nested property names like "package.class.foo.bar", however]
The Configurator
----------------
The configurator should be capable of the following:
- divide class names from property names.
- initialize the classes found
- register the instances in some kind of naming service (i.e. a global
HashMap)
- find and resolve dependencies among the different components
- set the properties according to the props file (using BeanUtil's
PropertyUtils.set(|Mapped|Indexed)Property())
- provide a decent error handling (i.e. return line numbers if exceptions
are thrown)
Connecting different components:
--------------------------------
I don't want components to create other components or services. This should
be done by the configurator. I can imagine two ways how components may be
connected:
- They tell the configurator that they need this or that service. I.e. the
VisitedFilter needs the HostManager.
- A property contains Service Names. Than these services have to be set up
before the property is set.
Therefore, the config process needs to be at least twofold: In a first step
the components are set up and initialized, and in a second step, connections
between components are set up.
Config File Overlays
--------------------
I had the same idea as Andrew about how config files should be able to
overwrite each other.
Internally all properties are treated equally. But the user has to be able
to distinguish several layers of configurations: I.e. a specific setup of
the components that is reused every time, but different domains to be
crawled.
Therefore I propose that different config files can be specified which are
loaded subsequently, probably overwriting properties already specified. I.e.
java ...
de.lanlab... -Iglobal.properties -Imycrawl.properties -DFetcher.threads=50
which means: global.properties is loaded first, then mycrawl.properties is
included and probably overwrites some of the settings in global.properties,
and at last the property Fetcher.threads is set manually.
I know that the JRun server uses a similar method: There you have one
global.properties and a local.properties for each server process instance. I
always found this very useful.
Example Property File:
---------------------
Configurator.services=HostManager,MessageHandler,LoggerService,HTTPProtocol
# do we need this?
# MessageHandler is initialized first and gets the filters property set.
# those filters have to be initialized in a second step, when all is set up.
MessageHandler.filters=URLLengthFilter,URLScopeFilter,RobotExclusionFilter,U
RLVisitedFilter,KnownPathsFilter
# configurator knows here that we need a MessageHandler, so the
Configurator.services line above is redundant in this case
#
LoggerService.baseDir=logs/
LoggerService.logs=store,links # defines property names used below
# LoggerService.logs.store.class=SimpleLogger
LoggerService.logs.store.fileName=store.log
LoggerService.logs.links.fileName=links.log
LoggerService.logs.store.fileName=store.log
StoragePipeline.docStorages=LogStorage
StoragePipeline.linkStorages=LinkLogStorage,MessageHandler
LogStorage.log=store # the log name from the logger service
LinkLogStorage.log=links
# LuceneStorage.analyzer=org.apache.lucene.analysis.StandardAnalyzer
# LuceneStorage.createIndex=true
# LuceneStorage.indexName=luceneIndex
# LuceneStorage.fieldInfos=url,content
# LuceneStorage.fieldInfos.url = Index,Store
# LuceneStorage.fieldInfos.content = Index,Store,Tokenize
# manually define host synonyms. I don't know if there's a better way than
the following, and if the method used here is possible anyway (one property
two times)
HostManager.synonym=www.foo1.bar.com,www.foo2.bar.com
HostManager.synonym=www1.foo.com,www2.foo.com
# or
# HostManager.addSynonym=www.foo1.bar.com,www.foo2.bar.com
# HostManager.addSynonym=www1.foo.com,www2.foo.com
# coded as void setAddSynonym(String) - not so nice
# alternative:
HostManager.synonyms[0]=www.foo1.bar.com,www.foo2.bar.com
HostManager.synonyms[1]=www1.foo.com,www2.foo.com
# but this would prevent adding further synonyms in other config files
URLScopeFilter.inScope=http://.*myHost.*
# or additionally URLScopeFilter.outOfScope=... ?
# RobotExclusionFilter doesn't have properties. It just needs to know the
host manager. MessageHandler should
# make clear that the filter has to be initialized. I think both have to
provide a method like
# 'String[] componentsNeeded()' that return the component names to set up.
# MessageHandler would return the value as specified in
"MessageHandler.filters", REFilter would return
# HostManager
HTTPProtocol.extractGZippedFiles=false
URLLengthFilter.maxURLLength=255
Fetcher.threadNumber=25
Fetcher.docStorage=StoragePipeline
Fetcher.linkStorage=StoragePipeline
# here comes the MIME type stuff, which is not yet implemented. Only HTML is
parsed, the rest is stored as-is.
# this is an example of another storage:
# SQLStorage.driver=com.ashna.JTurbo.driver.Driver
# SQLStorage.url=jdbc:JTurbo://host/parameters
# SQLStorage.user=...
# SQLStorage.password=...
Some Closing Remarks
--------------------
Ok you made it until here. Very good.
I think with this configuration LARM can be much more than just a crawler.
With a few changes it can also be used as a processor for documents that
come over the network, i.e. in a JMS topic.
I haven't mentioned what I call "Sources" or message producers. These are
active components that run in their own thread and put messages into the
queue.
If we have a JMSStorage and a JMSSource, then the crawler can be divided
into two pieces just from the config file.
part one: Fetcher -> JMSStorage
part two: JMSSource -> LuceneStorage
with all the possibilities for distribution.
Given a different source, one could also imagine feeding the crawler with
files or with URLs from a web frontend.
Clemens
--------------------------------------
http://www.cmarschner.net
--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>