Mailing List Archive

LARM: Configuration RFC
ok, this is my proposal for the crawler configuration. And you tell me if
I'm reinventing the wheel:

Overview
--------

I distinguish (logically, not necessarily on a class level) between 5
different types of components:
- "filters" are parts of the message pipeline. They get a message and either
pass it on or not. They are put into a messageHandler pipeline and are
notified about their insertion. Filters don't know about each other. If they
share common data, this has to be kept on the Service level
- "services" are things like the host manager, probably a logfile manager,
and other things that the other components share. The other components
should be able to access these services, but the services should not know
about them.
- "storages" (or sinks) are where the documents go after they have been
fetched
- "sources" . are sources of messages (i.e. URLMessages). They typically run
within their own thread of control and know the messageHandler.
- then there are some "steering" components that monitor the pipeline and
probably reconfigure it. They build the infrastructure. The ThreadMonitor
gathers runtime information. If we want to have this information displayed
the way we do it now, we need it to know all the other components. I'd leave
that as it is at the moment, we could change it later. But I'd like the
configuration component to know as little as possible about the other
components. See below how I'd like to achieve that.


Layer Diagram
-------------


---------------|-------------------------------------
| MessageHandler(?)
|- - - - - - - - - - -
ThreadMon-> |source | filter | filter... | storage
|
|--------------|----------------------
Configurator-> | v
| Services
---------------|-------------------------------------

I'm not quite sure where the MessageHandler fits in here. Is it also a
service? I like a layered model better.
The other possibility would be to regard all components as being independent
and on the same level. But the configurator keeps track of the interactions
between them.

Configuration
-------------

Prerequisite: All the components mentioned are implemented as JavaBeans
(tada, the main idea today!)

Then we can use bean utility classes to set their properties. I've had a
look at jakarta-commons which contains a BeanUtils package which should
contain whatever we need.

since every service/filter is a singleton, we can distinguish it in the
property file by its class name. If we ever need two instances of a class,
we'd have to change that. But for simplicity, I think this will do well at
this time.

Then I think we can use a syntax of the property file like

<ClassName>.<propertyName>=<PropertyValue>

"ClassName" can be fully qualified (i.e. with package) or we could assume a
default package like "de.lanlab.larm.fetcher". This could serve us well if
the package name changes.
[.If the class name is fully qualified, however, we'd have a problem with
nested property names like "package.class.foo.bar", however]


The Configurator
----------------

The configurator should be capable of the following:
- divide class names from property names.
- initialize the classes found
- register the instances in some kind of naming service (i.e. a global
HashMap)
- find and resolve dependencies among the different components
- set the properties according to the props file (using BeanUtil's
PropertyUtils.set(|Mapped|Indexed)Property())
- provide a decent error handling (i.e. return line numbers if exceptions
are thrown)


Connecting different components:
--------------------------------

I don't want components to create other components or services. This should
be done by the configurator. I can imagine two ways how components may be
connected:
- They tell the configurator that they need this or that service. I.e. the
VisitedFilter needs the HostManager.
- A property contains Service Names. Than these services have to be set up
before the property is set.
Therefore, the config process needs to be at least twofold: In a first step
the components are set up and initialized, and in a second step, connections
between components are set up.

Config File Overlays
--------------------

I had the same idea as Andrew about how config files should be able to
overwrite each other.
Internally all properties are treated equally. But the user has to be able
to distinguish several layers of configurations: I.e. a specific setup of
the components that is reused every time, but different domains to be
crawled.
Therefore I propose that different config files can be specified which are
loaded subsequently, probably overwriting properties already specified. I.e.

java ...
de.lanlab... -Iglobal.properties -Imycrawl.properties -DFetcher.threads=50

which means: global.properties is loaded first, then mycrawl.properties is
included and probably overwrites some of the settings in global.properties,
and at last the property Fetcher.threads is set manually.

I know that the JRun server uses a similar method: There you have one
global.properties and a local.properties for each server process instance. I
always found this very useful.


Example Property File:
---------------------

Configurator.services=HostManager,MessageHandler,LoggerService,HTTPProtocol
# do we need this?

# MessageHandler is initialized first and gets the filters property set.
# those filters have to be initialized in a second step, when all is set up.
MessageHandler.filters=URLLengthFilter,URLScopeFilter,RobotExclusionFilter,U
RLVisitedFilter,KnownPathsFilter
# configurator knows here that we need a MessageHandler, so the
Configurator.services line above is redundant in this case

#
LoggerService.baseDir=logs/
LoggerService.logs=store,links # defines property names used below
# LoggerService.logs.store.class=SimpleLogger
LoggerService.logs.store.fileName=store.log
LoggerService.logs.links.fileName=links.log
LoggerService.logs.store.fileName=store.log


StoragePipeline.docStorages=LogStorage
StoragePipeline.linkStorages=LinkLogStorage,MessageHandler

LogStorage.log=store # the log name from the logger service
LinkLogStorage.log=links

# LuceneStorage.analyzer=org.apache.lucene.analysis.StandardAnalyzer
# LuceneStorage.createIndex=true
# LuceneStorage.indexName=luceneIndex
# LuceneStorage.fieldInfos=url,content
# LuceneStorage.fieldInfos.url = Index,Store
# LuceneStorage.fieldInfos.content = Index,Store,Tokenize


# manually define host synonyms. I don't know if there's a better way than
the following, and if the method used here is possible anyway (one property
two times)
HostManager.synonym=www.foo1.bar.com,www.foo2.bar.com
HostManager.synonym=www1.foo.com,www2.foo.com
# or
# HostManager.addSynonym=www.foo1.bar.com,www.foo2.bar.com
# HostManager.addSynonym=www1.foo.com,www2.foo.com
# coded as void setAddSynonym(String) - not so nice

# alternative:
HostManager.synonyms[0]=www.foo1.bar.com,www.foo2.bar.com
HostManager.synonyms[1]=www1.foo.com,www2.foo.com
# but this would prevent adding further synonyms in other config files

URLScopeFilter.inScope=http://.*myHost.*
# or additionally URLScopeFilter.outOfScope=... ?

# RobotExclusionFilter doesn't have properties. It just needs to know the
host manager. MessageHandler should
# make clear that the filter has to be initialized. I think both have to
provide a method like
# 'String[] componentsNeeded()' that return the component names to set up.
# MessageHandler would return the value as specified in
"MessageHandler.filters", REFilter would return
# HostManager

HTTPProtocol.extractGZippedFiles=false

URLLengthFilter.maxURLLength=255

Fetcher.threadNumber=25
Fetcher.docStorage=StoragePipeline
Fetcher.linkStorage=StoragePipeline
# here comes the MIME type stuff, which is not yet implemented. Only HTML is
parsed, the rest is stored as-is.


# this is an example of another storage:

# SQLStorage.driver=com.ashna.JTurbo.driver.Driver
# SQLStorage.url=jdbc:JTurbo://host/parameters
# SQLStorage.user=...
# SQLStorage.password=...


Some Closing Remarks
--------------------

Ok you made it until here. Very good.
I think with this configuration LARM can be much more than just a crawler.
With a few changes it can also be used as a processor for documents that
come over the network, i.e. in a JMS topic.
I haven't mentioned what I call "Sources" or message producers. These are
active components that run in their own thread and put messages into the
queue.
If we have a JMSStorage and a JMSSource, then the crawler can be divided
into two pieces just from the config file.
part one: Fetcher -> JMSStorage
part two: JMSSource -> LuceneStorage
with all the possibilities for distribution.
Given a different source, one could also imagine feeding the crawler with
files or with URLs from a web frontend.



Clemens




--------------------------------------
http://www.cmarschner.net


--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: LARM: Configuration RFC [ In reply to ]
Clemens,

My overall impression is that this is overly complicated.
My brain is probably tired (past 1 AM), but I can't help but think that
there must be a simpler way....

Also, I believe this thread morphed in a thread about whether LARM
could/should be built as an Avalon component. I don't know enough
about Avalon, but I think configuring your components and using Avalon
overlap only partially. That is, Avalon can, I think, provide a good
infrastructure, a good container for your code, lifecycle methods and
such, but I'm not sure if it helps with component and system
configuration.

Some comments inlined...

--- Clemens Marschner <Clemens.Marschner@internet.lmu.de> wrote:
>
> ok, this is my proposal for the crawler configuration. And you tell
> me if
> I'm reinventing the wheel:
>
> Overview
> --------
>
> I distinguish (logically, not necessarily on a class level) between 5
> different types of components:
> - "filters" are parts of the message pipeline. They get a message and
> either
> pass it on or not. They are put into a messageHandler pipeline and
> are notified about their insertion.

Who/what is 'their' here?
Messages are put in the pipeline and filters are notified of their
insertion?

> Filters don't know about each other.
> If they
> share common data, this has to be kept on the Service level
> - "services" are things like the host manager, probably a logfile
> manager,
> and other things that the other components share. The other
> components
> should be able to access these services, but the services should not
> know
> about them.
> - "storages" (or sinks) are where the documents go after they have
> been fetched

Maybe this is just a confusing term to me (storages).
When you fetch a link, what do you do with it?
Do you store the page (HTML and all)?
If so, where do you store it? File system?

Or do you parse it with one of the filters, extract links with another
filter, and send extracted links to URL queue, and extracted text to
LuceneStorage?

> - "sources" . are sources of messages (i.e. URLMessages). They
> typically run
> within their own thread of control and know the messageHandler.
> - then there are some "steering" components that monitor the pipeline
> and
> probably reconfigure it. They build the infrastructure. The
> ThreadMonitor
> gathers runtime information. If we want to have this information
> displayed
> the way we do it now, we need it to know all the other components.
> I'd leave
> that as it is at the moment, we could change it later. But I'd like
> the
> configuration component to know as little as possible about the other
> components. See below how I'd like to achieve that.
>
>
> Layer Diagram
> -------------
>
>
> ---------------|-------------------------------------
> | MessageHandler(?)
> |- - - - - - - - - - -
> ThreadMon-> |source | filter | filter... | storage
> |
> |--------------|----------------------
> Configurator-> | v
> | Services
> ---------------|-------------------------------------
>
> I'm not quite sure where the MessageHandler fits in here. Is it also
> a
> service? I like a layered model better.

I'm also reading your PDF document (version 0.5) now.
One thing seems 'wrong' to me.
If I understand things correctly, you have:
[url/message queue] -> [filter1] ... [filterN] -> [fetcher]

This is the pipeline, correct?

This sounds like it is reversed to me.
Wouldn't this be better:

[url/message queue] -> [fetcher] -> [filter1] ... [filterN]

In English:
- get the next URL (or batch of URLs) to fetch, from the queue
- fetch the URL
- Pass the fetched page through different filters in the pipeline
(.e.g filter to extract links
filter to check each link against restrictto pattern
filter to check each link against 'Visited' list
put any remaining links (not filtered out) to URL queue
filter to extract text for indexing (e.g. HTML parser)
filter to store the extracted text (e.g. Lucene Storage)
filter to mark the fetched URL as fetched, set last fetched
date, etc.
)

Wouldn't that be better?
If I understand things correctly, you do this in the opposite order,
which, I think, means that you store all the extacted links in the URL
queue and filter 'bad' ones out only right before fetching.
If that is so, your URL queue is going to be unecessarily large.

What am I missing? (other than sleep)

> The other possibility would be to regard all components as being
> independent
> and on the same level. But the configurator keeps track of the
> interactions between them.

Do you really need an external Configurator component to configure
other components?
Why not have each component configure itself?
Each component can get its own properies, set its own attributes.
You would need only 1 place to glue them all together.
This would be in Java, and may look something like this:

fetcher = new UrlFetcher();
indexer = new UrlIndexer();
persister = new UrlPersister();
sweeper = new Sweeper();
errorHandler = new ErrorHandler();
...
...
mds.addServerCommand(fetcher);
mds.addServerCommand(indexer);
mds.addServerCommand(persister);
mds.addServerCommand(sweeper);
mds.start();
scheduler.setOutQueue(mds.getInQueue());
try
{
scheduler.start();
}
...

You get the idea.
Here, Scheduler is the component that talks to the URL queue and puts
messages containing URLs in the processing queue (the pipeline).

The 'mds' instance that you see above knows to pass messages from one
component to the next in the above order.

So only the file where the above code is entered needs to know about
different components (filters, URL and page processors, storages).

A while ago you mentioned you wanted to provide different sets of
components, different pipelines (pipelines with different sets of
filters, etc.).
To do that you would either need to create (hard-code) a few common
sets in Java, like above example for 1 set of components, or you could
come up with a way to read the components from a file (properties or
XML or custom format) which will tell your 'configurator' component
which components to instantiate and how to wire them together into a
pipeline.
.....which is, I guess, what you are asking further down.

> Configuration
> -------------
>
> Prerequisite: All the components mentioned are implemented as
> JavaBeans
> (tada, the main idea today!)
>
> Then we can use bean utility classes to set their properties. I've
> had a
> look at jakarta-commons which contains a BeanUtils package which
> should
> contain whatever we need.
>
> since every service/filter is a singleton, we can distinguish it in
> the
> property file by its class name. If we ever need two instances of a
> class,
> we'd have to change that. But for simplicity, I think this will do
> well at this time.

You will want to change that or you'll be sorry when one of your
filters becomes a bottleneck and you can't instantiate more of them :)

> Then I think we can use a syntax of the property file like
>
> <ClassName>.<propertyName>=<PropertyValue>
>
> "ClassName" can be fully qualified (i.e. with package) or we could
> assume a
> default package like "de.lanlab.larm.fetcher". This could serve us
> well if
> the package name changes.
> [.If the class name is fully qualified, however, we'd have a problem
> with
> nested property names like "package.class.foo.bar", however]

Aren't there projects under Jakarta Commons that can eliminate the need
for custom code to translate properties to java beans attributes?
Digester maybe?

> The Configurator
> ----------------
>
> The configurator should be capable of the following:
> - divide class names from property names.
> - initialize the classes found
> - register the instances in some kind of naming service (i.e. a
> global
> HashMap)
> - find and resolve dependencies among the different components
> - set the properties according to the props file (using BeanUtil's
> PropertyUtils.set(|Mapped|Indexed)Property())
> - provide a decent error handling (i.e. return line numbers if
> exceptions
> are thrown)

The first 3 points (dashes) can be taken care of by using k2d2.org
framework, or, I assume, Avalon (hm, that's a big guess, I don't really
know).

> Connecting different components:
> --------------------------------
>
> I don't want components to create other components or services. This
> should
> be done by the configurator. I can imagine two ways how components
> may be
> connected:
> - They tell the configurator that they need this or that service.
> I.e. the
> VisitedFilter needs the HostManager.
> - A property contains Service Names. Than these services have to be
> set up
> before the property is set.
> Therefore, the config process needs to be at least twofold: In a
> first step
> the components are set up and initialized, and in a second step,
> connections
> between components are set up.

Ok, I guess that is what I was talking about earlier.
I'm not a fan of using XML for config files if you can use simpler
name=value properties, but this sounds kind of 'structured', so an XML
config file may be helpful (or you can tokenize those property values
on commas, but that always looked like a hack to me)


> Config File Overlays
> --------------------
>
> I had the same idea as Andrew about how config files should be able
> to
> overwrite each other.
> Internally all properties are treated equally. But the user has to be
> able
> to distinguish several layers of configurations: I.e. a specific
> setup of
> the components that is reused every time, but different domains to be
> crawled.
> Therefore I propose that different config files can be specified
> which are
> loaded subsequently, probably overwriting properties already
> specified. I.e.
>
> java ...
> de.lanlab... -Iglobal.properties -Imycrawl.properties
> -DFetcher.threads=50
>
> which means: global.properties is loaded first, then
> mycrawl.properties is
> included and probably overwrites some of the settings in
> global.properties,
> and at last the property Fetcher.threads is set manually.
>
> I know that the JRun server uses a similar method: There you have one
> global.properties and a local.properties for each server process
> instance. I
> always found this very useful.

Yes, me to.
If you want, I can send you a class that does it, so you don't have to
type it up :). Simple stuff.


> Example Property File:
> ---------------------
>
>
Configurator.services=HostManager,MessageHandler,LoggerService,HTTPProtocol
> # do we need this?
>
> # MessageHandler is initialized first and gets the filters property
> set.
> # those filters have to be initialized in a second step, when all is
> set up.
>
MessageHandler.filters=URLLengthFilter,URLScopeFilter,RobotExclusionFilter,U
> RLVisitedFilter,KnownPathsFilter
> # configurator knows here that we need a MessageHandler, so the
> Configurator.services line above is redundant in this case
>
> #
> LoggerService.baseDir=logs/
> LoggerService.logs=store,links # defines property names used below
> # LoggerService.logs.store.class=SimpleLogger
> LoggerService.logs.store.fileName=store.log
> LoggerService.logs.links.fileName=links.log
> LoggerService.logs.store.fileName=store.log
>
>
> StoragePipeline.docStorages=LogStorage
> StoragePipeline.linkStorages=LinkLogStorage,MessageHandler
>
> LogStorage.log=store # the log name from the logger service
> LinkLogStorage.log=links
>
> # LuceneStorage.analyzer=org.apache.lucene.analysis.StandardAnalyzer
> # LuceneStorage.createIndex=true
> # LuceneStorage.indexName=luceneIndex
> # LuceneStorage.fieldInfos=url,content
> # LuceneStorage.fieldInfos.url = Index,Store
> # LuceneStorage.fieldInfos.content = Index,Store,Tokenize

I like this Lucene part - specifying fields' characteristics via
properties.

> # manually define host synonyms. I don't know if there's a better way
> than
> the following, and if the method used here is possible anyway (one
> property
> two times)

No, you can't do that, I'm pretty sure. Properties class is subclass
of Hash(table?) I think, so keys (prop names) would clash.

> HostManager.synonym=www.foo1.bar.com,www.foo2.bar.com
> HostManager.synonym=www1.foo.com,www2.foo.com
> # or
> # HostManager.addSynonym=www.foo1.bar.com,www.foo2.bar.com
> # HostManager.addSynonym=www1.foo.com,www2.foo.com
> # coded as void setAddSynonym(String) - not so nice
>
> # alternative:
> HostManager.synonyms[0]=www.foo1.bar.com,www.foo2.bar.com
> HostManager.synonyms[1]=www1.foo.com,www2.foo.com
> # but this would prevent adding further synonyms in other config
> files

This is also where XML may help:
<HostSynonym>
<name>www.example.com</name>
<syn>www1.example.com</syn>
<syn>www2.example.com</syn>

<name>www.porkchop.com</name>
<syn>www1.porkchop.com</syn>
<syn>www2.porkchop.com</syn>
</HostSynonym>

> URLScopeFilter.inScope=http://.*myHost.*
> # or additionally URLScopeFilter.outOfScope=... ?

Yes, I was going to tell you the other day.
You need 'include pattern' as well as 'exclude pattern'.
Include pattern may be something like *.de, exclude pattern may be
things like /cgi-bin/ or '?' or /wpoison/ or '.cgi' or ...

> # RobotExclusionFilter doesn't have properties. It just needs to know
> the
> host manager. MessageHandler should
> # make clear that the filter has to be initialized. I think both have
> to
> provide a method like
> # 'String[] componentsNeeded()' that return the component names to
> set up.
> # MessageHandler would return the value as specified in
> "MessageHandler.filters", REFilter would return
> # HostManager
>
> HTTPProtocol.extractGZippedFiles=false
>
> URLLengthFilter.maxURLLength=255
>
> Fetcher.threadNumber=25
> Fetcher.docStorage=StoragePipeline
> Fetcher.linkStorage=StoragePipeline
> # here comes the MIME type stuff, which is not yet implemented. Only
> HTML is
> parsed, the rest is stored as-is.
>
>
> # this is an example of another storage:
>
> # SQLStorage.driver=com.ashna.JTurbo.driver.Driver
> # SQLStorage.url=jdbc:JTurbo://host/parameters
> # SQLStorage.user=...
> # SQLStorage.password=...
>
>
> Some Closing Remarks
> --------------------
>
> Ok you made it until here. Very good.

2 AM. :(

> I think with this configuration LARM can be much more than just a
> crawler.
> With a few changes it can also be used as a processor for documents
> that
> come over the network, i.e. in a JMS topic.
> I haven't mentioned what I call "Sources" or message producers. These
> are
> active components that run in their own thread and put messages into
> the queue.

Like Fetcher threads?

> If we have a JMSStorage and a JMSSource, then the crawler can be
> divided
> into two pieces just from the config file.
> part one: Fetcher -> JMSStorage
> part two: JMSSource -> LuceneStorage
> with all the possibilities for distribution.
> Given a different source, one could also imagine feeding the crawler
> with files or with URLs from a web frontend.

This sounds like plans of failed 'dot coms' :)
We could do this, and this, and this..... in the end there is no focus,
no product, no company, no jobs. At least there were good parties,
free food, limos, nice SGI screens, Aeron chairs, and a nice office in
Chinatown ;)

Otis


__________________________________________________
Do You Yahoo!?
Yahoo! Autos - Get free new car price quotes
http://autos.yahoo.com

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
Re: LARM: Configuration RFC [ In reply to ]
> My overall impression is that this is overly complicated.
> My brain is probably tired (past 1 AM), but I can't help but think that
> there must be a simpler way....

Hi Otis,
sorry for the delay,
because of that I will repeat most of my original message:


> > I distinguish (logically, not necessarily on a class level) between 5
> > different types of components:
> > - "filters" are parts of the message pipeline. They get a message and
> > either
> > pass it on or not. They are put into a messageHandler pipeline and
> > are notified about their insertion.
>
> Who/what is 'their' here?
> Messages are put in the pipeline and filters are notified of their
> insertion?

No, filters are installed in the pipeline and get a notification that they
were inserted.

> > Filters don't know about each other.
> > If they
> > share common data, this has to be kept on the Service level
> > - "services" are things like the host manager, probably a logfile
> > manager,
> > and other things that the other components share. The other
> > components
> > should be able to access these services, but the services should not
> > know
> > about them.
> > - "storages" (or sinks) are where the documents go after they have
> > been fetched
>
> Maybe this is just a confusing term to me (storages).
> When you fetch a link, what do you do with it?
> Do you store the page (HTML and all)?
> If so, where do you store it? File system?

Call it a processor or sink. It's "were the meat goes" after being fetched.
(Sorry, watched too much Seinfeld lately) ["But where do you turn it on?"]

> Or do you parse it with one of the filters, extract links with another
> filter, and send extracted links to URL queue, and extracted text to
> LuceneStorage?

Ok, once more:

URLMsg. URLMsg. WebDoc
-----> FILTER* -----> FETCHER{1} -----> PROCESSOR+

Right now:
Processing = Part of Fetcher -> bad
PROCESSOR = Storage/StoragePipeline -> just a special kind of processor
I'm still not very confident with the word "processor", though. Should be
something like drain or sink. There's no "is-a" between "Storage" and
"Processor".

> > - "sources" . are sources of messages (i.e. URLMessages). They
> > typically run
> > within their own thread of control and know the messageHandler.
> > - then there are some "steering" components that monitor the pipeline
> > and
> > probably reconfigure it. They build the infrastructure. The
> > ThreadMonitor
> > gathers runtime information. If we want to have this information
> > displayed
> > the way we do it now, we need it to know all the other components.
> > I'd leave
> > that as it is at the moment, we could change it later. But I'd like
> > the
> > configuration component to know as little as possible about the other
> > components. See below how I'd like to achieve that.
> >
> >
> > Layer Diagram
> > -------------
> >
> >
> > ---------------|-------------------------------------
> > | MessageHandler(?)
> > |- - - - - - - - - - -
> > ThreadMon-> |source | filter | filter... | storage
> > |
> > |--------------|----------------------
> Configurator-> | v
> > | Services
> > ---------------|-------------------------------------
> >
> > I'm not quite sure where the MessageHandler fits in here. Is it also
> > a
> > service? I like a layered model better.
>
> I'm also reading your PDF document (version 0.5) now.
> One thing seems 'wrong' to me.
> If I understand things correctly, you have:
> [url/message queue] -> [filter1] ... [filterN] -> [fetcher]
>
> This is the pipeline, correct?
>
> This sounds like it is reversed to me.
> Wouldn't this be better:
>
> [url/message queue] -> [fetcher] -> [filter1] ... [filterN]

no, since the filters are applied to the URLs. The idea was that URLs are
filtered or changed, i.e. because of robot exclusion etc.

> In English:
> - get the next URL (or batch of URLs) to fetch, from the queue
> - fetch the URL
> - Pass the fetched page through different filters in the pipeline
> (.e.g filter to extract links
> filter to check each link against restrictto pattern
> filter to check each link against 'Visited' list
> put any remaining links (not filtered out) to URL queue
> filter to extract text for indexing (e.g. HTML parser)
> filter to store the extracted text (e.g. Lucene Storage)
> filter to mark the fetched URL as fetched, set last fetched
> date, etc.
> )
>
> Wouldn't that be better?

It's all the same, since the whole thing is a CIRCLE. If you draw it from
left to right, put the point where you would insert a URL from the outside
to the left, and we're close together. I put it that way because I think
that every URL should walk through the filter pipeline.
It's also a threading issue: The filter pipeline runs in ONE thread since it
uses a lot of resources that would otherwise have to be shared among the
threads, which doesn't make sense (would slow it down a lot). The processing
(or storage) pipeline itself is done within the fetcher threads, since
processing can be done in parallel, and storage _has_ to be done
synchronized. Otherwise there would have to be a document queue in front of
the storage mechanism, which doesn't make sense.

> If I understand you right, you do this in the opposite order,
> which, I think, means that you store all the extacted links in the URL
> queue and filter 'bad' ones out only right before fetching.
> If that is so, your URL queue is going to be unecessarily large.

No, since URL processing is pretty fast, this queue is usually empty. That's
not a problem.

> > The other possibility would be to regard all components as being
> > independent
> > and on the same level. But the configurator keeps track of the
> > interactions between them.
>
> Do you really need an external Configurator component to configure
> other components?
> Why not have each component configure itself?
> Each component can get its own properies, set its own attributes.
> You would need only 1 place to glue them all together.
> This would be in Java, and may look something like this:
>
> fetcher = new UrlFetcher();
> indexer = new UrlIndexer();
> persister = new UrlPersister();
> sweeper = new Sweeper();
> errorHandler = new ErrorHandler();
> ...
> ...
> mds.addServerCommand(fetcher);
> mds.addServerCommand(indexer);
> mds.addServerCommand(persister);
> mds.addServerCommand(sweeper);
> mds.start();
> scheduler.setOutQueue(mds.getInQueue());
> try
> {
> scheduler.start();
> }
> ...
>
> You get the idea.
> Here, Scheduler is the component that talks to the URL queue and puts
> messages containing URLs in the processing queue (the pipeline).
>
> The 'mds' instance that you see above knows to pass messages from one
> component to the next in the above order.

Have you taken a look at FetcherMain.java? That will look familiar to you,
because the pattern is exactly the same. The attempt I made was to push this
forward to a more generic way.

> A while ago you mentioned you wanted to provide different sets of
> components, different pipelines (pipelines with different sets of
> filters, etc.).
> To do that you would either need to create (hard-code) a few common
> sets in Java, like above example for 1 set of components, or you could
> come up with a way to read the components from a file (properties or
> XML or custom format) which will tell your 'configurator' component
> which components to instantiate and how to wire them together into a
> pipeline.
> .....which is, I guess, what you are asking further down.
yep

> > since every service/filter is a singleton, we can distinguish it in
> > the
> > property file by its class name. If we ever need two instances of a
> > class,
> > we'd have to change that. But for simplicity, I think this will do
> > well at this time.
>
> You will want to change that or you'll be sorry when one of your
> filters becomes a bottleneck and you can't instantiate more of them :)

Don't know. "Do the simplest thing you possibly can". I think keeping it in
mind will be enough not to drive in a dead end.

> Aren't there projects under Jakarta Commons that can eliminate the need
> for custom code to translate properties to java beans attributes?
> Digester maybe?

> > The Configurator
> > ----------------
> >
> > The configurator should be capable of the following:
> > - divide class names from property names.
> > - initialize the classes found
> > - register the instances in some kind of naming service (i.e. a
> > global
> > HashMap)
> > - find and resolve dependencies among the different components
> > - set the properties according to the props file (using BeanUtil's
> > PropertyUtils.set(|Mapped|Indexed)Property())
> > - provide a decent error handling (i.e. return line numbers if
> > exceptions
> > are thrown)
>
> The first 3 points (dashes) can be taken care of by using k2d2.org
> framework, or, I assume, Avalon (hm, that's a big guess, I don't really
> know).

Mehran wanted to have a look at Avalon. I hope he'll finds out if it
provides what we want here. Otherwise, k2d2 is a good hint.

> I'm not a fan of using XML for config files if you can use simpler
> name=value properties, but this sounds kind of 'structured', so an XML
> config file may be helpful (or you can tokenize those property values
> on commas, but that always looked like a hack to me)

The only thing I want to avoid is parsing the XML by hand. I've done this
with Castor XML (which I suppose no one would want to use here), and I think
there're already gazillions of frameworks out there that do exactly that. I
still hope that Avalon can get us through this.

> > # LuceneStorage.analyzer=org.apache.lucene.analysis.StandardAnalyzer
> > # LuceneStorage.createIndex=true
> > # LuceneStorage.indexName=luceneIndex
> > # LuceneStorage.fieldInfos=url,content
> > # LuceneStorage.fieldInfos.url = Index,Store
> > # LuceneStorage.fieldInfos.content = Index,Store,Tokenize
>
> I like this Lucene part - specifying fields' characteristics via
> properties.
>
> > # manually define host synonyms. I don't know if there's a better way
> > than
> > the following, and if the method used here is possible anyway (one
> > property
> > two times)
>
> No, you can't do that, I'm pretty sure. Properties class is subclass
> of Hash(table?) I think, so keys (prop names) would clash.

ok.

>
> > HostManager.synonym=www.foo1.bar.com,www.foo2.bar.com
> > HostManager.synonym=www1.foo.com,www2.foo.com
> > # or
> > # HostManager.addSynonym=www.foo1.bar.com,www.foo2.bar.com
> > # HostManager.addSynonym=www1.foo.com,www2.foo.com
> > # coded as void setAddSynonym(String) - not so nice
> >
> > # alternative:
> > HostManager.synonyms[0]=www.foo1.bar.com,www.foo2.bar.com
> > HostManager.synonyms[1]=www1.foo.com,www2.foo.com
> > # but this would prevent adding further synonyms in other config
> > files
>
> This is also where XML may help:
> <HostSynonym>
> <name>www.example.com</name>
> <syn>www1.example.com</syn>
> <syn>www2.example.com</syn>
>
> <name>www.porkchop.com</name>
> <syn>www1.porkchop.com</syn>
> <syn>www2.porkchop.com</syn>
> </HostSynonym>
sure

>
> > URLScopeFilter.inScope=http://.*myHost.*
> > # or additionally URLScopeFilter.outOfScope=... ?
>
> Yes, I was going to tell you the other day.
> You need 'include pattern' as well as 'exclude pattern'.
> Include pattern may be something like *.de, exclude pattern may be
> things like /cgi-bin/ or '?' or /wpoison/ or '.cgi' or ...

There's something like that in the KnownPathsFilter class, which I should
sometimes rename to SuckyFilter because it's so horrible. You can define how
URL.path.startsWith() and URL.query.startsWith have to look like, from what
I recall.

> > I think with this configuration LARM can be much more than just a
> > crawler.
> > With a few changes it can also be used as a processor for documents
> > that
> > come over the network, i.e. in a JMS topic.
> > I haven't mentioned what I call "Sources" or message producers. These
> > are
> > active components that run in their own thread and put messages into
> > the queue.
>
> Like Fetcher threads?

Yes. But I think the other cases mentioned are a lot more interesting.

>
> > If we have a JMSStorage and a JMSSource, then the crawler can be
> > divided
> > into two pieces just from the config file.
> > part one: Fetcher -> JMSStorage
> > part two: JMSSource -> LuceneStorage
> > with all the possibilities for distribution.
> > Given a different source, one could also imagine feeding the crawler
> > with files or with URLs from a web frontend.
>
> This sounds like plans of failed 'dot coms' :)
> We could do this, and this, and this..... in the end there is no focus,
> no product, no company, no jobs. At least there were good parties,
> free food, limos, nice SGI screens, Aeron chairs, and a nice office in
> Chinatown ;)

I understand your sarcasm at 2:30 AM, but to me it is _very_ simple.
Distributing via JMS can (and will) be made in 2 seconds. To me the config
part is much more complicated, since people will start writing config files
and will get angry if the format is changed a short time later...


Clemens



--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>