Mailing List Archive

cvs commit: jakarta-lucene-sandbox/contributions/webcrawler-LARM TODO.txt
cmarschner 2002/06/18 04:39:51

Modified: contributions/webcrawler-LARM TODO.txt
Log:
see file

Revision Changes Path
1.2 +40 -13 jakarta-lucene-sandbox/contributions/webcrawler-LARM/TODO.txt

Index: TODO.txt
===================================================================
RCS file: /home/cvs/jakarta-lucene-sandbox/contributions/webcrawler-LARM/TODO.txt,v
retrieving revision 1.1
retrieving revision 1.2
diff -u -r1.1 -r1.2
--- TODO.txt 1 Jun 2002 18:55:15 -0000 1.1
+++ TODO.txt 18 Jun 2002 11:39:51 -0000 1.2
@@ -1,11 +1,39 @@

Todos for 1.0 (not yet ordered in decreasing priority)

-$id: $
+$Id$
+
+-----------------------------------------------------------------------------------------------
+solved:
+-----------------------------------------------------------------------------------------------
+
+Bugs:
+ - some relative URLs are not appended appropriately, leading to wrong and growing URLs
+ - 301/302 URLs were not updated: the docs were saved under the old URL, which lead to
+ wrong relative URLs (cmarschner, 2002-06-17)
+
+URLs:
+ - include a URLNormalizer
+ * lowercase host names
+ * avoid ambiguities like '%20' / '+'
+ * make sure http://host URLs end with "/"
+ * avoid host name aliases
+ - two host names / one ip adress can point to the same web site: www.lmu.de / www.uni-muenchen.de
+ - two host names / one ip adress can point to different web sites (then other URLs / pages must differ)
+ suche.lmu.de / interesse.lmu.de
+ * cater 301/302 result codes
+ STATUS: seems to be solved except that URL parameters can occur in different orders, which is NOT resolved
+ host names are resolved by hand, via a synonym in HostManager. (cmarschner, 2002-06-17)
+ problem: URLMessage size doubles
+
+-----------------------------------------------------------------------------------------------
+remaining:
+-----------------------------------------------------------------------------------------------

* Bugs
- on very fast LAN connections (100MBit), sockets are not freed as fast as allocated
- - some relative URLs are not appended appropriately, leading to wrong and growing URLs
+ probably this will be solved by changing from HTTPClient.* to Jakarta HTTP client and reuse sockets
+

* Build
- added build.xml, but build.bat and build.sh are still working without ANT. Change that.
@@ -16,16 +44,6 @@
* Configuration
- move all configuration stuff into a meaningful properties file

-* URLs:
- - include a URLNormalizer
- * lowercase host names
- * avoid ambiguities like '%20' / '+'
- * make sure http://host URLs end with "/"
- * avoid host name aliases
- - two host names / one ip adress can point to the same web site: www.lmu.de / www.uni-muenchen.de
- - two host names / one ip adress can point to different web sites (then other URLs / pages must differ)
- suche.lmu.de / interesse.lmu.de
- * cater 301/302 result codes

* Repository
- optionally use a database as repository (caches, queues, logs)
@@ -50,13 +68,22 @@
* Politeness
- add the option to restrict the number of host accesses per hour/minute

+* URL Extraction
+ - URLs can be encoded in different encoding styles - see http://www.unicode.org/unicode/faq/unicode_web.html
+
+* I18N, HTML encoding
+ - determine document encoding style in content-type, meta tag (http-equiv), or Doctype-tag; adapt URLs to
+ encoding style
+
* Anchor text extraction
* read until a meaningful end tag, not just the first encountered
* remove entities
* optionally remove Tags, leave ALT attribute
* remove redundant spaces

-
+* URLNormalizer
+ * add possibility to add synonyms to top level domains, i.e. "d1.com = d2.com" --> "sub1.d1.com = sub1.d2.com"
+ * add possibility to detect synonyms automatically, i.e. by comparing IP addresses or file checksums

Nice-to-have:





--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>