Mailing List Archive

Solr web crawler with recursive option
Hello Team,


I am working on solr for the first time and got the setup done. Now I have created a core using command line and want to perform webcrawl of a third party site.
If I try it with individual links, I am able to do the crawl and index it to the core.This was done using >
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -jar post.jar http://www.example.com

Now what I intend to do is to give a url and using the recursive option (-Drecursive) and let it crawl the entire site.
Note that I am pointing to a website that has around 125 pages and I am using the below command >
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=yes -jar post.jar http://www.example.com and
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=2 -jar post.jar http://www.example.com

and I am getting the below error message.
Error:


POSTed web resource http://www.example.com (depth: 0)
[Fatal Error] :1:1: Content is not allowed in prolog.
Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252)
at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616)
at org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563)
at org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365)
at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187)
at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172)
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061)
at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232)
... 5 more



I would be very grateful if anyone could get me to solve this issue I have been trying to fix for a couple of days.


Regards,
ShivprasadS


Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail, delete and then destroy all copies of the original message.
Solr web crawler with recursive option [ In reply to ]
Hello Team,


I am working on solr for the first time and got the setup done. Now I have created a core using command line and want to perform webcrawl of a third party site.
If I try it with individual links, I am able to do the crawl and index it to the core.This was done using >
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -jar post.jar http://www.example.com

Now what I intend to do is to give a url and using the recursive option (-Drecursive) and let it crawl the entire site.
Note that I am pointing to a website that has around 125 pages and I am using the below command >
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=yes -jar post.jar http://www.example.com and
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=2 -jar post.jar http://www.example.com

and I am getting the below error message.
Error:


POSTed web resource http://www.example.com (depth: 0)
[Fatal Error] :1:1: Content is not allowed in prolog.
Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252)
at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616)
at org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563)
at org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365)
at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187)
at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172)
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061)
at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232)
... 5 more



I would be very grateful if anyone could get me to solve this issue I have been trying to fix for a couple of days.


Regards,
ShivprasadS


Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail, delete and then destroy all copies of the original message.
Solr web crawler with recursive option [ In reply to ]
I am working on solr for the first time and got the setup done. Now I have created a core using command line and want to perform webcrawl of a third party site.
If I try it with individual links, I am able to do the crawl and index it to the core.This was done using >
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -jar post.jar http://www.example.com

Now what I intend to do is to give a url and using the recursive option (-Drecursive) and let it crawl the entire site.
Note that I am pointing to a website that has around 125 pages and I am using the below command >
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=yes -jar post.jar http://www.example.com and
java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -Drecursive=2 -jar post.jar http://www.example.com

and I am getting the below error message.
Error:


POSTed web resource http://www.example.com (depth: 0)
[Fatal Error] :1:1: Content is not allowed in prolog.
Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252)
at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616)
at org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563)
at org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365)
at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187)
at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172)
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061)
at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232)
... 5 more



I would be very grateful if anyone could get me to solve this issue I have been trying to fix for a couple of days.


Regards,
ShivprasadS


Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail, delete and then destroy all copies of the original message.