Mailing List Archive

Italian web sites
Hi all,

I'm using Jobo for spidering web sites and lucene for indexing. The
problem is that I'd like spidering only Italian web sites.
How can I see discover the country of a web site?

Dou you know some method that tou can suggest me?

Thanks


Laura
RE: Italian web sites [ In reply to ]
sniff the IP and then using the database at the
internet topology website http://netgeo.caida.org/perl/netgeo.cgi
you can find the country of origin, (use that to populate your
own DB) so retrieval decreases as you accumulate IPs), but that will
give you the website in Italy (not Italian websites). Unfortunately unless
Italian
uses a different encoding for the page, picking it up from the page
(JavaScript)
won't help much.




-----Original Message-----
From: lucene@libero.it [mailto:lucene@libero.it]
Sent: Wednesday, April 24, 2002 1:03 PM
To: lucene-user@jakarta.apache.org
Subject: Italian web sites


Hi all,

I'm using Jobo for spidering web sites and lucene for indexing. The
problem is that I'd like spidering only Italian web sites.
How can I see discover the country of a web site?

Dou you know some method that tou can suggest me?

Thanks


Laura



--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Italian web sites [ In reply to ]
What does it mean? "Italian website" can be:
- site that use italian language
- site owned by an italian organization
- site hosted in a italian geographical site
Every definition has a different solution.

Date sent: Wed, 24 Apr 2002 11:02:32 +0200
From: "lucene@libero.it" <lucene@libero.it>
Subject: Italian web sites
To: lucene-user@jakarta.apache.org
Send reply to: Lucene Users List <lucene-user@jakarta.apache.org>

> Hi all,
>
> I'm using Jobo for spidering web sites and lucene for indexing. The
> problem is that I'd like spidering only Italian web sites.
> How can I see discover the country of a web site?
>
> Dou you know some method that tou can suggest me?
>
> Thanks
>
>
> Laura
>


--------------------------------------------------
Marco Ferrante (ferrante@unige.it)
CSITA (Centro Servizi Informatici e Telematici d'Ateneo)
Università degli Studi di Genova - Italy
Via Brigata Salerno, ponte - 16147 Genova
tel (+39) 0103532621 (interno tel. 2621)
--------------------------------------------------


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Italian web sites [ In reply to ]
combined with that you could use an italian stop-word list to run statistics
on a page :-) ?!?

On Wednesday 24 April 2002 11:02, lucene@libero.it wrote:
> Hi all,
>
> I'm using Jobo for spidering web sites and lucene for indexing. The
> problem is that I'd like spidering only Italian web sites.
> How can I see discover the country of a web site?
>
> Dou you know some method that tou can suggest me?
>
> Thanks
>
>
> Laura
>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Italian web sites [ In reply to ]
Hi all,

I have found a very interesting library which is written in perl.
The problem is now how I can use this library.

Anyway the library is Textcat an you can find it:

http://odur.let.rug.nl/~vannoord/TextCat/

Bye

Laura

> combined with that you could use an italian stop-
word list to run statistics
> on a page :-) ?!?
>
> On Wednesday 24 April 2002 11:02, lucene@libero.it wrote:
> > Hi all,
> >
> > I'm using Jobo for spidering web sites and lucene for indexing. The
> > problem is that I'd like spidering only Italian web sites.
> > How can I see discover the country of a web site?
> >
> > Dou you know some method that tou can suggest me?
> >
> > Thanks
> >
> >
> > Laura
> >
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-
unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-
help@jakarta.apache.org>
>
>
Re: Italian web sites [ In reply to ]
hm... this looks very interesting! if it is a perl exe you can just copy the
text into a temp file and run the per exe on that file and redirect the
output to another tmp file. then read the file and use the result in a lucene
keyword.

mvh karl øie

On Wednesday 24 April 2002 13:46, lucene@libero.it wrote:
> Hi all,
>
> I have found a very interesting library which is written in perl.
> The problem is now how I can use this library.
>
> Anyway the library is Textcat an you can find it:
>
> http://odur.let.rug.nl/~vannoord/TextCat/
>
> Bye
>
> Laura
>
>
> > combined with that you could use an italian stop-
>
> word list to run statistics
>
> > on a page :-) ?!?
> >
> > On Wednesday 24 April 2002 11:02, lucene@libero.it wrote:
> >
> > > Hi all,
> > >
> > > I'm using Jobo for spidering web sites and lucene for indexing. The
> > > problem is that I'd like spidering only Italian web sites.
> > > How can I see discover the country of a web site?
> > >
> > > Dou you know some method that tou can suggest me?
> > >
> > > Thanks
> > >
> > >
> > > Laura
> > >
> >
> >
> >
> > --
> > To unsubscribe, e-mail: <mailto:lucene-user-
>
> unsubscribe@jakarta.apache.org>
>
> > For additional commands, e-mail: <mailto:lucene-user-
>
> help@jakarta.apache.org>
>
> >


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Italian web sites [ In reply to ]
Laura

>Hi all,
>
>I'm using Jobo for spidering web sites and lucene for indexing. The
>problem is that I'd like spidering only Italian web sites.
>How can I see discover the country of a web site?
>
>Dou you know some method that tou can suggest me?

The best method I know is using n-grams of characters and
use the frequencies of the n-grams that occur most:
http://citeseer.nj.nec.com/context/698873/68861

Regards,
Ype

--

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Italian web sites [ In reply to ]
The first one.

Bye Laura


> What does it mean? "Italian website" can be:
> - site that use italian language
> - site owned by an italian organization
> - site hosted in a italian geographical site
> Every definition has a different solution.
>
> Date sent: Wed, 24 Apr 2002 11:02:32 +0200
> From: "lucene@libero.it" <lucene@libero.it>
> Subject: Italian web sites
> To: lucene-user@jakarta.apache.org
> Send reply to: Lucene Users List <lucene-
user@jakarta.apache.org>
>
> > Hi all,
> >
> > I'm using Jobo for spidering web sites and lucene for indexing. The
> > problem is that I'd like spidering only Italian web sites.
> > How can I see discover the country of a web site?
> >
> > Dou you know some method that tou can suggest me?
> >
> > Thanks
> >
> >
> > Laura
> >
>
>
> --------------------------------------------------
> Marco Ferrante (ferrante@unige.it)
> CSITA (Centro Servizi Informatici e Telematici d'Ateneo)
> Università degli Studi di Genova - Italy
> Via Brigata Salerno, ponte - 16147 Genova
> tel (+39) 0103532621 (interno tel. 2621)
> --------------------------------------------------
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-
unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-
help@jakarta.apache.org>
>
>
Re: Italian web sites [ In reply to ]
The first one.

Bye Laura


> What does it mean? "Italian website" can be:
> - site that use italian language
> - site owned by an italian organization
> - site hosted in a italian geographical site
> Every definition has a different solution.
>
> Date sent: Wed, 24 Apr 2002 11:02:32 +0200
> From: "lucene@libero.it" <lucene@libero.it>
> Subject: Italian web sites
> To: lucene-user@jakarta.apache.org
> Send reply to: Lucene Users List <lucene-
user@jakarta.apache.org>
>
> > Hi all,
> >
> > I'm using Jobo for spidering web sites and lucene for indexing. The
> > problem is that I'd like spidering only Italian web sites.
> > How can I see discover the country of a web site?
> >
> > Dou you know some method that tou can suggest me?
> >
> > Thanks
> >
> >
> > Laura
> >
>
>
> --------------------------------------------------
> Marco Ferrante (ferrante@unige.it)
> CSITA (Centro Servizi Informatici e Telematici d'Ateneo)
> Università degli Studi di Genova - Italy
> Via Brigata Salerno, ponte - 16147 Genova
> tel (+39) 0103532621 (interno tel. 2621)
> --------------------------------------------------
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-
unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-
help@jakarta.apache.org>
>
>