Mailing List Archive

Wikipedia Web Logs for scientific research
Dear Wikimedia community,

please have a look at the following proposal of
collaboration I sent to the Wikimedia board right yesterday.
The board (and myself too!) would like to hear your opinion
on that, before taking a decision.
I would like to specify that such collaboration will have
scientific- and academic-only purposes, without any
commercial involvement.
Finally, the analysis software we are developing (and
going to apply to Wikipedia data, if the proposal will be
accepted) will be distributed in the scientific community as
open source.

Of course, I will be glad to provide you any detail and
explanation you will think necessary.

Thank you for your attention. Best regards,

- Mirco


----------------------------------------------------------

Dear Sirs,

I am writing to you on behalf of the KDD-Lab (Laboratory
on Knowledge Discovery and Delivery:
http://www-kdd.isti.cnr.it), a branch of the ISTI institute
of the Italian National Research Centre (CNR).

Our group is working (among the others) on a project
regarding the analysis of the logs of web servers, and in
recent days we are working on analysis techniques that seem
to be best suited for "content-rich sites". Our first
thought obviously went to Wikipedia...

We would like to have the opportunity to apply our
analysis techniques to the web logs of Wikipedia. Looking to
the Wikipedia access statistics, we believe that an optimal
amount of data would be the following: (1) the (raw) weblogs
of the English section covering a few days of usage, or (2)
a few weeks for the Italian section.

Do you think it could be possible to start this kind of
collaboration?

Of course, we are willing to provide you all the legal
agreements you will consider necessary, especially those
regarding privacy. And, obviously, we will properly
acknowledge your contribution in any of our scientifical
publications and reports where we use it.

[.Addendum: the sensible information in web logs is
essentially located in the "client IP" field ("who visited
that page"). However, for our research purposes such field
is not strictly needed as an encrypted version of it would
be enough, thus avoiding most of the privacy issues.]

Thank you for your attention.
Looking for receiving your answer and opinion, I send you
my best regards,

- Mirco Nanni

====================================
http://ercolino.isti.cnr.it/mirco
====================================
Re: Wikipedia Web Logs for scientific research [ In reply to ]
On 5/10/05, Mirco Nanni <mirco.nanni@isti.cnr.it> wrote:
>
> Of course, we are willing to provide you all the legal
> agreements you will consider necessary, especially those
> regarding privacy. And, obviously, we will properly
> acknowledge your contribution in any of our scientifical
> publications and reports where we use it.
>
> [.Addendum: the sensible information in web logs is
> essentially located in the "client IP" field ("who visited
> that page"). However, for our research purposes such field
> is not strictly needed as an encrypted version of it would
> be enough, thus avoiding most of the privacy issues.]


The problem is if you substitute the IP with a unique number, and you still
show accesses to user pages, you can probably identify the logged in users.
I'd be OK if the IPs were masked AND accesses to non-article namespace pages
were not given out.
Re: Wikipedia Web Logs for scientific research [ In reply to ]
Dori wrote:
>>[.Addendum: the sensible information in web logs is
>>essentially located in the "client IP" field ("who visited
>>that page"). However, for our research purposes such field
>>is not strictly needed as an encrypted version of it would
>>be enough, thus avoiding most of the privacy issues.]
>
> The problem is if you substitute the IP with a unique number, and you still
> show accesses to user pages, you can probably identify the logged in users.
> I'd be OK if the IPs were masked AND accesses to non-article namespace pages
> were not given out.

Well, our objective is not to make web accesses public,
but to apply analysis techniques on them and possibly make
some selected results public (something like -- but a bit
more sophisticated and specific than -- the Webalizer system
which is now used to build the Wikipedia usage statistics).
However, you are right, masking IPs does not solve
privacy problems once and for all. I agree with restricting
to web traffic relative to articles, discarding personal
pages and similar -- moreover, they are not very interesting
for our research purposes.

- Mirco

====================================
http://ercolino.isti.cnr.it/mirco
====================================