Mailing List Archive

Release of squid log data
For a while now, we've been releasing squid log data, stripped of
personally identifying information such as IP addresses, to groups at
two universities: Vrije Universiteit and the University of Minnesota. We
now have a request pending from a third group, at Universidad Rey Juan
Carlos in Spain. They are asking if they can have the full data stream
including IP addresses, and they are prepared to sign a confidentiality
agreement to get it.

I'm leaning towards letting them have it. Via the confidentiality
agreement, we can avoid the most likely abuse scenarios, such as release
of individual user profiles. Currently we let toolserver users process
similar data, assisted by Wikipedia administrators who put web bugs on
the site. They use it to produce the WikiCharts report. Are we to tell
prospective research groups to use the toolserver, rather than their own
substantial hardware, for analysis of Wikipedia traffic patterns?

I'm not sure if this would be allowed on the privacy policy, which does
mention statistics, but doesn't say who is making them. Maybe the use of
web bugs by administrators is already against the privacy policy. In any
case, I think the question would benefit from community discussion,
which is why I am posting it here.

-- Tim Starling


_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Release of squid log data [ In reply to ]
On 14/09/2007, Tim Starling <tstarling@wikimedia.org> wrote:
> In any
> case, I think the question would benefit from community discussion,
> which is why I am posting it here.

It might be helpful (to prevent uninformed ramblings) if we could have
a draft of the proposed confidentiality agreement, or at least a rough
bulletpoint of what it would cover. Unless that's confidential ;-)

I assume the data processing and handling would be done in Spain? It's
certainly much less of a legal headache to shift the data to Europe
rather than from Europe...

--
- Andrew Gray
andrew.gray@dunelm.org.uk

_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Release of squid log data [ In reply to ]
Andrew Gray wtoye:

>It might be helpful (to prevent uninformed ramblings) if we could have
>a draft of the proposed confidentiality agreement, or at least a rough
>bulletpoint of what it would cover. Unless that's confidential ;-)
>
>I assume the data processing and handling would be done in Spain? It's
>certainly much less of a legal headache to shift the data to Europe
>rather than from Europe...

I'm with Andrew here; it depends on the terms and the scope of access within
their community.

I'm also interested in what their research goal is. Is it technical or
sociological study?

And - to try and not be uninformed - do we need to have a data disclosure
policy? The edit history is a very valuable research dataset, should we have
a "for academic research, under an appropriate non-disclosure agreement" as
part of the privacy policy?


Brian McNeil


_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Release of squid log data [ In reply to ]
On 9/14/07, Tim Starling <tstarling@wikimedia.org> wrote:
> For a while now, we've been releasing squid log data,

Is there a public url for accessing that data?

Mathias

And just two question: Do they need the actual IP-address or would
just a distinct number to tell different IP addresses be sufficient?

When you say stripped of personally identifying information, does this
include information such as search queries to our side that might to a
certain degree be used to identify persons? People digging into the
AOL-data did not need IP addresses to identify individual people.

_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Release of squid log data [ In reply to ]
On 9/14/07, Tim Starling <tstarling@wikimedia.org> wrote:
> For a while now, we've been releasing squid log data, stripped of
> personally identifying information such as IP addresses, to groups at
> two universities: Vrije Universiteit and the University of Minnesota. We
> now have a request pending from a third group, at Universidad Rey Juan
> Carlos in Spain. They are asking if they can have the full data stream
> including IP addresses, and they are prepared to sign a confidentiality
> agreement to get it.
>
How long would the log file run for, and how long would the university
keep the log?

_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Release of squid log data [ In reply to ]
Mathias Schindler wrote:
> On 9/14/07, Tim Starling <tstarling@wikimedia.org> wrote:
>
>>For a while now, we've been releasing squid log data,
>
>
> Is there a public url for accessing that data?

What data? You mean information about the project? The data itself is
only available as a UDP stream, there's no URL.

> And just two question: Do they need the actual IP-address or would
> just a distinct number to tell different IP addresses be sufficient?

I wouldn't recommend using a hashed IP address to anyone involved in
academic work. I've worked in the academic sector, I know how important
it is for data to be above any criticism. Any data using unique IP
addresses as an estimate of individual user population would be severely
skewed by proxies and NAT.

> When you say stripped of personally identifying information, does this
> include information such as search queries to our side that might to a
> certain degree be used to identify persons? People digging into the
> AOL-data did not need IP addresses to identify individual people.

Yes it includes search queries, user page queries, etc., but they're all
mixed in together in a homogeneous stream. There is no referrer data or
user agent data. So there is no way to correlate requests.

Also, we are only sending them 1 in every 10 requests. You can't tell
much about a person from one tenth of their requests, uniformly mixed in
with requests from 100 million other people.

-- Tim Starling


_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Release of squid log data [ In reply to ]
Andrew Gray wrote:
> On 14/09/2007, Tim Starling <tstarling@wikimedia.org> wrote:
>
>>In any
>>case, I think the question would benefit from community discussion,
>>which is why I am posting it here.
>
>
> It might be helpful (to prevent uninformed ramblings) if we could have
> a draft of the proposed confidentiality agreement, or at least a rough
> bulletpoint of what it would cover. Unless that's confidential ;-)

It hasn't been written yet.

> I assume the data processing and handling would be done in Spain? It's
> certainly much less of a legal headache to shift the data to Europe
> rather than from Europe...

Yes.

-- Tim Starling


_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Release of squid log data [ In reply to ]
I'd be all for that. Helping out academic enlightenment is one of the
goals of, well, every wikimedia project i know, this is simply another
aspect of that. But, really, i've never been one to really care about
my own privacy, i don't much care what people know about me and
whether or not they like it. There are plenty out there who do,
however. Making a prelim of the confidentiality agreement and
publishing it might help them. It's already got my support though.

On 9/14/07, Tim Starling <tstarling@wikimedia.org> wrote:
> Andrew Gray wrote:
> > On 14/09/2007, Tim Starling <tstarling@wikimedia.org> wrote:
> >
> >>In any
> >>case, I think the question would benefit from community discussion,
> >>which is why I am posting it here.
> >
> >
> > It might be helpful (to prevent uninformed ramblings) if we could have
> > a draft of the proposed confidentiality agreement, or at least a rough
> > bulletpoint of what it would cover. Unless that's confidential ;-)
>
> It hasn't been written yet.
>
> > I assume the data processing and handling would be done in Spain? It's
> > certainly much less of a legal headache to shift the data to Europe
> > rather than from Europe...
>
> Yes.
>
> -- Tim Starling
>
>
> _______________________________________________
> foundation-l mailing list
> foundation-l@lists.wikimedia.org
> http://lists.wikimedia.org/mailman/listinfo/foundation-l
>


--
-Brock

_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Release of squid log data [ In reply to ]
I'd be all for that. Helping out academic enlightenment is one of the
goals of, well, every wikimedia project i know, this is simply another
aspect of that. But, really, i've never been one to really care about
my own privacy, i don't much care what people know about me and
whether or not they like it. There are plenty out there who do,
however. Making a prelim of the confidentiality agreement and
publishing it might help them. It's already got my support though.

--
-Brock

_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Release of squid log data [ In reply to ]
Do these university offer to make a significant monetary donation to the
WMF?

Dedalus
_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Release of squid log data [ In reply to ]
Dedalus wrote:
> Do these university offer to make a significant monetary donation to the
> WMF?
>
> Dedalus
> _______________________________________________

Hmmm...If they did, would that be better or worse?

-Rich Holton

w:en:Rholton

_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Release of squid log data [ In reply to ]
I fail to see what that has to do with the Foundation supporting academic
research.


Brian.

-----Original Message-----
From: foundation-l-bounces@lists.wikimedia.org
[mailto:foundation-l-bounces@lists.wikimedia.org] On Behalf Of Dedalus
Sent: 14 September 2007 20:01
To: foundation-l@lists.wikimedia.org
Subject: [Foundation-l] Release of squid log data

Do these university offer to make a significant monetary donation to the
WMF?

Dedalus
_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l


_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Release of squid log data [ In reply to ]
On 0, Rich Holton <richholton@gmail.com> scribbled:
> Dedalus wrote:
> > Do these university offer to make a significant monetary donation to the
> > WMF?
> >
> > Dedalus
>
> Hmmm...If they did, would that be better or worse?
>
> -Rich Holton
>
> w:en:Rholton

I think it'd be worse. Anytime money changes hand, or might, there's going to be conflict of interest. It's easy to say that it's only a *possibility*, that 'we' aren't going to really have any conflict of interest, but if the extensive literature on rationality and heuristics and biases has taught me anything, it's that even forewarned and educated persons fail prey to obvious irrationalities (even the very irrationalities they were warned about!).

No, much better to just avoid that entire quagmire entirely and keep it on the basis of 'Are you going to abuse this information, and are you going to do something worthwhile enough that it justifies the risk we're taking by giving this data to you?'

--
gwern
OSS body AVN GSGI OTCIXS NSOF ISEC & SEIDM Mexico
Re: Release of squid log data [ In reply to ]
On 9/14/07, Tim Starling <tstarling@wikimedia.org> wrote:
> For a while now, we've been releasing squid log data, stripped of
> personally identifying information such as IP addresses, to groups at
> two universities: Vrije Universiteit and the University of Minnesota. We
> now have a request pending from a third group, at Universidad Rey Juan
> Carlos in Spain. They are asking if they can have the full data stream
> including IP addresses, and they are prepared to sign a confidentiality
> agreement to get it.
>
> I'm leaning towards letting them have it. Via the confidentiality
> agreement, we can avoid the most likely abuse scenarios, such as release
> of individual user profiles. Currently we let toolserver users process
> similar data, assisted by Wikipedia administrators who put web bugs on
> the site. They use it to produce the WikiCharts report. Are we to tell
> prospective research groups to use the toolserver, rather than their own
> substantial hardware, for analysis of Wikipedia traffic patterns?
>
> I'm not sure if this would be allowed on the privacy policy, which does
> mention statistics, but doesn't say who is making them. Maybe the use of
> web bugs by administrators is already against the privacy policy. In any
> case, I think the question would benefit from community discussion,
> which is why I am posting it here.
>
> -- Tim Starling

I don't know if we should be letting any outside groups have the IP
addresses/data we are supposed to keep private; I'm uncomfortable with
that. I'd sooner we have someone here who is already trusted take
requests to run queries. (I note that Greg volunteers to do this...
and, for that matter, has been asking for access to do just such
things in the past.)

I don't think relying on an NDA to keep things private is effective
enough to meet our obligations. If we don't trust people to use proper
research ethics we shouldn't give them access to anything important in
the first place. But mistakes happen, leaks happen, and that you can
show somewhere along the way someone signed something that said they
wouldn't disclose private data doesn't take back the damage done from
mishandling.

The rest of the log data, that isn't private -- I don't see why you
should need to be a university group to access it. Is there somewhere
to do so publicly, or at least where anyone may make a request?

-Kat

--
Wikimedia needs you: http://wikimediafoundation.org/wiki/Fundraising
* * * * * * * * * * * * * * * * * * * * * * * * * * * *
http://en.wikipedia.org/wiki/User:Mindspillage | (G)AIM:Mindspillage
mindspillage or mind|wandering on irc.freenode.net | email for phone

_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Release of squid log data [ In reply to ]
On 9/14/07, Tim Starling <tstarling@wikimedia.org> wrote:
[snip]
>They are asking if they can have the full data stream
>including IP addresses, and they are prepared to sign a confidentiality
>agreement to get it.
[snip]
> Currently we let toolserver users process
> similar data, assisted by Wikipedia administrators who put web bugs on
> the site. They use it to produce the WikiCharts report. Are we to tell
> prospective research groups to use the toolserver, rather than their own
> substantial hardware, for analysis of Wikipedia traffic patterns?
[snip]

This is simply not true.

The web bug used by Wikicharts uses a URL which gets a custom log
format which logs only the most basic data, here is an example entry:

[14/Sep/2007:00:09:36 +0000] "GET
/xyz.png?ns=0&title=Honored%20Matres&factor=6000&wiki=enwiki HTTP/1.1"

That is the entirety of the logged data. With the exception of the
HTTP version nothing is gathered which is not strictly necessary to
produce the top viewed page data, and even that is gathered at a
sampling rate low enough to make the usefulness questionable.

Not that it isn't horribly silly that we're using a JS web-bug and
toolserver for this because we are already recording much better data
while the wikicharts approach is unreliably, low quality, and
trivially subject to manipulation. At the time Wikicharts was
established there was no Wikimedia logging, and because all of the
Wikimedia logging data is kept private even from most of our own
'inside people', Wikicharts continues to use this method for its
reporting.

The data we are providing to outsiders is substantially better than
the data available to people with @wikimedia.org addresses, including
myself.

For the moment I'm going to refrain from making further public comment
on this subject because I've not yet read most of the messages and I
think consideration is deserved before issuing some harsh criticism.
... but the comment about wikicharts logging is a factual matter which
demanded correction.

_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Release of squid log data [ In reply to ]
Tim Starling wrote:
> For a while now, we've been releasing squid log data, stripped of
> personally identifying information such as IP addresses, to groups at
> two universities: Vrije Universiteit and the University of Minnesota. We
> now have a request pending from a third group, at Universidad Rey Juan
> Carlos in Spain. They are asking if they can have the full data stream
> including IP addresses, and they are prepared to sign a confidentiality
> agreement to get it.

Why do they need the ips?
What is the purpose of the data?

I don't see why personally identifying information could be needed other
than to personally identify someone.
Given that you say ip's are not to be used as unique ids... Maybe
they're going to proxyscan hundreds of ips to find out if they're proxies??

I'd like to see the request reasons :P


PS: The intercepted data would surely be useless but would the data
stream with "personally identifying information" be vulnerable to a
man-in-the-middle attack?


_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Release of squid log data [ In reply to ]
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 14/09/2007 13:41, Tim Starling wrote:
[..snip..]
> I'm not sure if this would be allowed on the privacy policy, which does
> mention statistics, but doesn't say who is making them. Maybe the use of
> web bugs by administrators is already against the privacy policy. In any
> case, I think the question would benefit from community discussion,
> which is why I am posting it here.
- From http://wikimediafoundation.org/wiki/Privacy_policy#Private_logging
There are 6 points, mostly about law enforcement and project protection
against abuse, followed by "Wikimedia policy does not permit public
distribution of such information under any circumstances, except as
described above", no hint about academic research or NDAs that would
allow third parties to access those personal informations, not without
*explicit* consent from the users.

We have strict rules about how CUs have to handle private data of
_editors_ and then we would allow three universities to access data of
_any_user_ that access WMF sites?
I consider myself mildly paranoid, so this is undoubtedly POV, but I
think this idea is crazy.

Some people use static IP addresses, even with personal information
attached to whois records, did you ever considered it?
- --

Brownout

ICQ IM: 236537882
MSN IM: brown dot out at hotmail dot com
OpenPGP key: 0xCB11EA7E
fingerprint = 6706 B72E 0500 EC52 B33D 13B6 FCFA 8BE5 CB11 EA7E

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8-svn4549 (GNU/Linux)

iE8DBQFG6yE4fQNSMkK8eW8RCD9QAN9YlzvMcZ2Tm0qe9LTNmqpFx0FeE97SfD+c
wry/AOCh6HhyV3gj0TMqxPxiYfRl7si4qEMPIwHmChgs
=CP9N
-----END PGP SIGNATURE-----

_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Release of squid log data [ In reply to ]
On 9/14/07, Brownout <brovvnout@gmail.com> wrote:
> We have strict rules about how CUs have to handle private data of
> _editors_ and then we would allow three universities to access data of
> _any_user_ that access WMF sites?
> I consider myself mildly paranoid, so this is undoubtedly POV, but I
> think this idea is crazy.
>
> Some people use static IP addresses, even with personal information
> attached to whois records, did you ever considered it?

It's a tremendous bit of information. For those people whose
identities are in their WP profile, you'd be giving access to
everything they ever read. For those people whose identities aren't
in their WP profile, you'd be giving location information which might
very well be enough to identify them.

What I still don't understand is what period this information would be
from. Would it only be a UDP stream of new requests, or would it
include old log data? At least if it's only new requests those of us
who are "mildly paranoid" can make sure we always access WP through
tor.

_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Release of squid log data [ In reply to ]
On 9/14/07, Tim Starling <tstarling@wikimedia.org> wrote:
> I wouldn't recommend using a hashed IP address to anyone involved in
> academic work. I've worked in the academic sector, I know how important
> it is for data to be above any criticism. Any data using unique IP
> addresses as an estimate of individual user population would be severely
> skewed by proxies and NAT.

Perhaps in order to prevent potentially violating our own privacy
policy, we can meet the researchers half-way. If we can find out the
reason they need IP addresses we can craft the data we send them to
satisfy their request. For example:

a) they could just need the unique addresses to link together browsing
patterns, but not care for them to be IP addresses. We could create
convert the addresses into a unique number (or a salted hash) and send
them the data.

b) they could be looking for network topology information; we could
give them the first two or three octets of the IP address.

c) they could be looking for geographical distribution of queries; we
could do the geo-lookup of addresses and give them coordinate
resolution for each address instead of the address itself.

Obviously, a b and c are all somewhat contentious still, but probably
less so than just giving them raw IP addresses, and could be a good
compromise.

-ilya

_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Release of squid log data [ In reply to ]
No comment on your other points, but id hardly consider a university
conducting research under a nda 'public' disclosure at all, to the
point where i don't think this even breaks the privacy policy,
although whether or not to do it should be a community decision.

On 9/14/07, Brownout <brovvnout@gmail.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> On 14/09/2007 13:41, Tim Starling wrote:
> [..snip..]
> > I'm not sure if this would be allowed on the privacy policy, which does
> > mention statistics, but doesn't say who is making them. Maybe the use of
> > web bugs by administrators is already against the privacy policy. In any
> > case, I think the question would benefit from community discussion,
> > which is why I am posting it here.
> - From http://wikimediafoundation.org/wiki/Privacy_policy#Private_logging
> There are 6 points, mostly about law enforcement and project protection
> against abuse, followed by "Wikimedia policy does not permit public
> distribution of such information under any circumstances, except as
> described above", no hint about academic research or NDAs that would
> allow third parties to access those personal informations, not without
> *explicit* consent from the users.
>
> We have strict rules about how CUs have to handle private data of
> _editors_ and then we would allow three universities to access data of
> _any_user_ that access WMF sites?
> I consider myself mildly paranoid, so this is undoubtedly POV, but I
> think this idea is crazy.
>
> Some people use static IP addresses, even with personal information
> attached to whois records, did you ever considered it?
> - --
>
> Brownout
>
> ICQ IM: 236537882
> MSN IM: brown dot out at hotmail dot com
> OpenPGP key: 0xCB11EA7E
> fingerprint = 6706 B72E 0500 EC52 B33D 13B6 FCFA 8BE5 CB11 EA7E
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.8-svn4549 (GNU/Linux)
>
> iE8DBQFG6yE4fQNSMkK8eW8RCD9QAN9YlzvMcZ2Tm0qe9LTNmqpFx0FeE97SfD+c
> wry/AOCh6HhyV3gj0TMqxPxiYfRl7si4qEMPIwHmChgs
> =CP9N
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> foundation-l mailing list
> foundation-l@lists.wikimedia.org
> http://lists.wikimedia.org/mailman/listinfo/foundation-l
>


--
-Brock

_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Release of squid log data [ In reply to ]
Brock Weller wrote:
> No comment on your other points, but id hardly consider a university
> conducting research under a nda 'public' disclosure at all, to the
> point where i don't think this even breaks the privacy policy,
> although whether or not to do it should be a community decision.
>

Even if it's not a violation of the privacy policy per se, IMO the
privacy policy should still be updated to reflect this use of the data
if it's something we agree is a legitimate use of it and are going to be
doing on a semi-regular basis. We already list a few things the data is
used for, so can just add another one to the list.

-Mark


_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Release of squid log data [ In reply to ]
On 9/14/07, Ilya Haykinson <haykinson@gmail.com> wrote:
> If we can find out the
> reason they need IP addresses we can craft the data we send them to
> satisfy their request. For example:

Two years ago*, when we didn't actually have the data to release, I
proposed a two pronged approach, restated here:

(1) Make as much of the non-private data public as we safely can, this
maximizes the public value of this data and avoids the harm that
picking favorites by sharing valuable data (commercially valuable as
well as a academically valuable) with only certain groups. Plus it
scales much better.

(2) Offer to run reasonable aggregation scripts for those who can
describe a need for access to data we protect. For example, if they
wanted to analyze article views vs country of origin the script could
look up the countries and only disclose that.

If the needs of a researcher can't be met by data scrubbed with a
custom aggregator, then I must question the usefulness of their
research: If it's not possible to convert the research data into an
aggregate result which has no privacy problems then the underlying
data driving their research would be unpublishable, unrepeatable, and
unverifiable.

Keep in mind that well over 99% of the people potentially impacted by
this aren't our "community", they aren't people who have already
agreed to lose a little privacy by making public edits... they are
just readers.

It is my understanding that public libraries do not generally disclose
detailed use records like this for outside research. Google and the
other search engines fought in court to avoid providing the US
government search log data.

I'm also disappointed with the standard of care provided of some other
academic Wikipedia data researchers in recent memory.

So long as there exist *reasonable alternatives* I'm having a hard
time seeing the justification for this proposed disclosure.



*For some reason our own archive of this thread seem to be missing. I
found a third party copy:
http://www.archivum.info/wikipedia-l@wikimedia.org/2005-08/msg00049.html

_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Release of squid log data [ In reply to ]
On 9/14/07, Ilya Haykinson <haykinson@gmail.com> wrote:
> On 9/14/07, Tim Starling <tstarling@wikimedia.org> wrote:
> > I wouldn't recommend using a hashed IP address to anyone involved in
> > academic work. I've worked in the academic sector, I know how important
> > it is for data to be above any criticism. Any data using unique IP
> > addresses as an estimate of individual user population would be severely
> > skewed by proxies and NAT.
>
> Perhaps in order to prevent potentially violating our own privacy
> policy, we can meet the researchers half-way.

The best way to avoid violating the privacy policy would be to change
it to say exactly what it is you plan on doing, and to not give data
from before the policy is changed.

> If we can find out the
> reason they need IP addresses we can craft the data we send them to
> satisfy their request. For example:
>
> a) they could just need the unique addresses to link together browsing
> patterns, but not care for them to be IP addresses. We could create
> convert the addresses into a unique number (or a salted hash) and send
> them the data.
>
In case anyone's seriously considering this, make sure you've read
[[AOL search data scandal]] which should show you why it's completely
useless. This is *especially* true with Wikipedia data, where the
urls we access constantly reveal who we are (e.g.
http://en.wikipedia.org/wiki/User_talk:Whatever).

> b) they could be looking for network topology information; we could
> give them the first two or three octets of the IP address.
>
Three octects would be almost as bad as a) for the same reasons. Two
octets would be better, but less useful too.

> c) they could be looking for geographical distribution of queries; we
> could do the geo-lookup of addresses and give them coordinate
> resolution for each address instead of the address itself.
>
If that geo information is limited to country, I guess it wouldn't be too bad.

_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Release of squid log data [ In reply to ]
If a Chinese or Iranian university offered to sign a confidentiality
agreement, would you accept it? Or an institute in another country where
they exchange students with?

I still remember the talk at Berlin, 21C3, Dec 2004, where inside info was
given about the draconic measures China has taken to keep its citizen under
control. According to the talk they have 30,000 IT personnel working on
patrolling their electronic borders (estimate by 'Reporters without
Borders'), and the best (US) equipment, loads of it. Those guys would love
to parse these data.

I am not questioning the integrity of current applicants at all. I do have
doubts about where the data will ultimately end up, if gradually tens of
institutions carry our viewer data on their portables, or in 2009 on 1 Tb
memory sticks :)

Pakistan got the blueprints for ultracentrifuges for producing nuclear bombs
by a friendly student exchange project, from a small peaceful country in
Western Europe. Sensitive scientific data tend to travel.

Erik Zachte

> Tim Starling wrote:
> > For a while now, we've been releasing squid log data, stripped of
> > personally identifying information such as IP addresses, to groups at
> > two universities: Vrije Universiteit and the University of
> Minnesota. We
> > now have a request pending from a third group, at Universidad Rey Juan
> > Carlos in Spain. They are asking if they can have the full data stream
> > including IP addresses, and they are prepared to sign a confidentiality
> > agreement to get it.
>


_______________________________________________
foundation-l mailing list
foundation-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: Release of squid log data [ In reply to ]
On 0, Gregory Maxwell <gmaxwell@gmail.com> scribbled:
> On 9/14/07, Ilya Haykinson <haykinson@gmail.com> wrote:
> > If we can find out the
> > reason they need IP addresses we can craft the data we send them to
> > satisfy their request. For example:
>
> Two years ago*, when we didn't actually have the data to release, I
> proposed a two pronged approach, restated here:
>
> (1) Make as much of the non-private data public as we safely can, this
> maximizes the public value of this data and avoids the harm that
> picking favorites by sharing valuable data (commercially valuable as
> well as a academically valuable) with only certain groups. Plus it
> scales much better.
....

In a very strong sense, we can 'safely' make no data available. I went and did a little research (shucks, now I'm feeling like ArmedBlowfish). Entirely apart from obvious attacks using this data, like [[traffic analysis]] and all the various attacks Tor and remailer systems try to protect against, just the database alone is enough to compromise identities and reveal valuable information - even if you pseudonymize and remove data, and even if you insert dummy (but statistically valid, so it doesn't wreck analyses) data.

The obvious example to prove this would be the leak of AOL search queries, but there's an even better example. It turns out that Iceland has a very large and very well known national DNA database with which is associated a large quantity of metadata concerning family trees and what not (a somewhat amusing aside - a professor of mine once described her visits to Icelander-dominated parties; apparently when Icelanders have nothing better to chat about, or nothing particular in common, they simply go over their genealogies and figure out how they are related). Eventually [[Decode Genetics]]'s database was killed out of privacy concerns (<http://observer.guardian.co.uk/international/story/0,6903,1217842,00.html> etc.).

This is interesting, yes, but for us the interesting thing is that efforts were made to anonymous/scrub the data before use. Keeping in mind that the techniques were more advanced than the ones I've seem suggested here, the efforts failed. Inferences could be made from the data that broke the security quite easily. I found one particularly interesting paper on the topic; I quote from the abstract:

"Results: While susceptibility varies, we find that each of the protection methods studied is deficient in their protection against re-identification. In certain instances the protection schema itself, such as singly-encrypted pseudonymization, can be leveraged to compromise privacy even further than simple de-identification permits. In order to facilitate the future development of privacy protection methods, we provide a susceptibility comparison of the methods."

"Conclusion: This work illustrates the danger of blindly adopting identity protection methods for genomic data. Future methods must account for inferences that can be leaked from the data itself and the environment into which the data is being released in order to provide guarantees of privacy. While the protection methods reviewed in this paper provide a base for future protection strategies, our analyses provide guideposts for the development of provable privacy protecting methods."

("Why Pseudonyms Don’t Anonymize: A Computational Re-identification Analysis of Genomic Data Privacy Protection Systems"; <http://privacy.cs.cmu.edu/dataprivacy/projects/linkage/lidap-wp19.pdf>.)

--
gwern
contacts Unix Force SUR Flame analysis bank Gamma CBNRC passwd

1 2 3 4 5  View All