Mailing List Archive

Ethical question regarding some code
Hey,
I have an ethical question that I couldn't answer yet and have been asking
around but no definite answer yet so I'm asking it in a larger audience in
hope of a solution.

For almost a year now, I have been developing an NLP-based AI system to be
able to catch sock puppets (two users pretending to be different but
actually the same person). It's based on the way they speak. The way we
speak is like a fingerprint and it's unique to us and it's really hard to
forge or change on demand (unlike IP/UA), as the result if you apply some
basic techniques in AI on Wikipedia discussions (which can be really
lengthy, trust me), the datasets and sock puppets shine.

Here's an example, I highly recommend looking at these graphs, I compared
two pairs of users, one pair that are not sock puppets and the other is a
pair of known socks (a user who got banned indefinitely but came back
hidden under another username). [1][2] These graphs are based one of
several aspects of this AI system.

I have talked about this with WMF and other CUs to build and help us
understand and catch socks. Especially the ones that have enough resources
to change their IP/UA regularly (like sock farms, and/or UPEs) and also
with the increase of mobile intern providers and the horrible way they
assign IP to their users, this can get really handy in some SPI ("Sock
puppet investigation") [3] cases.

The problem is that this tool, while being built only on public
information, actually has the power to expose legitimate sock puppets.
People who live under oppressive governments and edit on sensitive topics.
Disclosing such connections between two accounts can cost people their
lives.

So, this code is not going to be public, period. But we need to have this
code in Wikimedia Cloud Services so people like CUs in other wikis be able
to use it as a web-based tool instead of me running it for them upon
request. But WMCS terms of use explicitly say code should never be
closed-source and this is our principle. What should we do? I pay a
corporate cloud provider for this and put such important code and data
there? We amend the terms of use to have some exceptions like this one?

The most plausible solution suggested so far (thanks Huji) is to have a
shell of a code that would be useless without data, and keep the code that
produces the data (out of dumps) closed (which is fine, running that code
is not too hard even on enwiki) and update the data myself. This might be
doable (which I'm around 30% sure, it still might expose too much) but it
wouldn't cover future cases similar to mine and I think a more long-term
solution is needed here. Also, it would reduce the bus factor to 1, and
maintenance would be complicated.

What should we do?

Thanks
[1]
https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_fawiki_1.png
[2]
https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_fawiki_2.png
[3] https://en.wikipedia.org/wiki/Wikipedia:SPI
--
Amir (he/him)
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Ethical question regarding some code [ In reply to ]
Creating and promoting the use of a closed-source tool, especially one
used to detect disruptive editing, runs counter to core Wikimedia
community principles.

Making such a tool closed-source prevents the Wikimedia editing
community from auditing its use, contesting its decisions, making
improvements to it, or learning from its creation. This causes harm to
the community.

Open-sourcing a tool such as this could allow an unscrupulous user to
connect accounts that are not publicly connected. This is a problem
with all sock detection tools. It also causes harm to the community.

The only way to create such a tool that does not harm the community in
any way is to make the tool's decision making entirely public while
keeping the tool's decisions non-public. This is not possible.
However, we can approach that goal using careful engineering and
attempt to minimize harm. Things like restricting the interface to
CUs, requiring a logged reason for a check, technical barriers against
fishing (comparing two known users, not looking for other potential
users), not making processed data available publicly, and publishing
the entire source code (including code used to load data) can reduce
harm.

After all that, if you are not satisfied that harm has been
sufficiently reduced, there is only one answer: do not create the
tool.

AntiCompositeNumber

On Wed, Aug 5, 2020 at 10:33 PM Amir Sarabadani <ladsgroup@gmail.com> wrote:
>
> Hey,
> I have an ethical question that I couldn't answer yet and have been asking
> around but no definite answer yet so I'm asking it in a larger audience in
> hope of a solution.
>
> For almost a year now, I have been developing an NLP-based AI system to be
> able to catch sock puppets (two users pretending to be different but
> actually the same person). It's based on the way they speak. The way we
> speak is like a fingerprint and it's unique to us and it's really hard to
> forge or change on demand (unlike IP/UA), as the result if you apply some
> basic techniques in AI on Wikipedia discussions (which can be really
> lengthy, trust me), the datasets and sock puppets shine.
>
> Here's an example, I highly recommend looking at these graphs, I compared
> two pairs of users, one pair that are not sock puppets and the other is a
> pair of known socks (a user who got banned indefinitely but came back
> hidden under another username). [1][2] These graphs are based one of
> several aspects of this AI system.
>
> I have talked about this with WMF and other CUs to build and help us
> understand and catch socks. Especially the ones that have enough resources
> to change their IP/UA regularly (like sock farms, and/or UPEs) and also
> with the increase of mobile intern providers and the horrible way they
> assign IP to their users, this can get really handy in some SPI ("Sock
> puppet investigation") [3] cases.
>
> The problem is that this tool, while being built only on public
> information, actually has the power to expose legitimate sock puppets.
> People who live under oppressive governments and edit on sensitive topics.
> Disclosing such connections between two accounts can cost people their
> lives.
>
> So, this code is not going to be public, period. But we need to have this
> code in Wikimedia Cloud Services so people like CUs in other wikis be able
> to use it as a web-based tool instead of me running it for them upon
> request. But WMCS terms of use explicitly say code should never be
> closed-source and this is our principle. What should we do? I pay a
> corporate cloud provider for this and put such important code and data
> there? We amend the terms of use to have some exceptions like this one?
>
> The most plausible solution suggested so far (thanks Huji) is to have a
> shell of a code that would be useless without data, and keep the code that
> produces the data (out of dumps) closed (which is fine, running that code
> is not too hard even on enwiki) and update the data myself. This might be
> doable (which I'm around 30% sure, it still might expose too much) but it
> wouldn't cover future cases similar to mine and I think a more long-term
> solution is needed here. Also, it would reduce the bus factor to 1, and
> maintenance would be complicated.
>
> What should we do?
>
> Thanks
> [1]
> https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_fawiki_1.png
> [2]
> https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_fawiki_2.png
> [3] https://en.wikipedia.org/wiki/Wikipedia:SPI
> --
> Amir (he/him)
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Ethical question regarding some code [ In reply to ]
That's a tough question, and I'm not sure what the answer is.

There is a little bit of precedent with
https://www.mediawiki.org/w/index.php?oldid=2533048&title=Extension:AntiBot

When evaluating harm, I guess one of the questions is how does your
approach compare in effectiveness to other publicly available approaches
like http://www.philocomp.net/humanities/signature.htm &
https://github.com/search?q=authorship+attribution+user:pan-webis-de ?
(i.e. There is more harm if your approach is significantly better than
other already available tools, and less if they're at a similar level)

--
Brian

On Thu, Aug 6, 2020 at 2:33 AM Amir Sarabadani <ladsgroup@gmail.com> wrote:

> Hey,
> I have an ethical question that I couldn't answer yet and have been asking
> around but no definite answer yet so I'm asking it in a larger audience in
> hope of a solution.
>
> For almost a year now, I have been developing an NLP-based AI system to be
> able to catch sock puppets (two users pretending to be different but
> actually the same person). It's based on the way they speak. The way we
> speak is like a fingerprint and it's unique to us and it's really hard to
> forge or change on demand (unlike IP/UA), as the result if you apply some
> basic techniques in AI on Wikipedia discussions (which can be really
> lengthy, trust me), the datasets and sock puppets shine.
>
> Here's an example, I highly recommend looking at these graphs, I compared
> two pairs of users, one pair that are not sock puppets and the other is a
> pair of known socks (a user who got banned indefinitely but came back
> hidden under another username). [1][2] These graphs are based one of
> several aspects of this AI system.
>
> I have talked about this with WMF and other CUs to build and help us
> understand and catch socks. Especially the ones that have enough resources
> to change their IP/UA regularly (like sock farms, and/or UPEs) and also
> with the increase of mobile intern providers and the horrible way they
> assign IP to their users, this can get really handy in some SPI ("Sock
> puppet investigation") [3] cases.
>
> The problem is that this tool, while being built only on public
> information, actually has the power to expose legitimate sock puppets.
> People who live under oppressive governments and edit on sensitive topics.
> Disclosing such connections between two accounts can cost people their
> lives.
>
> So, this code is not going to be public, period. But we need to have this
> code in Wikimedia Cloud Services so people like CUs in other wikis be able
> to use it as a web-based tool instead of me running it for them upon
> request. But WMCS terms of use explicitly say code should never be
> closed-source and this is our principle. What should we do? I pay a
> corporate cloud provider for this and put such important code and data
> there? We amend the terms of use to have some exceptions like this one?
>
> The most plausible solution suggested so far (thanks Huji) is to have a
> shell of a code that would be useless without data, and keep the code that
> produces the data (out of dumps) closed (which is fine, running that code
> is not too hard even on enwiki) and update the data myself. This might be
> doable (which I'm around 30% sure, it still might expose too much) but it
> wouldn't cover future cases similar to mine and I think a more long-term
> solution is needed here. Also, it would reduce the bus factor to 1, and
> maintenance would be complicated.
>
> What should we do?
>
> Thanks
> [1]
>
> https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_fawiki_1.png
> [2]
>
> https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_fawiki_2.png
> [3] https://en.wikipedia.org/wiki/Wikipedia:SPI
> --
> Amir (he/him)
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Ethical question regarding some code [ In reply to ]
I'm afraid I have to agree with what AntiCompositeNumber wrote. When
you set up infrastructure to fight abuse – no matter if that
infrastructure is a technical barrier like a captcha, a tool that
"blames" people for being sock puppets, or a law – it will affect
*all* users, not only the abusers. What you need to think about is not
if what you do is right or wrong, but if there is still an acceptable
balance between your intended positive effects, and the unavoidable
negative effects.

That said, I'm very happy to see something like this being discussed
that early. This doesn't always happen. Does anyone still remember
discussing "Deep User Inspector"[1][2] in 2013?

Having read what was already said about "harm", I feel there is
something missing: AI based tools always have the potential to cause
harm simply because people don't really understand what it means to
work with such a tool. For example, when the tool says "there is a 95%
certainty this is a sock puppet", people will use this as "proof",
totally ignoring the fact that the particular case they are looking at
could as well be within the 5%. This is the reason why I believe such
a tool can not be a toy, open for anyone to play around with, but
needs trained users.

TL;DR: Closed source? No. Please avoid at all costs. Closed databases? Sure.

Best
Thiemo

[1] https://ricordisamoa.toolforge.org/dui/
[2] https://meta.wikimedia.org/wiki/User_talk:Ricordisamoa#Deep_user_inspector

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Ethical question regarding some code [ In reply to ]
Technically, you can make the tool open-source and keep the source code
secret. That solves the maintenance problem (others who get access can
legally modify). Of course, you'd have to trust everyone with access to the
files to not publish them which they would be technically entitled to
(unless there is some NDA-like mechanism).

Transparency and auditability wouldn't be fulfilled just by making the code
public, anyway; they need to be solved by tool design (keeping logs,
providing feedback options for the users, trying to expose the components
of the decision as much as possible).

I'd agree with Bawolff though that there is probably no point in going to
great lengths to keep details secret as creating a similar tool is probably
not that hard. You can build some assumptions into the tool which are
nontrivial to fulfill outside Toolforge (e.g. use the replicas instead of
dumps) to make running it require an effort, at least.
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Ethical question regarding some code [ In reply to ]
Nice idea! First time I wrote about this being possible was back in
2008-ish.

The problem is quite trivial, you use some observable feature to
fingerprint an adversary. The adversary can then game the system if the
observable feature can be somehow changed or modified. To avoid this the
observable features are usually chosen to be physical properties that can't
be easily changed.

In this case the features are word and/or relations between words, and then
the question is “Can the adversary change the choice of words?” Yes he can,
because the choice of words is not an inherent physical property of the
user. In fact there are several programs that help users express themselves
in a more fluent way, and such systems will change the observable features
i.e. choice of words. The program will move the observable features (the
words) from one user-specific distribution to another more program-specific
distribution. You will observe the users a priori to be different, but with
the program they will be a posteriori more similar.

A real problem is your own poisoning of the training data. That happens
when you find some subject to be the same as your postulated one, and then
feed the information back into your training data. If you don't do that
your training data will start to rot because humans change over time. It is
bad anyway you do it.

Even more fun is an adversary that knows what you are doing, and tries to
negate your detection algorithm, or even fool you into believing he is
someone else. It is after all nothing more than word count and statistics.
What will you do when someone edits a Wikipedia-page and your system tells
you “This revision is most likely written by Jimbo”?

Several such programs exist, and I'm a bit perplexed that they are not in
more use among Wikipedia's editors. Some of them are more aggressive, and
can propose quite radical rewrites of the text. I use one of them, and it
is not the best, but still it corrects me all the time.

I believe it would be better to create a system where users are internally
identified and externally authenticated. (The previous is biometric
identification, and must adhere to privacy laws.)

On Thu, Aug 6, 2020 at 4:33 AM Amir Sarabadani <ladsgroup@gmail.com> wrote:

> Hey,
> I have an ethical question that I couldn't answer yet and have been asking
> around but no definite answer yet so I'm asking it in a larger audience in
> hope of a solution.
>
> For almost a year now, I have been developing an NLP-based AI system to be
> able to catch sock puppets (two users pretending to be different but
> actually the same person). It's based on the way they speak. The way we
> speak is like a fingerprint and it's unique to us and it's really hard to
> forge or change on demand (unlike IP/UA), as the result if you apply some
> basic techniques in AI on Wikipedia discussions (which can be really
> lengthy, trust me), the datasets and sock puppets shine.
>
> Here's an example, I highly recommend looking at these graphs, I compared
> two pairs of users, one pair that are not sock puppets and the other is a
> pair of known socks (a user who got banned indefinitely but came back
> hidden under another username). [1][2] These graphs are based one of
> several aspects of this AI system.
>
> I have talked about this with WMF and other CUs to build and help us
> understand and catch socks. Especially the ones that have enough resources
> to change their IP/UA regularly (like sock farms, and/or UPEs) and also
> with the increase of mobile intern providers and the horrible way they
> assign IP to their users, this can get really handy in some SPI ("Sock
> puppet investigation") [3] cases.
>
> The problem is that this tool, while being built only on public
> information, actually has the power to expose legitimate sock puppets.
> People who live under oppressive governments and edit on sensitive topics.
> Disclosing such connections between two accounts can cost people their
> lives.
>
> So, this code is not going to be public, period. But we need to have this
> code in Wikimedia Cloud Services so people like CUs in other wikis be able
> to use it as a web-based tool instead of me running it for them upon
> request. But WMCS terms of use explicitly say code should never be
> closed-source and this is our principle. What should we do? I pay a
> corporate cloud provider for this and put such important code and data
> there? We amend the terms of use to have some exceptions like this one?
>
> The most plausible solution suggested so far (thanks Huji) is to have a
> shell of a code that would be useless without data, and keep the code that
> produces the data (out of dumps) closed (which is fine, running that code
> is not too hard even on enwiki) and update the data myself. This might be
> doable (which I'm around 30% sure, it still might expose too much) but it
> wouldn't cover future cases similar to mine and I think a more long-term
> solution is needed here. Also, it would reduce the bus factor to 1, and
> maintenance would be complicated.
>
> What should we do?
>
> Thanks
> [1]
>
> https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_fawiki_1.png
> [2]
>
> https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_fawiki_2.png
> [3] https://en.wikipedia.org/wiki/Wikipedia:SPI
> --
> Amir (he/him)
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Ethical question regarding some code [ In reply to ]
For those interested; the best solution as far as I know for this kind of
similarity detection is the Siamese network with RNNs in the first part.
That implies you must extract fingerprints for all likely candidates
(users) and then some to create a baseline. You can not simply claim that
two users (adversary and postulated sock) are the same because they have
edited the same page. It is quite unlikely a user will edit the same page
with a sock puppet, when it is known that such a system is activated.

On Thu, Aug 6, 2020 at 10:49 PM John Erling Blad <jeblad@gmail.com> wrote:

> Nice idea! First time I wrote about this being possible was back in
> 2008-ish.
>
> The problem is quite trivial, you use some observable feature to
> fingerprint an adversary. The adversary can then game the system if the
> observable feature can be somehow changed or modified. To avoid this the
> observable features are usually chosen to be physical properties that can't
> be easily changed.
>
> In this case the features are word and/or relations between words, and
> then the question is “Can the adversary change the choice of words?” Yes he
> can, because the choice of words is not an inherent physical property of
> the user. In fact there are several programs that help users express
> themselves in a more fluent way, and such systems will change the
> observable features i.e. choice of words. The program will move the
> observable features (the words) from one user-specific distribution to
> another more program-specific distribution. You will observe the users a
> priori to be different, but with the program they will be a posteriori more
> similar.
>
> A real problem is your own poisoning of the training data. That happens
> when you find some subject to be the same as your postulated one, and then
> feed the information back into your training data. If you don't do that
> your training data will start to rot because humans change over time. It is
> bad anyway you do it.
>
> Even more fun is an adversary that knows what you are doing, and tries to
> negate your detection algorithm, or even fool you into believing he is
> someone else. It is after all nothing more than word count and statistics.
> What will you do when someone edits a Wikipedia-page and your system tells
> you “This revision is most likely written by Jimbo”?
>
> Several such programs exist, and I'm a bit perplexed that they are not in
> more use among Wikipedia's editors. Some of them are more aggressive, and
> can propose quite radical rewrites of the text. I use one of them, and it
> is not the best, but still it corrects me all the time.
>
> I believe it would be better to create a system where users are internally
> identified and externally authenticated. (The previous is biometric
> identification, and must adhere to privacy laws.)
>
> On Thu, Aug 6, 2020 at 4:33 AM Amir Sarabadani <ladsgroup@gmail.com>
> wrote:
>
>> Hey,
>> I have an ethical question that I couldn't answer yet and have been asking
>> around but no definite answer yet so I'm asking it in a larger audience in
>> hope of a solution.
>>
>> For almost a year now, I have been developing an NLP-based AI system to be
>> able to catch sock puppets (two users pretending to be different but
>> actually the same person). It's based on the way they speak. The way we
>> speak is like a fingerprint and it's unique to us and it's really hard to
>> forge or change on demand (unlike IP/UA), as the result if you apply some
>> basic techniques in AI on Wikipedia discussions (which can be really
>> lengthy, trust me), the datasets and sock puppets shine.
>>
>> Here's an example, I highly recommend looking at these graphs, I compared
>> two pairs of users, one pair that are not sock puppets and the other is a
>> pair of known socks (a user who got banned indefinitely but came back
>> hidden under another username). [1][2] These graphs are based one of
>> several aspects of this AI system.
>>
>> I have talked about this with WMF and other CUs to build and help us
>> understand and catch socks. Especially the ones that have enough resources
>> to change their IP/UA regularly (like sock farms, and/or UPEs) and also
>> with the increase of mobile intern providers and the horrible way they
>> assign IP to their users, this can get really handy in some SPI ("Sock
>> puppet investigation") [3] cases.
>>
>> The problem is that this tool, while being built only on public
>> information, actually has the power to expose legitimate sock puppets.
>> People who live under oppressive governments and edit on sensitive topics.
>> Disclosing such connections between two accounts can cost people their
>> lives.
>>
>> So, this code is not going to be public, period. But we need to have this
>> code in Wikimedia Cloud Services so people like CUs in other wikis be able
>> to use it as a web-based tool instead of me running it for them upon
>> request. But WMCS terms of use explicitly say code should never be
>> closed-source and this is our principle. What should we do? I pay a
>> corporate cloud provider for this and put such important code and data
>> there? We amend the terms of use to have some exceptions like this one?
>>
>> The most plausible solution suggested so far (thanks Huji) is to have a
>> shell of a code that would be useless without data, and keep the code that
>> produces the data (out of dumps) closed (which is fine, running that code
>> is not too hard even on enwiki) and update the data myself. This might be
>> doable (which I'm around 30% sure, it still might expose too much) but it
>> wouldn't cover future cases similar to mine and I think a more long-term
>> solution is needed here. Also, it would reduce the bus factor to 1, and
>> maintenance would be complicated.
>>
>> What should we do?
>>
>> Thanks
>> [1]
>>
>> https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_fawiki_1.png
>> [2]
>>
>> https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_fawiki_2.png
>> [3] https://en.wikipedia.org/wiki/Wikipedia:SPI
>> --
>> Amir (he/him)
>> _______________________________________________
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Ethical question regarding some code [ In reply to ]
I think an important thing to note is that it's public information, so such
a model, either better or worse can easily be built by an AI enthusiast.
The potential for misuse is not much as it's relatively easy to game, and I
don't think that the model's results will hold more water than behaviour
analysis done by a human (which some editors excel at). Theoretically, by
feeding such edits into an assessment system similar to ClueBot and having
expert sockpuppet hunters assess them would result in a much more accurate
and more "dangerous" model, so to say - but with public information, it
shouldn't be closed source and probably only stifles innovation (for e.g.
GPT-3's eventual release).
If the concern is privacy, probably best to dismantle the entire project
but again, someone who wants to can simply put it in the hours required to
do something similar, so not much point and I think by raising this here,
it's probably resulted in the Streisand effect and more people are now
aware of your model and it's possible repercussions, although transparency
is quite integral in all open-source communities. In the end, it all comes
down to your choice, there's no right answer as far as I can tell.

Best,
QEDK

On Fri, Aug 7, 2020, 02:19 John Erling Blad <jeblad@gmail.com> wrote:

> Nice idea! First time I wrote about this being possible was back in
> 2008-ish.
>
> The problem is quite trivial, you use some observable feature to
> fingerprint an adversary. The adversary can then game the system if the
> observable feature can be somehow changed or modified. To avoid this the
> observable features are usually chosen to be physical properties that can't
> be easily changed.
>
> In this case the features are word and/or relations between words, and then
> the question is “Can the adversary change the choice of words?” Yes he can,
> because the choice of words is not an inherent physical property of the
> user. In fact there are several programs that help users express themselves
> in a more fluent way, and such systems will change the observable features
> i.e. choice of words. The program will move the observable features (the
> words) from one user-specific distribution to another more program-specific
> distribution. You will observe the users a priori to be different, but with
> the program they will be a posteriori more similar.
>
> A real problem is your own poisoning of the training data. That happens
> when you find some subject to be the same as your postulated one, and then
> feed the information back into your training data. If you don't do that
> your training data will start to rot because humans change over time. It is
> bad anyway you do it.
>
> Even more fun is an adversary that knows what you are doing, and tries to
> negate your detection algorithm, or even fool you into believing he is
> someone else. It is after all nothing more than word count and statistics.
> What will you do when someone edits a Wikipedia-page and your system tells
> you “This revision is most likely written by Jimbo”?
>
> Several such programs exist, and I'm a bit perplexed that they are not in
> more use among Wikipedia's editors. Some of them are more aggressive, and
> can propose quite radical rewrites of the text. I use one of them, and it
> is not the best, but still it corrects me all the time.
>
> I believe it would be better to create a system where users are internally
> identified and externally authenticated. (The previous is biometric
> identification, and must adhere to privacy laws.)
>
> On Thu, Aug 6, 2020 at 4:33 AM Amir Sarabadani <ladsgroup@gmail.com>
> wrote:
>
> > Hey,
> > I have an ethical question that I couldn't answer yet and have been
> asking
> > around but no definite answer yet so I'm asking it in a larger audience
> in
> > hope of a solution.
> >
> > For almost a year now, I have been developing an NLP-based AI system to
> be
> > able to catch sock puppets (two users pretending to be different but
> > actually the same person). It's based on the way they speak. The way we
> > speak is like a fingerprint and it's unique to us and it's really hard to
> > forge or change on demand (unlike IP/UA), as the result if you apply some
> > basic techniques in AI on Wikipedia discussions (which can be really
> > lengthy, trust me), the datasets and sock puppets shine.
> >
> > Here's an example, I highly recommend looking at these graphs, I compared
> > two pairs of users, one pair that are not sock puppets and the other is a
> > pair of known socks (a user who got banned indefinitely but came back
> > hidden under another username). [1][2] These graphs are based one of
> > several aspects of this AI system.
> >
> > I have talked about this with WMF and other CUs to build and help us
> > understand and catch socks. Especially the ones that have enough
> resources
> > to change their IP/UA regularly (like sock farms, and/or UPEs) and also
> > with the increase of mobile intern providers and the horrible way they
> > assign IP to their users, this can get really handy in some SPI ("Sock
> > puppet investigation") [3] cases.
> >
> > The problem is that this tool, while being built only on public
> > information, actually has the power to expose legitimate sock puppets.
> > People who live under oppressive governments and edit on sensitive
> topics.
> > Disclosing such connections between two accounts can cost people their
> > lives.
> >
> > So, this code is not going to be public, period. But we need to have this
> > code in Wikimedia Cloud Services so people like CUs in other wikis be
> able
> > to use it as a web-based tool instead of me running it for them upon
> > request. But WMCS terms of use explicitly say code should never be
> > closed-source and this is our principle. What should we do? I pay a
> > corporate cloud provider for this and put such important code and data
> > there? We amend the terms of use to have some exceptions like this one?
> >
> > The most plausible solution suggested so far (thanks Huji) is to have a
> > shell of a code that would be useless without data, and keep the code
> that
> > produces the data (out of dumps) closed (which is fine, running that code
> > is not too hard even on enwiki) and update the data myself. This might be
> > doable (which I'm around 30% sure, it still might expose too much) but it
> > wouldn't cover future cases similar to mine and I think a more long-term
> > solution is needed here. Also, it would reduce the bus factor to 1, and
> > maintenance would be complicated.
> >
> > What should we do?
> >
> > Thanks
> > [1]
> >
> >
> https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_fawiki_1.png
> > [2]
> >
> >
> https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_fawiki_2.png
> > [3] https://en.wikipedia.org/wiki/Wikipedia:SPI
> > --
> > Amir (he/him)
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Ethical question regarding some code [ In reply to ]
I appreciate that Amir is acknowledging that as neat as this tool sounds,
its use is fraught with risk. The comparison that immediately jumped to my
mind is predictive algorithms used in the criminal justice system to assess
risk of bail jumping or criminal recidividism. These algorithms have been
largely secret, their use hidden, their conclusions non-public. The more we
learn about them, the more deeply flawed it's clear they are. Obviously the
real-world consequences of these tools are more severe in that they
directly lead to the incarceration of many people, but I think the
comparison is illustrative of the risks. It also suggests the type of
ongoing comprehensive review that should be involved in making this tool
available to users.

The potential misuse here to be concerned about is by amateurs with
unsociable intent, or by intended users who are wreckless or ignorant of
the risks. Major governments have the resources to easily build this
themselves, and if they care enough about fingerprinting Wikipedians they
likely already have.

I think if the tool is useful and there's a demand for it, everything about
it - how it works, who uses it, what conclusions and actions are taken as a
result of its use, etc - should be made public. That's the only way we'll
discover the multiple ways in which it will surely eventually be misused.
SPI has been using these 'techniques' in a manual way, or with
unsophisticated tools, for many years. But like any tool, the data fed into
it can be training the system incorrectly. The results it returns can be
misunderstood or intentionally misused. Knowledge of its existence will
lead the most sophisticated to beat it, or intentionally misdirect it.
People who are innocent of any violation of our norms will be harmed by its
use. Please establish the proper cultural and procedural safeguards to
limit the harm as much as possible.
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Ethical question regarding some code [ In reply to ]
As others, I see several problems
1. If the code is public, someone can duplicate it and bypass our internal 'safekeeping', because it uses public data.
2. Risk of misuse by either incompetence or malice
3. Risk of accidentally exposing legitimate sockpuppets even in the most closed off situations.
4. Give ppl insight into how the AI works

My answers to those:

1. I have no problem with keeping this in a private repo (yet technically opensourced) code. We also run private mailinglists and have private repos for configuration secrets. Yes it is a bit of a stretch, but.. IAR. At the same time, from the description, seems like something any AI developer with a bit of determination can reproduce... so... for how long will this matter ?
2. NDA + OAuth access for those who need it. Aggressive action logging of usage of the software. Showing these logs to all users of the tool to enforce social control. "User X investigated the matches of account: Y", User Z investigated match on previously known sockpuppet BlockedQ"
3. Usage wise, I'd have two flows.
1. Matches: Surface 'matches' that match previously known sockpuppets (will require keeping track of that list). Only disclose details of a match upon additional user action (logged).
2. Requests: Enter specific account name(s) and request if there are matches on/between that/those name(s). (logged)
Those flows might have different levels match certainty perhaps...
If you want to go even further.. Requiring signoff on a request by another user before you can actually view the matches.
4. That does leave you with the problem of how you can give ppl insight into why an AI matched something.. that is a hard problem. I don't know enough about that problem space.

DJ

> On 6 Aug 2020, at 04:33, Amir Sarabadani <ladsgroup@gmail.com> wrote:
>
> Hey,
> I have an ethical question that I couldn't answer yet and have been asking
> around but no definite answer yet so I'm asking it in a larger audience in
> hope of a solution.
>
> For almost a year now, I have been developing an NLP-based AI system to be
> able to catch sock puppets (two users pretending to be different but
> actually the same person). It's based on the way they speak. The way we
> speak is like a fingerprint and it's unique to us and it's really hard to
> forge or change on demand (unlike IP/UA), as the result if you apply some
> basic techniques in AI on Wikipedia discussions (which can be really
> lengthy, trust me), the datasets and sock puppets shine.
>
> Here's an example, I highly recommend looking at these graphs, I compared
> two pairs of users, one pair that are not sock puppets and the other is a
> pair of known socks (a user who got banned indefinitely but came back
> hidden under another username). [1][2] These graphs are based one of
> several aspects of this AI system.
>
> I have talked about this with WMF and other CUs to build and help us
> understand and catch socks. Especially the ones that have enough resources
> to change their IP/UA regularly (like sock farms, and/or UPEs) and also
> with the increase of mobile intern providers and the horrible way they
> assign IP to their users, this can get really handy in some SPI ("Sock
> puppet investigation") [3] cases.
>
> The problem is that this tool, while being built only on public
> information, actually has the power to expose legitimate sock puppets.
> People who live under oppressive governments and edit on sensitive topics.
> Disclosing such connections between two accounts can cost people their
> lives.
>
> So, this code is not going to be public, period. But we need to have this
> code in Wikimedia Cloud Services so people like CUs in other wikis be able
> to use it as a web-based tool instead of me running it for them upon
> request. But WMCS terms of use explicitly say code should never be
> closed-source and this is our principle. What should we do? I pay a
> corporate cloud provider for this and put such important code and data
> there? We amend the terms of use to have some exceptions like this one?
>
> The most plausible solution suggested so far (thanks Huji) is to have a
> shell of a code that would be useless without data, and keep the code that
> produces the data (out of dumps) closed (which is fine, running that code
> is not too hard even on enwiki) and update the data myself. This might be
> doable (which I'm around 30% sure, it still might expose too much) but it
> wouldn't cover future cases similar to mine and I think a more long-term
> solution is needed here. Also, it would reduce the bus factor to 1, and
> maintenance would be complicated.
>
> What should we do?
>
> Thanks
> [1]
> https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_fawiki_1.png
> [2]
> https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_fawiki_2.png
> [3] https://en.wikipedia.org/wiki/Wikipedia:SPI
> --
> Amir (he/him)
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Ethical question regarding some code [ In reply to ]
Thanks Amir for having this conversation here.

On Nathan's point: outside the Wikimedia projects, we of the free
culture movement tend to argue for full transparency on the functioning
of "automated decision making", "algorithmic tools" , "forensic
software" and so on, typically ensured by open data and free software.

Wikimedia wikis are not a court system, a block is not jail and so on,
but for instance EFF just argued that in the USA judiciary certain
rights ought to be ensured to respect the Sixth Amendment.
https://www.eff.org/deeplinks/2020/07/our-eu-policy-principles-procedural-justice
https://www.eff.org/deeplinks/2020/08/eff-and-aclu-tell-federal-court-forensic-software-source-code-must-be-disclosed
<https://meta.wikimedia.org/wiki/EU_policy/Consultation_on_the_White_Paper_on_Artificial_Intelligence_(2020)>

Federico

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Ethical question regarding some code [ In reply to ]
For better or worse, it seems clear that the cat is out of the bag.
Identity detection through stylometry is now an established technology and
you can easily find code on GitHub or elsewhere (e.g.
https://github.com/jabraunlin/reddit-user-id) to accomplish it (if you have
the time and energy to build a data set and train the model). Back in 2017,
there was even a start-up company that was offering this as a service.
Whatever danger is embodied in Amir's code, it's only a matter of time
before this danger is ubiquitous. And for the worst-case
scenario—governments using the technology to hunt down dissidents—I imagine
this is already happening. So while I agree there is a moral consideration
to releasing this software, I think the moral implications aren't actually
that huge. Eventually, we will just have to accept that creating separate
accounts is not an effective way to protect your identity. That said, I
think taking precautions to minimize (or at least slow down) the potential
abuse of this technology is sensible. TheDJ offered many good suggestions
in this vein so I won't repeat them here. Overall though, I think moving
ahead with this tool is a good idea and I hope you are able to come to a
solution that is amenable to everyone. The WMF is also interested in this
technology (as a potential mitigation for IP masking), so the outcome may
help inform their work as well.

On Fri, Aug 7, 2020 at 5:51 AM Derk-Jan Hartman <
d.j.hartman+wmf_ml@gmail.com> wrote:

> As others, I see several problems
> 1. If the code is public, someone can duplicate it and bypass our internal
> 'safekeeping', because it uses public data.
> 2. Risk of misuse by either incompetence or malice
> 3. Risk of accidentally exposing legitimate sockpuppets even in the most
> closed off situations.
> 4. Give ppl insight into how the AI works
>
> My answers to those:
>
> 1. I have no problem with keeping this in a private repo (yet technically
> opensourced) code. We also run private mailinglists and have private repos
> for configuration secrets. Yes it is a bit of a stretch, but.. IAR. At the
> same time, from the description, seems like something any AI developer with
> a bit of determination can reproduce... so... for how long will this matter
> ?
> 2. NDA + OAuth access for those who need it. Aggressive action logging of
> usage of the software. Showing these logs to all users of the tool to
> enforce social control. "User X investigated the matches of account: Y",
> User Z investigated match on previously known sockpuppet BlockedQ"
> 3. Usage wise, I'd have two flows.
> 1. Matches: Surface 'matches' that match previously known sockpuppets
> (will require keeping track of that list). Only disclose details of a match
> upon additional user action (logged).
> 2. Requests: Enter specific account name(s) and request if there are
> matches on/between that/those name(s). (logged)
> Those flows might have different levels match certainty perhaps...
> If you want to go even further.. Requiring signoff on a request by
> another user before you can actually view the matches.
> 4. That does leave you with the problem of how you can give ppl insight
> into why an AI matched something.. that is a hard problem. I don't know
> enough about that problem space.
>
> DJ
>
> > On 6 Aug 2020, at 04:33, Amir Sarabadani <ladsgroup@gmail.com> wrote:
> >
> > Hey,
> > I have an ethical question that I couldn't answer yet and have been
> asking
> > around but no definite answer yet so I'm asking it in a larger audience
> in
> > hope of a solution.
> >
> > For almost a year now, I have been developing an NLP-based AI system to
> be
> > able to catch sock puppets (two users pretending to be different but
> > actually the same person). It's based on the way they speak. The way we
> > speak is like a fingerprint and it's unique to us and it's really hard to
> > forge or change on demand (unlike IP/UA), as the result if you apply some
> > basic techniques in AI on Wikipedia discussions (which can be really
> > lengthy, trust me), the datasets and sock puppets shine.
> >
> > Here's an example, I highly recommend looking at these graphs, I compared
> > two pairs of users, one pair that are not sock puppets and the other is a
> > pair of known socks (a user who got banned indefinitely but came back
> > hidden under another username). [1][2] These graphs are based one of
> > several aspects of this AI system.
> >
> > I have talked about this with WMF and other CUs to build and help us
> > understand and catch socks. Especially the ones that have enough
> resources
> > to change their IP/UA regularly (like sock farms, and/or UPEs) and also
> > with the increase of mobile intern providers and the horrible way they
> > assign IP to their users, this can get really handy in some SPI ("Sock
> > puppet investigation") [3] cases.
> >
> > The problem is that this tool, while being built only on public
> > information, actually has the power to expose legitimate sock puppets.
> > People who live under oppressive governments and edit on sensitive
> topics.
> > Disclosing such connections between two accounts can cost people their
> > lives.
> >
> > So, this code is not going to be public, period. But we need to have this
> > code in Wikimedia Cloud Services so people like CUs in other wikis be
> able
> > to use it as a web-based tool instead of me running it for them upon
> > request. But WMCS terms of use explicitly say code should never be
> > closed-source and this is our principle. What should we do? I pay a
> > corporate cloud provider for this and put such important code and data
> > there? We amend the terms of use to have some exceptions like this one?
> >
> > The most plausible solution suggested so far (thanks Huji) is to have a
> > shell of a code that would be useless without data, and keep the code
> that
> > produces the data (out of dumps) closed (which is fine, running that code
> > is not too hard even on enwiki) and update the data myself. This might be
> > doable (which I'm around 30% sure, it still might expose too much) but it
> > wouldn't cover future cases similar to mine and I think a more long-term
> > solution is needed here. Also, it would reduce the bus factor to 1, and
> > maintenance would be complicated.
> >
> > What should we do?
> >
> > Thanks
> > [1]
> >
> https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_fawiki_1.png
> > [2]
> >
> https://commons.wikimedia.org/wiki/File:Word_distributions_of_two_users_in_fawiki_2.png
> > [3] https://en.wikipedia.org/wiki/Wikipedia:SPI
> > --
> > Amir (he/him)
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Ethical question regarding some code [ In reply to ]
Thank you all for the responses, I try to summarize my responses here.

* By closed source, I don't mean it will be only accessible to me, It's
already accessible by another CU and one WMF staff, and I would gladly
share the code with anyone who has signed NDA and they are of course more
than welcome to change it. Github has a really low limit for people who can
access a private repo but I would be fine with any means to fix this.

* I have read that people say that there are already public tools to
analyze text. I disagree, 1- The tools you mentioned are for English and
not other languages (maybe I missed something) and even if we imagine there
would be such tools for big languages like German and/or French, they don't
cover lots of languages unlike my tool that's basically language agnostic
and depends on the volume of discussions happened in the wiki.

* I also disagree that it's not hard to build. I have lots of experience
with NLP (with my favorite work being a tool that finds swear words in
every language based on history of vandalism in that Wikipedia [1]) and
still it took me more than a year (a couple of hours almost in every
weekend) to build this, analyzing a pure clean text is not hard, cleaning
up wikitext and templates and links to get only text people "spoke" is
doubly hard, analyzing user signatures brings only suffer and sorrow.

* While in general I agree if a government wants to build this, they can
but reality is more complicated and this situation is similar to security.
You can never be 100% secure but you can increase the cost of hacking you
so much that it would be pointless for a major actor to do it. Governments
have a limited budget and dictatorships are by design corrupt and filled
with incompotent people [2] and sanctions put another restrain on such
governments too so I would not give them such opportunity for oppersion in
a silver plate for free, if they really want to, then they must pay for it
(which means they can't use that money/resources on oppersing some other
groups).

* People have said this AI is easy to be gamed, while it's not that easy
and the tools you mentioned are limited to English, it's still a big win
for the integrity of our projects. It boils down again to increasing the
cost. If a major actor wants to spread disinformation, so far they only
need to fake their UA and IP which is a piece of cake and I already see
that (as a CU) but now they have to mess with UA/IP AND change their
methods of speaking (which is one order of magnitude harder than changing
IP). As I said, increasing this cost might not prevent it from happening
but at least it takes away the ability of oppressing other groups.

* This tool never will be the only reason to block a sock. It's more than
anything a helper, if CU brings a large range and they are similar but the
result is not conclusive, this tool can help. Or when we are 90% sure it's
a WP:DUCK, this tool can help too but blocking just because this tool said
so would imply a "Minority report" situation and to be honest and I would
really like to avoid that. It is supposed to empower CUs.

* Banning using this tool is not possible legally, the content of Wikipedia
is published under CC-BY-SA and this allows such analysis specially you
can't ban an offwiki action. Also, if a university professor can do it, I
don't see the point of banning using it by the most trusted group of users
(CUs). You can ban blocking based on this tool but I don't think we should
block solely based on this anyway.

* It has been pointed out by people in the checkuser mailing list that
there's no point in logging accessing this tool, since the code is
accessible to CUs (if they want to), so they can download and run it on
their computer without logging anyway.

* There is a huge difference between CU and this AI tool in matters of
privacy. While both are privacy sensitive but CU reveals much more, as a
CU, I know where lots of people are living or studying because they showed
up in my CUs and while I won't tell a soul about them but it makes me
uncomfortable (I'm also not implying CUs are not trusted, it's just we
should respect people's privacy and avoid "unreasonable search and
seizure"[3]) but this tool only reveals a connection between accounts if
one of them is linked to a public identity and the other is not which I
wholeheartedly agree is not great but it's not on the same level as seeing
people's IPs. So I even think in an ideal world where the AI model is more
accurate than CU, we should stop using CU and rely solely on the AI instead
(important: I'm not implying the current model is better, I'm saying if it
was better). This would help us understand why for example fishing for sock
puppets with CU is bad (and banned by the policy) but fishing for socks
using this AI is not bad and can be a good starting point. In other words,
this tool being used right, can reduce check user actions and protect
people's privacy instead.

* People have been saying you need to teach AI to people so for example CUs
don't make wrong judgments based on this. I want to point out the examples
mentioned in the discussion are supervised machine learning which is AI but
not all of AI. This tool is not machine learning but it's AI (by heavily
relying on NLP) and for example it produces graphs and etc. and it wouldn't
give a number like "95% sure these two users are the same" which a
supervised machine learning model would do. I think reducing fingerprints
of people to just a number is inaccurate and harmful (life is not like a TV
crime series where a forensic scientist gives you the truth using some
magic). I write a detailed instruction on how to use it but it's not as bad
as you'd think, I leave a huge room for human judgment.

[1] Have fun (warning, explicit language):
https://gist.github.com/Ladsgroup/cc22515f55ae3d868f47#file-enwiki
[2] For knowing why, you can read this book on political science called
"The Dictator's handbook":
https://en.wikipedia.org/wiki/The_Dictator%27s_Handbook
[3] From the fourth amendment of US constitution, you can find a similar
clause in every constitution.

Hope this responds to some concerns. Sorry for a long email.
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Ethical question regarding some code [ In reply to ]
Please stop calling this an “AI” system, it is not. It is statistical
learning.

This is probably not going to make me popular…

In some jurisdictions you will need a permit to create, manage, and store
biometric identifiers, no matter if the biometric identifier is for a known
person or not. If you want to create biometric identifiers, and use them,
make darn sure you follow every applicable law and rule. I'm not amused by
the idea of having CUs using illegal tools to wet ordinary users.

Any system that tries to remove anonymity og users on Wikipedia should have
an RfC where the community can make their concerns heard. This is not the
proper forum to get acceptance from Wikipedias community.

And btw, systems for cleanup of prose exists for a whole bunch of
languages, not only English. Grammarly is one, LanguageTool another, and
there are a whole bunch other such tools.

lør. 8. aug. 2020, 19.42 skrev Amir Sarabadani <ladsgroup@gmail.com>:

> Thank you all for the responses, I try to summarize my responses here.
>
> * By closed source, I don't mean it will be only accessible to me, It's
> already accessible by another CU and one WMF staff, and I would gladly
> share the code with anyone who has signed NDA and they are of course more
> than welcome to change it. Github has a really low limit for people who can
> access a private repo but I would be fine with any means to fix this.
>
> * I have read that people say that there are already public tools to
> analyze text. I disagree, 1- The tools you mentioned are for English and
> not other languages (maybe I missed something) and even if we imagine there
> would be such tools for big languages like German and/or French, they don't
> cover lots of languages unlike my tool that's basically language agnostic
> and depends on the volume of discussions happened in the wiki.
>
> * I also disagree that it's not hard to build. I have lots of experience
> with NLP (with my favorite work being a tool that finds swear words in
> every language based on history of vandalism in that Wikipedia [1]) and
> still it took me more than a year (a couple of hours almost in every
> weekend) to build this, analyzing a pure clean text is not hard, cleaning
> up wikitext and templates and links to get only text people "spoke" is
> doubly hard, analyzing user signatures brings only suffer and sorrow.
>
> * While in general I agree if a government wants to build this, they can
> but reality is more complicated and this situation is similar to security.
> You can never be 100% secure but you can increase the cost of hacking you
> so much that it would be pointless for a major actor to do it. Governments
> have a limited budget and dictatorships are by design corrupt and filled
> with incompotent people [2] and sanctions put another restrain on such
> governments too so I would not give them such opportunity for oppersion in
> a silver plate for free, if they really want to, then they must pay for it
> (which means they can't use that money/resources on oppersing some other
> groups).
>
> * People have said this AI is easy to be gamed, while it's not that easy
> and the tools you mentioned are limited to English, it's still a big win
> for the integrity of our projects. It boils down again to increasing the
> cost. If a major actor wants to spread disinformation, so far they only
> need to fake their UA and IP which is a piece of cake and I already see
> that (as a CU) but now they have to mess with UA/IP AND change their
> methods of speaking (which is one order of magnitude harder than changing
> IP). As I said, increasing this cost might not prevent it from happening
> but at least it takes away the ability of oppressing other groups.
>
> * This tool never will be the only reason to block a sock. It's more than
> anything a helper, if CU brings a large range and they are similar but the
> result is not conclusive, this tool can help. Or when we are 90% sure it's
> a WP:DUCK, this tool can help too but blocking just because this tool said
> so would imply a "Minority report" situation and to be honest and I would
> really like to avoid that. It is supposed to empower CUs.
>
> * Banning using this tool is not possible legally, the content of Wikipedia
> is published under CC-BY-SA and this allows such analysis specially you
> can't ban an offwiki action. Also, if a university professor can do it, I
> don't see the point of banning using it by the most trusted group of users
> (CUs). You can ban blocking based on this tool but I don't think we should
> block solely based on this anyway.
>
> * It has been pointed out by people in the checkuser mailing list that
> there's no point in logging accessing this tool, since the code is
> accessible to CUs (if they want to), so they can download and run it on
> their computer without logging anyway.
>
> * There is a huge difference between CU and this AI tool in matters of
> privacy. While both are privacy sensitive but CU reveals much more, as a
> CU, I know where lots of people are living or studying because they showed
> up in my CUs and while I won't tell a soul about them but it makes me
> uncomfortable (I'm also not implying CUs are not trusted, it's just we
> should respect people's privacy and avoid "unreasonable search and
> seizure"[3]) but this tool only reveals a connection between accounts if
> one of them is linked to a public identity and the other is not which I
> wholeheartedly agree is not great but it's not on the same level as seeing
> people's IPs. So I even think in an ideal world where the AI model is more
> accurate than CU, we should stop using CU and rely solely on the AI instead
> (important: I'm not implying the current model is better, I'm saying if it
> was better). This would help us understand why for example fishing for sock
> puppets with CU is bad (and banned by the policy) but fishing for socks
> using this AI is not bad and can be a good starting point. In other words,
> this tool being used right, can reduce check user actions and protect
> people's privacy instead.
>
> * People have been saying you need to teach AI to people so for example CUs
> don't make wrong judgments based on this. I want to point out the examples
> mentioned in the discussion are supervised machine learning which is AI but
> not all of AI. This tool is not machine learning but it's AI (by heavily
> relying on NLP) and for example it produces graphs and etc. and it wouldn't
> give a number like "95% sure these two users are the same" which a
> supervised machine learning model would do. I think reducing fingerprints
> of people to just a number is inaccurate and harmful (life is not like a TV
> crime series where a forensic scientist gives you the truth using some
> magic). I write a detailed instruction on how to use it but it's not as bad
> as you'd think, I leave a huge room for human judgment.
>
> [1] Have fun (warning, explicit language):
> https://gist.github.com/Ladsgroup/cc22515f55ae3d868f47#file-enwiki
> [2] For knowing why, you can read this book on political science called
> "The Dictator's handbook":
> https://en.wikipedia.org/wiki/The_Dictator%27s_Handbook
> [3] From the fourth amendment of US constitution, you can find a similar
> clause in every constitution.
>
> Hope this responds to some concerns. Sorry for a long email.
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Ethical question regarding some code [ In reply to ]
On Sat, Aug 8, 2020 at 9:44 PM John Erling Blad <jeblad@gmail.com> wrote:

> Please stop calling this an “AI” system, it is not. It is statistical
> learning.
>
>
So in other words, it is an AI system? AI is just a colloquial synonym for
statistical learning at this point.

--
Brian
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Ethical question regarding some code [ In reply to ]
For my part, I think Amir is going way above and beyond to be so thoughtful
and open about the future of his tool.

I don't see how any part of it constitutes creating biometric identifiers,
nor is it obvious to me how it must remove anonymity of users.

John, perhaps you can elaborate on your reasoning there?

Ultimately I don't think community approval for this tool is technically
required. I appreciate the effort to solicit input and don't think it would
hurt to do that more broadly.

On Sat, Aug 8, 2020 at 5:44 PM John Erling Blad <jeblad@gmail.com> wrote:

> Please stop calling this an “AI” system, it is not. It is statistical
> learning.
>
> This is probably not going to make me popular…
>
> In some jurisdictions you will need a permit to create, manage, and store
> biometric identifiers, no matter if the biometric identifier is for a known
> person or not. If you want to create biometric identifiers, and use them,
> make darn sure you follow every applicable law and rule. I'm not amused by
> the idea of having CUs using illegal tools to wet ordinary users.
>
> Any system that tries to remove anonymity og users on Wikipedia should have
> an RfC where the community can make their concerns heard. This is not the
> proper forum to get acceptance from Wikipedias community.
>
> And btw, systems for cleanup of prose exists for a whole bunch of
> languages, not only English. Grammarly is one, LanguageTool another, and
> there are a whole bunch other such tools.
>
> lør. 8. aug. 2020, 19.42 skrev Amir Sarabadani <ladsgroup@gmail.com>:
>
> > Thank you all for the responses, I try to summarize my responses here.
> >
> > * By closed source, I don't mean it will be only accessible to me, It's
> > already accessible by another CU and one WMF staff, and I would gladly
> > share the code with anyone who has signed NDA and they are of course more
> > than welcome to change it. Github has a really low limit for people who
> can
> > access a private repo but I would be fine with any means to fix this.
> >
> > * I have read that people say that there are already public tools to
> > analyze text. I disagree, 1- The tools you mentioned are for English and
> > not other languages (maybe I missed something) and even if we imagine
> there
> > would be such tools for big languages like German and/or French, they
> don't
> > cover lots of languages unlike my tool that's basically language agnostic
> > and depends on the volume of discussions happened in the wiki.
> >
> > * I also disagree that it's not hard to build. I have lots of experience
> > with NLP (with my favorite work being a tool that finds swear words in
> > every language based on history of vandalism in that Wikipedia [1]) and
> > still it took me more than a year (a couple of hours almost in every
> > weekend) to build this, analyzing a pure clean text is not hard, cleaning
> > up wikitext and templates and links to get only text people "spoke" is
> > doubly hard, analyzing user signatures brings only suffer and sorrow.
> >
> > * While in general I agree if a government wants to build this, they can
> > but reality is more complicated and this situation is similar to
> security.
> > You can never be 100% secure but you can increase the cost of hacking you
> > so much that it would be pointless for a major actor to do it.
> Governments
> > have a limited budget and dictatorships are by design corrupt and filled
> > with incompotent people [2] and sanctions put another restrain on such
> > governments too so I would not give them such opportunity for oppersion
> in
> > a silver plate for free, if they really want to, then they must pay for
> it
> > (which means they can't use that money/resources on oppersing some other
> > groups).
> >
> > * People have said this AI is easy to be gamed, while it's not that easy
> > and the tools you mentioned are limited to English, it's still a big win
> > for the integrity of our projects. It boils down again to increasing the
> > cost. If a major actor wants to spread disinformation, so far they only
> > need to fake their UA and IP which is a piece of cake and I already see
> > that (as a CU) but now they have to mess with UA/IP AND change their
> > methods of speaking (which is one order of magnitude harder than changing
> > IP). As I said, increasing this cost might not prevent it from happening
> > but at least it takes away the ability of oppressing other groups.
> >
> > * This tool never will be the only reason to block a sock. It's more than
> > anything a helper, if CU brings a large range and they are similar but
> the
> > result is not conclusive, this tool can help. Or when we are 90% sure
> it's
> > a WP:DUCK, this tool can help too but blocking just because this tool
> said
> > so would imply a "Minority report" situation and to be honest and I would
> > really like to avoid that. It is supposed to empower CUs.
> >
> > * Banning using this tool is not possible legally, the content of
> Wikipedia
> > is published under CC-BY-SA and this allows such analysis specially you
> > can't ban an offwiki action. Also, if a university professor can do it, I
> > don't see the point of banning using it by the most trusted group of
> users
> > (CUs). You can ban blocking based on this tool but I don't think we
> should
> > block solely based on this anyway.
> >
> > * It has been pointed out by people in the checkuser mailing list that
> > there's no point in logging accessing this tool, since the code is
> > accessible to CUs (if they want to), so they can download and run it on
> > their computer without logging anyway.
> >
> > * There is a huge difference between CU and this AI tool in matters of
> > privacy. While both are privacy sensitive but CU reveals much more, as a
> > CU, I know where lots of people are living or studying because they
> showed
> > up in my CUs and while I won't tell a soul about them but it makes me
> > uncomfortable (I'm also not implying CUs are not trusted, it's just we
> > should respect people's privacy and avoid "unreasonable search and
> > seizure"[3]) but this tool only reveals a connection between accounts if
> > one of them is linked to a public identity and the other is not which I
> > wholeheartedly agree is not great but it's not on the same level as
> seeing
> > people's IPs. So I even think in an ideal world where the AI model is
> more
> > accurate than CU, we should stop using CU and rely solely on the AI
> instead
> > (important: I'm not implying the current model is better, I'm saying if
> it
> > was better). This would help us understand why for example fishing for
> sock
> > puppets with CU is bad (and banned by the policy) but fishing for socks
> > using this AI is not bad and can be a good starting point. In other
> words,
> > this tool being used right, can reduce check user actions and protect
> > people's privacy instead.
> >
> > * People have been saying you need to teach AI to people so for example
> CUs
> > don't make wrong judgments based on this. I want to point out the
> examples
> > mentioned in the discussion are supervised machine learning which is AI
> but
> > not all of AI. This tool is not machine learning but it's AI (by heavily
> > relying on NLP) and for example it produces graphs and etc. and it
> wouldn't
> > give a number like "95% sure these two users are the same" which a
> > supervised machine learning model would do. I think reducing fingerprints
> > of people to just a number is inaccurate and harmful (life is not like a
> TV
> > crime series where a forensic scientist gives you the truth using some
> > magic). I write a detailed instruction on how to use it but it's not as
> bad
> > as you'd think, I leave a huge room for human judgment.
> >
> > [1] Have fun (warning, explicit language):
> > https://gist.github.com/Ladsgroup/cc22515f55ae3d868f47#file-enwiki
> > [2] For knowing why, you can read this book on political science called
> > "The Dictator's handbook":
> > https://en.wikipedia.org/wiki/The_Dictator%27s_Handbook
> > [3] From the fourth amendment of US constitution, you can find a similar
> > clause in every constitution.
> >
> > Hope this responds to some concerns. Sorry for a long email.
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Ethical question regarding some code [ In reply to ]
On Sat, Aug 8, 2020 at 7:43 PM Amir Sarabadani <ladsgroup@gmail.com> wrote:

> * By closed source, I don't mean it will be only accessible to me, It's
> already accessible by another CU and one WMF staff, and I would gladly
> share the code with anyone who has signed NDA and they are of course more
> than welcome to change it. Github has a really low limit for people who can
> access a private repo but I would be fine with any means to fix this.
>

Closed source is commonly understood to mean the code is not under an
OSI-approved open-source license (such code is banned from Toolforge).
Contrary to common misconceptions, many OSI-approved open-source licenses
(such as the GPL) allow keeping the code private, as long as the software
itself is also kept private. IMO it would be less confusing to use the
"public"/"private" terminology here - yes the code should be open-sourced,
but that's mostly orthogonal to the concerns discussed here.

* It has been pointed out by people in the checkuser mailing list that
> there's no point in logging accessing this tool, since the code is
> accessible to CUs (if they want to), so they can download and run it on
> their computer without logging anyway.
>

There's a significant difference between your actions not being logged vs.
your actions being logged unless you actively circumvent the logging (in
ways which would probably seem malicious). Clear red lines work well in a
community project even when there's nothing physically stopping people from
stepping over them.

* There is a huge difference between CU and this AI tool in matters of
> privacy. While both are privacy sensitive but CU reveals much more, as a
> CU, I know where lots of people are living or studying because they showed
> up in my CUs (...) but this tool only reveals a connection between
> accounts if
> one of them is linked to a public identity and the other is not which I
> wholeheartedly agree is not great but it's not on the same level as seeing
> people's IPs.
>

On the other hand, IP checks are very unreliable. A hypothetical tool that
is reliable would be a bigger privacy concern, since it would be used more
often and more successfully to extract private details.
(On the other other hand, as a Wikipedia editor I have a reasonable
expectation of privacy of the site not telling its administrators where I
live. Do I have a reasonable expectation of privacy for not telling them
what my alt accounts are? Arguably not.)

Also, how much help would such a tool be in off-wiki stylometry? If it can
be used (on its own or with additional tooling) to connect wiki accounts to
other online accounts, that would subjectively seem to me to have a
significantly larger privacy impact than IP addresses.
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Ethical question regarding some code [ In reply to ]
On Fri, Aug 7, 2020 at 6:39 PM Ryan Kaldari <rkaldari@wikimedia.org> wrote:

> Whatever danger is embodied in Amir's code, it's only a matter of time
> before this danger is ubiquitous. And for the worst-case
> scenario—governments using the technology to hunt down dissidents—I imagine
> this is already happening. So while I agree there is a moral consideration
> to releasing this software, I think the moral implications aren't actually
> that huge. Eventually, we will just have to accept that creating separate
> accounts is not an effective way to protect your identity.


Deanonymizing wiki accounts is one way of misusing the tool, and one which
would indeed happen anyway. Another scenario is an attacker examining the
tool with the intent of misleading it (such as using an adversarial network
to construct edits which the tool would consistently misidentify as
belonging to a certain user, which could be used to cast suspicion on a
legitimate user). That specifically depends on the model being publicly
available.
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Ethical question regarding some code [ In reply to ]
On Sun, Aug 9, 2020 at 2:18 AM Nathan <nawrich@gmail.com> wrote:

> I don't see how any part of it constitutes creating biometric identifiers,
> nor is it obvious to me how it must remove anonymity of users.
>

The GDPR for example defines biometric data as "personal data resulting
from specific technical processing relating to the physical, physiological
or behavioural characteristics of a natural person, which allow or confirm
the unique identification of that natural person" (Art. 4 (14)). That seems
to fit, although it could be argued that the tool would link accounts to
other accounts and not to people so the data is not used for
"identification of a natural person", but that does not sound super
convincing. The GDPR (Art 9) generally forbids processing biometric data,
except for a number of special cases, some of which can be argued to apply:

* processing is carried out in the course of its legitimate activities with
appropriate safeguards by a foundation, association or any other
not-for-profit body with a political, philosophical, religious or trade
union aim and on condition that the processing relates solely to the
members or to former members of the body or to persons who have regular
contact with it in connection with its purposes and that the personal data
are not disclosed outside that body without the consent of the data
subjects;
* processing relates to personal data which are manifestly made public by
the data subject;

but I wouldn't say it's clear-cut.
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Ethical question regarding some code [ In reply to ]
FWIW, the movement strategy included a recommendation* for having a
technology ethics review process [1]; maybe this is a good opportunity to
experiment with creating a precursory, unofficial version of that - make a
wiki page for the sock puppet detection tool, and a proposal process for
such pages, and consider where we could source expert advice from.

* More precisely, it was a draft recommendation. The final recommendations
were significantly less fine-grained.
[1]
https://meta.wikimedia.org/wiki/Strategy/Wikimedia_movement/2018-20/Recommendations/Iteration_2/Product_%26_Technology/8
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l