Mailing List Archive

GDPR compliance
Hi Folks,
In LinkedIn we need to comply with GDPR for a large part of our data, and
an important part of it is that we need to be sure we have completely
deleted the data the user requested to delete within a certain period of
time.
The way we have come up with so far is to:
1. Record the segment creation time somewhere (not decided yet, maybe index
commit userinfo, maybe some other place outside of lucene)
2. Create a new merge policy which delegate most operations to a normal MP,
like TieredMergePolicy, and then add extra single-segment (merge from 1
segment to 1 segment, basically only do deletion) merges if it finds any
segment is about to violate the GDPR time frame.

So here's my question:
1. Is there a better/existing way to do this?
2. I would like to directly contribute to Lucene about such a merge policy
since I think GDPR is more or less a common thing. Would like to know
whether people feel like it's necessary or not?
3. It's also nice if we can store the segment creation time to the index
directly by IndexWriter (maybe write to SegmentInfo?), I can try to do that
but would like to ask whether there's any objections?

Best
Patrick
Re: GDPR compliance [ In reply to ]
I don't think there's any problem with GDPR, and I don't think users
should be running unnecessary "optimize". GDRP just says data should
be erased without "undue" delay. waiting for a merge to nuke the
deleted docs isn't "undue", there is a good reason for it.

On Tue, Nov 28, 2023 at 2:40?PM Patrick Zhai <zhai7631@gmail.com> wrote:
>
> Hi Folks,
> In LinkedIn we need to comply with GDPR for a large part of our data, and an important part of it is that we need to be sure we have completely deleted the data the user requested to delete within a certain period of time.
> The way we have come up with so far is to:
> 1. Record the segment creation time somewhere (not decided yet, maybe index commit userinfo, maybe some other place outside of lucene)
> 2. Create a new merge policy which delegate most operations to a normal MP, like TieredMergePolicy, and then add extra single-segment (merge from 1 segment to 1 segment, basically only do deletion) merges if it finds any segment is about to violate the GDPR time frame.
>
> So here's my question:
> 1. Is there a better/existing way to do this?
> 2. I would like to directly contribute to Lucene about such a merge policy since I think GDPR is more or less a common thing. Would like to know whether people feel like it's necessary or not?
> 3. It's also nice if we can store the segment creation time to the index directly by IndexWriter (maybe write to SegmentInfo?), I can try to do that but would like to ask whether there's any objections?
>
> Best
> Patrick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: GDPR compliance [ In reply to ]
Are larger and older segments even certain to ever be merged in practice? I
was assuming that if there is not a lot of new indexed content and not a
lot of older documents being deleted, large older segment might never have
to be merged.


On Tue 28 Nov 2023 at 20:53, Robert Muir <rcmuir@gmail.com> wrote:

> I don't think there's any problem with GDPR, and I don't think users
> should be running unnecessary "optimize". GDRP just says data should
> be erased without "undue" delay. waiting for a merge to nuke the
> deleted docs isn't "undue", there is a good reason for it.
>
> On Tue, Nov 28, 2023 at 2:40?PM Patrick Zhai <zhai7631@gmail.com> wrote:
> >
> > Hi Folks,
> > In LinkedIn we need to comply with GDPR for a large part of our data,
> and an important part of it is that we need to be sure we have completely
> deleted the data the user requested to delete within a certain period of
> time.
> > The way we have come up with so far is to:
> > 1. Record the segment creation time somewhere (not decided yet, maybe
> index commit userinfo, maybe some other place outside of lucene)
> > 2. Create a new merge policy which delegate most operations to a normal
> MP, like TieredMergePolicy, and then add extra single-segment (merge from 1
> segment to 1 segment, basically only do deletion) merges if it finds any
> segment is about to violate the GDPR time frame.
> >
> > So here's my question:
> > 1. Is there a better/existing way to do this?
> > 2. I would like to directly contribute to Lucene about such a merge
> policy since I think GDPR is more or less a common thing. Would like to
> know whether people feel like it's necessary or not?
> > 3. It's also nice if we can store the segment creation time to the index
> directly by IndexWriter (maybe write to SegmentInfo?), I can try to do that
> but would like to ask whether there's any objections?
> >
> > Best
> > Patrick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: GDPR compliance [ In reply to ]
What is the expected grace time for the data-deletion request to take place?

I'm not expert about the policy but I think something like "I need my data to be gone in next 2 second" is unreasonable.

Tony X

________________________________
From: Robert Muir <rcmuir@gmail.com>
Sent: Tuesday, November 28, 2023 11:52 AM
To: dev@lucene.apache.org <dev@lucene.apache.org>
Subject: Re: GDPR compliance

I don't think there's any problem with GDPR, and I don't think users
should be running unnecessary "optimize". GDRP just says data should
be erased without "undue" delay. waiting for a merge to nuke the
deleted docs isn't "undue", there is a good reason for it.

On Tue, Nov 28, 2023 at 2:40?PM Patrick Zhai <zhai7631@gmail.com> wrote:
>
> Hi Folks,
> In LinkedIn we need to comply with GDPR for a large part of our data, and an important part of it is that we need to be sure we have completely deleted the data the user requested to delete within a certain period of time.
> The way we have come up with so far is to:
> 1. Record the segment creation time somewhere (not decided yet, maybe index commit userinfo, maybe some other place outside of lucene)
> 2. Create a new merge policy which delegate most operations to a normal MP, like TieredMergePolicy, and then add extra single-segment (merge from 1 segment to 1 segment, basically only do deletion) merges if it finds any segment is about to violate the GDPR time frame.
>
> So here's my question:
> 1. Is there a better/existing way to do this?
> 2. I would like to directly contribute to Lucene about such a merge policy since I think GDPR is more or less a common thing. Would like to know whether people feel like it's necessary or not?
> 3. It's also nice if we can store the segment creation time to the index directly by IndexWriter (maybe write to SegmentInfo?), I can try to do that but would like to ask whether there's any objections?
>
> Best
> Patrick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: GDPR compliance [ In reply to ]
and if you delete those segments, will that data ever be actually
removed from the underlying physical storage? equally uncertain.

deleting a file from the filesystem is similar to what lucene is
doing, it doesn't really delete anything from the disk, just allows it
to be overwritten by future writes.

so I don't think we should provide any "GDPRMergePolicy" to satisfy an
extreme (and short-sighted) legal interpretation. it wouldn't solve
the problem anyway.

On Tue, Nov 28, 2023 at 3:27?PM Ilan Ginzburg <ilansolr@gmail.com> wrote:
>
> Are larger and older segments even certain to ever be merged in practice? I was assuming that if there is not a lot of new indexed content and not a lot of older documents being deleted, large older segment might never have to be merged.
>
>
> On Tue 28 Nov 2023 at 20:53, Robert Muir <rcmuir@gmail.com> wrote:
>>
>> I don't think there's any problem with GDPR, and I don't think users
>> should be running unnecessary "optimize". GDRP just says data should
>> be erased without "undue" delay. waiting for a merge to nuke the
>> deleted docs isn't "undue", there is a good reason for it.
>>
>> On Tue, Nov 28, 2023 at 2:40?PM Patrick Zhai <zhai7631@gmail.com> wrote:
>> >
>> > Hi Folks,
>> > In LinkedIn we need to comply with GDPR for a large part of our data, and an important part of it is that we need to be sure we have completely deleted the data the user requested to delete within a certain period of time.
>> > The way we have come up with so far is to:
>> > 1. Record the segment creation time somewhere (not decided yet, maybe index commit userinfo, maybe some other place outside of lucene)
>> > 2. Create a new merge policy which delegate most operations to a normal MP, like TieredMergePolicy, and then add extra single-segment (merge from 1 segment to 1 segment, basically only do deletion) merges if it finds any segment is about to violate the GDPR time frame.
>> >
>> > So here's my question:
>> > 1. Is there a better/existing way to do this?
>> > 2. I would like to directly contribute to Lucene about such a merge policy since I think GDPR is more or less a common thing. Would like to know whether people feel like it's necessary or not?
>> > 3. It's also nice if we can store the segment creation time to the index directly by IndexWriter (maybe write to SegmentInfo?), I can try to do that but would like to ask whether there's any objections?
>> >
>> > Best
>> > Patrick
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: GDPR compliance [ In reply to ]
Thanks Robert and Dawid,
I think what you said is reasonable to me, I can keep the MP private then I
guess(and it's not hard to code it out anyway so I guess people can still
figure it out easily if they're facing a similar situation).
For our case I think we do have some other constraints so we have to
"clean" them every so often, so we still need to do that.

Anyway thank you for the interpretation of GDPR, I'm actually not sure what
exactly it's trying to enforce so it's a good learn for me as well.

Patrick


On Tue, Nov 28, 2023 at 2:48?PM Robert Muir <rcmuir@gmail.com> wrote:

> and if you delete those segments, will that data ever be actually
> removed from the underlying physical storage? equally uncertain.
>
> deleting a file from the filesystem is similar to what lucene is
> doing, it doesn't really delete anything from the disk, just allows it
> to be overwritten by future writes.
>
> so I don't think we should provide any "GDPRMergePolicy" to satisfy an
> extreme (and short-sighted) legal interpretation. it wouldn't solve
> the problem anyway.
>
> On Tue, Nov 28, 2023 at 3:27?PM Ilan Ginzburg <ilansolr@gmail.com> wrote:
> >
> > Are larger and older segments even certain to ever be merged in
> practice? I was assuming that if there is not a lot of new indexed content
> and not a lot of older documents being deleted, large older segment might
> never have to be merged.
> >
> >
> > On Tue 28 Nov 2023 at 20:53, Robert Muir <rcmuir@gmail.com> wrote:
> >>
> >> I don't think there's any problem with GDPR, and I don't think users
> >> should be running unnecessary "optimize". GDRP just says data should
> >> be erased without "undue" delay. waiting for a merge to nuke the
> >> deleted docs isn't "undue", there is a good reason for it.
> >>
> >> On Tue, Nov 28, 2023 at 2:40?PM Patrick Zhai <zhai7631@gmail.com>
> wrote:
> >> >
> >> > Hi Folks,
> >> > In LinkedIn we need to comply with GDPR for a large part of our data,
> and an important part of it is that we need to be sure we have completely
> deleted the data the user requested to delete within a certain period of
> time.
> >> > The way we have come up with so far is to:
> >> > 1. Record the segment creation time somewhere (not decided yet, maybe
> index commit userinfo, maybe some other place outside of lucene)
> >> > 2. Create a new merge policy which delegate most operations to a
> normal MP, like TieredMergePolicy, and then add extra single-segment (merge
> from 1 segment to 1 segment, basically only do deletion) merges if it finds
> any segment is about to violate the GDPR time frame.
> >> >
> >> > So here's my question:
> >> > 1. Is there a better/existing way to do this?
> >> > 2. I would like to directly contribute to Lucene about such a merge
> policy since I think GDPR is more or less a common thing. Would like to
> know whether people feel like it's necessary or not?
> >> > 3. It's also nice if we can store the segment creation time to the
> index directly by IndexWriter (maybe write to SegmentInfo?), I can try to
> do that but would like to ask whether there's any objections?
> >> >
> >> > Best
> >> > Patrick
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: GDPR compliance [ In reply to ]
It's not that insane, it's about several weeks however the big segment can
stay there for quite long if there's not enough update for a merge policy
to pick it up

On Tue, Nov 28, 2023, 17:14 Dongyu Xu <dongyu214@hotmail.com> wrote:

> What is the expected grace time for the data-deletion request to take
> place?
>
> I'm not expert about the policy but I think something like "I need my data
> to be gone in next 2 second" is unreasonable.
>
> Tony X
>
> ------------------------------
> *From:* Robert Muir <rcmuir@gmail.com>
> *Sent:* Tuesday, November 28, 2023 11:52 AM
> *To:* dev@lucene.apache.org <dev@lucene.apache.org>
> *Subject:* Re: GDPR compliance
>
> I don't think there's any problem with GDPR, and I don't think users
> should be running unnecessary "optimize". GDRP just says data should
> be erased without "undue" delay. waiting for a merge to nuke the
> deleted docs isn't "undue", there is a good reason for it.
>
> On Tue, Nov 28, 2023 at 2:40?PM Patrick Zhai <zhai7631@gmail.com> wrote:
> >
> > Hi Folks,
> > In LinkedIn we need to comply with GDPR for a large part of our data,
> and an important part of it is that we need to be sure we have completely
> deleted the data the user requested to delete within a certain period of
> time.
> > The way we have come up with so far is to:
> > 1. Record the segment creation time somewhere (not decided yet, maybe
> index commit userinfo, maybe some other place outside of lucene)
> > 2. Create a new merge policy which delegate most operations to a normal
> MP, like TieredMergePolicy, and then add extra single-segment (merge from 1
> segment to 1 segment, basically only do deletion) merges if it finds any
> segment is about to violate the GDPR time frame.
> >
> > So here's my question:
> > 1. Is there a better/existing way to do this?
> > 2. I would like to directly contribute to Lucene about such a merge
> policy since I think GDPR is more or less a common thing. Would like to
> know whether people feel like it's necessary or not?
> > 3. It's also nice if we can store the segment creation time to the index
> directly by IndexWriter (maybe write to SegmentInfo?), I can try to do that
> but would like to ask whether there's any objections?
> >
> > Best
> > Patrick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: GDPR compliance [ In reply to ]
Another way is to ensure that all documents get updated on a regular
cadence whether there are changes in the underlying data or not. Or,
regenerating the index from scratch all the time. Of course these
approaches might be more costly for an index that has intrinsically low
update rates, but they do keep the index fresh without the need for any
special tracking.

On Tue, Nov 28, 2023, 8:45?PM Patrick Zhai <zhai7631@gmail.com> wrote:

> It's not that insane, it's about several weeks however the big segment can
> stay there for quite long if there's not enough update for a merge policy
> to pick it up
>
> On Tue, Nov 28, 2023, 17:14 Dongyu Xu <dongyu214@hotmail.com> wrote:
>
>> What is the expected grace time for the data-deletion request to take
>> place?
>>
>> I'm not expert about the policy but I think something like "I need my
>> data to be gone in next 2 second" is unreasonable.
>>
>> Tony X
>>
>> ------------------------------
>> *From:* Robert Muir <rcmuir@gmail.com>
>> *Sent:* Tuesday, November 28, 2023 11:52 AM
>> *To:* dev@lucene.apache.org <dev@lucene.apache.org>
>> *Subject:* Re: GDPR compliance
>>
>> I don't think there's any problem with GDPR, and I don't think users
>> should be running unnecessary "optimize". GDRP just says data should
>> be erased without "undue" delay. waiting for a merge to nuke the
>> deleted docs isn't "undue", there is a good reason for it.
>>
>> On Tue, Nov 28, 2023 at 2:40?PM Patrick Zhai <zhai7631@gmail.com> wrote:
>> >
>> > Hi Folks,
>> > In LinkedIn we need to comply with GDPR for a large part of our data,
>> and an important part of it is that we need to be sure we have completely
>> deleted the data the user requested to delete within a certain period of
>> time.
>> > The way we have come up with so far is to:
>> > 1. Record the segment creation time somewhere (not decided yet, maybe
>> index commit userinfo, maybe some other place outside of lucene)
>> > 2. Create a new merge policy which delegate most operations to a normal
>> MP, like TieredMergePolicy, and then add extra single-segment (merge from 1
>> segment to 1 segment, basically only do deletion) merges if it finds any
>> segment is about to violate the GDPR time frame.
>> >
>> > So here's my question:
>> > 1. Is there a better/existing way to do this?
>> > 2. I would like to directly contribute to Lucene about such a merge
>> policy since I think GDPR is more or less a common thing. Would like to
>> know whether people feel like it's necessary or not?
>> > 3. It's also nice if we can store the segment creation time to the
>> index directly by IndexWriter (maybe write to SegmentInfo?), I can try to
>> do that but would like to ask whether there's any objections?
>> >
>> > Best
>> > Patrick
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
Re: GDPR compliance [ In reply to ]
To the valid point Robert makes above about the underlying data still on
the disk (old news):
https://news.sophos.com/en-us/2022/09/23/morgan-stanley-fined-millions-for-selling-off-devices-full-of-customer-pii/

On Wed, Nov 29, 2023 at 11:01?AM Michael Sokolov <msokolov@gmail.com> wrote:

> Another way is to ensure that all documents get updated on a regular
> cadence whether there are changes in the underlying data or not. Or,
> regenerating the index from scratch all the time. Of course these
> approaches might be more costly for an index that has intrinsically low
> update rates, but they do keep the index fresh without the need for any
> special tracking.
>
> On Tue, Nov 28, 2023, 8:45?PM Patrick Zhai <zhai7631@gmail.com> wrote:
>
>> It's not that insane, it's about several weeks however the big segment
>> can stay there for quite long if there's not enough update for a merge
>> policy to pick it up
>>
>> On Tue, Nov 28, 2023, 17:14 Dongyu Xu <dongyu214@hotmail.com> wrote:
>>
>>> What is the expected grace time for the data-deletion request to take
>>> place?
>>>
>>> I'm not expert about the policy but I think something like "I need my
>>> data to be gone in next 2 second" is unreasonable.
>>>
>>> Tony X
>>>
>>> ------------------------------
>>> *From:* Robert Muir <rcmuir@gmail.com>
>>> *Sent:* Tuesday, November 28, 2023 11:52 AM
>>> *To:* dev@lucene.apache.org <dev@lucene.apache.org>
>>> *Subject:* Re: GDPR compliance
>>>
>>> I don't think there's any problem with GDPR, and I don't think users
>>> should be running unnecessary "optimize". GDRP just says data should
>>> be erased without "undue" delay. waiting for a merge to nuke the
>>> deleted docs isn't "undue", there is a good reason for it.
>>>
>>> On Tue, Nov 28, 2023 at 2:40?PM Patrick Zhai <zhai7631@gmail.com> wrote:
>>> >
>>> > Hi Folks,
>>> > In LinkedIn we need to comply with GDPR for a large part of our data,
>>> and an important part of it is that we need to be sure we have completely
>>> deleted the data the user requested to delete within a certain period of
>>> time.
>>> > The way we have come up with so far is to:
>>> > 1. Record the segment creation time somewhere (not decided yet, maybe
>>> index commit userinfo, maybe some other place outside of lucene)
>>> > 2. Create a new merge policy which delegate most operations to a
>>> normal MP, like TieredMergePolicy, and then add extra single-segment (merge
>>> from 1 segment to 1 segment, basically only do deletion) merges if it finds
>>> any segment is about to violate the GDPR time frame.
>>> >
>>> > So here's my question:
>>> > 1. Is there a better/existing way to do this?
>>> > 2. I would like to directly contribute to Lucene about such a merge
>>> policy since I think GDPR is more or less a common thing. Would like to
>>> know whether people feel like it's necessary or not?
>>> > 3. It's also nice if we can store the segment creation time to the
>>> index directly by IndexWriter (maybe write to SegmentInfo?), I can try to
>>> do that but would like to ask whether there's any objections?
>>> >
>>> > Best
>>> > Patrick
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>