Mailing List Archive: Feature: External hash iterators

Feature: External hash iterators

lukasmai.403+p5p at gmail

Nov 7, 2023, 12:04 PM

Post #1 of 14 (386 views)

To iterate over a hash in Perl, you can fetch up front a list of keys()
(or values()) and use a for loop. But if that's not an option, your only
other choice is each(), which uses an implicit internal iterator stored
in the hash itself. This is brittle: If during the iteration you call a
function which also uses each() on the same hash (or keys() or values()
or just the hash in list context), things break.

That's why I'd like for Perl to have external hash iterators.

Desired properties:

* A hash iterator is exposed to Perl code as some kind of scalar value.
* You can obtain an iterator for any hash.
* You can obtain the current key/value pair from an iterator and
advance it to the next entry, if any, or detect the end of the hash.
* A hash can have any number of concurrent iterators.
* External hash iterators don't interact with each other nor with the
internal iterator.
* Modifying the structure of a hash invalidates all of its iterators,
but that doesn't have to be an observable state. I'm fine if it
means an iterator might skip or revisit some entries.

For example, it might look like this:

my $it = iterator(\%ENV);
while (my ($key, $value) = $it->()) {
say "$key=$value";
}

Here the iterator is exposed as a code reference that returns either the
next key/value pair or the empty list each time it is invoked.

How feasible would it be to implement such a thing?

Re: Feature: External hash iterators [ In reply to ]

grinnz at gmail

Nov 7, 2023, 12:21 PM

Post #2 of 14 (386 views)

On Tue, Nov 7, 2023 at 3:04?PM Lukas Mai <lukasmai.403+p5p@gmail.com> wrote:

> To iterate over a hash in Perl, you can fetch up front a list of keys()
> (or values()) and use a for loop. But if that's not an option, your only
> other choice is each(), which uses an implicit internal iterator stored in
> the hash itself. This is brittle: If during the iteration you call a
> function which also uses each() on the same hash (or keys() or values() or
> just the hash in list context), things break.
>
> That's why I'd like for Perl to have external hash iterators.
>
> Desired properties:
>
> - A hash iterator is exposed to Perl code as some kind of scalar value.
> - You can obtain an iterator for any hash.
> - You can obtain the current key/value pair from an iterator and
> advance it to the next entry, if any, or detect the end of the hash.
> - A hash can have any number of concurrent iterators.
> - External hash iterators don't interact with each other nor with the
> internal iterator.
> - Modifying the structure of a hash invalidates all of its iterators,
> but that doesn't have to be an observable state. I'm fine if it means an
> iterator might skip or revisit some entries.
>
> For example, it might look like this:
>
> my $it = iterator(\%ENV);
> while (my ($key, $value) = $it->()) {
> say "$key=$value";
> }
>
> Here the iterator is exposed as a code reference that returns either the
> next key/value pair or the empty list each time it is invoked.
>
> How feasible would it be to implement such a thing?
>

I think this would conceptually be a wonderful addition to builtin.pm.

Some questions: would it be callable on arrays? Objects? I think both are
reasonable and useful behavior, but both together would lead to array/hash
deref ambiguity similar to the issues with autoderef.

I think there is some prior art on CPAN but all I can find at the moment is
https://metacpan.org/pod/Hash::Iterator (and I much prefer your proposed
interface).

-Dan

Re: Feature: External hash iterators [ In reply to ]

lukasmai.403+p5p at gmail

Nov 7, 2023, 12:29 PM

Post #3 of 14 (386 views)

On 07.11.23 21:21, Dan Book wrote:
> Some questions: would it be callable on arrays? Objects? I think both
> are reasonable and useful behavior, but both together would lead to
> array/hash deref ambiguity similar to the issues with autoderef.

The concept itself could generalize nicely (and one might imagine
iterator-based map/grep-like functions and even a new iterator-based
loop keyword), but for now I just want to focus on hashes. (Because
that's where the missing feature really hurts: I could easily build an
array iterator out of an arrayref plus an index, but there's nothing
like that for hashes.)

Whatever the final interface looks like, I'd want this portion of it to
be hash-specific (e.g. builtin::hash_iterator or Hash::Util::iterator).
We can always build a more generic interface on top of it later.

Re: Feature: External hash iterators [ In reply to ]

grinnz at gmail

Nov 7, 2023, 12:39 PM

Post #4 of 14 (386 views)

On Tue, Nov 7, 2023 at 3:21?PM Dan Book <grinnz@gmail.com> wrote:

> On Tue, Nov 7, 2023 at 3:04?PM Lukas Mai <lukasmai.403+p5p@gmail.com>
> wrote:
>
>> To iterate over a hash in Perl, you can fetch up front a list of keys()
>> (or values()) and use a for loop. But if that's not an option, your only
>> other choice is each(), which uses an implicit internal iterator stored in
>> the hash itself. This is brittle: If during the iteration you call a
>> function which also uses each() on the same hash (or keys() or values() or
>> just the hash in list context), things break.
>>
>> That's why I'd like for Perl to have external hash iterators.
>>
>> Desired properties:
>>
>> - A hash iterator is exposed to Perl code as some kind of scalar
>> value.
>> - You can obtain an iterator for any hash.
>> - You can obtain the current key/value pair from an iterator and
>> advance it to the next entry, if any, or detect the end of the hash.
>> - A hash can have any number of concurrent iterators.
>> - External hash iterators don't interact with each other nor with the
>> internal iterator.
>> - Modifying the structure of a hash invalidates all of its iterators,
>> but that doesn't have to be an observable state. I'm fine if it means an
>> iterator might skip or revisit some entries.
>>
>> For example, it might look like this:
>>
>> my $it = iterator(\%ENV);
>> while (my ($key, $value) = $it->()) {
>> say "$key=$value";
>> }
>>
>> Here the iterator is exposed as a code reference that returns either the
>> next key/value pair or the empty list each time it is invoked.
>>
>> How feasible would it be to implement such a thing?
>>
>
> I think this would conceptually be a wonderful addition to builtin.pm.
>
> Some questions: would it be callable on arrays? Objects? I think both are
> reasonable and useful behavior, but both together would lead to array/hash
> deref ambiguity similar to the issues with autoderef.
>
> I think there is some prior art on CPAN but all I can find at the moment
> is https://metacpan.org/pod/Hash::Iterator (and I much prefer your
> proposed interface).
>

And to add: the most important reason for something like this, which
prevents it from being possible to implement outside of core or XS, and I
presume would be the difficult part of implementation, is that it would
retrieve only the key and value of the current iteration and not store a
copy of all keys up front, as a foreach loop conceptually requires.

-Dan

Re: Feature: External hash iterators [ In reply to ]

perl5-porters at perl

Nov 7, 2023, 12:51 PM

Post #5 of 14 (386 views)

* Dan Book <grinnz@gmail.com> [2023-11-07 15:21:43 -0500]:

> On Tue, Nov 7, 2023 at 3:04?PM Lukas Mai <lukasmai.403+p5p@gmail.com> wrote:
>
> > To iterate over a hash in Perl, you can fetch up front a list of keys()
> > (or values()) and use a for loop. But if that's not an option, your only
> > other choice is each(), which uses an implicit internal iterator stored in
> > the hash itself. This is brittle: If during the iteration you call a
> > function which also uses each() on the same hash (or keys() or values() or
> > just the hash in list context), things break.
> >
> > That's why I'd like for Perl to have external hash iterators.
> >
> > Desired properties:
> >
> > - A hash iterator is exposed to Perl code as some kind of scalar value.
> > - You can obtain an iterator for any hash.
> > - You can obtain the current key/value pair from an iterator and
> > advance it to the next entry, if any, or detect the end of the hash.
> > - A hash can have any number of concurrent iterators.
> > - External hash iterators don't interact with each other nor with the
> > internal iterator.
> > - Modifying the structure of a hash invalidates all of its iterators,
> > but that doesn't have to be an observable state. I'm fine if it means an
> > iterator might skip or revisit some entries.
> >
> > For example, it might look like this:
> >
> > my $it = iterator(\%ENV);
> > while (my ($key, $value) = $it->()) {
> > say "$key=$value";
> > }
> >
> > Here the iterator is exposed as a code reference that returns either the
> > next key/value pair or the empty list each time it is invoked.
> >
> > How feasible would it be to implement such a thing?
> >
>
> I think this would conceptually be a wonderful addition to builtin.pm.
>
> Some questions: would it be callable on arrays? Objects? I think both are
> reasonable and useful behavior, but both together would lead to array/hash
> deref ambiguity similar to the issues with autoderef.
>
> I think there is some prior art on CPAN but all I can find at the moment is
> https://metacpan.org/pod/Hash::Iterator (and I much prefer your proposed
> interface).

This "need" aligns well with the addition of a "kv" keyword that is a
special case of N-at-a-time.

I think it'd very much be an amazing addition to "keys" and "values"
to have something like,

foreach my $kvpair_ref (kv %array) {
printf qq{%s=%s\n}, @$kvpair_ref;
..
}

Would work the same as, perhaps even not handle odd-sized LISTs.

foreach my $tuple_ref (kv @array) {
printf qq{%s=%s\n}, @$tuple_ref;
..
}

This could easily be done as a prototype as I've been prone to use
a lot lately, but would be greatly.

Underneath having a general sub-setter or "N-chunker" that work on
a HASH or ARRAY, and would by like C<kv> but specify the chunk
size. Let's call it C<chunk INT_SIZE, LIST>.

C<kv> would be equivalent to C<chunk 2, %hash> or C<chunk 2, @array>.

Returning an array reference of N size would also allow this
multi-part map that was discussed some time back,

my @result = map { my ($X,$Y,$Z) = @$_; ... } chunk 3, @array.

This would be served up really well with a simple prototype based
module. I may have time and be sufficiently motivated to whip it
up unless someone beats me to it.

My personal belief is the "future" of Perl has more to do with make
data structure iteration and transformation readily accessible -
which I suppose might be leaning more into the APL side of the
house. And I think this is the kind where prototypes can really
shine.

Cheers,
Brett

>
> -Dan

--
oodler@cpan.org ~> oodler577@sdf-eu.org
SDF-EU Public Access UNIX System - http://sdfeu.org
irc.perl.org #openmp #pdl #native
gopher://sdfeu.org/1/users/oodler577/

Carl Spackler: [preparing to dynamite the gopher tunnel] In the
immortal words of Jean Paul Sartre, Au revoir, gopher.

Re: Feature: External hash iterators [ In reply to ]

Nov 7, 2023, 1:11 PM

Post #6 of 14 (386 views)

Michael Conrad <mike@nrdvana.net> wrote:
:I implemented this concept for Tree::RB::XS (and even with the fancy
:behavior that deleting the iterated element moves all iterators at that
:element to the next element) but it required a linked-list (of
:iterators) at each tree node. Applying the same to a HV would add a
:pointer to each HE which seems like too heavy of a solution. A second
:option is to have the hash itself track the known iterators, and then an
:extra 'if' statement every time it frees a HE to see if that was one
:that was being pointed at by any of the iterators... but that overhead
:would probably show up in benchmarks. A third option is to have the
:iterators composed of (hash_weakref, bucket_idx, collision_ofs, prev_he)
:and then each time you want the next element you verify that the hash
:still exists, that bucket_idx still exists, and re-walk the linked list
:of HE by collision_ofs iterations. If collisions are few, this will be
:reasonably efficient. If the iterator no longer finds prev_he at that
:position, it just goes to the next HE or next bucket.
:
:That third one is the only option I can see that doesn't degrade Perl's
:overall performance. Any other implementation ideas?

I think the simplest implementation would be a generation number on the hash,
such that any change to the hash increments the generation and each use of
the iterator verifies that the generation hasn't changed.

It is possible that the overhead of incrementing the generation on any
change would be enough to justify having this be a secondary hash type
that we upgrade to only when the first iterator is requested.

The iterator should certainly have a reference to the hash, which should
be enough to ensure the hash continues to exist.

Hugo

Re: Feature: External hash iterators [ In reply to ]

mike at nrdvana

Nov 7, 2023, 1:31 PM

Post #7 of 14 (386 views)

On 11/7/23 15:21, Dan Book wrote:
> On Tue, Nov 7, 2023 at 3:04?PM Lukas Mai <lukasmai.403+p5p@gmail.com
> <mailto:lukasmai.403%2Bp5p@gmail.com>> wrote:
>
> To iterate over a hash in Perl, you can fetch up front a list of
> keys() (or values()) and use a for loop. But if that's not an
> option, your only other choice is each(), which uses an implicit
> internal iterator stored in the hash itself. This is brittle: If
> during the iteration you call a function which also uses each() on
> the same hash (or keys() or values() or just the hash in list
> context), things break.
>
> That's why I'd like for Perl to have external hash iterators.
>
> Desired properties:
>
> * A hash iterator is exposed to Perl code as some kind of scalar
> value.
> * You can obtain an iterator for any hash.
> * You can obtain the current key/value pair from an iterator and
> advance it to the next entry, if any, or detect the end of the
> hash.
> * A hash can have any number of concurrent iterators.
> * External hash iterators don't interact with each other nor
> with the internal iterator.
> * Modifying the structure of a hash invalidates all of its
> iterators, but that doesn't have to be an observable state.
> I'm fine if it means an iterator might skip or revisit some
> entries.
>
> For example, it might look like this:
>
> my $it = iterator(\%ENV);
> while (my ($key, $value) = $it->()) {
> say "$key=$value";
> }
>
> Here the iterator is exposed as a code reference that returns
> either the next key/value pair or the empty list each time it is
> invoked.
>
> How feasible would it be to implement such a thing?
>
>
> I think this would conceptually be a wonderful addition to builtin.pm
> <http://builtin.pm>.
>
> Some questions: would it be callable on arrays? Objects? I think both
> are reasonable and useful behavior, but both together would lead to
> array/hash deref ambiguity similar to the issues with autoderef.
>
> I think there is some prior art on CPAN but all I can find at the
> moment is https://metacpan.org/pod/Hash::Iterator (and I much prefer
> your proposed interface).
>
> -Dan

I implemented this concept for Tree::RB::XS (and even with the fancy
behavior that deleting the iterated element moves all iterators at that
element to the next element) but it required a linked-list (of
iterators) at each tree node. Applying the same to a HV would add a
pointer to each HE which seems like too heavy of a solution. A second
option is to have the hash itself track the known iterators, and then an
extra 'if' statement every time it frees a HE to see if that was one
that was being pointed at by any of the iterators... but that overhead
would probably show up in benchmarks. A third option is to have the
iterators composed of (hash_weakref, bucket_idx, collision_ofs, prev_he)
and then each time you want the next element you verify that the hash
still exists, that bucket_idx still exists, and re-walk the linked list
of HE by collision_ofs iterations. If collisions are few, this will be
reasonably efficient. If the iterator no longer finds prev_he at that
position, it just goes to the next HE or next bucket.

That third one is the only option I can see that doesn't degrade Perl's
overall performance. Any other implementation ideas? It sure would be
nice to be able to delete the most recent iterated element without
breaking iteration.

-Mike

Re: Feature: External hash iterators [ In reply to ]

scott at perturb

Nov 7, 2023, 1:49 PM

Post #8 of 14 (386 views)

How is this functionally different than iterating over multiple values
at a time
<https://perldoc.perl.org/perl5360delta#iterating-over-multiple-values-at-a-time-(experimental)>
that came out with Perl 5.36?

You mention having multiple iterators at the same time, so that might be
a difference?

- Scott

On 11/7/2023 12:04 PM, Lukas Mai wrote:
>
> To iterate over a hash in Perl, you can fetch up front a list of
> keys() (or values()) and use a for loop. But if that's not an option,
> your only other choice is each(), which uses an implicit internal
> iterator stored in the hash itself. This is brittle: If during the
> iteration you call a function which also uses each() on the same hash
> (or keys() or values() or just the hash in list context), things break.
>
> That's why I'd like for Perl to have external hash iterators.
>
> Desired properties:
>
> * A hash iterator is exposed to Perl code as some kind of scalar value.
> * You can obtain an iterator for any hash.
> * You can obtain the current key/value pair from an iterator and
> advance it to the next entry, if any, or detect the end of the hash.
> * A hash can have any number of concurrent iterators.
> * External hash iterators don't interact with each other nor with
> the internal iterator.
> * Modifying the structure of a hash invalidates all of its
> iterators, but that doesn't have to be an observable state. I'm
> fine if it means an iterator might skip or revisit some entries.
>
> For example, it might look like this:
>
> my $it = iterator(\%ENV);
> while (my ($key, $value) = $it->()) {
> say "$key=$value";
> }
>
> Here the iterator is exposed as a code reference that returns either
> the next key/value pair or the empty list each time it is invoked.
>
> How feasible would it be to implement such a thing?
>

Re: Feature: External hash iterators [ In reply to ]

fawaka at gmail

Nov 7, 2023, 1:50 PM

Post #9 of 14 (386 views)

On Tue, Nov 7, 2023 at 9:04?PM Lukas Mai <lukasmai.403+p5p@gmail.com> wrote:

> To iterate over a hash in Perl, you can fetch up front a list of keys()
> (or values()) and use a for loop. But if that's not an option, your only
> other choice is each(), which uses an implicit internal iterator stored in
> the hash itself. This is brittle: If during the iteration you call a
> function which also uses each() on the same hash (or keys() or values() or
> just the hash in list context), things break.
>
> That's why I'd like for Perl to have external hash iterators.
>
> Desired properties:
>
> - A hash iterator is exposed to Perl code as some kind of scalar value.
> - You can obtain an iterator for any hash.
> - You can obtain the current key/value pair from an iterator and
> advance it to the next entry, if any, or detect the end of the hash.
> - A hash can have any number of concurrent iterators.
> - External hash iterators don't interact with each other nor with the
> internal iterator.
> - Modifying the structure of a hash invalidates all of its iterators,
> but that doesn't have to be an observable state. I'm fine if it means an
> iterator might skip or revisit some entries.
>
> For example, it might look like this:
>
> my $it = iterator(\%ENV);
> while (my ($key, $value) = $it->()) {
> say "$key=$value";
> }
>
> Here the iterator is exposed as a code reference that returns either the
> next key/value pair or the empty list each time it is invoked.
>
> How feasible would it be to implement such a thing?
>

I would agree it's desirable, I'm not sure it's implementable in a sane
way; especially invalidation. Or more specifically if it's possible to
efficiently implement this in a way that doesn't segfault when trying to
access an invalid iterator.

Leon

Re: Feature: External hash iterators [ In reply to ]

grinnz at gmail

Nov 7, 2023, 1:50 PM

Post #10 of 14 (386 views)

On Tue, Nov 7, 2023 at 3:51?PM Oodler 577 <oodler577@sdf-eu.org> wrote:

> * Dan Book <grinnz@gmail.com> [2023-11-07 15:21:43 -0500]:
>
> > On Tue, Nov 7, 2023 at 3:04?PM Lukas Mai <lukasmai.403+p5p@gmail.com>
> wrote:
> >
> > > To iterate over a hash in Perl, you can fetch up front a list of keys()
> > > (or values()) and use a for loop. But if that's not an option, your
> only
> > > other choice is each(), which uses an implicit internal iterator
> stored in
> > > the hash itself. This is brittle: If during the iteration you call a
> > > function which also uses each() on the same hash (or keys() or
> values() or
> > > just the hash in list context), things break.
> > >
> > > That's why I'd like for Perl to have external hash iterators.
> > >
> > > Desired properties:
> > >
> > > - A hash iterator is exposed to Perl code as some kind of scalar
> value.
> > > - You can obtain an iterator for any hash.
> > > - You can obtain the current key/value pair from an iterator and
> > > advance it to the next entry, if any, or detect the end of the hash.
> > > - A hash can have any number of concurrent iterators.
> > > - External hash iterators don't interact with each other nor with
> the
> > > internal iterator.
> > > - Modifying the structure of a hash invalidates all of its
> iterators,
> > > but that doesn't have to be an observable state. I'm fine if it
> means an
> > > iterator might skip or revisit some entries.
> > >
> > > For example, it might look like this:
> > >
> > > my $it = iterator(\%ENV);
> > > while (my ($key, $value) = $it->()) {
> > > say "$key=$value";
> > > }
> > >
> > > Here the iterator is exposed as a code reference that returns either
> the
> > > next key/value pair or the empty list each time it is invoked.
> > >
> > > How feasible would it be to implement such a thing?
> > >
> >
> > I think this would conceptually be a wonderful addition to builtin.pm.
> >
> > Some questions: would it be callable on arrays? Objects? I think both are
> > reasonable and useful behavior, but both together would lead to
> array/hash
> > deref ambiguity similar to the issues with autoderef.
> >
> > I think there is some prior art on CPAN but all I can find at the moment
> is
> > https://metacpan.org/pod/Hash::Iterator (and I much prefer your proposed
> > interface).
>

Many of these suggestions already have available or experimental solutions:

> This "need" aligns well with the addition of a "kv" keyword that is a
> special case of N-at-a-time.
>
> I think it'd very much be an amazing addition to "keys" and "values"
> to have something like,
>
> foreach my $kvpair_ref (kv %array) {
> printf qq{%s=%s\n}, @$kvpair_ref;
> ..
> }

This use case is better served by the new multi-foreach syntax:

foreach my ($k, $v) (%hash) {...}

> Would work the same as, perhaps even not handle odd-sized LISTs.
>
> foreach my $tuple_ref (kv @array) {
> printf qq{%s=%s\n}, @$tuple_ref;
> ..
> }

This use case does however need an intermediary function to interpose the
array indices, which has been added as the 'indexed' builtin function (and
works on either arrays or lists): https://perldoc.perl.org/builtin#indexed

> This could easily be done as a prototype as I've been prone to use
> a lot lately, but would be greatly.
>
> Underneath having a general sub-setter or "N-chunker" that work on
> a HASH or ARRAY, and would by like C<kv> but specify the chunk
> size. Let's call it C<chunk INT_SIZE, LIST>.
>
> C<kv> would be equivalent to C<chunk 2, %hash> or C<chunk 2, @array>.
>
> Returning an array reference of N size would also allow this
> multi-part map that was discussed some time back,
>
> my @result = map { my ($X,$Y,$Z) = @$_; ... } chunk 3, @array.
>

The multi-foreach syntax can be used for many use cases like this:

foreach my ($x, $y, $z) (@array) {...}

A similar syntax could be imagined for map or grep but may be syntactically
more difficult.

A builtin function similar to List::UtilsBy::bundle_by { [@_] } ... would
be useful as well, I don't believe such has been formally proposed yet.

I will note however that these all involve copying the entire set of
values, and so are not especially related to this proposal.

-Dan

Re: Feature: External hash iterators [ In reply to ]

grinnz at gmail

Nov 7, 2023, 1:52 PM

Post #11 of 14 (386 views)

On Tue, Nov 7, 2023 at 4:49?PM Scott Baker <scott@perturb.org> wrote:

> How is this functionally different than iterating over multiple values at
> a time
> <https://perldoc.perl.org/perl5360delta#iterating-over-multiple-values-at-a-time-(experimental)>
> that came out with Perl 5.36?
>

By the important difference I mentioned in a previous followup email: the
iterator retrieves only the key and value currently iterated, as the each
function does, and not the entire set of keys.

-Dan

Re: Feature: External hash iterators [ In reply to ]

Nov 9, 2023, 6:55 AM

Post #12 of 14 (356 views)

On Tue, Nov 07, 2023 at 09:04:09PM +0100, Lukas Mai wrote:
> * Modifying the structure of a hash invalidates all of its iterators,
> but that doesn't have to be an observable state. I'm fine if it
> means an iterator might skip or revisit some entries.

Perl's internal hash iterator guarantees that deleting the current hash
entry doesn't mess up the iterator, which is a very useful property.
Should the external iterator make a similar guarantee?

What about iterating on tied or otherwise magical hashes? Should this be
possible, and if so what semantics?

--
The crew of the Enterprise encounter an alien life form which is
surprisingly neither humanoid nor made from pure energy.
-- Things That Never Happen in "Star Trek" #22

Re: Feature: External hash iterators [ In reply to ]

lukasmai.403+p5p at gmail

Nov 9, 2023, 7:13 AM

Post #13 of 14 (356 views)

On 09.11.23 15:55, Dave Mitchell wrote:
> On Tue, Nov 07, 2023 at 09:04:09PM +0100, Lukas Mai wrote:
>> * Modifying the structure of a hash invalidates all of its iterators,
>> but that doesn't have to be an observable state. I'm fine if it
>> means an iterator might skip or revisit some entries.
>
> Perl's internal hash iterator guarantees that deleting the current hash
> entry doesn't mess up the iterator, which is a very useful property.
> Should the external iterator make a similar guarantee?

Preferably yes, but I don't know how practical it would be to implement
(efficiently). That's why I didn't want to make it a hard requirement.

> What about iterating on tied or otherwise magical hashes? Should this be
> possible, and if so what semantics?

Good question. My idea was to require a new method in classes that
implement TIEHASH, which would create a custom iterator. (If an iterator
is requested for a hash tied to an "old" class that doesn't provide such
a method, an exception would be thrown.)

I'm less familiar with the "otherwise magical" part. Would this require
modification of some internal "vtable" structures to add a new hook?

Re: Feature: External hash iterators [ In reply to ]

demerphq at gmail

Nov 9, 2023, 9:34 AM

Post #14 of 14 (356 views)

On Tue, 7 Nov 2023 at 23:03, <hv@crypt.org> wrote:

> Michael Conrad <mike@nrdvana.net> wrote:
> :I implemented this concept for Tree::RB::XS (and even with the fancy
> :behavior that deleting the iterated element moves all iterators at that
> :element to the next element) but it required a linked-list (of
> :iterators) at each tree node. Applying the same to a HV would add a
> :pointer to each HE which seems like too heavy of a solution. A second
> :option is to have the hash itself track the known iterators, and then an
> :extra 'if' statement every time it frees a HE to see if that was one
> :that was being pointed at by any of the iterators... but that overhead
> :would probably show up in benchmarks. A third option is to have the
> :iterators composed of (hash_weakref, bucket_idx, collision_ofs, prev_he)
> :and then each time you want the next element you verify that the hash
> :still exists, that bucket_idx still exists, and re-walk the linked list
> :of HE by collision_ofs iterations. If collisions are few, this will be
> :reasonably efficient. If the iterator no longer finds prev_he at that
> :position, it just goes to the next HE or next bucket.
> :
> :That third one is the only option I can see that doesn't degrade Perl's
> :overall performance. Any other implementation ideas?
>
> I think the simplest implementation would be a generation number on the
> hash,
> such that any change to the hash increments the generation and each use of
> the iterator verifies that the generation hasn't changed.
>

We already have this. It is not a monotonically increasing value however as
we use the "generation" to randomize the order of the buckets we visit to
prevent people establishing key order dependencies that might break if we
modified the hash function. (It is a proxy for a monotonically increasing
value however, as we use a Marsaglia XORSHIFT RNG which has a cycle length
of 2**32-1, with 0 being the only excluded value.)

The generation value is not modified on deletion however, as Dave pointed
out already it is documented that you can delete the most recent result
from each without breaking the iterator only only on an insert (a store
that actually adds a new key).

> It is possible that the overhead of incrementing the generation on any
> change would be enough to justify having this be a secondary hash type
> that we upgrade to only when the first iterator is requested.
>

As I said we have been doing this for some time.

$ ./perl -le'my %hash=(1..10); my ($k,$v)= each(%hash); $hash{x}++; my
($k,$v)= each(%hash)'
Use of each() on hash after insertion without resetting hash iterator
results in undefined behavior, Perl interpreter: 0x55ab1a71b2a0 at -e line
1.

Not sure why it shows the perl interpreter tho. I guess because it is a
DEBUGGING perl?

As far as I remember the iterator lives in struct xpvhv_aux, along with a
bunch of other properties for "special" hashes, and is added to the hash
lazily only when needed, converting the hash from being based on struct
xpvhv to struct xpvhv_with_aux. Thus any hash that has been iterated,
blessed, or is used as part of the stash, has backreferences, etc, will be
based on xpvhv_with_aux. (At one point the aux structure lived at the end
of the bucket array, but i dont think that is the case anymore.)

The xpvhv_aux structure contains two fields which are used for iteration:
HE xhv_eiter and I32 xhv_riter. The xhv_riter is incremented from 0 to
xhv_max (the index of the last bucket, eg for a hash with 2**k keys it will
be 2**k-1), and the xhv_eiter is the current HE in the bucket chain. The
xhv_riter is xored with xhv_rand to give each hash its own bucket
visitation schedule. Any time a new key is added to the hash the xhv_rand
is perturbed by PERL_XORSHIFT32_A(), and after each iteration we copy
xhv_rand into xhv_last_rand (so we only get one warning per modification).
If someone calls an iterator function we check if xhv_rand and
xhv_last_rand are the same. If they differ we produce a warning.

The xpvhv_aux structure contains all the fields used to implement
weakreferences btw, so it seems conceivable to me that the sequence of
events would be something like:

Init:
upgrade hash to have an aux structure.
initialize the aux structure for normal iteration.
create a standalone xpvhv_aux structure that lives in the PV slot of an
SVPV.
stick the PV in the xhv_backreferences structure in the hash.
hijack the xhv_class_superclass HV field to point back at the HV.

Iter:
check if the xhv_rand value is the same in the SV wrapped aux structure
and the real one.
assuming they are the same then find the "next HE to return", update the
riter/eiter values afterwards, if they differ invalidate the iterator.

On delete:
check if the hash has an aux structure and any backreferences, if it does
then check to see if any are SVPV's, if they are invalidate them.

I'm definitely glossing over some edge cases, but id say most of the pieces
are here to do what Lukas requested. They just need to be assembled into a
cohesive whole.

FWIW, I have a recollection that at one point we actually did have XS/C
code that used hand rolled independent iterators.

One thing I think is important is that these new iterators should mirror
the xhv_rand behavior, that way we can detect when someone is using the
iterator after an insert, and it prevents people from writing code that
depends on the bucket order, which means that we can safely change the hash
function and have different hash functions in different build without
people writing code that expects some specific key order.

Anyway. see hv.h for details of much of this, along with Perl_hv_iterinit
and Perl_hv_iternext in hv.c Perl_hv_iterinit takes care of "upgrading" the
hash to have an aux structure, and some basic bookkeeping.
Perl_hv_iternext() does the incrementing of the riter and the traversal of
the HE chain in the buckets represented by eiter, it is actually a macro
which wraps Perl_hv_iternext_flags().

I'd be happy to help with this effort, but I have little time these days.
This is a worthy enough cause that I'd do my best to make some time for it
tho.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"