Mailing List Archive

Re: threads and stuff (was Re: Perl’s leaky bucket)
Thank you, appreciate the reply. To avoid collision with the other
thread, I update the subject.

So I suppose this cuts to the heart of the matter. It's fair, I believe,
that "top level 'threading'" is not something that can be reasonable
supported in the perl runtime (and to be fair, other scripting languages
also struggle greatly with this).

The question, then goes to: how do we facilitate access to the native
platform's SMP? Perhaps the answer is, as always, "it depends."

I will continue to pound my head on this, something will shake loose. In
the meantime, as proof that I am vested in seeing some interesting
options shake loose, I have offered a module on CPAN called
OpenMP::Environment.

It only provides programmatic manipulation of the environmental
variables that OpenMP programs care about; but it is a necessary step
for me to investigate and gain some intuition about what is possible and
what looks "right".

Currently I am thinking a SIMD (single instruction, multiple data)
approach is the most accessible. This doesn't really get at what is
ideal (in my mind), but at least provides a path forward by creating
threaded libraries made available via Inline::C/XS/FFIs - and even
leveraging these via a tie interface. And if the direction of providing
SIMD "data types" is truly the best direction, PDL seems like a really
good target for that.

Thanks for all the feedback, will monitor the list/thread. I do ask that
we keep the whole idea of threading "where and when feasible" (even in
language design considerations) is extremely important to Perl's long
term viability and as a driver of important capabilities. The idea of
using Perl to efficiently max out a large SMP machine for useful work is
very exciting to me, and I think I am not the only one.

Cheers,
Brett


On 4/8/21 3:42 AM, Konovalov, Vadim wrote:
>> Nicholas Clark <nick@ccl4.org> wrote:
>>> On Wed, Apr 07, 2021 at 10:49:05AM -0500, B. Estrade wrote:
>>>
>>>> Here's some food for thought. When was was the last time anyone had a
>>>> serious discussion of introducing smp thread safety into the language? Or
>>>> easy access to atomics? These are, IMO, 2 very important and serious things
>>>
>>> Retrofitting proper multicore concurrency to perl would imply that it's
>>> possible at all.
>>>
>>> None of the comparable C-based dynamic languages have done it (eg Python,
>>> Ruby)
>
> Both Python and Ruby are not a good place to learn on how multiprocessing happens.
> There is much better place to learn from - Julia.
>
> Julia made interesting approach: syntax highly inspired by Python, but correctly
> deals with multithreading. In a sense - this is much developed Python with all missing
> features added into completely different language.
>
> Very interesting approach, and I suppose this is about what perl community intended
> to implement when perl6 discussion have begun in 2000.
>
>
>
>>> Ruby's new experimental new threading is an actor model (I believe) CPython
>>> is dogged by the GIL (all the threads you like, but if you're CPU bound then
>>> you're making one core very toasty) The seem to be thinking about going
>>> multicore by doing something like ithreads (using pickle to pass data
>>> structures between MULITPLICITY-like interpreters) but that's likely going
>>> to be as good as ithreads.
>
> It appears that reference counting + threading results in considerable slow down
> and the solution to the problem is good GC.
> This was "proved" by GILectomy project, see any google video on this or this
> nice summary https://news.ycombinator.com/item?id=11842779
>
> Perl history on threading (PERL5005THREADS and then ITHREADS where 1st was
> deprecated and removed while 2nd was recently deprecated) gives me think
> that no real multithreading/multicoring is actually possible on language level.
> Developing into that direction now is just wasting of time.
>
> "async" is another story, which is Javascript way, where basic JS engine V8 is single-threaded.
>
>
Re: threads and stuff (was Re: Perl’s leaky bucket) [ In reply to ]
B. Estrade

I don't yet understand what you want.

If you want to call OpenMP easily, I'm also trying in my SPVM module.

https://github.com/yuki-kimoto/SPVM/tree/master/examples/native/openmp

Perl already has Inline::C/XS/FFIs modules. SPVM is another way to bind C
libraries.

More example

-cuda
-Eigen
-GSL
-OpenCV
-zlib

https://github.com/yuki-kimoto/SPVM/tree/master/examples/native

2021?4?9?(?) 0:05 B. Estrade <brett@cpanel.net>:

> Thank you, appreciate the reply. To avoid collision with the other
> thread, I update the subject.
>
> So I suppose this cuts to the heart of the matter. It's fair, I believe,
> that "top level 'threading'" is not something that can be reasonable
> supported in the perl runtime (and to be fair, other scripting languages
> also struggle greatly with this).
>
> The question, then goes to: how do we facilitate access to the native
> platform's SMP? Perhaps the answer is, as always, "it depends."
>
> I will continue to pound my head on this, something will shake loose. In
> the meantime, as proof that I am vested in seeing some interesting
> options shake loose, I have offered a module on CPAN called
> OpenMP::Environment.
>
> It only provides programmatic manipulation of the environmental
> variables that OpenMP programs care about; but it is a necessary step
> for me to investigate and gain some intuition about what is possible and
> what looks "right".
>
> Currently I am thinking a SIMD (single instruction, multiple data)
> approach is the most accessible. This doesn't really get at what is
> ideal (in my mind), but at least provides a path forward by creating
> threaded libraries made available via Inline::C/XS/FFIs - and even
> leveraging these via a tie interface. And if the direction of providing
> SIMD "data types" is truly the best direction, PDL seems like a really
> good target for that.
>
> Thanks for all the feedback, will monitor the list/thread. I do ask that
> we keep the whole idea of threading "where and when feasible" (even in
> language design considerations) is extremely important to Perl's long
> term viability and as a driver of important capabilities. The idea of
> using Perl to efficiently max out a large SMP machine for useful work is
> very exciting to me, and I think I am not the only one.
>
> Cheers,
> Brett
>
>
> On 4/8/21 3:42 AM, Konovalov, Vadim wrote:
> >> Nicholas Clark <nick@ccl4.org> wrote:
> >>> On Wed, Apr 07, 2021 at 10:49:05AM -0500, B. Estrade wrote:
> >>>
> >>>> Here's some food for thought. When was was the last time anyone had a
> >>>> serious discussion of introducing smp thread safety into the
> language? Or
> >>>> easy access to atomics? These are, IMO, 2 very important and serious
> things
> >>>
> >>> Retrofitting proper multicore concurrency to perl would imply that it's
> >>> possible at all.
> >>>
> >>> None of the comparable C-based dynamic languages have done it (eg
> Python,
> >>> Ruby)
> >
> > Both Python and Ruby are not a good place to learn on how
> multiprocessing happens.
> > There is much better place to learn from - Julia.
> >
> > Julia made interesting approach: syntax highly inspired by Python, but
> correctly
> > deals with multithreading. In a sense - this is much developed Python
> with all missing
> > features added into completely different language.
> >
> > Very interesting approach, and I suppose this is about what perl
> community intended
> > to implement when perl6 discussion have begun in 2000.
> >
> >
> >
> >>> Ruby's new experimental new threading is an actor model (I believe)
> CPython
> >>> is dogged by the GIL (all the threads you like, but if you're CPU
> bound then
> >>> you're making one core very toasty) The seem to be thinking about going
> >>> multicore by doing something like ithreads (using pickle to pass data
> >>> structures between MULITPLICITY-like interpreters) but that's likely
> going
> >>> to be as good as ithreads.
> >
> > It appears that reference counting + threading results in considerable
> slow down
> > and the solution to the problem is good GC.
> > This was "proved" by GILectomy project, see any google video on this or
> this
> > nice summary https://news.ycombinator.com/item?id=11842779
> >
> > Perl history on threading (PERL5005THREADS and then ITHREADS where 1st
> was
> > deprecated and removed while 2nd was recently deprecated) gives me think
> > that no real multithreading/multicoring is actually possible on language
> level.
> > Developing into that direction now is just wasting of time.
> >
> > "async" is another story, which is Javascript way, where basic JS engine
> V8 is single-threaded.
> >
> >
>
Re: threads and stuff (was Re: Perl’s leaky bucket) [ In reply to ]
On 4/8/21 11:50 PM, Yuki Kimoto wrote:
> B. Estrade
>
> I don't yet understand what you want.

You and me both. What I *want* and what I think we have to do to get
there are two vastly different things.

For now, I want all of us to think "thread safety" and "SMP" in
everything we do. Not in a design sense, in the unconsiousness sense; so
it imbues itself in every aspect of any work and thinking done in the
name of perl/Perl.

If you ask me right now; in 2024, I'd like to be able to do something as
simple as this as a single OS process, but utilizing "light weight" OS
threads (e.g., pthreads):

#!/usr/bin/env perl
use strict;
use warnings;
use v12;

# 'traditional perl code'
#...
#

my_r $mailbox_rref = [];
map_r(8) {
local $thread_id = tid();
$mailbox_rref->[$thread_id] = qq{Hello from thread #$thread_id};
}
foreach my $msg (@$mailbox_ref) {
say $msg;
}

Output:

Hello from thread # 0
Hello from thread # 1
Hello from thread # 2
Hello from thread # 3
Hello from thread # 4
Hello from thread # 5
Hello from thread # 6
Hello from thread # 7

Another example,

#!/usr/bin/env perl
use strict;
use warnings;
use v12;

# 'traditional perl code'
#...
#

sub state_var_update {
state $ret = 0;
return ++$ret;
}

sub atomic_var_update {
atomic $ret = 0; # like 'state' but actions upon it are atomic
return ++$ret; # atomic nature forces unordered serialization
}

sub now_serving {
state $now_serving = 0;
my $ticket = shift;
return undef if $ticket != $now_serving;
return q{You've been served a nice warm rye loaf!};
}

my $num_threads = 8;
map_r($num_threads) {
local $thread_id = tid();
local $bakery_ticket = atomic_var_update; # threads race
# threads do stuff
# .. and more stuff
# ... and maybe more stuff

# busy wait
do {
local $ihazbeenserved = now_serving($bakery_ticket);
} while(not $ihazbeenserved);

# "safely" print to STDOUT without clobbering (could use regular
# printf and risk messages overlapping)
printf_r("Yay! I, thread %d, now has my %s", $tid, $ihazbeenserved);
}

# had to google for the following
format STUDENT =
===================================
@<<<<<<<<<<<<<<<<<<<< @<<
$name $age
@#####.##
$fee
===================================
.

select(STDOUT);
$~ = STUDENT;

@n = ("Deepak", "Rajat", "Vikrant");
@a = (18, 16, 14);
@s = (2000.00, 2500.00, 4000.000);

$i = 0;
foreach (@n)
{
$name = $_;
$age = $a[$i];
$fee = $s[$i++];
write;
}

>
> If you want to call OpenMP easily, I'm also trying in my SPVM module.
>
> https://github.com/yuki-kimoto/SPVM/tree/master/examples/native/openmp
>

I saw your module come through on metacpan under "recent" and it looked
very interesting. I'll look deeper into it.

> Perl already has Inline::C/XS/FFIs modules. SPVM is another way to bind
> C libraries.
>
> More example
>
> -cuda
> -Eigen
> -GSL
> -OpenCV
> -zlib
>
> https://github.com/yuki-kimoto/SPVM/tree/master/examples/native

Yes, very nice. I will look at it. Whereas what I want and what I am
playing with are vastly different, I am also very interested in FFI type
support for leveraging SMP, GPUs (not really myself but it's important),
etc.

My two examples above point to a few things:

0. '_r' indicates "thread safe" - this is an old tradition from IBM;
google it, it's a thing; I promise.

1. atomic data types (thread safe) that in addition to providing for
"safe" updates (one thread at a time), can serve as barriers
(serialization points)

2. a map_r (or similar) construct that provides a threaded environment;
yes, I know 'map_r' also implies a lot of things related to the existing
'map'; that'd be cool to, but it seemed like a more apt name that what I
had originally, 'fork_r'.

3. ability for threads to call all subs "unsafely" (race condition); but
with the judicious use of atomics can provide barriers to keep things

4. thread safe analogs to things like printf (see inline comment in code
above)

5. the use of 'format' was just because I've never used it before and
had just googled it to see what it's all about :)

I quite enjoyed coming up these examples, and I might even as an
exercise try to come up with many more so that when asked - "what do you
mean" or "show me an example" of what you are thinking, I can point to that.

Even after this thought exercise, I am beginning to see how a mix of the
following might point us in the right direction (and through the
minefield that is the perl runtime):

* a threaded block (above it's map_r, could be fork_r, or just spawner, etc)
* thread safe forms of:
- scalar reference; declared with 'my_r', array ref or hash ref on
the other end is assumed to be a 'thread safe' anonymous array or hash
- arrays - slots can be updated in a threaded environment (threads
can be unsafe and simultaneously update an element; but that's the risk
with SMP)
- hash - same concept as 'thread safe' array, but hash
- atomic - an actual scalar value that can be safely updated; it's
primary use is as a synchronization point; default access is "unordered"
meaning first thread there wins, then all others line up
non-deterministically; there should probably be an way to 'order'
threads based on tid() in some efficient way

Other thoughts,

* sub_r: thread safe form of 'sub' that has a serialization effect as
well, except it is run by only one thread at a time (user is allowed to
do unsafe things using perl's traditional scoping rules) - akin to an
OpenMP 'critical' section.

Anyway, thanks for asking. I'll think more about it and see if I can
flesh it out more.

Note, I am not going for any particular concurrency model. I am going
for "what does SMP look like in perl5". I think I have a good idea, and
the '_r' stuff and atomic(s) are merely language additions to get around
affecting "serial" perl. (no I don't want a perl_r; just for the record).

Brett

>
> 2021?4?9?(?) 0:05 B. Estrade <brett@cpanel.net
> <mailto:brett@cpanel.net>>:
>
> Thank you, appreciate the reply. To avoid collision with the other
> thread, I update the subject.
>
> So I suppose this cuts to the heart of the matter. It's fair, I
> believe,
> that "top level 'threading'" is not something that can be reasonable
> supported in the perl runtime (and to be fair, other scripting
> languages
> also struggle greatly with this).
>
> The question, then goes to: how do we facilitate access to the native
> platform's SMP? Perhaps the answer is, as always, "it depends."
>
> I will continue to pound my head on this, something will shake
> loose. In
> the meantime, as proof that I am vested in seeing some interesting
> options shake loose, I have offered a module on CPAN called
> OpenMP::Environment.
>
> It only provides programmatic manipulation of the environmental
> variables that OpenMP programs care about; but it is a necessary step
> for me to investigate and gain some intuition about what is possible
> and
> what looks "right".
>
> Currently I am thinking a SIMD (single instruction, multiple data)
> approach is the most accessible. This doesn't really get at what is
> ideal (in my mind), but at least provides a path forward by creating
> threaded libraries made available via Inline::C/XS/FFIs - and even
> leveraging these via a tie interface. And if the direction of providing
> SIMD "data types" is truly the best direction, PDL seems like a really
> good target for that.
>
> Thanks for all the feedback, will monitor the list/thread. I do ask
> that
> we keep the whole idea of threading "where and when feasible" (even in
> language design considerations) is extremely important to Perl's long
> term viability and as a driver of important capabilities. The idea of
> using Perl to efficiently max out a large SMP machine for useful
> work is
> very exciting to me, and I think I am not the only one.
>
> Cheers,
> Brett
>
>
> On 4/8/21 3:42 AM, Konovalov, Vadim wrote:
> >> Nicholas Clark <nick@ccl4.org <mailto:nick@ccl4.org>> wrote:
> >>> On Wed, Apr 07, 2021 at 10:49:05AM -0500, B. Estrade wrote:
> >>>
> >>>> Here's some food for thought. When was was the last time
> anyone had a
> >>>> serious discussion of introducing smp thread safety into the
> language? Or
> >>>> easy access to atomics? These are, IMO, 2 very important and
> serious things
> >>>
> >>> Retrofitting proper multicore concurrency to perl would imply
> that it's
> >>> possible at all.
> >>>
> >>> None of the comparable C-based dynamic languages have done it
> (eg Python,
> >>> Ruby)
> >
> > Both Python and Ruby are not a good place to learn on how
> multiprocessing happens.
> > There is much better place to learn from - Julia.
> >
> > Julia made interesting approach: syntax highly inspired by
> Python, but correctly
> > deals with multithreading. In a sense - this is much developed
> Python with all missing
> > features added into completely different language.
> >
> > Very interesting approach, and I suppose this is about what perl
> community intended
> > to implement when perl6 discussion have begun in 2000.
> >
> >
> >
> >>> Ruby's new experimental new threading is an actor model (I
> believe) CPython
> >>> is dogged by the GIL (all the threads you like, but if you're
> CPU bound then
> >>> you're making one core very toasty) The seem to be thinking
> about going
> >>> multicore by doing something like ithreads (using pickle to
> pass data
> >>> structures between MULITPLICITY-like interpreters) but that's
> likely going
> >>> to be as good as ithreads.
> >
> > It appears that reference counting + threading results in
> considerable slow down
> > and the solution to the problem is good GC.
> > This was "proved" by GILectomy project, see any google video on
> this or this
> > nice summary https://news.ycombinator.com/item?id=11842779
> >
> > Perl history on threading (PERL5005THREADS and then ITHREADS
> where 1st was
> > deprecated and removed while 2nd was recently deprecated) gives
> me think
> > that no real multithreading/multicoring is actually possible on
> language level.
> > Developing into that direction now is just wasting of time.
> >
> > "async" is another story, which is Javascript way, where basic JS
> engine V8 is single-threaded.
> >
> >
>
Re: threads and stuff (was Re: Perl’s leaky bucket) [ In reply to ]
On Fri, Apr 9, 2021 at 1:54 AM B. Estrade <brett@cpanel.net> wrote:

>
> 2. a map_r (or similar) construct that provides a threaded environment;
> yes, I know 'map_r' also implies a lot of things related to the existing
> 'map'; that'd be cool to, but it seemed like a more apt name that what I
> had originally, 'fork_r'.
>

It's still based on our current forking and event loop stuff, but I just
wanted to mention this nice (though underdocumented) interface:
https://metacpan.org/pod/Parallel::Map (essentially Future::Utils::fmap +
IO::Async::Function)

-Dan
Re: threads and stuff (was Re: Perl’s leaky bucket) [ In reply to ]
"B. Estrade" <brett@cpanel.net> wrote:
> 0. '_r' indicates "thread safe" - this is an old tradition from IBM; google
> it, it's a thing; I promise.

AFAIK it initially meant "reentrant" (as in safe functions
inside C sig handlers). Thread-safety and reentrancy can
overlap sometimes, but not always. Some implementers
unfortunately propagated the confusion by naming
non-reentrant-but-thread-safe things "_r".

> 4. thread safe analogs to things like printf (see inline comment in code
> above)

Perl buffered I/O might be better off being thread-safe by
default to mimic C stdio characteristics. It would reduce the
learning curve and reduce confusion for users switching between
C and Perl. Note that C printf is not reentrant, since it uses
locking under-the-hood, but Perl printf is due to "safe signals".
Re: threads and stuff (was Re: Perl’s leaky bucket) [ In reply to ]
On 4/9/21 2:26 AM, Eric Wong wrote:
> "B. Estrade" <brett@cpanel.net> wrote:
>> 0. '_r' indicates "thread safe" - this is an old tradition from IBM; google
>> it, it's a thing; I promise.
>
> AFAIK it initially meant "reentrant" (as in safe functions
> inside C sig handlers). Thread-safety and reentrancy can
> overlap sometimes, but not always. Some implementers
> unfortunately propagated the confusion by naming
> non-reentrant-but-thread-safe things "_r".
>
>> 4. thread safe analogs to things like printf (see inline comment in code
>> above)
>
> Perl buffered I/O might be better off being thread-safe by
> default to mimic C stdio characteristics. It would reduce the
> learning curve and reduce confusion for users switching between
> C and Perl. Note that C printf is not reentrant, since it uses
> locking under-the-hood, but Perl printf is due to "safe signals".
>

Fair points, and thank you for the clarification.

Brett
Re: threads and stuff (was Re: Perl’s leaky bucket) [ In reply to ]
On Wed, Apr 07, 2021 at 08:49:31PM -0500, B. Estrade wrote:
> FWIW, I can accept that fundamentally this violates any number of basic
> assumptions in the perl runtime. I know it's interpreted, dynamic, etc. I
> understand conceptually that perl data structures necessarily are designed
> to be used in such a runtime that requires "bookkeeping". Worst case for me,
> I come away with a better understanding on why what I am asking actually is
> _impossible_.

> Can perl handle such a "threaded" block internally?
>
> The most likely answer I can project is:
>
> * not directly, an embedded "SMP capable interpreter" approach is the only
> clean way to do this.

Sure, Perl doesn't have a GIL (Global Interpreter Lock), but it doesn't need
to - it doesn't expose a concept of running more than one execution context
on the same interpreter, which is what the Python GIL is about -
time-slicing the single interpreter to run more than one execution context.
(Non-concurrent parallelism. So just one CPU core)

What you're describing would need more than one execution context able to
run on the same interpreter. Right now there's a total assumption of 1-to-1
mapping from interpreter to execution context. What differs from Python is
that we *don't* have the assumption of exactly one interpreter in an OS
process, all the C APIs are ready for this, and XS extensions likely should
be.

ithreads exploits this ability to have more than one interpreter to then
provide more than one execution context (which is what we care about) which
means that it can use more than one CPU.

On Wed, Apr 07, 2021 at 11:20:49PM -0400, Dan Book wrote:

> I'd defer to others more familiar with ithreads such as Nicholas's earlier
> post, but what you are describing is largely what threads.pm does, and its
> problems of being heavyweight to spawn and causing bugs largely stem from
> trying to share memory and have the perl interpreter available in each
> thread. So I'll just say, these are a lot of nice wishes, but getting the
> tradeoffs right in the implementation is a challenge, if it is possible at
> all.

Yes. The *problem* is/remains that everything internally assumes/relies on
"interpreter" === "execution context", meaning that ithreads has to

1) rather expensively copy the entire interpreter to create a second
execution context
2) there isn't a good way to share data between contexts

To be viable, the threading model Brett is suggesting still needs the
*first* point to be changed - it only works out if "execution context" is
decoupled from "interpreter state", and it becomes possible to have 2+
execution contexts running concurrently on the same interpreter.

ie *we* wouldn't have to solve:

There should be one -- and preferably only one -- [interpreter]

but the rest is effectively the same problem as removing the GIL.


And that failed, despite considerable efforts

The two talks by Larry Hastings on this are quite interesting (and fun to
watch, as he is good at explaining things, and it's mostly not Python
specific)

In Removing Python's GIL: The Gilectomy - PyCon 2016
at https://www.youtube.com/watch?v=P3AyI_u66Bw#t=19m45s

he has a slide listing what you need to add locking to. Most of that maps
one-to-one to the perl internals:

* dicts => hashes (for symbol tables)
* lists => arrays (many other interpreter global structures)
* freelists => SV arenas


He measured an immediate 30% slowdown *just* due to having to have reference
counts now be implemented by CPU atomic operations.

A year later, in his talk with the update, he notes that he's moved to
buffered reference counting (and gives a very good explanation of this)

Sufficient to say this approach makes DESTROY untimely. What's amusing is
that in the Q&A section he realises that he'd missed saying something
important:

https://www.youtube.com/watch?v=pLqv11ScGsQ#t=41m20s

The implication of buffered reference counting is that all refcount
decrement to zero happens on the "reference count committing thread". So,
na??vely, this means that all DESTROY actions happen on that thread, which
now has far too much work. Hence he uses queues to dispatch the destroy
action *back* to the thread that did the last refcount decrement. Meaning
that destruction is even more untimely.


The sad part is that a year later, the update was that he'd given up:

He is "out of bullets" at least with that approach.

With his complicated buffered-reference-count approach he was able to
get his "gilectomized" interpreter to reach performance parity with
CPython???except that his interpreter was running on around seven cores to
keep up with CPython on one.

https://lwn.net/Articles/754577/

It's interesting that he then is considering that one would need to rewrite
CPython from reference counting to a tracing garbage collector to get
further, in that in another session at the same Python Language Summit
someone from Instagram said that they experimented with exactly that:

It is part of the C API, though, so reference counting must be
maintained for C extensions; in the experiment, the Instagram developers
moved to a tracing garbage collector everywhere else.

https://lwn.net/Articles/754163/


The Instagram folks also experimented with changing CPython's data
structures to optimise for the common case, and saw speedups.
(this obviously makes the C code more complex and harder to understand, but
more importantly, this does start to break the C APIs, and hence existing
extensions). Which is interesting, because at the end of his talk the
previous year, Larry Hastings observed that Jython and IronPython can both
linearly scale threads with CPUs. Both underlying VMs are written in C (or
equivalent), hence an "existence proof" - it's clearly possible to scale
linearly with a C implementation - it was just a question of how much the C
API he needed to break to get there.

So I'm wondering (this is more [original arm waving] than [original
research]), whether in effect Jython and IronPython are just taking the
(equivalent) slowdown hit for concurrency as his Gilectomy, and then are
able to win the speed back simply by using better data structures (optimised
to the common runtime use patterns), because they aren't constrained by the
C ABI.

(Also they have JITs. But maybe more importantly, before that, they can
specialise the bytecode executed at runtime to remove accessor calls, and do
things like realise that some objects are not visible to other threads, or
maybe don't even need to be on the heap. There are a lot of tricks that
python, perl and VMs of that maturity don't and now can't use.)


Also interesting on the Hacker News thread on the 2016 talk was this:

I work on a new Ruby interpreter, and our solution to this problem has
been to interpret the C code of extensions. That way we can give it one
API, but really implement another.

https://news.ycombinator.com/item?id=11845347

This is "Solang", "system that can execute LLVM-based languages on the JVM":

https://www.researchgate.net/publication/309443492_Bringing_low-level_languages_to_the_JVM_efficient_execution_of_LLVM_IR_on_Truffle
https://www.youtube.com/watch?v=YLtjkP9bD_U

Yes, this is crazy. Ruby needs to call C extensions. To make this efficient
for Ruby running on the JVM, we'll just go compile the C into something we
can (somehow) inline. And it worked. *

However, again, despite all this very smart work, TruffleRuby doesn't seem
to have taken over the world any more than JRuby did.

Nor has PyPy taken over Python.

Despite all these other smart folks working on problems similar to ours (eg
at least 7 on TruffleRuby), I don't see any obvious strategy to steal.



Sideways from this, I started to wonder how you might start to untangle the
"one interpreter" === "one execution context". The obvious start is to try
to avoid reference counting overhead where possible. Most objects *aren't*
visible to more than one thread. (In the degenerate case of one thread,
that's all objects - this feels like a good start.). Hence have the concept
of "local objects" and "shared objects" - it's just one flag bit. You *are
going to need a branch for every reference count twiddle, but it might be
faster, because mostly it will avoid atomic operations (or buffered
reference counting, or whatever)

Everything starts local - if an object becomes referenced by a shared
object, that object must be promoted to "shared". It's the same sort of idea
as generational GC, and would also imply a write barrier on every
assignment. That's a small CPU cost, but it's a new rule that XS code on
CPAN doesn't know that it needs to conform to...

But, stashes are global. So because of this invariant, subs are global, so
pads are global, so all lexicals are global.

Oh pants!

Suddenly everything is global, without trying.


And actually it's worse - pads aren't just global in this sense - they are
one of these *implicit* assumptions of "one is one" - they are an array of
arrays - the inner array is lexicals (and temporaries) for the subroutine,
the outer array permits one of these for each depth of recursion.

There's no way to have two threads execute the same subroutine
simultaneously, without some major refactoring.

(signatures, @_, return values on the stack, sending return values back from
multiple threads to the master at the end - these are all problems, but
trivial in comparison to things like this)

On Fri, Apr 09, 2021 at 12:53:27AM -0500, B. Estrade wrote:

> For now, I want all of us to think "thread safety" and "SMP" in everything
> we do. Not in a design sense, in the unconsiousness sense; so it imbues
> itself in every aspect of any work and thinking done in the name of
> perl/Perl.
>
> If you ask me right now; in 2024, I'd like to be able to do something as
> simple as this as a single OS process, but utilizing "light weight" OS
> threads (e.g., pthreads):

> my $num_threads = 8;
> map_r($num_threads) {
> local $thread_id = tid();
> local $bakery_ticket = atomic_var_update; # threads race
> # threads do stuff
> # .. and more stuff
> # ... and maybe more stuff
>
> # busy wait
> do {
> local $ihazbeenserved = now_serving($bakery_ticket);
> } while(not $ihazbeenserved);
>
> # "safely" print to STDOUT without clobbering (could use regular
> # printf and risk messages overlapping)
> printf_r("Yay! I, thread %d, now has my %s", $tid, $ihazbeenserved);
> }

This isn't something that we can solve by working on it for the next few
years - really it's something that could only be solved by working on it 25
years ago, and having a radically different implementation.

Basically, I don't think that the current Perl 5 internals are going to
support CPU level parallelism in the same interpreter, any more than
CPython ever will, or MRI.

You *can* get to CPU level parallelism within "leaf" calls to C - so likely
Yuki Kimoto's suggestion of SPVM seems interesting - he pasted the link to
https://github.com/yuki-kimoto/SPVM/tree/master/examples/native/openmp

For anything more than that, it's going to take other internals. For the
syntax you're suggesting, the fastest route to seeing it run might be to
look at https://github.com/rakudo-p5/v5 which aimed to write a Perl 5
parser using Rakudo. I'm aware that FROGGS stopped working on it a few
years ago - I don't know how complete it is, but I don't think that there
was any fundamental blocker.

Nicholas Clark

* There's a talk on it. You can watch the whole thing and just s/ruby/perl/g
I had no idea that the Ruby C "API" was just like the Perl C "API".

Spoilers start here:
Of the 2.1 billion lines of code in RubyGems, .5 billion is C.
A ruby extension in C can be 10x faster than MRI
Their approach is 15x faster than MRI *without* inlining C, 30x with.
They actually implement much of the Ruby C ABI in Ruby.
The Gem thinks that it's doing this:
existing Gem written in C, compiled => C function in MRI
when actually it's this:
existing Gem unchanged, compiled to LLVM bytecode => their C shim => Ruby
where *all* that is being run inside the JVM.

The single best slide is this:
https://www.youtube.com/watch?v=YLtjkP9bD_U&t=266s
Re: threads and stuff (was Re: Perl’s leaky bucket) [ In reply to ]
Thank you for your continued thoughts, time, and attention. If all we do
is starting thinking about it, which I think these threads have
accomplished, then it's progress. I remain confident someone out there
will have an epiphany. All I can offer is encouragement, continued
thought, and those examples I posted showing how I can imaging what I am
thinking about from a Perl programmer's perspective - taking into
account only the external developer experience and what I've
internalized idiomatically over the years.

I'm happy to continue to mention it, but I know there are other pressing
matters. I will just sum it up by reiterating what I think the most
important points are, for me:

* Perl's future relevance must be, in large part, driven through
interesting and hard application areas (the SMP thing came from only one
example, but that sort of took on a life of its own)

* SMP is a crucial capability, even if not threads. OpenMP is not
general threading; it's a very limited model. Perl just needs to provide
a very small subset; what that looks like versus what is possible, I can
not assert. We should always be keeping it mind. Our collective
unconscious will reveal what that looks like when it is the right time.
I have faith in that.

* Perl has nothing to prove to language weenies; it has everything to
prove to it's loyal following - practical people doing things they deem
to be serious, in a fun and productive ways.

Thank you all, again, for everything. Especially the discussions and
considerations this topic has been afforded.

Cheers,
Brett

On 4/9/21 8:31 AM, Nicholas Clark wrote:
> On Wed, Apr 07, 2021 at 08:49:31PM -0500, B. Estrade wrote:
>> FWIW, I can accept that fundamentally this violates any number of basic
>> assumptions in the perl runtime. I know it's interpreted, dynamic, etc. I
>> understand conceptually that perl data structures necessarily are designed
>> to be used in such a runtime that requires "bookkeeping". Worst case for me,
>> I come away with a better understanding on why what I am asking actually is
>> _impossible_.
>
>> Can perl handle such a "threaded" block internally?
>>
>> The most likely answer I can project is:
>>
>> * not directly, an embedded "SMP capable interpreter" approach is the only
>> clean way to do this.
>
> Sure, Perl doesn't have a GIL (Global Interpreter Lock), but it doesn't need
> to - it doesn't expose a concept of running more than one execution context
> on the same interpreter, which is what the Python GIL is about -
> time-slicing the single interpreter to run more than one execution context.
> (Non-concurrent parallelism. So just one CPU core)
>
> What you're describing would need more than one execution context able to
> run on the same interpreter. Right now there's a total assumption of 1-to-1
> mapping from interpreter to execution context. What differs from Python is
> that we *don't* have the assumption of exactly one interpreter in an OS
> process, all the C APIs are ready for this, and XS extensions likely should
> be.
>
> ithreads exploits this ability to have more than one interpreter to then
> provide more than one execution context (which is what we care about) which
> means that it can use more than one CPU.
>
> On Wed, Apr 07, 2021 at 11:20:49PM -0400, Dan Book wrote:
>
>> I'd defer to others more familiar with ithreads such as Nicholas's earlier
>> post, but what you are describing is largely what threads.pm does, and its
>> problems of being heavyweight to spawn and causing bugs largely stem from
>> trying to share memory and have the perl interpreter available in each
>> thread. So I'll just say, these are a lot of nice wishes, but getting the
>> tradeoffs right in the implementation is a challenge, if it is possible at
>> all.
>
> Yes. The *problem* is/remains that everything internally assumes/relies on
> "interpreter" === "execution context", meaning that ithreads has to
>
> 1) rather expensively copy the entire interpreter to create a second
> execution context
> 2) there isn't a good way to share data between contexts
>
> To be viable, the threading model Brett is suggesting still needs the
> *first* point to be changed - it only works out if "execution context" is
> decoupled from "interpreter state", and it becomes possible to have 2+
> execution contexts running concurrently on the same interpreter.
>
> ie *we* wouldn't have to solve:
>
> There should be one -- and preferably only one -- [interpreter]
>
> but the rest is effectively the same problem as removing the GIL.
>
>
> And that failed, despite considerable efforts
>
> The two talks by Larry Hastings on this are quite interesting (and fun to
> watch, as he is good at explaining things, and it's mostly not Python
> specific)
>
> In Removing Python's GIL: The Gilectomy - PyCon 2016
> at https://www.youtube.com/watch?v=P3AyI_u66Bw#t=19m45s
>
> he has a slide listing what you need to add locking to. Most of that maps
> one-to-one to the perl internals:
>
> * dicts => hashes (for symbol tables)
> * lists => arrays (many other interpreter global structures)
> * freelists => SV arenas
>
>
> He measured an immediate 30% slowdown *just* due to having to have reference
> counts now be implemented by CPU atomic operations.
>
> A year later, in his talk with the update, he notes that he's moved to
> buffered reference counting (and gives a very good explanation of this)
>
> Sufficient to say this approach makes DESTROY untimely. What's amusing is
> that in the Q&A section he realises that he'd missed saying something
> important:
>
> https://www.youtube.com/watch?v=pLqv11ScGsQ#t=41m20s
>
> The implication of buffered reference counting is that all refcount
> decrement to zero happens on the "reference count committing thread". So,
> na??vely, this means that all DESTROY actions happen on that thread, which
> now has far too much work. Hence he uses queues to dispatch the destroy
> action *back* to the thread that did the last refcount decrement. Meaning
> that destruction is even more untimely.
>
>
> The sad part is that a year later, the update was that he'd given up:
>
> He is "out of bullets" at least with that approach.
>
> With his complicated buffered-reference-count approach he was able to
> get his "gilectomized" interpreter to reach performance parity with
> CPython???except that his interpreter was running on around seven cores to
> keep up with CPython on one.
>
> https://lwn.net/Articles/754577/
>
> It's interesting that he then is considering that one would need to rewrite
> CPython from reference counting to a tracing garbage collector to get
> further, in that in another session at the same Python Language Summit
> someone from Instagram said that they experimented with exactly that:
>
> It is part of the C API, though, so reference counting must be
> maintained for C extensions; in the experiment, the Instagram developers
> moved to a tracing garbage collector everywhere else.
>
> https://lwn.net/Articles/754163/
>
>
> The Instagram folks also experimented with changing CPython's data
> structures to optimise for the common case, and saw speedups.
> (this obviously makes the C code more complex and harder to understand, but
> more importantly, this does start to break the C APIs, and hence existing
> extensions). Which is interesting, because at the end of his talk the
> previous year, Larry Hastings observed that Jython and IronPython can both
> linearly scale threads with CPUs. Both underlying VMs are written in C (or
> equivalent), hence an "existence proof" - it's clearly possible to scale
> linearly with a C implementation - it was just a question of how much the C
> API he needed to break to get there.
>
> So I'm wondering (this is more [original arm waving] than [original
> research]), whether in effect Jython and IronPython are just taking the
> (equivalent) slowdown hit for concurrency as his Gilectomy, and then are
> able to win the speed back simply by using better data structures (optimised
> to the common runtime use patterns), because they aren't constrained by the
> C ABI.
>
> (Also they have JITs. But maybe more importantly, before that, they can
> specialise the bytecode executed at runtime to remove accessor calls, and do
> things like realise that some objects are not visible to other threads, or
> maybe don't even need to be on the heap. There are a lot of tricks that
> python, perl and VMs of that maturity don't and now can't use.)
>
>
> Also interesting on the Hacker News thread on the 2016 talk was this:
>
> I work on a new Ruby interpreter, and our solution to this problem has
> been to interpret the C code of extensions. That way we can give it one
> API, but really implement another.
>
> https://news.ycombinator.com/item?id=11845347
>
> This is "Solang", "system that can execute LLVM-based languages on the JVM":
>
> https://www.researchgate.net/publication/309443492_Bringing_low-level_languages_to_the_JVM_efficient_execution_of_LLVM_IR_on_Truffle
> https://www.youtube.com/watch?v=YLtjkP9bD_U
>
> Yes, this is crazy. Ruby needs to call C extensions. To make this efficient
> for Ruby running on the JVM, we'll just go compile the C into something we
> can (somehow) inline. And it worked. *
>
> However, again, despite all this very smart work, TruffleRuby doesn't seem
> to have taken over the world any more than JRuby did.
>
> Nor has PyPy taken over Python.
>
> Despite all these other smart folks working on problems similar to ours (eg
> at least 7 on TruffleRuby), I don't see any obvious strategy to steal.
>
>
>
> Sideways from this, I started to wonder how you might start to untangle the
> "one interpreter" === "one execution context". The obvious start is to try
> to avoid reference counting overhead where possible. Most objects *aren't*
> visible to more than one thread. (In the degenerate case of one thread,
> that's all objects - this feels like a good start.). Hence have the concept
> of "local objects" and "shared objects" - it's just one flag bit. You *are
> going to need a branch for every reference count twiddle, but it might be
> faster, because mostly it will avoid atomic operations (or buffered
> reference counting, or whatever)
>
> Everything starts local - if an object becomes referenced by a shared
> object, that object must be promoted to "shared". It's the same sort of idea
> as generational GC, and would also imply a write barrier on every
> assignment. That's a small CPU cost, but it's a new rule that XS code on
> CPAN doesn't know that it needs to conform to...
>
> But, stashes are global. So because of this invariant, subs are global, so
> pads are global, so all lexicals are global.
>
> Oh pants!
>
> Suddenly everything is global, without trying.
>
>
> And actually it's worse - pads aren't just global in this sense - they are
> one of these *implicit* assumptions of "one is one" - they are an array of
> arrays - the inner array is lexicals (and temporaries) for the subroutine,
> the outer array permits one of these for each depth of recursion.
>
> There's no way to have two threads execute the same subroutine
> simultaneously, without some major refactoring.
>
> (signatures, @_, return values on the stack, sending return values back from
> multiple threads to the master at the end - these are all problems, but
> trivial in comparison to things like this)
>
> On Fri, Apr 09, 2021 at 12:53:27AM -0500, B. Estrade wrote:
>
>> For now, I want all of us to think "thread safety" and "SMP" in everything
>> we do. Not in a design sense, in the unconsiousness sense; so it imbues
>> itself in every aspect of any work and thinking done in the name of
>> perl/Perl.
>>
>> If you ask me right now; in 2024, I'd like to be able to do something as
>> simple as this as a single OS process, but utilizing "light weight" OS
>> threads (e.g., pthreads):
>
>> my $num_threads = 8;
>> map_r($num_threads) {
>> local $thread_id = tid();
>> local $bakery_ticket = atomic_var_update; # threads race
>> # threads do stuff
>> # .. and more stuff
>> # ... and maybe more stuff
>>
>> # busy wait
>> do {
>> local $ihazbeenserved = now_serving($bakery_ticket);
>> } while(not $ihazbeenserved);
>>
>> # "safely" print to STDOUT without clobbering (could use regular
>> # printf and risk messages overlapping)
>> printf_r("Yay! I, thread %d, now has my %s", $tid, $ihazbeenserved);
>> }
>
> This isn't something that we can solve by working on it for the next few
> years - really it's something that could only be solved by working on it 25
> years ago, and having a radically different implementation.
>
> Basically, I don't think that the current Perl 5 internals are going to
> support CPU level parallelism in the same interpreter, any more than
> CPython ever will, or MRI.
>
> You *can* get to CPU level parallelism within "leaf" calls to C - so likely
> Yuki Kimoto's suggestion of SPVM seems interesting - he pasted the link to
> https://github.com/yuki-kimoto/SPVM/tree/master/examples/native/openmp
>
> For anything more than that, it's going to take other internals. For the
> syntax you're suggesting, the fastest route to seeing it run might be to
> look at https://github.com/rakudo-p5/v5 which aimed to write a Perl 5
> parser using Rakudo. I'm aware that FROGGS stopped working on it a few
> years ago - I don't know how complete it is, but I don't think that there
> was any fundamental blocker.
>
> Nicholas Clark
>
> * There's a talk on it. You can watch the whole thing and just s/ruby/perl/g
> I had no idea that the Ruby C "API" was just like the Perl C "API".
>
> Spoilers start here:
> Of the 2.1 billion lines of code in RubyGems, .5 billion is C.
> A ruby extension in C can be 10x faster than MRI
> Their approach is 15x faster than MRI *without* inlining C, 30x with.
> They actually implement much of the Ruby C ABI in Ruby.
> The Gem thinks that it's doing this:
> existing Gem written in C, compiled => C function in MRI
> when actually it's this:
> existing Gem unchanged, compiled to LLVM bytecode => their C shim => Ruby
> where *all* that is being run inside the JVM.
>
> The single best slide is this:
> https://www.youtube.com/watch?v=YLtjkP9bD_U&t=266s
>
Re: threads and stuff (was Re: Perl’s leaky bucket) [ In reply to ]
B. Estrade

I think it's a good choice to work on parallelization at the C / C ++ level
rather than the entire Perl language.

Perl Level only provides fork parallelization and parallelization of non
blocking I/O.

The other parallelization can be provided in C/C++ level.

It feels almost impossible to achieve Perl language and thread
parallelization at the same time without performance degradation.

Perl string processing is tied to copy-on-right.

I don't think it's possible to implement threads and copy-on-write at the
same time without performance degradation.


2021?4?9?(?) 23:31 B. Estrade <brett@cpanel.net>:

> Thank you for your continued thoughts, time, and attention. If all we do
> is starting thinking about it, which I think these threads have
> accomplished, then it's progress. I remain confident someone out there
> will have an epiphany. All I can offer is encouragement, continued
> thought, and those examples I posted showing how I can imaging what I am
> thinking about from a Perl programmer's perspective - taking into
> account only the external developer experience and what I've
> internalized idiomatically over the years.
>
> I'm happy to continue to mention it, but I know there are other pressing
> matters. I will just sum it up by reiterating what I think the most
> important points are, for me:
>
> * Perl's future relevance must be, in large part, driven through
> interesting and hard application areas (the SMP thing came from only one
> example, but that sort of took on a life of its own)
>
> * SMP is a crucial capability, even if not threads. OpenMP is not
> general threading; it's a very limited model. Perl just needs to provide
> a very small subset; what that looks like versus what is possible, I can
> not assert. We should always be keeping it mind. Our collective
> unconscious will reveal what that looks like when it is the right time.
> I have faith in that.
>
> * Perl has nothing to prove to language weenies; it has everything to
> prove to it's loyal following - practical people doing things they deem
> to be serious, in a fun and productive ways.
>
> Thank you all, again, for everything. Especially the discussions and
> considerations this topic has been afforded.
>
> Cheers,
> Brett
>
> On 4/9/21 8:31 AM, Nicholas Clark wrote:
> > On Wed, Apr 07, 2021 at 08:49:31PM -0500, B. Estrade wrote:
> >> FWIW, I can accept that fundamentally this violates any number of basic
> >> assumptions in the perl runtime. I know it's interpreted, dynamic, etc.
> I
> >> understand conceptually that perl data structures necessarily are
> designed
> >> to be used in such a runtime that requires "bookkeeping". Worst case
> for me,
> >> I come away with a better understanding on why what I am asking
> actually is
> >> _impossible_.
> >
> >> Can perl handle such a "threaded" block internally?
> >>
> >> The most likely answer I can project is:
> >>
> >> * not directly, an embedded "SMP capable interpreter" approach is the
> only
> >> clean way to do this.
> >
> > Sure, Perl doesn't have a GIL (Global Interpreter Lock), but it doesn't
> need
> > to - it doesn't expose a concept of running more than one execution
> context
> > on the same interpreter, which is what the Python GIL is about -
> > time-slicing the single interpreter to run more than one execution
> context.
> > (Non-concurrent parallelism. So just one CPU core)
> >
> > What you're describing would need more than one execution context able to
> > run on the same interpreter. Right now there's a total assumption of
> 1-to-1
> > mapping from interpreter to execution context. What differs from Python
> is
> > that we *don't* have the assumption of exactly one interpreter in an OS
> > process, all the C APIs are ready for this, and XS extensions likely
> should
> > be.
> >
> > ithreads exploits this ability to have more than one interpreter to then
> > provide more than one execution context (which is what we care about)
> which
> > means that it can use more than one CPU.
> >
> > On Wed, Apr 07, 2021 at 11:20:49PM -0400, Dan Book wrote:
> >
> >> I'd defer to others more familiar with ithreads such as Nicholas's
> earlier
> >> post, but what you are describing is largely what threads.pm does, and
> its
> >> problems of being heavyweight to spawn and causing bugs largely stem
> from
> >> trying to share memory and have the perl interpreter available in each
> >> thread. So I'll just say, these are a lot of nice wishes, but getting
> the
> >> tradeoffs right in the implementation is a challenge, if it is possible
> at
> >> all.
> >
> > Yes. The *problem* is/remains that everything internally assumes/relies
> on
> > "interpreter" === "execution context", meaning that ithreads has to
> >
> > 1) rather expensively copy the entire interpreter to create a second
> > execution context
> > 2) there isn't a good way to share data between contexts
> >
> > To be viable, the threading model Brett is suggesting still needs the
> > *first* point to be changed - it only works out if "execution context" is
> > decoupled from "interpreter state", and it becomes possible to have 2+
> > execution contexts running concurrently on the same interpreter.
> >
> > ie *we* wouldn't have to solve:
> >
> > There should be one -- and preferably only one -- [interpreter]
> >
> > but the rest is effectively the same problem as removing the GIL.
> >
> >
> > And that failed, despite considerable efforts
> >
> > The two talks by Larry Hastings on this are quite interesting (and fun to
> > watch, as he is good at explaining things, and it's mostly not Python
> > specific)
> >
> > In Removing Python's GIL: The Gilectomy - PyCon 2016
> > at https://www.youtube.com/watch?v=P3AyI_u66Bw#t=19m45s
> >
> > he has a slide listing what you need to add locking to. Most of that maps
> > one-to-one to the perl internals:
> >
> > * dicts => hashes (for symbol tables)
> > * lists => arrays (many other interpreter global structures)
> > * freelists => SV arenas
> >
> >
> > He measured an immediate 30% slowdown *just* due to having to have
> reference
> > counts now be implemented by CPU atomic operations.
> >
> > A year later, in his talk with the update, he notes that he's moved to
> > buffered reference counting (and gives a very good explanation of this)
> >
> > Sufficient to say this approach makes DESTROY untimely. What's amusing is
> > that in the Q&A section he realises that he'd missed saying something
> > important:
> >
> > https://www.youtube.com/watch?v=pLqv11ScGsQ#t=41m20s
> >
> > The implication of buffered reference counting is that all refcount
> > decrement to zero happens on the "reference count committing thread".
> So,
> > na??vely, this means that all DESTROY actions happen on that thread,
> which
> > now has far too much work. Hence he uses queues to dispatch the destroy
> > action *back* to the thread that did the last refcount decrement. Meaning
> > that destruction is even more untimely.
> >
> >
> > The sad part is that a year later, the update was that he'd given up:
> >
> > He is "out of bullets" at least with that approach.
> >
> > With his complicated buffered-reference-count approach he was able
> to
> > get his "gilectomized" interpreter to reach performance parity with
> > CPython???except that his interpreter was running on around seven
> cores to
> > keep up with CPython on one.
> >
> > https://lwn.net/Articles/754577/
> >
> > It's interesting that he then is considering that one would need to
> rewrite
> > CPython from reference counting to a tracing garbage collector to get
> > further, in that in another session at the same Python Language Summit
> > someone from Instagram said that they experimented with exactly that:
> >
> > It is part of the C API, though, so reference counting must be
> > maintained for C extensions; in the experiment, the Instagram
> developers
> > moved to a tracing garbage collector everywhere else.
> >
> > https://lwn.net/Articles/754163/
> >
> >
> > The Instagram folks also experimented with changing CPython's data
> > structures to optimise for the common case, and saw speedups.
> > (this obviously makes the C code more complex and harder to understand,
> but
> > more importantly, this does start to break the C APIs, and hence existing
> > extensions). Which is interesting, because at the end of his talk the
> > previous year, Larry Hastings observed that Jython and IronPython can
> both
> > linearly scale threads with CPUs. Both underlying VMs are written in C
> (or
> > equivalent), hence an "existence proof" - it's clearly possible to scale
> > linearly with a C implementation - it was just a question of how much
> the C
> > API he needed to break to get there.
> >
> > So I'm wondering (this is more [original arm waving] than [original
> > research]), whether in effect Jython and IronPython are just taking the
> > (equivalent) slowdown hit for concurrency as his Gilectomy, and then are
> > able to win the speed back simply by using better data structures
> (optimised
> > to the common runtime use patterns), because they aren't constrained by
> the
> > C ABI.
> >
> > (Also they have JITs. But maybe more importantly, before that, they can
> > specialise the bytecode executed at runtime to remove accessor calls,
> and do
> > things like realise that some objects are not visible to other threads,
> or
> > maybe don't even need to be on the heap. There are a lot of tricks that
> > python, perl and VMs of that maturity don't and now can't use.)
> >
> >
> > Also interesting on the Hacker News thread on the 2016 talk was this:
> >
> > I work on a new Ruby interpreter, and our solution to this problem
> has
> > been to interpret the C code of extensions. That way we can give it
> one
> > API, but really implement another.
> >
> > https://news.ycombinator.com/item?id=11845347
> >
> > This is "Solang", "system that can execute LLVM-based languages on the
> JVM":
> >
> >
> https://www.researchgate.net/publication/309443492_Bringing_low-level_languages_to_the_JVM_efficient_execution_of_LLVM_IR_on_Truffle
> > https://www.youtube.com/watch?v=YLtjkP9bD_U
> >
> > Yes, this is crazy. Ruby needs to call C extensions. To make this
> efficient
> > for Ruby running on the JVM, we'll just go compile the C into something
> we
> > can (somehow) inline. And it worked. *
> >
> > However, again, despite all this very smart work, TruffleRuby doesn't
> seem
> > to have taken over the world any more than JRuby did.
> >
> > Nor has PyPy taken over Python.
> >
> > Despite all these other smart folks working on problems similar to ours
> (eg
> > at least 7 on TruffleRuby), I don't see any obvious strategy to steal.
> >
> >
> >
> > Sideways from this, I started to wonder how you might start to untangle
> the
> > "one interpreter" === "one execution context". The obvious start is to
> try
> > to avoid reference counting overhead where possible. Most objects
> *aren't*
> > visible to more than one thread. (In the degenerate case of one thread,
> > that's all objects - this feels like a good start.). Hence have the
> concept
> > of "local objects" and "shared objects" - it's just one flag bit. You
> *are
> > going to need a branch for every reference count twiddle, but it might be
> > faster, because mostly it will avoid atomic operations (or buffered
> > reference counting, or whatever)
> >
> > Everything starts local - if an object becomes referenced by a shared
> > object, that object must be promoted to "shared". It's the same sort of
> idea
> > as generational GC, and would also imply a write barrier on every
> > assignment. That's a small CPU cost, but it's a new rule that XS code on
> > CPAN doesn't know that it needs to conform to...
> >
> > But, stashes are global. So because of this invariant, subs are global,
> so
> > pads are global, so all lexicals are global.
> >
> > Oh pants!
> >
> > Suddenly everything is global, without trying.
> >
> >
> > And actually it's worse - pads aren't just global in this sense - they
> are
> > one of these *implicit* assumptions of "one is one" - they are an array
> of
> > arrays - the inner array is lexicals (and temporaries) for the
> subroutine,
> > the outer array permits one of these for each depth of recursion.
> >
> > There's no way to have two threads execute the same subroutine
> > simultaneously, without some major refactoring.
> >
> > (signatures, @_, return values on the stack, sending return values back
> from
> > multiple threads to the master at the end - these are all problems, but
> > trivial in comparison to things like this)
> >
> > On Fri, Apr 09, 2021 at 12:53:27AM -0500, B. Estrade wrote:
> >
> >> For now, I want all of us to think "thread safety" and "SMP" in
> everything
> >> we do. Not in a design sense, in the unconsiousness sense; so it imbues
> >> itself in every aspect of any work and thinking done in the name of
> >> perl/Perl.
> >>
> >> If you ask me right now; in 2024, I'd like to be able to do something as
> >> simple as this as a single OS process, but utilizing "light weight" OS
> >> threads (e.g., pthreads):
> >
> >> my $num_threads = 8;
> >> map_r($num_threads) {
> >> local $thread_id = tid();
> >> local $bakery_ticket = atomic_var_update; # threads race
> >> # threads do stuff
> >> # .. and more stuff
> >> # ... and maybe more stuff
> >>
> >> # busy wait
> >> do {
> >> local $ihazbeenserved = now_serving($bakery_ticket);
> >> } while(not $ihazbeenserved);
> >>
> >> # "safely" print to STDOUT without clobbering (could use regular
> >> # printf and risk messages overlapping)
> >> printf_r("Yay! I, thread %d, now has my %s", $tid,
> $ihazbeenserved);
> >> }
> >
> > This isn't something that we can solve by working on it for the next few
> > years - really it's something that could only be solved by working on it
> 25
> > years ago, and having a radically different implementation.
> >
> > Basically, I don't think that the current Perl 5 internals are going to
> > support CPU level parallelism in the same interpreter, any more than
> > CPython ever will, or MRI.
> >
> > You *can* get to CPU level parallelism within "leaf" calls to C - so
> likely
> > Yuki Kimoto's suggestion of SPVM seems interesting - he pasted the link
> to
> > https://github.com/yuki-kimoto/SPVM/tree/master/examples/native/openmp
> >
> > For anything more than that, it's going to take other internals. For the
> > syntax you're suggesting, the fastest route to seeing it run might be to
> > look at https://github.com/rakudo-p5/v5 which aimed to write a Perl 5
> > parser using Rakudo. I'm aware that FROGGS stopped working on it a few
> > years ago - I don't know how complete it is, but I don't think that there
> > was any fundamental blocker.
> >
> > Nicholas Clark
> >
> > * There's a talk on it. You can watch the whole thing and just
> s/ruby/perl/g
> > I had no idea that the Ruby C "API" was just like the Perl C "API".
> >
> > Spoilers start here:
> > Of the 2.1 billion lines of code in RubyGems, .5 billion is C.
> > A ruby extension in C can be 10x faster than MRI
> > Their approach is 15x faster than MRI *without* inlining C, 30x with.
> > They actually implement much of the Ruby C ABI in Ruby.
> > The Gem thinks that it's doing this:
> > existing Gem written in C, compiled => C function in MRI
> > when actually it's this:
> > existing Gem unchanged, compiled to LLVM bytecode => their C shim =>
> Ruby
> > where *all* that is being run inside the JVM.
> >
> > The single best slide is this:
> > https://www.youtube.com/watch?v=YLtjkP9bD_U&t=266s
> >
>
Re: threads and stuff (was Re: Perl’s leaky bucket) [ In reply to ]
On 4/11/21 8:13 PM, Yuki Kimoto wrote:
> B. Estrade
>
> I think it's a good choice to work on parallelization at the C / C ++
> level rather than the entire Perl language.
>

Thank you. I've been playing with Inline::C to explore the limitations
of OpenMP::Environment, and indeed I found some which has motivated
additional work for me there. However, we're truly blessed to have all
the FFI options and people supporting available.

> Perl Level only provides fork parallelization and parallelization of non
> blocking I/O.

Well, not really even that in terms of semantics. I'll expand on what I
think I mean, later.

>
> The other parallelization can be provided in C/C++ level.
>
> It feels almost impossible to achieve Perl language and thread
> parallelization at the same time without performance degradation.

I've come to the conclusion that perl/Perl better modeled mentally as an
operating system with a single CPU. So being "performant" may end up
being a trade off with the ability for providing consistent and
meaningful semantic extensions to the language. I am not saying I don't
care about performance; that is not true at all.

But as useful as mental models are, they're fully subjective. I don't
expect anyone else to actually believe perl/Perl is an operating system.
I would not argue this, either, in the strictest sense of the term.

The test of a "good" mental model is ultimately how useful it is as a
reasoning tool. Mental models can't be proven better or more accurate
than the other. You have to look at the consistency and strength of the
ideas that follow from such thinking it enables.

In our situation, we're striving to reach some balance of
"implementation cost + performance cost + language consistency" with
"semantic power". That's a hard problem to solve, particularly when we
seem to be lacking any more "use cases" of Perl that are able inform us
of the next "great" thing to add to it. So that's the challenge. There's
no more low hanging fruit, so we need to get creative.

>
> Perl string processing is tied to copy-on-right.
>
> I don't think it's possible to implement threads and copy-on-write at
> the same time without performance degradation.
>

I am working on an RFC that introduces my current mental model that has
been fruitful in identifying, I think, a powerful incremental step
foward. And just to be clear, it is not exactly "threaded" perl but it
is my hope it presents a way forward for a lot of things.

The goal is to post this in the next few hours. I am not trying to bait
anyone, just to say that I think I understand some things better thanks
to the outlet here on p5p.

Brett

>
>
> 2021?4?9?(?) 23:31 B. Estrade <brett@cpanel.net
> <mailto:brett@cpanel.net>>:
>
> Thank you for your continued thoughts, time, and attention. If all
> we do
> is starting thinking about it, which I think these threads have
> accomplished, then it's progress. I remain confident someone out there
> will have an epiphany. All I can offer is encouragement, continued
> thought, and those examples I posted showing how I can imaging what
> I am
> thinking about from a Perl programmer's perspective - taking into
> account only the external developer experience and what I've
> internalized idiomatically over the years.
>
> I'm happy to continue to mention it, but I know there are other
> pressing
> matters. I will just sum it up by reiterating what I think the most
> important points are, for me:
>
> * Perl's future relevance must be, in large part, driven through
> interesting and hard application areas (the SMP thing came from only
> one
> example, but that sort of took on a life of its own)
>
> * SMP is a crucial capability, even if not threads. OpenMP is not
> general threading; it's a very limited model. Perl just needs to
> provide
> a very small subset; what that looks like versus what is possible, I
> can
> not assert. We should always be keeping it mind. Our collective
> unconscious will reveal what that looks like when it is the right time.
> I have faith in that.
>
> * Perl has nothing to prove to language weenies; it has everything to
> prove to it's loyal following - practical people doing things they deem
> to be serious, in a fun and productive ways.
>
> Thank you all, again, for everything. Especially the discussions and
> considerations this topic has been afforded.
>
> Cheers,
> Brett
>
> On 4/9/21 8:31 AM, Nicholas Clark wrote:
> > On Wed, Apr 07, 2021 at 08:49:31PM -0500, B. Estrade wrote:
> >> FWIW, I can accept that fundamentally this violates any number
> of basic
> >> assumptions in the perl runtime. I know it's interpreted,
> dynamic, etc. I
> >> understand conceptually that perl data structures necessarily
> are designed
> >> to be used in such a runtime that requires "bookkeeping". Worst
> case for me,
> >> I come away with a better understanding on why what I am asking
> actually is
> >> _impossible_.
> >
> >> Can perl handle such a "threaded" block internally?
> >>
> >> The most likely answer I can project is:
> >>
> >> * not directly, an embedded "SMP capable interpreter" approach
> is the only
> >> clean way to do this.
> >
> > Sure, Perl doesn't have a GIL (Global Interpreter Lock), but it
> doesn't need
> > to - it doesn't expose a concept of running more than one
> execution context
> > on the same interpreter, which is what the Python GIL is about -
> > time-slicing the single interpreter to run more than one
> execution context.
> > (Non-concurrent parallelism. So just one CPU core)
> >
> > What you're describing would need more than one execution context
> able to
> > run on the same interpreter. Right now there's a total assumption
> of 1-to-1
> > mapping from interpreter to execution context. What differs from
> Python is
> > that we *don't* have the assumption of exactly one interpreter in
> an OS
> > process, all the C APIs are ready for this, and XS extensions
> likely should
> > be.
> >
> > ithreads exploits this ability to have more than one interpreter
> to then
> > provide more than one execution context (which is what we care
> about) which
> > means that it can use more than one CPU.
> >
> > On Wed, Apr 07, 2021 at 11:20:49PM -0400, Dan Book wrote:
> >
> >> I'd defer to others more familiar with ithreads such as
> Nicholas's earlier
> >> post, but what you are describing is largely what threads.pm
> <http://threads.pm> does, and its
> >> problems of being heavyweight to spawn and causing bugs largely
> stem from
> >> trying to share memory and have the perl interpreter available
> in each
> >> thread. So I'll just say, these are a lot of nice wishes, but
> getting the
> >> tradeoffs right in the implementation is a challenge, if it is
> possible at
> >> all.
> >
> > Yes. The *problem* is/remains that everything internally
> assumes/relies on
> > "interpreter" === "execution context", meaning that ithreads has to
> >
> > 1) rather expensively copy the entire interpreter to create a second
> >     execution context
> > 2) there isn't a good way to share data between contexts
> >
> > To be viable, the threading model Brett is suggesting still needs the
> > *first* point to be changed - it only works out if "execution
> context" is
> > decoupled from "interpreter state", and it becomes possible to
> have 2+
> > execution contexts running concurrently on the same interpreter.
> >
> > ie *we* wouldn't have to solve:
> >
> >      There should be one -- and preferably only one -- [interpreter]
> >
> > but the rest is effectively the same problem as removing the GIL.
> >
> >
> > And that failed, despite considerable efforts
> >
> > The two talks by Larry Hastings on this are quite interesting
> (and fun to
> > watch, as he is good at explaining things, and it's mostly not Python
> > specific)
> >
> > In Removing Python's GIL: The Gilectomy - PyCon 2016
> > at https://www.youtube.com/watch?v=P3AyI_u66Bw#t=19m45s
> >
> > he has a slide listing what you need to add locking to. Most of
> that maps
> > one-to-one to the perl internals:
> >
> > * dicts => hashes (for symbol tables)
> > * lists => arrays (many other interpreter global structures)
> > * freelists => SV arenas
> >
> >
> > He measured an immediate 30% slowdown *just* due to having to
> have reference
> > counts now be implemented by CPU atomic operations.
> >
> > A year later, in his talk with the update, he notes that he's
> moved to
> > buffered reference counting (and gives a very good explanation of
> this)
> >
> > Sufficient to say this approach makes DESTROY untimely. What's
> amusing is
> > that in the Q&A section he realises that he'd missed saying something
> > important:
> >
> > https://www.youtube.com/watch?v=pLqv11ScGsQ#t=41m20s
> >
> > The implication of buffered reference counting is that all refcount
> > decrement to zero happens on the "reference count committing
> thread".  So,
> > na??vely, this means that all DESTROY actions happen on that
> thread, which
> > now has far too much work. Hence he uses queues to dispatch the
> destroy
> > action *back* to the thread that did the last refcount decrement.
> Meaning
> > that destruction is even more untimely.
> >
> >
> > The sad part is that a year later, the update was that he'd given up:
> >
> >      He is "out of bullets" at least with that approach.
> >
> >      With his complicated buffered-reference-count approach he
> was able to
> >      get his "gilectomized" interpreter to reach performance
> parity with
> >      CPython???except that his interpreter was running on around
> seven cores to
> >      keep up with CPython on one.
> >
> > https://lwn.net/Articles/754577/
> >
> > It's interesting that he then is considering that one would need
> to rewrite
> > CPython from reference counting to a tracing garbage collector to get
> > further, in that in another session at the same Python Language
> Summit
> > someone from Instagram said that they experimented with exactly that:
> >
> >      It is part of the C API, though, so reference counting must be
> >      maintained for C extensions; in the experiment, the
> Instagram developers
> >      moved to a tracing garbage collector everywhere else.
> >
> > https://lwn.net/Articles/754163/
> >
> >
> > The Instagram folks also experimented with changing CPython's data
> > structures to optimise for the common case, and saw speedups.
> > (this obviously makes the C code more complex and harder to
> understand, but
> > more importantly, this does start to break the C APIs, and hence
> existing
> > extensions). Which is interesting, because at the end of his talk the
> > previous year, Larry Hastings observed that Jython and IronPython
> can both
> > linearly scale threads with CPUs. Both underlying VMs are written
> in C (or
> > equivalent), hence an "existence proof" - it's clearly possible
> to scale
> > linearly with a C implementation - it was just a question of how
> much the C
> > API he needed to break to get there.
> >
> > So I'm wondering (this is more [original arm waving] than [original
> > research]), whether in effect Jython and IronPython are just
> taking the
> > (equivalent) slowdown hit for concurrency as his Gilectomy, and
> then are
> > able to win the speed back simply by using better data structures
> (optimised
> > to the common runtime use patterns), because they aren't
> constrained by the
> > C ABI.
> >
> > (Also they have JITs. But maybe more importantly, before that,
> they can
> > specialise the bytecode executed at runtime to remove accessor
> calls, and do
> > things like realise that some objects are not visible to other
> threads, or
> > maybe don't even need to be on the heap. There are a lot of
> tricks that
> > python, perl and VMs of that maturity don't and now can't use.)
> >
> >
> > Also interesting on the Hacker News thread on the 2016 talk was this:
> >
> >      I work on a new Ruby interpreter, and our solution to this
> problem has
> >      been to interpret the C code of extensions. That way we can
> give it one
> >      API, but really implement another.
> >
> > https://news.ycombinator.com/item?id=11845347
> >
> > This is "Solang", "system that can execute LLVM-based languages
> on the JVM":
> >
> >
> https://www.researchgate.net/publication/309443492_Bringing_low-level_languages_to_the_JVM_efficient_execution_of_LLVM_IR_on_Truffle
> > https://www.youtube.com/watch?v=YLtjkP9bD_U
> >
> > Yes, this is crazy. Ruby needs to call C extensions. To make this
> efficient
> > for Ruby running on the JVM, we'll just go compile the C into
> something we
> > can (somehow) inline. And it worked. *
> >
> > However, again, despite all this very smart work, TruffleRuby
> doesn't seem
> > to have taken over the world any more than JRuby did.
> >
> > Nor has PyPy taken over Python.
> >
> > Despite all these other smart folks working on problems similar
> to ours (eg
> > at least 7 on TruffleRuby), I don't see any obvious strategy to
> steal.
> >
> >
> >
> > Sideways from this, I started to wonder how you might start to
> untangle the
> > "one interpreter" === "one execution context". The obvious start
> is to try
> > to avoid reference counting overhead where possible. Most objects
> *aren't*
> > visible to more than one thread. (In the degenerate case of one
> thread,
> > that's all objects - this feels like a good start.). Hence have
> the concept
> > of "local objects" and "shared objects" - it's just one flag bit.
> You *are
> > going to need a branch for every reference count twiddle, but it
> might be
> > faster, because mostly it will avoid atomic operations (or buffered
> > reference counting, or whatever)
> >
> > Everything starts local - if an object becomes referenced by a shared
> > object, that object must be promoted to "shared". It's the same
> sort of idea
> > as generational GC, and would also imply a write barrier on every
> > assignment. That's a small CPU cost, but it's a new rule that XS
> code on
> > CPAN doesn't know that it needs to conform to...
> >
> > But, stashes are global. So because of this invariant, subs are
> global, so
> > pads are global, so all lexicals are global.
> >
> > Oh pants!
> >
> > Suddenly everything is global, without trying.
> >
> >
> > And actually it's worse - pads aren't just global in this sense -
> they are
> > one of these *implicit* assumptions of "one is one" - they are an
> array of
> > arrays - the inner array is lexicals (and temporaries) for the
> subroutine,
> > the outer array permits one of these for each depth of recursion.
> >
> > There's no way to have two threads execute the same subroutine
> > simultaneously, without some major refactoring.
> >
> > (signatures, @_, return values on the stack, sending return
> values back from
> > multiple threads to the master at the end - these are all
> problems, but
> > trivial in comparison to things like this)
> >
> > On Fri, Apr 09, 2021 at 12:53:27AM -0500, B. Estrade wrote:
> >
> >> For now, I want all of us to think "thread safety" and "SMP" in
> everything
> >> we do. Not in a design sense, in the unconsiousness sense; so it
> imbues
> >> itself in every aspect of any work and thinking done in the name of
> >> perl/Perl.
> >>
> >> If you ask me right now; in 2024, I'd like to be able to do
> something as
> >> simple as this as a single OS process, but utilizing "light
> weight" OS
> >> threads (e.g., pthreads):
> >
> >>    my $num_threads = 8;
> >>    map_r($num_threads) {
> >>      local $thread_id = tid();
> >>      local $bakery_ticket = atomic_var_update; # threads race
> >>      # threads do stuff
> >>      # .. and more stuff
> >>      # ... and maybe more stuff
> >>
> >>      # busy wait
> >>      do {
> >>        local $ihazbeenserved = now_serving($bakery_ticket);
> >>      } while(not $ihazbeenserved);
> >>
> >>      # "safely" print to STDOUT without clobbering (could use
> regular
> >>      # printf and risk messages overlapping)
> >>      printf_r("Yay! I, thread %d, now has my %s", $tid,
> $ihazbeenserved);
> >>    }
> >
> > This isn't something that we can solve by working on it for the
> next few
> > years - really it's something that could only be solved by
> working on it 25
> > years ago, and having a radically different implementation.
> >
> > Basically, I don't think that the current Perl 5 internals are
> going to
> > support CPU level parallelism in the same interpreter, any more than
> > CPython ever will, or MRI.
> >
> > You *can* get to CPU level parallelism within "leaf" calls to C -
> so likely
> > Yuki Kimoto's suggestion of SPVM seems interesting - he pasted
> the link to
> >
> https://github.com/yuki-kimoto/SPVM/tree/master/examples/native/openmp
> >
> > For anything more than that, it's going to take other internals.
> For the
> > syntax you're suggesting, the fastest route to seeing it run
> might be to
> > look at https://github.com/rakudo-p5/v5 which aimed to write a Perl 5
> > parser using Rakudo. I'm aware that FROGGS stopped working on it
> a few
> > years ago - I don't know how complete it is, but I don't think
> that there
> > was any fundamental blocker.
> >
> > Nicholas Clark
> >
> > * There's a talk on it. You can watch the whole thing and just
> s/ruby/perl/g
> >    I had no idea that the Ruby C "API" was just like the Perl C
> "API".
> >
> >    Spoilers start here:
> >    Of the 2.1 billion lines of code in RubyGems, .5 billion is C.
> >    A ruby extension in C can be 10x faster than MRI
> >    Their approach is 15x faster than MRI *without* inlining C,
> 30x with.
> >    They actually implement much of the Ruby C ABI in Ruby.
> >    The Gem thinks that it's doing this:
> >    existing Gem written in C, compiled => C function in MRI
> >    when actually it's this:
> >    existing Gem unchanged, compiled to LLVM bytecode => their C
> shim => Ruby
> >    where *all* that is being run inside the JVM.
> >
> >    The single best slide is this:
> > https://www.youtube.com/watch?v=YLtjkP9bD_U&t=266s
> >
>
Re: threads and stuff (was Re: Perl’s leaky bucket) [ In reply to ]
B. Estrade

> I am working on an RFC that introduces my current mental model that has
been fruitful in identifying

If there is a possible implementation, it can also be announced in a branch.


2021?4?12?(?) 11:09 B. Estrade <brett@cpanel.net>:

>
>
> On 4/11/21 8:13 PM, Yuki Kimoto wrote:
> > B. Estrade
> >
> > I think it's a good choice to work on parallelization at the C / C ++
> > level rather than the entire Perl language.
> >
>
> Thank you. I've been playing with Inline::C to explore the limitations
> of OpenMP::Environment, and indeed I found some which has motivated
> additional work for me there. However, we're truly blessed to have all
> the FFI options and people supporting available.
>
> > Perl Level only provides fork parallelization and parallelization of non
> > blocking I/O.
>
> Well, not really even that in terms of semantics. I'll expand on what I
> think I mean, later.
>
> >
> > The other parallelization can be provided in C/C++ level.
> >
> > It feels almost impossible to achieve Perl language and thread
> > parallelization at the same time without performance degradation.
>
> I've come to the conclusion that perl/Perl better modeled mentally as an
> operating system with a single CPU. So being "performant" may end up
> being a trade off with the ability for providing consistent and
> meaningful semantic extensions to the language. I am not saying I don't
> care about performance; that is not true at all.
>
> But as useful as mental models are, they're fully subjective. I don't
> expect anyone else to actually believe perl/Perl is an operating system.
> I would not argue this, either, in the strictest sense of the term.
>
> The test of a "good" mental model is ultimately how useful it is as a
> reasoning tool. Mental models can't be proven better or more accurate
> than the other. You have to look at the consistency and strength of the
> ideas that follow from such thinking it enables.
>
> In our situation, we're striving to reach some balance of
> "implementation cost + performance cost + language consistency" with
> "semantic power". That's a hard problem to solve, particularly when we
> seem to be lacking any more "use cases" of Perl that are able inform us
> of the next "great" thing to add to it. So that's the challenge. There's
> no more low hanging fruit, so we need to get creative.
>
> >
> > Perl string processing is tied to copy-on-right.
> >
> > I don't think it's possible to implement threads and copy-on-write at
> > the same time without performance degradation.
> >
>
> I am working on an RFC that introduces my current mental model that has
> been fruitful in identifying, I think, a powerful incremental step
> foward. And just to be clear, it is not exactly "threaded" perl but it
> is my hope it presents a way forward for a lot of things.
>
> The goal is to post this in the next few hours. I am not trying to bait
> anyone, just to say that I think I understand some things better thanks
> to the outlet here on p5p.
>
> Brett
>
> >
> >
> > 2021?4?9?(?) 23:31 B. Estrade <brett@cpanel.net
> > <mailto:brett@cpanel.net>>:
> >
> > Thank you for your continued thoughts, time, and attention. If all
> > we do
> > is starting thinking about it, which I think these threads have
> > accomplished, then it's progress. I remain confident someone out
> there
> > will have an epiphany. All I can offer is encouragement, continued
> > thought, and those examples I posted showing how I can imaging what
> > I am
> > thinking about from a Perl programmer's perspective - taking into
> > account only the external developer experience and what I've
> > internalized idiomatically over the years.
> >
> > I'm happy to continue to mention it, but I know there are other
> > pressing
> > matters. I will just sum it up by reiterating what I think the most
> > important points are, for me:
> >
> > * Perl's future relevance must be, in large part, driven through
> > interesting and hard application areas (the SMP thing came from only
> > one
> > example, but that sort of took on a life of its own)
> >
> > * SMP is a crucial capability, even if not threads. OpenMP is not
> > general threading; it's a very limited model. Perl just needs to
> > provide
> > a very small subset; what that looks like versus what is possible, I
> > can
> > not assert. We should always be keeping it mind. Our collective
> > unconscious will reveal what that looks like when it is the right
> time.
> > I have faith in that.
> >
> > * Perl has nothing to prove to language weenies; it has everything to
> > prove to it's loyal following - practical people doing things they
> deem
> > to be serious, in a fun and productive ways.
> >
> > Thank you all, again, for everything. Especially the discussions and
> > considerations this topic has been afforded.
> >
> > Cheers,
> > Brett
> >
> > On 4/9/21 8:31 AM, Nicholas Clark wrote:
> > > On Wed, Apr 07, 2021 at 08:49:31PM -0500, B. Estrade wrote:
> > >> FWIW, I can accept that fundamentally this violates any number
> > of basic
> > >> assumptions in the perl runtime. I know it's interpreted,
> > dynamic, etc. I
> > >> understand conceptually that perl data structures necessarily
> > are designed
> > >> to be used in such a runtime that requires "bookkeeping". Worst
> > case for me,
> > >> I come away with a better understanding on why what I am asking
> > actually is
> > >> _impossible_.
> > >
> > >> Can perl handle such a "threaded" block internally?
> > >>
> > >> The most likely answer I can project is:
> > >>
> > >> * not directly, an embedded "SMP capable interpreter" approach
> > is the only
> > >> clean way to do this.
> > >
> > > Sure, Perl doesn't have a GIL (Global Interpreter Lock), but it
> > doesn't need
> > > to - it doesn't expose a concept of running more than one
> > execution context
> > > on the same interpreter, which is what the Python GIL is about -
> > > time-slicing the single interpreter to run more than one
> > execution context.
> > > (Non-concurrent parallelism. So just one CPU core)
> > >
> > > What you're describing would need more than one execution context
> > able to
> > > run on the same interpreter. Right now there's a total assumption
> > of 1-to-1
> > > mapping from interpreter to execution context. What differs from
> > Python is
> > > that we *don't* have the assumption of exactly one interpreter in
> > an OS
> > > process, all the C APIs are ready for this, and XS extensions
> > likely should
> > > be.
> > >
> > > ithreads exploits this ability to have more than one interpreter
> > to then
> > > provide more than one execution context (which is what we care
> > about) which
> > > means that it can use more than one CPU.
> > >
> > > On Wed, Apr 07, 2021 at 11:20:49PM -0400, Dan Book wrote:
> > >
> > >> I'd defer to others more familiar with ithreads such as
> > Nicholas's earlier
> > >> post, but what you are describing is largely what threads.pm
> > <http://threads.pm> does, and its
> > >> problems of being heavyweight to spawn and causing bugs largely
> > stem from
> > >> trying to share memory and have the perl interpreter available
> > in each
> > >> thread. So I'll just say, these are a lot of nice wishes, but
> > getting the
> > >> tradeoffs right in the implementation is a challenge, if it is
> > possible at
> > >> all.
> > >
> > > Yes. The *problem* is/remains that everything internally
> > assumes/relies on
> > > "interpreter" === "execution context", meaning that ithreads has
> to
> > >
> > > 1) rather expensively copy the entire interpreter to create a
> second
> > > execution context
> > > 2) there isn't a good way to share data between contexts
> > >
> > > To be viable, the threading model Brett is suggesting still needs
> the
> > > *first* point to be changed - it only works out if "execution
> > context" is
> > > decoupled from "interpreter state", and it becomes possible to
> > have 2+
> > > execution contexts running concurrently on the same interpreter.
> > >
> > > ie *we* wouldn't have to solve:
> > >
> > > There should be one -- and preferably only one --
> [interpreter]
> > >
> > > but the rest is effectively the same problem as removing the GIL.
> > >
> > >
> > > And that failed, despite considerable efforts
> > >
> > > The two talks by Larry Hastings on this are quite interesting
> > (and fun to
> > > watch, as he is good at explaining things, and it's mostly not
> Python
> > > specific)
> > >
> > > In Removing Python's GIL: The Gilectomy - PyCon 2016
> > > at https://www.youtube.com/watch?v=P3AyI_u66Bw#t=19m45s
> > >
> > > he has a slide listing what you need to add locking to. Most of
> > that maps
> > > one-to-one to the perl internals:
> > >
> > > * dicts => hashes (for symbol tables)
> > > * lists => arrays (many other interpreter global structures)
> > > * freelists => SV arenas
> > >
> > >
> > > He measured an immediate 30% slowdown *just* due to having to
> > have reference
> > > counts now be implemented by CPU atomic operations.
> > >
> > > A year later, in his talk with the update, he notes that he's
> > moved to
> > > buffered reference counting (and gives a very good explanation of
> > this)
> > >
> > > Sufficient to say this approach makes DESTROY untimely. What's
> > amusing is
> > > that in the Q&A section he realises that he'd missed saying
> something
> > > important:
> > >
> > > https://www.youtube.com/watch?v=pLqv11ScGsQ#t=41m20s
> > >
> > > The implication of buffered reference counting is that all
> refcount
> > > decrement to zero happens on the "reference count committing
> > thread". So,
> > > na??vely, this means that all DESTROY actions happen on that
> > thread, which
> > > now has far too much work. Hence he uses queues to dispatch the
> > destroy
> > > action *back* to the thread that did the last refcount decrement.
> > Meaning
> > > that destruction is even more untimely.
> > >
> > >
> > > The sad part is that a year later, the update was that he'd given
> up:
> > >
> > > He is "out of bullets" at least with that approach.
> > >
> > > With his complicated buffered-reference-count approach he
> > was able to
> > > get his "gilectomized" interpreter to reach performance
> > parity with
> > > CPython???except that his interpreter was running on around
> > seven cores to
> > > keep up with CPython on one.
> > >
> > > https://lwn.net/Articles/754577/
> > >
> > > It's interesting that he then is considering that one would need
> > to rewrite
> > > CPython from reference counting to a tracing garbage collector to
> get
> > > further, in that in another session at the same Python Language
> > Summit
> > > someone from Instagram said that they experimented with exactly
> that:
> > >
> > > It is part of the C API, though, so reference counting must
> be
> > > maintained for C extensions; in the experiment, the
> > Instagram developers
> > > moved to a tracing garbage collector everywhere else.
> > >
> > > https://lwn.net/Articles/754163/
> > >
> > >
> > > The Instagram folks also experimented with changing CPython's data
> > > structures to optimise for the common case, and saw speedups.
> > > (this obviously makes the C code more complex and harder to
> > understand, but
> > > more importantly, this does start to break the C APIs, and hence
> > existing
> > > extensions). Which is interesting, because at the end of his talk
> the
> > > previous year, Larry Hastings observed that Jython and IronPython
> > can both
> > > linearly scale threads with CPUs. Both underlying VMs are written
> > in C (or
> > > equivalent), hence an "existence proof" - it's clearly possible
> > to scale
> > > linearly with a C implementation - it was just a question of how
> > much the C
> > > API he needed to break to get there.
> > >
> > > So I'm wondering (this is more [original arm waving] than
> [original
> > > research]), whether in effect Jython and IronPython are just
> > taking the
> > > (equivalent) slowdown hit for concurrency as his Gilectomy, and
> > then are
> > > able to win the speed back simply by using better data structures
> > (optimised
> > > to the common runtime use patterns), because they aren't
> > constrained by the
> > > C ABI.
> > >
> > > (Also they have JITs. But maybe more importantly, before that,
> > they can
> > > specialise the bytecode executed at runtime to remove accessor
> > calls, and do
> > > things like realise that some objects are not visible to other
> > threads, or
> > > maybe don't even need to be on the heap. There are a lot of
> > tricks that
> > > python, perl and VMs of that maturity don't and now can't use.)
> > >
> > >
> > > Also interesting on the Hacker News thread on the 2016 talk was
> this:
> > >
> > > I work on a new Ruby interpreter, and our solution to this
> > problem has
> > > been to interpret the C code of extensions. That way we can
> > give it one
> > > API, but really implement another.
> > >
> > > https://news.ycombinator.com/item?id=11845347
> > >
> > > This is "Solang", "system that can execute LLVM-based languages
> > on the JVM":
> > >
> > >
> >
> https://www.researchgate.net/publication/309443492_Bringing_low-level_languages_to_the_JVM_efficient_execution_of_LLVM_IR_on_Truffle
> > > https://www.youtube.com/watch?v=YLtjkP9bD_U
> > >
> > > Yes, this is crazy. Ruby needs to call C extensions. To make this
> > efficient
> > > for Ruby running on the JVM, we'll just go compile the C into
> > something we
> > > can (somehow) inline. And it worked. *
> > >
> > > However, again, despite all this very smart work, TruffleRuby
> > doesn't seem
> > > to have taken over the world any more than JRuby did.
> > >
> > > Nor has PyPy taken over Python.
> > >
> > > Despite all these other smart folks working on problems similar
> > to ours (eg
> > > at least 7 on TruffleRuby), I don't see any obvious strategy to
> > steal.
> > >
> > >
> > >
> > > Sideways from this, I started to wonder how you might start to
> > untangle the
> > > "one interpreter" === "one execution context". The obvious start
> > is to try
> > > to avoid reference counting overhead where possible. Most objects
> > *aren't*
> > > visible to more than one thread. (In the degenerate case of one
> > thread,
> > > that's all objects - this feels like a good start.). Hence have
> > the concept
> > > of "local objects" and "shared objects" - it's just one flag bit.
> > You *are
> > > going to need a branch for every reference count twiddle, but it
> > might be
> > > faster, because mostly it will avoid atomic operations (or
> buffered
> > > reference counting, or whatever)
> > >
> > > Everything starts local - if an object becomes referenced by a
> shared
> > > object, that object must be promoted to "shared". It's the same
> > sort of idea
> > > as generational GC, and would also imply a write barrier on every
> > > assignment. That's a small CPU cost, but it's a new rule that XS
> > code on
> > > CPAN doesn't know that it needs to conform to...
> > >
> > > But, stashes are global. So because of this invariant, subs are
> > global, so
> > > pads are global, so all lexicals are global.
> > >
> > > Oh pants!
> > >
> > > Suddenly everything is global, without trying.
> > >
> > >
> > > And actually it's worse - pads aren't just global in this sense -
> > they are
> > > one of these *implicit* assumptions of "one is one" - they are an
> > array of
> > > arrays - the inner array is lexicals (and temporaries) for the
> > subroutine,
> > > the outer array permits one of these for each depth of recursion.
> > >
> > > There's no way to have two threads execute the same subroutine
> > > simultaneously, without some major refactoring.
> > >
> > > (signatures, @_, return values on the stack, sending return
> > values back from
> > > multiple threads to the master at the end - these are all
> > problems, but
> > > trivial in comparison to things like this)
> > >
> > > On Fri, Apr 09, 2021 at 12:53:27AM -0500, B. Estrade wrote:
> > >
> > >> For now, I want all of us to think "thread safety" and "SMP" in
> > everything
> > >> we do. Not in a design sense, in the unconsiousness sense; so it
> > imbues
> > >> itself in every aspect of any work and thinking done in the name
> of
> > >> perl/Perl.
> > >>
> > >> If you ask me right now; in 2024, I'd like to be able to do
> > something as
> > >> simple as this as a single OS process, but utilizing "light
> > weight" OS
> > >> threads (e.g., pthreads):
> > >
> > >> my $num_threads = 8;
> > >> map_r($num_threads) {
> > >> local $thread_id = tid();
> > >> local $bakery_ticket = atomic_var_update; # threads race
> > >> # threads do stuff
> > >> # .. and more stuff
> > >> # ... and maybe more stuff
> > >>
> > >> # busy wait
> > >> do {
> > >> local $ihazbeenserved = now_serving($bakery_ticket);
> > >> } while(not $ihazbeenserved);
> > >>
> > >> # "safely" print to STDOUT without clobbering (could use
> > regular
> > >> # printf and risk messages overlapping)
> > >> printf_r("Yay! I, thread %d, now has my %s", $tid,
> > $ihazbeenserved);
> > >> }
> > >
> > > This isn't something that we can solve by working on it for the
> > next few
> > > years - really it's something that could only be solved by
> > working on it 25
> > > years ago, and having a radically different implementation.
> > >
> > > Basically, I don't think that the current Perl 5 internals are
> > going to
> > > support CPU level parallelism in the same interpreter, any more
> than
> > > CPython ever will, or MRI.
> > >
> > > You *can* get to CPU level parallelism within "leaf" calls to C -
> > so likely
> > > Yuki Kimoto's suggestion of SPVM seems interesting - he pasted
> > the link to
> > >
> >
> https://github.com/yuki-kimoto/SPVM/tree/master/examples/native/openmp
> > >
> > > For anything more than that, it's going to take other internals.
> > For the
> > > syntax you're suggesting, the fastest route to seeing it run
> > might be to
> > > look at https://github.com/rakudo-p5/v5 which aimed to write a
> Perl 5
> > > parser using Rakudo. I'm aware that FROGGS stopped working on it
> > a few
> > > years ago - I don't know how complete it is, but I don't think
> > that there
> > > was any fundamental blocker.
> > >
> > > Nicholas Clark
> > >
> > > * There's a talk on it. You can watch the whole thing and just
> > s/ruby/perl/g
> > > I had no idea that the Ruby C "API" was just like the Perl C
> > "API".
> > >
> > > Spoilers start here:
> > > Of the 2.1 billion lines of code in RubyGems, .5 billion is C.
> > > A ruby extension in C can be 10x faster than MRI
> > > Their approach is 15x faster than MRI *without* inlining C,
> > 30x with.
> > > They actually implement much of the Ruby C ABI in Ruby.
> > > The Gem thinks that it's doing this:
> > > existing Gem written in C, compiled => C function in MRI
> > > when actually it's this:
> > > existing Gem unchanged, compiled to LLVM bytecode => their C
> > shim => Ruby
> > > where *all* that is being run inside the JVM.
> > >
> > > The single best slide is this:
> > > https://www.youtube.com/watch?v=YLtjkP9bD_U&t=266s
> > >
> >
>