Mailing List Archive

On 22.12.2020 06:49, Matthias Peng wrote:
> Hello
>
> I am developing a simple mp2 application.
> I looked for the installation for mp2 utils, and found this two:
>
> libapache2-mod-perl2
>
> libapache2-mod-apreq2
>
>
> what're their relations? Should I install both, or only the first one?
>

Hi.

They are different and independent packages and module libraries, and you can use the one
or the other, or both, depending on your needs.
(We always install both, and we use both)

For mod_perl per se, you need only the /libapache2-mod-perl2/ package.
This gives you access to all the stuff documented here :
http://perl.apache.org/docs/2.0/api/index.html

*except* what is at the very end of that page :

"Part VI: Related Modules" -> libapreq modules
(this is what is contained in the separate /libapache2-mod-apreq2/ package)
The documentation for libapreq is at :
http://httpd.apache.org/apreq/docs/libapreq2/modules.html

It may be a bit confusing at first, because both (independent) packages use some common
namespaces ("Apache2::" and "APR::"), and because each of mod_perl and libapreq2 have
their own form of "Apache Request object", named very similarly :
- for mod_perl it is Apache2::RequestRec
- for libapreq it is Apache2::Request
(I guess that libapreq was first, that's why they got the better name ;-)

I am a bit reluctant to try explaining the difference further (for fear of confusing you
further), but here is a very rough summary :

- to deal with 99% of what has to do with controlling what happens within Apache httpd in
terms of processing HTTP requests (or just to run your perl scripts faster), use the
mod_perl package.
So install /libapache2-mod-perl2/ first, and start coding.

- if you finds out later that you have to do a lot of processing of CGI parameters (the
request "query string") or cookies, you can then install and use use the libapreq API
which (among other things) provide an alternative to what the CGI module provides.

In any case, there is a bit of a learning curve, but it is great fun and very powerful.

Re: Confused about two development utils [ In reply to ]

Dec 22, 2020, 5:20 AM

Post #3 of 21 (2449 views)

Can I guess mod_perl is the upgraded version of apreq? Thanks Andre.

> On 22.12.2020 06:49, Matthias Peng wrote:
> > Hello
> >
> > I am developing a simple mp2 application.
> > I looked for the installation for mp2 utils, and found this two:
> >
> > libapache2-mod-perl2
> >
> > libapache2-mod-apreq2
> >
> >
> > what're their relations? Should I install both, or only the first one?
> >
>
> Hi.
>
> They are different and independent packages and module libraries, and you
> can use the one
> or the other, or both, depending on your needs.
> (We always install both, and we use both)
>
> For mod_perl per se, you need only the /libapache2-mod-perl2/ package.
> This gives you access to all the stuff documented here :
> http://perl.apache.org/docs/2.0/api/index.html
>
> *except* what is at the very end of that page :
>
> "Part VI: Related Modules" -> libapreq modules
> (this is what is contained in the separate /libapache2-mod-apreq2/ package)
> The documentation for libapreq is at :
> http://httpd.apache.org/apreq/docs/libapreq2/modules.html
>
> It may be a bit confusing at first, because both (independent) packages
> use some common
> namespaces ("Apache2::" and "APR::"), and because each of mod_perl and
> libapreq2 have
> their own form of "Apache Request object", named very similarly :
> - for mod_perl it is Apache2::RequestRec
> - for libapreq it is Apache2::Request
> (I guess that libapreq was first, that's why they got the better name ;-)
>
> I am a bit reluctant to try explaining the difference further (for fear
> of confusing you
> further), but here is a very rough summary :
>
> - to deal with 99% of what has to do with controlling what happens within
> Apache httpd in
> terms of processing HTTP requests (or just to run your perl scripts
> faster), use the
> mod_perl package.
> So install /libapache2-mod-perl2/ first, and start coding.
>
> - if you finds out later that you have to do a lot of processing of CGI
> parameters (the
> request "query string") or cookies, you can then install and use use the
> libapreq API
> which (among other things) provide an alternative to what the CGI module
> provides.
>
> In any case, there is a bit of a learning curve, but it is great fun and
> very powerful.
>
>
>
>
>

Re: Confused about two development utils [ In reply to ]

Dec 22, 2020, 4:43 PM

Post #4 of 21 (2449 views)

I am a newbie to mp development stack.
After one day of work, I have made a simple handler, which returns the
client's address and its PTR record.
The demo:
https://myhostnames.com/

The code shown below:

package MyHostname;

use strict;

use Net::DNS;

use Apache2::RequestRec ();

use Apache2::RequestIO ();

use Apache2::Connection ();

use APR::Table ();

use Apache2::Const -compile => qw(OK FORBIDDEN);

sub handler {

my $r = shift;

my $ip = $r->headers_in->{'CF-Connecting-IP'} ||
$r->connection->client_ip;

my $host = dns_query($ip) || "";

$r->content_type('text/plain; charset=utf-8');

$r->print("Your IP: $ip, Hostname: $host");

return Apache2::Const::OK;

}

sub dns_query {

my $ip = shift;

my $resolver = Net::DNS::Resolver->new();

my $reply = $resolver->query($ip, 'PTR');

if ($reply) {

for my $rr ($reply->answer) {

return $rr->rdstring; # we need only one

}

}

return;

}

1;

Can anyone give your review? Thanks in advance.

Matthias

On Tue, Dec 22, 2020 at 1:49 PM Matthias Peng <pengmatthias@gmail.com>
wrote:

> Hello
>
> I am developing a simple mp2 application.
> I looked for the installation for mp2 utils, and found this two:
>
> libapache2-mod-perl2
>
> libapache2-mod-apreq2
>
>
> what're their relations? Should I install both, or only the first one?
>
>
> Thanks.
>

Re: Confused about two development utils [ In reply to ]

Dec 22, 2020, 5:08 PM

Post #5 of 21 (2449 views)

Replying to the DL.

On Tue, Dec 22, 2020 at 7:07 PM Mithun Bhattacharya <mithnb@gmail.com>
wrote:

> $r->connection->client_ip would report your proxy server if you have a
> reverse proxy setup - this is not a common use case though.
>
> DNS lookup would usually be an expensive process and you are supposed to
> be nice to other services so cache it for the TTL of the PTR record.
>
> On Tue, Dec 22, 2020 at 6:44 PM Matthias Peng <pengmatthias@gmail.com>
> wrote:
>
>> I am a newbie to mp development stack.
>> After one day of work, I have made a simple handler, which returns the
>> client's address and its PTR record.
>> The demo:
>> https://myhostnames.com/
>>
>> The code shown below:
>>
>> package MyHostname;
>>
>>
>> use strict;
>>
>> use Net::DNS;
>>
>> use Apache2::RequestRec ();
>>
>> use Apache2::RequestIO ();
>>
>> use Apache2::Connection ();
>>
>> use APR::Table ();
>>
>> use Apache2::Const -compile => qw(OK FORBIDDEN);
>>
>>
>>
>> sub handler {
>>
>>
>> my $r = shift;
>>
>> my $ip = $r->headers_in->{'CF-Connecting-IP'} ||
>> $r->connection->client_ip;
>>
>>
>> my $host = dns_query($ip) || "";
>>
>>
>> $r->content_type('text/plain; charset=utf-8');
>>
>> $r->print("Your IP: $ip, Hostname: $host");
>>
>>
>> return Apache2::Const::OK;
>>
>> }
>>
>>
>>
>> sub dns_query {
>>
>> my $ip = shift;
>>
>> my $resolver = Net::DNS::Resolver->new();
>>
>> my $reply = $resolver->query($ip, 'PTR');
>>
>>
>> if ($reply) {
>>
>> for my $rr ($reply->answer) {
>>
>> return $rr->rdstring; # we need only one
>>
>> }
>>
>> }
>>
>>
>> return;
>>
>> }
>>
>>
>>
>> 1;
>>
>>
>>
>> Can anyone give your review? Thanks in advance.
>>
>>
>> Matthias
>>
>>
>>
>>
>> On Tue, Dec 22, 2020 at 1:49 PM Matthias Peng <pengmatthias@gmail.com>
>> wrote:
>>
>>> Hello
>>>
>>> I am developing a simple mp2 application.
>>> I looked for the installation for mp2 utils, and found this two:
>>>
>>> libapache2-mod-perl2
>>>
>>> libapache2-mod-apreq2
>>>
>>>
>>> what're their relations? Should I install both, or only the first one?
>>>
>>>
>>> Thanks.
>>>
>>

Re: Confused about two development utils [ In reply to ]

Dec 22, 2020, 5:19 PM

Post #6 of 21 (2449 views)

Thanks Mithun.
1. Since the query is passed through cloudflare, so a CF- header is wanted
for fetching client's real IP.
2. Since I am querying PTR via a stub resolver (such as 8.8.8.8), I guess
this public dns server has already cached the result. right?

Regards.

On Wed, Dec 23, 2020 at 9:08 AM Mithun Bhattacharya <mithnb@gmail.com>
wrote:

> Replying to the DL.
>
> On Tue, Dec 22, 2020 at 7:07 PM Mithun Bhattacharya <mithnb@gmail.com>
> wrote:
>
>> $r->connection->client_ip would report your proxy server if you have a
>> reverse proxy setup - this is not a common use case though.
>>
>> DNS lookup would usually be an expensive process and you are supposed to
>> be nice to other services so cache it for the TTL of the PTR record.
>>
>> On Tue, Dec 22, 2020 at 6:44 PM Matthias Peng <pengmatthias@gmail.com>
>> wrote:
>>
>>> I am a newbie to mp development stack.
>>> After one day of work, I have made a simple handler, which returns the
>>> client's address and its PTR record.
>>> The demo:
>>> https://myhostnames.com/
>>>
>>> The code shown below:
>>>
>>> package MyHostname;
>>>
>>>
>>> use strict;
>>>
>>> use Net::DNS;
>>>
>>> use Apache2::RequestRec ();
>>>
>>> use Apache2::RequestIO ();
>>>
>>> use Apache2::Connection ();
>>>
>>> use APR::Table ();
>>>
>>> use Apache2::Const -compile => qw(OK FORBIDDEN);
>>>
>>>
>>>
>>> sub handler {
>>>
>>>
>>> my $r = shift;
>>>
>>> my $ip = $r->headers_in->{'CF-Connecting-IP'} ||
>>> $r->connection->client_ip;
>>>
>>>
>>> my $host = dns_query($ip) || "";
>>>
>>>
>>> $r->content_type('text/plain; charset=utf-8');
>>>
>>> $r->print("Your IP: $ip, Hostname: $host");
>>>
>>>
>>> return Apache2::Const::OK;
>>>
>>> }
>>>
>>>
>>>
>>> sub dns_query {
>>>
>>> my $ip = shift;
>>>
>>> my $resolver = Net::DNS::Resolver->new();
>>>
>>> my $reply = $resolver->query($ip, 'PTR');
>>>
>>>
>>> if ($reply) {
>>>
>>> for my $rr ($reply->answer) {
>>>
>>> return $rr->rdstring; # we need only one
>>>
>>> }
>>>
>>> }
>>>
>>>
>>> return;
>>>
>>> }
>>>
>>>
>>>
>>> 1;
>>>
>>>
>>>
>>> Can anyone give your review? Thanks in advance.
>>>
>>>
>>> Matthias
>>>
>>>
>>>
>>>
>>> On Tue, Dec 22, 2020 at 1:49 PM Matthias Peng <pengmatthias@gmail.com>
>>> wrote:
>>>
>>>> Hello
>>>>
>>>> I am developing a simple mp2 application.
>>>> I looked for the installation for mp2 utils, and found this two:
>>>>
>>>> libapache2-mod-perl2
>>>>
>>>> libapache2-mod-apreq2
>>>>
>>>>
>>>> what're their relations? Should I install both, or only the first one?
>>>>
>>>>
>>>> Thanks.
>>>>
>>>

Re: Confused about two development utils [ In reply to ]

Dec 22, 2020, 5:30 PM

Post #7 of 21 (2449 views)

8.8.8.8 is google's public DNS server - yah they can handle whatever you
throw at them but you shouldnt misuse it. The whole point of TTL in DNS is
for suggested caching - you are welcome to ignore it but you are also being
rude to others.

$r->connection->client_ip is the IP your apache server is seeing you come
through which I assume is CloudFlare - I have no idea how to see the real
ip behind cloudflare.

On Tue, Dec 22, 2020 at 7:19 PM Matthias Peng <pengmatthias@gmail.com>
wrote:

> Thanks Mithun.
> 1. Since the query is passed through cloudflare, so a CF- header is wanted
> for fetching client's real IP.
> 2. Since I am querying PTR via a stub resolver (such as 8.8.8.8), I guess
> this public dns server has already cached the result. right?
>
> Regards.
>
>
> On Wed, Dec 23, 2020 at 9:08 AM Mithun Bhattacharya <mithnb@gmail.com>
> wrote:
>
>> Replying to the DL.
>>
>> On Tue, Dec 22, 2020 at 7:07 PM Mithun Bhattacharya <mithnb@gmail.com>
>> wrote:
>>
>>> $r->connection->client_ip would report your proxy server if you have a
>>> reverse proxy setup - this is not a common use case though.
>>>
>>> DNS lookup would usually be an expensive process and you are supposed to
>>> be nice to other services so cache it for the TTL of the PTR record.
>>>
>>> On Tue, Dec 22, 2020 at 6:44 PM Matthias Peng <pengmatthias@gmail.com>
>>> wrote:
>>>
>>>> I am a newbie to mp development stack.
>>>> After one day of work, I have made a simple handler, which returns the
>>>> client's address and its PTR record.
>>>> The demo:
>>>> https://myhostnames.com/
>>>>
>>>> The code shown below:
>>>>
>>>> package MyHostname;
>>>>
>>>>
>>>> use strict;
>>>>
>>>> use Net::DNS;
>>>>
>>>> use Apache2::RequestRec ();
>>>>
>>>> use Apache2::RequestIO ();
>>>>
>>>> use Apache2::Connection ();
>>>>
>>>> use APR::Table ();
>>>>
>>>> use Apache2::Const -compile => qw(OK FORBIDDEN);
>>>>
>>>>
>>>>
>>>> sub handler {
>>>>
>>>>
>>>> my $r = shift;
>>>>
>>>> my $ip = $r->headers_in->{'CF-Connecting-IP'} ||
>>>> $r->connection->client_ip;
>>>>
>>>>
>>>> my $host = dns_query($ip) || "";
>>>>
>>>>
>>>> $r->content_type('text/plain; charset=utf-8');
>>>>
>>>> $r->print("Your IP: $ip, Hostname: $host");
>>>>
>>>>
>>>> return Apache2::Const::OK;
>>>>
>>>> }
>>>>
>>>>
>>>>
>>>> sub dns_query {
>>>>
>>>> my $ip = shift;
>>>>
>>>> my $resolver = Net::DNS::Resolver->new();
>>>>
>>>> my $reply = $resolver->query($ip, 'PTR');
>>>>
>>>>
>>>> if ($reply) {
>>>>
>>>> for my $rr ($reply->answer) {
>>>>
>>>> return $rr->rdstring; # we need only one
>>>>
>>>> }
>>>>
>>>> }
>>>>
>>>>
>>>> return;
>>>>
>>>> }
>>>>
>>>>
>>>>
>>>> 1;
>>>>
>>>>
>>>>
>>>> Can anyone give your review? Thanks in advance.
>>>>
>>>>
>>>> Matthias
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Dec 22, 2020 at 1:49 PM Matthias Peng <pengmatthias@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello
>>>>>
>>>>> I am developing a simple mp2 application.
>>>>> I looked for the installation for mp2 utils, and found this two:
>>>>>
>>>>> libapache2-mod-perl2
>>>>>
>>>>> libapache2-mod-apreq2
>>>>>
>>>>>
>>>>> what're their relations? Should I install both, or only the first one?
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>

Re: Confused about two development utils [ In reply to ]

aw at ice-sa

Dec 23, 2020, 1:20 AM

Post #8 of 21 (2449 views)

On 22.12.2020 14:20, Matthias Peng wrote:
> Can I guess mod_perl is the upgraded version of apreq? Thanks Andre.

Not really. They are really 2 different things.

The essence of mod_perl, is to embed a perl interpreter in Apache httpd.
This costs memory, and all the more since many perl modules are not thread-safe, so if you
use them in your code, at this moment the only safe way to do it is to use the Apache
httpd prefork model. This means that each Apache httpd child process has its own copy of
the perl interpreter, which means that the memory used by this embedded perl interpreter
has to be counted n times (as many times as there are Apache httpd child processes running
at any one time).
This can be significant, but it is also relative : if you compare the memory used in the
practice by an Apache httpd with mod_perl and compare it to other possible solutions for
web applications (e.g. a java-based webserver, or back-end solutions running their own
interpreter), you will see that this "bad side of mod_perl" is really not as bad as many
perl and mod_perl detractors would want you to believe.
There are no miracles : if you want to do many complex things in parallel and do them
fast, you are going to use memory and CPU time, no matter which techniques or tools you use.

Once you have admitted the above, having a perl interpreter embedded in Apache httpd
through mod_perl leads to 2 major things :

1) anything written in perl and used by your web application can run *much* faster (with
the right setup), because it will be pre-compiled by the perl interpreter the first time
it is run, and then cached in some intermediate form, which will run say 100 times faster
any subsequent time it is run after that (as long as it runs within the same Apache
child). So, * if you know perl and like perl as a programming language, and you are - as I
am - amazed by the incredible scope and quality and documentation of the perl CPAN library
*, just that makes it worth having a good look at mod_perl.
But this is only one aspect of mod_perl.

2) the most important aspect (in my view), is that mod_perl allows you to really intervene
and "do things", in perl, inside the logic of Apache httpd itself, at just about any step
of the processing of a HTTP request by Apache. Not everyone is interested in doing this,
but if you find that your applications could benefit from being able to inspect and/or
modify the way in which Apache httpd is processing HTTP requests, there is not really any
tool that compares to mod_perl in that respect.
In fact, what mod_perl really allows you to do, is to turn things around : instead of
being this little perl toolbox that is added to httpd, it allows you to use Apache httpd -
with all its finely tuned code and extensions - *as a toolbox* to do what you want in your
perl application. That is the real power of mod_perl. If mod_perl did not exist, the only
real alternative for doing that kind of thing, would be to write your own Apache add-on
modules in C. (Which would mean "forget about CPAN" and start looking elsewhere for
anything you need in addition).

In comparison to mod_perl, the apreq library is more limited and more focused. It's main
aspect in my view, is to provide a more efficient alternative for the perl CGI module, in
terms of processing cgi-bin script arguments and cookies (and maybe some other things
which I admit I've never really looked at).

I guess that by now you know that I am really a perl and mod_perl fan.

As far as perl is concerned, as a programming language : there are people who like it, and
others who don't, and it is quite pointless to try to convert the ones into the others.
But if what you want is a programming tool which allows you to do many different things
quickly - even things of which you initially know very little about - there is still no
real alternative to the conjunction of perl and the CPAN library.

>
>
> On 22.12.2020 06:49, Matthias Peng wrote:
> > Hello
> >
> > I am developing a simple mp2 application.
> > I looked for the installation for mp2 utils, and found this two:
> >
> > libapache2-mod-perl2
> >
> > libapache2-mod-apreq2
> >
> >
> > what're their relations? Should I install both, or only the first one?
> >
>
> Hi.
>
> They are different and independent packages and module libraries, and you can use the one
> or the other, or both, depending on your needs.
> (We always install both, and we use both)
>
> For mod_perl per se, you need only the /libapache2-mod-perl2/ package.
> This gives you access to all the stuff documented here :
> http://perl.apache.org/docs/2.0/api/index.html
> <http://perl.apache.org/docs/2.0/api/index.html>
>
> *except* what is at the very end of that page :
>
> "Part VI: Related Modules" -> libapreq modules
> (this is what is contained in the separate /libapache2-mod-apreq2/ package)
> The documentation for libapreq is at :
> http://httpd.apache.org/apreq/docs/libapreq2/modules.html
> <http://httpd.apache.org/apreq/docs/libapreq2/modules.html>
>
> It may be a bit confusing at first, because both (independent) packages use some common
> namespaces ("Apache2::" and "APR::"), and because each of mod_perl and libapreq2 have
> their own form of "Apache Request object", named very similarly :
> - for mod_perl it is Apache2::RequestRec
> - for libapreq it is Apache2::Request
> (I guess that libapreq was first, that's why they got the better name ;-)
>
> I am a bit reluctant to try explaining the difference further (for fear of confusing you
> further), but here is a very rough summary :
>
> - to deal with 99% of what has to do with controlling what happens within Apache httpd in
> terms of processing HTTP requests (or just to run your perl scripts faster), use the
> mod_perl package.
> So install /libapache2-mod-perl2/ first, and start coding.
>
> - if you finds out later that you have to do a lot of processing of CGI parameters (the
> request "query string") or cookies, you can then install and use use the libapreq API
> which (among other things) provide an alternative to what the CGI module provides.
>
> In any case, there is a bit of a learning curve, but it is great fun and very powerful.
>
>
>
>

RE: Confused about two development utils [EXT] [ In reply to ]

Dec 23, 2020, 3:33 AM

Post #9 of 21 (2449 views)

> This costs memory, and all the more since many perl modules are not thread-safe, so if you use them in your code, at this moment the only safe way to do it is to use the Apache httpd prefork model. This means that each Apache httpd child process has its own copy of the perl interpreter, which means that the memory used by this embedded perl interpreter has to be counted n times (as many times as there are Apache httpd child processes running at any one time).

This isn’t quite true - if you load modules before the process forks then they can cleverly share the same parts of memory. It is useful to be able to "pre-load" core functionality which is used across all functions {this is the case in Linux anyway}. It also speeds up child process generation as the modules are already in memory and converted to byte code.

One of the great advantages of mod_perl is Apache2::SizeLimit which can blow away large child process - and then if needed create new ones. This is not the case with some of the FCGI solutions as the individual processes can grow if there is a memory leak or a request that retrieves a large amount of content (even if not served), but perl can't give the memory back. So FCGI processes only get bigger and bigger and eventually blow up memory (or hit swap first)

--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

RE: Confused about two development utils [EXT] [ In reply to ]

Dec 23, 2020, 3:36 AM

Post #10 of 21 (2449 views)

Forgot to add - so our FCGI servers need a lot (and I mean a lot) more memory than the mod_perl servers to serve the same level of content (just in case memory blows up with FCGI backends)

-----Original Message-----
From: James Smith <js5@sanger.ac.uk>
Sent: 23 December 2020 11:34
To: André Warnier (tomcat/perl) <aw@ice-sa.com>; modperl@perl.apache.org
Subject: RE: Confused about two development utils [EXT]

> This costs memory, and all the more since many perl modules are not thread-safe, so if you use them in your code, at this moment the only safe way to do it is to use the Apache httpd prefork model. This means that each Apache httpd child process has its own copy of the perl interpreter, which means that the memory used by this embedded perl interpreter has to be counted n times (as many times as there are Apache httpd child processes running at any one time).

This isn’t quite true - if you load modules before the process forks then they can cleverly share the same parts of memory. It is useful to be able to "pre-load" core functionality which is used across all functions {this is the case in Linux anyway}. It also speeds up child process generation as the modules are already in memory and converted to byte code.

One of the great advantages of mod_perl is Apache2::SizeLimit which can blow away large child process - and then if needed create new ones. This is not the case with some of the FCGI solutions as the individual processes can grow if there is a memory leak or a request that retrieves a large amount of content (even if not served), but perl can't give the memory back. So FCGI processes only get bigger and bigger and eventually blow up memory (or hit swap first)

--
The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Re: Confused about two development utils [EXT] [ In reply to ]

Dec 23, 2020, 4:01 AM

Post #11 of 21 (2449 views)

Today memory is not serious problem, each of our server has 64GB memory.

> Forgot to add - so our FCGI servers need a lot (and I mean a lot) more
> memory than the mod_perl servers to serve the same level of content (just
> in case memory blows up with FCGI backends)
>
> -----Original Message-----
> From: James Smith <js5@sanger.ac.uk>
> Sent: 23 December 2020 11:34
> To: André Warnier (tomcat/perl) <aw@ice-sa.com>; modperl@perl.apache.org
> Subject: RE: Confused about two development utils [EXT]
>
>
> > This costs memory, and all the more since many perl modules are not
> thread-safe, so if you use them in your code, at this moment the only safe
> way to do it is to use the Apache httpd prefork model. This means that each
> Apache httpd child process has its own copy of the perl interpreter, which
> means that the memory used by this embedded perl interpreter has to be
> counted n times (as many times as there are Apache httpd child processes
> running at any one time).
>
> This isn’t quite true - if you load modules before the process forks then
> they can cleverly share the same parts of memory. It is useful to be able
> to "pre-load" core functionality which is used across all functions {this
> is the case in Linux anyway}. It also speeds up child process generation as
> the modules are already in memory and converted to byte code.
>
> One of the great advantages of mod_perl is Apache2::SizeLimit which can
> blow away large child process - and then if needed create new ones. This is
> not the case with some of the FCGI solutions as the individual processes
> can grow if there is a memory leak or a request that retrieves a large
> amount of content (even if not served), but perl can't give the memory
> back. So FCGI processes only get bigger and bigger and eventually blow up
> memory (or hit swap first)
>
>
>
>
>
> --
> The Wellcome Sanger Institute is operated by Genome Research Limited, a
> charity registered in England with number 1021457 and a company registered
> in England with number 2742969, whose registered office is 215 Euston
> Road, London, NW1 2
> <https://www.google.com/maps/search/s+215+Euston+Road,+London,+NW1+2?entry=gmail&source=g>
> BE.
>
>
>
> --
> The Wellcome Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2
> <https://www.google.com/maps/search/s+215+Euston+Road,+London,+NW1+2?entry=gmail&source=g>
> BE.

Re: Confused about two development utils [EXT] [ In reply to ]

sandhya.pawar03 at gmail

Dec 23, 2020, 4:50 AM

Post #12 of 21 (2449 views)

unsubscribe.

On Wed, Dec 23, 2020 at 5:05 PM James Smith <js5@sanger.ac.uk> wrote:

>
> > This costs memory, and all the more since many perl modules are not
> thread-safe, so if you use them in your code, at this moment the only safe
> way to do it is to use the Apache httpd prefork model. This means that each
> Apache httpd child process has its own copy of the perl interpreter, which
> means that the memory used by this embedded perl interpreter has to be
> counted n times (as many times as there are Apache httpd child processes
> running at any one time).
>
> This isn’t quite true - if you load modules before the process forks then
> they can cleverly share the same parts of memory. It is useful to be able
> to "pre-load" core functionality which is used across all functions {this
> is the case in Linux anyway}. It also speeds up child process generation as
> the modules are already in memory and converted to byte code.
>
> One of the great advantages of mod_perl is Apache2::SizeLimit which can
> blow away large child process - and then if needed create new ones. This is
> not the case with some of the FCGI solutions as the individual processes
> can grow if there is a memory leak or a request that retrieves a large
> amount of content (even if not served), but perl can't give the memory
> back. So FCGI processes only get bigger and bigger and eventually blow up
> memory (or hit swap first)
>
>
>
>
>
> --
> The Wellcome Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2BE.

RE: Confused about two development utils [EXT] [ In reply to ]

Dec 23, 2020, 4:56 AM

Post #13 of 21 (2449 views)

Oh but memory is a problem – but not if you have just a small cluster of machines!

Our boxes are larger than that – but they all run virtual machine {only a small proportion web related} – machines/memory would rapidly become in our data centre - we run VMWARE [995 hosts] and openstack [10,000s of hosts] + a selection of large memory machines {measured in TBs of memory per machine }.

We would be looking at somewhere between 0.5 PB and 1 PB of memory – not just the price of buying that amount of memory - for many machines we need the fastest memory money can buy for the workload, but we would need a lot more CPUs then we currently have as we would need a larger amount of machines to have 64GB virtual machines {we would get 2 VMs per host. We currently have approx. 1-2000 CPUs running our hardware (last time I had a figure) – it would probably need to go to approximately 5-10,000!
It is not just the initial outlay but the environmental and financial cost of running that number of machines, and finding space to run them without putting the cooling costs through the roof!! That is without considering what additional constraints on storage having the extra machines may have (at the last count a year ago we had over 30 PBytes of storage on side – and a large amount of offsite backup.

We would also stretch the amount of power we can get from the national grid to power it all - we currently have 3 feeds from different part of the national grid (we are fortunately in position where this is possible) and the dedicated link we would need to add more power would be at least 50 miles long!

So - managing cores/memory is vitally important to us – moving to the cloud is an option we are looking at – but that is more than 4 times the price of our onsite set-up (with substantial discounts from AWS) and would require an upgrade of our existing link to the internet – which is currently 40Gbit of data (I think).

Currently we are analysing a very large amounts of data directly linked to the current major world problem – this is why the UK is currently being isolated as we have discovered and can track a new strain, in near real time – other countries have no ability to do this – we in a day can and do handle, sequence and analyse more samples than the whole of France has sequenced since February. We probably don’t have more of the new variant strain than in other areas of the world – it is just that we know we have because of the amount of sequencing and analysis that we in the UK have done.

From: Matthias Peng <pengmatthias@gmail.com>
Sent: 23 December 2020 12:02
To: mod_perl list <modperl@perl.apache.org>
Subject: Re: Confused about two development utils [EXT]

Today memory is not serious problem, each of our server has 64GB memory.

Forgot to add - so our FCGI servers need a lot (and I mean a lot) more memory than the mod_perl servers to serve the same level of content (just in case memory blows up with FCGI backends)

-----Original Message-----
From: James Smith <js5@sanger.ac.uk<mailto:js5@sanger.ac.uk>>
Sent: 23 December 2020 11:34
To: André Warnier (tomcat/perl) <aw@ice-sa.com<mailto:aw@ice-sa.com>>; modperl@perl.apache.org<mailto:modperl@perl.apache.org>
Subject: RE: Confused about two development utils [EXT]

> This costs memory, and all the more since many perl modules are not thread-safe, so if you use them in your code, at this moment the only safe way to do it is to use the Apache httpd prefork model. This means that each Apache httpd child process has its own copy of the perl interpreter, which means that the memory used by this embedded perl interpreter has to be counted n times (as many times as there are Apache httpd child processes running at any one time).

This isn’t quite true - if you load modules before the process forks then they can cleverly share the same parts of memory. It is useful to be able to "pre-load" core functionality which is used across all functions {this is the case in Linux anyway}. It also speeds up child process generation as the modules are already in memory and converted to byte code.

One of the great advantages of mod_perl is Apache2::SizeLimit which can blow away large child process - and then if needed create new ones. This is not the case with some of the FCGI solutions as the individual processes can grow if there is a memory leak or a request that retrieves a large amount of content (even if not served), but perl can't give the memory back. So FCGI processes only get bigger and bigger and eventually blow up memory (or hit swap first)

--
The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2 [google.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.com_maps_search_s-2B215-2BEuston-2BRoad-2C-2BLondon-2C-2BNW1-2B2-3Fentry-3Dgmail-26source-3Dg&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=oH2yp0ge1ecj4oDX0XM7vQ&m=friR8ykiZ-NWYdX6SrbT_ogNXEVR-4ixdkrhy5khQjA&s=xU3F4xE2ugQuDWHZ4GtDn9mPBCKcJJOI0PYScsSNjSg&e=>BE.

--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2 [google.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.com_maps_search_s-2B215-2BEuston-2BRoad-2C-2BLondon-2C-2BNW1-2B2-3Fentry-3Dgmail-26source-3Dg&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=oH2yp0ge1ecj4oDX0XM7vQ&m=friR8ykiZ-NWYdX6SrbT_ogNXEVR-4ixdkrhy5khQjA&s=xU3F4xE2ugQuDWHZ4GtDn9mPBCKcJJOI0PYScsSNjSg&e=>BE.

--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Re: Confused about two development utils [EXT] [ In reply to ]

Dec 23, 2020, 4:05 PM

Post #14 of 21 (2449 views)

James would you be able to share more info about your setup ?
1. What exactly is your application doing which requires so much memory and
CPU - is it something like gene splicing (no i don't know much about it
beyond Jurassic Park :D )
2. Do you feel Perl was the best choice for whatever you are doing and if
yes then why ? How much of your stuff is using mod_perl considering you
mentioned not much is web related ?
3. What are the challenges you are currently facing with your
implementation ?

On Wed, Dec 23, 2020 at 6:58 AM James Smith <js5@sanger.ac.uk> wrote:

> Oh but memory is a problem – but not if you have just a small cluster of
> machines!
>
> Our boxes are larger than that – but they all run virtual machine {only a
> small proportion web related} – machines/memory would rapidly become in our
> data centre - we run VMWARE [995 hosts] and openstack [10,000s of hosts] +
> a selection of large memory machines {measured in TBs of memory per machine
> }.
>
> We would be looking at somewhere between 0.5 PB and 1 PB of memory – not
> just the price of buying that amount of memory - for many machines we need
> the fastest memory money can buy for the workload, but we would need a lot
> more CPUs then we currently have as we would need a larger amount of
> machines to have 64GB virtual machines {we would get 2 VMs per host. We
> currently have approx. 1-2000 CPUs running our hardware (last time I had a
> figure) – it would probably need to go to approximately 5-10,000!
> It is not just the initial outlay but the environmental and financial cost
> of running that number of machines, and finding space to run them without
> putting the cooling costs through the roof!! That is without considering
> what additional constraints on storage having the extra machines may have
> (at the last count a year ago we had over 30 PBytes of storage on side –
> and a large amount of offsite backup.
>
> We would also stretch the amount of power we can get from the national
> grid to power it all - we currently have 3 feeds from different part of the
> national grid (we are fortunately in position where this is possible) and
> the dedicated link we would need to add more power would be at least 50
> miles long!
>
> So - managing cores/memory is vitally important to us – moving to the
> cloud is an option we are looking at – but that is more than 4 times the
> price of our onsite set-up (with substantial discounts from AWS) and would
> require an upgrade of our existing link to the internet – which is
> currently 40Gbit of data (I think).
>
> Currently we are analysing a very large amounts of data directly linked to
> the current major world problem – this is why the UK is currently being
> isolated as we have discovered and can track a new strain, in near real
> time – other countries have no ability to do this – we in a day can and do
> handle, sequence and analyse more samples than the whole of France has
> sequenced since February. We probably don’t have more of the new variant
> strain than in other areas of the world – it is just that we know we have
> because of the amount of sequencing and analysis that we in the UK have
> done.
>
>
>
> *From:* Matthias Peng <pengmatthias@gmail.com>
> *Sent:* 23 December 2020 12:02
> *To:* mod_perl list <modperl@perl.apache.org>
> *Subject:* Re: Confused about two development utils [EXT]
>
>
>
> Today memory is not serious problem, each of our server has 64GB memory.
>
>
>
>
> Forgot to add - so our FCGI servers need a lot (and I mean a lot) more
> memory than the mod_perl servers to serve the same level of content (just
> in case memory blows up with FCGI backends)
>
> -----Original Message-----
> From: James Smith <js5@sanger.ac.uk>
> Sent: 23 December 2020 11:34
> To: André Warnier (tomcat/perl) <aw@ice-sa.com>; modperl@perl.apache.org
> Subject: RE: Confused about two development utils [EXT]
>
>
> > This costs memory, and all the more since many perl modules are not
> thread-safe, so if you use them in your code, at this moment the only safe
> way to do it is to use the Apache httpd prefork model. This means that each
> Apache httpd child process has its own copy of the perl interpreter, which
> means that the memory used by this embedded perl interpreter has to be
> counted n times (as many times as there are Apache httpd child processes
> running at any one time).
>
> This isn’t quite true - if you load modules before the process forks then
> they can cleverly share the same parts of memory. It is useful to be able
> to "pre-load" core functionality which is used across all functions {this
> is the case in Linux anyway}. It also speeds up child process generation as
> the modules are already in memory and converted to byte code.
>
> One of the great advantages of mod_perl is Apache2::SizeLimit which can
> blow away large child process - and then if needed create new ones. This is
> not the case with some of the FCGI solutions as the individual processes
> can grow if there is a memory leak or a request that retrieves a large
> amount of content (even if not served), but perl can't give the memory
> back. So FCGI processes only get bigger and bigger and eventually blow up
> memory (or hit swap first)
>
>
>
>
>
> --
> The Wellcome Sanger Institute is operated by Genome Research Limited, a
> charity registered in England with number 1021457 and a company registered
> in England with number 2742969, whose registered office is 215 Euston
> Road, London, NW1 2 [google.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.com_maps_search_s-2B215-2BEuston-2BRoad-2C-2BLondon-2C-2BNW1-2B2-3Fentry-3Dgmail-26source-3Dg&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=oH2yp0ge1ecj4oDX0XM7vQ&m=friR8ykiZ-NWYdX6SrbT_ogNXEVR-4ixdkrhy5khQjA&s=xU3F4xE2ugQuDWHZ4GtDn9mPBCKcJJOI0PYScsSNjSg&e=>
> BE.
>
>
>
> --
> The Wellcome Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2 [google.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.com_maps_search_s-2B215-2BEuston-2BRoad-2C-2BLondon-2C-2BNW1-2B2-3Fentry-3Dgmail-26source-3Dg&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=oH2yp0ge1ecj4oDX0XM7vQ&m=friR8ykiZ-NWYdX6SrbT_ogNXEVR-4ixdkrhy5khQjA&s=xU3F4xE2ugQuDWHZ4GtDn9mPBCKcJJOI0PYScsSNjSg&e=>
> BE.
>
> -- The Wellcome Sanger Institute is operated by Genome Research Limited, a
> charity registered in England with number 1021457 and a company registered
> in England with number 2742969, whose registered office is 215 Euston Road,
> London, NW1 2BE.
>

RE: Confused about two development utils [EXT] [ In reply to ]

Dec 23, 2020, 5:38 PM

Post #15 of 21 (2449 views)

We don’t use perl for everything, yes we use it for web data, yes we still use it as the glue language in a lot of cases, the most complex stuff is done with C (not even C++ as that is too slow). Others on site use Python, Java, Rust, Go, PHP, along with looking at using GPUs in cases where code can be highly parallelised

It is not just one application – but many, many applications… All with a common goal of understanding the human genome, and using it to assist in developing new understanding and techniques which can advance health care.

We are a very large sequencing centre (one of the largest in the world) – what I was pointing out is that you can’t just throw memory, CPUs, power at a problem – you have to think – how can I do what I need to do with the least resources. Rather than what resources can I throw at the problem.

Currently we are acting as the central repository for all COVID-19 sequencing in the UK, along with one of the largest “wet” labs sequencing data for it – and that is half the sequenced samples in the whole world. The UK is sequencing more COVID-19 genomes a day than most other countries have sequenced since the start of the pandemic in Feb/Mar. This has lead to us discovering a new more transmissible version of the virus, and it what part of the country the different strains are present – no other country in the world has the information, technology or infrastructure in place to achieve this.

But this is just a small part of the genomic sequencing we are looking at – we work on:
* other pathogens – e.g. Plasmodium (Malaria);
* cancer genomes (and how effective drugs are);
* are a major part of the Human Cell Atlas which is looking at how the expression of genes (in the simplest terms which ones are switched on and switched off) are different in different tissues;
* sequencing the genomes of other animals to understand their evolution;
* and looking at some other species in detail, to see what we can learn from them when they have defective genes;

Although all these are currently scaled back so that we can work relentlessly to support the medical teams and other researchers get on top of COVID-19.

What is interesting is that many of the developers we have on campus (well all wfh at the moment) are all (relatively) old as we learnt to develop code on machines with limited CPU and limited memory – so that things had to be efficient, had to be compact…. And that is as important now as it was 20 or 30 years ago – the data we handle is going up faster than Moore’s Law! Many of us have pride in doing things as efficiently as possible.

It took around 10 years to sequence and assemble the first human genome {well we are still tinkering with it and filling in the gaps} – now at the institute we can sequence and assemble around 400 human genomes in a day – to the same quality!

So most of our issues are due to the scale of the problems we face – e.g. the human genome has 3 billion base-pairs (A, C, G, Ts) , so normal solutions don’t scale to that (once many years ago we looked at setting up an Oracle database where there was at least 1 row for every base pair – recording all variants (think of them as spelling mistakes, for example a T rather than an A, or an extra letter inserted or deleted) for that base pair… The schema was set up – and then they realised it would take 12 months to load the data which we had then (which is probably less than a millionth of what we have now)!

Moving compute off site is a problem as the transfer of the level of data we have would cause a problem – you can’t easily move all the data to the compute – so you have to bring the compute to the data.

The site I worked on before I became a more general developer was doing that – and the code that was written 12-15 years ago is actually still going strong – it has seen a few changes over the year – many displays have had to be redeveloped as the scale of the data has got so big that even the summary pages we produced 10 years ago have to be summarised because they are so large.

From: Mithun Bhattacharya <mithnb@gmail.com>
Sent: 24 December 2020 00:06
To: mod_perl list <modperl@perl.apache.org>
Subject: Re: Confused about two development utils [EXT]

James would you be able to share more info about your setup ?
1. What exactly is your application doing which requires so much memory and CPU - is it something like gene splicing (no i don't know much about it beyond Jurassic Park :D )
2. Do you feel Perl was the best choice for whatever you are doing and if yes then why ? How much of your stuff is using mod_perl considering you mentioned not much is web related ?
3. What are the challenges you are currently facing with your implementation ?

On Wed, Dec 23, 2020 at 6:58 AM James Smith <js5@sanger.ac.uk<mailto:js5@sanger.ac.uk>> wrote:
Oh but memory is a problem – but not if you have just a small cluster of machines!

Our boxes are larger than that – but they all run virtual machine {only a small proportion web related} – machines/memory would rapidly become in our data centre - we run VMWARE [995 hosts] and openstack [10,000s of hosts] + a selection of large memory machines {measured in TBs of memory per machine }.

We would be looking at somewhere between 0.5 PB and 1 PB of memory – not just the price of buying that amount of memory - for many machines we need the fastest memory money can buy for the workload, but we would need a lot more CPUs then we currently have as we would need a larger amount of machines to have 64GB virtual machines {we would get 2 VMs per host. We currently have approx. 1-2000 CPUs running our hardware (last time I had a figure) – it would probably need to go to approximately 5-10,000!
It is not just the initial outlay but the environmental and financial cost of running that number of machines, and finding space to run them without putting the cooling costs through the roof!! That is without considering what additional constraints on storage having the extra machines may have (at the last count a year ago we had over 30 PBytes of storage on side – and a large amount of offsite backup.

We would also stretch the amount of power we can get from the national grid to power it all - we currently have 3 feeds from different part of the national grid (we are fortunately in position where this is possible) and the dedicated link we would need to add more power would be at least 50 miles long!

So - managing cores/memory is vitally important to us – moving to the cloud is an option we are looking at – but that is more than 4 times the price of our onsite set-up (with substantial discounts from AWS) and would require an upgrade of our existing link to the internet – which is currently 40Gbit of data (I think).

Currently we are analysing a very large amounts of data directly linked to the current major world problem – this is why the UK is currently being isolated as we have discovered and can track a new strain, in near real time – other countries have no ability to do this – we in a day can and do handle, sequence and analyse more samples than the whole of France has sequenced since February. We probably don’t have more of the new variant strain than in other areas of the world – it is just that we know we have because of the amount of sequencing and analysis that we in the UK have done.

From: Matthias Peng <pengmatthias@gmail.com<mailto:pengmatthias@gmail.com>>
Sent: 23 December 2020 12:02
To: mod_perl list <modperl@perl.apache.org<mailto:modperl@perl.apache.org>>
Subject: Re: Confused about two development utils [EXT]

Today memory is not serious problem, each of our server has 64GB memory.

Forgot to add - so our FCGI servers need a lot (and I mean a lot) more memory than the mod_perl servers to serve the same level of content (just in case memory blows up with FCGI backends)

-----Original Message-----
From: James Smith <js5@sanger.ac.uk<mailto:js5@sanger.ac.uk>>
Sent: 23 December 2020 11:34
To: André Warnier (tomcat/perl) <aw@ice-sa.com<mailto:aw@ice-sa.com>>; modperl@perl.apache.org<mailto:modperl@perl.apache.org>
Subject: RE: Confused about two development utils [EXT]

> This costs memory, and all the more since many perl modules are not thread-safe, so if you use them in your code, at this moment the only safe way to do it is to use the Apache httpd prefork model. This means that each Apache httpd child process has its own copy of the perl interpreter, which means that the memory used by this embedded perl interpreter has to be counted n times (as many times as there are Apache httpd child processes running at any one time).

This isn’t quite true - if you load modules before the process forks then they can cleverly share the same parts of memory. It is useful to be able to "pre-load" core functionality which is used across all functions {this is the case in Linux anyway}. It also speeds up child process generation as the modules are already in memory and converted to byte code.

One of the great advantages of mod_perl is Apache2::SizeLimit which can blow away large child process - and then if needed create new ones. This is not the case with some of the FCGI solutions as the individual processes can grow if there is a memory leak or a request that retrieves a large amount of content (even if not served), but perl can't give the memory back. So FCGI processes only get bigger and bigger and eventually blow up memory (or hit swap first)

--
The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2 [google.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.com_maps_search_s-2B215-2BEuston-2BRoad-2C-2BLondon-2C-2BNW1-2B2-3Fentry-3Dgmail-26source-3Dg&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=oH2yp0ge1ecj4oDX0XM7vQ&m=friR8ykiZ-NWYdX6SrbT_ogNXEVR-4ixdkrhy5khQjA&s=xU3F4xE2ugQuDWHZ4GtDn9mPBCKcJJOI0PYScsSNjSg&e=>BE.

--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2 [google.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.com_maps_search_s-2B215-2BEuston-2BRoad-2C-2BLondon-2C-2BNW1-2B2-3Fentry-3Dgmail-26source-3Dg&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=oH2yp0ge1ecj4oDX0XM7vQ&m=friR8ykiZ-NWYdX6SrbT_ogNXEVR-4ixdkrhy5khQjA&s=xU3F4xE2ugQuDWHZ4GtDn9mPBCKcJJOI0PYScsSNjSg&e=>BE.
-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Re: Confused about two development utils [EXT] [ In reply to ]

aw at ice-sa

Dec 25, 2020, 8:59 AM

Post #16 of 21 (2449 views)

Hello James.
Bravo and many thanks for this excellent overview of your activities. Of course the setup
(in your previous message) and the activities are very impressive by themselves.
But in addition, even though your message is not in itself a perl advocacy message, I feel
that it would have its right place in some perl/mod_perl advocacy forum, because it
touches on some general idea which are valid /also/ for perl and mod_perl.
It was very refreshing to read for once a clear exposé of why it is still important
nowadays to think before programming, to program efficiently, and to choose the right tool
for the job at hand (be it perl, mod_perl, or any other) without the kind of off-the-cuff
general a-priori which tend to plague these discussions.

And even though our own (commercial) activities and setups do not have anything even close
to the scope which you describe, I would like to say that the same basic principles which
you mention in your exposé are just as valid when you scale-down as when you scale-up.
("--you can’t just throw memory, CPUs, power at a problem – you have to
think – how can I do what I need to do with the least resources..")
Even when you think of a single server, or a single server rack, at any one period in time
there is always a practical limit as to how much memory or CPUs you can fit in a given
server, or how many servers you can fit in a rack, or how many additional Gb of bandwidth
you can allocate per server, beyond which there is a sudden "quantum jump" as to how
practical and cost-effective a whole project becomes.
In that sense, I particulary enjoyed your examples of the database and of the additional
power line.

On 24.12.2020 02:38, James Smith wrote:
> We don’t use perl for everything, yes we use it for web data, yes we still use it as the
> glue language in a lot of cases, the most complex stuff is done with C (not even C++ as
> that is too slow). Others on site use Python, Java, Rust, Go, PHP, along with looking at
> using GPUs in cases where code can be highly parallelised
>
> It is not just one application – but many, many applications… All with a common goal of
> understanding the human genome, and using it to assist in developing new understanding and
> techniques which can advance health care.
>
> We are a very large sequencing centre (one of the largest in the world) – what I was
> pointing out is that you can’t just throw memory, CPUs, power at a problem – you have to
> think – how can I do what I need to do with the least resources. Rather than what
> resources can I throw at the problem.
>
> Currently we are acting as the central repository for all COVID-19 sequencing in the UK,
> along with one of the largest “wet” labs sequencing data for it – and that is half the
> sequenced samples in the whole world. The UK is sequencing more COVID-19 genomes a day
> than most other countries have sequenced since the start of the pandemic in Feb/Mar. This
> has lead to us discovering a new more transmissible version of the virus, and it what part
> of the country the different strains are present – no other country in the world has the
> information, technology or infrastructure in place to achieve this.
>
> But this is just a small part of the genomic sequencing we are looking at – we work on:
> * other pathogens – e.g. Plasmodium (Malaria);
> * cancer genomes (and how effective drugs are);
> * are a major part of the Human Cell Atlas which is looking at how the expression of genes
> (in the simplest terms which ones are switched on and switched off) are different in
> different tissues;
> * sequencing the genomes of other animals to understand their evolution;
> * and looking at some other species in detail, to see what we can learn from them when
> they have defective genes;
>
> Although all these are currently scaled back so that we can work relentlessly to support
> the medical teams and other researchers get on top of COVID-19.
>
> What is interesting is that many of the developers we have on campus (well all wfh at the
> moment) are all (relatively) old as we learnt to develop code on machines with limited CPU
> and limited memory – so that things had to be efficient, had to be compact…. And that is
> as important now as it was 20 or 30 years ago – the data we handle is going up faster than
> Moore’s Law! Many of us have pride in doing things as efficiently as possible.
>
> It took around 10 years to sequence and assemble the first human genome {well we are still
> tinkering with it and filling in the gaps} – now at the institute we can sequence and
> assemble around 400 human genomes in a day – to the same quality!
>
> So most of our issues are due to the scale of the problems we face – e.g. the human genome
> has 3 billion base-pairs (A, C, G, Ts) , so normal solutions don’t scale to that (once
> many years ago we looked at setting up an Oracle database where there was at least 1 row
> for every base pair – recording all variants (think of them as spelling mistakes, for
> example a T rather than an A, or an extra letter inserted or deleted) for that base pair…
> The schema was set up – and then they realised it would take 12 months to load the data
> which we had then (which is probably less than a millionth of what we have now)!
>
> Moving compute off site is a problem as the transfer of the level of data we have would
> cause a problem – you can’t easily move all the data to the compute – so you have to bring
> the compute to the data.
>
> The site I worked on before I became a more general developer was doing that – and the
> code that was written 12-15 years ago is actually still going strong – it has seen a few
> changes over the year – many displays have had to be redeveloped as the scale of the data
> has got so big that even the summary pages we produced 10 years ago have to be summarised
> because they are so large.
>
> *From:*Mithun Bhattacharya <mithnb@gmail.com>
> *Sent:* 24 December 2020 00:06
> *To:* mod_perl list <modperl@perl.apache.org>
> *Subject:* Re: Confused about two development utils [EXT]
>
> James would you be able to share more info about your setup ?
>
> 1. What exactly is your application doing which requires so much memory and CPU - is it
> something like gene splicing (no i don't know much about it beyond Jurassic Park :D )
>
> 2. Do you feel Perl was the best choice for whatever you are doing and if yes then why ?
> How much of your stuff is using mod_perl considering you mentioned not much is web related ?
>
> 3. What are the challenges you are currently facing with your implementation ?
>
> On Wed, Dec 23, 2020 at 6:58 AM James Smith <js5@sanger.ac.uk <mailto:js5@sanger.ac.uk>>
> wrote:
>
> Oh but memory is a problem – but not if you have just a small cluster of machines!
>
> Our boxes are larger than that – but they all run virtual machine {only a small
> proportion web related} – machines/memory would rapidly become in our data centre - we
> run VMWARE [995 hosts] and openstack [10,000s of hosts] + a selection of large memory
> machines {measured in TBs of memory per machine }.
>
> We would be looking at somewhere between 0.5 PB and 1 PB of memory – not just the
> price of buying that amount of memory - for many machines we need the fastest memory
> money can buy for the workload, but we would need a lot more CPUs then we currently
> have as we would need a larger amount of machines to have 64GB virtual machines {we
> would get 2 VMs per host. We currently have approx. 1-2000 CPUs running our hardware
> (last time I had a figure) – it would probably need to go to approximately 5-10,000!
> It is not just the initial outlay but the environmental and financial cost of running
> that number of machines, and finding space to run them without putting the cooling
> costs through the roof!! That is without considering what additional constraints on
> storage having the extra machines may have (at the last count a year ago we had over
> 30 PBytes of storage on side – and a large amount of offsite backup.
>
> We would also stretch the amount of power we can get from the national grid to power
> it all - we currently have 3 feeds from different part of the national grid (we are
> fortunately in position where this is possible) and the dedicated link we would need
> to add more power would be at least 50 miles long!
>
> So - managing cores/memory is vitally important to us – moving to the cloud is an
> option we are looking at – but that is more than 4 times the price of our onsite
> set-up (with substantial discounts from AWS) and would require an upgrade of our
> existing link to the internet – which is currently 40Gbit of data (I think).
>
> Currently we are analysing a very large amounts of data directly linked to the current
> major world problem – this is why the UK is currently being isolated as we have
> discovered and can track a new strain, in near real time – other countries have no
> ability to do this – we in a day can and do handle, sequence and analyse more samples
> than the whole of France has sequenced since February. We probably don’t have more of
> the new variant strain than in other areas of the world – it is just that we know we
> have because of the amount of sequencing and analysis that we in the UK have done.
>
> *From:*Matthias Peng <pengmatthias@gmail.com <mailto:pengmatthias@gmail.com>>
> *Sent:* 23 December 2020 12:02
> *To:* mod_perl list <modperl@perl.apache.org <mailto:modperl@perl.apache.org>>
> *Subject:* Re: Confused about two development utils [EXT]
>
> Today memory is not serious problem, each of our server has 64GB memory.
>
>
> Forgot to add - so our FCGI servers need a lot (and I mean a lot) more memory than
> the mod_perl servers to serve the same level of content (just in case memory blows
> up with FCGI backends)
>
> -----Original Message-----
> From: James Smith <js5@sanger.ac.uk <mailto:js5@sanger.ac.uk>>
> Sent: 23 December 2020 11:34
> To: André Warnier (tomcat/perl) <aw@ice-sa.com <mailto:aw@ice-sa.com>>;
> modperl@perl.apache.org <mailto:modperl@perl.apache.org>
> Subject: RE: Confused about two development utils [EXT]
>
>
> > This costs memory, and all the more since many perl modules are not
> thread-safe, so if you use them in your code, at this moment the only safe way to
> do it is to use the Apache httpd prefork model. This means that each Apache httpd
> child process has its own copy of the perl interpreter, which means that the
> memory used by this embedded perl interpreter has to be counted n times (as many
> times as there are Apache httpd child processes running at any one time).
>
> This isn’t quite true - if you load modules before the process forks then they can
> cleverly share the same parts of memory. It is useful to be able to "pre-load"
> core functionality which is used across all functions {this is the case in Linux
> anyway}. It also speeds up child process generation as the modules are already in
> memory and converted to byte code.
>
> One of the great advantages of mod_perl is Apache2::SizeLimit which can blow away
> large child process - and then if needed create new ones. This is not the case
> with some of the FCGI solutions as the individual processes can grow if there is a
> memory leak or a request that retrieves a large amount of content (even if not
> served), but perl can't give the memory back. So FCGI processes only get bigger
> and bigger and eventually blow up memory (or hit swap first)
>
>
>
>
>
> --
> The Wellcome Sanger Institute is operated by Genome Research Limited, a charity
> registered in England with number 1021457 and a company registered in England
> with number 2742969, whose registered office is 215 Euston Road, London, NW1 2
> [google.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.com_maps_search_s-2B215-2BEuston-2BRoad-2C-2BLondon-2C-2BNW1-2B2-3Fentry-3Dgmail-26source-3Dg&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=oH2yp0ge1ecj4oDX0XM7vQ&m=friR8ykiZ-NWYdX6SrbT_ogNXEVR-4ixdkrhy5khQjA&s=xU3F4xE2ugQuDWHZ4GtDn9mPBCKcJJOI0PYScsSNjSg&e=>BE.
>
>
>
> --
> The Wellcome Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2 [google.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.com_maps_search_s-2B215-2BEuston-2BRoad-2C-2BLondon-2C-2BNW1-2B2-3Fentry-3Dgmail-26source-3Dg&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=oH2yp0ge1ecj4oDX0XM7vQ&m=friR8ykiZ-NWYdX6SrbT_ogNXEVR-4ixdkrhy5khQjA&s=xU3F4xE2ugQuDWHZ4GtDn9mPBCKcJJOI0PYScsSNjSg&e=>BE.
>
> -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity
> registered in England with number 1021457 and a company registered in England with
> number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
>
> -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity
> registered in England with number 1021457 and a company registered in England with number
> 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

Re: Confused about two development utils [EXT] [ In reply to ]

sandhya.pawar03 at gmail

Dec 25, 2020, 11:22 PM

Post #17 of 21 (2449 views)

unsubscribe.

On Fri, Dec 25, 2020 at 10:30 PM André Warnier (tomcat/perl) <aw@ice-sa.com>
wrote:

> Hello James.
> Bravo and many thanks for this excellent overview of your activities. Of
> course the setup
> (in your previous message) and the activities are very impressive by
> themselves.
> But in addition, even though your message is not in itself a perl advocacy
> message, I feel
> that it would have its right place in some perl/mod_perl advocacy forum,
> because it
> touches on some general idea which are valid /also/ for perl and mod_perl.
> It was very refreshing to read for once a clear exposé of why it is still
> important
> nowadays to think before programming, to program efficiently, and to
> choose the right tool
> for the job at hand (be it perl, mod_perl, or any other) without the kind
> of off-the-cuff
> general a-priori which tend to plague these discussions.
>
> And even though our own (commercial) activities and setups do not have
> anything even close
> to the scope which you describe, I would like to say that the same basic
> principles which
> you mention in your exposé are just as valid when you scale-down as when
> you scale-up.
> ("--you can’t just throw memory, CPUs, power at a problem – you have to
> think – how can I do what I need to do with the least resources..")
> Even when you think of a single server, or a single server rack, at any
> one period in time
> there is always a practical limit as to how much memory or CPUs you can
> fit in a given
> server, or how many servers you can fit in a rack, or how many additional
> Gb of bandwidth
> you can allocate per server, beyond which there is a sudden "quantum jump"
> as to how
> practical and cost-effective a whole project becomes.
> In that sense, I particulary enjoyed your examples of the database and of
> the additional
> power line.
>
>
> On 24.12.2020 02:38, James Smith wrote:
> > We don’t use perl for everything, yes we use it for web data, yes we
> still use it as the
> > glue language in a lot of cases, the most complex stuff is done with C
> (not even C++ as
> > that is too slow). Others on site use Python, Java, Rust, Go, PHP, along
> with looking at
> > using GPUs in cases where code can be highly parallelised
> >
> > It is not just one application – but many, many applications… All with a
> common goal of
> > understanding the human genome, and using it to assist in developing new
> understanding and
> > techniques which can advance health care.
> >
> > We are a very large sequencing centre (one of the largest in the world)
> – what I was
> > pointing out is that you can’t just throw memory, CPUs, power at a
> problem – you have to
> > think – how can I do what I need to do with the least resources. Rather
> than what
> > resources can I throw at the problem.
> >
> > Currently we are acting as the central repository for all COVID-19
> sequencing in the UK,
> > along with one of the largest “wet” labs sequencing data for it – and
> that is half the
> > sequenced samples in the whole world. The UK is sequencing more COVID-19
> genomes a day
> > than most other countries have sequenced since the start of the pandemic
> in Feb/Mar. This
> > has lead to us discovering a new more transmissible version of the
> virus, and it what part
> > of the country the different strains are present – no other country in
> the world has the
> > information, technology or infrastructure in place to achieve this.
> >
> > But this is just a small part of the genomic sequencing we are looking
> at – we work on:
> > * other pathogens – e.g. Plasmodium (Malaria);
> > * cancer genomes (and how effective drugs are);
> > * are a major part of the Human Cell Atlas which is looking at how the
> expression of genes
> > (in the simplest terms which ones are switched on and switched off) are
> different in
> > different tissues;
> > * sequencing the genomes of other animals to understand their evolution;
> > * and looking at some other species in detail, to see what we can learn
> from them when
> > they have defective genes;
> >
> > Although all these are currently scaled back so that we can work
> relentlessly to support
> > the medical teams and other researchers get on top of COVID-19.
> >
> > What is interesting is that many of the developers we have on campus
> (well all wfh at the
> > moment) are all (relatively) old as we learnt to develop code on
> machines with limited CPU
> > and limited memory – so that things had to be efficient, had to be
> compact…. And that is
> > as important now as it was 20 or 30 years ago – the data we handle is
> going up faster than
> > Moore’s Law! Many of us have pride in doing things as efficiently as
> possible.
> >
> > It took around 10 years to sequence and assemble the first human genome
> {well we are still
> > tinkering with it and filling in the gaps} – now at the institute we can
> sequence and
> > assemble around 400 human genomes in a day – to the same quality!
> >
> > So most of our issues are due to the scale of the problems we face –
> e.g. the human genome
> > has 3 billion base-pairs (A, C, G, Ts) , so normal solutions don’t scale
> to that (once
> > many years ago we looked at setting up an Oracle database where there
> was at least 1 row
> > for every base pair – recording all variants (think of them as spelling
> mistakes, for
> > example a T rather than an A, or an extra letter inserted or deleted)
> for that base pair…
> > The schema was set up – and then they realised it would take 12 months
> to load the data
> > which we had then (which is probably less than a millionth of what we
> have now)!
> >
> > Moving compute off site is a problem as the transfer of the level of
> data we have would
> > cause a problem – you can’t easily move all the data to the compute – so
> you have to bring
> > the compute to the data.
> >
> > The site I worked on before I became a more general developer was doing
> that – and the
> > code that was written 12-15 years ago is actually still going strong –
> it has seen a few
> > changes over the year – many displays have had to be redeveloped as the
> scale of the data
> > has got so big that even the summary pages we produced 10 years ago have
> to be summarised
> > because they are so large.
> >
> > *From:*Mithun Bhattacharya <mithnb@gmail.com>
> > *Sent:* 24 December 2020 00:06
> > *To:* mod_perl list <modperl@perl.apache.org>
> > *Subject:* Re: Confused about two development utils [EXT]
> >
> > James would you be able to share more info about your setup ?
> >
> > 1. What exactly is your application doing which requires so much memory
> and CPU - is it
> > something like gene splicing (no i don't know much about it beyond
> Jurassic Park :D )
> >
> > 2. Do you feel Perl was the best choice for whatever you are doing and
> if yes then why ?
> > How much of your stuff is using mod_perl considering you mentioned not
> much is web related ?
> >
> > 3. What are the challenges you are currently facing with your
> implementation ?
> >
> > On Wed, Dec 23, 2020 at 6:58 AM James Smith <js5@sanger.ac.uk <mailto:
> js5@sanger.ac.uk>>
> > wrote:
> >
> > Oh but memory is a problem – but not if you have just a small
> cluster of machines!
> >
> > Our boxes are larger than that – but they all run virtual machine
> {only a small
> > proportion web related} – machines/memory would rapidly become in
> our data centre - we
> > run VMWARE [995 hosts] and openstack [10,000s of hosts] + a
> selection of large memory
> > machines {measured in TBs of memory per machine }.
> >
> > We would be looking at somewhere between 0.5 PB and 1 PB of memory –
> not just the
> > price of buying that amount of memory - for many machines we need
> the fastest memory
> > money can buy for the workload, but we would need a lot more CPUs
> then we currently
> > have as we would need a larger amount of machines to have 64GB
> virtual machines {we
> > would get 2 VMs per host. We currently have approx. 1-2000 CPUs
> running our hardware
> > (last time I had a figure) – it would probably need to go to
> approximately 5-10,000!
> > It is not just the initial outlay but the environmental and
> financial cost of running
> > that number of machines, and finding space to run them without
> putting the cooling
> > costs through the roof!! That is without considering what additional
> constraints on
> > storage having the extra machines may have (at the last count a year
> ago we had over
> > 30 PBytes of storage on side – and a large amount of offsite backup.
> >
> > We would also stretch the amount of power we can get from the
> national grid to power
> > it all - we currently have 3 feeds from different part of the
> national grid (we are
> > fortunately in position where this is possible) and the dedicated
> link we would need
> > to add more power would be at least 50 miles long!
> >
> > So - managing cores/memory is vitally important to us – moving to
> the cloud is an
> > option we are looking at – but that is more than 4 times the price
> of our onsite
> > set-up (with substantial discounts from AWS) and would require an
> upgrade of our
> > existing link to the internet – which is currently 40Gbit of data (I
> think).
> >
> > Currently we are analysing a very large amounts of data directly
> linked to the current
> > major world problem – this is why the UK is currently being isolated
> as we have
> > discovered and can track a new strain, in near real time – other
> countries have no
> > ability to do this – we in a day can and do handle, sequence and
> analyse more samples
> > than the whole of France has sequenced since February. We probably
> don’t have more of
> > the new variant strain than in other areas of the world – it is just
> that we know we
> > have because of the amount of sequencing and analysis that we in the
> UK have done.
> >
> > *From:*Matthias Peng <pengmatthias@gmail.com <mailto:
> pengmatthias@gmail.com>>
> > *Sent:* 23 December 2020 12:02
> > *To:* mod_perl list <modperl@perl.apache.org <mailto:
> modperl@perl.apache.org>>
> > *Subject:* Re: Confused about two development utils [EXT]
> >
> > Today memory is not serious problem, each of our server has 64GB
> memory.
> >
> >
> > Forgot to add - so our FCGI servers need a lot (and I mean a
> lot) more memory than
> > the mod_perl servers to serve the same level of content (just in
> case memory blows
> > up with FCGI backends)
> >
> > -----Original Message-----
> > From: James Smith <js5@sanger.ac.uk <mailto:js5@sanger.ac.uk>>
> > Sent: 23 December 2020 11:34
> > To: André Warnier (tomcat/perl) <aw@ice-sa.com <mailto:
> aw@ice-sa.com>>;
> > modperl@perl.apache.org <mailto:modperl@perl.apache.org>
> > Subject: RE: Confused about two development utils [EXT]
> >
> >
> > > This costs memory, and all the more since many perl modules
> are not
> > thread-safe, so if you use them in your code, at this moment the
> only safe way to
> > do it is to use the Apache httpd prefork model. This means that
> each Apache httpd
> > child process has its own copy of the perl interpreter, which
> means that the
> > memory used by this embedded perl interpreter has to be counted
> n times (as many
> > times as there are Apache httpd child processes running at any
> one time).
> >
> > This isn’t quite true - if you load modules before the process
> forks then they can
> > cleverly share the same parts of memory. It is useful to be able
> to "pre-load"
> > core functionality which is used across all functions {this is
> the case in Linux
> > anyway}. It also speeds up child process generation as the
> modules are already in
> > memory and converted to byte code.
> >
> > One of the great advantages of mod_perl is Apache2::SizeLimit
> which can blow away
> > large child process - and then if needed create new ones. This
> is not the case
> > with some of the FCGI solutions as the individual processes can
> grow if there is a
> > memory leak or a request that retrieves a large amount of
> content (even if not
> > served), but perl can't give the memory back. So FCGI processes
> only get bigger
> > and bigger and eventually blow up memory (or hit swap first)
> >
> >
> >
> >
> >
> > --
> > The Wellcome Sanger Institute is operated by Genome Research
> Limited, a charity
> > registered in England with number 1021457 and a company
> registered in England
> > with number 2742969, whose registered office is 215 Euston
> Road, London, NW1 2
> > [google.com]
> > <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.com_maps_search_s-2B215-2BEuston-2BRoad-2C-2BLondon-2C-2BNW1-2B2-3Fentry-3Dgmail-26source-3Dg&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=oH2yp0ge1ecj4oDX0XM7vQ&m=friR8ykiZ-NWYdX6SrbT_ogNXEVR-4ixdkrhy5khQjA&s=xU3F4xE2ugQuDWHZ4GtDn9mPBCKcJJOI0PYScsSNjSg&e=
> >BE.
> >
> >
> >
> > --
> > The Wellcome Sanger Institute is operated by Genome Research
> > Limited, a charity registered in England with number 1021457
> and a
> > company registered in England with number 2742969, whose
> registered
> > office is 215 Euston Road, London, NW1 2 [google.com]
> > <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.com_maps_search_s-2B215-2BEuston-2BRoad-2C-2BLondon-2C-2BNW1-2B2-3Fentry-3Dgmail-26source-3Dg&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=oH2yp0ge1ecj4oDX0XM7vQ&m=friR8ykiZ-NWYdX6SrbT_ogNXEVR-4ixdkrhy5khQjA&s=xU3F4xE2ugQuDWHZ4GtDn9mPBCKcJJOI0PYScsSNjSg&e=
> >BE.
> >
> > -- The Wellcome Sanger Institute is operated by Genome Research
> Limited, a charity
> > registered in England with number 1021457 and a company registered
> in England with
> > number 2742969, whose registered office is 215 Euston Road, London,
> NW1 2BE.
> >
> > -- The Wellcome Sanger Institute is operated by Genome Research Limited,
> a charity
> > registered in England with number 1021457 and a company registered in
> England with number
> > 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
>
>

Re: Confused about two development utils [EXT] [ In reply to ]

Dec 26, 2020, 12:01 AM

Post #18 of 21 (2449 views)

If I have been using modperl for development, does it influence I drive
Maserati? :)

> unsubscribe.
>
> On Fri, Dec 25, 2020 at 10:30 PM André Warnier (tomcat/perl) <
> aw@ice-sa.com> wrote:
>
>> Hello James.
>> Bravo and many thanks for this excellent overview of your activities. Of
>> course the setup
>> (in your previous message) and the activities are very impressive by
>> themselves.
>> But in addition, even though your message is not in itself a perl
>> advocacy message, I feel
>> that it would have its right place in some perl/mod_perl advocacy forum,
>> because it
>> touches on some general idea which are valid /also/ for perl and mod_perl.
>> It was very refreshing to read for once a clear exposé of why it is still
>> important
>> nowadays to think before programming, to program efficiently, and to
>> choose the right tool
>> for the job at hand (be it perl, mod_perl, or any other) without the kind
>> of off-the-cuff
>> general a-priori which tend to plague these discussions.
>>
>> And even though our own (commercial) activities and setups do not have
>> anything even close
>> to the scope which you describe, I would like to say that the same basic
>> principles which
>> you mention in your exposé are just as valid when you scale-down as when
>> you scale-up.
>> ("--you can’t just throw memory, CPUs, power at a problem – you have to
>> think – how can I do what I need to do with the least resources..")
>> Even when you think of a single server, or a single server rack, at any
>> one period in time
>> there is always a practical limit as to how much memory or CPUs you can
>> fit in a given
>> server, or how many servers you can fit in a rack, or how many additional
>> Gb of bandwidth
>> you can allocate per server, beyond which there is a sudden "quantum
>> jump" as to how
>> practical and cost-effective a whole project becomes.
>> In that sense, I particulary enjoyed your examples of the database and of
>> the additional
>> power line.
>>
>>
>> On 24.12.2020 02:38, James Smith wrote:
>> > We don’t use perl for everything, yes we use it for web data, yes we
>> still use it as the
>> > glue language in a lot of cases, the most complex stuff is done with C
>> (not even C++ as
>> > that is too slow). Others on site use Python, Java, Rust, Go, PHP,
>> along with looking at
>> > using GPUs in cases where code can be highly parallelised
>> >
>> > It is not just one application – but many, many applications… All with
>> a common goal of
>> > understanding the human genome, and using it to assist in developing
>> new understanding and
>> > techniques which can advance health care.
>> >
>> > We are a very large sequencing centre (one of the largest in the world)
>> – what I was
>> > pointing out is that you can’t just throw memory, CPUs, power at a
>> problem – you have to
>> > think – how can I do what I need to do with the least resources. Rather
>> than what
>> > resources can I throw at the problem.
>> >
>> > Currently we are acting as the central repository for all COVID-19
>> sequencing in the UK,
>> > along with one of the largest “wet” labs sequencing data for it – and
>> that is half the
>> > sequenced samples in the whole world. The UK is sequencing more
>> COVID-19 genomes a day
>> > than most other countries have sequenced since the start of the
>> pandemic in Feb/Mar. This
>> > has lead to us discovering a new more transmissible version of the
>> virus, and it what part
>> > of the country the different strains are present – no other country in
>> the world has the
>> > information, technology or infrastructure in place to achieve this.
>> >
>> > But this is just a small part of the genomic sequencing we are looking
>> at – we work on:
>> > * other pathogens – e.g. Plasmodium (Malaria);
>> > * cancer genomes (and how effective drugs are);
>> > * are a major part of the Human Cell Atlas which is looking at how the
>> expression of genes
>> > (in the simplest terms which ones are switched on and switched off) are
>> different in
>> > different tissues;
>> > * sequencing the genomes of other animals to understand their evolution;
>> > * and looking at some other species in detail, to see what we can learn
>> from them when
>> > they have defective genes;
>> >
>> > Although all these are currently scaled back so that we can work
>> relentlessly to support
>> > the medical teams and other researchers get on top of COVID-19.
>> >
>> > What is interesting is that many of the developers we have on campus
>> (well all wfh at the
>> > moment) are all (relatively) old as we learnt to develop code on
>> machines with limited CPU
>> > and limited memory – so that things had to be efficient, had to be
>> compact…. And that is
>> > as important now as it was 20 or 30 years ago – the data we handle is
>> going up faster than
>> > Moore’s Law! Many of us have pride in doing things as efficiently as
>> possible.
>> >
>> > It took around 10 years to sequence and assemble the first human genome
>> {well we are still
>> > tinkering with it and filling in the gaps} – now at the institute we
>> can sequence and
>> > assemble around 400 human genomes in a day – to the same quality!
>> >
>> > So most of our issues are due to the scale of the problems we face –
>> e.g. the human genome
>> > has 3 billion base-pairs (A, C, G, Ts) , so normal solutions don’t
>> scale to that (once
>> > many years ago we looked at setting up an Oracle database where there
>> was at least 1 row
>> > for every base pair – recording all variants (think of them as spelling
>> mistakes, for
>> > example a T rather than an A, or an extra letter inserted or deleted)
>> for that base pair…
>> > The schema was set up – and then they realised it would take 12 months
>> to load the data
>> > which we had then (which is probably less than a millionth of what we
>> have now)!
>> >
>> > Moving compute off site is a problem as the transfer of the level of
>> data we have would
>> > cause a problem – you can’t easily move all the data to the compute –
>> so you have to bring
>> > the compute to the data.
>> >
>> > The site I worked on before I became a more general developer was doing
>> that – and the
>> > code that was written 12-15 years ago is actually still going strong –
>> it has seen a few
>> > changes over the year – many displays have had to be redeveloped as the
>> scale of the data
>> > has got so big that even the summary pages we produced 10 years ago
>> have to be summarised
>> > because they are so large.
>> >
>> > *From:*Mithun Bhattacharya <mithnb@gmail.com>
>> > *Sent:* 24 December 2020 00:06
>> > *To:* mod_perl list <modperl@perl.apache.org>
>> > *Subject:* Re: Confused about two development utils [EXT]
>> >
>> > James would you be able to share more info about your setup ?
>> >
>> > 1. What exactly is your application doing which requires so much memory
>> and CPU - is it
>> > something like gene splicing (no i don't know much about it beyond
>> Jurassic Park :D )
>> >
>> > 2. Do you feel Perl was the best choice for whatever you are doing and
>> if yes then why ?
>> > How much of your stuff is using mod_perl considering you mentioned not
>> much is web related ?
>> >
>> > 3. What are the challenges you are currently facing with your
>> implementation ?
>> >
>> > On Wed, Dec 23, 2020 at 6:58 AM James Smith <js5@sanger.ac.uk <mailto:
>> js5@sanger.ac.uk>>
>> > wrote:
>> >
>> > Oh but memory is a problem – but not if you have just a small
>> cluster of machines!
>> >
>> > Our boxes are larger than that – but they all run virtual machine
>> {only a small
>> > proportion web related} – machines/memory would rapidly become in
>> our data centre - we
>> > run VMWARE [995 hosts] and openstack [10,000s of hosts] + a
>> selection of large memory
>> > machines {measured in TBs of memory per machine }.
>> >
>> > We would be looking at somewhere between 0.5 PB and 1 PB of memory
>> – not just the
>> > price of buying that amount of memory - for many machines we need
>> the fastest memory
>> > money can buy for the workload, but we would need a lot more CPUs
>> then we currently
>> > have as we would need a larger amount of machines to have 64GB
>> virtual machines {we
>> > would get 2 VMs per host. We currently have approx. 1-2000 CPUs
>> running our hardware
>> > (last time I had a figure) – it would probably need to go to
>> approximately 5-10,000!
>> > It is not just the initial outlay but the environmental and
>> financial cost of running
>> > that number of machines, and finding space to run them without
>> putting the cooling
>> > costs through the roof!! That is without considering what
>> additional constraints on
>> > storage having the extra machines may have (at the last count a
>> year ago we had over
>> > 30 PBytes of storage on side – and a large amount of offsite backup.
>> >
>> > We would also stretch the amount of power we can get from the
>> national grid to power
>> > it all - we currently have 3 feeds from different part of the
>> national grid (we are
>> > fortunately in position where this is possible) and the dedicated
>> link we would need
>> > to add more power would be at least 50 miles long!
>> >
>> > So - managing cores/memory is vitally important to us – moving to
>> the cloud is an
>> > option we are looking at – but that is more than 4 times the price
>> of our onsite
>> > set-up (with substantial discounts from AWS) and would require an
>> upgrade of our
>> > existing link to the internet – which is currently 40Gbit of data
>> (I think).
>> >
>> > Currently we are analysing a very large amounts of data directly
>> linked to the current
>> > major world problem – this is why the UK is currently being
>> isolated as we have
>> > discovered and can track a new strain, in near real time – other
>> countries have no
>> > ability to do this – we in a day can and do handle, sequence and
>> analyse more samples
>> > than the whole of France has sequenced since February. We probably
>> don’t have more of
>> > the new variant strain than in other areas of the world – it is
>> just that we know we
>> > have because of the amount of sequencing and analysis that we in
>> the UK have done.
>> >
>> > *From:*Matthias Peng <pengmatthias@gmail.com <mailto:
>> pengmatthias@gmail.com>>
>> > *Sent:* 23 December 2020 12:02
>> > *To:* mod_perl list <modperl@perl.apache.org <mailto:
>> modperl@perl.apache.org>>
>> > *Subject:* Re: Confused about two development utils [EXT]
>> >
>> > Today memory is not serious problem, each of our server has 64GB
>> memory.
>> >
>> >
>> > Forgot to add - so our FCGI servers need a lot (and I mean a
>> lot) more memory than
>> > the mod_perl servers to serve the same level of content (just
>> in case memory blows
>> > up with FCGI backends)
>> >
>> > -----Original Message-----
>> > From: James Smith <js5@sanger.ac.uk <mailto:js5@sanger.ac.uk>>
>> > Sent: 23 December 2020 11:34
>> > To: André Warnier (tomcat/perl) <aw@ice-sa.com <mailto:
>> aw@ice-sa.com>>;
>> > modperl@perl.apache.org <mailto:modperl@perl.apache.org>
>> > Subject: RE: Confused about two development utils [EXT]
>> >
>> >
>> > > This costs memory, and all the more since many perl modules
>> are not
>> > thread-safe, so if you use them in your code, at this moment
>> the only safe way to
>> > do it is to use the Apache httpd prefork model. This means that
>> each Apache httpd
>> > child process has its own copy of the perl interpreter, which
>> means that the
>> > memory used by this embedded perl interpreter has to be counted
>> n times (as many
>> > times as there are Apache httpd child processes running at any
>> one time).
>> >
>> > This isn’t quite true - if you load modules before the process
>> forks then they can
>> > cleverly share the same parts of memory. It is useful to be
>> able to "pre-load"
>> > core functionality which is used across all functions {this is
>> the case in Linux
>> > anyway}. It also speeds up child process generation as the
>> modules are already in
>> > memory and converted to byte code.
>> >
>> > One of the great advantages of mod_perl is Apache2::SizeLimit
>> which can blow away
>> > large child process - and then if needed create new ones. This
>> is not the case
>> > with some of the FCGI solutions as the individual processes can
>> grow if there is a
>> > memory leak or a request that retrieves a large amount of
>> content (even if not
>> > served), but perl can't give the memory back. So FCGI processes
>> only get bigger
>> > and bigger and eventually blow up memory (or hit swap first)
>> >
>> >
>> >
>> >
>> >
>> > --
>> > The Wellcome Sanger Institute is operated by Genome Research
>> Limited, a charity
>> > registered in England with number 1021457 and a company
>> registered in England
>> > with number 2742969, whose registered office is 215 Euston
>> Road, London, NW1
>> <https://www.google.com/maps/search/215+Euston+Road,+London,+NW1?entry=gmail&source=g>
>> 2
>> > [google.com]
>> > <
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.com_maps_search_s-2B215-2BEuston-2BRoad-2C-2BLondon-2C-2BNW1-2B2-3Fentry-3Dgmail-26source-3Dg&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=oH2yp0ge1ecj4oDX0XM7vQ&m=friR8ykiZ-NWYdX6SrbT_ogNXEVR-4ixdkrhy5khQjA&s=xU3F4xE2ugQuDWHZ4GtDn9mPBCKcJJOI0PYScsSNjSg&e=
>> >BE.
>> >
>> >
>> >
>> > --
>> > The Wellcome Sanger Institute is operated by Genome Research
>> > Limited, a charity registered in England with number 1021457
>> and a
>> > company registered in England with number 2742969, whose
>> registered
>> > office is 215 Euston Road, London, NW1
>> <https://www.google.com/maps/search/215+Euston+Road,+London,+NW1?entry=gmail&source=g>
>> 2 [google.com]
>> > <
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.com_maps_search_s-2B215-2BEuston-2BRoad-2C-2BLondon-2C-2BNW1-2B2-3Fentry-3Dgmail-26source-3Dg&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=oH2yp0ge1ecj4oDX0XM7vQ&m=friR8ykiZ-NWYdX6SrbT_ogNXEVR-4ixdkrhy5khQjA&s=xU3F4xE2ugQuDWHZ4GtDn9mPBCKcJJOI0PYScsSNjSg&e=
>> >BE.
>> >
>> > -- The Wellcome Sanger Institute is operated by Genome Research
>> Limited, a charity
>> > registered in England with number 1021457 and a company registered
>> in England with
>> > number 2742969, whose registered office is 215 Euston Road,
>> London, NW1 2BE
>> <https://www.google.com/maps/search/215+Euston+Road,+London,+NW1+2BE?entry=gmail&source=g>
>> .
>> >
>> > -- The Wellcome Sanger Institute is operated by Genome Research
>> Limited, a charity
>> > registered in England with number 1021457 and a company registered in
>> England with number
>> > 2742969, whose registered office is 215 Euston Road, London, NW1 2BE
>> <https://www.google.com/maps/search/215+Euston+Road,+London,+NW1+2BE?entry=gmail&source=g>
>> .
>>
>>

Re: Confused about two development utils [EXT] [ In reply to ]

ruben at mrbrklyn

Jan 1, 2021, 8:07 AM

Post #19 of 21 (2448 views)

>
> This isn’t quite true - if you load modules before the process forks then they can cleverly share the same parts of memory. It is useful to be able to "pre-load" core functionality which is used across all functions {this is the case in Linux anyway}. It also speeds up child process generation as the modules are already in memory and converted to byte code.
>
> One of the great advantages of mod_perl is Apache2::SizeLimit which can blow away large child process - and then if needed create new ones. This is not the case with some of the FCGI solutions as the individual processes can grow if there is a memory leak or a request that retrieves a large amount of content (even if not served), but perl can't give the memory back. So FCGI processes only get bigger and bigger and eventually blow up memory (or hit swap first)
>

The OS uses shared memory, and this is for any kind of forks. And FWIW,
in GNU/Linux the difference between processes and threads is nominal.
Everythings is light weight processes.

This is an OK dsicussion of it
https://www.thegeekstuff.com/2013/11/linux-process-and-threads/

BTW - there is caching memory throughout the system from the hard drive,
filesystem, in the OS, and in Perl.

It would be NICE if the mod_perl code was FIXED so that it would
compile with preload correctly out of the box.

>
>
>
>
> --
> The Wellcome Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2BE.

--
So many immigrant groups have swept through our town
that Brooklyn, like Atlantis, reaches mythological
proportions in the mind of the world - RI Safir 1998
http://www.mrbrklyn.com

DRM is THEFT - We are the STAKEHOLDERS - RI Safir 2002
http://www.nylxs.com - Leadership Development in Free Software
http://www2.mrbrklyn.com/resources - Unpublished Archive
http://www.coinhangout.com - coins!
http://www.brooklyn-living.com

Being so tracked is for FARM ANIMALS and extermination camps,
but incompatible with living as a free human being. -RI Safir 2013

Re: Confused about two development utils [EXT] [ In reply to ]

Jan 1, 2021, 8:38 AM

Post #20 of 21 (2448 views)

Preload of what doesnt work ?

On Fri, Jan 1, 2021 at 10:08 AM Ruben Safir <ruben@mrbrklyn.com> wrote:

> >
> > This isn’t quite true - if you load modules before the process forks
> then they can cleverly share the same parts of memory. It is useful to be
> able to "pre-load" core functionality which is used across all functions
> {this is the case in Linux anyway}. It also speeds up child process
> generation as the modules are already in memory and converted to byte code.
> >
> > One of the great advantages of mod_perl is Apache2::SizeLimit which can
> blow away large child process - and then if needed create new ones. This is
> not the case with some of the FCGI solutions as the individual processes
> can grow if there is a memory leak or a request that retrieves a large
> amount of content (even if not served), but perl can't give the memory
> back. So FCGI processes only get bigger and bigger and eventually blow up
> memory (or hit swap first)
> >
>
>
> The OS uses shared memory, and this is for any kind of forks. And FWIW,
> in GNU/Linux the difference between processes and threads is nominal.
> Everythings is light weight processes.
>
> This is an OK dsicussion of it
> https://www.thegeekstuff.com/2013/11/linux-process-and-threads/
>
> BTW - there is caching memory throughout the system from the hard drive,
> filesystem, in the OS, and in Perl.
>
> It would be NICE if the mod_perl code was FIXED so that it would
> compile with preload correctly out of the box.
>
>
> >
> >
> >
> >
> > --
> > The Wellcome Sanger Institute is operated by Genome Research
> > Limited, a charity registered in England with number 1021457 and a
> > company registered in England with number 2742969, whose registered
> > office is 215 Euston Road, London, NW1 2BE.
>
> --
> So many immigrant groups have swept through our town
> that Brooklyn, like Atlantis, reaches mythological
> proportions in the mind of the world - RI Safir 1998
> http://www.mrbrklyn.com
>
> DRM is THEFT - We are the STAKEHOLDERS - RI Safir 2002
> http://www.nylxs.com - Leadership Development in Free Software
> http://www2.mrbrklyn.com/resources - Unpublished Archive
> http://www.coinhangout.com - coins!
> http://www.brooklyn-living.com
>
> Being so tracked is for FARM ANIMALS and extermination camps,
> but incompatible with living as a free human being. -RI Safir 2013
>
>

Re: Confused about two development utils [EXT] [ In reply to ]

mark at blackmans

Jan 1, 2021, 8:54 AM

Post #21 of 21 (2448 views)