Mailing List Archive

Output filters, data encoding
Hi.
I'm writing a new PerlOutputFilter, stream version.
I have written several working ones before, so I know the general scheme.
But in this latest filter, I have a problem with the data encoding, which I did not
encounter previously.
I did not find an answer in the on-line mod_perl documentation, which seems silent about
any encoding of the data that one might read in a filter.
(I also tried the WWW, without any more success).

To read the response data coming from upstream, I use the standard $f->read(), as in

while (my $sz = $f->read(my $buffer, BUFF_LEN)) {
..

and then I need to pass this data to another module for processing (Template::Toolkit).
To make a long story short, Template::Toolkit misinterprets the data I'm sending to it,
because this data /is/ actually UTF-8, but apparently not marked so internally by the
$f->read(). So TT2 re-encodes it, leading to double UTF-8 encoding.

My question is : can I - and how -, set the filehandle that corresponds to the $f->read(),
to a UTF-8 layer ?
I have tried

line 155: binmode($f,'encoding:(UTF-8)');

and that triggers an error :
Not a GLOB reference at (my filter) line 155.\n
)

Or do I need to read the data 'as is', and separately do an

$decoded_buffer = decode('UTF-8', $buffer);

?

Note 1 : I can of course do the above decode(), but it seems more elegant to just insert
the I/O layer.

Note 2 : I know that the pre-filter data is UTF-8 (or at least I choose to believe that it
is), because I check $f->r->content_type(), which returns "text/html;charset=UTF-8".
Also, the filter otherwise works as expected, and I get the expected output in the
browser, except that e.g. "München" becomes "München".

Thanks in advance for any tips
André
Re: Output filters, data encoding [ In reply to ]
On Wednesday 13 November 2019 19:12:10 André Warnier (tomcat/perl) wrote:
> My question is : can I - and how -, set the filehandle that corresponds to
> the $f->read(), to a UTF-8 layer ?
> I have tried
>
> line 155: binmode($f,'encoding:(UTF-8)');

Hi André! When specifying PerlIO layer for file handle, you need to
write colon character before layer name. So correct binmode call is:

binmode($f, ':encoding(UTF-8)');

> and that triggers an error :
> Not a GLOB reference at (my filter) line 155.\n
> )
Re: Output filters, data encoding [ In reply to ]
-=| Andr? Warnier (tomcat/perl), 13.11.2019 19:12:10 +0100 |=-
> while (my $sz = $f->read(my $buffer, BUFF_LEN)) {
> ..
>
> and then I need to pass this data to another module for processing (Template::Toolkit).
> To make a long story short, Template::Toolkit misinterprets the data I'm
> sending to it, because this data /is/ actually UTF-8, but apparently not
> marked so internally by the $f->read(). So TT2 re-encodes it, leading to
> double UTF-8 encoding.
>
> My question is : can I - and how -, set the filehandle that corresponds to
> the $f->read(), to a UTF-8 layer ?
> I have tried
>
> line 155: binmode($f,'encoding:(UTF-8)');
>
> and that triggers an error :
> Not a GLOB reference at (my filter) line 155.\n
> )
>
> Or do I need to read the data 'as is', and separately do an
>
> $decoded_buffer = decode('UTF-8', $buffer);

There's a middle ground - partial decoding. See Encode(1)/FB_QUIET:

If CHECK is set to "Encode::FB_QUIET", encoding and decoding
immediately return the portion of the data that has been processed so
far when an error occurs. The data argument is overwritten with
everything after that point; that is, the unprocessed portion of the
data. This is handy when you have to call "decode" repeatedly in the
case where your source data may contain partial multi-byte character
sequences, (that is, you are reading with a fixed-width buffer). Here's
some sample code to do exactly that:

my($buffer, $string) = ("", "");
while (read($fh, $buffer, 256, length($buffer))) {
$string .= decode($encoding, $buffer, Encode::FB_QUIET);
# $buffer now contains the unprocessed partial character
}

Looks exactly like your case.


-- Damyan
Re: Output filters, data encoding [ In reply to ]
On 13.11.2019 19:17, pali@cpan.org wrote:
> On Wednesday 13 November 2019 19:12:10 André Warnier (tomcat/perl) wrote:
>> My question is : can I - and how -, set the filehandle that corresponds to
>> the $f->read(), to a UTF-8 layer ?
>> I have tried
>>
>> line 155: binmode($f,'encoding:(UTF-8)');
>
> Hi André! When specifying PerlIO layer for file handle, you need to
> write colon character before layer name. So correct binmode call is:
>
> binmode($f, ':encoding(UTF-8)');
>
>> and that triggers an error :
>> Not a GLOB reference at (my filter) line 155.\n
>> )

Thanks. Ooops, that was a typo (also in my filter, not only in the list message).
But correcting it, does not change the GLOB error message.
Re: Output filters, data encoding [ In reply to ]
On Wednesday 13 November 2019 19:52:25 André Warnier (tomcat/perl) wrote:
> On 13.11.2019 19:17, pali@cpan.org wrote:
> > On Wednesday 13 November 2019 19:12:10 André Warnier (tomcat/perl) wrote:
> > > My question is : can I - and how -, set the filehandle that corresponds to
> > > the $f->read(), to a UTF-8 layer ?
> > > I have tried
> > >
> > > line 155: binmode($f,'encoding:(UTF-8)');
> >
> > Hi André! When specifying PerlIO layer for file handle, you need to
> > write colon character before layer name. So correct binmode call is:
> >
> > binmode($f, ':encoding(UTF-8)');
> >
> > > and that triggers an error :
> > > Not a GLOB reference at (my filter) line 155.\n
> > > )
>
> Thanks. Ooops, that was a typo (also in my filter, not only in the list message).
> But correcting it, does not change the GLOB error message.

Ok. What is the $f? It is object or what kind of scalar?
Re: Output filters, data encoding [ In reply to ]
On 13.11.2019 19:53, pali@cpan.org wrote:
> On Wednesday 13 November 2019 19:52:25 André Warnier (tomcat/perl) wrote:
>> On 13.11.2019 19:17, pali@cpan.org wrote:
>>> On Wednesday 13 November 2019 19:12:10 André Warnier (tomcat/perl) wrote:
>>>> My question is : can I - and how -, set the filehandle that corresponds to
>>>> the $f->read(), to a UTF-8 layer ?
>>>> I have tried
>>>>
>>>> line 155: binmode($f,'encoding:(UTF-8)');
>>>
>>> Hi André! When specifying PerlIO layer for file handle, you need to
>>> write colon character before layer name. So correct binmode call is:
>>>
>>> binmode($f, ':encoding(UTF-8)');
>>>
>>>> and that triggers an error :
>>>> Not a GLOB reference at (my filter) line 155.\n
>>>> )
>>
>> Thanks. Ooops, that was a typo (also in my filter, not only in the list message).
>> But correcting it, does not change the GLOB error message.
>
> Ok. What is the $f? It is object or what kind of scalar?
>
It is the Apache2::Filter object.
See : http://perl.apache.org/docs/2.0/api/Apache2/Filter.html
Configured in httpd as : PerlOutputFilterHandler MyFilter
See also : http://perl.apache.org/docs/2.0/user/handlers/filters.html

My (hopeful) thinking was that considering the
$f->read()
the Apache2::Filter object may also be a FileHandle, hence the attempt at
binmode($f,..)
But that seems to be incorrect.
(And I don't see any (documented) method of Apache2::Filter that would return the
underlying FileHandle either)
Re: Output filters, data encoding [ In reply to ]
On 13.11.2019 19:37, Damyan Ivanov wrote:
> -=| Andr? Warnier (tomcat/perl), 13.11.2019 19:12:10 +0100 |=-
>> while (my $sz = $f->read(my $buffer, BUFF_LEN)) {
>> ..
>>
>> and then I need to pass this data to another module for processing (Template::Toolkit).
>> To make a long story short, Template::Toolkit misinterprets the data I'm
>> sending to it, because this data /is/ actually UTF-8, but apparently not
>> marked so internally by the $f->read(). So TT2 re-encodes it, leading to
>> double UTF-8 encoding.
>>
>> My question is : can I - and how -, set the filehandle that corresponds to
>> the $f->read(), to a UTF-8 layer ?
>> I have tried
>>
>> line 155: binmode($f,'encoding:(UTF-8)');
>>
>> and that triggers an error :
>> Not a GLOB reference at (my filter) line 155.\n
>> )
>>
>> Or do I need to read the data 'as is', and separately do an
>>
>> $decoded_buffer = decode('UTF-8', $buffer);
>
> There's a middle ground - partial decoding. See Encode(1)/FB_QUIET:
>
> If CHECK is set to "Encode::FB_QUIET", encoding and decoding
> immediately return the portion of the data that has been processed so
> far when an error occurs. The data argument is overwritten with
> everything after that point; that is, the unprocessed portion of the
> data. This is handy when you have to call "decode" repeatedly in the
> case where your source data may contain partial multi-byte character
> sequences, (that is, you are reading with a fixed-width buffer). Here's
> some sample code to do exactly that:
>
> my($buffer, $string) = ("", "");
> while (read($fh, $buffer, 256, length($buffer))) {
> $string .= decode($encoding, $buffer, Encode::FB_QUIET);
> # $buffer now contains the unprocessed partial character
> }
>
> Looks exactly like your case.
>
Thanks for the response and the tip.

My idea of adding a UTF-8 layer to the filehandle through which Apache2::Filter reads the
incoming data was probably wrong anyway : it cannot do that, because it gets this data
originally in chunks, as "bucket brigades" from Apache httpd. And there is no guarantee
that such a bucket brigade would always end in "complete" UTF-8 character sequences.
At the very least, this would probably complicate the code underlying $f->read() quite a bit.
It is clearer to handle that in the filter itself.

The Encode::FB_QUIET flag above, with the incremental buffer read, is really smart.
Unfortunately, the Apache2::Filter read() method does not allow as many arguments, and all
one has is something like this :

my $accumulated_content = "";
while (my $sz = $f->read(my $buffer, BUFF_LEN)) {
$accumulated_content .= $buffer;
}

Luckily, in this case, I have to accumulate the complete response content anyway, before I
can decide to call Template::Toolkit on it or not. So I can do a single decode() on
$accumulated_content. Not the most efficient memory-wise, but good enough in this case.
Re: Output filters, data encoding [ In reply to ]
Hi

on 2019/11/14 2:12, André Warnier (tomcat/perl) wrote:
> I'm writing a new PerlOutputFilter, stream version.

Can you give a more general introduction for what is "stream version"?

Thank you.
Re: Output filters, data encoding [ In reply to ]
On Wednesday 13 November 2019 20:10:07 André Warnier (tomcat/perl) wrote:
> On 13.11.2019 19:53, pali@cpan.org wrote:
> > On Wednesday 13 November 2019 19:52:25 André Warnier (tomcat/perl) wrote:
> > > On 13.11.2019 19:17, pali@cpan.org wrote:
> > > > On Wednesday 13 November 2019 19:12:10 André Warnier (tomcat/perl) wrote:
> > > > > My question is : can I - and how -, set the filehandle that corresponds to
> > > > > the $f->read(), to a UTF-8 layer ?
> > > > > I have tried
> > > > >
> > > > > line 155: binmode($f,'encoding:(UTF-8)');
> > > >
> > > > Hi André! When specifying PerlIO layer for file handle, you need to
> > > > write colon character before layer name. So correct binmode call is:
> > > >
> > > > binmode($f, ':encoding(UTF-8)');
> > > >
> > > > > and that triggers an error :
> > > > > Not a GLOB reference at (my filter) line 155.\n
> > > > > )
> > >
> > > Thanks. Ooops, that was a typo (also in my filter, not only in the list message).
> > > But correcting it, does not change the GLOB error message.
> >
> > Ok. What is the $f? It is object or what kind of scalar?
> >
> It is the Apache2::Filter object.
> See : http://perl.apache.org/docs/2.0/api/Apache2/Filter.html
> Configured in httpd as : PerlOutputFilterHandler MyFilter
> See also : http://perl.apache.org/docs/2.0/user/handlers/filters.html
>
> My (hopeful) thinking was that considering the
> $f->read()
> the Apache2::Filter object may also be a FileHandle, hence the attempt at
> binmode($f,..)
> But that seems to be incorrect.
> (And I don't see any (documented) method of Apache2::Filter that would
> return the underlying FileHandle either)

Sorry, then I do not know :-(
Re: Output filters, data encoding [ In reply to ]
On Wednesday 13 November 2019 20:37:06 Damyan Ivanov wrote:
> -=| André Warnier (tomcat/perl), 13.11.2019 19:12:10 +0100 |=-
> > while (my $sz = $f->read(my $buffer, BUFF_LEN)) {
> > ..
> >
> > and then I need to pass this data to another module for processing (Template::Toolkit).
> > To make a long story short, Template::Toolkit misinterprets the data I'm
> > sending to it, because this data /is/ actually UTF-8, but apparently not
> > marked so internally by the $f->read(). So TT2 re-encodes it, leading to
> > double UTF-8 encoding.
> >
> > My question is : can I - and how -, set the filehandle that corresponds to
> > the $f->read(), to a UTF-8 layer ?
> > I have tried
> >
> > line 155: binmode($f,'encoding:(UTF-8)');
> >
> > and that triggers an error :
> > Not a GLOB reference at (my filter) line 155.\n
> > )
> >
> > Or do I need to read the data 'as is', and separately do an
> >
> > $decoded_buffer = decode('UTF-8', $buffer);
>
> There's a middle ground - partial decoding. See Encode(1)/FB_QUIET:
>
> If CHECK is set to "Encode::FB_QUIET", encoding and decoding
> immediately return the portion of the data that has been processed so
> far when an error occurs. The data argument is overwritten with
> everything after that point; that is, the unprocessed portion of the
> data. This is handy when you have to call "decode" repeatedly in the
> case where your source data may contain partial multi-byte character
> sequences, (that is, you are reading with a fixed-width buffer). Here's
> some sample code to do exactly that:
>
> my($buffer, $string) = ("", "");
> while (read($fh, $buffer, 256, length($buffer))) {
> $string .= decode($encoding, $buffer, Encode::FB_QUIET);
> # $buffer now contains the unprocessed partial character
> }

This code is dangerous. It can enter into endless loop. Once you read
invalid UTF-8 sequence, above loop never finish. So if buffer input is
under user/attacker control you introduce DoS issues.

Instead of FB_QUIET, you should use Encode::STOP_AT_PARTIAL flag. This
is the flag which you want to use. Encode::decode stops decoding when
valid UTF-8 sequence is not complete and needs more bytes to read. And
by default invalid UTF-8 sequences are mapped to Unicode replacement
character.

Btw, PerlIO::encoding uses also Encode::STOP_AT_PARTIAL flag to handle
this situation.

PS: I know that Encode::STOP_AT_PARTIAL is undocumented, but it is only
because nobody found time to write documentation for it. It is part of
Encode API and ready to use...

>
> Looks exactly like your case.
>
>
> -- Damyan
Re: Output filters, data encoding [ In reply to ]
On Wed, 13 Nov 2019 19:12:10 +0100
Andr? Warnier (tomcat/perl) <aw@ice-sa.com> wrote:
>

I also found that calls to binmode in output filters generate a double encoding.

Here is a paste of the code of an output filter that adds a menu, some headers and closing tags to the html pages generated by previous modules; it reads from STDOUT, not from a file:

https://pastebin.com/trhjfDxX

It uses this :

> #on arrive ? la fin du contenu
> if ($f->seen_eos) {
>
> $content = Encode::decode_utf8( $content ) . '</div>' ;


Never had a problem with it.

I have handlers, not output filters, that read from files and use this :

open DOCUMENT_CONTENT, "<:encoding(UTF-8)", "$document_content" or die "can't open $document_content : $!\n" ;


--

Bien ? vous, Vincent Veyron

https://compta.libremen.com
Logiciel libre de comptabilit? g?n?rale en partie double
Re: Output filters, data encoding [ In reply to ]
On 14.11.2019 01:09, Hua, Yong wrote:
> Hi
>
> on 2019/11/14 2:12, André Warnier (tomcat/perl) wrote:
>> I'm writing a new PerlOutputFilter, stream version.
>
> Can you give a more general introduction for what is "stream version"?
>
> Thank you.
>
You shoud read the pages which I referred to previously, they explain this better than I
could do :
1) http://perl.apache.org/docs/2.0/user/handlers/filters.html
2) http://perl.apache.org/docs/2.0/api/Apache2/Filter.html

See in particular here :
http://perl.apache.org/docs/2.0/user/handlers/filters.html#Two_Methods_for_Manipulating_Data
Re: Output filters, data encoding [ In reply to ]
-=| pali@cpan.org, 14.11.2019 09:51:20 +0100 |=-
> On Wednesday 13 November 2019 20:37:06 Damyan Ivanov wrote:
> > my($buffer, $string) = ("", "");
> > while (read($fh, $buffer, 256, length($buffer))) {
> > $string .= decode($encoding, $buffer, Encode::FB_QUIET);
> > # $buffer now contains the unprocessed partial character
> > }
>
> This code is dangerous. It can enter into endless loop. Once you read
> invalid UTF-8 sequence, above loop never finish. So if buffer input is
> under user/attacker control you introduce DoS issues.

Sure. A check to prevent that would be in order. I must admit that
I was very happy to find a solution to the problem that was even in
the official documentation.

> Instead of FB_QUIET, you should use Encode::STOP_AT_PARTIAL flag. This
> is the flag which you want to use. Encode::decode stops decoding when
> valid UTF-8 sequence is not complete and needs more bytes to read. And
> by default invalid UTF-8 sequences are mapped to Unicode replacement
> character.
>
> Btw, PerlIO::encoding uses also Encode::STOP_AT_PARTIAL flag to handle
> this situation.
>
> PS: I know that Encode::STOP_AT_PARTIAL is undocumented, but it is only
> because nobody found time to write documentation for it. It is part of
> Encode API and ready to use...

That would be https://rt.cpan.org/Public/Bug/Display.html?id=67065
(filed 8 years ago, still open).