Mailing List Archive

vlink display of UTF8 characters
We recently found we were having issues with the display of UTF8
characters (I honestly don't know why this issue is coming up now, this
shop has been live for years). At any rate, the problem, after hours
spent debugging, turned out to be coming from the vlink script.

I switched over to vlink.pl and the problem persists, but I was able to
"fix" vlink.pl with the following patch:

--- a/vlink.pl 2010-08-20 17:29:56.000000000 -0700
+++ b/vlink.pl 2016-08-15 03:54:37.000000000 -0700
@@ -141,6 +141,7 @@
eval { alarm $LINK_TIMEOUT; };

socket(SOCK, PF_UNIX, SOCK_STREAM, 0) or die "socket: $!\n";
+binmode(SOCK, ':utf8');

my $ok;


... note that the fix is to simply specify utf8 handling of the socket
stream. Now this is not a fix that I'm willing to push out because it
would likely break backwards compatibility. We really need a way to
make this a configurable setting.

This should also be fixed in the vlink.c and tlink.c scripts, but I
honestly have no idea how to enable UTF8 handling of a socket in C. I
would like to switch my client back to vlink.c, though, due to it being
more efficient.

Comments, ideas, better fixes?


Peter


_______________________________________________
interchange-users mailing list
interchange-users@icdevgroup.org
http://www.icdevgroup.org/mailman/listinfo/interchange-users
Re: vlink display of UTF8 characters [ In reply to ]
> On Aug 15, 2016, at 7:07 AM, Peter <peter@pajamian.dhs.org> wrote:
>
> We recently found we were having issues with the display of UTF8
> characters (I honestly don't know why this issue is coming up now, this
> shop has been live for years). At any rate, the problem, after hours
> spent debugging, turned out to be coming from the vlink script.
>
> I switched over to vlink.pl and the problem persists, but I was able to
> "fix" vlink.pl with the following patch:
>
> --- a/vlink.pl 2010-08-20 17:29:56.000000000 -0700
> +++ b/vlink.pl 2016-08-15 03:54:37.000000000 -0700
> @@ -141,6 +141,7 @@
> eval { alarm $LINK_TIMEOUT; };
>
> socket(SOCK, PF_UNIX, SOCK_STREAM, 0) or die "socket: $!\n";
> +binmode(SOCK, ':utf8');
>
> my $ok;

What charset was the site in originally and what were the other settings (Apache’ DefaultCharset, MV_HTTP_CHARSET, MV_UTF8, perl/Encode.pm versions, etc)?

I wonder if this was properly configured in the IC side, as I’d just expect the vlink to pass-thru the octets regardless of encoding. In any case, this doesn’t feel correct to me, so I’d like to see what other information we can gather.

C-wise, you’d have to write your own equivalent to the PerlIO layer to encode input data as UTF8, which is another reason I think this is just misconfigured, not fundamentally broken at this layer. We’ve had quite a few sites use the IC UTF-8 layer without ever having to resort to vlink modifications.

HTH,

David
--
David Christensen
End Point Corporation
david@endpoint.com
785-727-1171




_______________________________________________
interchange-users mailing list
interchange-users@icdevgroup.org
http://www.icdevgroup.org/mailman/listinfo/interchange-users
Re: vlink display of UTF8 characters [ In reply to ]
On 16/08/16 00:07, Peter wrote:
> We recently found we were having issues with the display of UTF8
> characters (I honestly don't know why this issue is coming up now, this
> shop has been live for years). At any rate, the problem, after hours
> spent debugging, turned out to be coming from the vlink script.
>
> I switched over to vlink.pl and the problem persists, but I was able to
> "fix" vlink.pl with the following patch:
>
> socket(SOCK, PF_UNIX, SOCK_STREAM, 0) or die "socket: $!\n";
> +binmode(SOCK, ':utf8');
>
> ... note that the fix is to simply specify utf8 handling of the socket
> stream. Now this is not a fix that I'm willing to push out because it
> would likely break backwards compatibility. We really need a way to
> make this a configurable setting.

Further testing shows that this breaks binary file uploads. I am
testing how it works to move the binmode line to after get_entity().

I think that reading the HTTP headers for the charset and setting the
binmode accordingly may be the ultimate solution here to make sure that
it doesn't clash with file downloads either.


Peter

_______________________________________________
interchange-users mailing list
interchange-users@icdevgroup.org
http://www.icdevgroup.org/mailman/listinfo/interchange-users
Re: vlink display of UTF8 characters [ In reply to ]
On 16/08/16 10:28, Peter wrote:
> Further testing shows that this breaks binary file uploads. I am
> testing how it works to move the binmode line to after get_entity().

Doesn't work if the binmode line is moved down, but if I comment it out
then uploads work fine again (but UTF8 is broken). I will be looking
into a solution, but I'm certainly open to suggestions.


Peter

_______________________________________________
interchange-users mailing list
interchange-users@icdevgroup.org
http://www.icdevgroup.org/mailman/listinfo/interchange-users
Re: vlink display of UTF8 characters [ In reply to ]
This landed in my spam box, I only now found it...

On 16/08/16 02:50, David Christensen wrote:
> What charset was the site in originally and what were the other
> settings (Apache’ DefaultCharset, MV_HTTP_CHARSET, MV_UTF8,
> perl/Encode.pm versions, etc)?

The site has been in UTF-8 for years, and we only recently started
having issues with it.

AddDefaultCharset UTF-8

Variable MV_HTTP_CHARSET UTF-8
Variable MV_UTF8 1
DatabaseDefault PG_ENABLE_UTF8 1
DatabaseDefault GDBM_ENABLE_UTF8 1

Database is set to UTF-8 in postgresql.

# /usr/local/perl/bin/perl --version

This is perl 5, version 12, subversion 1 (v5.12.1) built for x86_64-linux

# /usr/local/perl/bin/perl -MEncode -le 'print $Encode::VERSION'
2.39

> I wonder if this was properly configured in the IC side, as I’d just
> expect the vlink to pass-thru the octets regardless of encoding. In
> any case, this doesn’t feel correct to me, so I’d like to see what
> other information we can gather.

I've literally debugged it right to the point where IC writes the data
out to the socket. I can log right from the end of
Vend::Server::respond() (which is right the point where Interchange
writes to the socket) and I get perfectly formatted UTF-8 data in the
log. I also checked $$body with Encode::is_utf8() at that same point
and it comes back true, so it certainly has UTF8 data at that point.

I then went on to manually fetch a page from vlink (by setting the
environment variables to simulate a page fetch) and it spewed out
invalid UTF-8 chars. I swapped for vlink.pl and still got the invalid
chars. I added the binmode line and it came back with properly
formatted UTF-8. Then I checked with apache and my normal browser with
the swapped out vlink.pl and it likes the output as well.

> C-wise, you’d have to write your own equivalent to the PerlIO layer
> to encode input data as UTF8, which is another reason I think this is
> just misconfigured, not fundamentally broken at this layer. We’ve
> had quite a few sites use the IC UTF-8 layer without ever having to
> resort to vlink modifications.

Yes, I just wonder how many of them are still using vlink. Haven't most
people moved onto mod_perl2 now?


Peter

_______________________________________________
interchange-users mailing list
interchange-users@icdevgroup.org
http://www.icdevgroup.org/mailman/listinfo/interchange-users
Re: vlink display of UTF8 characters [ In reply to ]
On 16/08/16 10:28, Peter wrote:
> Further testing shows that this breaks binary file uploads. I am
> testing how it works to move the binmode line to after get_entity().
>
> I think that reading the HTTP headers for the charset and setting the
> binmode accordingly may be the ultimate solution here to make sure that
> it doesn't clash with file downloads either.

As it turns out, I just didn't move the binmode line down far enough.
That said, I came up with this which seems to work well for me:

--- src/vlink.pl 2010-08-20 17:29:56.000000000 -0700
+++ bin/vlink.pl 2016-08-16 01:22:20.000000000 -0700
@@ -162,9 +162,14 @@
print SOCK send_entity();
print SOCK "end\n";

-
+my $utf8;
while(<SOCK>) {
- print;
+ $utf8 ||= /^Content-Type: .+; charset=UTF-?8\s*$/i;
+ print;
+ if (!/\S/) {
+ binmode(SOCK, ':utf8') if $utf8 > 0;
+ $utf8 = -1;
+ }
}

close (SOCK) or die "close: $!\n";


... so basically put it only sets binmode to utf8 for the data, not for
the HTTP headers and only if it sees a charset explicitly set to utf8 in
the headers.

I'm definitely not going to commit this without a fair bit of further
review, and I still would like to see patches for vlink.c and tlink.c.
I have no idea how those would handle it.


Peter

_______________________________________________
interchange-users mailing list
interchange-users@icdevgroup.org
http://www.icdevgroup.org/mailman/listinfo/interchange-users
Re: vlink display of UTF8 characters [ In reply to ]
On 16/08/16 20:34, Peter wrote:
> I'm definitely not going to commit this without a fair bit of further
> review, and I still would like to see patches for vlink.c and tlink.c.
> I have no idea how those would handle it.

Ok, I'm now finding there are places in the code that work with the
binmode and places that work without it. For example:

[data userdb fname foo] ... returns data that is corrupt without the
binmode but works with the binmode line, but...

[value fname]

... returns data that works without the binmode but is corrupt with it.

At the end of the day it looks like some methods of importing data to IC
are encoding it and some are not. If it gets encoded then it ends up
double-encoded (or something), unless the binmode line removes one of
the encodings. To that end I'm abandoning the idea of putting binmode
in vlink and instead I've just written a simple filter that runs the
data through Encode::decode(). This seems to allow the wrong data to
display correctly without causing all sorts of other issues.


Peter

_______________________________________________
interchange-users mailing list
interchange-users@icdevgroup.org
http://www.icdevgroup.org/mailman/listinfo/interchange-users