Mailing List Archive

TeX
Here is first version of TeX rendering extension to Wikipedia.
It's not production code yet.

Please comment.

= How does it work =

New preference is introduced which says whether to:
* always render images as PNGs
* render them as HTML if they are simple enough, or as PNGs otherwise.
* leave them as pseudo-TeX (mainly for text browsers where neither PNG
nor HTML rendering would be visible)

ISSUE 1: While HTML reduces bandwidth, it is much uglier, so default is PNG-only.
ISSUE 2: PNGs are rendered with "a bit too big" font. That's on purpose.
A big too big works well in big and medium resolution, and is still readable
in small resolution. But "a bit too little" with big resolution would be very
hard to read.

Also new table is introduced:
CREATE TABLE math (
math_inputhash char(32) NOT NULL,
math_outputhash char(32) NOT NULL,
math_html text NOT NULL,
UNIQUE KEY math_inputhash (math_inputhash)
);

math_inputhash is MD5 of input markup, math_outputhash
is MD5 of output markup, math_html is HTML rendering or ""
if it's too difficult for HTML.

ISSUE 3: MD5 should be stored in binary in final version.

OutputPage.php calls renderMath() for every occurence of <math></math>
in code. If user decided he likes pseudo-TeX, then that's the end.
Otherwise it checks in database whether it is already rendered or not.
If it is, then it either takes HTML or generates link to image.

ISSUE 4: Directory for math images should be configurable and it should be
also known to texvc (command line ? compilation option ?).
It should not be upload directory.
ISSUE 5: Maybe it should use a/ab/ab*.png like other images. Or maybe
Wikipedia servershould move to reserfs.
ISSUE 6: Image should have ALT= tag

If image/html isn't generated yet, texvc is called. If it fails, message
is generated.

ISSUE 7: this message should be localized.
ISSUE 8: texvc shouldn't be in cgi-bin or care should be taken it can't
be called with any evil options.

Depending on return value of texvc results are generated and put into table
for caching.

ISSUE 9: failures are not cached. In final version they should be cached,
but cleaned on every upgrade of texvc (which may support more
TeX than previous version).

Now texvc takes input in first argument.

ISSUE 10: I'd rather use stdin but proc_open (popen2 for perl hackers) appears
only in PHP 4.3, but PHP 4.2 is still the standard

Then it LALR-parses it. What it parses is not real TeX. If HTML contains &foo;
and TeX doesn't, this preudo-TeX will contain \foo anyway. This ensures
that it's very easy to use.

Then it is standarized and md5 of standarized version computed.

ISSUE 11: race condition of 2 runs of texvc trying to generate the same PNG,
will have to be investigated
ISSUE 12: texvc should check here whether output PNG already exists (HTML is fast
to generate so it doesn't hurt to regenerate it). It may happen not only
in case of race condition, but also if it was generated from different
input markup (say from "x + y", and we do it from "x+y" now)

Then it prints md5 and HTML (if any) on stdout.

ISSUE 13: PHP should not wait for texvc to finish from this point. texvc should
probably fork() here.

Now latex, dvips and convert (which in turn uses ghostscript) are called.

ISSUE 14: Latex creates some temporary files. They should be created in some
tmp/ directory, not in current directory.
Re: TeX [ In reply to ]
ISSUE 15: texvc is written in Ocaml

Ocaml is best available language for writing interpreters for special
purpose languages because:
* it is really fast, comparable in performance to "traditional" compiled
languages with GC like Java, and in cases where GC doesn't introduce
much overhead, to C/C++
* it doesn't segfault
* it has yacc/lex
* it has advanced symbolic processing functionality
* it can do a lot to ensure correctness of programs and programs in Ocaml
are very easy to reason about (texvc doesn't contain single variable)
* programs can be written at much higher level, so developement is faster,
yet low level functionality is also available if needed.

I hope this explanation in enough for you.

To learn more about Ocaml visit Polish Wikipedia, in particular:
http://pl.wikipedia.org/wiki/Ocaml
http://pl.wikipedia.org/wiki/Ocamlyacc

ISSUE 16: this pseudo-TeX is very incomplete. Please tell me what
functionality do you need (sums/integrals come to mind ...).

ISSUE 17: texvc uses double-dollar math, not single-dollar math. It
uses more space but looks nicer.
Re: TeX [ In reply to ]
On Sun, Dec 01, 2002 at 02:00:26AM +0100, Tomasz Wegrzanowski wrote:
> Here is first version of TeX rendering extension to Wikipedia.
> It's not production code yet.
>
> Please comment.

Hello taw,

just looked into your diff. Making HTML-rendering an option is
a good idea.

One thing I would strongly propose to change:

function renderMath( $matches )
{
...
$pid = popen ("./math/texvc \"{$tex}\"", "r"); # texvc shouldn't be in cgi-bin


This allows nasty attacks before the TeX-code is validated. Let, for
example, $tex be $(find / -type f|xargs rm)
Then popen starts a shell to start the program and its parameters are expanded by
the shell. A lot of nasty things could be performed this way.

Workaround:
a) use a bi-directional proc_open and put the $tex via stdin
b) create a file with the md5-hash as filename.

Workaround (a) is currently not available in standard PHP.

Regarding funtions to be provided:
\mbox
\sum
\int
\left, \right
\infty
blackboard letters
\sin, \cos, \lim, \log, ...

This OCAML looks funny, I think I will have to dig deeper into it before
commenting it :-)

Regards,

JeLuF
Re: TeX [ In reply to ]
On Sun, Dec 01, 2002 at 09:22:14AM +0100, Jens Frank wrote:
> One thing I would strongly propose to change:
>
> function renderMath( $matches )
> {
> ...
> $pid = popen ("./math/texvc \"{$tex}\"", "r"); # texvc shouldn't be in cgi-bin
>
>
> This allows nasty attacks before the TeX-code is validated. Let, for
> example, $tex be $(find / -type f|xargs rm)
> Then popen starts a shell to start the program and its parameters are expanded by
> the shell. A lot of nasty things could be performed this way.
>
> Workaround:
> a) use a bi-directional proc_open and put the $tex via stdin
> b) create a file with the md5-hash as filename.
>
> Workaround (a) is currently not available in standard PHP.

PHP has standard function that escapes shell metacharacters,
exactly for this purpose.

I just forgot to put it there.
Re: TeX [ In reply to ]
I wonder if it would be beneficial to skip the external parsing step
and leave all parsing to TeX. The advantage would be that all TeX
functionality is immediately available without any additional work,
including macro packages such as those for
* commutative diagrams and graphs (xypic)
* chemical structure formulas (chemtex)
* music scores (musictex)
* chess diagrams

There is no safety issue, since TeX can be made to run safely so that
no shell processes can be called and only the standard files can be
generated.

The only disadvantage I see is that we would lose the optional HTML
rendering of formulas. While I like that feature, I think we can live
without it: most advanced formulas cannot be rendered in HTML anyway,
and those that can should probably be written in HTML in the first
place to be nice to anonymous users with non-graphical browsers.

Axel

---
Payment: http://www.wikipedia.org/wiki/K%F6nig%27s_lemma

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com
Re: TeX [ In reply to ]
On Sun, Dec 01, 2002 at 05:06:06PM -0800, Axel Boldt wrote:
> I wonder if it would be beneficial to skip the external parsing step
> and leave all parsing to TeX. The advantage would be that all TeX
> functionality is immediately available without any additional work,
> including macro packages such as those for
> * commutative diagrams and graphs (xypic)
> * chemical structure formulas (chemtex)
> * music scores (musictex)
> * chess diagrams
>
> There is no safety issue, since TeX can be made to run safely so that
> no shell processes can be called and only the standard files can be
> generated.
>
> The only disadvantage I see is that we would lose the optional HTML
> rendering of formulas. While I like that feature, I think we can live
> without it: most advanced formulas cannot be rendered in HTML anyway,
> and those that can should probably be written in HTML in the first
> place to be nice to anonymous users with non-graphical browsers.

<table>
<tr align=center><td><td>&infin;
<tr align=center><td><td>&int;<td>x<sup>2</sup> dx
<tr align=center><td><td>0
<tr align=center><td>3<td colspan=2><hr><td><td> - &pi;
<tr align=center><td><td colspan=2>2&pi;
</table>

Ok. That was silly.
But you can see a diference between "possible to do in HTML"
and "sane to do in HTML" now.

Now back to topic. texvc's markup is not really TeX or some subset of it.
It is pseudo-TeX that is much nicer to write. It fixes all the usual
problems (%, &foo; not having \foo counterpart, some {}-ing issues),
and provides us with full information that we can use for many purposes,
rendering HTML being one. But it could also be extended to render ascii-arts
MathML or really awesome HTML code like one in previous example.
And from other side too, to accept other kinds of math markup (OpenOffice
math markup has been suggested).

I'm not completely sure about safety. Probably we should both validate and
run TeX in safe mode.

texvc is extremely easy to extend. It shouldn't take too much time just to
add a few dozens of new tags, and a night or two should be enough for
completely new kind of markup, be it chemistry (which certainly should be added),
chess, music or whatever.


Payment: http://pl.wikipedia.org/wiki/Rezolucja
Re: TeX [ In reply to ]
--- Tomasz Wegrzanowski <taw@users.sourceforge.net> wrote:

> But you can see a diference between "possible to do in HTML"
> and "sane to do in HTML" now.

Yes. But do we agree that formulas that are "sane to do in HTML", like
x<sup>2</sup> + y<sup>2</sup> &ge; 0 for instance, should be written in
HTML even when the TeX system is in place?

> I'm not completely sure about safety. Probably we should both
> validate and
> run TeX in safe mode.

For safety concerns, preparsing of TeX is not necessary: the safety
issues of running PHP and calling an external parser are much larger.

> texvc is extremely easy to extend. It shouldn't take too much time
> just to
> add a few dozens of new tags, and a night or two should be enough for
> completely new kind of markup, be it chemistry (which certainly
> should be added),
> chess, music or whatever.

Unfortunately, the xypic package (which I consider the one feature that
is really needed in the math area right now) as well as the other
packages I listed all use their own idiosyncratic syntax quite unlike
ordinary TeX. So extending texvc may not be quite as simple, and I'm
afraid we are creating a development bottleneck by requiring the
external parsing stage.

Axel

Payment: http://www.wikipedia.org/w/wiki.phtml?title=Tree_(graph_theory)&diff=0&oldid=456065

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com
Re: TeX [ In reply to ]
On Mon, Dec 02, 2002 at 10:18:54AM -0800, Axel Boldt wrote:
> --- Tomasz Wegrzanowski <taw@users.sourceforge.net> wrote:
>
> > But you can see a diference between "possible to do in HTML"
> > and "sane to do in HTML" now.
>
> Yes. But do we agree that formulas that are "sane to do in HTML", like
> x<sup>2</sup> + y<sup>2</sup> &ge; 0 for instance, should be written in
> HTML even when the TeX system is in place?

No. It sould be in pseudo-TeX.

Part of my plan is to make Wikipedias available on dict format
and put it them into Debian packages. The less HTML it will use,
the more useful these dicts will be.

> > I'm not completely sure about safety. Probably we should both
> > validate and
> > run TeX in safe mode.
>
> For safety concerns, preparsing of TeX is not necessary: the safety
> issues of running PHP and calling an external parser are much larger.

Right.
How to put TeX into safe mode ?

> > texvc is extremely easy to extend. It shouldn't take too much time
> > just to
> > add a few dozens of new tags, and a night or two should be enough for
> > completely new kind of markup, be it chemistry (which certainly
> > should be added),
> > chess, music or whatever.
>
> Unfortunately, the xypic package (which I consider the one feature that
> is really needed in the math area right now) as well as the other
> packages I listed all use their own idiosyncratic syntax quite unlike
> ordinary TeX. So extending texvc may not be quite as simple, and I'm
> afraid we are creating a development bottleneck by requiring the
> external parsing stage.

If the only thing that can be done with them is generating PNGs, then
PNGs should be generated and uploaded. Syntax of xypic is neither
widely-known nor (as far as I can see) very readable, so I don't see
much benefit with it being generated on server.
Re: TeX [ In reply to ]
On Mon, Dec 02, 2002 at 08:39:43PM +0100, Tomasz Wegrzanowski wrote:
> On Mon, Dec 02, 2002 at 10:18:54AM -0800, Axel Boldt wrote:
> > Unfortunately, the xypic package (which I consider the one feature that
> > is really needed in the math area right now) as well as the other
> > packages I listed all use their own idiosyncratic syntax quite unlike
> > ordinary TeX. So extending texvc may not be quite as simple, and I'm
> > afraid we are creating a development bottleneck by requiring the
> > external parsing stage.
>
> If the only thing that can be done with them is generating PNGs, then
> PNGs should be generated and uploaded. Syntax of xypic is neither
> widely-known nor (as far as I can see) very readable, so I don't see
> much benefit with it being generated on server.

It's editable if it's inline. Mistakes can be changed easily.
Expecting a potential editor to have a running TeX installation
to fix an error is a higher requirement than him having a browser.

Regards,

JeLuF
Re: TeX [ In reply to ]
--- Tomasz Wegrzanowski <taw@users.sourceforge.net> wrote:
> On Mon, Dec 02, 2002 at 10:18:54AM -0800, Axel Boldt wrote:

> > Yes. But do we agree that formulas that are "sane to do in HTML",
> > like
> > x<sup>2</sup> + y<sup>2</sup> &ge; 0 for instance, should be
> > written in
> > HTML even when the TeX system is in place?
>
> No. It sould be in pseudo-TeX.

I disagree: that way we send lots of unnecessary png's to our users.
This wastes bandwidth and creates problems if people use the wrong
screen resolution or non-graphical browsers.

> Part of my plan is to make Wikipedias available on dict format
> and put it them into Debian packages. The less HTML it will use,
> the more useful these dicts will be.

Why not put the rendered HTML tree into Debian? That way all the
formatting including tables will be the way the authors intended it to
be.

In any event, I don't think that requirements of a possible future
presentation format should guide our present editing decisions. Right
now, Wikipedia is presented in HTML.

Axel

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com
Re: TeX [ In reply to ]
On Mon, Dec 02, 2002 at 01:03:26PM -0800, Axel Boldt wrote:
> > Part of my plan is to make Wikipedias available on dict format
> > and put it them into Debian packages. The less HTML it will use,
> > the more useful these dicts will be.
>
> Why not put the rendered HTML tree into Debian? That way all the
> formatting including tables will be the way the authors intended it to
> be.

Putting HTML tree into Debian is fine, but that's not really a point.
The point is to have Wikipedia accessible the same way foldoc, wordnet,
jargon and all other data is - from text mode command line.

> In any event, I don't think that requirements of a possible future
> presentation format should guide our present editing decisions. Right
> now, Wikipedia is presented in HTML.

There is really no reason for Wikipedia to depend on particular presentation
format. If it will become multi-format it will be much more useful.
Dict was just one example. I'm sure people have their own list of formats
they would like to see Wikipedia in.

Another obvious format is TeX (and ps/pdf generated from it) for
printable version of Wikipedia pages.
Re: TeX [ In reply to ]
>Toby Bartels is a troll and I'm not going to listen any more to what he
>says.

But he may have some valid criticisms that could only make texvc better.
Although he seems to be a bit fundamentalist about what texvc should allow
in his comments on the TeX testing pages; e.g., "...we really don't need TeX
support inline, it should probably not be supported inline. ... Wikipedia's
math should stick to English letters in boldface and italic whenever
possible, for the widest readability, with exceptions only for things (like
[pi]) that are universally rendered in other fonts."

Aside from that, my own observations are this:

1) Fonts seem to be over anti-aliased. '+', '-', and '=' signs are all
blury, when they should be distinct, straight lines.

2) Fonts seem to be over weighted, i.e., everything looks somewhat bold.

3) Inline math doesn't feel very readable. PlanetMath's inline equations
feel a bit more readable than texvc's.

Of course, these are rather non-technical criticisms of texvc's font
rendering. I'd suggest employing whatever font rendering defaults that
PlanetMath is using, as their equations feel much more readable and much
less ugly.

Also, what about <math> ... </math> conflicting with MathML? If people
employ MathML in an article, does texvc check between the <math> ... </math>
tags to see whether it's MathML or TeX? Instead of overloading standard
tags, maybe it should be <tex> ... </tex>, and then just let people know
that Wikipedia only supports a subset of TeX for math purposes only. In any
case, that would make parsing of pages faster, as the "Is this MathML or
TeX?" logic could be thrown out. The price of that logic may be trivial
now, but what about when Wikipedia is serving hundreds of thousands of pages
a day? Plus it would make all the texvc input forward-compatible if it's
decided someday that fuller TeX support is needed and that <tex> ... </tex>
tags should be used.

Just my thoughts...

Okay, I'm done now,

Derek

_________________________________________________________________
MSN 8 limited-time offer: Join now and get 3 months FREE*.
http://join.msn.com/?page=dept/dialup&xAPID=42&PS=47575&PI=7324&DI=7474&SU=
http://www.hotmail.msn.com/cgi-bin/getmsg&HL=1216hotmailtaglines_newmsn8ishere_3mf
Re: TeX [ In reply to ]
On Fri, Dec 27, 2002 at 04:26:19PM -0600, Derek Moore wrote:
> Aside from that, my own observations are this:
>
> 1) Fonts seem to be over anti-aliased. '+', '-', and '=' signs are all
> blury, when they should be distinct, straight lines.
>
> 2) Fonts seem to be over weighted, i.e., everything looks somewhat bold.

These are all pretty much ghostscripts' defaults.
We could experiment with font size (texvc uses 120, some people think even
smaller would look better, but I'm not sure if it would be so on larger
displays too).

> 3) Inline math doesn't feel very readable. PlanetMath's inline equations
> feel a bit more readable than texvc's.

On default settings texvc has concept of "conservativeness" level.
If math is simple it renders it in properly-italicized HTML, otherwise
it renders it in PNG. In practice most of inline equations will be
rendered in HTML, and most of displayed equations in PNG, so it
will look more or less right. This is most that can be done without
analyzing context where math appears.

> Also, what about <math> ... </math> conflicting with MathML? If people
> employ MathML in an article, does texvc check between the <math> ...
> </math> tags to see whether it's MathML or TeX? Instead of overloading
> standard tags, maybe it should be <tex> ... </tex>, and then just let
> people know that Wikipedia only supports a subset of TeX for math purposes
> only.

MathML isn't supposed to be written by humans, so I don't see a problem here.
If some new markup was added, <math> could be extended to alternative markups,
like: <math type="mathml"><mrow>some scary equation</mrow></math>.

> In any case, that would make parsing of pages faster, as the "Is
> this MathML or TeX?" logic could be thrown out. The price of that logic
> may be trivial now, but what about when Wikipedia is serving hundreds of
> thousands of pages a day? Plus it would make all the texvc input
> forward-compatible if it's decided someday that fuller TeX support is
> needed and that <tex> ... </tex> tags should be used.

Computation costs would be really negligible. Checking whether something
is MathML or TeX is probably 4 or 5 orders of magnitude less resource-sensitive
than the rest of things we do to render page.