Mailing List Archive

Parsing
There's a discussion on wikipedia-l about the exact syntax of an URL and
whether a punctuation mark at the end of the URL should be considered part
of the URL or not.

This made me think: Would it make sense to make a formal BNF grammar for
the Wikipedia text format, so a LALR(1) parser could be made for it?
Would that make any sense at all with PHP, or just be too hard to code
and inflexible?

Only ten years ago, people would use C programming and YACC to solve
problems like this, and reg.exp based parsing was considered just too
inefficient.


--
Lars Aronsson (lars@aronsson.se)
tel +46-70-7891609
http://aronsson.se/ http://elektrosmog.nu/ http://susning.nu/
Re: Parsing [ In reply to ]
> This made me think: Would it make sense to make a formal BNF
> grammar for the Wikipedia text format, so a LALR(1) parser could
> be made for it? Would that make any sense at all with PHP, or
> just be too hard to code and inflexible?

I'd love to have a formal grammar of some kind (I think regexps
would be fine), and I agree with Jan that a totally wiki-specific
syntax would be far better than out current mish-mash of HTML and
wiki markup. But I'm not sure if it's not already too late to
revisit those decisions.

But if it isn't, I'll be happy to discuss what a syntax might
look like.
Re: Parsing [ In reply to ]
On Thu, 25 Jul 2002 lcrocker@nupedia.com wrote:

> > This made me think: Would it make sense to make a formal BNF
> > grammar for the Wikipedia text format, so a LALR(1) parser could
> > be made for it? Would that make any sense at all with PHP, or
> > just be too hard to code and inflexible?
>
> I'd love to have a formal grammar of some kind (I think regexps
> would be fine), and I agree with Jan that a totally wiki-specific
> syntax would be far better than out current mish-mash of HTML and
> wiki markup. But I'm not sure if it's not already too late to
> revisit those decisions.
>
> But if it isn't, I'll be happy to discuss what a syntax might
> look like.

Wiki is still a new concept. Think how HTML was based on SGML, then
evolved into HTML 2, 3, 4, 5, and then XML came along, because people
understood from the HTML experience that SGML was overly complex.

There is a big world of PhpWikis out there with [single bracket] link
syntax. There are other wiki implementations with different ideas about
syntax. But no wiki is as big as Wikipedia, so this is the most
concentrated amount of experience. This is where a format standard should
or at least could start to form.


--
Lars Aronsson (lars@aronsson.se)
tel +46-70-7891609
http://aronsson.se/ http://elektrosmog.nu/ http://susning.nu/
Re: Parsing [ In reply to ]
On Thu, Jul 25, 2002 at 12:17:26PM +0200, Lars Aronsson wrote:
> On Thu, 25 Jul 2002 lcrocker@nupedia.com wrote:
> > > This made me think: Would it make sense to make a formal BNF
> > > grammar for the Wikipedia text format, so a LALR(1) parser could
> > > be made for it? Would that make any sense at all with PHP, or
> > > just be too hard to code and inflexible?
> >
> > I'd love to have a formal grammar of some kind (I think regexps
> > would be fine), and I agree with Jan that a totally wiki-specific
> > syntax would be far better than out current mish-mash of HTML and
> > wiki markup. But I'm not sure if it's not already too late to
> > revisit those decisions.
> >
> > But if it isn't, I'll be happy to discuss what a syntax might
> > look like.
>
> Wiki is still a new concept. Think how HTML was based on SGML, then
> evolved into HTML 2, 3, 4, 5, and then XML came along, because people
> understood from the HTML experience that SGML was overly complex.
>
> There is a big world of PhpWikis out there with [single bracket] link
> syntax. There are other wiki implementations with different ideas about
> syntax. But no wiki is as big as Wikipedia, so this is the most
> concentrated amount of experience. This is where a format standard should
> or at least could start to form.

I tried to make formal grammar of Wikipedia, LALR, regexps of
whatever, and I can tell you that it's next to impossible if almost
arbitrary HTML markup is allowed.

Especially HTML tables syntax is difficult to parse,
so maybe we should make our own ?

Without HTML tables I think that we could limit what kind of HTML is
allowed and make some sane formal syntax.

It's not easy to design simple table markup that:
* allows multicolumn and multirow cells
* allows cell attributes
* can nest tables
* allows all constructs that HTML allows inside cells, i.e. multiple
paragraphs, lists etc.
* is readable
* is easy to write

So I suggest that you check http://sf.net/projects/freetable
I made this a while ago to allow simpler HTML tables.
It seems to be working and is used by WebMake and WebsiteMetaLanguage.

Syntax looks something like this:

<wwwtable border=1>
(1,1)
column 1, row 1
(+,)
the same column, next row
(*,2)
column 2 in any row
(*,3) align=center
columns 3 should be centered
(1,3)
Some centered text
(3,3)
Other centered text
</wwwtable>

What is converted to:
<table border=1>
<tr>
<td>column 1, row 1</td>
<td>column 2 in any row</td>
<td align=center>columns 3 should be centered Some centered text</td>
</tr>
<tr>
<td>the same column, next row</td>
<td>column 2 in any row</td>
<td align=center>columns 3 should be centered</td>
</tr>
<tr>
<td>&nbsp;</td>
<td>column 2 in any row</td>
<td align=center>columns 3 should be centered Other centered text</td>
</tr>
</table>

I'd say it's much better than what wikipedia currently uses.
Re: Parsing [ In reply to ]
On Thu, Jul 25, 2002 at 02:25:18AM -0700, lcrocker@nupedia.com wrote:
>
> I'd love to have a formal grammar of some kind (I think regexps
> would be fine),

Hmm, I seem to remember I promised that once. :-/ I'll see what I can do. If
people want to help, just go to

http://www.wikipedia.com/wiki/User:Jan_Hidders/Wikipedia_syntax

(I probably should put this on the meta-wikipedia.)

Just to be clear; the syntax should not describe what we accept and not
accept (we accept actualy everything sot that's a really simple grammar :-))
but should have enough "resolution" to allow us to specify the semantics of
the mark-up. We first should not concentrate on making it LALR(1) or
anything, but just that it is unambiguous (in the parsing-sense of the word)
and complete.

> and I agree with Jan that a totally wiki-specific syntax would be far
> better than our current mish-mash of HTML and wiki markup. But I'm not
> sure if it's not already too late to revisit those decisions.

Was it a conscious decision? I got the impressions the early software didn't
filter out HTML so people used it and now we are stuck with it.

Apart from the big technical advantages I still feel that having a simple
HTML-free mark-up language is necessary to keep Wikipedia accessible for
newcomers. Having lots of complicated HTML that is not very WYSIWIG makes
editing harder. This inevitably means that you cannot do a lot of fancy
lay-out things, but I believe that is not a bug but a feature.

So, yes, it is probably impossible to come up with an HTML-free mark-up that
has an equivalent for all the HTML that is currently used. However, we would
probably be breaking only a very small percentage of pages and we could even
automatically detect those pages and put them on a "to be simplified" list.

> But if it isn't, I'll be happy to discuss what a syntax might
> look like.

I have once made a proposal on

http://www.wikipedia.com/wiki/User:Jan_Hidders/HTML-free_mark-up

but I have to admit that it was mainly to draw some discussion.

-- Jan Hidders
Re: Parsing [ In reply to ]
Jan.Hidders wrote:
> Was it a conscious decision? I got the impressions the early software didn't
> filter out HTML so people used it and now we are stuck with it.

That's more or less right. There was a discussion, but really, the code is
the law.

> Apart from the big technical advantages I still feel that having a simple
> HTML-free mark-up language is necessary to keep Wikipedia accessible for
> newcomers. Having lots of complicated HTML that is not very WYSIWIG makes
> editing harder. This inevitably means that you cannot do a lot of fancy
> lay-out things, but I believe that is not a bug but a feature.

My grasp of the consensus (but maybe I am misremembering... perhaps
this is just my grasp of my own opinion!) is that going out of our way
to be HTML-free is not a good thing, but that not allowing the many
html-nightmares is a good thing.

For example, many many many people, not just programmers, understand
how to make html <b>bold</b> and <i>italics</i>. Those are intuitive
and harmless. The original Ward Cunningham wiki solution of ' and ''
and ''' for different things, well, that was never very intuitive and
newcomers didn't know about it.

Supporting some html tags, familiar and harmless ones, seems like a
good idea.

Of course, at this point, we do have an established userbase of
writers, some of whom are known to us as regulars, but lots and lots
of whom may only show up once every month to write a little bit. I
think we have a duty not to change anything in a way that will
astonish them.

--Jimbo
Re: Parsing [ In reply to ]
On Fri, Jul 26, 2002 at 07:00:28AM -0700, Jimmy Wales wrote:
> Jan.Hidders wrote:
>
> My grasp of the consensus (but maybe I am misremembering... perhaps
> this is just my grasp of my own opinion!) is that going out of our way
> to be HTML-free is not a good thing, but that not allowing the many
> html-nightmares is a good thing.

I think you are right that this was and is the consensus. Doesn't mean I
can't try, does it? :-)

> For example, many many many people, not just programmers, understand
> how to make html <b>bold</b> and <i>italics</i>. Those are intuitive
> and harmless. The original Ward Cunningham wiki solution of ' and ''
> and ''' for different things, well, that was never very intuitive and
> newcomers didn't know about it.

<rave>
Oh, come on! How long does it take for newcomers to grasp what '' and
''' means? I agree that in itself there is nothing wrong with <b> and <i>
although I personally think they are slightly less easier to read then the
WikiWiki notation and I think it is always better to simply have one notation
for every mark-up.

However, they are at the root of the HTML problems because once you start
allowing such HTML-like mark-up the software has to decide which tags it
allows and which it doesn't and if the stuff is well-formed or not and
that's just hard to do. So in the beginning it simply wasn't done and
because, as you said, the code was and is the law now the life of the parser
has become unnecessarily difficult and some pages are more and more looking
like regular hard to understand and edit HTML pages.

It really would have been so much better if right from the beginning somebody
would have said: no HTML tags. It really would have. *sigh*
</rave>

> Of course, at this point, we do have an established userbase of writers,
> some of whom are known to us as regulars, but lots and lots of whom may
> only show up once every month to write a little bit. I think we have a
> duty not to change anything in a way that will astonish them.

You are right, of course, but I would say that we also have a duty towards
the hundreds of thousands potential contributers that will come to visit us
in the future. And seeing how Wikipedia is happily humming along at the
moment and even getting a facial soon, I'm quite sure that they will come.
:-)

-- Jan Hidders