Mailing List Archive: Wikitext, Document Models, and HTML5 Output

Wikitext, Document Models, and HTML5 Output

Jan 10, 2022, 6:53 AM

Post #1 of 5 (314 views)

Wikitech-l,

Hello. I have a question about the HTML output of wiki parsers. I wonder about how simple or complex that it would be for a wiki parser to output, instead of a flat document structure inside of a <div> element, an <article> element containing nested <section> elements?

Recently, in the Community Wishlist Survey Sandbox<https://meta.wikimedia.org/wiki/Community_Wishlist_Survey/Sandbox>, the speech synthesis of Wikipedia articles<https://meta.wikimedia.org/wiki/Community_Wishlist_Survey/Sandbox#Spoken_articles> was broached. The proposer of these ideas indicated that, for best results, some content, e.g., ?See also? sections, should not be synthesized.

In response to these interesting ideas, I mentioned some ideas from EPUB, referencing pronunciation lexicons from HTML<https://www.w3.org/publishing/epub3/epub-contentdocs.html#sec-pls> and SSML attributes in HTML<https://www.w3.org/publishing/epub3/epub-contentdocs.html#sec-xhtml-ssml-attrib>, the CSS Speech Module<https://www.w3.org/TR/css-speech-1/>, and that output HTML content could be styled using the CSS Speech Module?s speak property.

In these regards, I started thinking about how one might extend wikitext syntax to be able to style sections, e.g.,:

== See also == {style="speak:never"}

Next, I inspected the HTML of some Wikipedia articles and realized that, due to the structure of the output HTML documents, it isn?t simple to style or to add attributes to sections. There are only <h2>, <h3>, <h4> (et cetera) elements inside of a containing <div> element; sections are not yet structured elements.

The gist is that, instead of outputting HTML like:

<div class="mw-parser-output">

<h2><span class="mw-headline" id="Heading">Heading</span></h2>

<p>Paragraph 1</p>

<p>Paragraph 2</p>

<h3><span class="mw-headline" id="Subheading">Subheading</span></h3>

<p>Paragraph 3</p>

<p>Paragraph 4</p>

</div>

could a wiki parser output HTML5 like:

<article class="mw-parser-output">

<section id="Heading">

<header><h2><span class="mw-headline">Heading</span></h2></header>

<p>Paragraph 1</p>

<p>Paragraph 2</p>

<section id="Subheading">

<header><h3><span class="mw-headline">Subheading</span></h3></header>

<p>Paragraph 3</p>

<p>Paragraph 4</p>

</section>

</section>

</article>

Initial thoughts regarding the latter HTML5 include that it is better structured, more semantic, more styleable, and potentially more accessible. If there is any interest, I could write up some lengthier discussion about one versus the other, why one might be better ? and more useful ? than the other.

Is this the correct mailing list to discuss any of these wiki technology, wiki parsing, wikitext, document model, and HTML5 output topics?

Best regards,

Adam

Re: Wikitext, Document Models, and HTML5 Output [ In reply to ]

bawolff at gmail

Jan 11, 2022, 7:50 PM

Post #2 of 5 (314 views)

Permalink

Have you seen the html structure of parsoid?

E.g.
https://en.wikipedia.org/api/rest_v1/page/html/Dog

--
Bawolff
On Monday, January 10, 2022, Adam Sobieski <adamsobieski@hotmail.com> wrote:

> Wikitech-l,
>
>
>
> Hello. I have a question about the HTML output of wiki parsers. I wonder
> about how simple or complex that it would be for a wiki parser to output,
> instead of a flat document structure inside of a <div> element, an
> <article> element containing nested <section> elements?
>
>
>
> Recently, in the Community Wishlist Survey Sandbox
> <https://meta.wikimedia.org/wiki/Community_Wishlist_Survey/Sandbox>, the
> speech synthesis of Wikipedia articles
> <https://meta.wikimedia.org/wiki/Community_Wishlist_Survey/Sandbox#Spoken_articles>
> was broached. The proposer of these ideas indicated that, for best results,
> some content, e.g., “See also” sections, should not be synthesized.
>
>
>
> In response to these interesting ideas, I mentioned some ideas from EPUB, referencing
> pronunciation lexicons from HTML
> <https://www.w3.org/publishing/epub3/epub-contentdocs.html#sec-pls> and SSML
> attributes in HTML
> <https://www.w3.org/publishing/epub3/epub-contentdocs.html#sec-xhtml-ssml-attrib>,
> the CSS Speech Module <https://www.w3.org/TR/css-speech-1/>, and that
> output HTML content could be styled using the CSS Speech Module’s speak
> property.
>
>
>
> In these regards, I started thinking about how one might extend wikitext
> syntax to be able to style sections, e.g.,:
>
>
>
> == See also == {style="speak:never"}
>
>
>
> Next, I inspected the HTML of some Wikipedia articles and realized that,
> due to the structure of the output HTML documents, it isn’t simple to style
> or to add attributes to sections. There are only <h2>, <h3>, <h4> (et
> cetera) elements inside of a containing <div> element; sections are not
> yet structured elements.
>
>
>
> The gist is that, instead of outputting HTML like:
>
>
>
> <div class="mw-parser-output">
>
> <h2><span class="mw-headline" id="Heading">Heading</span></h2>
>
> <p>Paragraph 1</p>
>
> <p>Paragraph 2</p>
>
> <h3><span class="mw-headline" id="Subheading">Subheading</span></h3>
>
> <p>Paragraph 3</p>
>
> <p>Paragraph 4</p>
>
> </div>
>
>
>
> could a wiki parser output HTML5 like:
>
>
>
> <article class="mw-parser-output">
>
> <section id="Heading">
>
> <header><h2><span class="mw-headline">Heading</span></h2></header>
>
> <p>Paragraph 1</p>
>
> <p>Paragraph 2</p>
>
> <section id="Subheading">
>
> <header><h3><span class="mw-headline">Subheading</span></h3></
> header>
>
> <p>Paragraph 3</p>
>
> <p>Paragraph 4</p>
>
> </section>
>
> </section>
>
> </article>
>
>
>
> Initial thoughts regarding the latter HTML5 include that it is better
> structured, more semantic, more styleable, and potentially more accessible.
> If there is any interest, I could write up some lengthier discussion about
> one versus the other, why one might be better – and more useful – than the
> other.
>
>
>
> Is this the correct mailing list to discuss any of these wiki technology,
> wiki parsing, wikitext, document model, and HTML5 output topics?
>
>
>
>
>
> Best regards,
>
> Adam
>
>
>

Re: Wikitext, Document Models, and HTML5 Output [ In reply to ]

adamsobieski at hotmail

Jan 12, 2022, 9:42 AM

Post #3 of 5 (314 views)

Permalink

Bawolff,

Thank you. I just checked out the parsoid output and do see nested <section>?s.

Brainstorming, what do you think about wikitext syntax for putting class, style, or other attributes on sections?

Best regards,
Adam

From: Brian Wolff<mailto:bawolff@gmail.com>
Sent: Tuesday, January 11, 2022 10:51 PM
To: Wikitech-l<mailto:wikitech-l@lists.wikimedia.org>
Subject: [Wikitech-l] Re: Wikitext, Document Models, and HTML5 Output

Have you seen the html structure of parsoid?

E.g.
https://en.wikipedia.org/api/rest_v1/page/html/Dog

--
Bawolff
On Monday, January 10, 2022, Adam Sobieski <adamsobieski@hotmail.com<mailto:adamsobieski@hotmail.com>> wrote:

Wikitech-l,

Hello. I have a question about the HTML output of wiki parsers. I wonder about how simple or complex that it would be for a wiki parser to output, instead of a flat document structure inside of a <div> element, an <article> element containing nested <section> elements?

Recently, in the Community Wishlist Survey Sandbox<https://meta.wikimedia.org/wiki/Community_Wishlist_Survey/Sandbox>, the speech synthesis of Wikipedia articles<https://meta.wikimedia.org/wiki/Community_Wishlist_Survey/Sandbox#Spoken_articles> was broached. The proposer of these ideas indicated that, for best results, some content, e.g., ?See also? sections, should not be synthesized.

In response to these interesting ideas, I mentioned some ideas from EPUB, referencing pronunciation lexicons from HTML<https://www.w3.org/publishing/epub3/epub-contentdocs.html#sec-pls> and SSML attributes in HTML<https://www.w3.org/publishing/epub3/epub-contentdocs.html#sec-xhtml-ssml-attrib>, the CSS Speech Module<https://www.w3.org/TR/css-speech-1/>, and that output HTML content could be styled using the CSS Speech Module?s speak property.

In these regards, I started thinking about how one might extend wikitext syntax to be able to style sections, e.g.,:

== See also == {style="speak:never"}

Next, I inspected the HTML of some Wikipedia articles and realized that, due to the structure of the output HTML documents, it isn?t simple to style or to add attributes to sections. There are only <h2>, <h3>, <h4> (et cetera) elements inside of a containing <div> element; sections are not yet structured elements.

The gist is that, instead of outputting HTML like:

<div class="mw-parser-output">

<h2><span class="mw-headline" id="Heading">Heading</span></h2>

<p>Paragraph 1</p>

<p>Paragraph 2</p>

<h3><span class="mw-headline" id="Subheading">Subheading</span></h3>

<p>Paragraph 3</p>

<p>Paragraph 4</p>

</div>

could a wiki parser output HTML5 like:

<article class="mw-parser-output">

<section id="Heading">

<header><h2><span class="mw-headline">Heading</span></h2></header>

<p>Paragraph 1</p>

<p>Paragraph 2</p>

<section id="Subheading">

<header><h3><span class="mw-headline">Subheading</span></h3></header>

<p>Paragraph 3</p>

<p>Paragraph 4</p>

</section>

</section>

</article>

Initial thoughts regarding the latter HTML5 include that it is better structured, more semantic, more styleable, and potentially more accessible. If there is any interest, I could write up some lengthier discussion about one versus the other, why one might be better ? and more useful ? than the other.

Is this the correct mailing list to discuss any of these wiki technology, wiki parsing, wikitext, document model, and HTML5 output topics?

Best regards,

Adam

Re: Wikitext, Document Models, and HTML5 Output [ In reply to ]

cananian at wikimedia

Jan 13, 2022, 7:22 AM

Post #4 of 5 (314 views)

Permalink

First: <section> tags are definitely on the near-future roadmap. There are
some issues with balancing tags (when the section deliberately has an
unclosed <div>) that are awaiting the completion of the parsoid transition,
but it will certainly happen.

WRT adding additional attributes/properties to certain constructs -- yes,
this is a somewhat pervasive issue with wikitext. List items and headings
are probably the biggest examples of "un-annotatable" constructs.
Typically folks hack around the issue by using literal <div> or <span> tags
in their markup ... but then that leads to the nesting issues described in
the second sentence above.

https://phabricator.wikimedia.org/T230658 contains a discussion, originally
in the context of list item attributes, and in
https://phabricator.wikimedia.org/T230658#5916798 it is proposed to have a
general-purpose `{{#attr}}` parser function which would attach itself to
the nearest containing HTML tag. More discussion of that proposal in
https://phabricator.wikimedia.org/T230658#5786980. I think that would
address your use case?
--scott

On Tue, Jan 11, 2022 at 6:38 PM Adam Sobieski <adamsobieski@hotmail.com>
wrote:

> Wikitech-l,
>
>
>
> Hello. I have a question about the HTML output of wiki parsers. I wonder
> about how simple or complex that it would be for a wiki parser to output,
> instead of a flat document structure inside of a <div> element, an
> <article> element containing nested <section> elements?
>
>
>
> Recently, in the Community Wishlist Survey Sandbox
> <https://meta.wikimedia.org/wiki/Community_Wishlist_Survey/Sandbox>, the
> speech synthesis of Wikipedia articles
> <https://meta.wikimedia.org/wiki/Community_Wishlist_Survey/Sandbox#Spoken_articles>
> was broached. The proposer of these ideas indicated that, for best results,
> some content, e.g., “See also” sections, should not be synthesized.
>
>
>
> In response to these interesting ideas, I mentioned some ideas from EPUB, referencing
> pronunciation lexicons from HTML
> <https://www.w3.org/publishing/epub3/epub-contentdocs.html#sec-pls> and SSML
> attributes in HTML
> <https://www.w3.org/publishing/epub3/epub-contentdocs.html#sec-xhtml-ssml-attrib>,
> the CSS Speech Module <https://www.w3.org/TR/css-speech-1/>, and that
> output HTML content could be styled using the CSS Speech Module’s speak
> property.
>
>
>
> In these regards, I started thinking about how one might extend wikitext
> syntax to be able to style sections, e.g.,:
>
>
>
> == See also == {style="speak:never"}
>
>
>
> Next, I inspected the HTML of some Wikipedia articles and realized that,
> due to the structure of the output HTML documents, it isn’t simple to style
> or to add attributes to sections. There are only <h2>, <h3>, <h4> (et
> cetera) elements inside of a containing <div> element; sections are not
> yet structured elements.
>
>
>
> The gist is that, instead of outputting HTML like:
>
>
>
> <div class="mw-parser-output">
>
> <h2><span class="mw-headline" id="Heading">Heading</span></h2>
>
> <p>Paragraph 1</p>
>
> <p>Paragraph 2</p>
>
> <h3><span class="mw-headline" id="Subheading">Subheading</span></h3>
>
> <p>Paragraph 3</p>
>
> <p>Paragraph 4</p>
>
> </div>
>
>
>
> could a wiki parser output HTML5 like:
>
>
>
> <article class="mw-parser-output">
>
> <section id="Heading">
>
> <header><h2><span class="mw-headline">Heading</span></h2></header>
>
> <p>Paragraph 1</p>
>
> <p>Paragraph 2</p>
>
> <section id="Subheading">
>
> <header><h3><span class="mw-headline">Subheading</span></h3></header>
>
> <p>Paragraph 3</p>
>
> <p>Paragraph 4</p>
>
> </section>
>
> </section>
>
> </article>
>
>
>
> Initial thoughts regarding the latter HTML5 include that it is better
> structured, more semantic, more styleable, and potentially more accessible.
> If there is any interest, I could write up some lengthier discussion about
> one versus the other, why one might be better – and more useful – than the
> other.
>
>
>
> Is this the correct mailing list to discuss any of these wiki technology,
> wiki parsing, wikitext, document model, and HTML5 output topics?
>
>
>
>
>
> Best regards,
>
> Adam
>
>
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

--
(http://cscott.net)

Re: Wikitext, Document Models, and HTML5 Output [ In reply to ]

adamsobieski at hotmail

Jan 13, 2022, 12:25 PM

Post #5 of 5 (314 views)

Permalink

Scott,

Thank you.

I?m glad to hear that <section> elements are on the roadmap.

Yes, that item (T230658<https://phabricator.wikimedia.org/T230658>) addresses the use case.

In addition to {{#attr}} and the syntactic possibilities indicated in T230658<https://phabricator.wikimedia.org/T230658>:

:::<attr id=foo class=bar/> xyz ("magic extension")
:::{{#attr|id=foo|class=bar}} ("magic parser function")
:::|id=foo|class=bar| ("like table syntax")
:::[id=foo][class=bar] ("like CSS syntax")
::: id=foo class=bar <<< xyz >>> ("requires use of heredoc syntax")

===<attr id=foo> foo ===

=== id=foo class=bar <<< heading >>> ===

syntactic possibilities for sections also include:

== Section == style="..."

== Section == { style="..." }

== Section == @ style="..."

== Section ==
@ style="..."

== Section == | style="..."

== Section ==
| style="..."

There could be some complexity if both styling <section> and nested <h2> elements are desired for end-users. In these regards, the following additional syntactic possibilities could be useful for disambiguation:

== Section @ style="..." ==

== Section | style="..." ==

That is, syntactic possibilities could be combined for simultaneously styling <h2> and <section> elements:

== Section @ style="...h2..." == @ style="...section..."

== Section | style="...h2..." == | style="...section..."

== Section @ style="...h2..." ==
@ style="...section..."

== Section | style="...h2..." ==
| style="...section..."

In any event, it looks like adding style and attributes to sections are being looked into which is good news for the use case, using styling to indicate which sections should not be rendered to audio by hypertext-to-speech engines.

Best regards,
Adam

From: C. Scott Ananian<mailto:cananian@wikimedia.org>
Sent: Thursday, January 13, 2022 10:23 AM
To: Wikitech-l<mailto:wikitech-l@lists.wikimedia.org>
Subject: [Wikitech-l] Re: Wikitext, Document Models, and HTML5 Output

First: <section> tags are definitely on the near-future roadmap. There are some issues with balancing tags (when the section deliberately has an unclosed <div>) that are awaiting the completion of the parsoid transition, but it will certainly happen.

WRT adding additional attributes/properties to certain constructs -- yes, this is a somewhat pervasive issue with wikitext. List items and headings are probably the biggest examples of "un-annotatable" constructs. Typically folks hack around the issue by using literal <div> or <span> tags in their markup ... but then that leads to the nesting issues described in the second sentence above.

https://phabricator.wikimedia.org/T230658 contains a discussion, originally in the context of list item attributes, and in https://phabricator.wikimedia.org/T230658#5916798 it is proposed to have a general-purpose `{{#attr}}` parser function which would attach itself to the nearest containing HTML tag. More discussion of that proposal in https://phabricator.wikimedia.org/T230658#5786980. I think that would address your use case?
--scott

On Tue, Jan 11, 2022 at 6:38 PM Adam Sobieski <adamsobieski@hotmail.com<mailto:adamsobieski@hotmail.com>> wrote:

Wikitech-l,

Hello. I have a question about the HTML output of wiki parsers. I wonder about how simple or complex that it would be for a wiki parser to output, instead of a flat document structure inside of a <div> element, an <article> element containing nested <section> elements?

Recently, in the Community Wishlist Survey Sandbox<https://meta.wikimedia.org/wiki/Community_Wishlist_Survey/Sandbox>, the speech synthesis of Wikipedia articles<https://meta.wikimedia.org/wiki/Community_Wishlist_Survey/Sandbox#Spoken_articles> was broached. The proposer of these ideas indicated that, for best results, some content, e.g., ?See also? sections, should not be synthesized.

In response to these interesting ideas, I mentioned some ideas from EPUB, referencing pronunciation lexicons from HTML<https://www.w3.org/publishing/epub3/epub-contentdocs.html#sec-pls> and SSML attributes in HTML<https://www.w3.org/publishing/epub3/epub-contentdocs.html#sec-xhtml-ssml-attrib>, the CSS Speech Module<https://www.w3.org/TR/css-speech-1/>, and that output HTML content could be styled using the CSS Speech Module?s speak property.

In these regards, I started thinking about how one might extend wikitext syntax to be able to style sections, e.g.,:

== See also == {style="speak:never"}

Next, I inspected the HTML of some Wikipedia articles and realized that, due to the structure of the output HTML documents, it isn?t simple to style or to add attributes to sections. There are only <h2>, <h3>, <h4> (et cetera) elements inside of a containing <div> element; sections are not yet structured elements.

The gist is that, instead of outputting HTML like:

<div class="mw-parser-output">

<h2><span class="mw-headline" id="Heading">Heading</span></h2>

<p>Paragraph 1</p>

<p>Paragraph 2</p>

<h3><span class="mw-headline" id="Subheading">Subheading</span></h3>

<p>Paragraph 3</p>

<p>Paragraph 4</p>

</div>

could a wiki parser output HTML5 like:

<article class="mw-parser-output">

<section id="Heading">

<header><h2><span class="mw-headline">Heading</span></h2></header>

<p>Paragraph 1</p>

<p>Paragraph 2</p>

<section id="Subheading">

<header><h3><span class="mw-headline">Subheading</span></h3></header>

<p>Paragraph 3</p>

<p>Paragraph 4</p>

</section>

</section>

</article>

Initial thoughts regarding the latter HTML5 include that it is better structured, more semantic, more styleable, and potentially more accessible. If there is any interest, I could write up some lengthier discussion about one versus the other, why one might be better ? and more useful ? than the other.

Is this the correct mailing list to discuss any of these wiki technology, wiki parsing, wikitext, document model, and HTML5 output topics?

Best regards,

Adam

_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org<mailto:wikitech-l@lists.wikimedia.org>
To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org<mailto:wikitech-l-leave@lists.wikimedia.org>
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

--
(http://cscott.net)