Mailing List Archive

Update on abstract schema and schema changes
Hello,
It has been a while since I gave an update on the state of abstracting
schema and schema changes in mediawiki
<https://phabricator.wikimedia.org/T191231>. So here's a really long one.

So far around half of the mediawiki core tables have been migrated to
abstract schema (plus lots of extensions lika Wikibase, Babel, Linter,
BetaFeatures, etc.). Special thanks to Tgr for reviewing most of the
patches and Sam Reed and James Forrester for doing the extensions.

With the growing number of schemas being abstracted, this is going to
affect your development if you work on schema and schema changes in core or
any of the extensions. So If you do, please read Manual:Schema changes
<https://www.mediawiki.org/wiki/Manual:Schema_changes> in mediawiki.org

You might think that abstraction is just migrating SQL to JSON but it's
much more, we are making the database schema of mediawiki much more
consistent, We are basically addressing several long standing issues like
T164898 <https://phabricator.wikimedia.org/T164898> and T42626
<https://phabricator.wikimedia.org/T42626> as well.

*Improvement aspects*

First aspect is drifts between different DBMSes. Sqlite schema is being
produced by regex replacement (this code
<https://github.com/wikimedia/mediawiki/blob/c477bcf2c5c482d3189ec3579c5dee444eb06f7d/includes/libs/rdbms/database/DatabaseSqlite.php#L898>)
which is less than great but at least it comes from one place. For
Postgres, its schema and MySQL/Sqlite has drifted so drastically, that
fixing it so far required 76 schema changes fixing issues ranging from
missing indexes to missing PKs, extra AUTO_INCREMENT where it shouldn't be,
missing DEFAULT values, drifting data types and much more. You can follow
the fixes of Postgres in here <https://phabricator.wikimedia.org/T164898>.

The second aspect is the inconsistency in the schema itself. How do we
model strings? VARCHAR? VARBINARY()? VARCHAR() BINARY? (all three are
different things). You'd be surprised how inconsistent our MySQL is. So
far, we are migrating all VARCHAR() BINARY fields to VARBINARY() (so far
ten schema changes).

Another inconsistency is timestamps. In MySQL, around half of them are
BINARY(14) and the other half VARBINARY(14) (but in Postgres all are
TIMESTAMPTZ), there is even a ticket
<https://phabricator.wikimedia.org/T42626> about it. It makes sense to
migrate all of them to BINARY(14) but not all timestamps are 14 characters,
e.g. expiry fields accept "infinity" as value and it's a valid timestamp in
Postgres ¯\_(?)_/¯ When you turn an expiry field to BINARY(14), "infinity"
becomes " infinity" and as the result mediawiki doesn't recognize it
as infinity ("infinity" != " infinity"). There are several ways to
move forward handling expiry fields, you can follow the discussion in this
gerrit patch <https://gerrit.wikimedia.org/r/c/mediawiki/core/+/631936>.

Another fun aspect: Booleans. MySQL doesn't have boolean, it translates
them to TINYINT(1) but other DBMSes don't have TINYINT, they have SMALLINT
and BOOL though (and we mostly use SMALLINT for them), we decided to go
with SMALLINT for these cases (which is different than what Doctrine DBAL
does, it uses BOOL, so we introduced our own custom type for booleans).

Last but not least: ENUMs. MySQL and Postgres support that but Sqlite
doesn't. Doctrine DBAL doesn't support ENUM at all (as it's an
anti-pattern) while core has eight fields that are ENUM. There's an RFC to
discourage using it in general. Feel free to comment on it.
<https://phabricator.wikimedia.org/T119173>

A miscellaneous note: The directories that hold the archive of sql patches
of schema change are exploding (some of the sql patches are even orphan but
we can't find them because there are so many of them). So I started a RFC
to clean that mess up: Drop support for database upgrade older than two LTS
releases <https://phabricator.wikimedia.org/T259771>

*What's next?*

- We continue to migrate more tables, hopefully we will get two third
of them by the end of the year (fingers crossed). You can follow the
progress in its ticket <https://phabricator.wikimedia.org/T230428>.
- We will support abstract schema changes, really soon, like in a
couple of weeks. Basically you start a json file containing snapshots of
before and after of a table and then a maintenance script will produce the
needed sql patches for you for different schemas. This will increase the
developer productivity drastically, since 1- Schema change sql files become
more reliable and consistent and less prone to errors like adding the
column to the wrong table in some DBMSes
<https://gerrit.wikimedia.org/r/c/mediawiki/core/+/328377/22/maintenance/mssql/archives/patch-user_groups-ug_expiry.sql#4>
2- You don't need to know Postgres or Sqlite peculiarities to make patches
against it. The reason you need to proved the whole table for adding like
an index is that sqlite doesn't support all types of ALTER TABLES, you have
to create temporary tables, move the data around and then rename and drop
in some cases, producing beautiful sql patches like this
<https://gerrit.wikimedia.org/r/c/mediawiki/core/+/630341/1/maintenance/sqlite/archives/patch-querycachetwo-qcc_title-varbinary.sql>
- We work on improving the script that reports drifts between core and
our production. I have already made it work with abstract schemas as well,
I will continue working on it to report even smaller differences like field
size, type, etc. Which is now much easier thanks to the abstract schema.
Slowly we will migrate that script to production (as part of SRE scripts)
and we will do automated reports and automated drift fixes (on small
wikis). You can follow the work on this ticket.
<https://phabricator.wikimedia.org/T104459> So far, this script is being
run manually but found more than thousand and thousands of drifts across
the cluster and all are fixed thanks to our amazing DBAs (look at the
ticket)


*How can I help?*
Glad you asked! You can follow the abstract-schema
<https://gerrit.wikimedia.org/r/q/hashtag:%22abstract-schema%22+(status:open%20OR%20status:merged)>
hashtag in gerrit and review patches or you can make them yourself (get
yourself familiar using the documentations
<https://www.mediawiki.org/wiki/Manual:Schema_changes>). If you maintain an
extension feel free to migrate its table(s) (and track it in this ticket
<https://phabricator.wikimedia.org/T259374>). If you use Postgres for
mediawiki, please help us with testing the improvements for Postgres.

Thanks for reading this long email!

Best
--
Amir (he/him)
Re: Update on abstract schema and schema changes [ In reply to ]
I hadn’t followed along with this work and this summary is amazing, thank
you so much!

On Sat, Oct 10, 2020 at 21:14 Amir Sarabadani <ladsgroup@gmail.com> wrote:

> Hello,
> It has been a while since I gave an update on the state of abstracting
> schema and schema changes in mediawiki
> <https://phabricator.wikimedia.org/T191231>. So here's a really long one.
>
> So far around half of the mediawiki core tables have been migrated to
> abstract schema (plus lots of extensions lika Wikibase, Babel, Linter,
> BetaFeatures, etc.). Special thanks to Tgr for reviewing most of the
> patches and Sam Reed and James Forrester for doing the extensions.
>
> With the growing number of schemas being abstracted, this is going to
> affect your development if you work on schema and schema changes in core or
> any of the extensions. So If you do, please read Manual:Schema changes
> <https://www.mediawiki.org/wiki/Manual:Schema_changes> in mediawiki.org
>
> You might think that abstraction is just migrating SQL to JSON but it's
> much more, we are making the database schema of mediawiki much more
> consistent, We are basically addressing several long standing issues like
> T164898 <https://phabricator.wikimedia.org/T164898> and T42626
> <https://phabricator.wikimedia.org/T42626> as well.
>
> *Improvement aspects*
>
> First aspect is drifts between different DBMSes. Sqlite schema is being
> produced by regex replacement (this code
> <https://github.com/wikimedia/mediawiki/blob/c477bcf2c5c482d3189ec3579c5dee444eb06f7d/includes/libs/rdbms/database/DatabaseSqlite.php#L898>)
> which is less than great but at least it comes from one place. For
> Postgres, its schema and MySQL/Sqlite has drifted so drastically, that
> fixing it so far required 76 schema changes fixing issues ranging from
> missing indexes to missing PKs, extra AUTO_INCREMENT where it shouldn't be,
> missing DEFAULT values, drifting data types and much more. You can follow
> the fixes of Postgres in here <https://phabricator.wikimedia.org/T164898>.
>
> The second aspect is the inconsistency in the schema itself. How do we
> model strings? VARCHAR? VARBINARY()? VARCHAR() BINARY? (all three are
> different things). You'd be surprised how inconsistent our MySQL is. So
> far, we are migrating all VARCHAR() BINARY fields to VARBINARY() (so far
> ten schema changes).
>
> Another inconsistency is timestamps. In MySQL, around half of them are
> BINARY(14) and the other half VARBINARY(14) (but in Postgres all are
> TIMESTAMPTZ), there is even a ticket
> <https://phabricator.wikimedia.org/T42626> about it. It makes sense to
> migrate all of them to BINARY(14) but not all timestamps are 14 characters,
> e.g. expiry fields accept "infinity" as value and it's a valid timestamp in
> Postgres ¯\_(?)_/¯ When you turn an expiry field to BINARY(14), "infinity"
> becomes " infinity" and as the result mediawiki doesn't recognize it
> as infinity ("infinity" != " infinity"). There are several ways to
> move forward handling expiry fields, you can follow the discussion in this
> gerrit patch <https://gerrit.wikimedia.org/r/c/mediawiki/core/+/631936>.
>
> Another fun aspect: Booleans. MySQL doesn't have boolean, it translates
> them to TINYINT(1) but other DBMSes don't have TINYINT, they have SMALLINT
> and BOOL though (and we mostly use SMALLINT for them), we decided to go
> with SMALLINT for these cases (which is different than what Doctrine DBAL
> does, it uses BOOL, so we introduced our own custom type for booleans).
>
> Last but not least: ENUMs. MySQL and Postgres support that but Sqlite
> doesn't. Doctrine DBAL doesn't support ENUM at all (as it's an
> anti-pattern) while core has eight fields that are ENUM. There's an RFC to
> discourage using it in general. Feel free to comment on it.
> <https://phabricator.wikimedia.org/T119173>
>
> A miscellaneous note: The directories that hold the archive of sql patches
> of schema change are exploding (some of the sql patches are even orphan but
> we can't find them because there are so many of them). So I started a RFC
> to clean that mess up: Drop support for database upgrade older than two
> LTS releases <https://phabricator.wikimedia.org/T259771>
>
> *What's next?*
>
> - We continue to migrate more tables, hopefully we will get two third
> of them by the end of the year (fingers crossed). You can follow the
> progress in its ticket <https://phabricator.wikimedia.org/T230428>.
> - We will support abstract schema changes, really soon, like in a
> couple of weeks. Basically you start a json file containing snapshots of
> before and after of a table and then a maintenance script will produce the
> needed sql patches for you for different schemas. This will increase the
> developer productivity drastically, since 1- Schema change sql files become
> more reliable and consistent and less prone to errors like adding the
> column to the wrong table in some DBMSes
> <https://gerrit.wikimedia.org/r/c/mediawiki/core/+/328377/22/maintenance/mssql/archives/patch-user_groups-ug_expiry.sql#4>
> 2- You don't need to know Postgres or Sqlite peculiarities to make patches
> against it. The reason you need to proved the whole table for adding like
> an index is that sqlite doesn't support all types of ALTER TABLES, you have
> to create temporary tables, move the data around and then rename and drop
> in some cases, producing beautiful sql patches like this
> <https://gerrit.wikimedia.org/r/c/mediawiki/core/+/630341/1/maintenance/sqlite/archives/patch-querycachetwo-qcc_title-varbinary.sql>
> - We work on improving the script that reports drifts between core
> and our production. I have already made it work with abstract schemas as
> well, I will continue working on it to report even smaller differences like
> field size, type, etc. Which is now much easier thanks to the abstract
> schema. Slowly we will migrate that script to production (as part of SRE
> scripts) and we will do automated reports and automated drift fixes (on
> small wikis). You can follow the work on this ticket.
> <https://phabricator.wikimedia.org/T104459> So far, this script is
> being run manually but found more than thousand and thousands of drifts
> across the cluster and all are fixed thanks to our amazing DBAs (look at
> the ticket)
>
>
> *How can I help?*
> Glad you asked! You can follow the abstract-schema
> <https://gerrit.wikimedia.org/r/q/hashtag:%22abstract-schema%22+(status:open%20OR%20status:merged)>
> hashtag in gerrit and review patches or you can make them yourself (get
> yourself familiar using the documentations
> <https://www.mediawiki.org/wiki/Manual:Schema_changes>). If you maintain
> an extension feel free to migrate its table(s) (and track it in this
> ticket <https://phabricator.wikimedia.org/T259374>). If you use Postgres
> for mediawiki, please help us with testing the improvements for Postgres.
>
> Thanks for reading this long email!
>
> Best
>
> --
> Amir (he/him)
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>