Mailing List Archive: Report: High Availability and Distributed Storage miniconf at LCA 2012

Hi All,

Apologies for the mass email, but it seemed most appropriate to post a
followup to all the lists I originally sent the LCA 2012 HA miniconf CFP
to. I would humbly suggest that any miniconf-related replies be sent
either direct to myself, or to ha-wg@lists.linux-foundation.org.
Comments on the HA BOF mentioned below should probably go to either
pacemaker@oss.clusterlabs.org or linux-ha@lists.linux-ha.org.

==========

The High Availability and Distributed Storage miniconf[1] at LCA 2012
went very well. Probably 60+ in attendance (so about 1/8th of the
conference attendees, given 7 other concurrent miniconfs), with maybe a
few less later in the day. First half was more linux-ha type stuff,
second half more database-y, with a bit of CTDB and Samba foo in the
middle. Sadly we didn't actually get much in the way of distributed
storage talks -- oddly enough, there was a conspicuous absence of
Gluster and Ceph talks in the main conf track as well. We hope to have
better luck next year (I plan to propose this miniconf again).

The talks were almost all 25 minute slots, as follows:

Storage Replication in High-Performance High-Availability Environments
http://www.youtube.com/watch?v=l910kiEuHOM
by Florian Haas; discussion of using drbd with flashcache to
provide failover while still keeping the cache hot.

Building a Non-Shared Storage HA Cluster with Pacemaker & PostgreSQL 9.1
http://www.youtube.com/watch?v=ON4QGfDkqwg
by Keisuke Mori; enhanced pgsql RA to work with PostgreSQL
streaming replication.

Extend Pacemaker to Support Geographically Distributed Clustering
http://www.youtube.com/watch?v=S3DB_DSVI_A
by Tim Serong on behalf of Jiaju Zhang; an introduction to
Booth (what it is, how to configure it).

HiPBX - HiAv VoIP with Open Source Software and 5000 Lines of Bash
http://www.youtube.com/watch?v=CpMifzcYSdU
by Rob Thomas; showing how he built an HA VoIP system with live
demo (which almost worked) and a rickroll. Very entertaining.

Squashing SPOFs with Common Sense, Velcro, and a Hammer
http://www.youtube.com/watch?v=6mQ65Flmri8
also by Rob Thomas; somewhat more generic (label everything,
do proper cable management etc.), but still also entertaining.

CTDB Overview
http://www.youtube.com/watch?v=L7-QSbEEjS0
by Ronnie Sahlberg; CTDB's approach to clustering - run
everything everywhere instead of classic active/passive, and
know what state is safe to drop/lose if a node dies.

High Availability Login Services with Samba4 Active Directory
http://www.youtube.com/watch?v=-EeqYbEwJU8
by Kai Blin; Brief overview of using Samba4 for AD auth - Kai
has a whole bunch of little embedded systems in his house
running this, which is kind of cute.

HA Lessons Learned from Darth Vader
http://www.youtube.com/watch?v=tnBz8212X5M
by Ronnie Sahlberg; essentially saying the Empire got it wrong
with the Death Star (big SPOF), but did better on Hoth with its
redundant army of AT-ATs.

MySQL for the Developer in a Post-Oracle World
http://www.youtube.com/watch?v=oJ9HnFgC48s
by Adam Donnison; various forking etc. of MySQL, both project
forking and different companies providing dev, consulting etc.

MySQL and Postgres Cloud Offerings
http://www.youtube.com/watch?v=UFTp0zA4Mx8
by Stewart Smith & Selena Deckelmann; basically there aren't
many sensible DB cloud offerings and/or they don't work and/or
they don't scale (I might be exaggerating, but probably not
much).

Scaling Data: Postgres, The Stack and the Future of Replication
http://www.youtube.com/watch?v=Pdgzy7KoGWU
by Selena Deckelmann; some general postgres discussion, live
demo of setting up binary replication, new stuff in 9.2.

Swift 101
http://www.youtube.com/watch?v=mX25RtDvf8E
by Monty Taylor; introduction to Swift in OpenStack - it's not
a RAID, it's not distributed storage, it's not (etc.), it's an
object store! Good for backups (large, write once, read never)
and web content (small, write once, read many).

MySQL Web Infra Scaling and Keeping it Online, Cheaply
http://www.youtube.com/watch?v=A4K-ZDDBRHI
by Arjen Lentz; the approaches his company takes when "fixing"
client systems so that they're resilient to failure (mysql
tuning, split web/db servers, backups, monitoring, master/slave
systems etc.)

We also had two lightning talks which apparently weren't recorded. One
was Avi Miller from Oracle announcing that they're supporting DRBD 8.3
in UEK2 (which is currently in beta). The other was from Florian Haas
ranting about crappy HA stack usability (e.g.: inscrutable command line
options and incomprehensible error messages). It was fun.

On Thursday, I co-presented the tutorial "High Availability Sprint: from
the brink of disaster to the Zen of Pacemaker" with Florian Haas. We
ran through basic concepts of drbd, corosync, pacemaker etc. then did a
walkthrough of setting up drbd+corosync+pacemaker+mysql on two VMs (VM
images were provided in advance, so participants could follow along).
This was well received, with people coming out of it actually
understanding what the hell we were talking about. Probably 30-40
attendees. The video is at http://www.youtube.com/watch?v=3GoT36cK6os

After that we had an HA birds of a feather session for a couple of
hours, maybe 15-20 people. Party this was answering questions and
random discussion, but also us (myself, Florian, Andrew Beekhof) seeking
feedback about pain points with the HA stack. Comments include:

- Documentation is still too hard to find.

- crm shell lacks some facilities for automation with e.g.: puppet.
Someone wanted to be able to query the current value of a monitor op
on a resource. Querying the whole primitive and grepping is too
coarse.

- The whole stack is too complicated(?) and/or some concern about
maintenance of documentation going forwards.

- Corosync 2.0 drops support for plugins, and requires libqb.

- Someone wants resource-agents manpage generation foo to go to a
devel package, so people shipping their own RAs can utilize that.

- A "frequently encountered errors and solutions" help page somewhere
would be of major benefit. We could probably crowdsource this to some
extent. We're still evaluating where this could be hosted best, but
currently the Clusterlabs wiki seems like the most suitable candidate.

- The need to deprecate resource agents came up again ("should I use
ocf:heartbeat:drbd or ocf:linbit:drbd?"), highlighting the need for
the overdue OCF spec update.

- Some part of Red Hat's decision to use their own (new, in development)
shell for Pacemaker in RHEL 7(?) is because they want that shell to
do whole cluster setup, including corosync etc. which is a different
scope than the crm shell.

Thanks for reading, hope it was interesting.

Regards,

Tim

[1]
http://lca2012.linux.org.au/wiki/index.php/Miniconfs/HighAvailabilityAndDistributedStorage
--
Tim Serong
Senior Clustering Engineer
SUSE
tserong@suse.com
_______________________________________________
ha-wg mailing list
ha-wg@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/ha-wg