Mailing List Archive

pacemaker + drb9 and prefers location constraint
Hello , everyone. I am seeing a behaviour that I can not understand
very well when I have drbd managed by pacemaker and I have a prefers
location constraint.

These are my resources:

Full List of Resources:
* Clone Set: DRBDData-clone [DRBDData] (promotable):
* Promoted: [ pcs01 ]
* Unpromoted: [ pcs02 pcs03 ]
* Resource Group: nfs:
* portblock_on_nfs (ocf:heartbeat:portblock): Started pcs01
* vip_nfs (ocf:heartbeat:IPaddr2): Started pcs01
* drbd_fs (ocf:heartbeat:Filesystem): Started pcs01
* nfsd (ocf:heartbeat:nfsserver): Started pcs01
* exportnfs (ocf:heartbeat:exportfs): Started pcs01
* portblock_off_nfs (ocf:heartbeat:portblock): Started pcs01

And I have a location preference for pcs01 and DRBD-clone
resource 'DRBDData-clone' prefers node 'pcs01' with score INFINITY

# drbdadm status
exports role:Primary
disk:UpToDate
pcs02.lan role:Secondary
peer-disk:UpToDate
pcs03.lan role:Secondary
peer-disk:UpToDate


Now, while pcs01 is providing the resources, I mount the NFS export in
some client and start copying a 15GB random file.
After copying 5 GB I pull the plug of the pcs01 node. After a few
seconds pcs02 is promoted and the copy resumes.

Output for drbdadm status:

exports role:Primary
disk:UpToDate
pcs01.lan connection:Connecting
pcs03.lan role:Secondary congested:yes ap-in-flight:1032 rs-in-flight:0
peer-disk:UpToDate

output for pcs status

Node List:
* Online: [ pcs02 pcs03 ]
* OFFLINE: [ pcs01 ]

Full List of Resources:
* Clone Set: DRBDData-clone [DRBDData] (promotable):
* Promoted: [ pcs02 ]
* Unpromoted: [ pcs03 ]
* Stopped: [ pcs01 ]
* Resource Group: nfs:
* portblock_on_nfs (ocf:heartbeat:portblock): Started pcs02
* vip_nfs (ocf:heartbeat:IPaddr2): Started pcs02
* drbd_fs (ocf:heartbeat:Filesystem): Started pcs02
* nfsd (ocf:heartbeat:nfsserver): Started pcs02
* exportnfs (ocf:heartbeat:exportfs): Started pcs02
* portblock_off_nfs (ocf:heartbeat:portblock): Started pcs02



Now after the 15GB file is near 14GB copied (so, at least 9GB needed
to resync if pcs01 is back online) I start the pcs01 again.

Since it has preference on pacemaker, as soon as pacemaker detects it,
the service will move there.
The question is, how can a "inconsistent/degraded" replica become
primary without the resync is completed ?

# drbdadm status
exports role:Primary
disk:Inconsistent
pcs02.lan role:Secondary
replication:SyncTarget peer-disk:UpToDate done:79.16
pcs03.lan role:Secondary
replication:PausedSyncT peer-disk:UpToDate done:78.24
resync-suspended:dependency

The service moved back to pcs01:
Node List:
* Online: [ pcs01 pcs02 pcs03 ]

Full List of Resources:
* Clone Set: DRBDData-clone [DRBDData] (promotable):
* Promoted: [ pcs01 ]
* Unpromoted: [ pcs02 pcs03 ]
* Resource Group: nfs:
* portblock_on_nfs (ocf:heartbeat:portblock): Started pcs01
* vip_nfs (ocf:heartbeat:IPaddr2): Started pcs01
* drbd_fs (ocf:heartbeat:Filesystem): Started pcs01
* nfsd (ocf:heartbeat:nfsserver): Started pcs01
* exportnfs (ocf:heartbeat:exportfs): Started pcs01
* portblock_off_nfs (ocf:heartbeat:portblock): Started pcs01


# drbdadm --version
DRBDADM_BUILDTAG=GIT-hash:\ fd0904f7bf256ecd380e1c19ec73c712f3855d40\
build\ by\ mockbuild@42fe748df8a24339966f712147eb3bfd\,\ 2023-11-01\
01:47:26
DRBDADM_API_VERSION=2
DRBD_KERNEL_VERSION_CODE=0x090111
DRBD_KERNEL_VERSION=9.1.17
DRBDADM_VERSION_CODE=0x091a00
DRBDADM_VERSION=9.26.0
# cat /etc/redhat-release
AlmaLinux release 9.3 (Shamrock Pampas Cat)


Is that a bug ? Shouldn´t that corrupt the filesystem ?




Atenciosamente/Kind regards,
Salatiel
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: pacemaker + drb9 and prefers location constraint [ In reply to ]
On 12/11/23 20:51, Salatiel Filho wrote:
> The question is, how can a "inconsistent/degraded" replica become
> primary without the resync is completed ?

It can as long as the replication link is connected to another node that
has up-to-date data. Commanding the replication link to disconnect will
fail until the resynchronization is completed. Loss of the replication
link causes either I/O to freeze, or causes I/O errors to be returned to
the application, depending on several configuration settings.

> Is that a bug ? Shouldn´t that corrupt the filesystem ?

Not a bug, works as intended.

Cheers,
Robert
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user