Mailing List Archive

Proxmox with Linstor: Online migration / disk move problem
Hi,

I'm trying to migrate VM storage to Linstor SDS and have some odd
troubles. All nodes are running Proxmox VE 7.1:

pve-manager/7.1-5/6fe299a0 (running kernel: 5.13.19-1-pve)

Linstor storage is, for now, on one host. When I create new VM on
linstor it works. When I try to migrate VM from another host (and
another storage) to Linstor it fails:

2021-11-22 13:06:53 starting migration of VM 116 to node 'proxmox-ve3'
(192.168.8.203)
2021-11-22 13:06:53 found local disk 'local-lvm:vm-116-disk-0' (in
current VM config)
2021-11-22 13:06:53 starting VM 116 on remote node 'proxmox-ve3'
2021-11-22 13:07:01 volume 'local-lvm:vm-116-disk-0' is
'linstor-local:vm-116-disk-1' on the target
2021-11-22 13:07:01 start remote tunnel
2021-11-22 13:07:03 ssh tunnel ver 1
2021-11-22 13:07:03 starting storage migration
2021-11-22 13:07:03 scsi1: start migration to
nbd:unix:/run/qemu-server/116_nbd.migrate:exportname=drive-scsi1
drive mirror is starting for drive-scsi1 with bandwidth limit: 51200 KB/s
drive-scsi1: Cancelling block job
drive-scsi1: Done.
2021-11-22 13:07:03 ERROR: online migrate failure - block job (mirror)
error: drive-scsi1: 'mirror' has been cancelled
2021-11-22 13:07:03 aborting phase 2 - cleanup resources
2021-11-22 13:07:03 migrate_cancel
2021-11-22 13:07:08 ERROR: migration finished with problems (duration
00:00:16)
TASK ERROR: migration problems

Linstor volumes are created during migration, no errors in it's logs. I
don't know why Proxmox is cancelling this job.

When I try to move disk from NFS to Linstor (online) it fails:

create full clone of drive scsi0 (nfs-backup:129/vm-129-disk-0.qcow2)

NOTICE
Trying to create diskful resource (vm-129-disk-1) on (proxmox-ve3).
drive mirror is starting for drive-scsi0 with bandwidth limit: 51200 KB/s
drive-scsi0: Cancelling block job
drive-scsi0: Done.
TASK ERROR: storage migration failed: block job (mirror) error:
drive-scsi0: 'mirror' has been cancelled


To move storage to Linstor I have first move it to NFS (online), turn
off VM and move VM storage offline to Linstor. And bizzare thing is that
once I do it, I can move this particular VM storage from Linstor to NFS
online and from NFS to Linstor online. I can also migrate VM online,
from Linstor, directly to another node and another storage without problems.

I've setup test cluster to reproduce this problem and couldn't - online
migration to Linstor storage just worked. I don't know why it's not
working on main cluster - any hints how to debug it?

storage.cfg:

dir: local
path /var/lib/vz
content vztmpl,iso,backup

lvmthin: local-lvm
thinpool data
vgname pve
content rootdir,images
nodes proxmox-ve5,proxmox-ve4

drbd: linstor-local
content images,rootdir
controller 192.168.8.203
resourcegroup linstor-local
preferlocal yes
nodes proxmox-ve3

zfspool: local-zfs
pool rpool/data
content images,rootdir
nodes proxmox-ve0,proxmox-ve1,proxmox-ve2
sparse 1

nfs: nfs-backup
export /data/nfs
path /mnt/pve/nfs-backup
server backup2
content rootdir,backup,images,iso,vztmpl
options vers=3

--
Best regards,
?ukasz W?sikowski
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: Proxmox with Linstor: Online migration / disk move problem [ In reply to ]
On Mon, Nov 22, 2021 at 03:28:10PM +0100, ?ukasz W?sikowski wrote:
> Hi,
>
> I'm trying to migrate VM storage to Linstor SDS and have some odd troubles.
> All nodes are running Proxmox VE 7.1:
>
> pve-manager/7.1-5/6fe299a0 (running kernel: 5.13.19-1-pve)
>
> Linstor storage is, for now, on one host. When I create new VM on linstor it
> works. When I try to migrate VM from another host (and another storage) to
> Linstor it fails:
>
> 2021-11-22 13:06:53 starting migration of VM 116 to node 'proxmox-ve3'
> (192.168.8.203)
> 2021-11-22 13:06:53 found local disk 'local-lvm:vm-116-disk-0' (in current
> VM config)
> 2021-11-22 13:06:53 starting VM 116 on remote node 'proxmox-ve3'
> 2021-11-22 13:07:01 volume 'local-lvm:vm-116-disk-0' is
> 'linstor-local:vm-116-disk-1' on the target
> 2021-11-22 13:07:01 start remote tunnel
> 2021-11-22 13:07:03 ssh tunnel ver 1
> 2021-11-22 13:07:03 starting storage migration
> 2021-11-22 13:07:03 scsi1: start migration to
> nbd:unix:/run/qemu-server/116_nbd.migrate:exportname=drive-scsi1
> drive mirror is starting for drive-scsi1 with bandwidth limit: 51200 KB/s
> drive-scsi1: Cancelling block job
> drive-scsi1: Done.
> 2021-11-22 13:07:03 ERROR: online migrate failure - block job (mirror)
> error: drive-scsi1: 'mirror' has been cancelled
> 2021-11-22 13:07:03 aborting phase 2 - cleanup resources
> 2021-11-22 13:07:03 migrate_cancel
> 2021-11-22 13:07:08 ERROR: migration finished with problems (duration
> 00:00:16)
> TASK ERROR: migration problems
>
> Linstor volumes are created during migration, no errors in it's logs. I
> don't know why Proxmox is cancelling this job.
>
> When I try to move disk from NFS to Linstor (online) it fails:
>
> create full clone of drive scsi0 (nfs-backup:129/vm-129-disk-0.qcow2)
>
> NOTICE
> Trying to create diskful resource (vm-129-disk-1) on (proxmox-ve3).
> drive mirror is starting for drive-scsi0 with bandwidth limit: 51200 KB/s
> drive-scsi0: Cancelling block job
> drive-scsi0: Done.
> TASK ERROR: storage migration failed: block job (mirror) error: drive-scsi0:
> 'mirror' has been cancelled
>
>
> To move storage to Linstor I have first move it to NFS (online), turn off VM
> and move VM storage offline to Linstor. And bizzare thing is that once I do
> it, I can move this particular VM storage from Linstor to NFS online and
> from NFS to Linstor online. I can also migrate VM online, from Linstor,
> directly to another node and another storage without problems.
>
> I've setup test cluster to reproduce this problem and couldn't - online
> migration to Linstor storage just worked. I don't know why it's not working
> on main cluster - any hints how to debug it?

Hi ?ukasz,

I have heard of that once before, but never experienced it myself and so
far no customers complained so I did not dive into it.

If you can reproduce it, that would be highly appreciated. To me it
looks like the plugin and LINSTOR basically did their job, but then
something else happens. This are just random thoughts that might be
complete nonsense:

- maybe some size rounding error and the resulting DRBD device is just a
tiny bit too small. If you can reproduce it, I would check sizes of
source/destination. If it starts writing and fails at the end it
should start writing data. So does it take some time till it fails? Do
you see that some data was written at the beginning of the DRBD block
device that matches the source? But maybe there is already a size
check at the beginning and it fails fast, who knows. Maybe try with a
VM that has exactly the same size as the failing one in production.
- some race and the DRBD device isn't actually ready before the
migration wants to write data. Maybe there is more time before a
disk gets used when a VM is created vs. when existing data is written
to a freshly created device + migration.
- check dmesg to see what happened on DRBD level
- start grepping for the error msgs in pve/pve-storage to see when and
why these errors happen. What tool/function gets called and then
manually call that tool several times in some "linstor spawn &&
$magic_tool" to trigger a race (if there is one).

HTH, rck
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: Proxmox with Linstor: Online migration / disk move problem [ In reply to ]
Hi Roland,

W dniu 2021-11-23 o 08:41, Roland Kammerer pisze:

> I have heard of that once before, but never experienced it myself and so
> far no customers complained so I did not dive into it.
>
> If you can reproduce it, that would be highly appreciated. To me it
> looks like the plugin and LINSTOR basically did their job, but then
> something else happens. This are just random thoughts that might be
> complete nonsense:
>
> - maybe some size rounding error and the resulting DRBD device is just a
> tiny bit too small. If you can reproduce it, I would check sizes of
> source/destination. If it starts writing and fails at the end it
> should start writing data. So does it take some time till it fails? Do
> you see that some data was written at the beginning of the DRBD block
> device that matches the source? But maybe there is already a size
> check at the beginning and it fails fast, who knows. Maybe try with a
> VM that has exactly the same size as the failing one in production.

It fails at the start of the migration. I have two identical (in terms
of size) VMs - both have 32 GB disk. The one that was not migrated to
Linstor looks like this:

scsi0: local-lvm:vm-131-disk-0,cache=writeback,size=32G

The one that was migrated (offline, via NFS) to Linstor, which
originally has size=32G, on Linstor looks like this:

scsi0: linstor-local:vm-125-disk-1,cache=writeback,size=33555416K

33555416 KiB is 32.0009384 GiB, slightly larger than 32 GiB.

> - some race and the DRBD device isn't actually ready before the
> migration wants to write data. Maybe there is more time before a
> disk gets used when a VM is created vs. when existing data is written
> to a freshly created device + migration.

I don't think so (but I may be wrong). "linstor volume list" when I
start live migration looks like this: https://pastebin.com/FWqbq6uK

It's at "InUse" at some point during migration.

> - check dmesg to see what happened on DRBD level

dmesg of failed migration of VM between nodes, from thin LVM to Linstor,
dmesg is from target node: https://pastebin.com/rN5ZQ8vN

This VM has two disks:

scsi0: local-lvm:vm-132-disk-0,cache=writeback,size=16M
scsi1: local-lvm:vm-132-disk-1,cache=writeback,size=10244M

This VM was at some point at Linstor, because it's size is 10244M
instead of 10G (originally it was 10G).

--
Best regards,
?ukasz W?sikowski


_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: Proxmox with Linstor: Online migration / disk move problem [ In reply to ]
I was able to reproduce this issue on PVE 6.4 as well with the latest
packages installed. Never used this combination before, so I'm not sure if
it is something that started happening recently after updating PVE or
LINSTOR packages..The task is cancelled almost immediately, without
starting the migration process at all and the new linstor resource is
removed instantly as well.

https://pastebin.com/i4yuKYyp
Re: Proxmox with Linstor: Online migration / disk move problem [ In reply to ]
On Wed, Nov 24, 2021 at 02:00:38PM +0000, G. Milo wrote:
> I was able to reproduce this issue on PVE 6.4 as well with the latest
> packages installed. Never used this combination before, so I'm not sure if
> it is something that started happening recently after updating PVE or
> LINSTOR packages..The task is cancelled almost immediately, without
> starting the migration process at all and the new linstor resource is
> removed instantly as well.
>
> https://pastebin.com/i4yuKYyp

okay... so what do we have:
- it can happen from local lvm (?ukasz) and local zfs (Milo)
- it can happen with about 32G (?ukasz) and smaller 11G (Milo)

Milo, as you seem to be able to reproduce it immediately, can you try
smaller volumes, like 2G? Does it happen with those as well?

Does it need to be a running VM, or can it happen if the VM is turned
off as well?

I will try to reproduce that later today/tomorrow, all information that
narrows that down a bit might help.

Thanks, rck
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: Proxmox with Linstor: Online migration / disk move problem [ In reply to ]
On Thu, Nov 25, 2021 at 08:57:45AM +0100, Roland Kammerer wrote:
> On Wed, Nov 24, 2021 at 02:00:38PM +0000, G. Milo wrote:
> > I was able to reproduce this issue on PVE 6.4 as well with the latest
> > packages installed. Never used this combination before, so I'm not sure if
> > it is something that started happening recently after updating PVE or
> > LINSTOR packages..The task is cancelled almost immediately, without
> > starting the migration process at all and the new linstor resource is
> > removed instantly as well.
> >
> > https://pastebin.com/i4yuKYyp
>
> okay... so what do we have:
> - it can happen from local lvm (?ukasz) and local zfs (Milo)
> - it can happen with about 32G (?ukasz) and smaller 11G (Milo)
>
> Milo, as you seem to be able to reproduce it immediately, can you try
> smaller volumes, like 2G? Does it happen with those as well?
>
> Does it need to be a running VM, or can it happen if the VM is turned
> off as well?
>
> I will try to reproduce that later today/tomorrow, all information that
> narrows that down a bit might help.

I reproduced it (1G alpine image from local-lvm), let's see what I can
find out, currently I don't need input from your side.

Thanks, rck
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: Proxmox with Linstor: Online migration / disk move problem [ In reply to ]
Just in case it helps, I did some additional tests and the issue for me
seems to be narrowed down to the following...

Migration (online) from Local LVM/ZFS/DIR to ZFS backed LINSTOR storage
always succeeds, so the problem appears to be isolated when migrating from
local to a ThinLVM based LINSTOR storage.

On Thu, 25 Nov 2021 at 10:23, Roland Kammerer <roland.kammerer@linbit.com>
wrote:

> On Thu, Nov 25, 2021 at 08:57:45AM +0100, Roland Kammerer wrote:
> > On Wed, Nov 24, 2021 at 02:00:38PM +0000, G. Milo wrote:
> > > I was able to reproduce this issue on PVE 6.4 as well with the latest
> > > packages installed. Never used this combination before, so I'm not
> sure if
> > > it is something that started happening recently after updating PVE or
> > > LINSTOR packages..The task is cancelled almost immediately, without
> > > starting the migration process at all and the new linstor resource is
> > > removed instantly as well.
> > >
> > > https://pastebin.com/i4yuKYyp
> >
> > okay... so what do we have:
> > - it can happen from local lvm (?ukasz) and local zfs (Milo)
> > - it can happen with about 32G (?ukasz) and smaller 11G (Milo)
> >
> > Milo, as you seem to be able to reproduce it immediately, can you try
> > smaller volumes, like 2G? Does it happen with those as well?
> >
> > Does it need to be a running VM, or can it happen if the VM is turned
> > off as well?
> >
> > I will try to reproduce that later today/tomorrow, all information that
> > narrows that down a bit might help.
>
> I reproduced it (1G alpine image from local-lvm), let's see what I can
> find out, currently I don't need input from your side.
>
> Thanks, rck
> _______________________________________________
> Star us on GITHUB: https://github.com/LINBIT
> drbd-user mailing list
> drbd-user@lists.linbit.com
> https://lists.linbit.com/mailman/listinfo/drbd-user
>
Re: Proxmox with Linstor: Online migration / disk move problem [ In reply to ]
On Thu, Nov 25, 2021 at 11:23:35AM +0100, Roland Kammerer wrote:
> I reproduced it (1G alpine image from local-lvm), let's see what I can
> find out, currently I don't need input from your side.

Almost forgot to report back "my" findings:

tl;tr: "seems like qemu does not like moving from a smaller to a bigger disk
here.."

So please use offline migration: more details and links to forum posts
and proxmox bug reports here:

https://lists.proxmox.com/pipermail/pve-devel/2021-November/051103.html

Regards, rck
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Re: Proxmox with Linstor: Online migration / disk move problem [ In reply to ]
W dniu 2021-12-02 o 16:53, Roland Kammerer pisze:

> On Thu, Nov 25, 2021 at 11:23:35AM +0100, Roland Kammerer wrote:
>> I reproduced it (1G alpine image from local-lvm), let's see what I can
>> find out, currently I don't need input from your side.
>
> Almost forgot to report back "my" findings:
>
> tl;tr: "seems like qemu does not like moving from a smaller to a bigger disk
> here.."
>
> So please use offline migration: more details and links to forum posts
> and proxmox bug reports here:
>
> https://lists.proxmox.com/pipermail/pve-devel/2021-November/051103.html

Well, then this is a no-go for Linstor for me. I can't migrate offline
large production VMs and also Linstor on thin LVM won't work as a
storage on the same host on which "plain" thin LVM storage works. So I
have to get rid one of them and it has to be Linstor. That's sad it
didn't work out.

Best regards,
?ukasz W?sikowski
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user