Mailing List Archive

[PATCH v6 00/12] HWPOISON: soft offline rework
Hi,

This patchset is the latest version of soft offline rework patchset
targetted for v5.9.

Since v5, I dropped some patches which tweak refcount handling in
madvise_inject_error() to avoid the "unknown refcount page" error.
I don't confirm the fix (that didn't reproduce with v5 in my environment),
but this change surely call soft_offline_page() after holding refcount,
so the error should not happen any more.

Dropped patches
- mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
- mm,madvise: Refactor madvise_inject_error
- mm,hwpoison: remove MF_COUNT_INCREASED
- mm,hwpoison: remove flag argument from soft offline functions

Thanks,
Naoya Horiguchi

Quoting cover letter of v5:
----
Main focus of this series is to stabilize soft offline. Historically soft
offlined pages have suffered from racy conditions because PageHWPoison is
used to a little too aggressively, which (directly or indirectly) invades
other mm code which cares little about hwpoison. This results in unexpected
behavior or kernel panic, which is very far from soft offline's "do not
disturb userspace or other kernel component" policy.

Main point of this change set is to contain target page "via buddy allocator",
where we first free the target page as we do for normal pages, and remove
from buddy only when we confirm that it reaches free list. There is surely
race window of page allocation, but that's fine because someone really want
that page and the page is still working, so soft offline can happily give up.

v4 from Oscar tries to handle the race around reallocation, but that part
seems still work in progress, so I decide to separate it for changes into
v5.9. Thank you for your contribution, Oscar.

---
Previous versions:
v1: https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@ah.jp.nec.com/
v2: https://lore.kernel.org/linux-mm/20191017142123.24245-1-osalvador@suse.de/
v3: https://lore.kernel.org/linux-mm/20200624150137.7052-1-nao.horiguchi@gmail.com/
v4: https://lore.kernel.org/linux-mm/20200716123810.25292-1-osalvador@suse.de/
v5: https://lore.kernel.org/linux-mm/20200805204354.GA16406@hori.linux.bs1.fc.nec.co.jp/T/#t
---
Summary:

Naoya Horiguchi (5):
mm,hwpoison: cleanup unused PageHuge() check
mm, hwpoison: remove recalculating hpage
mm,hwpoison-inject: don't pin for hwpoison_filter
mm,hwpoison: introduce MF_MSG_UNSPLIT_THP
mm,hwpoison: double-check page count in __get_any_page()

Oscar Salvador (7):
mm,hwpoison: Un-export get_hwpoison_page and make it static
mm,hwpoison: Kill put_hwpoison_page
mm,hwpoison: Unify THP handling for hard and soft offline
mm,hwpoison: Rework soft offline for free pages
mm,hwpoison: Rework soft offline for in-use pages
mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page
mm,hwpoison: Return 0 if the page is already poisoned in soft-offline

include/linux/mm.h | 3 +-
include/linux/page-flags.h | 6 +-
include/ras/ras_event.h | 3 +
mm/hwpoison-inject.c | 18 +--
mm/madvise.c | 5 -
mm/memory-failure.c | 307 +++++++++++++++++++++------------------------
mm/migrate.c | 11 +-
mm/page_alloc.c | 60 +++++++--
8 files changed, 203 insertions(+), 210 deletions(-)
Re: [PATCH v6 00/12] HWPOISON: soft offline rework [ In reply to ]
On Thu, Aug 06, 2020 at 06:49:11PM +0000, nao.horiguchi@gmail.com wrote:
> Hi,
>
> This patchset is the latest version of soft offline rework patchset
> targetted for v5.9.
>
> Since v5, I dropped some patches which tweak refcount handling in
> madvise_inject_error() to avoid the "unknown refcount page" error.
> I don't confirm the fix (that didn't reproduce with v5 in my environment),
> but this change surely call soft_offline_page() after holding refcount,
> so the error should not happen any more.

With this patchset, arm64 is still suffering from premature 512M-size hugepages
allocation failures.

# git clone https://gitlab.com/cailca/linux-mm
# cd linux-mm; make
# ./random 1
- start: migrate_huge_offline
- use NUMA nodes 0,1.
- mmap and free 2147483648 bytes hugepages on node 0
- mmap and free 2147483648 bytes hugepages on node 1
madvise: Cannot allocate memory

[ 292.456538][ T3685] soft offline: 0x8a000: hugepage isolation failed: 0, page count 2, type 7ffff80001000e (referenced|uptodate|dirty|head)
[ 292.469113][ T3685] Soft offlining pfn 0x8c000 at process virtual address 0xffff60000000
[ 292.983855][ T3685] Soft offlining pfn 0x88000 at process virtual address 0xffff40000000
[ 293.271369][ T3685] Soft offlining pfn 0x8a000 at process virtual address 0xffff60000000
[ 293.834030][ T3685] Soft offlining pfn 0xa000 at process virtual address 0xffff40000000
[ 293.851378][ T3685] soft offline: 0xa000: hugepage migration failed -12, type 7ffff80001000e (referenced|uptodate|dirty|head)

The fresh-booted system still had 40G+ memory free before running the test.

Reverting the following commits allowed the test to run succesfully over and over again.

"mm, hwpoison: remove recalculating hpage"
"mm,hwpoison-inject: don't pin for hwpoison_filter"
"mm,hwpoison: Un-export get_hwpoison_page and make it static"
"mm,hwpoison: kill put_hwpoison_page"
"mm,hwpoison: unify THP handling for hard and soft offline"
"mm,hwpoison: rework soft offline for free pages"
"mm,hwpoison: rework soft offline for in-use pages"
"mm,hwpoison: refactor soft_offline_huge_page and __soft_offline_page"

i.e., it is not enough to only revert,

mm,hwpoison: double-check page count in __get_any_page()
mm,hwpoison: introduce MF_MSG_UNSPLIT_THP
mm,hwpoison: return 0 if the page is already poisoned in soft-offline

>
> Dropped patches
> - mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
> - mm,madvise: Refactor madvise_inject_error
> - mm,hwpoison: remove MF_COUNT_INCREASED
> - mm,hwpoison: remove flag argument from soft offline functions
>
> Thanks,
> Naoya Horiguchi
>
> Quoting cover letter of v5:
> ----
> Main focus of this series is to stabilize soft offline. Historically soft
> offlined pages have suffered from racy conditions because PageHWPoison is
> used to a little too aggressively, which (directly or indirectly) invades
> other mm code which cares little about hwpoison. This results in unexpected
> behavior or kernel panic, which is very far from soft offline's "do not
> disturb userspace or other kernel component" policy.
>
> Main point of this change set is to contain target page "via buddy allocator",
> where we first free the target page as we do for normal pages, and remove
> from buddy only when we confirm that it reaches free list. There is surely
> race window of page allocation, but that's fine because someone really want
> that page and the page is still working, so soft offline can happily give up.
>
> v4 from Oscar tries to handle the race around reallocation, but that part
> seems still work in progress, so I decide to separate it for changes into
> v5.9. Thank you for your contribution, Oscar.
>
> ---
> Previous versions:
> v1: https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@ah.jp.nec.com/
> v2: https://lore.kernel.org/linux-mm/20191017142123.24245-1-osalvador@suse.de/
> v3: https://lore.kernel.org/linux-mm/20200624150137.7052-1-nao.horiguchi@gmail.com/
> v4: https://lore.kernel.org/linux-mm/20200716123810.25292-1-osalvador@suse.de/
> v5: https://lore.kernel.org/linux-mm/20200805204354.GA16406@hori.linux.bs1.fc.nec.co.jp/T/#t
> ---
> Summary:
>
> Naoya Horiguchi (5):
> mm,hwpoison: cleanup unused PageHuge() check
> mm, hwpoison: remove recalculating hpage
> mm,hwpoison-inject: don't pin for hwpoison_filter
> mm,hwpoison: introduce MF_MSG_UNSPLIT_THP
> mm,hwpoison: double-check page count in __get_any_page()
>
> Oscar Salvador (7):
> mm,hwpoison: Un-export get_hwpoison_page and make it static
> mm,hwpoison: Kill put_hwpoison_page
> mm,hwpoison: Unify THP handling for hard and soft offline
> mm,hwpoison: Rework soft offline for free pages
> mm,hwpoison: Rework soft offline for in-use pages
> mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page
> mm,hwpoison: Return 0 if the page is already poisoned in soft-offline
>
> include/linux/mm.h | 3 +-
> include/linux/page-flags.h | 6 +-
> include/ras/ras_event.h | 3 +
> mm/hwpoison-inject.c | 18 +--
> mm/madvise.c | 5 -
> mm/memory-failure.c | 307 +++++++++++++++++++++------------------------
> mm/migrate.c | 11 +-
> mm/page_alloc.c | 60 +++++++--
> 8 files changed, 203 insertions(+), 210 deletions(-)
Re: [PATCH v6 00/12] HWPOISON: soft offline rework [ In reply to ]
On Mon, Aug 10, 2020 at 11:22:55AM -0400, Qian Cai wrote:
> On Thu, Aug 06, 2020 at 06:49:11PM +0000, nao.horiguchi@gmail.com wrote:
> > Hi,
> >
> > This patchset is the latest version of soft offline rework patchset
> > targetted for v5.9.
> >
> > Since v5, I dropped some patches which tweak refcount handling in
> > madvise_inject_error() to avoid the "unknown refcount page" error.
> > I don't confirm the fix (that didn't reproduce with v5 in my environment),
> > but this change surely call soft_offline_page() after holding refcount,
> > so the error should not happen any more.
>
> With this patchset, arm64 is still suffering from premature 512M-size hugepages
> allocation failures.
>
> # git clone https://gitlab.com/cailca/linux-mm
> # cd linux-mm; make
> # ./random 1
> - start: migrate_huge_offline
> - use NUMA nodes 0,1.
> - mmap and free 2147483648 bytes hugepages on node 0
> - mmap and free 2147483648 bytes hugepages on node 1
> madvise: Cannot allocate memory
>
> [ 292.456538][ T3685] soft offline: 0x8a000: hugepage isolation failed: 0, page count 2, type 7ffff80001000e (referenced|uptodate|dirty|head)
> [ 292.469113][ T3685] Soft offlining pfn 0x8c000 at process virtual address 0xffff60000000
> [ 292.983855][ T3685] Soft offlining pfn 0x88000 at process virtual address 0xffff40000000
> [ 293.271369][ T3685] Soft offlining pfn 0x8a000 at process virtual address 0xffff60000000
> [ 293.834030][ T3685] Soft offlining pfn 0xa000 at process virtual address 0xffff40000000
> [ 293.851378][ T3685] soft offline: 0xa000: hugepage migration failed -12, type 7ffff80001000e (referenced|uptodate|dirty|head)
>
> The fresh-booted system still had 40G+ memory free before running the test.

As I commented over v5, this failure is expected and it doesn't mean kernel
issue. Once we successfully soft offline a hugepage, the memory range
covering the hugepage will never participate in hugepage because one of the
subpages is removed from buddy. So if you iterate soft offlining hugepages,
all memory range are "holed" finally, and no hugepage will be available in
the system.

Please fix your test program to properly determine nubmer of loop (NR_LOOP)
so that you can assume that you can always allocate hugepage during testing.
For example, if you can use 40G memory and hugepage size is 512MB, NR_LOOP
should not be larger than 80.

>
> Reverting the following commits allowed the test to run succesfully over and over again.
>
> "mm, hwpoison: remove recalculating hpage"
> "mm,hwpoison-inject: don't pin for hwpoison_filter"
> "mm,hwpoison: Un-export get_hwpoison_page and make it static"
> "mm,hwpoison: kill put_hwpoison_page"
> "mm,hwpoison: unify THP handling for hard and soft offline"
> "mm,hwpoison: rework soft offline for free pages"
> "mm,hwpoison: rework soft offline for in-use pages"
> "mm,hwpoison: refactor soft_offline_huge_page and __soft_offline_page"

I'm still not sure why the test succeeded by reverting these because
current mainline kernel provides similar mechanism to prevent reuse of
soft offlined page. So this success seems to me something suspicious.

To investigate more, I want to have additional info about the page states
of the relevant pages after soft offlining. Could you collect it by the
following steps?

- modify random.c not to run hotplug_memory() in migrate_huge_hotplug_memory(),
- compile it and run "./random 1" once,
- to collect page state with hwpoisoned pages, run "./page-types -Nlr -b hwpoison",
where page-types is available under tools/vm in kernel source tree.
- choose a few pfns of soft offlined pages from kernel message
"Soft offlining pfn ...", and run "./page-types -Nlr -a <pfn>".

Thanks,
Naoya Horiguchi

>
> i.e., it is not enough to only revert,
>
> mm,hwpoison: double-check page count in __get_any_page()
> mm,hwpoison: introduce MF_MSG_UNSPLIT_THP
> mm,hwpoison: return 0 if the page is already poisoned in soft-offline
>
Re: [PATCH v6 00/12] HWPOISON: soft offline rework [ In reply to ]
> On Aug 10, 2020, at 11:11 PM, HORIGUCHI NAOYA(?????) <naoya.horiguchi@nec.com> wrote:
>
> I'm still not sure why the test succeeded by reverting these because
> current mainline kernel provides similar mechanism to prevent reuse of
> soft offlined page. So this success seems to me something suspicious.

Even if we call munmap() on the range, it still can’t be be reused? If so, how to recover those memory then?

>
> To investigate more, I want to have additional info about the page states
> of the relevant pages after soft offlining. Could you collect it by the
> following steps?

Do you want to collect those from the failed or succeed kernel?
Re: [PATCH v6 00/12] HWPOISON: soft offline rework [ In reply to ]
On Mon, Aug 10, 2020 at 11:45:36PM -0400, Qian Cai wrote:
>
>
> > On Aug 10, 2020, at 11:11 PM, HORIGUCHI NAOYA($BKY8}!!D>Li(B) <naoya.horiguchi@nec.com> wrote:
> >
> > I'm still not sure why the test succeeded by reverting these because
> > current mainline kernel provides similar mechanism to prevent reuse of
> > soft offlined page. So this success seems to me something suspicious.
>
> Even if we call munmap() on the range, it still can$B!G(Bt be be reused? If so, how to recover those memory then?

No, it can't, because soft offline isolates the physical page.
so even after calling munmap(), the side effect remains on the page.
In your random.c, memory online/offline resets the status of hwpoison.
So you can reallocate hugepages in another run of the program.

>
> >
> > To investigate more, I want to have additional info about the page states
> > of the relevant pages after soft offlining. Could you collect it by the
> > following steps?
>
> Do you want to collect those from the failed or succeed kernel?

I'd like to check on the succeeded kernel.
Sorry for the lack of information.

Thanks,
Naoya Horiguchi
Re: [PATCH v6 00/12] HWPOISON: soft offline rework [ In reply to ]
On Tue, Aug 11, 2020 at 03:11:40AM +0000, HORIGUCHI NAOYA(?? ??) wrote:
> I'm still not sure why the test succeeded by reverting these because
> current mainline kernel provides similar mechanism to prevent reuse of
> soft offlined page. So this success seems to me something suspicious.
>
> To investigate more, I want to have additional info about the page states
> of the relevant pages after soft offlining. Could you collect it by the
> following steps?
>
> - modify random.c not to run hotplug_memory() in migrate_huge_hotplug_memory(),
> - compile it and run "./random 1" once,
> - to collect page state with hwpoisoned pages, run "./page-types -Nlr -b hwpoison",
> where page-types is available under tools/vm in kernel source tree.
> - choose a few pfns of soft offlined pages from kernel message
> "Soft offlining pfn ...", and run "./page-types -Nlr -a <pfn>".

# ./page-types -Nlr -b hwpoison
offset len flags
99a000 1 __________B________X_______________________
99c000 1 __________B________X_______________________
99e000 1 __________B________X_______________________
9a0000 1 __________B________X_______________________
ba6000 1 __________B________X_______________________
baa000 1 __________B________X_______________________

Every single one of pfns was like this,

# ./page-types -Nlr -a 0x99a000
offset len flags
99a000 1 __________B________X_______________________

# ./page-types -Nlr -a 0x99e000
offset len flags
99e000 1 __________B________X_______________________

# ./page-types -Nlr -a 0x99c000
offset len flags
99c000 1 __________B________X_______________________
Re: [PATCH v6 00/12] HWPOISON: soft offline rework [ In reply to ]
On Tue, Aug 11, 2020 at 01:39:24PM -0400, Qian Cai wrote:
> On Tue, Aug 11, 2020 at 03:11:40AM +0000, HORIGUCHI NAOYA($BKY8}(B $BD>Li(B) wrote:
> > I'm still not sure why the test succeeded by reverting these because
> > current mainline kernel provides similar mechanism to prevent reuse of
> > soft offlined page. So this success seems to me something suspicious.
> >
> > To investigate more, I want to have additional info about the page states
> > of the relevant pages after soft offlining. Could you collect it by the
> > following steps?
> >
> > - modify random.c not to run hotplug_memory() in migrate_huge_hotplug_memory(),
> > - compile it and run "./random 1" once,
> > - to collect page state with hwpoisoned pages, run "./page-types -Nlr -b hwpoison",
> > where page-types is available under tools/vm in kernel source tree.
> > - choose a few pfns of soft offlined pages from kernel message
> > "Soft offlining pfn ...", and run "./page-types -Nlr -a <pfn>".
>
> # ./page-types -Nlr -b hwpoison
> offset len flags
> 99a000 1 __________B________X_______________________
> 99c000 1 __________B________X_______________________
> 99e000 1 __________B________X_______________________
> 9a0000 1 __________B________X_______________________
> ba6000 1 __________B________X_______________________
> baa000 1 __________B________X_______________________

Thank you. It only shows 6 lines of records, which is unexpected to me
because random.c iterates soft offline 2 hugepages with madvise() 1000 times.
Somehow (maybe in arch specific way?) other hwpoisoned pages might be cleared?
If they really are, the success of this test is a fake, and this patchset
can be considered as a fix.

>
> Every single one of pfns was like this,
>
> # ./page-types -Nlr -a 0x99a000
> offset len flags
> 99a000 1 __________B________X_______________________
>
> # ./page-types -Nlr -a 0x99e000
> offset len flags
> 99e000 1 __________B________X_______________________
>
> # ./page-types -Nlr -a 0x99c000
> offset len flags
> 99c000 1 __________B________X_______________________
Re: [PATCH v6 00/12] HWPOISON: soft offline rework [ In reply to ]
On Wed, Aug 12, 2020 at 04:32:01AM +0900, Naoya Horiguchi wrote:
> On Tue, Aug 11, 2020 at 01:39:24PM -0400, Qian Cai wrote:
> > On Tue, Aug 11, 2020 at 03:11:40AM +0000, HORIGUCHI NAOYA(?? ??) wrote:
> > > I'm still not sure why the test succeeded by reverting these because
> > > current mainline kernel provides similar mechanism to prevent reuse of
> > > soft offlined page. So this success seems to me something suspicious.
> > >
> > > To investigate more, I want to have additional info about the page states
> > > of the relevant pages after soft offlining. Could you collect it by the
> > > following steps?
> > >
> > > - modify random.c not to run hotplug_memory() in migrate_huge_hotplug_memory(),
> > > - compile it and run "./random 1" once,
> > > - to collect page state with hwpoisoned pages, run "./page-types -Nlr -b hwpoison",
> > > where page-types is available under tools/vm in kernel source tree.
> > > - choose a few pfns of soft offlined pages from kernel message
> > > "Soft offlining pfn ...", and run "./page-types -Nlr -a <pfn>".
> >
> > # ./page-types -Nlr -b hwpoison
> > offset len flags
> > 99a000 1 __________B________X_______________________
> > 99c000 1 __________B________X_______________________
> > 99e000 1 __________B________X_______________________
> > 9a0000 1 __________B________X_______________________
> > ba6000 1 __________B________X_______________________
> > baa000 1 __________B________X_______________________
>
> Thank you. It only shows 6 lines of records, which is unexpected to me
> because random.c iterates soft offline 2 hugepages with madvise() 1000 times.
> Somehow (maybe in arch specific way?) other hwpoisoned pages might be cleared?
> If they really are, the success of this test is a fake, and this patchset
> can be considered as a fix.

The test was designed to catch a previous bug (the latest patchset fixed that)
where kernel will be enterting into an endless loop.

https://lore.kernel.org/lkml/1570829564.5937.36.camel@lca.pw/

However, I don't understand why mmap() does not return ENOMEM in the first
place where overcommit_memory == 0 instead of munmap() or/and madvise()
returning ENOMEM. I suppose that is the price to pay with heuristic, and I
can't easily confirm if it is related to this patchset or not.

addr = mmap(NULL, length, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
if (addr == MAP_FAILED) {
if (i == 0 || errno != ENOMEM) {
perror("mmap");
return 1;
}
usleep(1000);
continue;
}
memset(addr, 0, length);

code = madvise(addr, length, MADV_SOFT_OFFLINE);
if(safe_munmap(addr, length))
return 1;

/* madvise() could return >= 0 on success. */
if (code < 0 && errno != EBUSY) {
perror("madvise");
return 1;
}

Otherwise, our test will keep running and ignore ENOMEM correctly. I did also
confirm that this patchset has a higher success rate of soft-offlining
("page-types" shows 400+ lines) which changes the existing assumption (looks
like in a good way in this case).