Mailing List Archive

[00/41] Large Blocksize Support V7 (adds memmap support)
This patchset modifies the Linux kernel so that larger block sizes than
page size can be supported. Larger block sizes are handled by using
compound pages of an arbitrary order for the page cache instead of
single pages with order 0.

- Support is added in a way that limits the changes to existing code.
As a result filesystems can support larger page I/O with minimal changes.

- The page cache functions are mostly unchanged. Instead of a page struct
representing a single page they take a head page struct (which looks
the same as a regular page struct apart from the compound flags) and
operate on those. Most page cache functions can stay as they are.

- No locking protocols are added or modified.

- The support is also fully transparent at the level of the OS. No
specialized heuristics are added to switch to larger pages. Large
page support is enabled by filesystems or device drivers when a device
or volume is mounted. Larger block sizes are usually set during volume
creation although the patchset supports setting these sizes per file.
The formattted partition will then always be accessed with the
configured blocksize.

- Large blocks also do not mean that the 4k mmap semantics need to be abandoned.
The included mmap support will happily map 4k chunks of large blocks so that
user space sees no changes.

Some of the changes are:

- Replace the use of PAGE_CACHE_XXX constants to calculate offsets into
pages with functions that do the the same and allow the constants to
be parameterized.

- Extend the capabilities of compound pages so that they can be
put onto the LRU and reclaimed.

- Allow setting a larger blocksize via set_blocksize()

Rationales:
-----------

1. The ability to handle memory of an arbitrarily large size using
a singe page struct "handle" is essential for scaling memory handling
and reducing overhead in multiple kernel subsystems. This patchset
is a strategic move that allows performance gains throughout the
kernel.

2. Reduce fsck times. Larger block sizes mean faster file system checking.
Using 64k block size will reduce the number of blocks to be managed
by a factor of 16 and produce much denser and contiguous metadata.

3. Performance. If we look at IA64 vs. x86_64 then it seems that the
faster interrupt handling on x86_64 compensate for the speed loss due to
a smaller page size (4k vs 16k on IA64). Supporting larger block sizes
sizes on all allows a significant reduction in I/O overhead and increases
the size of I/O that can be performed by hardware in a single request
since the number of scatter gather entries are typically limited for
one request. This is going to become increasingly important to support
the ever growing memory sizes since we may have to handle excessively
large amounts of 4k requests for data sizes that may become common
soon. For example to write a 1 terabyte file the kernel would have to
handle 256 million 4k chunks.

4. Cross arch compatibility: It is currently not possible to mount
an 16k blocksize ext2 filesystem created on IA64 on an x86_64 system.
With this patch this becomes possible. Note that this also means that
some filesystems are already capable of working with blocksizes of
up to 64k (ext2, XFS) which is currently only available on a select
few arches. This patchset enables that functionality on all arches.
There are no special modifications needed to the filesystems. The
set_blocksize() function call will simply support a larger blocksize.

5. VM scalability
Large block sizes mean less state keeping for the information being
transferred. For a 1TB file one needs to handle 256 million page
structs in the VM if one uses 4k page size. A 64k page size reduces
that amount to 16 million. If the limitation in existing filesystems
are removed then even higher reductions become possible. For very
large files like that a page size of 2 MB may be beneficial which
will reduce the number of page struct to handle to 512k. The variable
nature of the block size means that the size can be tuned at file
system creation time for the anticipated needs on a volume.

6. IO scalability
The IO layer will receive large blocks of contiguious memory with
this patchset. This means that less scatter gather elements are needed
and the memory used is guaranteed to be contiguous. Instead of having
to handle 4k chunks we can f.e. handle 64k chunks in one go.

7. Limited scatter gather support restricts I/O sizes.

A lot of I/O controllers are limited in the number of scatter gather
elements that they support. For example a controller that support 128
entries in the scatter gather lists can only perform I/O of 128*4k =
512k in one go. If the blocksize is larger (f.e. 64k) then we can perform
larger I/O transfers. If we support 128 entries then 128*64k = 8M
can be transferred in one transaction.

Dave Chinner measured a performance increase of 50% when going to 64k
blocksize with XFS with an earlier version of this patchset.

8. We have problems supporting devices with a higher blocksize than
page size. This is for example important to support CD and DVDs that
can only read and write 32k or 64k blocks. We currently have a shim
layer in there to deal with this situation which limits the speed
of I/O. The developers are currently looking for ways to completely
bypass the page cache because of this deficiency.

9. 32/64k blocksize is also used in flash devices. Same issues.

10. Future harddisks will support bigger block sizes that Linux cannot
support since we are limited to PAGE_SIZE. Ok the on board cache
may buffer this for us but what is the point of handling smaller
page sizes than what the drive supports?


Fragmentation issues
--------------------

The Linux VM is gradually acquiring abilities to defragment memory. These
capabilities are partially present for 2.6.23. Later versions may merge
more of the defragmentation work. The use of large pages may cause
significant fragmentation to memory. Large buffers require pages of higher
order. Defragmentation support is necessary to insure that pages of higher
order are available or reclaimable when necessary.

There have been a number of statements that defragmentation cannot ever
work. However, the failures with the early defragmentation code from the
spring do not longer occur. I have seen no failures with 2.6.23 when
testing with 16k and 32k blocksize. The effect of the limited
defragmentation capabilities in 2.6.23 may already be sufficient for many
uses.

I would like to increase the supported blocksize to very large pages in the
future so that device drives will be capable of providing large contiguous
mapping. For that purpose I think that we need a mechanism to reserve
pools of varying large sizes at boot time. Such a mechanism can also be used
to compensate in situations where one wants to use larger buffers but
defragmentation support is not (yet?) capable to reliably provide pages
of the desired sizes.

How to make this patchset work:
-------------------------------

1. Apply this patchset or do a

git pull
git://git.kernel.org/pub/scm/linux/kernel/git/christoph/largeblocksize.git
largeblock

(The git archive is used to keep the patchset up to date. Please send patches
against the git tree)

2. Enable LARGE_BLOCKSIZE Support
3. Compile kernel

In order to use a filesystem with a larger blocksize it needs to be formatted
for that larger blocksize. This is done using the mkfs.xxx tool for each
filesystem. Surprisingly the existing tools work without modification. These
formatting tools may warn you that the blocksize you specify is not supported
on your particular architecture. Ignore that warning since this is no longer
true after you have applied this patchset.

Tested file systems:

Filesystem Max Blocksize Changes

Reiserfs 8k Page size functions
Ext2 64k Page size functions
XFS 64k Page size functions / Remove PAGE_SIZE check
Ramfs MAX_ORDER Parameter to specify order

Todo/Issues:

- There are certainly numerous issues with this patch. I have only tested
copying files back and forth, volume creation etc. Others have run
fsxlinux on the volumes. The missing mmap support limits what can be
done for now.

- ZONE_MOVABLE is available in 2.6.23. Using the kernelcore=xxx as a kernel
parameter enables an area where defragmentation can work. This may be
necessary to avoid OOMs although I have seen no problems with up to 32k
blocksize even without that measure.

- The antifragmentation patches in Andrew's tree address more fragmentation
issues. However, large orders may still lead to fragmentation
of the movable sections. Memory compaction is still not merged and will
likely be needed to reliably support even larger orders of 256k or more.
How memory compaction impacts performance still has to be determined.

- Support for bouncing pages.

- Remove PAGE_CACHE_xxx constants after using page_cache_xxx functions
everywhere. But that will have to wait until merging becomes possible.
For now certain subsystems (shmem f.e.) are not using these functions.
They will only use order 0 pages.

- Support for non harddisk based filesystems. Remove the pktdvd etc
layers needed because the VM current does not support sufficiently
large blocksizes for these devices. Look for other places in the kernel
where we have similar issues.

V6->V7
- Mmap support
- Further cleanups
- Against 2.6.23-rc5
- Drop provocative ext2 patch
- Add patches to enable 64k blocksize in ext2/3 (Thanks, Mingming)

V5->V6:
- Rediff against 2.6.23-rc4
- Fix breakage introduced by updates to reiserfs
- Readahead fixes by Fengguang Wu <fengguang.wu@gmail.com>
- Provide a git tree that is kept up to date

V4->V5:
- Diff against 2.6.22-rc6-mm1
- provide test tree on ftp.kernel.org:/pub/linux

V3->V4
- It is possible to transparently make filesystems support larger
blocksizes by simply allowing larger blocksizes in set_blocksize.
Remove all special modifications for mmap etc from the filesystems.
This now makes 3 disk based filesystems that can use larger blocks
(reiser, ext2, xfs). Are there any other useful ones to make work?
- Patch against 2.6.22-rc4-mm2 which allows the use of Mel's antifrag
logic to avoid fragmentation.
- More page cache cleanup by applying the functions to filesystems.
- Disable bouncing when the gfp mask is setup.
- Disable mmap directly in mm/filemap.c to avoid filesystem changes
while we have no mmap support for higher order pages.

RFC V2->V3
- More restructuring
- It actually works!
- Add XFS support
- Fix up UP support
- Work out the direct I/O issues
- Add CONFIG_LARGE_BLOCKSIZE. Off by default which makes the inlines revert
back to constants. Disabled for 32bit and HIGHMEM configurations.
This also allows a gradual migration to the new page cache
inline functions. LARGE_BLOCKSIZE capabilities can be
added gradually and if there is a problem then we can disable
a subsystem.

RFC V1->V2
- Some ext2 support
- Some block layer, fs layer support etc.
- Better page cache macros
- Use macros to clean up code.

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Tuesday 11 September 2007 16:03, Christoph Lameter wrote:

> 5. VM scalability
> Large block sizes mean less state keeping for the information being
> transferred. For a 1TB file one needs to handle 256 million page
> structs in the VM if one uses 4k page size. A 64k page size reduces
> that amount to 16 million. If the limitation in existing filesystems
> are removed then even higher reductions become possible. For very
> large files like that a page size of 2 MB may be beneficial which
> will reduce the number of page struct to handle to 512k. The variable
> nature of the block size means that the size can be tuned at file
> system creation time for the anticipated needs on a volume.

There is a limitation in the VM. Fragmentation. You keep saying this
is a solved issue and just assuming you'll be able to fix any cases
that come up as they happen.

I still don't get the feeling you realise that there is a fundamental
fragmentation issue that is unsolvable with Mel's approach.

The idea that there even _is_ a bug to fail when higher order pages
cannot be allocated was also brushed aside by some people at the
vm/fs summit. I don't know if those people had gone through the
math about this, but it goes somewhat like this: if you use a 64K
page size, you can "run out of memory" with 93% of your pages free.
If you use a 2MB page size, you can fail with 99.8% of your pages
still free. That's 64GB of memory used on a 32TB Altix.

If you don't consider that is a problem because you don't care about
theoretical issues or nobody has reported it from running -mm
kernels, then I simply can't argue against that on a technical basis.
But I'm totally against introducing known big fundamental problems to
the VM at this stage of the kernel. God knows how long it takes to ever
fix them in future after they have become pervasive throughout the
kernel.

IMO the only thing that higher order pagecache is good for is a quick
hack for filesystems to support larger block sizes. And after seeing it
is fairly ugly to support mmap, I'm not even really happy for it to do
that.

If VM scalability is a problem, then it needs to be addressed in other
areas anyway for order-0 pages, and if contiguous pages helps IO
scalability or crappy hardware, then there is nothing stopping us from
*attempting* to get contiguous memory in the current scheme.

Basically, if you're placing your hopes for VM and IO scalability on this,
then I think that's a totally broken thing to do and will end up making
the kernel worse in the years to come (except maybe on some poor
configurations of bad hardware).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Tuesday 11 September 2007 22:12, Jörn Engel wrote:
> On Tue, 11 September 2007 04:52:19 +1000, Nick Piggin wrote:
> > On Tuesday 11 September 2007 16:03, Christoph Lameter wrote:
> > > 5. VM scalability
> > > Large block sizes mean less state keeping for the information being
> > > transferred. For a 1TB file one needs to handle 256 million page
> > > structs in the VM if one uses 4k page size. A 64k page size reduces
> > > that amount to 16 million. If the limitation in existing filesystems
> > > are removed then even higher reductions become possible. For very
> > > large files like that a page size of 2 MB may be beneficial which
> > > will reduce the number of page struct to handle to 512k. The
> > > variable nature of the block size means that the size can be tuned at
> > > file system creation time for the anticipated needs on a volume.
> >
> > The idea that there even _is_ a bug to fail when higher order pages
> > cannot be allocated was also brushed aside by some people at the
> > vm/fs summit. I don't know if those people had gone through the
> > math about this, but it goes somewhat like this: if you use a 64K
> > page size, you can "run out of memory" with 93% of your pages free.
> > If you use a 2MB page size, you can fail with 99.8% of your pages
> > still free. That's 64GB of memory used on a 32TB Altix.
>
> While I agree with your concern, those numbers are quite silly. The

They are the theoretical worst case. Obviously with a non trivially
sized system and non-DoS workload, they will not be reached.


> chances of 99.8% of pages being free and the remaining 0.2% being
> perfectly spread across all 2MB large_pages are lower than those of SHA1
> creating a collision. I don't see anyone abandoning git or rsync, so
> your extreme example clearly is the wrong one.
>
> Again, I agree with your concern, even though your example makes it look
> silly.

It is not simply a question of once-off chance for an all-at-once layout
to fail in this way. Fragmentation slowly builds over time, and especially
if you do actually use higher-order pages for a significant number of
things (unlike we do today), then the problem will become worse. If you
have any part of your workload that is affected by fragmentation, then
it will cause unfragmented regions to eventually be used for fragmentation
inducing allocations (by definition -- if it did not, eg. then there would be
no fragmentation problem and no need for Mel's patches).

I don't know what happens as time tends towards infinity, but I don't think
it will be good.

At millions of allocations per second, how long does it take to produce
an unacceptable number of free pages before the ENOMEM condition?
Furthermore, what *is* an unacceptable number? I don't know. I am not
trying to push this feature in, so the burden is not mine to make sure it
is OK.

Yes, we already have some of these problems today. Introducing more
and worse problems and justifying them because of existing ones is much
more silly than my quoting of the numbers. IMO.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Wednesday 12 September 2007 01:36, Mel Gorman wrote:
> On Tue, 2007-09-11 at 04:52 +1000, Nick Piggin wrote:
> > On Tuesday 11 September 2007 16:03, Christoph Lameter wrote:
> > > 5. VM scalability
> > > Large block sizes mean less state keeping for the information being
> > > transferred. For a 1TB file one needs to handle 256 million page
> > > structs in the VM if one uses 4k page size. A 64k page size reduces
> > > that amount to 16 million. If the limitation in existing filesystems
> > > are removed then even higher reductions become possible. For very
> > > large files like that a page size of 2 MB may be beneficial which
> > > will reduce the number of page struct to handle to 512k. The
> > > variable nature of the block size means that the size can be tuned at
> > > file system creation time for the anticipated needs on a volume.
> >
> > There is a limitation in the VM. Fragmentation. You keep saying this
> > is a solved issue and just assuming you'll be able to fix any cases
> > that come up as they happen.
> >
> > I still don't get the feeling you realise that there is a fundamental
> > fragmentation issue that is unsolvable with Mel's approach.
>
> I thought we had discussed this already at VM and reached something
> resembling a conclusion. It was acknowledged that depending on
> contiguous allocations to always succeed will get a caller into trouble
> and they need to deal with fallback - whether the problem was
> theoritical or not. It was also strongly pointed out that the large
> block patches as presented would be vunerable to that problem.

Well Christoph seems to still be spinning them as a solution for VM
scalability and first class support for making contiguous IOs, large
filesystem block sizes etc.

At the VM summit I think the conclusion was that grouping by
mobility could be merged. I'm still not thrilled by that, but I was
going to get steamrolled[*] anyway... and seeing as the userspace
hugepages is a relatively demanded workload and can be
implemented in this way with basically no other changes to the
kernel and already must have fallbacks.... then that's actually a
reasonable case for it.

The higher order pagecache, again I'm just going to get steamrolled
on, and it actually isn't so intrusive minus the mmap changes, so I
didn't have much to reasonably say there.

And I would have kept quiet this time too, except for the worrying idea
to use higher order pages to fix the SLUB vs SLAB regression, and if
the rationale for this patchset was more realistic.

[*] And I don't say steamrolled because I'm bitter and twisted :) I
personally want the kernel to be perfect. But I realise it already isn't
and for practical purposes people want these things, so I accept
being overruled, no problem. The fact simply is -- I would have been
steamrolled I think :P

> The alternatives were fs-block and increasing the size of order-0. It
> was felt that fs-block was far away because it's complex and I thought
> that increasing the pagesize like what Andrea suggested would lead to
> internal fragmentation problems. Regrettably we didn't discuss Andrea's
> approach in depth.

Sure. And some people run workloads where fragmentation is likely never
going to be a problem, they are shipping this poorly configured hardware
now or soon, so they don't have too much interest in doing it right at this
point, rather than doing it *now*. OK, that's a valid reason which is why I
don't use the argument that we should do it correctly or never at all.


> I *thought* that the end conclusion was that we would go with
> Christoph's approach pending two things being resolved;
>
> o mmap() support that we agreed on is good

In theory (and again for the filesystem guys who don't have to worry about
it). In practice after seeing the patch it's not a nice thing for the VM to
have to do.


> I also thought there was an acknowledgement that long-term, fs-block was
> the way to go - possibly using contiguous pages optimistically instead
> of virtual mapping the pages. At that point, it would be a general
> solution and we could remove the warnings.

I guess it is still in the air. I personally think a vmapping approach and/or
teaching filesystems to do some nonlinear block metadata access is the
way to go (strangely, this happens to be one of the fsblock paradigms!).
OTOH, I'm not sure how much buy-in there was from the filesystems guys.
Particularly Christoph H and XFS (which is strange because they already do
vmapping in places).

That's understandable though. It is a lot of work for filesystems. But the
reason I think it is the correct approach for larger block than soft-page
size is that it doesn't have fundamental issues (assuming that virtually
mapping the entire kernel is off the table).


> Basically, to start out with, this was going to be an SGI-only thing so
> they get to rattle out the issues we expect to encounter with large
> blocks and help steer the direction of the
> more-complex-but-safer-overall fs-block.

That's what I expected, but it seems from the descriptions in the patches
that it is also supposed to cure cancer :)


> > The idea that there even _is_ a bug to fail when higher order pages
> > cannot be allocated was also brushed aside by some people at the
> > vm/fs summit.
>
> When that brushing occured, I thought I made it very clear what the
> expectations were and that without fallback they would be taking a risk.
> I am not sure if that message actually sank in or not.

No, you have been good about that aspect. I wasn't trying to point to you
at all here.


> > I don't know if those people had gone through the
> > math about this, but it goes somewhat like this: if you use a 64K
> > page size, you can "run out of memory" with 93% of your pages free.
> > If you use a 2MB page size, you can fail with 99.8% of your pages
> > still free. That's 64GB of memory used on a 32TB Altix.
>
> That's the absolute worst case but yes, in theory this can occur and
> it's safest to assume the situation will occur somewhere to someone. It
> would be difficult to craft an attack to do it but conceivably a machine
> running for a long enough time would trigger it particularly if the
> large block allocations are GFP_NOIO or GFP_NOFS.

It would be interesting to craft an attack. If you knew roughly the layout
and size of your dentry slab for example... maybe you could stat a whole
lot of files, then open one and keep it open (maybe post the fd to a unix
socket or something crazy!) when you think you have filled up a couple
of MB worth of them. Repeat the process until your movable zone is
gone. Or do the same things with pagetables, or task structs, or radix
tree nodes, etc.. these are the kinds of things I worry about (as well as
just the gradual natural degredation).

Yeah, it might be reasonably possible to make an attack that would
deplete most of higher order allocations while pinning somewhat close
to just the theoretical minimum required.

[snip]

Thanks Mel. Fairly good summary I think.


> > Basically, if you're placing your hopes for VM and IO scalability on
> > this, then I think that's a totally broken thing to do and will end up
> > making the kernel worse in the years to come (except maybe on some poor
> > configurations of bad hardware).
>
> My magic 8-ball is in the garage.
>
> I thought the following plan was sane but I could be la-la
>
> 1. Go with large block + explosions to start with
> - Second class feature at this point, not fully supported
> - Experiment in different places to see what it gains (if anything)
> 2. Get fs-block in slowly over time with the fallback options replacing
> Christophs patches bit by bit
> 3. Kick away warnings
> - First class feature at this point, fully supported

I guess that was my hope. The only problem I have with a 2nd class
higher order pagecache on a *practical* technical issue is introducing
more complexity in the VM for mmap. Andrea and Hugh are probably
more guardians of that area of code than I, so if they're happy with the
mmap stuff then again I can accept being overruled on this ;)

Then I would love to say #2 will go ahead (and I hope it would), but I
can't force it down the throat of the filesystem maintainers just like I
feel they can't force vm devs (me) to do a virtually mapped and
defrag-able kernel :) Basically I'm trying to practice what I preach and
I don't want to force fsblock onto anyone.

Maybe when ext2 is converted and if I can show it isn't a performance
problem / too much complexity then I'll have another leg to stand on
here... I don't know.

> Independently of that, we would work on order-0 scalability,
> particularly readahead and batching operations on ranges of pages as
> much as possible.

Definitely. Also, aops capable of spanning multiple pages, batching of
large write(2) pagecache insertion, etc all are things we must go after,
regardless of the large page and/or block size work.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Wednesday 12 September 2007 04:31, Mel Gorman wrote:
> On Tue, 2007-09-11 at 18:47 +0200, Andrea Arcangeli wrote:
> > Hi Mel,
>
> Hi,
>
> > On Tue, Sep 11, 2007 at 04:36:07PM +0100, Mel Gorman wrote:
> > > that increasing the pagesize like what Andrea suggested would lead to
> > > internal fragmentation problems. Regrettably we didn't discuss Andrea's
> >
> > The config_page_shift guarantees the kernel stacks or whatever not
> > defragmentable allocation other allocation goes into the same 64k "not
> > defragmentable" page. Not like with SGI design that a 8k kernel stack
> > could be allocated in the first 64k page, and then another 8k stack
> > could be allocated in the next 64k page, effectively pinning all 64k
> > pages until Nick worst case scenario triggers.
>
> In practice, it's pretty difficult to trigger. Buddy allocators always
> try and use the smallest possible sized buddy to split. Once a 64K is
> split for a 4K or 8K allocation, the remainder of that block will be
> used for other 4K, 8K, 16K, 32K allocations. The situation where
> multiple 64K blocks gets split does not occur.
>
> Now, the worst case scenario for your patch is that a hostile process
> allocates large amount of memory and mlocks() one 4K page per 64K chunk
> (this is unlikely in practice I know). The end result is you have many
> 64KB regions that are now unusable because 4K is pinned in each of them.
> Your approach is not immune from problems either. To me, only Nicks
> approach is bullet-proof in the long run.

One important thing I think in Andrea's case, the memory will be accounted
for (eg. we can limit mlock, or work within various memory accounting things).

With fragmentation, I suspect it will be much more difficult to do this. It
would be another layer of heuristics that will also inevitably go wrong
at times if you try to limit how much "fragmentation" a process can do.
Quite likely it is hard to make something even work reasonably well in
most cases.


> > We can still try to save some memory by
> > defragging the slab a bit, but it's by far *not* required with
> > config_page_shift. No defrag at all is required infact.
>
> You will need to take some sort of defragmentation to deal with internal
> fragmentation. It's a very similar problem to blasting away at slab
> pages and still not being able to free them because objects are in use.
> Replace "slab" with "large page" and "object" with "4k page" and the
> issues are similar.

Well yes and slab has issues today too with internal fragmentation,
targetted reclaim and some (small) higher order allocations too today.
But at least with config_page_shift, you don't introduce _new_ sources
of problems (eg. coming from pagecache or other allocs).

Sure, there are some other things -- like pagecache can actually use
up more memory instead -- but there are a number of other positives
that Andrea's has as well. It is using order-0 pages, which are first class
throughout the VM; they have per-cpu queues, and do not require any
special reclaim code. They also *actually do* reduce the page
management overhead in the general case, unlike higher order pcache.

So combined with the accounting issues, I think it is unfair to say that
Andrea's is just moving the fragmentation to internal. It has a number
of upsides. I have no idea how it will actually behave and perform, mind
you ;)


> > Plus there's a cost in defragging and freeing cache... the more you
> > need defrag, the slower the kernel will be.
> >
> > > approach in depth.
> >
> > Well it wasn't my fault if we didn't discuss it in depth though.
>
> If it's my fault, sorry about that. It wasn't my intention.

I think it did get brushed aside a little quickly too (not blaming anyone).
Maybe because Linus was hostile. But *if* the idea is that page
management overhead has or will become a problem that needs fixing,
then neither higher order pagecache, nor (obviously) fsblock, fixes this
properly. Andrea's most definitely has the potential to.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Wednesday 12 September 2007 04:25, Maxim Levitsky wrote:

> Hi,
>
> I think that fundamental problem is no fragmentation/large pages/...
>
> The problem is the VM itself.
> The vm doesn't use virtual memory, thats all, that the problem.
> Although this will be probably linux 3.0, I think that the right way to
> solve all those problems is to make all kernel memory vmalloced (except few
> areas like kernel .text)
>
> It will suddenly remove the buddy allocator, it will remove need for
> highmem, it will allow to allocate any amount of memory (for example 4k
> stacks will be obsolete)
> It will even allow kernel memory to be swapped to disk.
>
> This is the solution, but it is very very hard.

I'm not sure that it is too hard. OK it is far from trivial...

This is not a new idea though, it has been floated around for a long
time (since before Linux I'm sure, although have no references).

There are lots of reasons why such an approach has fundamental
performance problems too, however. Your kernel can't use huge tlbs
for a lot of memory, you can't find the physical address of a page
without walking page tables, defragmenting still has a significant
cost in terms of moving pages and flushing TLBs etc.

So the train of thought up to now has been that a virtually mapped
kernel would be "the problem with the VM itself" ;)

We're actually at a point now where higher order allocations are
pretty rare and not a big problem (except with very special cases
like hugepages and memory hotplug which can mostly get away
with compromises, so we don't want to turn over the kernel just
for these).

So in my opinion, any increase of the dependence on higher order
allocations is simply a bad move until a killer use-case can be found.
They move us further away from good behaviour on our assumed
ideal of an identity mapped kernel.

(I don't actually dislike the idea of virtually mapped kernel. Maybe
hardware trends will favour that model and there are some potential
simple instructions a CPU can implement to help with some of the
performance hits. I'm sure it will be looked at again for Linux one day)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Wednesday 12 September 2007 06:01, Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
> > There is a limitation in the VM. Fragmentation. You keep saying this
> > is a solved issue and just assuming you'll be able to fix any cases
> > that come up as they happen.
> >
> > I still don't get the feeling you realise that there is a fundamental
> > fragmentation issue that is unsolvable with Mel's approach.
>
> Well my problem first of all is that you did not read the full message. It
> discusses that later and provides page pools to address the issue.
>
> Secondly you keep FUDding people with lots of theoretical concerns
> assuming Mel's approaches must fail. If there is an issue (I guess there
> must be right?) then please give us a concrete case of a failure that we
> can work against.

On the other hand, you ignore the potential failure cases, and ignore
the alternatives that do not have such cases.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Wednesday 12 September 2007 06:11, Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
> > It would be interesting to craft an attack. If you knew roughly the
> > layout and size of your dentry slab for example... maybe you could stat a
> > whole lot of files, then open one and keep it open (maybe post the fd to
> > a unix socket or something crazy!) when you think you have filled up a
> > couple of MB worth of them. Repeat the process until your movable zone is
> > gone. Or do the same things with pagetables, or task structs, or radix
> > tree nodes, etc.. these are the kinds of things I worry about (as well as
> > just the gradual natural degredation).
>
> I guess you would have to run that without my targeted slab reclaim
> patchset? Otherwise the slab that are in the way could be reclaimed and
> you could not produce your test case.

I didn't realise you had patches to move pinned dentries, radix tree nodes,
task structs, page tables etc. Did I miss them in your last patchset?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Wednesday 12 September 2007 06:01, Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
> > There is a limitation in the VM. Fragmentation. You keep saying this
> > is a solved issue and just assuming you'll be able to fix any cases
> > that come up as they happen.
> >
> > I still don't get the feeling you realise that there is a fundamental
> > fragmentation issue that is unsolvable with Mel's approach.
>
> Well my problem first of all is that you did not read the full message. It
> discusses that later and provides page pools to address the issue.
>
> Secondly you keep FUDding people with lots of theoretical concerns
> assuming Mel's approaches must fail. If there is an issue (I guess there
> must be right?) then please give us a concrete case of a failure that we
> can work against.

And BTW, before you accuse me of FUD, I'm actually talking about the
fragmentation issues on which Mel I think mostly agrees with me at this
point.

Also have you really a rational reason why we should just up and accept
all these big changes happening just because that, while there are lots
of theoretical issues, the person pointing them out to you hasn't happened
to give you a concrete failure case. Oh, and the actual performance
benefit is actually not really even quantified yet, crappy hardware not
withstanding, and neither has a proper evaluation of the alternatives.

So... would you drive over a bridge if the engineer had this mindset?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Wednesday 12 September 2007 06:42, Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
> > > I guess you would have to run that without my targeted slab reclaim
> > > patchset? Otherwise the slab that are in the way could be reclaimed and
> > > you could not produce your test case.
> >
> > I didn't realise you had patches to move pinned dentries, radix tree
> > nodes, task structs, page tables etc. Did I miss them in your last
> > patchset?
>
> You did not mention that in your earlier text.

Actually, I am pretty sure actually everything I mentioned was explicitly
things that your patches do not handle. This was not a coincidence.


> If these are issues then we
> certainly can work on that. Could you first provide us some real failure
> conditions so that we know that these are real problems?

I think I would have as good a shot as any to write a fragmentation
exploit, yes. I think I've given you enough info to do the same, so I'd
like to hear a reason why it is not a problem.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Wednesday 12 September 2007 06:53, Mel Gorman wrote:
> On (11/09/07 11:44), Nick Piggin didst pronounce:

> However, this discussion belongs more with the non-existant-remove-slab
> patch. Based on what we've seen since the summits, we need a thorough
> analysis with benchmarks before making a final decision (kernbench, ebizzy,
> tbench (netpipe if someone has the time/resources), hackbench and maybe
> sysbench as well as something the filesystem people recommend to get good
> coverage of the subsystems).

True. Aside, it seems I might have been mistaken in saying Christoph
is proposing to use higher order allocations to fix the SLUB regression.
Anyway, I agree let's not get sidetracked about this here.


> I'd rather not get side-tracked here. I regret you feel stream-rolled but I
> think grouping pages by mobility is the right thing to do for better usage
> of the TLB by the kernel and for improving hugepage support in userspace
> minimally. We never really did see eye-to-eye but this way, if I'm wrong
> you get to chuck eggs down the line.

No it's a fair point, and even the hugepage allocations alone are a fair
point. From the discussions I think it seems like quite probably the right
thing to do pragmatically, which is what Linux is about and I hope will
result in a better kernel in the end. So I don't have complaints except
from little ivory tower ;)


> > Sure. And some people run workloads where fragmentation is likely never
> > going to be a problem, they are shipping this poorly configured hardware
> > now or soon, so they don't have too much interest in doing it right at
> > this point, rather than doing it *now*. OK, that's a valid reason which
> > is why I don't use the argument that we should do it correctly or never
> > at all.
>
> So are we saying the right thing to do is go with fs-block from day 1 once
> we get it to optimistically use high-order pages? I think your concern
> might be that if this goes in then it'll be harder to justify fsblock in
> the future because it'll be solving a theoritical problem that takes months
> to trigger if at all. i.e. The filesystem people will push because
> apparently large block support as it is solves world peace. Is that
> accurate?

Heh. It's hard to say. I think fsblock could take a while to implement,
regardless of high order pages or not. I actually would like to be able
to pass down a mandate to say higher order pagecache will never
get merged, simply so that these talented people would work on
fsblock ;)

But that's not my place to say, and I'm actually not arguing that high
order pagecache does not have uses (especially as a practical,
shorter-term solution which is unintrusive to filesystems).

So no, I don't think I'm really going against the basics of what we agreed
in Cambridge. But it sounds like it's still being billed as first-order
support right off the bat here.


> > OTOH, I'm not sure how much buy-in there was from the filesystems guys.
> > Particularly Christoph H and XFS (which is strange because they already
> > do vmapping in places).
>
> I think they use vmapping because they have to, not because they want
> to. They might be a lot happier with fsblock if it used contiguous pages
> for large blocks whenever possible - I don't know for sure. The metadata
> accessors they might be unhappy with because it's inconvenient but as
> Christoph Hellwig pointed out at VM/FS, the filesystems who really care
> will convert.

Sure, they would rather not to. But there are also a lot of ways you can
improve vmap more than what XFS does (or probably what darwin does)
(more persistence for cached objects, and batched invalidates for example).
There are also a lot of trivial things you can do to make a lot of those
accesses not require vmaps (and less trivial things, but even such things
as binary searches over multiple pages should be quite possible with a bit
of logic).


> > It would be interesting to craft an attack. If you knew roughly the
> > layout and size of your dentry slab for example... maybe you could stat a
> > whole lot of files, then open one and keep it open (maybe post the fd to
> > a unix socket or something crazy!) when you think you have filled up a
> > couple of MB worth of them.
>
> I might regret saying this, but it would be easier to craft an attack
> using pagetable pages. It's woefully difficult to do but it's probably
> doable. I say pagetables because while slub targetted reclaim is on the
> cards and memory compaction exists for page cache pages, pagetables are
> currently pinned with no prototype patch existing to deal with them.

But even so, you can just hold an open fd in order to pin the dentry you
want. My attack would go like this: get the page size and allocation group
size for the machine, then get the number of dentries required to fill a
slab. Then read in that many dentries and pin one of them. Repeat the
process. Even if there is other activity on the system, it seems possible
that such a thing will cause some headaches after not too long a time.
Some sources of pinned memory are going to be better than others for
this of course, so yeah maybe pagetables will be a bit easier (I don't know).


> > Then I would love to say #2 will go ahead (and I hope it would), but I
> > can't force it down the throat of the filesystem maintainers just like I
> > feel they can't force vm devs (me) to do a virtually mapped and
> > defrag-able kernel :) Basically I'm trying to practice what I preach and
> > I don't want to force fsblock onto anyone.
>
> If the FS people really want it and they insist that this has to be a
> #1 citizen then it's fsblock or make something new up.

Well I'm glad you agree :) I think not all do, but as you say maybe the
only thing is just to leave it up to the individual filesystems...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Wednesday 12 September 2007 07:41, Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
> > I think I would have as good a shot as any to write a fragmentation
> > exploit, yes. I think I've given you enough info to do the same, so I'd
> > like to hear a reason why it is not a problem.
>
> No you have not explained why the theoretical issues continue to exist
> given even just considering Lumpy Reclaim in .23 nor what effect the
> antifrag patchset would have.

So how does lumpy reclaim, your slab patches, or anti-frag have
much effect on the worst case situation? Or help much against a
targetted fragmentation attack?


> And you have used a 2M pagesize which is
> irrelevant to this patchset that deals with blocksizes up to 64k. In my
> experience the use of blocksize < PAGE_COSTLY_ORDER (32k) is reasonably
> safe.

I used EXACTLY the page sizes that you brought up in your patch
description (ie. 64K and 2MB).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Wednesday 12 September 2007 07:48, Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
> > But that's not my place to say, and I'm actually not arguing that high
> > order pagecache does not have uses (especially as a practical,
> > shorter-term solution which is unintrusive to filesystems).
> >
> > So no, I don't think I'm really going against the basics of what we
> > agreed in Cambridge. But it sounds like it's still being billed as
> > first-order support right off the bat here.
>
> Well its seems that we have different interpretations of what was agreed
> on. My understanding was that the large blocksize patchset was okay
> provided that I supply an acceptable mmap implementation and put a
> warning in.

Yes. I think we differ on our interpretations of "okay". In my interpretation,
it is not OK to use this patch as a way to solve VM or FS or IO scalability
issues, especially not while the alternative approaches that do _not_ have
these problems have not been adequately compared or argued against.


> > But even so, you can just hold an open fd in order to pin the dentry you
> > want. My attack would go like this: get the page size and allocation
> > group size for the machine, then get the number of dentries required to
> > fill a slab. Then read in that many dentries and pin one of them. Repeat
> > the process. Even if there is other activity on the system, it seems
> > possible that such a thing will cause some headaches after not too long a
> > time. Some sources of pinned memory are going to be better than others
> > for this of course, so yeah maybe pagetables will be a bit easier (I
> > don't know).
>
> Well even without slab targeted reclaim: Mel's antifrag will sort the
> dentries into separate blocks of memory and so isolate the issue.

So even after all this time you do not understand what the fundamental
problem is with anti-frag and yet you are happy to waste both our time
in endless flamewars telling me how wrong I am about it.

Forgive me if I'm starting to be rude, Christoph. This is really irritating.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Tue, Sep 11, 2007 at 04:52:19AM +1000, Nick Piggin wrote:
> The idea that there even _is_ a bug to fail when higher order pages
> cannot be allocated was also brushed aside by some people at the
> vm/fs summit. I don't know if those people had gone through the
> math about this, but it goes somewhat like this: if you use a 64K
> page size, you can "run out of memory" with 93% of your pages free.
> If you use a 2MB page size, you can fail with 99.8% of your pages
> still free. That's 64GB of memory used on a 32TB Altix.
>
> If you don't consider that is a problem because you don't care about
> theoretical issues or nobody has reported it from running -mm

MM kernels also forbids mmap, so there's no chance the largepages are
mlocked etc... that's not the final thing that is being measured.

> kernels, then I simply can't argue against that on a technical basis.
> But I'm totally against introducing known big fundamental problems to
> the VM at this stage of the kernel. God knows how long it takes to ever
> fix them in future after they have become pervasive throughout the
> kernel.

Seconded.

> IMO the only thing that higher order pagecache is good for is a quick
> hack for filesystems to support larger block sizes. And after seeing it
> is fairly ugly to support mmap, I'm not even really happy for it to do
> that.

Additionally I feel the ones that will get the main advantage from the
quick hack are the crippled devices that are ~30% slower if the SG
tables are large.

> If VM scalability is a problem, then it needs to be addressed in other
> areas anyway for order-0 pages, and if contiguous pages helps IO
> scalability or crappy hardware, then there is nothing stopping us from

Yep.

> *attempting* to get contiguous memory in the current scheme.
>
> Basically, if you're placing your hopes for VM and IO scalability on this,
> then I think that's a totally broken thing to do and will end up making
> the kernel worse in the years to come (except maybe on some poor
> configurations of bad hardware).

Agreed. From my part I am really convinced the only sane way to
approach the VM scalability and larger-physically contiguous pages
problem is the CONFIG_PAGE_SHIFT patch (aka large PAGE_SIZE from Hugh
for 2.4). I also have to say I always disliked the PAGE_CACHE_SIZE
definition too ;). I take it only as an attempt to documentation.

Furthermore all the issues with writeprotect faults over MAP_PRIVATE
regions will have to be addressed the same way with both approaches if
we want real 100% 4k-granular backwards compatibility.

On this topic I'm also going to suggest the cpu vendors to add a 64k
tlb using the reserved 62th bitflag in the pte (right after the NX
bit). So if alignment allows we can map pagecache with a 64k large tlb
on x86 (with a PAGE_SIZE of 64k), mixing it with the 4k tlb in the
same address space if userland alignment forbids using the 64k tlb. If
we want to break backwards compatibility and force all alignments on
64k and get rid of any 4k tlb to simplify the page fault code we can
do it later anyway... No idea if this feasible to achieve on the
hardware level though, it's not my problem anyway to judge this ;). As
constraints to the hardware interface it would be ok to require the
62th 64k-tlb bitflag to be only available on the pte that would have
normally mapped a physical address 64k naturally aligned, and to
require all later overlapping 4k ptes to be set to 0. If you've better
ideas to achieve this than my interface please let me know.

And if I'm terribly wrong and the variable order pagecache is the way
to go for the long run, the 64k tlb feature will fit in that model
very nicely too.

The reason of the 64k magic number is that this is the minimum unit of
contiguous I/O required to reach platter speed on most devices out
there. And it incidentally also matches ppc64 ;).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Tue, 11 September 2007 04:52:19 +1000, Nick Piggin wrote:
> On Tuesday 11 September 2007 16:03, Christoph Lameter wrote:
>
> > 5. VM scalability
> > Large block sizes mean less state keeping for the information being
> > transferred. For a 1TB file one needs to handle 256 million page
> > structs in the VM if one uses 4k page size. A 64k page size reduces
> > that amount to 16 million. If the limitation in existing filesystems
> > are removed then even higher reductions become possible. For very
> > large files like that a page size of 2 MB may be beneficial which
> > will reduce the number of page struct to handle to 512k. The variable
> > nature of the block size means that the size can be tuned at file
> > system creation time for the anticipated needs on a volume.
>
> The idea that there even _is_ a bug to fail when higher order pages
> cannot be allocated was also brushed aside by some people at the
> vm/fs summit. I don't know if those people had gone through the
> math about this, but it goes somewhat like this: if you use a 64K
> page size, you can "run out of memory" with 93% of your pages free.
> If you use a 2MB page size, you can fail with 99.8% of your pages
> still free. That's 64GB of memory used on a 32TB Altix.

While I agree with your concern, those numbers are quite silly. The
chances of 99.8% of pages being free and the remaining 0.2% being
perfectly spread across all 2MB large_pages are lower than those of SHA1
creating a collision. I don't see anyone abandoning git or rsync, so
your extreme example clearly is the wrong one.

Again, I agree with your concern, even though your example makes it look
silly.

Jörn

--
You can't tell where a program is going to spend its time. Bottlenecks
occur in surprising places, so don't try to second guess and put in a
speed hack until you've proven that's where the bottleneck is.
-- Rob Pike
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Tue, 2007-09-11 at 04:52 +1000, Nick Piggin wrote:
> On Tuesday 11 September 2007 16:03, Christoph Lameter wrote:
>
> > 5. VM scalability
> > Large block sizes mean less state keeping for the information being
> > transferred. For a 1TB file one needs to handle 256 million page
> > structs in the VM if one uses 4k page size. A 64k page size reduces
> > that amount to 16 million. If the limitation in existing filesystems
> > are removed then even higher reductions become possible. For very
> > large files like that a page size of 2 MB may be beneficial which
> > will reduce the number of page struct to handle to 512k. The variable
> > nature of the block size means that the size can be tuned at file
> > system creation time for the anticipated needs on a volume.
>
> There is a limitation in the VM. Fragmentation. You keep saying this
> is a solved issue and just assuming you'll be able to fix any cases
> that come up as they happen.
>
> I still don't get the feeling you realise that there is a fundamental
> fragmentation issue that is unsolvable with Mel's approach.
>

I thought we had discussed this already at VM and reached something
resembling a conclusion. It was acknowledged that depending on
contiguous allocations to always succeed will get a caller into trouble
and they need to deal with fallback - whether the problem was
theoritical or not. It was also strongly pointed out that the large
block patches as presented would be vunerable to that problem.

The alternatives were fs-block and increasing the size of order-0. It
was felt that fs-block was far away because it's complex and I thought
that increasing the pagesize like what Andrea suggested would lead to
internal fragmentation problems. Regrettably we didn't discuss Andrea's
approach in depth.

I *thought* that the end conclusion was that we would go with
Christoph's approach pending two things being resolved;

o mmap() support that we agreed on is good
o A clear statement, with logging maybe for users that mounted a large
block filesystem that it might blow up and they get to keep both parts
when it does. Basically, for now it's only suitable in specialised
environments.

I also thought there was an acknowledgement that long-term, fs-block was
the way to go - possibly using contiguous pages optimistically instead
of virtual mapping the pages. At that point, it would be a general
solution and we could remove the warnings.

Basically, to start out with, this was going to be an SGI-only thing so
they get to rattle out the issues we expect to encounter with large
blocks and help steer the direction of the
more-complex-but-safer-overall fs-block.

> The idea that there even _is_ a bug to fail when higher order pages
> cannot be allocated was also brushed aside by some people at the
> vm/fs summit.

When that brushing occured, I thought I made it very clear what the
expectations were and that without fallback they would be taking a risk.
I am not sure if that message actually sank in or not.

That said, the filesystem people can experiement to some extent against
Christoph's approach as long as they don't think they are 100% safe.
Again, their experimenting will help steer the direction of fs-block.

>
> I don't know if those people had gone through the
> math about this, but it goes somewhat like this: if you use a 64K
> page size, you can "run out of memory" with 93% of your pages free.
> If you use a 2MB page size, you can fail with 99.8% of your pages
> still free. That's 64GB of memory used on a 32TB Altix.
>

That's the absolute worst case but yes, in theory this can occur and
it's safest to assume the situation will occur somewhere to someone. It
would be difficult to craft an attack to do it but conceivably a machine
running for a long enough time would trigger it particularly if the
large block allocations are GFP_NOIO or GFP_NOFS.

> If you don't consider that is a problem because you don't care about
> theoretical issues or nobody has reported it from running -mm
> kernels, then I simply can't argue against that on a technical basis.

The -mm kernels have patches related to watermarking that will not be
making it to mainline for reasons we don't need to revisit right now.
The lack of the watermarking patches may turn out to be a non-issue but
the point is that what's in mainline is not exactly the same as -mm and
mainline will be running for longer periods of time in a different
environment.

Where we expected to see the the use of this patchset was in specialised
environments *only*. The SGI people can mitigate their mixed
fragmentation problems somewhat by setting slub_min_order ==
large_block_order so that blocks get allocated and freed at the same
size. This is partial way towards Andrea's solution of raising the size
of an order-0 allocation. The point of printing out the warnings at
mount time was not so much for a general user who may miss the logs but
for distributions that consider turning large block use on by default to
discourage them until such time as we have proper fallback in place.

> But I'm totally against introducing known big fundamental problems to
> the VM at this stage of the kernel. God knows how long it takes to ever
> fix them in future after they have become pervasive throughout the
> kernel.
>
> IMO the only thing that higher order pagecache is good for is a quick
> hack for filesystems to support larger block sizes. And after seeing it
> is fairly ugly to support mmap, I'm not even really happy for it to do
> that.
>

If the mmap() support is poor and going to be an obstacle in the future,
then that is a reason to hold it up. I haven't actually read the mmap()
support patch yet so I have no worthwhile opinion yet.

If the mmap() mess can be agreed on, the large block patchset as it is
could give us important information from the users willing to deal with
this risk about what sort of behaviour to expect. If they find it fails
all the time, then fs-block having the complexity of optimistically
using large pages is not worthwhile either. That is useful data.

> If VM scalability is a problem, then it needs to be addressed in other
> areas anyway for order-0 pages, and if contiguous pages helps IO
> scalability or crappy hardware, then there is nothing stopping us from
> *attempting* to get contiguous memory in the current scheme.
>

This was also brought up at VM Summit but for the benefit of the people
that were not there;

It was emphasised that large block support is not the solution to all
scalability problems. There was a strong emphasis on fixing up the
order-0 uses should be encouraged. In particular, readahead should be
batched so that each page is not individually locked. There were also
other page-related operations that should be done in batch. On a similar
note, it was pointed out that dcache lookup is something that should be
scaled better - possibly before spending too much time on things like
page cache or radix locks.

For scalability, it was also pointed out at some point that heavy users
of large blocks may now find themselves contending on the zone->lock and
they might well find that order-0 pages were what they wanted to use
anyway.

> Basically, if you're placing your hopes for VM and IO scalability on this,
> then I think that's a totally broken thing to do and will end up making
> the kernel worse in the years to come (except maybe on some poor
> configurations of bad hardware).

My magic 8-ball is in the garage.

I thought the following plan was sane but I could be la-la

1. Go with large block + explosions to start with
- Second class feature at this point, not fully supported
- Experiment in different places to see what it gains (if anything)
2. Get fs-block in slowly over time with the fallback options replacing
Christophs patches bit by bit
3. Kick away warnings
- First class feature at this point, fully supported

Independently of that, we would work on order-0 scalability,
particularly readahead and batching operations on ranges of pages as
much as possible.

--
Mel "la-la" Gorman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Wednesday 12 September 2007 11:49, David Chinner wrote:
> On Tue, Sep 11, 2007 at 04:00:17PM +1000, Nick Piggin wrote:
> > > > OTOH, I'm not sure how much buy-in there was from the filesystems
> > > > guys. Particularly Christoph H and XFS (which is strange because they
> > > > already do vmapping in places).
> > >
> > > I think they use vmapping because they have to, not because they want
> > > to. They might be a lot happier with fsblock if it used contiguous
> > > pages for large blocks whenever possible - I don't know for sure. The
> > > metadata accessors they might be unhappy with because it's inconvenient
> > > but as Christoph Hellwig pointed out at VM/FS, the filesystems who
> > > really care will convert.
> >
> > Sure, they would rather not to. But there are also a lot of ways you can
> > improve vmap more than what XFS does (or probably what darwin does)
> > (more persistence for cached objects, and batched invalidates for
> > example).
>
> XFS already has persistence across the object life time (which can be many
> tens of seconds for a frequently used buffer)

But you don't do a very good job. When you go above 64 cached mappings,
you purge _all_ of them. fsblock's vmap cache can have a much higher number
(if you want), and purging can just unmap a batch which is decided by a simple
LRU (thus important metadata gets saved).


> and it also does batched
> unmapping of objects as well.

It also could do a lot better at unmapping. Currently you're just calling
vunmap a lot of times in sequence. That still requires global IPIs and TLB
flushing every time.

This simple patch should easily be able to reduce that number by 2 or 3
orders of magnitude (maybe more on big systems).
http://www.mail-archive.com/linux-arch@vger.kernel.org/msg03956.html

vmap area locking and data structures could also be made a lot better
quite easily, I suspect.


> > There are also a lot of trivial things you can do to make a lot of those
> > accesses not require vmaps (and less trivial things, but even such things
> > as binary searches over multiple pages should be quite possible with a
> > bit of logic).
>
> Yes, we already do the many of these things (via xfs_buf_offset()), but
> that is not good enough for something like a memcpy that spans multiple
> pages in a large block (think btree block compaction, splits and
> recombines).

fsblock_memcpy(fsblock *src, int soff, fsblock *dst, int doff, int size); ?


> IOWs, we already play these vmap harm-minimisation games in the places
> where we can, but still the overhead is high and something we'd prefer
> to be able to avoid.

I don't think you've looked nearly far enough with all this low hanging
fruit.

I just gave 4 things which combined might easily reduce xfs vmap overhead
by several orders of magnitude, all without changing much code at all.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
Nick Piggin <nickpiggin@yahoo.com.au> writes:

> On Tuesday 11 September 2007 22:12, Jörn Engel wrote:
>> On Tue, 11 September 2007 04:52:19 +1000, Nick Piggin wrote:
>> > On Tuesday 11 September 2007 16:03, Christoph Lameter wrote:
>> > > 5. VM scalability
>> > > Large block sizes mean less state keeping for the information being
>> > > transferred. For a 1TB file one needs to handle 256 million page
>> > > structs in the VM if one uses 4k page size. A 64k page size reduces
>> > > that amount to 16 million. If the limitation in existing filesystems
>> > > are removed then even higher reductions become possible. For very
>> > > large files like that a page size of 2 MB may be beneficial which
>> > > will reduce the number of page struct to handle to 512k. The
>> > > variable nature of the block size means that the size can be tuned at
>> > > file system creation time for the anticipated needs on a volume.
>> >
>> > The idea that there even _is_ a bug to fail when higher order pages
>> > cannot be allocated was also brushed aside by some people at the
>> > vm/fs summit. I don't know if those people had gone through the
>> > math about this, but it goes somewhat like this: if you use a 64K
>> > page size, you can "run out of memory" with 93% of your pages free.
>> > If you use a 2MB page size, you can fail with 99.8% of your pages
>> > still free. That's 64GB of memory used on a 32TB Altix.
>>
>> While I agree with your concern, those numbers are quite silly. The
>
> They are the theoretical worst case. Obviously with a non trivially
> sized system and non-DoS workload, they will not be reached.

I would think it should be pretty hard to have only one page out of
each 2MB chunk allocated and non evictable (writeable, swappable or
movable). Wouldn't that require some kernel driver to allocate all
pages and then selectively free them in such a pattern as to keep one
page per 2MB chunk?

Assuming nothing tries to allocate a large chunk of ram while holding
to many locks for the kernel to free it.

>> chances of 99.8% of pages being free and the remaining 0.2% being
>> perfectly spread across all 2MB large_pages are lower than those of SHA1
>> creating a collision. I don't see anyone abandoning git or rsync, so
>> your extreme example clearly is the wrong one.
>>
>> Again, I agree with your concern, even though your example makes it look
>> silly.
>
> It is not simply a question of once-off chance for an all-at-once layout
> to fail in this way. Fragmentation slowly builds over time, and especially
> if you do actually use higher-order pages for a significant number of
> things (unlike we do today), then the problem will become worse. If you
> have any part of your workload that is affected by fragmentation, then
> it will cause unfragmented regions to eventually be used for fragmentation
> inducing allocations (by definition -- if it did not, eg. then there would be
> no fragmentation problem and no need for Mel's patches).

It might be naive (stop me as soon as I go into dream world) but I
would think there are two kinds of fragmentation:

Hard fragments - physical pages the kernel can't move around
Soft fragments - virtual pages/cache that happen to cause a fragment

I would further assume most ram is used on soft fragments and that the
kernel will free them up by flushing or swapping the data when there
is sufficient need. With defragmentation support the kernel could
prevent some flushings or swapping by moving the data from one
physical page to another. But that would just reduce unneccessary work
and not change the availability of larger pages.

Further I would assume that there are two kinds of hard fragments:
Fragments allocated once at start time and temporary fragments.

At boot time (or when a module is loaded or something) you get a tiny
amount of ram allocated that will remain busy for basically ever. You
get some fragmentation right there that you can never get rid of.

At runtime a lot of pages are allocated and quickly freed again. They
get preferably positions in regions where there already is
fragmentation. In regions where there are suitable sized holes
already. They would only break a free 2MB chunk into smaller chunks if
there is no small hole to be found.

Now a trick I would use is to put kernel allocated pages at one end of
the ram and virtual/cache pages at the other end. Small kernel allocs
would find holes at the start of the ram while big allocs would have
to move more to the middle or end of the ram to find a large enough
hole. And virtual/cache pages could always be cleared out to free
large continious chunks.

Splitting the two types would prevent fragmentation of freeable and
not freeable regions giving us always a large pool to pull compound
pages from.

One could also split the ram into regions of different page sizes,
meaning that some large compound pages may not be split below a
certain limit. E.g. some amount of ram would be reserved for chunk
>=64k only. This should be configurable via sys.

> I don't know what happens as time tends towards infinity, but I don't think
> it will be good.

It depends on the lifetime of the allocations. If the lifetime is
uniform enough then larger chunks of memory allocated for small
objects will always be freed after a short time. If the lifetime
varies widely then it can happen that one page of a larger chunk
remains busy far longer than the rest causing fragmentation.

I'm hopeing that we don't have such wide variance in lifetime that we
run into a ENOMEM. I'm hoping allocation and freeing are not random
events that would result in an expectancy of an infinite number of
allocations to be alife as time tends towards infinity. I'm hoping
there is enough dependence between the two to impose an upper limit on
the fragmentation.

> At millions of allocations per second, how long does it take to produce
> an unacceptable number of free pages before the ENOMEM condition?
> Furthermore, what *is* an unacceptable number? I don't know. I am not
> trying to push this feature in, so the burden is not mine to make sure it
> is OK.

I think the only acceptable solution is to have the code cope with
large pages being unavailable and use multiple smaller chunks instead
in a tight spot. By all means try to use a large continious chunk but
never fail just because we are too fragmented. I'm sure modern system
with 4+GB ram will not run into the wall but i'm equaly sure older
systems with as little as 64MB quickly will. Handling the fragmented
case is the only way to make sure we keep running.

> Yes, we already have some of these problems today. Introducing more
> and worse problems and justifying them because of existing ones is much
> more silly than my quoting of the numbers. IMO.

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
Hi Mel,

On Tue, Sep 11, 2007 at 04:36:07PM +0100, Mel Gorman wrote:
> that increasing the pagesize like what Andrea suggested would lead to
> internal fragmentation problems. Regrettably we didn't discuss Andrea's

The config_page_shift guarantees the kernel stacks or whatever not
defragmentable allocation other allocation goes into the same 64k "not
defragmentable" page. Not like with SGI design that a 8k kernel stack
could be allocated in the first 64k page, and then another 8k stack
could be allocated in the next 64k page, effectively pinning all 64k
pages until Nick worst case scenario triggers.

What I said at the VM summit is that your reclaim-defrag patch in the
slub isn't necessarily entirely useless with config_page_shift,
because the larger the software page_size, the more partial pages we
could find in the slab, so to save some memory if there are tons of
pages very partially used, we could free some of them.

But the whole point is that with the config_page_shift, Nick's worst
case scenario can't happen by design regardless of defrag or not
defrag. While it can _definitely_ happen with SGI design (regardless
of any defrag thing). We can still try to save some memory by
defragging the slab a bit, but it's by far *not* required with
config_page_shift. No defrag at all is required infact.

Plus there's a cost in defragging and freeing cache... the more you
need defrag, the slower the kernel will be.

> approach in depth.

Well it wasn't my fault if we didn't discuss it in depth though. I
tried to discuss it in all possible occasions where I was suggested to
talk about it and where it was somewhat on topic. Given I wasn't even
invited at the KS, I felt it would not be appropriate for me to try to
monopolize the VM summit according to my agenda. So I happily listened
to what the top kernel developers are planning ;), while giving
some hints on what I think the right direction is instead.

> I *thought* that the end conclusion was that we would go with

Frankly I don't care what the end conclusion was.

> Christoph's approach pending two things being resolved;
>
> o mmap() support that we agreed on is good

Let's see how good the mmap support for variable order page size will
work after the 2 weeks...

> o A clear statement, with logging maybe for users that mounted a large
> block filesystem that it might blow up and they get to keep both parts
> when it does. Basically, for now it's only suitable in specialised
> environments.

Yes, but perhaps you missed that such printk is needed exactly to
provide proof that SGI design is the wrong way and it needs to be
dumped. If that printk ever triggers it means you were totally wrong.

> I also thought there was an acknowledgement that long-term, fs-block was
> the way to go - possibly using contiguous pages optimistically instead
> of virtual mapping the pages. At that point, it would be a general
> solution and we could remove the warnings.

fsblock should stack on top of config_page_shift simply. Both are
needed. You don't want to use 64k pages on a laptop but you may want a
larger blocksize for the btrees etc... if you've a large harddisk and
not much ram.

> That's the absolute worst case but yes, in theory this can occur and
> it's safest to assume the situation will occur somewhere to someone. It

Do you agree this worst case can't happen with config_page_shift?

> Where we expected to see the the use of this patchset was in specialised
> environments *only*. The SGI people can mitigate their mixed
> fragmentation problems somewhat by setting slub_min_order ==
> large_block_order so that blocks get allocated and freed at the same
> size. This is partial way towards Andrea's solution of raising the size
> of an order-0 allocation. The point of printing out the warnings at

Except you don't get all the full benefits of it...

Even if I could end up mapping 4k kmalloced entries in userland for
the tail packing, that IMHO would still be a preferable solution than
to keep the base-page small and to make an hard effort to create large
pages out of small pages. The approach I advocate keeps the base page
big and the fast path fast, and it rather does some work to split the
base pages outside the buddy for the small files.

All your defrag work is still good to have, like I said at the VM
summit if you remember, to grow the hugetlbfs at runtime etc... I just
rather avoid to depend on it to avoid I/O failure in presence of
mlocked pagecache for example.

> Independently of that, we would work on order-0 scalability,
> particularly readahead and batching operations on ranges of pages as
> much as possible.

That's pretty much an unnecessary logic, if the order0 pages become
larger.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Wednesday 12 September 2007 07:52, Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
> > > No you have not explained why the theoretical issues continue to exist
> > > given even just considering Lumpy Reclaim in .23 nor what effect the
> > > antifrag patchset would have.
> >
> > So how does lumpy reclaim, your slab patches, or anti-frag have
> > much effect on the worst case situation? Or help much against a
> > targetted fragmentation attack?
>
> F.e. Lumpy reclaim reclaim neighboring pages and thus works against
> fragmentation. So your formulae no longer works.

OK, I'll describe how it works and what the actual problem with it is. I
haven't looked at the patches for a fair while so you can forgive my
inaccuracies in terminology or exact details.

So anti-frag groups memory into (say) 2MB chunks. Top priority heuristic
is that allocations which are movable all go into groups with other movable
memory and allocations which are not movable do not go into these
groups. This is flexible though, so if a workload wants to use more non
movable memory, it is allowed to eat into first free, then movable
groups after filling all non-movable groups. This is important because
it is what makes anti-frag flexible (otherwise it would just be a memory
reserve in another form).

In my attack, I cause the kernel to allocate lots of unmovable allocations
and deplete movable groups. I theoretically then only need to keep a
small number (1/2^N) of these allocations around in order to DoS a
page allocation of order N.

And it doesn't even have to be a DoS. The natural fragmentation
that occurs today in a kernel today has the possibility to slowly push out
the movable groups and give you the same situation.

Now there are lots of other little heuristics, *including lumpy reclaim
and various slab reclaim improvements*, that improve the effectiveness
or speed of this thing, but at the end of the day, it has the same basic
issues. Unless you can move practically any currently unmovable
allocation (which will either be a lot of intrusive code or require a
vmapped kernel), then you can't get around the fundamental problem.
And if you do get around the fundamental problem, you don't really
need to group pages by mobility any more because they are all
movable[*].

So lumpy reclaim does not change my formula nor significantly help
against a fragmentation attack. AFAIKS.

[*] ok, this isn't quite true because if you can actually put a hard limit on
unmovable allocations then anti-frag will fundamentally help -- get back to
me on that when you get patches to move most of the obvious ones.
Like pinned dentries, inodes, buffer heads, page tables, task structs, mm
structs, vmas, anon_vmas, radix-tree nodes, etc.


> > > And you have used a 2M pagesize which is
> > > irrelevant to this patchset that deals with blocksizes up to 64k. In my
> > > experience the use of blocksize < PAGE_COSTLY_ORDER (32k) is reasonably
> > > safe.
> >
> > I used EXACTLY the page sizes that you brought up in your patch
> > description (ie. 64K and 2MB).
>
> The patch currently only supports 64k.

Sure, and I pointed out the theoretical figure for 64K pages as well. Is that
figure not problematic to you? Where do you draw the limit for what is
acceptable? Why? What happens with tiny memory machines where a reserve
or even the anti-frag patches may not be acceptable and/or work very well?
When do you require reserve pools? Why are reserve pools acceptable for
first-class support of filesystems when it has been very loudly been made a
known policy decision by Linus in the past (and for some valid reasons) that
we should not put limits on the sizes of caches in the kernel.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Tuesday 11 September 2007 05:26:05 Nick Piggin wrote:
> On Wednesday 12 September 2007 04:31, Mel Gorman wrote:
> > On Tue, 2007-09-11 at 18:47 +0200, Andrea Arcangeli wrote:
> > > Hi Mel,
> >
> > Hi,
> >
> > > On Tue, Sep 11, 2007 at 04:36:07PM +0100, Mel Gorman wrote:
> > > > that increasing the pagesize like what Andrea suggested would lead to
> > > > internal fragmentation problems. Regrettably we didn't discuss Andrea's
> > >
> > > The config_page_shift guarantees the kernel stacks or whatever not
> > > defragmentable allocation other allocation goes into the same 64k "not
> > > defragmentable" page. Not like with SGI design that a 8k kernel stack
> > > could be allocated in the first 64k page, and then another 8k stack
> > > could be allocated in the next 64k page, effectively pinning all 64k
> > > pages until Nick worst case scenario triggers.
> >
> > In practice, it's pretty difficult to trigger. Buddy allocators always
> > try and use the smallest possible sized buddy to split. Once a 64K is
> > split for a 4K or 8K allocation, the remainder of that block will be
> > used for other 4K, 8K, 16K, 32K allocations. The situation where
> > multiple 64K blocks gets split does not occur.
> >
> > Now, the worst case scenario for your patch is that a hostile process
> > allocates large amount of memory and mlocks() one 4K page per 64K chunk
> > (this is unlikely in practice I know). The end result is you have many
> > 64KB regions that are now unusable because 4K is pinned in each of them.
> > Your approach is not immune from problems either. To me, only Nicks
> > approach is bullet-proof in the long run.
>
> One important thing I think in Andrea's case, the memory will be accounted
> for (eg. we can limit mlock, or work within various memory accounting things).
>
> With fragmentation, I suspect it will be much more difficult to do this. It
> would be another layer of heuristics that will also inevitably go wrong
> at times if you try to limit how much "fragmentation" a process can do.
> Quite likely it is hard to make something even work reasonably well in
> most cases.
>
>
> > > We can still try to save some memory by
> > > defragging the slab a bit, but it's by far *not* required with
> > > config_page_shift. No defrag at all is required infact.
> >
> > You will need to take some sort of defragmentation to deal with internal
> > fragmentation. It's a very similar problem to blasting away at slab
> > pages and still not being able to free them because objects are in use.
> > Replace "slab" with "large page" and "object" with "4k page" and the
> > issues are similar.
>
> Well yes and slab has issues today too with internal fragmentation,
> targetted reclaim and some (small) higher order allocations too today.
> But at least with config_page_shift, you don't introduce _new_ sources
> of problems (eg. coming from pagecache or other allocs).
>
> Sure, there are some other things -- like pagecache can actually use
> up more memory instead -- but there are a number of other positives
> that Andrea's has as well. It is using order-0 pages, which are first class
> throughout the VM; they have per-cpu queues, and do not require any
> special reclaim code. They also *actually do* reduce the page
> management overhead in the general case, unlike higher order pcache.
>
> So combined with the accounting issues, I think it is unfair to say that
> Andrea's is just moving the fragmentation to internal. It has a number
> of upsides. I have no idea how it will actually behave and perform, mind
> you ;)
>
>
> > > Plus there's a cost in defragging and freeing cache... the more you
> > > need defrag, the slower the kernel will be.
> > >
> > > > approach in depth.
> > >
> > > Well it wasn't my fault if we didn't discuss it in depth though.
> >
> > If it's my fault, sorry about that. It wasn't my intention.
>
> I think it did get brushed aside a little quickly too (not blaming anyone).
> Maybe because Linus was hostile. But *if* the idea is that page
> management overhead has or will become a problem that needs fixing,
> then neither higher order pagecache, nor (obviously) fsblock, fixes this
> properly. Andrea's most definitely has the potential to.
>

Hi,

I think that fundamental problem is no fragmentation/large pages/...

The problem is the VM itself.
The vm doesn't use virtual memory, thats all, that the problem.
Although this will be probably linux 3.0, I think that the right way to solve all those problems
is to make all kernel memory vmalloced (except few areas like kernel .text)

It will suddenly remove the buddy allocator, it will remove need for highmem, it will allow to allocate any amount of memory
(for example 4k stacks will be obsolete)
It will even allow kernel memory to be swapped to disk.

This is the solution, but it is very very hard.

Best regards,
Maxim Levitsky
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Tue, 2007-09-11 at 18:47 +0200, Andrea Arcangeli wrote:
> Hi Mel,
>

Hi,

> On Tue, Sep 11, 2007 at 04:36:07PM +0100, Mel Gorman wrote:
> > that increasing the pagesize like what Andrea suggested would lead to
> > internal fragmentation problems. Regrettably we didn't discuss Andrea's
>
> The config_page_shift guarantees the kernel stacks or whatever not
> defragmentable allocation other allocation goes into the same 64k "not
> defragmentable" page. Not like with SGI design that a 8k kernel stack
> could be allocated in the first 64k page, and then another 8k stack
> could be allocated in the next 64k page, effectively pinning all 64k
> pages until Nick worst case scenario triggers.
>

In practice, it's pretty difficult to trigger. Buddy allocators always
try and use the smallest possible sized buddy to split. Once a 64K is
split for a 4K or 8K allocation, the remainder of that block will be
used for other 4K, 8K, 16K, 32K allocations. The situation where
multiple 64K blocks gets split does not occur.

Now, the worst case scenario for your patch is that a hostile process
allocates large amount of memory and mlocks() one 4K page per 64K chunk
(this is unlikely in practice I know). The end result is you have many
64KB regions that are now unusable because 4K is pinned in each of them.
Your approach is not immune from problems either. To me, only Nicks
approach is bullet-proof in the long run.

> What I said at the VM summit is that your reclaim-defrag patch in the
> slub isn't necessarily entirely useless with config_page_shift,
> because the larger the software page_size, the more partial pages we
> could find in the slab, so to save some memory if there are tons of
> pages very partially used, we could free some of them.
>

This is true. Slub targetted reclaim (Chrisophs work) is useful
independent of this current problem.

> But the whole point is that with the config_page_shift, Nick's worst
> case scenario can't happen by design regardless of defrag or not
> defrag.

I agree with this. It's why I thought Nick's approach was where we were
going to finish up ultimately.

> While it can _definitely_ happen with SGI design (regardless
> of any defrag thing).

I have never stated that the SGI design is immune from this problem.

> We can still try to save some memory by
> defragging the slab a bit, but it's by far *not* required with
> config_page_shift. No defrag at all is required infact.
>

You will need to take some sort of defragmentation to deal with internal
fragmentation. It's a very similar problem to blasting away at slab
pages and still not being able to free them because objects are in use.
Replace "slab" with "large page" and "object" with "4k page" and the
issues are similar.

> Plus there's a cost in defragging and freeing cache... the more you
> need defrag, the slower the kernel will be.
>
> > approach in depth.
>
> Well it wasn't my fault if we didn't discuss it in depth though.

If it's my fault, sorry about that. It wasn't my intention.

> I
> tried to discuss it in all possible occasions where I was suggested to
> talk about it and where it was somewhat on topic.

Who said it was off-topic? Again, if this was me, sorry - you should
have chucked something at my head to shut me up.

> Given I wasn't even
> invited at the KS, I felt it would not be appropriate for me to try to
> monopolize the VM summit according to my agenda. So I happily listened
> to what the top kernel developers are planning ;), while giving
> some hints on what I think the right direction is instead.
>

Right, clearly we failed or at least had sub-optimal results dicussion
this one at VM Summit. Good job we have mail to pick up the stick with.

> > I *thought* that the end conclusion was that we would go with
>
> Frankly I don't care what the end conclusion was.
>

heh. Well we need to come to some sort of conclusion here or this will
go around the merri-go-round till we're all bald.

> > Christoph's approach pending two things being resolved;
> >
> > o mmap() support that we agreed on is good
>
> Let's see how good the mmap support for variable order page size will
> work after the 2 weeks...
>

Ok, I'm ok with that.

> > o A clear statement, with logging maybe for users that mounted a large
> > block filesystem that it might blow up and they get to keep both parts
> > when it does. Basically, for now it's only suitable in specialised
> > environments.
>
> Yes, but perhaps you missed that such printk is needed exactly to
> provide proof that SGI design is the wrong way and it needs to be
> dumped. If that printk ever triggers it means you were totally wrong.
>

heh, I suggested printing the warning because I knew it had this
problem. The purpose in my mind was to see how far the design could be
brought before fs-block had to fill in the holes.

> > I also thought there was an acknowledgement that long-term, fs-block was
> > the way to go - possibly using contiguous pages optimistically instead
> > of virtual mapping the pages. At that point, it would be a general
> > solution and we could remove the warnings.
>
> fsblock should stack on top of config_page_shift simply.

It should be able to stack on top of either approach and arguably
setting slub_min_order=large_block_order with large block filesystems is
90% of your approach anyway.

> Both are
> needed. You don't want to use 64k pages on a laptop but you may want a
> larger blocksize for the btrees etc... if you've a large harddisk and
> not much ram.

I am still failing to see what happens when there are pagetable pages,
slab objects or mlocked 4k pages pinning the 64K pages and you need to
allocate another 64K page for the filesystem. I *think* you deadlock in
a similar fashion to Christoph's approach but the shape of the problem
is different because we are dealing with internal instead of external
fragmentation. Am I wrong?

> > That's the absolute worst case but yes, in theory this can occur and
> > it's safest to assume the situation will occur somewhere to someone. It
>
> Do you agree this worst case can't happen with config_page_shift?
>

Yes. I just think you have a different worst case that is just as bad.

> > Where we expected to see the the use of this patchset was in specialised
> > environments *only*. The SGI people can mitigate their mixed
> > fragmentation problems somewhat by setting slub_min_order ==
> > large_block_order so that blocks get allocated and freed at the same
> > size. This is partial way towards Andrea's solution of raising the size
> > of an order-0 allocation. The point of printing out the warnings at
>
> Except you don't get all the full benefits of it...
>
> Even if I could end up mapping 4k kmalloced entries in userland for
> the tail packing, that IMHO would still be a preferable solution than
> to keep the base-page small and to make an hard effort to create large
> pages out of small pages. The approach I advocate keeps the base page
> big and the fast path fast, and it rather does some work to split the
> base pages outside the buddy for the small files.
>

small files (you need something like Shaggy's page tail packing),
pagetable pages, pte pages all have to be dealt with. These are the
things I think will cause us internal fragmentaiton problems.

> All your defrag work is still good to have, like I said at the VM
> summit if you remember, to grow the hugetlbfs at runtime etc... I just
> rather avoid to depend on it to avoid I/O failure in presence of
> mlocked pagecache for example.
>

I'd rather avoid depending on it for the system to work 100% of the
same. Hence I've been saying that we need fsblock ultimately for this to
be a 100% supported feature.

> > Independently of that, we would work on order-0 scalability,
> > particularly readahead and batching operations on ranges of pages as
> > much as possible.
>
> That's pretty much an unnecessary logic, if the order0 pages become
> larger.
>

Quite possibly.

--
Mel Gorman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
Hi,

On Tue, Sep 11, 2007 at 07:31:01PM +0100, Mel Gorman wrote:
> Now, the worst case scenario for your patch is that a hostile process
> allocates large amount of memory and mlocks() one 4K page per 64K chunk
> (this is unlikely in practice I know). The end result is you have many
> 64KB regions that are now unusable because 4K is pinned in each of them.

Initially 4k kmalloced tails aren't going to be mapped in
userland. But let's take the kernel stack that would generate the same
problem and that is clearly going to pin the whole 64k slab/slub
page.

What I think you're missing is that for Nick's worst case to trigger
with the config_page_shift design, you would need the _whole_ ram to
be _at_least_once_ allocated completely in kernel stacks. If the whole
100% of ram wouldn't go allocated in slub as a pure kernel stack, such
a scenario could never materialize.

With the SGI design + defrag, Nick's scenario can instead happen with
only total_ram/64k kernel stacks allocated.

The the problem with the slub fragmentation isn't a new problem, it
happens in today kernels as well and at least the slab by design is
meant to _defrag_ internally. So it's practically already solved and
it provides some guarantee unlike the buddy allocator.

> If it's my fault, sorry about that. It wasn't my intention.

It's not the fault of anyone, I simply didn't push too hard towards my
agenda for the reasons I just said, but I used any given opportunity
to discuss it.

With on-topic I meant not talking about it during the other topics,
like mmap_sem or RCU with radix tree lock ;)

> heh. Well we need to come to some sort of conclusion here or this will
> go around the merri-go-round till we're all bald.

Well, I only meant I'm still free to disagree if I think there's a
better way. All SGI has provided so far is data to show that their I/O
subsystem is much faster if the data is physically contiguous in ram
(ask Linus if you want more details, or better don't ask). That's not
very interesting data for my usages and with my hardware, and I guess
it's more likely that config_page_shift will produce interesting
numbers than their patch on my possible usage cases, but we'll never
know until both are finished.

> heh, I suggested printing the warning because I knew it had this
> problem. The purpose in my mind was to see how far the design could be
> brought before fs-block had to fill in the holes.

Indeed!

> I am still failing to see what happens when there are pagetable pages,
> slab objects or mlocked 4k pages pinning the 64K pages and you need to
> allocate another 64K page for the filesystem. I *think* you deadlock in
> a similar fashion to Christoph's approach but the shape of the problem
> is different because we are dealing with internal instead of external
> fragmentation. Am I wrong?

pagetables aren't the issue. They should be still pre-allocated in
page_size chunks. The 4k entries with 64k page-size are sure not worse
than a 32byte kmalloc today, the slab by design defragments the
stuff. There's probably room for improvement in that area even without
freeing any object by just ordering the list with an rbtree (or better
an heak like CFS should also use!!) so to always allocate new slabs
from the most full partial slab, that alone would help a lot probably
(not sure if slub does anything like that, I'm not fond on slub yet).

> Yes. I just think you have a different worst case that is just as bad.

Disagree here...

> small files (you need something like Shaggy's page tail packing),
> pagetable pages, pte pages all have to be dealt with. These are the
> things I think will cause us internal fragmentaiton problems.

Also note that not all users will need to turn on the tail
packing. We're here talking about features that not all users will
need anyway.. And we're in the same boat as ppc64, no difference.

Thanks!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Tue, 11 Sep 2007, Nick Piggin wrote:

> There is a limitation in the VM. Fragmentation. You keep saying this
> is a solved issue and just assuming you'll be able to fix any cases
> that come up as they happen.
>
> I still don't get the feeling you realise that there is a fundamental
> fragmentation issue that is unsolvable with Mel's approach.

Well my problem first of all is that you did not read the full message. It
discusses that later and provides page pools to address the issue.

Secondly you keep FUDding people with lots of theoretical concerns
assuming Mel's approaches must fail. If there is an issue (I guess there
must be right?) then please give us a concrete case of a failure that we
can work against.

> The idea that there even _is_ a bug to fail when higher order pages
> cannot be allocated was also brushed aside by some people at the
> vm/fs summit. I don't know if those people had gone through the
> math about this, but it goes somewhat like this: if you use a 64K
> page size, you can "run out of memory" with 93% of your pages free.
> If you use a 2MB page size, you can fail with 99.8% of your pages
> still free. That's 64GB of memory used on a 32TB Altix.

Allocations can currently fail and all code has the requirement to handle
failure cases in one form or another.

Currently we can only handle up to order 3 allocs it seems. 2M pages (and
in particular pagesizes > MAX_ORDER) will have to be handled by a separate
large page pool facility discussed in the earlier message.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Tue, 11 Sep 2007, Andrea Arcangeli wrote:

> Furthermore all the issues with writeprotect faults over MAP_PRIVATE
> regions will have to be addressed the same way with both approaches if
> we want real 100% 4k-granular backwards compatibility.

Could you be more specific as to why my patch does not address that issue?

> And if I'm terribly wrong and the variable order pagecache is the way
> to go for the long run, the 64k tlb feature will fit in that model
> very nicely too.

Right.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

1 2 3 4 5 6  View All