Mailing List Archive

1 2 3 4 5 6  View All
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Wed, 19 Sep 2007, Rene Herman wrote:
>
> Well, not so sure about that. What if one of your expected uses for example is
> video data storage -- lots of data, especially for multiple streams, and needs
> still relatively fast machinery. Why would you care for the overhead af
> _small_ blocks?

.. so work with an extent-based filesystem instead.

16k blocks are total idiocy. If this wasn't about a "support legacy
customers", I think the whole patch-series has been a total waste of time.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On 09/19/2007 05:50 AM, Linus Torvalds wrote:

> On Wed, 19 Sep 2007, Rene Herman wrote:

>> Well, not so sure about that. What if one of your expected uses for example is
>> video data storage -- lots of data, especially for multiple streams, and needs
>> still relatively fast machinery. Why would you care for the overhead af
>> _small_ blocks?
>
> .. so work with an extent-based filesystem instead.
>
> 16k blocks are total idiocy. If this wasn't about a "support legacy
> customers", I think the whole patch-series has been a total waste of time.

Admittedly, extent-based might not be a particularly bad answer at least to
the I/O side of the equation...

I do feel larger blocksizes continue to make sense in general though. Packet
writing on CD/DVD is a problem already today since the hardware needs 32K or
64K blocks and I'd expect to see more of these and similiar situations when
flash gets (even) more popular which it sort of inevitably is going to be.

Rene.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Wed, 19 Sep 2007, Rene Herman wrote:
>
> I do feel larger blocksizes continue to make sense in general though. Packet
> writing on CD/DVD is a problem already today since the hardware needs 32K or
> 64K blocks and I'd expect to see more of these and similiar situations when
> flash gets (even) more popular which it sort of inevitably is going to be.

.. that's what scatter-gather exists for.

What's so hard with just realizing that physical memory isn't contiguous?

It's why we have MMU's. It's why we have scatter-gather.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On 09/19/2007 06:33 AM, Linus Torvalds wrote:

> On Wed, 19 Sep 2007, Rene Herman wrote:

>> I do feel larger blocksizes continue to make sense in general though. Packet
>> writing on CD/DVD is a problem already today since the hardware needs 32K or
>> 64K blocks and I'd expect to see more of these and similiar situations when
>> flash gets (even) more popular which it sort of inevitably is going to be.
>
> .. that's what scatter-gather exists for.
>
> What's so hard with just realizing that physical memory isn't contiguous?
>
> It's why we have MMU's. It's why we have scatter-gather.

So if I understood that right, you'd suggest to deal with devices with
larger physical blocksizes at some level above the current blocklayer.

Not familiar enough with either block or fs to be able to argue that
effectively...

Rene.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Tue, Sep 18, 2007 at 06:06:52PM -0700, Linus Torvalds wrote:
> > especially as the Linux
> > kernel limitations in this area are well known. There's no "16K mess"
> > that SGI is trying to clean up here (and SGI have offered both IA64 and
> > x86_64 systems for some time now, so not sure how you came up with that
> > whacko theory).
>
> Well, if that is the case, then I vote that we drop the whole patch-series
> entirely. It clearly has no reason for existing at all.
>
> There is *no* valid reason for 16kB blocksizes unless you have legacy
> issues.

Ok, let's step back for a moment and look at a basic, fundamental
constraint of disks - seek capacity. A decade ago, a terabyte of
filesystem had 30 disks behind it - a seek capacity of about
6000 seeks/s. Nowdays, that's a single disk with a seek
capacity of about 200/s. We're going *rapidly* backwards in
terms of seek capacity per terabyte of storage.

Now fill that terabyte of storage and index it in the most efficient
way - let's say btrees are used because lots of filesystems use
them. Hence the depth of the tree is roughly O((log n)/m) where m is
a factor of the btree block size. Effectively, btree depth = seek
count on lookup of any object.

When the filesystem had a capacity of 6,000 seeks/s, we didn't
really care if the indexes used 4k blocks or not - the storage
subsystem had an excess of seek capacity to deal with
less-than-optimal indexing. Now we have over an order of magnitude
less seeks to expend in index operations *for the same amount of
data* so we are really starting to care about minimising the
number of seeks in our indexing mechanisms and allocations.

We can play tricks in index compaction to reduce the number of
interior nodes of the tree (like hashed indexing in the XFS ext3
htree directories) but that still only gets us so far in reducing
seeks and doesn't help at all for tree traversals. That leaves us
with the btree block size as the only factor we can further vary to
reduce the depth of the tree. i.e. "m".

So we want to increase the filesystem block size it improve the
efficiency of our indexing. That improvement in efficiency
translates directly into better performance on seek constrained
storage subsystems.

The problem is this: to alter the fundamental block size of the
filesystem we also need to alter the data block size and that is
exactly the piece that linux does not support right now. So while
we have the capability to use large block sizes in certain
filesystems, we can't use that capability until the data path
supports it.

To summarise, large block size support in the filesystem is not
about "legacy" issues. It's about trying to cope with the rapid
expansion of storage capabilities of modern hardware where we have
to index much, much more data with a corresponding decrease in
the seek capability of the hardware.

> So get your stories straight, people.

Ok, so let's set the record straight. There were 3 justifications
for using *large pages* to *support* large filesystem block sizes
The justifications for the variable order page cache with large
pages were:

1. little code change needed in the filesystems
-> still true

2. Increased I/O sizes on 4k page machines (the "SCSI
controller problem")
-> redundant thanks to Jens Axboe's quick work

3. avoiding the need for vmap() as it has great
overhead and does not scale
-> Nick is starting to work on that and has
already had good results.

Everyone seems to be focussing on #2 as the entire justification for
large block sizes in filesystems and that this is an "SGI" problem.
Nothing could be further from the truth - the truth is that large
pages solved multiple problems in one go. We now have a different,
better solution #2, so please, please stop using that as some
justification for claiming filesystems don't need large block sizes.

However, all this doesn't change the fact that we have a major storage
scalability crunch coming in the next few years. Disk capacity is
likely to continue to double every 12 months for the next 3 or 4
years. Large block size support is only one mechanism we need to
help cope with this trend.

The variable order page cache with large pages was a means to an end
- it's not the only solution to this problem and I'm extremely happy
to see that there is progress on multiple fronts. That's the
strength of the Linux community showing through. In the end, I
really don't care how we end up supporting large filesystem block
sizes in the page cache - all I care about is that we end up
supporting it as efficiently and generically as we possibly can.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On 9/19/07, David Chinner <dgc@sgi.com> wrote:
> The problem is this: to alter the fundamental block size of the
> filesystem we also need to alter the data block size and that is
> exactly the piece that linux does not support right now. So while
> we have the capability to use large block sizes in certain
> filesystems, we can't use that capability until the data path
> supports it.

it's much simpler to teach fs to understand multipage data (like
multipage bitmap scan, multipage extent search, etc) then deal with mm
fragmentation. IMHO. at same time you don't bust IO traffic with
non-used space.

--
thanks, Alex

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Wed, Sep 19, 2007 at 03:09:10PM +1000, David Chinner wrote:
> Ok, let's step back for a moment and look at a basic, fundamental
> constraint of disks - seek capacity. A decade ago, a terabyte of
> filesystem had 30 disks behind it - a seek capacity of about
> 6000 seeks/s. Nowdays, that's a single disk with a seek
> capacity of about 200/s. We're going *rapidly* backwards in
> terms of seek capacity per terabyte of storage.
>
> Now fill that terabyte of storage and index it in the most efficient
> way - let's say btrees are used because lots of filesystems use
> them. Hence the depth of the tree is roughly O((log n)/m) where m is
> a factor of the btree block size. Effectively, btree depth = seek
> count on lookup of any object.

I agree. btrees will clearly benefit if the nodes are larger. We've an
excess of disk capacity and an huge gap between seeking and contiguous
bandwidth.

You don't need largepages for this, fsblocks is enough.

Largepages for you are a further improvement to avoid reducing the SG
entries and potentially reducing the cpu utilization a bit (not much
though, only the pagecache works with largepages and especially with
small sized random I/O you'll be taking the radix tree lock the same
number of times...).

Plus of course you don't like fsblock because it requires work to
adapt a fs to it, I can't argue about that.

> Ok, so let's set the record straight. There were 3 justifications
> for using *large pages* to *support* large filesystem block sizes
> The justifications for the variable order page cache with large
> pages were:
>
> 1. little code change needed in the filesystems
> -> still true

Disagree, the mmap side is not a little change. If you do it just for
the not-mmapped I/O that truly is an hack, but then frankly I would
prefer only the read/write hack (without mmap) so it will not reject
heavily with my stuff and it'll be quicker to nuke it out of the
kernel later.

> 3. avoiding the need for vmap() as it has great
> overhead and does not scale
> -> Nick is starting to work on that and has
> already had good results.

Frankly I don't follow this vmap thing. Can you elaborate? Is this
about allowing the blkdev pagecache for metadata to go in highmemory?
Is that the kmap thing? I think we can stick to a direct mapped b_data
and avoid all overhead of converting a struct page to a virtual
address. It takes the same 64bit size anyway in ram and we avoid one
layer of indirection and many modifications. If we wanted to switch to
kmap for blkdev pagecache we should have done years ago, now it's far
too late to worry about it.

> Everyone seems to be focussing on #2 as the entire justification for
> large block sizes in filesystems and that this is an "SGI" problem.

I agree it's not a SGI problem and this is why I want a design that
has a _slight chance_ to improve performance on x86-64 too. If
variable order page cache will provide any further improvement on top
of fsblock will be only because your I/O device isn't fast with small
sg entries.

For the I/O layout the fsblock is more than enough, but I don't think
your variable order page cache will help in any significant way on
x86-64. Furthermore the complexity of handle page faults on largepages
is almost equivalent to the complexity of config-page-shift, but
config-page-shift gives you the whole cpu-saving benefits that you can
never remotely hope to achieve with variable order page cache.

config-page-shift + fsblock IMHO is the way to go for x86-64, with one
additional 64k PAGE_SIZE rpm. config-page-shift will stack nicely on
top of fsblocks.

fsblock will provide the guarantee of "mounting" all fs anywhere no
matter which config-page-shift you selected at compile time, as well
as dvd writing. Then config-page-shift will provide the cpu
optimization on all fronts, not just for the pagecache I/O for the
large ram systems, without fragmentation issues and with 100%
reliability in the "free" numbers (not working by luck). That's all we
need as far as I can tell.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Wed, Sep 19, 2007 at 04:04:30PM +0200, Andrea Arcangeli wrote:
> On Wed, Sep 19, 2007 at 03:09:10PM +1000, David Chinner wrote:
> > Ok, let's step back for a moment and look at a basic, fundamental
> > constraint of disks - seek capacity. A decade ago, a terabyte of
> > filesystem had 30 disks behind it - a seek capacity of about
> > 6000 seeks/s. Nowdays, that's a single disk with a seek
> > capacity of about 200/s. We're going *rapidly* backwards in
> > terms of seek capacity per terabyte of storage.
> >
> > Now fill that terabyte of storage and index it in the most efficient
> > way - let's say btrees are used because lots of filesystems use
> > them. Hence the depth of the tree is roughly O((log n)/m) where m is
> > a factor of the btree block size. Effectively, btree depth = seek
> > count on lookup of any object.
>
> I agree. btrees will clearly benefit if the nodes are larger. We've an
> excess of disk capacity and an huge gap between seeking and contiguous
> bandwidth.
>
> You don't need largepages for this, fsblocks is enough.

Sure, and that's what I meant when I said VPC + large pages was
a means to the end, not the only solution to the problem.

> Plus of course you don't like fsblock because it requires work to
> adapt a fs to it, I can't argue about that.

No, I don't like fsblock because it is inherently a "struture
per filesystem block" construct, just like buggerheads. You
still need to allocate millions of them when you have millions
dirty pages around. Rather than type it all out again, read
the fsblocks thread from here:

http://marc.info/?l=linux-fsdevel&m=118284983925719&w=2

FWIW, with Chris mason's extent-based block mapping (which btrfs
is using and Christoph Hellwig is porting XFS over to) we completely
remove buggerheads from XFS and so fsblock would be a pretty
major step backwards for us if Chris's work goes into mainline.

> > Ok, so let's set the record straight. There were 3 justifications
> > for using *large pages* to *support* large filesystem block sizes
> > The justifications for the variable order page cache with large
> > pages were:
> >
> > 1. little code change needed in the filesystems
> > -> still true
>
> Disagree, the mmap side is not a little change.

That's not in the filesystem, though. ;)

However, I agree that if you don't have mmap then it's not
worthwhile and the changes for VPC aren't trivial.

> > 3. avoiding the need for vmap() as it has great
> > overhead and does not scale
> > -> Nick is starting to work on that and has
> > already had good results.
>
> Frankly I don't follow this vmap thing. Can you elaborate?

We current support metadata blocks larger than page size for
certain types of metadata in XFS. e.g. directory blocks.
This however, requires vmap()ing a bunch of individual,
non-contiguous pages out of a block device address space
in exactly the fashion that was proposed by Nick with fsblock
originally.

vmap() has severe scalability problems - read this subthread
of this discussion between Nick and myself:

http://lkml.org/lkml/2007/9/11/508

> > Everyone seems to be focussing on #2 as the entire justification for
> > large block sizes in filesystems and that this is an "SGI" problem.
>
> I agree it's not a SGI problem and this is why I want a design that
> has a _slight chance_ to improve performance on x86-64 too. If
> variable order page cache will provide any further improvement on top
> of fsblock will be only because your I/O device isn't fast with small
> sg entries.

<sigh>

There we go - back to the bloody I/O devices. Can ppl please stop
bringing this up because it *is not an issue any more*.

> config-page-shift + fsblock IMHO is the way to go for x86-64, with one
> additional 64k PAGE_SIZE rpm. config-page-shift will stack nicely on
> top of fsblocks.

Hmm - so you'll need page cache tail packing as well in that case
to prevent memory being wasted on small files. That means any way
we look at it (VPC+mmap or config-page-shift+fsblock+pctails)
we've got some non-trivial VM modifications to make.

If VPC can be separated from the large contiguous page requirement
(i.e. virtually mapped compound page support), I still think it
comes out on top because it doesn't require every filesystem to be
modified and you can use standard pages where they are optimal
(i.e. on filesystems were block size <= PAGE_SIZE).

But, I'm not going to argue endlessly for one solution or another;
I'm happy to see different solutions being chased, so may the
best VM win ;)

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Thu, Sep 20, 2007 at 11:38:21AM +1000, David Chinner wrote:
> Sure, and that's what I meant when I said VPC + large pages was
> a means to the end, not the only solution to the problem.

The whole point is that it's not an end, it's an end to your own fs
centric view only (which is sure fair enough), but I watch the whole
VM not just the pagecache...

The same way the fs-centric view will hope to get this little bit of
further optimization from largepages to reach "the end", my VM-wide
view wants the same little bit of opitmization for *everything*
including tmpfs and anonymous memory, slab etc..! This is clearly why
config-page-shift is better...

If you're ok not to be on the edge and you want a generic rpm image
that run quite optimally for any workload, then 4k+fslblock is just
fine of course. But if we go on the edge we should aim for the _very_
end for the whole VM, not just for "the end of the pagecache on
certain files". Especially when the complexity involved in the mmap
code is similar, and it will reject heavily if we merge this
not-very-end solution that only reaches "the end" for the pagecache.

> No, I don't like fsblock because it is inherently a "struture
> per filesystem block" construct, just like buggerheads. You
> still need to allocate millions of them when you have millions
> dirty pages around. Rather than type it all out again, read
> the fsblocks thread from here:
>
> http://marc.info/?l=linux-fsdevel&m=118284983925719&w=2

Thanks for the pointer!

> FWIW, with Chris mason's extent-based block mapping (which btrfs
> is using and Christoph Hellwig is porting XFS over to) we completely
> remove buggerheads from XFS and so fsblock would be a pretty
> major step backwards for us if Chris's work goes into mainline.

I tend to agree if we change it fsblock should support extent if
that's what you need on xfs to support range-locking etc... Whatever
happens in vfs should please all existing fs without people needing to
go their own way again... Or replace fsblock with Chris's block
mapping. Frankly I didn't see Chris's code so I cannot comment
further. But your complains sounds sensible. We certainly want to
avoid lowlevel fs to get smarter again than the vfs. The brainer stuff
should be in vfs!

> That's not in the filesystem, though. ;)
>
> However, I agree that if you don't have mmap then it's not
> worthwhile and the changes for VPC aren't trivial.

Yep.

>
> > > 3. avoiding the need for vmap() as it has great
> > > overhead and does not scale
> > > -> Nick is starting to work on that and has
> > > already had good results.
> >
> > Frankly I don't follow this vmap thing. Can you elaborate?
>
> We current support metadata blocks larger than page size for
> certain types of metadata in XFS. e.g. directory blocks.
> This however, requires vmap()ing a bunch of individual,
> non-contiguous pages out of a block device address space
> in exactly the fashion that was proposed by Nick with fsblock
> originally.
>
> vmap() has severe scalability problems - read this subthread
> of this discussion between Nick and myself:
>
> http://lkml.org/lkml/2007/9/11/508

So the idea of vmap is that it's much simpler to have a contiguous
virtual address space large blocksize, than to find the right
b_data[index] once you exceed PAGE_SIZE...

The global tlb flush with ipi would kill performance, you can forget
any global mapping here. The only chance to do this would be like we
do with kmap_atomic per-cpu on highmem, with preempt_disable (for the
enjoyment of the rt folks out there ;). what's the problem of having
it per-cpu? Is this what fsblock already does? You've just have to
allocate a new virtual range large numberofentriesinvmap*blocksize
every time you mount a new fs. Then instead of calling kmap you call
vmap and vunmap when you're finished. That should provide decent
performance, especially with physically indexed caches.

Anything more heavyweight than what I suggested is probably overkill,
even vmalloc_to_page.

> Hmm - so you'll need page cache tail packing as well in that case
> to prevent memory being wasted on small files. That means any way
> we look at it (VPC+mmap or config-page-shift+fsblock+pctails)
> we've got some non-trivial VM modifications to make.

Hmm no, the point of config-page-shift is that if you really need to
reach "the very end", you probably don't care about wasting some
memory, because either your workload can't fit in cache, or it fits in
cache regardless, or you're not wasting memory because you work with
large files...

The only point of this largepage stuff is to go an extra mile to save
a bit more of cpu vs a strict vmap based solution (fsblock of course
will be smart enough that if it notices the PAGE_SIZE is >= blocksize
it doesn't need to run any vmap at all and it can just use the direct
mapping, so vmap translates in 1 branch only to check the blocksize
variable, PAGE_SIZE is immediate in the .text at compile time). But if
you care about that tiny bit of performance during I/O operations
(variable order page cache only gives the tinybit of performance
during read/write syscalls!!!), then it means you actually want to
save CPU _everywhere_ not just in read/write and while mangling
metadata in the lowlevel fs. And that's what config-page-shift should
provide...

This is my whole argument for preferring config-page-shift+fsblock (or
whatever else fsblock replacement but then Nick design looked quite
sensible to me, if integrated with extent based locking, without
having seen Chris's yet). Regardless of the fact config-page-shift
also has the other benefit of providing guarantees for meminfo levels
and the other fact it doesn't strictly require defrag heuristics to
avoid hitting worst case huge-ram-waste scenarios.

> But, I'm not going to argue endlessly for one solution or another;
> I'm happy to see different solutions being chased, so may the
> best VM win ;)

;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Thu, 20 Sep 2007, David Chinner wrote:

> > Disagree, the mmap side is not a little change.
>
> That's not in the filesystem, though. ;)

And its really only a minimal change for some function to loop over all
4k pages and elsewhere index the right 4k subpage.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Thu, 20 Sep 2007, Andrea Arcangeli wrote:

> The only point of this largepage stuff is to go an extra mile to save
> a bit more of cpu vs a strict vmap based solution (fsblock of course
> will be smart enough that if it notices the PAGE_SIZE is >= blocksize
> it doesn't need to run any vmap at all and it can just use the direct
> mapping, so vmap translates in 1 branch only to check the blocksize
> variable, PAGE_SIZE is immediate in the .text at compile time). But if

Hmmm.. You are not keeping up with things? Heard of virtual compound
pages? They only require a vmap when the page allocator fails a
larger order allocation (which we have established is rare to the point
of nonexistence). The next rev of large blocksize will use that for
fallback. Plus vcompounds can be used to get rid of most uses of vmalloc
reducing the need for virtual mapping in general.

Largeblock is a general solution for managing large data sets using a
single page struct. See the original message that started this thread. It
can be used for various other subsystems like vcompound can.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Thu, 20 Sep 2007, Christoph Lameter wrote:
> On Thu, 20 Sep 2007, David Chinner wrote:
> > > Disagree, the mmap side is not a little change.
> >
> > That's not in the filesystem, though. ;)
>
> And its really only a minimal change for some function to loop over all
> 4k pages and elsewhere index the right 4k subpage.

I agree with you on that: the changes you had to make to support mmap
were _much_ less bothersome than I'd been fearing, and I'm surprised
some people still see that side of it as a sticking point.

But I've kept very quiet because I remain quite ambivalent about the
patchset: I'm somewhere on the spectrum between you and Nick, shifting
my position from hour to hour. Don't expect any decisiveness from me.

In some senses I'm even further off the scale away from you: I'm
dubious even of Nick and Andrew's belief in opportunistic contiguity.
Just how hard should we trying for contiguity? How far should we
go in sacrificing our earlier "LRU" principles? It's easy to bump
up PAGE_ALLOC_COSTLY_ORDER, but what price do we pay when we do?

I agree with those who want to see how the competing approaches
work out in practice: which is frustrating for you, yes, because
you are so close to ready. (I've not glanced at virtual compound,
but had been wondering in that direction before you suggested it.)

I do think your patchset is, for the time being at least, a nice
out-of-tree set, and it's grand to be able to bring a filesystem
from another arch with larger pagesize and get at the data from it.

I've found some fixes needed on top of your Large Blocksize Support
patches: I'll send those to you in a moment. Looks like you didn't
try much swapping!

I only managed to get ext2 working with larger blocksizes:
reiserfs -b 8192 wouldn't mount ("reiserfs_fill_super: can not find
reiserfs on /dev/sdb1"); ext3 gave me mysterious errors ("JBD: tar
wants too many credits", even after adding JBD patches that you
turned out to be depending on); and I didn't try ext4 or xfs
(I'm guessing the latter has been quite well tested ;)

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
mel@skynet.ie (Mel Gorman) writes:

> On (16/09/07 23:31), Andrea Arcangeli didst pronounce:
>> On Sun, Sep 16, 2007 at 09:54:18PM +0100, Mel Gorman wrote:
>> Allocating ptes from slab is fairly simple but I think it would be
>> better to allocate ptes in PAGE_SIZE (64k) chunks and preallocate the
>> nearby ptes in the per-task local pagetable tree, to reduce the number
>> of locks taken and not to enter the slab at all for that.
>
> It runs the risk of pinning up to 60K of data per task that is unusable for
> any other purpose. On average, it'll be more like 32K but worth keeping
> in mind.

Two things to both of you respectively.

Why should we try to stay out of the pte slab? Isn't the slab exactly
made for this thing? To efficiently handle a large number of equal
size objects for quick allocation and dealocation? If it is a locking
problem then there should be a per cpu cache of ptes. Say 0-32
ptes. If you run out you allocate 16 from slab. When you overflow you
free 16 (which would give you your 64k allocations but in multiple
objects).

As for the wastage. Every pte can map 2MB on amd64, 4MB on i386, 8MB
on sparc (?). A 64k pte chunk would be 32MB, 64MB and 32MB (?)
respectively. For the sbrk() and mmap() usage from glibc malloc() that
would be fine as they grow linear and the mmap() call in glibc could
be made to align to those chunks. But for a programm like rtorrent
using mmap to bring in chunks of a 4GB file this looks desasterous.

>> Infact we
>> could allocate the 4 levels (or anyway more than one level) in one
>> single alloc_pages(0) and track the leftovers in the mm (or similar).

Personally I would really go with a per cpu cache. When mapping a page
reserve 4 tables. Then you walk the tree and add entries as
needed. And last you release 0-4 unused entries to the cache.

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
mel@skynet.ie (Mel Gorman) writes:

> On (16/09/07 23:58), Goswin von Brederlow didst pronounce:
>> But when you already have say 10% of the ram in mixed groups then it
>> is a sign the external fragmentation happens and some time should be
>> spend on moving movable objects.
>>
>
> I'll play around with it on the side and see what sort of results I get.
> I won't be pushing anything any time soon in relation to this though.
> For now, I don't intend to fiddle more with grouping pages by mobility
> for something that may or may not be of benefit to a feature that hasn't
> been widely tested with what exists today.

I watched the videos you posted. A ncie and quite clear improvement
with and without your logic. Cudos.

When you play around with it may I suggest a change to the display of
the memory information. I think it would be valuable to use a Hilbert
Curve to arange the pages into pixels. Like this:

# # 0 3
# #
### 1 2

# #
### ### 3 2 D C
# #
# ### # 4 7 8 B
# # # #
### ### 5 6 9 A
+-----------+-----------+
# ##### ##### # |00 03 04 05|3A 3B 3C 3F|
# # # # # # | | |
### ### ### ### |01 02 07 06|39 38 3D 3E|
# # | | |
### ### ### ### |0E 0D 08 09|36 37 32 31|
# # # # # # | | |
# ##### ##### # |0F 0C 0B 0A|35 34 33 30|
# # +-----+-----+ |
### ####### ### |10 11|1E 1F|20 21 2E 2F|
# # # # | | | |
### ### ### ### |13 12|1D 1C|23 22 2D 2C|
# # # # | +-----+ |
# ### # # ### # |14 17|18 1B|24 27 28 2B|
# # # # # # # # | | | |
### ### ### ### |15 16|19 1A|25 26 29 2A|
+-----+-----+-----------+

I've drawn in allocations for 16, 8, 4, 5, 32 pages in that order in
the last one. The idea is to get near pages visually near in the
output and into an area instead of lines. Easier on the eye. It also
manages to always draw aligned order(x) blocks as squares or rectanges
(even or odd order).

>> Maybe instead of reserving one could say that you can have up to 6
>> groups of space
>
> And if the groups are 1GB in size? I tried something like this already.
> It didn't work out well at the time although I could revisit.

You adjust group size with the number of groups total. You would not
use 1GB Huge Pages on a 2GB ram system. You could try 2MB groups. I
think for most current systems we are lucky there. 2MB groups fit
hardware support and give a large but not too large number of groups
to work with.

But you only need to stick to hardware suitable group sizes for huge
tlb support right? For better I/O and such you could have 512Kb groups
if that size gives a reasonable number of groups total.

>> not used by unmovable objects before aggressive moving
>> starts. I don't quite see why you NEED reserving as long as there is
>> enough space free alltogether in case something needs moving.
>
> hence, increase min_free_kbytes.

Which is different from reserving a full group as it does not count
fragmented space as lost.

>> 1 group
>> worth of space free might be plenty to move stuff too. Note that all
>> the virtual pages can be stuffed in every little free space there is
>> and reassembled by the MMU. There is no space lost there.
>>
>
> What you suggest sounds similar to having a type MIGRATE_MIXED where you
> allocate from when the preferred lists are full. It became a sizing
> problem that never really worked out. As I said, I can try again.

Not realy. I'm saying we should actively defragment mixed groups
during allocation and always as little as possible when a certain
level of external fragmentation is reached. A MIGRATE_MIXED sounds
like giving up completly if things get bad enough. Compare it to a
cheap network switch going into hub mode when its arp table runs full.
If you ever had that then you know how bad that is.

>> But until one tries one can't say.
>>
>> MfG
>> Goswin
>>
>> PS: How do allocations pick groups?
>
> Using GFP flags to identify the type.

That is the type of group, not which one.

>> Could one use the oldest group
>> dedicated to each MIGRATE_TYPE?
>
> Age is difficult to determine so probably not.

Put the uptime as sort key into each group header on creation or type
change. Then sort the partialy used groups by that key. A heap will do
fine and be fast.

>> Or lowest address for unmovable and
>> highest address for movable? Something to better keep the two out of
>> each other way.
>
> We bias the location of unmovable and reclaimable allocations already. It's
> not done for movable because it wasn't necessary (as they are easily
> reclaimed or moved anyway).

Except that is never done so doesn't count.

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
mel@skynet.ie (Mel Gorman) writes:

> On (17/09/07 00:38), Goswin von Brederlow didst pronounce:
>> mel@skynet.ie (Mel Gorman) writes:
>>
>> > On (15/09/07 02:31), Goswin von Brederlow didst pronounce:
>> >> Mel Gorman <mel@csn.ul.ie> writes:
>> >> Looking at my
>> >> little test program evicting movable objects from a mixed group should
>> >> not be that expensive as it doesn't happen often.
>> >
>> > It happens regularly if the size of the block you need to keep clean is
>> > lower than min_free_kbytes. In the case of hugepages, that was always
>> > the case.
>>
>> That assumes that the number of groups allocated for unmovable objects
>> will continiously grow and shrink.
>
> They do grow and shrink. The number of pagetables in use changes for
> example.

By numbers of groups worth? And full groups get free, unmixed and
filled by movable objects?

>> I'm assuming it will level off at
>> some size for long times (hours) under normal operations.
>
> It doesn't unless you assume the system remains in a steady state for it's
> lifetime. Things like updatedb tend to throw a spanner into the works.

Moved to cron weekly here. And even normally it is only once a day. So
what if it starts moving some pages while updatedb runs. If it isn't
too braindead it will reclaim some dentries updated has created and
left for good. It should just cause the dentry cache to be smaller at
no cost. I'm not calling that normal operations. That is a once a day
special. What I don't want is to spend 1% of cpu time copying
pages. That would be unacceptable. Copying 1000 pages per updatedb run
would be trivial on the other hand.

>> There should
>> be some buffering of a few groups to be held back in reserve when it
>> shrinks to prevent the scenario that the size is just at a group
>> boundary and always grows/shrinks by 1 group.
>>
>
> And what size should this group be that all workloads function?

1 is enough to prevent jittering. If you don't hold a group back and
you are exactly at a group boundary then alternatingly allocating and
freeing one page would result in a group allocation and freeing every
time. With one group in reserve you only get an group allocation or
freeing when a groupd worth of change has happened.

This assumes that changing the type and/or state of a group is
expensive. Takes time or locks or some such. Otherwise just let it
jitter.

>> >> So if
>> >> you evict movable objects from mixed group when needed all the
>> >> pagetable pages would end up in the same mixed group slowly taking it
>> >> over completly. No fragmentation at all. See how essential that
>> >> feature is. :)
>> >>
>> >
>> > To move pages, there must be enough blocks free. That is where
>> > min_free_kbytes had to come in. If you cared only about keeping 64KB
>> > chunks free, it makes sense but it didn't in the context of hugepages.
>>
>> I'm more concerned with keeping the little unmovable things out of the
>> way. Those are the things that will fragment the memory and prevent
>> any huge pages to be available even with moving other stuff out of the
>> way.
>
> That's fair, just not cheap

That is the price you pay. To allocate 2MB of ram you have to have 2MB
of free ram or make them free. There is no way around that. Moving
pages means that you can actually get those 2MB even if the price
is high and that you have more choice deciding what to throw away or
swap out. I would rather have a 2MB malloc take some time than have it
fail because the kernel doesn't feel like it.

>> Can you tell me how? I would like to do the same.
>>
>
> They were generated using trace_allocmap kernel module in
> http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.81-rc2.tar.gz
> in combination with frag-display in the same package. However, in the
> current version against current -mm's, it'll identify some movable pages
> wrong. Specifically, it will appear to be mixing movable pages with slab
> pages and it doesn't identify SLUB pages properly at all (SLUB came after
> the last revision of this tool). I need to bring an annotation patch up to
> date before it can generate the images correctly.

Thanks. I will test that out and see what I get on a few lustre
servers and clients. That is probably quite a different workload from
what you test.

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
Andrea Arcangeli <andrea@suse.de> writes:

> On Mon, Sep 17, 2007 at 12:56:07AM +0200, Goswin von Brederlow wrote:
>> When has free ever given any usefull "free" number? I can perfectly
>> fine allocate another gigabyte of memory despide free saing 25MB. But
>> that is because I know that the buffer/cached are not locked in.
>
> Well, as you said you know that buffer/cached are not locked in. If
> /proc/meminfo would be rubbish like you seem to imply in the first
> line, why would we ever bother to export that information and even
> waste time writing a binary that parse it for admins?

As a user I know it because I didn't put a kernel source into /tmp. A
programm can't reasonably know that.

>> On the other hand 1GB can instantly vanish when I start a xen domain
>> and anything relying on the free value would loose.
>
> Actually you better check meminfo or free before starting a 1G of Xen!!

Xen has its own memory pool and can quite agressively reclaim memory
from dom0 when needed. I just ment to say that the number in
/proc/meminfo can change in a second so it is not much use knowing
what it said last minute.

>> The only sensible thing for an application concerned with swapping is
>> to whatch the swapping and then reduce itself. Not the amount
>> free. Although I wish there were some kernel interface to get a
>> preasure value of how valuable free pages would be right now. I would
>> like that for fuse so a userspace filesystem can do caching without
>> cripling the kernel.
>
> Repeated drop caches + free can help.

I would kill any programm that does that to find out how much free ram
the system has.

MfG
Goswin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Sun, 16 September 2007 11:44:09 -0700, Linus Torvalds wrote:
> On Sun, 16 Sep 2007, Jörn Engel wrote:
> >
> > My approach is to have one for mount points and ramfs/tmpfs/sysfs/etc.
> > which are pinned for their entire lifetime and another for regular
> > files/inodes. One could take a three-way approach and have
> > always-pinned, often-pinned and rarely-pinned.
> >
> > We won't get never-pinned that way.
>
> That sounds pretty good. The problem, of course, is that most of the time,
> the actual dentry allocation itself is done before you really know which
> case the dentry will be in, and the natural place for actually giving the
> dentry lifetime hint is *not* at "d_alloc()", but when we "instantiate"
> it with d_add() or d_instantiate().
>
> [...]
>
> And yes, you'd end up with the reallocation overhead quite often, but at
> least it would now happen only when filling in a dentry, not in the
> (*much* more critical) cached lookup path.

There may be another approach. We could create a never-pinned cache,
without trying hard to keep it full. Instead of moving a hot dentry at
dput() time, we move a cold one from the end of lru. And if the lru
list is short, we just chicken out.

Our definition of "short lru list" can either be based on a ratio of
pinned to unpinned dentries or on a metric of cache hits vs. cache
misses. I tend to dislike the cache hit metric, because updatedb would
cause tons of misses and result in the same mess we have right now.

With this double cache, we have a source of slabs to cheaply reap under
memory pressure, but still have a performance advantage (memcpy beats
disk io by orders of magnitude).

Jörn

--
The story so far:
In the beginning the Universe was created. This has made a lot
of people very angry and been widely regarded as a bad move.
-- Douglas Adams
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Sep 23, 2007, at 02:22:12, Goswin von Brederlow wrote:
> mel@skynet.ie (Mel Gorman) writes:
>> On (16/09/07 23:58), Goswin von Brederlow didst pronounce:
>>> But when you already have say 10% of the ram in mixed groups then
>>> it is a sign the external fragmentation happens and some time
>>> should be spend on moving movable objects.
>>
>> I'll play around with it on the side and see what sort of results
>> I get. I won't be pushing anything any time soon in relation to
>> this though. For now, I don't intend to fiddle more with grouping
>> pages by mobility for something that may or may not be of benefit
>> to a feature that hasn't been widely tested with what exists today.
>
> I watched the videos you posted. A nice and quite clear improvement
> with and without your logic. Cudos.
>
> When you play around with it may I suggest a change to the display
> of the memory information. I think it would be valuable to use a
> Hilbert Curve to arange the pages into pixels. Like this:
>
> # # 0 3
> # #
> ### 1 2
>
> ### ### 0 1 E F
> # #
> ### ### 3 2 D C
> # #
> # ### # 4 7 8 B
> # # # #
> ### ### 5 6 9 A

Here's an excellent example of an 0-255 numbered hilbert curve used
to enumerate the various top-level allocations of IPv4 space:
http://xkcd.com/195/

Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Sun, Sep 23, 2007 at 08:56:39AM +0200, Goswin von Brederlow wrote:
> As a user I know it because I didn't put a kernel source into /tmp. A
> programm can't reasonably know that.

Various apps requires you (admin/user) to tune the size of their
caches. Seems like you never tried to setup a database, oh well.

> Xen has its own memory pool and can quite agressively reclaim memory
> from dom0 when needed. I just ment to say that the number in

The whole point is if there's not enough ram of course... this is why
you should check.

> /proc/meminfo can change in a second so it is not much use knowing
> what it said last minute.

The numbers will change depending on what's running on your
system. It's up to you to know plus I normally keep vmstat monitored
in the background to see how the cache/free levels change over
time. Those numbers are worthless if they could be fragmented...

> I would kill any programm that does that to find out how much free ram
> the system has.

The admin should do that if he's unsure, not a program of course!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Fri, 21 Sep 2007, Hugh Dickins wrote:

> I've found some fixes needed on top of your Large Blocksize Support
> patches: I'll send those to you in a moment. Looks like you didn't
> try much swapping!

yup. Thanks for looking at it.

>
> I only managed to get ext2 working with larger blocksizes:
> reiserfs -b 8192 wouldn't mount ("reiserfs_fill_super: can not find
> reiserfs on /dev/sdb1"); ext3 gave me mysterious errors ("JBD: tar
> wants too many credits", even after adding JBD patches that you
> turned out to be depending on); and I didn't try ext4 or xfs
> (I'm guessing the latter has been quite well tested ;)

Yes, there were issues with the first releases of the JBD patches. The
current crop in mm is fine but much of that may have bypassed this list.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support) [ In reply to ]
On Thursday 20 September 2007 11:38, David Chinner wrote:
> On Wed, Sep 19, 2007 at 04:04:30PM +0200, Andrea Arcangeli wrote:

> > Plus of course you don't like fsblock because it requires work to
> > adapt a fs to it, I can't argue about that.
>
> No, I don't like fsblock because it is inherently a "struture
> per filesystem block" construct, just like buggerheads. You
> still need to allocate millions of them when you have millions
> dirty pages around. Rather than type it all out again, read
> the fsblocks thread from here:

I don't think there is anything inherently wrong with a structure
per filesystem block construct. In the places where you want to
have a handle on that information.

In the data path of a lot of filesystems, it's not really useful, no
question.

But the block / buffer head concept is there and is useful for many
things obviously. fsblock, I believe, improves on it; maybe even to
the point where there won't be too much reason for many
filesystems to convert their data paths to something different (eg.
nobh mode, or an extent block mapping).


> http://marc.info/?l=linux-fsdevel&m=118284983925719&w=2
>
> FWIW, with Chris mason's extent-based block mapping (which btrfs
> is using and Christoph Hellwig is porting XFS over to) we completely
> remove buggerheads from XFS and so fsblock would be a pretty
> major step backwards for us if Chris's work goes into mainline.

If you don't need to manipulate or manage the pagecache on a
per-block basis anyway, then you shouldn't need fsblock (or anything
else particularly special) to do higher order block sizes.

If you do sometimes need to, then fsblock *may* be a way you can
remove vmap code from your filesystem and share it with generic
code...


> But, I'm not going to argue endlessly for one solution or another;
> I'm happy to see different solutions being chased, so may the
> best VM win ;)

Agreed :)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

1 2 3 4 5 6  View All