Mailing List Archive

[patch] new-vm improvement [Re: 2.2.0 Bug summary]
On Thu, 31 Dec 1998, Andrea Arcangeli wrote:
> Comments?
>
> Ah, the shrink_mmap limit was wrong since we account only not referenced
> pages.
>
> Patch against 2.2.0-pre1:
whoops in the last email I forget to change a bit the subject (adding
[patch]) and this printk:
Index: linux/mm/vmscan.c
diff -u linux/mm/vmscan.c:1.1.1.1.2.43 linux/mm/vmscan.c:1.1.1.1.2.45
--- linux/mm/vmscan.c:1.1.1.1.2.43 Thu Dec 31 17:56:27 1998
+++ linux/mm/vmscan.c Thu Dec 31 19:41:06 1998
@@ -449,11 +449,7 @@
case 0:
/* swap_out() failed to swapout */
if (shrink_mmap(priority, gfp_mask))
- {
- printk("swapout 0 shrink 1\n");
return 1;
- }
- printk("swapout 0 shrink 0\n");
return 0;
case 1:
/* this would be the best but should not happen right now */
Andrea Arcangeli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary] [ In reply to ]
Hi Andrea,
I added your patch to test1-2.2.0-pre2, and your patch does not appear
to work, becuase the page cache grows too large and swaps out too many
things.
At bootup:
telomere:~> cat free1
total used free shared buffers
cached
Mem: 63476 25684 37792 18108 1476
15376
-/+ buffers/cache: 8832 54644
Swap: 34236 0 34236
After running netscape:
telomere:~> cat free2
total used free shared buffers
cached
Mem: 63476 48352 15124 30592 1796
28772
-/+ buffers/cache: 17784 45692
Swap: 34236 0 34236
After running 'wc /usr/bin/*'
telomere:~> cat free3
total used free shared buffers
cached
Mem: 63476 61100 2376 10464 1064
51040
-/+ buffers/cache: 8996 54480
Swap: 34236 9904 24332
Without your patch, running 'wc /usr/bin/*' swaps out only about 220k.
So perhaps your change to 'count' in filemap.c was incorrect?
> - count = (limit<<1) >> (priority);
> + count = limit >> priority;
-benRI
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary] [ In reply to ]
I' ll try to comment my latest VM patch.
The patch basically do two things.
It add an heuristic to block trashing tasks in try_to_free_pages() and
allow normal tasks to run fine in the meantime.
It returns to the old do_try_to_free_pages() way to do things. I think the
reason the old way was no longer working well is that we are using
swap_out() as other freeing-methods while swapout has really nothing to
do with them.
To get VM stability under low memory we must use both swap_out() (that put
pages from the user process Vmemory to the swap cache) and shrink_mmap()
in a new method. My new method put user pages in the swap cache because
there we can handle aging very well. Then shrink_mmap() can free a not
refernced page to really do some progress in the memory freeing (and not
only in the swapout).
So basically my patch cause sure the system to swapout more than we was
used to do, but most of the time we will not need a swapin to reput the
pages in the process Vmemory.
Somebody reported a big slowdown of the trashing application. Right now I
don't know which bit of the patch caused this slowdown (yesterday my
benchmark here didn't showed this slowdown). My new trashing_memory
heuristic will probably decrease performance for the trashing application
(but hey you know that if you need performance you can alwaws buy more RAM
;), but it will improve a lot performance for normal not-trashing tasks.
I' ll try to change do_free_user_and_cache() to see if I can achieve
something better.
I changed also the swap_out() since the best way to choose a process it to
compare the raw RSS I think. And I don' t want that swap_cnt is decreased
of something every time something is swapped out. I want that the kernel
will continue passing throught all the pages of one process once it
started playing with it (if it will still exists of course ;). I changed
also the pressure of swap_out() since it make no sense to me to pass more
than one time over the VM of all tasks in the system. Now at priority 6
swap_out() is trying to swapout something at max from nr_tasks/7 (low
bound to 1 task). I changed also the pressure of shrink_mmap() because it
was making no sense to me to do two passes on just not referenced pages.
I also changed swapout() allowing it to return 0 1 or more.
0 means that swap_out() is been not able to put in the swap cache
something.
1 means that swap_out() is been able to swapout something and has also
freed up one page (how??? it can't right now because the page should
always be still at least present in the swap cache)
2 means that swap_out() has swapped out 1 page and that the page is still
referenced somewhere (probably by the swap cache)
So in case 2 and case 0 we must use shrink_mmap() to really do some
progress in the page freeing. This the idea that my new
do_free_user_and_cache() follows.
Comments?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary] [ In reply to ]
I rediffed my VM patch against test1-patch-2.2.0-pre3.gz. I also fixed
some bug (not totally critical but..) pointed out by Linus in my last
code. I also changed the shrink_mmap(0) to shrink_mmap(priority) because
it was completly sucking a lot performance. There is no need to do a
shrink_mmap(0) for example if the cache/buffer are under min. In such case
we must allow the swap_out() to grow the cache before start shrinking it.
So basically this new patch is _far_ more efficient than the last
one (I never seen so good/stable/fast behavior before!).
This my new patch is against testing/test1-patch-2.2.0-pre3.gz that is
against v2.1/2.2.0-pre2 that is against patch-2.2.0-pre1-vs-2.1.132.gz
(where is this last one now?).
Ah, from testing/test1-patch-2.2.0-pre3.gz was missing the trashing memory
initialization that will allow every process to do a fast start.
Index: linux/kernel/fork.c
diff -u linux/kernel/fork.c:1.1.1.3 linux/kernel/fork.c:1.1.1.1.2.6
--- linux/kernel/fork.c:1.1.1.3 Thu Dec 3 12:55:12 1998
+++ linux/kernel/fork.c Thu Dec 31 17:56:28 1998
@@ -567,6 +570,7 @@

/* ok, now we should be set up.. */
p->swappable = 1;
+ p->trashing_memory = 0;
p->exit_signal = clone_flags & CSIGNAL;
p->pdeath_signal = 0;

Index: linux/mm/vmscan.c
diff -u linux/mm/vmscan.c:1.1.1.8 linux/mm/vmscan.c:1.1.1.1.2.49
--- linux/mm/vmscan.c:1.1.1.8 Fri Jan 1 19:12:54 1999
+++ linux/mm/vmscan.c Fri Jan 1 20:29:19 1999
@@ -162,8 +162,9 @@
* copy in memory, so we add it to the swap
* cache. */
if (PageSwapCache(page_map)) {
+ entry = atomic_read(&page_map->count);
__free_page(page_map);
- return (atomic_read(&page_map->count) == 0);
+ return entry;
}
add_to_swap_cache(page_map, entry);
/* We checked we were unlocked way up above, and we
@@ -180,8 +181,9 @@
* asynchronously. That's no problem, shrink_mmap() can
* correctly clean up the occassional unshared page
* which gets left behind in the swap cache. */
+ entry = atomic_read(&page_map->count);
__free_page(page_map);
- return 1; /* we slept: the process may not exist any more */
+ return entry; /* we slept: the process may not exist any more */
}

/* The page was _not_ dirty, but still has a zero age. It must
@@ -194,8 +196,9 @@
set_pte(page_table, __pte(entry));
flush_tlb_page(vma, address);
swap_duplicate(entry);
+ entry = atomic_read(&page_map->count);
__free_page(page_map);
- return (atomic_read(&page_map->count) == 0);
+ return entry;
}
/*
* A clean page to be discarded? Must be mmap()ed from
@@ -210,7 +213,7 @@
flush_cache_page(vma, address);
pte_clear(page_table);
flush_tlb_page(vma, address);
- entry = (atomic_read(&page_map->count) == 1);
+ entry = atomic_read(&page_map->count);
__free_page(page_map);
return entry;
}
@@ -369,8 +372,14 @@
* swapped out. If the swap-out fails, we clear swap_cnt so the
* task won't be selected again until all others have been tried.
*/
- counter = ((PAGEOUT_WEIGHT * nr_tasks) >> 10) >> priority;
+ counter = nr_tasks / (priority+1);
+ if (counter < 1)
+ counter = 1;
+ if (counter > nr_tasks)
+ counter = nr_tasks;
+
for (; counter >= 0; counter--) {
+ int retval;
assign = 0;
max_cnt = 0;
pbest = NULL;
@@ -382,15 +391,8 @@
continue;
if (p->mm->rss <= 0)
continue;
- if (assign) {
- /*
- * If we didn't select a task on pass 1,
- * assign each task a new swap_cnt.
- * Normalise the number of pages swapped
- * by multiplying by (RSS / 1MB)
- */
- p->swap_cnt = AGE_CLUSTER_SIZE(p->mm->rss);
- }
+ if (assign)
+ p->swap_cnt = p->mm->rss;
if (p->swap_cnt > max_cnt) {
max_cnt = p->swap_cnt;
pbest = p;
@@ -404,14 +406,13 @@
}
goto out;
}
- pbest->swap_cnt--;
-
/*
* Nonzero means we cleared out something, but only "1" means
* that we actually free'd up a page as a result.
*/
- if (swap_out_process(pbest, gfp_mask) == 1)
- return 1;
+ retval = swap_out_process(pbest, gfp_mask);
+ if (retval)
+ return retval;
}
out:
return 0;
@@ -438,44 +439,74 @@
printk ("Starting kswapd v%.*s\n", i, s);
}

-#define free_memory(fn) \
- count++; do { if (!--count) goto done; } while (fn)
+static int do_free_user_and_cache(int priority, int gfp_mask)
+{
+ switch (swap_out(priority, gfp_mask))
+ {
+ default:
+ shrink_mmap(priority, gfp_mask);
+ /*
+ * We done at least some swapping progress so return 1 in
+ * this case. -arca
+ */
+ return 1;
+ case 0:
+ /* swap_out() failed to swapout */
+ if (shrink_mmap(priority, gfp_mask))
+ return 1;
+ return 0;
+ case 1:
+ /* this would be the best but should not happen right now */
+ printk(KERN_DEBUG
+ "do_free_user_and_cache: swapout returned 1\n");
+ return 1;
+ }
+}

-static int kswapd_free_pages(int kswapd_state)
+static int do_free_page(int * state, int gfp_mask)
{
- unsigned long end_time;
+ int priority = 6;

- /* Always trim SLAB caches when memory gets low. */
- kmem_cache_reap(0);
+ kmem_cache_reap(gfp_mask);

+ switch (*state) {
+ do {
+ default:
+ if (do_free_user_and_cache(priority, gfp_mask))
+ return 1;
+ *state = 1;
+ case 1:
+ if (shm_swap(priority, gfp_mask))
+ return 1;
+ *state = 2;
+ case 2:
+ shrink_dcache_memory(priority, gfp_mask);
+ *state = 0;
+ } while (--priority >= 0);
+ }
+ return 0;
+}
+
+static int kswapd_free_pages(int kswapd_state)
+{
/* max one hundreth of a second */
- end_time = jiffies + (HZ-1)/100;
- do {
- int priority = 5;
- int count = pager_daemon.swap_cluster;
+ unsigned long end_time = jiffies + (HZ-1)/100;

- switch (kswapd_state) {
- do {
- default:
- free_memory(shrink_mmap(priority, 0));
- kswapd_state++;
- case 1:
- free_memory(shm_swap(priority, 0));
- kswapd_state++;
- case 2:
- free_memory(swap_out(priority, 0));
- shrink_dcache_memory(priority, 0);
- kswapd_state = 0;
- } while (--priority >= 0);
- return kswapd_state;
- }
-done:
- if (nr_free_pages > freepages.high + pager_daemon.swap_cluster)
+ do {
+ do_free_page(&kswapd_state, 0);
+ if (nr_free_pages > freepages.high)
break;
} while (time_before_eq(jiffies,end_time));
+ /* take kswapd_state on the stack to save some byte of memory */
return kswapd_state;
}

+static inline void enable_swap_tick(void)
+{
+ timer_table[SWAP_TIMER].expires = jiffies+(HZ+99)/100;
+ timer_active |= 1<<SWAP_TIMER;
+}
+
/*
* The background pageout daemon.
* Started as a kernel thread from the init process.
@@ -523,6 +554,7 @@
current->state = TASK_INTERRUPTIBLE;
flush_signals(current);
run_task_queue(&tq_disk);
+ enable_swap_tick();
schedule();
swapstats.wakeups++;
state = kswapd_free_pages(state);
@@ -542,35 +574,23 @@
* if we need more memory as part of a swap-out effort we
* will just silently return "success" to tell the page
* allocator to accept the allocation.
- *
- * We want to try to free "count" pages, and we need to
- * cluster them so that we get good swap-out behaviour. See
- * the "free_memory()" macro for details.
*/
int try_to_free_pages(unsigned int gfp_mask, int count)
{
- int retval;
-
+ int retval = 1;
lock_kernel();

- /* Always trim SLAB caches when memory gets low. */
- kmem_cache_reap(gfp_mask);
-
- retval = 1;
if (!(current->flags & PF_MEMALLOC)) {
- int priority;
-
current->flags |= PF_MEMALLOC;
-
- priority = 5;
- do {
- free_memory(shrink_mmap(priority, gfp_mask));
- free_memory(shm_swap(priority, gfp_mask));
- free_memory(swap_out(priority, gfp_mask));
- shrink_dcache_memory(priority, gfp_mask);
- } while (--priority >= 0);
- retval = 0;
-done:
+ while (count--)
+ {
+ static int state = 0;
+ if (!do_free_page(&state, gfp_mask))
+ {
+ retval = 0;
+ break;
+ }
+ }
current->flags &= ~PF_MEMALLOC;
}
unlock_kernel();
@@ -593,7 +613,8 @@
if (priority) {
p->counter = p->priority << priority;
wake_up_process(p);
- }
+ } else
+ enable_swap_tick();
}

/*
@@ -631,9 +652,8 @@
want_wakeup = 3;

kswapd_wakeup(p,want_wakeup);
- }
-
- timer_active |= (1<<SWAP_TIMER);
+ } else
+ enable_swap_tick();
}

/*
@@ -642,7 +662,6 @@

void init_swap_timer(void)
{
- timer_table[SWAP_TIMER].expires = jiffies;
timer_table[SWAP_TIMER].fn = swap_tick;
- timer_active |= (1<<SWAP_TIMER);
+ enable_swap_tick();
}
Index: linux/mm/swap_state.c
diff -u linux/mm/swap_state.c:1.1.1.4 linux/mm/swap_state.c:1.1.1.1.2.9
--- linux/mm/swap_state.c:1.1.1.4 Fri Jan 1 19:12:54 1999
+++ linux/mm/swap_state.c Fri Jan 1 19:25:33 1999
@@ -262,6 +262,9 @@
struct page * lookup_swap_cache(unsigned long entry)
{
struct page *found;
+#ifdef SWAP_CACHE_INFO
+ swap_cache_find_total++;
+#endif

while (1) {
found = find_page(&swapper_inode, entry);
@@ -269,8 +272,12 @@
return 0;
if (found->inode != &swapper_inode || !PageSwapCache(found))
goto out_bad;
- if (!PageLocked(found))
+ if (!PageLocked(found)) {
+#ifdef SWAP_CACHE_INFO
+ swap_cache_find_success++;
+#endif
return found;
+ }
__free_page(found);
__wait_on_page(found);
}
If this patch is decreasing performance for you (eventually due too much
memory swapped out) you can try this incremental patch (I never tried here
btw):
Index: mm//vmscan.c
===================================================================
RCS file: /var/cvs/linux/mm/vmscan.c,v
retrieving revision 1.1.1.1.2.49
diff -u -r1.1.1.1.2.49 vmscan.c
--- vmscan.c 1999/01/01 19:29:19 1.1.1.1.2.49
+++ linux/mm/vmscan.c 1999/01/01 19:51:22
@@ -441,6 +441,9 @@

static int do_free_user_and_cache(int priority, int gfp_mask)
{
+ if (shrink_mmap(priority, gfp_mask))
+ return 1;
+
switch (swap_out(priority, gfp_mask))
{
default:
I written a swap benchmark that is dirtifying 160Mbyte of VM. For the
first loop 2.2-pre1 was taking 106 sec, for the second loop 120 and
then worse.
test1-pre3 + my new patch in this email, instead takes 120 sec in the
first loop (since it's allocating it's probably slowed down a bit by the
trashing_memory heuristic, and that's right), then it takes 90 sec in the
second loop and 77 sec in the third loop!! and the system was far to be
idle (as when I measured 2.2-pre1), but I was using it without special
regards and was perfectly usable (2.2-pre1 was unusable instead).
Comments?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary] [ In reply to ]
Andrea Arcangeli wrote:
>
> Please stop and try my new patch against Linus's test1-pre3 (that just
> merge some of my new stuff).
I got the patch and I must say I'm impressed. I ran my "117 image" test
and got these results:
[.Note: This loads 117 different images at the same time using 117
separate instances of 'xv' started in the background and results in ~
165 MB of swap area usage. The machine is an AMD K6-2 300 with 128MB]
2.1.131-ac11 172 sec (This was previously the
best)
2.2.0-pre1 + Arcangeli's 1st patch 400 sec
test1-pre + Arcangeli's 2nd patch 119 sec (!)
Processor utilization was substantially greater with the new patch
compared to either of the others. Before it starts using swap, memory
is being consumed at ~ 4MB/sec. After it starts to swap out, it streams
out at ~ 2MB/sec.
The performance is ~ 45% better than ac11 and ~ 70% better than
2.2.0-pre1 in this test.
I was going to test the low memory case but got side tracked.
Thanks,
Steve
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary] [ In reply to ]
On Fri, 1 Jan 1999, Andrea Arcangeli wrote:
> I rediffed my VM patch against test1-patch-2.2.0-pre3.gz. I also fixed
> some bug (not totally critical but..) pointed out by Linus in my last
> code. I also changed the shrink_mmap(0) to shrink_mmap(priority) because
> it was completly sucking a lot performance. There is no need to do a
> shrink_mmap(0) for example if the cache/buffer are under min. In such case
> we must allow the swap_out() to grow the cache before start shrinking it.
>
> So basically this new patch is _far_ more efficient than the last
> one (I never seen so good/stable/fast behavior before!).
Hmm, I just found a big problem, the patch was perfect as far as there was
no I/O bound application running.
When a I/O bound application start to read/write throught the fs, the
buffer and the cache grows, so kswapd has to use do_free_user_and_cache()
to make space for the new data in the cache.
The problem with my last approch is that do_free_user_and_cache() was
always generating I/O to async put some part of user memory to the swap.
This had a _bad_ impact in I/O performance of the I/O bound process :(.
I am the first guy that I hate to see some swapin/swapout while there are
tons of free memory used in cache/buffers.
So I obviously changed something. This new patch fix the problem
fine, even if it doesn't achieve the same iteractive performance as before
under heavily swapping (but it's near), it's a bit more sane ;).
The system is still perfectly balanced thought and now there aren't not
unnecessary swapin/swapout under heavy fs operation while there is a lot of
memory freeable.
Since to be happy I always need to change something more than what needed,
I also moved kmemcachereap with shrink_dcache().
Here is a new patch against test1-pre3. Steve if you are going
to make comparison let me know the results of course! Thanks.
You can also try to increase the priority = 8 in vmscan.c to 9 and see if the
benchmark is improved that way...
Index: linux/kernel/fork.c
diff -u linux/kernel/fork.c:1.1.1.3 linux/kernel/fork.c:1.1.1.1.2.6
--- linux/kernel/fork.c:1.1.1.3 Thu Dec 3 12:55:12 1998
+++ linux/kernel/fork.c Thu Dec 31 17:56:28 1998
@@ -567,6 +570,7 @@

/* ok, now we should be set up.. */
p->swappable = 1;
+ p->trashing_memory = 0;
p->exit_signal = clone_flags & CSIGNAL;
p->pdeath_signal = 0;

Index: linux/mm/swap_state.c
diff -u linux/mm/swap_state.c:1.1.1.4 linux/mm/swap_state.c:1.1.1.1.2.9
--- linux/mm/swap_state.c:1.1.1.4 Fri Jan 1 19:12:54 1999
+++ linux/mm/swap_state.c Fri Jan 1 19:25:33 1999
@@ -262,6 +262,9 @@
struct page * lookup_swap_cache(unsigned long entry)
{
struct page *found;
+#ifdef SWAP_CACHE_INFO
+ swap_cache_find_total++;
+#endif

while (1) {
found = find_page(&swapper_inode, entry);
@@ -269,8 +272,12 @@
return 0;
if (found->inode != &swapper_inode || !PageSwapCache(found))
goto out_bad;
- if (!PageLocked(found))
+ if (!PageLocked(found)) {
+#ifdef SWAP_CACHE_INFO
+ swap_cache_find_success++;
+#endif
return found;
+ }
__free_page(found);
__wait_on_page(found);
}
Index: linux/mm/vmscan.c
diff -u linux/mm/vmscan.c:1.1.1.8 linux/mm/vmscan.c:1.1.1.1.2.51
--- linux/mm/vmscan.c:1.1.1.8 Fri Jan 1 19:12:54 1999
+++ linux/mm/vmscan.c Sat Jan 2 04:18:31 1999
@@ -10,6 +10,11 @@
* Version: $Id: vmscan.c,v 1.5 1998/02/23 22:14:28 sct Exp $
*/

+/*
+ * Revisioned the page freeing algorithm: do_free_user_and_cache().
+ * Copyright (C) 1998 Andrea Arcangeli
+ */
+
#include <linux/slab.h>
#include <linux/kernel_stat.h>
#include <linux/swap.h>
@@ -162,8 +167,9 @@
* copy in memory, so we add it to the swap
* cache. */
if (PageSwapCache(page_map)) {
+ entry = atomic_read(&page_map->count);
__free_page(page_map);
- return (atomic_read(&page_map->count) == 0);
+ return entry;
}
add_to_swap_cache(page_map, entry);
/* We checked we were unlocked way up above, and we
@@ -180,8 +186,9 @@
* asynchronously. That's no problem, shrink_mmap() can
* correctly clean up the occassional unshared page
* which gets left behind in the swap cache. */
+ entry = atomic_read(&page_map->count);
__free_page(page_map);
- return 1; /* we slept: the process may not exist any more */
+ return entry; /* we slept: the process may not exist any more */
}

/* The page was _not_ dirty, but still has a zero age. It must
@@ -194,8 +201,9 @@
set_pte(page_table, __pte(entry));
flush_tlb_page(vma, address);
swap_duplicate(entry);
+ entry = atomic_read(&page_map->count);
__free_page(page_map);
- return (atomic_read(&page_map->count) == 0);
+ return entry;
}
/*
* A clean page to be discarded? Must be mmap()ed from
@@ -210,7 +218,7 @@
flush_cache_page(vma, address);
pte_clear(page_table);
flush_tlb_page(vma, address);
- entry = (atomic_read(&page_map->count) == 1);
+ entry = atomic_read(&page_map->count);
__free_page(page_map);
return entry;
}
@@ -369,8 +377,14 @@
* swapped out. If the swap-out fails, we clear swap_cnt so the
* task won't be selected again until all others have been tried.
*/
- counter = ((PAGEOUT_WEIGHT * nr_tasks) >> 10) >> priority;
+ counter = nr_tasks / (priority+1);
+ if (counter < 1)
+ counter = 1;
+ if (counter > nr_tasks)
+ counter = nr_tasks;
+
for (; counter >= 0; counter--) {
+ int retval;
assign = 0;
max_cnt = 0;
pbest = NULL;
@@ -382,15 +396,8 @@
continue;
if (p->mm->rss <= 0)
continue;
- if (assign) {
- /*
- * If we didn't select a task on pass 1,
- * assign each task a new swap_cnt.
- * Normalise the number of pages swapped
- * by multiplying by (RSS / 1MB)
- */
- p->swap_cnt = AGE_CLUSTER_SIZE(p->mm->rss);
- }
+ if (assign)
+ p->swap_cnt = p->mm->rss;
if (p->swap_cnt > max_cnt) {
max_cnt = p->swap_cnt;
pbest = p;
@@ -404,14 +411,13 @@
}
goto out;
}
- pbest->swap_cnt--;
-
/*
* Nonzero means we cleared out something, but only "1" means
* that we actually free'd up a page as a result.
*/
- if (swap_out_process(pbest, gfp_mask) == 1)
- return 1;
+ retval = swap_out_process(pbest, gfp_mask);
+ if (retval)
+ return retval;
}
out:
return 0;
@@ -438,44 +444,64 @@
printk ("Starting kswapd v%.*s\n", i, s);
}

-#define free_memory(fn) \
- count++; do { if (!--count) goto done; } while (fn)
+static int do_free_user_and_cache(int priority, int gfp_mask)
+{
+ if (shrink_mmap(priority, gfp_mask))
+ return 1;

-static int kswapd_free_pages(int kswapd_state)
+ if (swap_out(priority, gfp_mask))
+ /*
+ * We done at least some swapping progress so return 1 in
+ * this case. -arca
+ */
+ return 1;
+
+ return 0;
+}
+
+static int do_free_page(int * state, int gfp_mask)
{
- unsigned long end_time;
+ int priority = 8;

- /* Always trim SLAB caches when memory gets low. */
- kmem_cache_reap(0);
+ switch (*state) {
+ do {
+ default:
+ if (do_free_user_and_cache(priority, gfp_mask))
+ return 1;
+ *state = 1;
+ case 1:
+ if (shm_swap(priority, gfp_mask))
+ return 1;
+ *state = 2;
+ case 2:
+ shrink_dcache_memory(priority, gfp_mask);
+ kmem_cache_reap(gfp_mask);
+ *state = 0;
+ } while (--priority >= 0);
+ }
+ return 0;
+}

+static int kswapd_free_pages(int kswapd_state)
+{
/* max one hundreth of a second */
- end_time = jiffies + (HZ-1)/100;
- do {
- int priority = 5;
- int count = pager_daemon.swap_cluster;
+ unsigned long end_time = jiffies + (HZ-1)/100;

- switch (kswapd_state) {
- do {
- default:
- free_memory(shrink_mmap(priority, 0));
- kswapd_state++;
- case 1:
- free_memory(shm_swap(priority, 0));
- kswapd_state++;
- case 2:
- free_memory(swap_out(priority, 0));
- shrink_dcache_memory(priority, 0);
- kswapd_state = 0;
- } while (--priority >= 0);
- return kswapd_state;
- }
-done:
- if (nr_free_pages > freepages.high + pager_daemon.swap_cluster)
+ do {
+ do_free_page(&kswapd_state, 0);
+ if (nr_free_pages > freepages.high)
break;
} while (time_before_eq(jiffies,end_time));
+ /* take kswapd_state on the stack to save some byte of memory */
return kswapd_state;
}

+static inline void enable_swap_tick(void)
+{
+ timer_table[SWAP_TIMER].expires = jiffies+(HZ+99)/100;
+ timer_active |= 1<<SWAP_TIMER;
+}
+
/*
* The background pageout daemon.
* Started as a kernel thread from the init process.
@@ -523,6 +549,7 @@
current->state = TASK_INTERRUPTIBLE;
flush_signals(current);
run_task_queue(&tq_disk);
+ enable_swap_tick();
schedule();
swapstats.wakeups++;
state = kswapd_free_pages(state);
@@ -542,35 +569,23 @@
* if we need more memory as part of a swap-out effort we
* will just silently return "success" to tell the page
* allocator to accept the allocation.
- *
- * We want to try to free "count" pages, and we need to
- * cluster them so that we get good swap-out behaviour. See
- * the "free_memory()" macro for details.
*/
int try_to_free_pages(unsigned int gfp_mask, int count)
{
- int retval;
-
+ int retval = 1;
lock_kernel();

- /* Always trim SLAB caches when memory gets low. */
- kmem_cache_reap(gfp_mask);
-
- retval = 1;
if (!(current->flags & PF_MEMALLOC)) {
- int priority;
-
current->flags |= PF_MEMALLOC;
-
- priority = 5;
- do {
- free_memory(shrink_mmap(priority, gfp_mask));
- free_memory(shm_swap(priority, gfp_mask));
- free_memory(swap_out(priority, gfp_mask));
- shrink_dcache_memory(priority, gfp_mask);
- } while (--priority >= 0);
- retval = 0;
-done:
+ while (count--)
+ {
+ static int state = 0;
+ if (!do_free_page(&state, gfp_mask))
+ {
+ retval = 0;
+ break;
+ }
+ }
current->flags &= ~PF_MEMALLOC;
}
unlock_kernel();
@@ -593,7 +608,8 @@
if (priority) {
p->counter = p->priority << priority;
wake_up_process(p);
- }
+ } else
+ enable_swap_tick();
}

/*
@@ -631,9 +647,8 @@
want_wakeup = 3;

kswapd_wakeup(p,want_wakeup);
- }
-
- timer_active |= (1<<SWAP_TIMER);
+ } else
+ enable_swap_tick();
}

/*
@@ -642,7 +657,6 @@

void init_swap_timer(void)
{
- timer_table[SWAP_TIMER].expires = jiffies;
timer_table[SWAP_TIMER].fn = swap_tick;
- timer_active |= (1<<SWAP_TIMER);
+ enable_swap_tick();
}
Andrea Arcangeli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary] [ In reply to ]
On Fri, 1 Jan 1999, Steve Bergman wrote:
>
> I got the patch and I must say I'm impressed. I ran my "117 image" test
> and got these results:
>
> 2.1.131-ac11 172 sec (This was previously the best)
> 2.2.0-pre1 + Arcangeli's 1st patch 400 sec
> test1-pre + Arcangeli's 2nd patch 119 sec (!)
Would you care to do some more testing? In particular, I'd like to hear
how basic 2.2.0pre3 works (that's essentially the same as test1-pre, with
only minor updates)? I'd like to calibrate the numbers against that,
rather than against kernels that I haven't actually ever run myself.
The other thing I'd like to hear is how pre3 looks with this patch, which
should behave basically like Andrea's latest patch but without the
obfuscation he put into his patch..
Linus
-----
diff -u --recursive --new-file v2.2.0-pre3/linux/Makefile linux/Makefile
--- v2.2.0-pre3/linux/Makefile Fri Jan 1 12:58:14 1999
+++ linux/Makefile Fri Jan 1 12:58:29 1999
@@ -1,7 +1,7 @@
VERSION = 2
PATCHLEVEL = 2
SUBLEVEL = 0
-EXTRAVERSION =-pre3
+EXTRAVERSION =-pre4

ARCH := $(shell uname -m | sed -e s/i.86/i386/ -e s/sun4u/sparc64/ -e s/arm.*/arm/ -e s/sa110/arm/)

diff -u --recursive --new-file v2.2.0-pre3/linux/drivers/misc/parport_procfs.c linux/drivers/misc/parport_procfs.c
--- v2.2.0-pre3/linux/drivers/misc/parport_procfs.c Sun Nov 8 14:02:59 1998
+++ linux/drivers/misc/parport_procfs.c Fri Jan 1 21:27:12 1999
@@ -305,12 +305,11 @@
{
base = new_proc_entry("parport", S_IFDIR, &proc_root,PROC_PARPORT,
NULL);
- base->fill_inode = &parport_modcount;
-
if (base == NULL) {
printk(KERN_ERR "Unable to initialise /proc/parport.\n");
return 0;
}
+ base->fill_inode = &parport_modcount;

return 1;
}
diff -u --recursive --new-file v2.2.0-pre3/linux/fs/binfmt_misc.c linux/fs/binfmt_misc.c
--- v2.2.0-pre3/linux/fs/binfmt_misc.c Fri Jan 1 12:58:20 1999
+++ linux/fs/binfmt_misc.c Fri Jan 1 13:00:10 1999
@@ -30,6 +30,16 @@
#include <asm/uaccess.h>
#include <asm/spinlock.h>

+/*
+ * We should make this work with a "stub-only" /proc,
+ * which would just not be able to be configured.
+ * Right now the /proc-fs support is too black and white,
+ * though, so just remind people that this should be
+ * fixed..
+ */
+#ifndef CONFIG_PROC_FS
+#error You really need /proc support for binfmt_misc. Please reconfigure!
+#endif

#define VERBOSE_STATUS /* undef this to save 400 bytes kernel memory */

diff -u --recursive --new-file v2.2.0-pre3/linux/include/linux/swapctl.h linux/include/linux/swapctl.h
--- v2.2.0-pre3/linux/include/linux/swapctl.h Tue Dec 22 14:16:58 1998
+++ linux/include/linux/swapctl.h Fri Jan 1 22:31:21 1999
@@ -90,18 +90,6 @@
#define PAGE_DECLINE (swap_control.sc_page_decline)
#define PAGE_INITIAL_AGE (swap_control.sc_page_initial_age)

-/* Given a resource of N units (pages or buffers etc), we only try to
- * age and reclaim AGE_CLUSTER_FRACT per 1024 resources each time we
- * scan the resource list. */
-static inline int AGE_CLUSTER_SIZE(int resources)
-{
- unsigned int n = (resources * AGE_CLUSTER_FRACT) >> 10;
- if (n < AGE_CLUSTER_MIN)
- return AGE_CLUSTER_MIN;
- else
- return n;
-}
-
#endif /* __KERNEL */

#endif /* _LINUX_SWAPCTL_H */
diff -u --recursive --new-file v2.2.0-pre3/linux/mm/vmscan.c linux/mm/vmscan.c
--- v2.2.0-pre3/linux/mm/vmscan.c Fri Jan 1 12:58:21 1999
+++ linux/mm/vmscan.c Fri Jan 1 22:41:58 1999
@@ -363,13 +363,23 @@
/*
* We make one or two passes through the task list, indexed by
* assign = {0, 1}:
- * Pass 1: select the swappable task with maximal swap_cnt.
- * Pass 2: assign new swap_cnt values, then select as above.
+ * Pass 1: select the swappable task with maximal RSS that has
+ * not yet been swapped out.
+ * Pass 2: re-assign rss swap_cnt values, then select as above.
+ *
* With this approach, there's no need to remember the last task
* swapped out. If the swap-out fails, we clear swap_cnt so the
* task won't be selected again until all others have been tried.
+ *
+ * Think of swap_cnt as a "shadow rss" - it tells us which process
+ * we want to page out (always try largest first).
*/
- counter = ((PAGEOUT_WEIGHT * nr_tasks) >> 10) >> priority;
+ counter = nr_tasks / (priority+1);
+ if (counter < 1)
+ counter = 1;
+ if (counter > nr_tasks)
+ counter = nr_tasks;
+
for (; counter >= 0; counter--) {
assign = 0;
max_cnt = 0;
@@ -382,15 +392,9 @@
continue;
if (p->mm->rss <= 0)
continue;
- if (assign) {
- /*
- * If we didn't select a task on pass 1,
- * assign each task a new swap_cnt.
- * Normalise the number of pages swapped
- * by multiplying by (RSS / 1MB)
- */
- p->swap_cnt = AGE_CLUSTER_SIZE(p->mm->rss);
- }
+ /* Refresh swap_cnt? */
+ if (assign)
+ p->swap_cnt = p->mm->rss;
if (p->swap_cnt > max_cnt) {
max_cnt = p->swap_cnt;
pbest = p;
@@ -404,14 +408,13 @@
}
goto out;
}
- pbest->swap_cnt--;

/*
* Nonzero means we cleared out something, but only "1" means
* that we actually free'd up a page as a result.
*/
if (swap_out_process(pbest, gfp_mask) == 1)
- return 1;
+ return 1;
}
out:
return 0;
@@ -451,19 +454,17 @@
/* max one hundreth of a second */
end_time = jiffies + (HZ-1)/100;
do {
- int priority = 5;
+ int priority = 8;
int count = pager_daemon.swap_cluster;

switch (kswapd_state) {
do {
default:
free_memory(shrink_mmap(priority, 0));
+ free_memory(swap_out(priority, 0));
kswapd_state++;
case 1:
free_memory(shm_swap(priority, 0));
- kswapd_state++;
- case 2:
- free_memory(swap_out(priority, 0));
shrink_dcache_memory(priority, 0);
kswapd_state = 0;
} while (--priority >= 0);
@@ -562,7 +563,7 @@

current->flags |= PF_MEMALLOC;

- priority = 5;
+ priority = 8;
do {
free_memory(shrink_mmap(priority, gfp_mask));
free_memory(shm_swap(priority, gfp_mask));
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary] [ In reply to ]
Linus Torvalds wrote:
>
> On Fri, 1 Jan 1999, Steve Bergman wrote:
> >
> > I got the patch and I must say I'm impressed. I ran my "117 image" test
> > and got these results:
> >
> > 2.1.131-ac11 172 sec (This was previously the best)
> > 2.2.0-pre1 + Arcangeli's 1st patch 400 sec
> > test1-pre + Arcangeli's 2nd patch 119 sec (!)
>
> Would you care to do some more testing? In particular, I'd like to hear
> how basic 2.2.0pre3 works (that's essentially the same as test1-pre, with
> only minor updates)? I'd like to calibrate the numbers against that,
> rather than against kernels that I haven't actually ever run myself.
>
> The other thing I'd like to hear is how pre3 looks with this patch, which
> should behave basically like Andrea's latest patch
Hi Linus,
Andrea sent another patch to correct a problem with i/o bound processes,
which he also posted to linux-kernel. The performance in this test is
unchanged.
Here are the results:
2.1.131-ac11 172 sec
2.2.0-pre1 + Arcangeli's 1st patch 400 sec
test1-pre + Arcangeli's 2nd patch 119 sec
test1-pre + Arcangeli's 3rd patch 119 sec
test1-pre + Arcangeli's 3rd patch 117 sec
(changed to priority = 9 in mm/vmscan.c)
2.2.0-pre3 175 sec
2.2.0-pre3 + Linus's patch 129 sec
RH5.2 Stock (2.0.36-0.7) 280 sec
I noticed that in watching the 'vmstat 1' during the test that
'2.2.0+Linus patch' was not *quite* as smooth as the Archangeli patches,
in that there were periods of 2 or 3 seconds in which the swap out rate
would fall to ~800k/sec and then jump back up to 1.8-2.5MB/sec. I have
only run your patch once though. I'll check it further tomorrow to
confirm that that is really the case. Note how much better 2.2 is doing
compared to 2.0.36-0.7 in this situation.
I should be available for a good part of this weekend for further
testing; Just let me know.
As a reference:
AMD K6-2 300
128MB ram
2GB seagate scsi2 dedicated to swap
Data drive is 6.5GB UDMA
Steve Bergman
steve@netplus.net
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary] [ In reply to ]
On Fri, 1 Jan 1999, Linus Torvalds wrote:
> The other thing I'd like to hear is how pre3 looks with this patch, which
> should behave basically like Andrea's latest patch but without the
> obfuscation he put into his patch..
I still think the most important part of all my latest VM patches is my
new do_free_user_and_cache(). It allow the VM to scale very better and be
perfectly balanced.
Why to run `count' times swap_out() without take a look if the cache grows
too much?
Andrea Arcangeli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary] [ In reply to ]
On Fri, 1 Jan 1999, Linus Torvalds wrote:
> The other thing I'd like to hear is how pre3 looks with this patch, which
> should behave basically like Andrea's latest patch but without the
> obfuscation he put into his patch..
I rediffed my latest swapout stuff against your latest tree (I consider
your latest patch as test1-pre4, right?).
Index: linux/mm/vmscan.c
diff -u linux/mm/vmscan.c:1.1.1.9 linux/mm/vmscan.c:1.1.1.1.2.52
--- linux/mm/vmscan.c:1.1.1.9 Sat Jan 2 15:46:20 1999
+++ linux/mm/vmscan.c Sat Jan 2 15:53:33 1999
@@ -10,6 +10,11 @@
* Version: $Id: vmscan.c,v 1.5 1998/02/23 22:14:28 sct Exp $
*/

+/*
+ * Revisioned the page freeing algorithm: do_free_user_and_cache().
+ * Copyright (C) 1998 Andrea Arcangeli
+ */
+
#include <linux/slab.h>
#include <linux/kernel_stat.h>
#include <linux/swap.h>
@@ -162,8 +167,9 @@
* copy in memory, so we add it to the swap
* cache. */
if (PageSwapCache(page_map)) {
+ entry = atomic_read(&page_map->count);
__free_page(page_map);
- return (atomic_read(&page_map->count) == 0);
+ return entry;
}
add_to_swap_cache(page_map, entry);
/* We checked we were unlocked way up above, and we
@@ -180,8 +186,9 @@
* asynchronously. That's no problem, shrink_mmap() can
* correctly clean up the occassional unshared page
* which gets left behind in the swap cache. */
+ entry = atomic_read(&page_map->count);
__free_page(page_map);
- return 1; /* we slept: the process may not exist any more */
+ return entry; /* we slept: the process may not exist any more */
}

/* The page was _not_ dirty, but still has a zero age. It must
@@ -194,8 +201,9 @@
set_pte(page_table, __pte(entry));
flush_tlb_page(vma, address);
swap_duplicate(entry);
+ entry = atomic_read(&page_map->count);
__free_page(page_map);
- return (atomic_read(&page_map->count) == 0);
+ return entry;
}
/*
* A clean page to be discarded? Must be mmap()ed from
@@ -210,7 +218,7 @@
flush_cache_page(vma, address);
pte_clear(page_table);
flush_tlb_page(vma, address);
- entry = (atomic_read(&page_map->count) == 1);
+ entry = atomic_read(&page_map->count);
__free_page(page_map);
return entry;
}
@@ -381,6 +389,7 @@
counter = nr_tasks;

for (; counter >= 0; counter--) {
+ int retval;
assign = 0;
max_cnt = 0;
pbest = NULL;
@@ -413,8 +422,9 @@
* Nonzero means we cleared out something, but only "1" means
* that we actually free'd up a page as a result.
*/
- if (swap_out_process(pbest, gfp_mask) == 1)
- return 1;
+ retval = swap_out_process(pbest, gfp_mask);
+ if (retval)
+ return retval;
}
out:
return 0;
@@ -441,42 +451,64 @@
printk ("Starting kswapd v%.*s\n", i, s);
}

-#define free_memory(fn) \
- count++; do { if (!--count) goto done; } while (fn)
+static int do_free_user_and_cache(int priority, int gfp_mask)
+{
+ if (shrink_mmap(priority, gfp_mask))
+ return 1;

-static int kswapd_free_pages(int kswapd_state)
+ if (swap_out(priority, gfp_mask))
+ /*
+ * We done at least some swapping progress so return 1 in
+ * this case. -arca
+ */
+ return 1;
+
+ return 0;
+}
+
+static int do_free_page(int * state, int gfp_mask)
{
- unsigned long end_time;
+ int priority = 8;

- /* Always trim SLAB caches when memory gets low. */
- kmem_cache_reap(0);
+ switch (*state) {
+ do {
+ default:
+ if (do_free_user_and_cache(priority, gfp_mask))
+ return 1;
+ *state = 1;
+ case 1:
+ if (shm_swap(priority, gfp_mask))
+ return 1;
+ *state = 2;
+ case 2:
+ shrink_dcache_memory(priority, gfp_mask);
+ kmem_cache_reap(gfp_mask);
+ *state = 0;
+ } while (--priority >= 0);
+ }
+ return 0;
+}

+static int kswapd_free_pages(int kswapd_state)
+{
/* max one hundreth of a second */
- end_time = jiffies + (HZ-1)/100;
- do {
- int priority = 8;
- int count = pager_daemon.swap_cluster;
+ unsigned long end_time = jiffies + (HZ-1)/100;

- switch (kswapd_state) {
- do {
- default:
- free_memory(shrink_mmap(priority, 0));
- free_memory(swap_out(priority, 0));
- kswapd_state++;
- case 1:
- free_memory(shm_swap(priority, 0));
- shrink_dcache_memory(priority, 0);
- kswapd_state = 0;
- } while (--priority >= 0);
- return kswapd_state;
- }
-done:
- if (nr_free_pages > freepages.high + pager_daemon.swap_cluster)
+ do {
+ do_free_page(&kswapd_state, 0);
+ if (nr_free_pages > freepages.high)
break;
} while (time_before_eq(jiffies,end_time));
+ /* take kswapd_state on the stack to save some byte of memory */
return kswapd_state;
}

+static inline void enable_swap_tick(void)
+{
+ timer_table[SWAP_TIMER].expires = jiffies+(HZ+99)/100;
+ timer_active |= 1<<SWAP_TIMER;
+}
+
/*
* The background pageout daemon.
* Started as a kernel thread from the init process.
@@ -524,6 +556,7 @@
current->state = TASK_INTERRUPTIBLE;
flush_signals(current);
run_task_queue(&tq_disk);
+ enable_swap_tick();
schedule();
swapstats.wakeups++;
state = kswapd_free_pages(state);
@@ -543,35 +576,23 @@
* if we need more memory as part of a swap-out effort we
* will just silently return "success" to tell the page
* allocator to accept the allocation.
- *
- * We want to try to free "count" pages, and we need to
- * cluster them so that we get good swap-out behaviour. See
- * the "free_memory()" macro for details.
*/
int try_to_free_pages(unsigned int gfp_mask, int count)
{
- int retval;
-
+ int retval = 1;
lock_kernel();

- /* Always trim SLAB caches when memory gets low. */
- kmem_cache_reap(gfp_mask);
-
- retval = 1;
if (!(current->flags & PF_MEMALLOC)) {
- int priority;
-
current->flags |= PF_MEMALLOC;
-
- priority = 8;
- do {
- free_memory(shrink_mmap(priority, gfp_mask));
- free_memory(shm_swap(priority, gfp_mask));
- free_memory(swap_out(priority, gfp_mask));
- shrink_dcache_memory(priority, gfp_mask);
- } while (--priority >= 0);
- retval = 0;
-done:
+ while (count--)
+ {
+ static int state = 0;
+ if (!do_free_page(&state, gfp_mask))
+ {
+ retval = 0;
+ break;
+ }
+ }
current->flags &= ~PF_MEMALLOC;
}
unlock_kernel();
@@ -594,7 +615,8 @@
if (priority) {
p->counter = p->priority << priority;
wake_up_process(p);
- }
+ } else
+ enable_swap_tick();
}

/*
@@ -632,9 +654,8 @@
want_wakeup = 3;

kswapd_wakeup(p,want_wakeup);
- }
-
- timer_active |= (1<<SWAP_TIMER);
+ } else
+ enable_swap_tick();
}

/*
@@ -643,7 +664,6 @@

void init_swap_timer(void)
{
- timer_table[SWAP_TIMER].expires = jiffies;
timer_table[SWAP_TIMER].fn = swap_tick;
- timer_active |= (1<<SWAP_TIMER);
+ enable_swap_tick();
}
The try_to_swap_out() changes (entry = atomic_read()) are really not
important for the performance. We could always return 1 instead of
atomic_read() and consider the retval 1 from swap_out() as every current
retval >1. Since I can't see a big performance impact by atomic_read() I
left it here since it will give us more info than returning a plain 1 and
so knowing only that we have succesfully unliked a page from the user
process memory.
I have also a new experimental patch against the one above, that here
improve a _lot_ the swapout performance. The benchmark that dirtify 160
Mbyte in loop was used to take near 106 sec and now takes 89sec. It will
also avoid all not trashing process to be swapped out.
I don't consider this production code though but I am interested if
somebody will try it ;):
Index: mm//vmscan.c
===================================================================
RCS file: /var/cvs/linux/mm/vmscan.c,v
retrieving revision 1.1.1.1.2.52
diff -u -r1.1.1.1.2.52 vmscan.c
--- vmscan.c 1999/01/02 14:53:33 1.1.1.1.2.52
+++ linux/mm/vmscan.c 1999/01/02 15:19:21
@@ -353,7 +353,6 @@
}

/* We didn't find anything for the process */
- p->swap_cnt = 0;
p->swap_address = 0;
return 0;
}
@@ -423,6 +422,14 @@
* that we actually free'd up a page as a result.
*/
retval = swap_out_process(pbest, gfp_mask);
+ /*
+ * Don't play with other tasks next time if the huge one
+ * is been swapedin in the meantime. This can be considered
+ * a bit experimental, but it seems to improve a lot the
+ * swapout performances here. -arca
+ */
+ p->swap_cnt = p->mm->rss;
+
if (retval)
return retval;
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary] [ In reply to ]
Linus Torvalds wrote:
>
> Would you care to do some more testing? In particular, I'd like to hear
> how basic 2.2.0pre3 works (that's essentially the same as test1-pre, with
> only minor updates)? I'd like to calibrate the numbers against that,
> rather than against kernels that I haven't actually ever run myself.
>
I've done some more testing, this time including the low memory case.
For low memory testing I built the dhcp server from SRPM in 8MB with X,
xdm, various daemons (sendmail, named, inetd, etc.), and vmstat 1
running. Swap area stayed at about 8MB usage. I have also run the
128MB tests some more and have slightly more accurate results. Here is
the summary:
Kernel 128MB 8MB
------------ -------
------
2.1.131-ac11 172 sec 260
sec
test1-pre + Arcangeli's patch 119 sec 226
sec
2.2.0-pre3 175 sec 334
sec
2.2.0-pre3 + Linus's patch 129 sec 312
sec
RH5.2 Stock (2.0.36-0.7) 280 sec N/A
-Steve
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary] [ In reply to ]
On Sat, 2 Jan 1999, Andrea Arcangeli wrote:
> I rediffed my latest swapout stuff against your latest tree (I consider
> your latest patch as test1-pre4, right?).
I developed new exiting stuff this afternoon! The most important thing is
the swapout smart weight code. Basing the priority on the number of
process to try to swapout was really ugly and not smart.
The second change is done over shrink_mmap(), this will cause
shrink_mmap() to care very more about aging. We have only one bit and we
must use it carefully to get not out of cache ;)
I also added/removed some PG_referenced. But please, don't trust too much
the pg_refernced changes since I have not thought about it too much (maybe
they are not needed?).
I returned to put the minimum of cache and buffer to 5%. This allow me to
run every trashing memory proggy I can for every time but I still have all
my last command run (free) and filesystem (ls -l) in cache (because the
trashing memory _only_ play with its VM and asks nothing to the kernel of
course).
Ah and woops, in the last patch I do a mistake and I forget to change
max_cnt to unsigned long. This should be changed also in your tree, Linus.
This new patch seems to really rocks here and seems _far_ better than
anything I tried before! Steve, could try it and feedback? Thanks ;)
Please excuse me Linus if I have not yet cleanedup things, but my spare
time is very small and I would _try_ to improve things a bit more
before...
This patch is against 2.2.0-pre4 (the lateest patch posted by Linus here).
Index: linux/include/linux/mm.h
diff -u linux/include/linux/mm.h:1.1.1.3 linux/include/linux/mm.h:1.1.1.1.2.11
--- linux/include/linux/mm.h:1.1.1.3 Sat Jan 2 15:24:18 1999
+++ linux/include/linux/mm.h Sat Jan 2 21:40:13 1999
@@ -118,7 +118,6 @@
unsigned long offset;
struct page *next_hash;
atomic_t count;
- unsigned int unused;
unsigned long flags; /* atomic flags, some possibly updated asynchronously */
struct wait_queue *wait;
struct page **pprev_hash;
@@ -295,8 +294,7 @@

/* filemap.c */
extern void remove_inode_page(struct page *);
-extern unsigned long page_unuse(struct page *);
-extern int shrink_mmap(int, int);
+extern int FASTCALL(shrink_mmap(int, int));
extern void truncate_inode_pages(struct inode *, unsigned long);
extern unsigned long get_cached_page(struct inode *, unsigned long, int);
extern void put_cached_page(unsigned long);
Index: linux/include/linux/pagemap.h
diff -u linux/include/linux/pagemap.h:1.1.1.1 linux/include/linux/pagemap.h:1.1.1.1.2.1
--- linux/include/linux/pagemap.h:1.1.1.1 Fri Nov 20 00:01:16 1998
+++ linux/include/linux/pagemap.h Sat Jan 2 21:40:13 1999
@@ -77,6 +77,7 @@
*page->pprev_hash = page->next_hash;
page->pprev_hash = NULL;
}
+ clear_bit(PG_referenced, &page->flags);
page_cache_size--;
}

Index: linux/mm/filemap.c
diff -u linux/mm/filemap.c:1.1.1.8 linux/mm/filemap.c:1.1.1.1.2.35
--- linux/mm/filemap.c:1.1.1.8 Fri Jan 1 19:12:53 1999
+++ linux/mm/filemap.c Sat Jan 2 21:40:13 1999
@@ -118,6 +122,10 @@
__free_page(page);
}

+#define HANDLE_AGING(page) \
+ if (test_and_clear_bit(PG_referenced, &(page)->flags)) \
+ continue;
+
int shrink_mmap(int priority, int gfp_mask)
{
static unsigned long clock = 0;
@@ -140,12 +148,11 @@
page = page->next_hash;
clock = page->map_nr;
}
-
- if (test_and_clear_bit(PG_referenced, &page->flags))
- continue;

/* Decrement count only for non-referenced pages */
- count--;
+ if (!test_bit(PG_referenced, &page->flags))
+ count--;
+
if (PageLocked(page))
continue;

@@ -160,6 +167,7 @@
if (page->buffers) {
if (buffer_under_min())
continue;
+ HANDLE_AGING(page);
if (!try_to_free_buffers(page))
continue;
return 1;
@@ -167,12 +175,14 @@

/* is it a swap-cache or page-cache page? */
if (page->inode) {
- if (pgcache_under_min())
- continue;
if (PageSwapCache(page)) {
+ HANDLE_AGING(page);
delete_from_swap_cache(page);
return 1;
}
+ if (pgcache_under_min())
+ continue;
+ HANDLE_AGING(page);
remove_inode_page(page);
return 1;
}
@@ -181,6 +191,8 @@
return 0;
}

+#undef HANDLE_AGING
+
/*
* Update a page cache copy, when we're doing a "write()" system call
* See also "update_vm_cache()".
Index: linux/mm/swap.c
diff -u linux/mm/swap.c:1.1.1.5 linux/mm/swap.c:1.1.1.1.2.8
--- linux/mm/swap.c:1.1.1.5 Sat Jan 2 15:24:40 1999
+++ linux/mm/swap.c Sat Jan 2 21:40:13 1999
@@ -64,13 +64,13 @@
swapstat_t swapstats = {0};

buffer_mem_t buffer_mem = {
- 2, /* minimum percent buffer */
+ 5, /* minimum percent buffer */
10, /* borrow percent buffer */
60 /* maximum percent buffer */
};

buffer_mem_t page_cache = {
- 2, /* minimum percent page cache */
+ 5, /* minimum percent page cache */
15, /* borrow percent page cache */
75 /* maximum */
};
Index: linux/mm/vmscan.c
diff -u linux/mm/vmscan.c:1.1.1.9 linux/mm/vmscan.c:1.1.1.1.2.57
--- linux/mm/vmscan.c:1.1.1.9 Sat Jan 2 15:46:20 1999
+++ linux/mm/vmscan.c Sat Jan 2 21:45:22 1999
@@ -10,6 +10,12 @@
* Version: $Id: vmscan.c,v 1.5 1998/02/23 22:14:28 sct Exp $
*/

+/*
+ * Revisioned the page freeing algorithm (do_free_user_and_cache), and
+ * developed a smart mechanism to handle the swapout weight.
+ * Copyright (C) 1998 Andrea Arcangeli
+ */
+
#include <linux/slab.h>
#include <linux/kernel_stat.h>
#include <linux/swap.h>
@@ -162,8 +168,9 @@
* copy in memory, so we add it to the swap
* cache. */
if (PageSwapCache(page_map)) {
+ entry = atomic_read(&page_map->count);
__free_page(page_map);
- return (atomic_read(&page_map->count) == 0);
+ return entry;
}
add_to_swap_cache(page_map, entry);
/* We checked we were unlocked way up above, and we
@@ -180,8 +187,9 @@
* asynchronously. That's no problem, shrink_mmap() can
* correctly clean up the occassional unshared page
* which gets left behind in the swap cache. */
+ entry = atomic_read(&page_map->count);
__free_page(page_map);
- return 1; /* we slept: the process may not exist any more */
+ return entry; /* we slept: the process may not exist any more */
}

/* The page was _not_ dirty, but still has a zero age. It must
@@ -194,8 +202,9 @@
set_pte(page_table, __pte(entry));
flush_tlb_page(vma, address);
swap_duplicate(entry);
+ entry = atomic_read(&page_map->count);
__free_page(page_map);
- return (atomic_read(&page_map->count) == 0);
+ return entry;
}
/*
* A clean page to be discarded? Must be mmap()ed from
@@ -210,7 +219,7 @@
flush_cache_page(vma, address);
pte_clear(page_table);
flush_tlb_page(vma, address);
- entry = (atomic_read(&page_map->count) == 1);
+ entry = atomic_read(&page_map->count);
__free_page(page_map);
return entry;
}
@@ -230,7 +239,7 @@
*/

static inline int swap_out_pmd(struct task_struct * tsk, struct vm_area_struct * vma,
- pmd_t *dir, unsigned long address, unsigned long end, int gfp_mask)
+ pmd_t *dir, unsigned long address, unsigned long end, int gfp_mask, unsigned long * counter, unsigned long * next_addr)
{
pte_t * pte;
unsigned long pmd_end;
@@ -256,13 +265,19 @@
if (result)
return result;
address += PAGE_SIZE;
+ if (!*counter)
+ {
+ *next_addr = address;
+ return 0;
+ } else
+ (*counter)--;
pte++;
} while (address < end);
return 0;
}

static inline int swap_out_pgd(struct task_struct * tsk, struct vm_area_struct * vma,
- pgd_t *dir, unsigned long address, unsigned long end, int gfp_mask)
+ pgd_t *dir, unsigned long address, unsigned long end, int gfp_mask, unsigned long * counter, unsigned long * next_addr)
{
pmd_t * pmd;
unsigned long pgd_end;
@@ -282,9 +297,11 @@
end = pgd_end;

do {
- int result = swap_out_pmd(tsk, vma, pmd, address, end, gfp_mask);
+ int result = swap_out_pmd(tsk, vma, pmd, address, end, gfp_mask, counter, next_addr);
if (result)
return result;
+ if (!*counter)
+ return 0;
address = (address + PMD_SIZE) & PMD_MASK;
pmd++;
} while (address < end);
@@ -292,7 +309,7 @@
}

static int swap_out_vma(struct task_struct * tsk, struct vm_area_struct * vma,
- unsigned long address, int gfp_mask)
+ unsigned long address, int gfp_mask, unsigned long * counter, unsigned long * next_addr)
{
pgd_t *pgdir;
unsigned long end;
@@ -306,16 +323,19 @@

end = vma->vm_end;
while (address < end) {
- int result = swap_out_pgd(tsk, vma, pgdir, address, end, gfp_mask);
+ int result = swap_out_pgd(tsk, vma, pgdir, address, end, gfp_mask, counter, next_addr);
if (result)
return result;
+ if (!*counter)
+ return 0;
address = (address + PGDIR_SIZE) & PGDIR_MASK;
pgdir++;
}
return 0;
}

-static int swap_out_process(struct task_struct * p, int gfp_mask)
+static int swap_out_process(struct task_struct * p, int gfp_mask,
+ unsigned long * counter)
{
unsigned long address;
struct vm_area_struct* vma;
@@ -334,9 +354,16 @@
address = vma->vm_start;

for (;;) {
- int result = swap_out_vma(p, vma, address, gfp_mask);
+ unsigned long next_addr;
+ int result = swap_out_vma(p, vma, address, gfp_mask,
+ counter, &next_addr);
if (result)
return result;
+ if (!*counter)
+ {
+ p->swap_address = next_addr;
+ return 0;
+ }
vma = vma->vm_next;
if (!vma)
break;
@@ -350,6 +377,19 @@
return 0;
}

+static unsigned long total_rss(void)
+{
+ unsigned long total_rss = 0;
+ struct task_struct * p;
+
+ read_lock(&tasklist_lock);
+ for (p = init_task.next_task; p != &init_task; p = p->next_task)
+ total_rss += p->mm->rss;
+ read_unlock(&tasklist_lock);
+
+ return total_rss;
+}
+
/*
* Select the task with maximal swap_cnt and try to swap out a page.
* N.B. This function returns only 0 or 1. Return values != 1 from
@@ -358,7 +398,10 @@
static int swap_out(unsigned int priority, int gfp_mask)
{
struct task_struct * p, * pbest;
- int counter, assign, max_cnt;
+ int assign;
+ unsigned long max_cnt, counter;
+
+ counter = total_rss() >> priority;

/*
* We make one or two passes through the task list, indexed by
@@ -374,13 +417,8 @@
* Think of swap_cnt as a "shadow rss" - it tells us which process
* we want to page out (always try largest first).
*/
- counter = nr_tasks / (priority+1);
- if (counter < 1)
- counter = 1;
- if (counter > nr_tasks)
- counter = nr_tasks;
-
- for (; counter >= 0; counter--) {
+ while (counter > 0) {
+ int retval;
assign = 0;
max_cnt = 0;
pbest = NULL;
@@ -413,8 +451,9 @@
* Nonzero means we cleared out something, but only "1" means
* that we actually free'd up a page as a result.
*/
- if (swap_out_process(pbest, gfp_mask) == 1)
- return 1;
+ retval = swap_out_process(pbest, gfp_mask, &counter);
+ if (retval)
+ return retval;
}
out:
return 0;
@@ -441,42 +480,63 @@
printk ("Starting kswapd v%.*s\n", i, s);
}

-#define free_memory(fn) \
- count++; do { if (!--count) goto done; } while (fn)
+static int do_free_user_and_cache(int priority, int gfp_mask)
+{
+ if (shrink_mmap(priority, gfp_mask))
+ return 1;

-static int kswapd_free_pages(int kswapd_state)
+ if (swap_out(priority, gfp_mask))
+ /*
+ * We done at least some swapping progress so return 1 in
+ * this case. -arca
+ */
+ return 1;
+
+ return 0;
+}
+
+static int do_free_page(int * state, int gfp_mask)
{
- unsigned long end_time;
+ int priority = 8;

- /* Always trim SLAB caches when memory gets low. */
- kmem_cache_reap(0);
+ switch (*state) {
+ do {
+ default:
+ if (do_free_user_and_cache(priority, gfp_mask))
+ return 1;
+ *state = 1;
+ case 1:
+ if (shm_swap(priority, gfp_mask))
+ return 1;
+ *state = 0;

+ shrink_dcache_memory(priority, gfp_mask);
+ kmem_cache_reap(gfp_mask);
+ } while (--priority >= 0);
+ }
+ return 0;
+}
+
+static int kswapd_free_pages(int kswapd_state)
+{
/* max one hundreth of a second */
- end_time = jiffies + (HZ-1)/100;
- do {
- int priority = 8;
- int count = pager_daemon.swap_cluster;
+ unsigned long end_time = jiffies + (HZ-1)/100;

- switch (kswapd_state) {
- do {
- default:
- free_memory(shrink_mmap(priority, 0));
- free_memory(swap_out(priority, 0));
- kswapd_state++;
- case 1:
- free_memory(shm_swap(priority, 0));
- shrink_dcache_memory(priority, 0);
- kswapd_state = 0;
- } while (--priority >= 0);
- return kswapd_state;
- }
-done:
- if (nr_free_pages > freepages.high + pager_daemon.swap_cluster)
+ do {
+ do_free_page(&kswapd_state, 0);
+ if (nr_free_pages > freepages.high)
break;
} while (time_before_eq(jiffies,end_time));
+ /* take kswapd_state on the stack to save some byte of memory */
return kswapd_state;
}

+static inline void enable_swap_tick(void)
+{
+ timer_table[SWAP_TIMER].expires = jiffies+(HZ+99)/100;
+ timer_active |= 1<<SWAP_TIMER;
+}
+
/*
* The background pageout daemon.
* Started as a kernel thread from the init process.
@@ -524,6 +584,7 @@
current->state = TASK_INTERRUPTIBLE;
flush_signals(current);
run_task_queue(&tq_disk);
+ enable_swap_tick();
schedule();
swapstats.wakeups++;
state = kswapd_free_pages(state);
@@ -543,35 +604,23 @@
* if we need more memory as part of a swap-out effort we
* will just silently return "success" to tell the page
* allocator to accept the allocation.
- *
- * We want to try to free "count" pages, and we need to
- * cluster them so that we get good swap-out behaviour. See
- * the "free_memory()" macro for details.
*/
int try_to_free_pages(unsigned int gfp_mask, int count)
{
- int retval;
-
+ int retval = 1;
lock_kernel();

- /* Always trim SLAB caches when memory gets low. */
- kmem_cache_reap(gfp_mask);
-
- retval = 1;
if (!(current->flags & PF_MEMALLOC)) {
- int priority;
-
current->flags |= PF_MEMALLOC;
-
- priority = 8;
- do {
- free_memory(shrink_mmap(priority, gfp_mask));
- free_memory(shm_swap(priority, gfp_mask));
- free_memory(swap_out(priority, gfp_mask));
- shrink_dcache_memory(priority, gfp_mask);
- } while (--priority >= 0);
- retval = 0;
-done:
+ while (count--)
+ {
+ static int state = 0;
+ if (!do_free_page(&state, gfp_mask))
+ {
+ retval = 0;
+ break;
+ }
+ }
current->flags &= ~PF_MEMALLOC;
}
unlock_kernel();
@@ -594,7 +643,8 @@
if (priority) {
p->counter = p->priority << priority;
wake_up_process(p);
- }
+ } else
+ enable_swap_tick();
}

/*
@@ -632,9 +682,8 @@
want_wakeup = 3;

kswapd_wakeup(p,want_wakeup);
- }
-
- timer_active |= (1<<SWAP_TIMER);
+ } else
+ enable_swap_tick();
}

/*
@@ -643,7 +692,6 @@

void init_swap_timer(void)
{
- timer_table[SWAP_TIMER].expires = jiffies;
timer_table[SWAP_TIMER].fn = swap_tick;
- timer_active |= (1<<SWAP_TIMER);
+ enable_swap_tick();
}
Index: linux/fs/buffer.c
diff -u linux/fs/buffer.c:1.1.1.5 linux/fs/buffer.c:1.1.1.1.2.6
--- linux/fs/buffer.c:1.1.1.5 Fri Jan 1 19:10:20 1999
+++ linux/fs/buffer.c Sat Jan 2 21:40:07 1999
@@ -1263,6 +1263,7 @@
panic("brw_page: page not locked for I/O");
clear_bit(PG_uptodate, &page->flags);
clear_bit(PG_error, &page->flags);
+ set_bit(PG_referenced, &page->flags);
/*
* Allocate async buffer heads pointing to this page, just for I/O.
* They do _not_ show up in the buffer hash table!
Andrea Arcangeli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary] [ In reply to ]
On Sat, 2 Jan 1999, Andrea Arcangeli wrote:
> is the swapout smart weight code. Basing the priority on the number of
> process to try to swapout was really ugly and not smart.
But I done two mistakes in it. Benjamin pointed out after one msec that
there was no need for putting the address on the stack, and looking a
_bit_ more at swap_out_pmd() I noticed that the old code was just updating
swap_address, woops ;).
I noticed the second very more important mistakes running at 8Mbyte
because the trashing memory proggy was segfaulting. The bug was to base
the maximal weight of swap_out() on the total_rss and not on the sum of
the total_vm of all processes. With 8Mbyte all my processes got swapped
out and so swap_out stopped working ;). It's fixed now...
> The second change is done over shrink_mmap(), this will cause
> shrink_mmap() to care very more about aging. We have only one bit and we
> must use it carefully to get not out of cache ;)
This change is pretty buggy too. The only good thing was to not care
about the pgcache min limits before to shrink the _swap_cache_. Now I also
changed pgcache_under_min to don't care about the swapcache size (now the
swap cache is a bit more fast-variable/crazy).
> I also added/removed some PG_referenced. But please, don't trust too much
> the pg_refernced changes since I have not thought about it too much (maybe
> they are not needed?).
Hmm I guess at least the brw_page set_bit was not needed because before to
run such function is been run or a __find_page() or an add_to_...cache().
> Ah and woops, in the last patch I do a mistake and I forget to change
> max_cnt to unsigned long. This should be changed also in your tree, Linus.
Also some count should be moved from int to unsigned long to handle huge
RAM sizes.
> This new patch seems to really rocks here and seems _far_ better than
> anything I tried before! Steve, could try it and feedback? Thanks ;)
Here Steve's feedback:
128MB 8MB
------- -------
Your previous patch: 132 sec 218 sec
This patch : 118 sec 226 sec
Even if `This patch' was pretty buggy (as pointed out above) it was going
sligtly _faster_. I guess the reason for the 8Mbyte slowdown was the
s/rss/total_vm/ thing (but I am not 100% sure).
I fixed the bugs and so I repost the fixed diff against pre4. I also
cleaned up a bit some thing...
Index: linux/include/linux/mm.h
diff -u linux/include/linux/mm.h:1.1.1.3 linux/include/linux/mm.h:1.1.1.1.2.12
--- linux/include/linux/mm.h:1.1.1.3 Sat Jan 2 15:24:18 1999
+++ linux/include/linux/mm.h Sun Jan 3 03:43:52 1999
@@ -118,7 +118,6 @@
unsigned long offset;
struct page *next_hash;
atomic_t count;
- unsigned int unused;
unsigned long flags; /* atomic flags, some possibly updated asynchronously */
struct wait_queue *wait;
struct page **pprev_hash;
@@ -295,8 +294,7 @@

/* filemap.c */
extern void remove_inode_page(struct page *);
-extern unsigned long page_unuse(struct page *);
-extern int shrink_mmap(int, int);
+extern int FASTCALL(shrink_mmap(int, int));
extern void truncate_inode_pages(struct inode *, unsigned long);
extern unsigned long get_cached_page(struct inode *, unsigned long, int);
extern void put_cached_page(unsigned long);
@@ -379,8 +377,8 @@

#define buffer_under_min() ((buffermem >> PAGE_SHIFT) * 100 < \
buffer_mem.min_percent * num_physpages)
-#define pgcache_under_min() (page_cache_size * 100 < \
- page_cache.min_percent * num_physpages)
+#define pgcache_under_min() ((page_cache_size-swapper_inode.i_nrpages)*100\
+ < page_cache.min_percent * num_physpages)

#endif /* __KERNEL__ */

Index: linux/include/linux/pagemap.h
diff -u linux/include/linux/pagemap.h:1.1.1.1 linux/include/linux/pagemap.h:1.1.1.1.2.1
--- linux/include/linux/pagemap.h:1.1.1.1 Fri Nov 20 00:01:16 1998
+++ linux/include/linux/pagemap.h Sat Jan 2 21:40:13 1999
@@ -77,6 +77,7 @@
*page->pprev_hash = page->next_hash;
page->pprev_hash = NULL;
}
+ clear_bit(PG_referenced, &page->flags);
page_cache_size--;
}

Index: linux/mm/filemap.c
diff -u linux/mm/filemap.c:1.1.1.8 linux/mm/filemap.c:1.1.1.1.2.36
--- linux/mm/filemap.c:1.1.1.8 Fri Jan 1 19:12:53 1999
+++ linux/mm/filemap.c Sun Jan 3 03:13:09 1999
@@ -122,13 +126,14 @@
{
static unsigned long clock = 0;
unsigned long limit = num_physpages;
+ unsigned long count;
struct page * page;
- int count;

count = limit >> priority;

page = mem_map + clock;
- do {
+ while (count != 0)
+ {
page++;
clock++;
if (clock >= max_mapnr) {
@@ -167,17 +172,17 @@

/* is it a swap-cache or page-cache page? */
if (page->inode) {
- if (pgcache_under_min())
- continue;
if (PageSwapCache(page)) {
delete_from_swap_cache(page);
return 1;
}
+ if (pgcache_under_min())
+ continue;
remove_inode_page(page);
return 1;
}

- } while (count > 0);
+ }
return 0;
}

Index: linux/mm/swap.c
diff -u linux/mm/swap.c:1.1.1.5 linux/mm/swap.c:1.1.1.1.2.8
--- linux/mm/swap.c:1.1.1.5 Sat Jan 2 15:24:40 1999
+++ linux/mm/swap.c Sat Jan 2 21:40:13 1999
@@ -64,13 +64,13 @@
swapstat_t swapstats = {0};

buffer_mem_t buffer_mem = {
- 2, /* minimum percent buffer */
+ 5, /* minimum percent buffer */
10, /* borrow percent buffer */
60 /* maximum percent buffer */
};

buffer_mem_t page_cache = {
- 2, /* minimum percent page cache */
+ 5, /* minimum percent page cache */
15, /* borrow percent page cache */
75 /* maximum */
};
Index: linux/mm/vmscan.c
diff -u linux/mm/vmscan.c:1.1.1.9 linux/mm/vmscan.c:1.1.1.1.2.59
--- linux/mm/vmscan.c:1.1.1.9 Sat Jan 2 15:46:20 1999
+++ linux/mm/vmscan.c Sun Jan 3 03:43:54 1999
@@ -10,6 +10,12 @@
* Version: $Id: vmscan.c,v 1.5 1998/02/23 22:14:28 sct Exp $
*/

+/*
+ * Revisioned the page freeing algorithm (do_free_user_and_cache), and
+ * developed a smart mechanism to handle the swapout weight.
+ * Copyright (C) 1998 Andrea Arcangeli
+ */
+
#include <linux/slab.h>
#include <linux/kernel_stat.h>
#include <linux/swap.h>
@@ -163,7 +169,7 @@
* cache. */
if (PageSwapCache(page_map)) {
__free_page(page_map);
- return (atomic_read(&page_map->count) == 0);
+ return 1;
}
add_to_swap_cache(page_map, entry);
/* We checked we were unlocked way up above, and we
@@ -195,7 +201,7 @@
flush_tlb_page(vma, address);
swap_duplicate(entry);
__free_page(page_map);
- return (atomic_read(&page_map->count) == 0);
+ return 1;
}
/*
* A clean page to be discarded? Must be mmap()ed from
@@ -210,9 +216,8 @@
flush_cache_page(vma, address);
pte_clear(page_table);
flush_tlb_page(vma, address);
- entry = (atomic_read(&page_map->count) == 1);
__free_page(page_map);
- return entry;
+ return 1;
}

/*
@@ -230,7 +235,7 @@
*/

static inline int swap_out_pmd(struct task_struct * tsk, struct vm_area_struct * vma,
- pmd_t *dir, unsigned long address, unsigned long end, int gfp_mask)
+ pmd_t *dir, unsigned long address, unsigned long end, int gfp_mask, unsigned long * counter)
{
pte_t * pte;
unsigned long pmd_end;
@@ -251,18 +256,20 @@

do {
int result;
- tsk->swap_address = address + PAGE_SIZE;
result = try_to_swap_out(tsk, vma, address, pte, gfp_mask);
+ address += PAGE_SIZE;
+ tsk->swap_address = address;
if (result)
return result;
- address += PAGE_SIZE;
+ if (!--*counter)
+ return 0;
pte++;
} while (address < end);
return 0;
}

static inline int swap_out_pgd(struct task_struct * tsk, struct vm_area_struct * vma,
- pgd_t *dir, unsigned long address, unsigned long end, int gfp_mask)
+ pgd_t *dir, unsigned long address, unsigned long end, int gfp_mask, unsigned long * counter)
{
pmd_t * pmd;
unsigned long pgd_end;
@@ -282,9 +289,11 @@
end = pgd_end;

do {
- int result = swap_out_pmd(tsk, vma, pmd, address, end, gfp_mask);
+ int result = swap_out_pmd(tsk, vma, pmd, address, end, gfp_mask, counter);
if (result)
return result;
+ if (!*counter)
+ return 0;
address = (address + PMD_SIZE) & PMD_MASK;
pmd++;
} while (address < end);
@@ -292,7 +301,7 @@
}

static int swap_out_vma(struct task_struct * tsk, struct vm_area_struct * vma,
- unsigned long address, int gfp_mask)
+ unsigned long address, int gfp_mask, unsigned long * counter)
{
pgd_t *pgdir;
unsigned long end;
@@ -306,16 +315,19 @@

end = vma->vm_end;
while (address < end) {
- int result = swap_out_pgd(tsk, vma, pgdir, address, end, gfp_mask);
+ int result = swap_out_pgd(tsk, vma, pgdir, address, end, gfp_mask, counter);
if (result)
return result;
+ if (!*counter)
+ return 0;
address = (address + PGDIR_SIZE) & PGDIR_MASK;
pgdir++;
}
return 0;
}

-static int swap_out_process(struct task_struct * p, int gfp_mask)
+static int swap_out_process(struct task_struct * p, int gfp_mask,
+ unsigned long * counter)
{
unsigned long address;
struct vm_area_struct* vma;
@@ -334,9 +346,12 @@
address = vma->vm_start;

for (;;) {
- int result = swap_out_vma(p, vma, address, gfp_mask);
+ int result = swap_out_vma(p, vma, address, gfp_mask,
+ counter);
if (result)
return result;
+ if (!*counter)
+ return 0;
vma = vma->vm_next;
if (!vma)
break;
@@ -350,6 +365,19 @@
return 0;
}

+static unsigned long get_total_vm(void)
+{
+ unsigned long total_vm = 0;
+ struct task_struct * p;
+
+ read_lock(&tasklist_lock);
+ for_each_task(p)
+ total_vm += p->mm->total_vm;
+ read_unlock(&tasklist_lock);
+
+ return total_vm;
+}
+
/*
* Select the task with maximal swap_cnt and try to swap out a page.
* N.B. This function returns only 0 or 1. Return values != 1 from
@@ -358,8 +386,11 @@
static int swap_out(unsigned int priority, int gfp_mask)
{
struct task_struct * p, * pbest;
- int counter, assign, max_cnt;
+ int assign;
+ unsigned long counter, max_cnt;

+ counter = get_total_vm() >> priority;
+
/*
* We make one or two passes through the task list, indexed by
* assign = {0, 1}:
@@ -374,20 +405,14 @@
* Think of swap_cnt as a "shadow rss" - it tells us which process
* we want to page out (always try largest first).
*/
- counter = nr_tasks / (priority+1);
- if (counter < 1)
- counter = 1;
- if (counter > nr_tasks)
- counter = nr_tasks;
-
- for (; counter >= 0; counter--) {
+ while (counter > 0) {
assign = 0;
max_cnt = 0;
pbest = NULL;
select:
read_lock(&tasklist_lock);
- p = init_task.next_task;
- for (; p != &init_task; p = p->next_task) {
+ for_each_task(p)
+ {
if (!p->swappable)
continue;
if (p->mm->rss <= 0)
@@ -410,10 +435,11 @@
}

/*
- * Nonzero means we cleared out something, but only "1" means
- * that we actually free'd up a page as a result.
+ * Nonzero means we cleared out something, and "1" means
+ * that we actually moved a page from the process memory
+ * to the swap cache (it's not been freed yet).
*/
- if (swap_out_process(pbest, gfp_mask) == 1)
+ if (swap_out_process(pbest, gfp_mask, &counter))
return 1;
}
out:
@@ -441,42 +467,63 @@
printk ("Starting kswapd v%.*s\n", i, s);
}

-#define free_memory(fn) \
- count++; do { if (!--count) goto done; } while (fn)
+static int do_free_user_and_cache(int priority, int gfp_mask)
+{
+ if (shrink_mmap(priority, gfp_mask))
+ return 1;

-static int kswapd_free_pages(int kswapd_state)
+ if (swap_out(priority, gfp_mask))
+ /*
+ * We done at least some swapping progress so return 1 in
+ * this case. -arca
+ */
+ return 1;
+
+ return 0;
+}
+
+static int do_free_page(int * state, int gfp_mask)
{
- unsigned long end_time;
+ int priority = 8;

- /* Always trim SLAB caches when memory gets low. */
- kmem_cache_reap(0);
+ switch (*state) {
+ do {
+ default:
+ if (do_free_user_and_cache(priority, gfp_mask))
+ return 1;
+ *state = 1;
+ case 1:
+ if (shm_swap(priority, gfp_mask))
+ return 1;
+ *state = 0;
+
+ shrink_dcache_memory(priority, gfp_mask);
+ kmem_cache_reap(gfp_mask);
+ } while (--priority >= 0);
+ }
+ return 0;
+}

+static int kswapd_free_pages(int kswapd_state)
+{
/* max one hundreth of a second */
- end_time = jiffies + (HZ-1)/100;
- do {
- int priority = 8;
- int count = pager_daemon.swap_cluster;
+ unsigned long end_time = jiffies + (HZ-1)/100;

- switch (kswapd_state) {
- do {
- default:
- free_memory(shrink_mmap(priority, 0));
- free_memory(swap_out(priority, 0));
- kswapd_state++;
- case 1:
- free_memory(shm_swap(priority, 0));
- shrink_dcache_memory(priority, 0);
- kswapd_state = 0;
- } while (--priority >= 0);
- return kswapd_state;
- }
-done:
- if (nr_free_pages > freepages.high + pager_daemon.swap_cluster)
+ do {
+ do_free_page(&kswapd_state, 0);
+ if (nr_free_pages > freepages.high)
break;
} while (time_before_eq(jiffies,end_time));
+ /* take kswapd_state on the stack to save some byte of memory */
return kswapd_state;
}

+static inline void enable_swap_tick(void)
+{
+ timer_table[SWAP_TIMER].expires = jiffies+(HZ+99)/100;
+ timer_active |= 1<<SWAP_TIMER;
+}
+
/*
* The background pageout daemon.
* Started as a kernel thread from the init process.
@@ -524,6 +571,7 @@
current->state = TASK_INTERRUPTIBLE;
flush_signals(current);
run_task_queue(&tq_disk);
+ enable_swap_tick();
schedule();
swapstats.wakeups++;
state = kswapd_free_pages(state);
@@ -543,35 +591,23 @@
* if we need more memory as part of a swap-out effort we
* will just silently return "success" to tell the page
* allocator to accept the allocation.
- *
- * We want to try to free "count" pages, and we need to
- * cluster them so that we get good swap-out behaviour. See
- * the "free_memory()" macro for details.
*/
int try_to_free_pages(unsigned int gfp_mask, int count)
{
- int retval;
-
+ int retval = 1;
lock_kernel();

- /* Always trim SLAB caches when memory gets low. */
- kmem_cache_reap(gfp_mask);
-
- retval = 1;
if (!(current->flags & PF_MEMALLOC)) {
- int priority;
-
current->flags |= PF_MEMALLOC;
-
- priority = 8;
- do {
- free_memory(shrink_mmap(priority, gfp_mask));
- free_memory(shm_swap(priority, gfp_mask));
- free_memory(swap_out(priority, gfp_mask));
- shrink_dcache_memory(priority, gfp_mask);
- } while (--priority >= 0);
- retval = 0;
-done:
+ while (count--)
+ {
+ static int state = 0;
+ if (!do_free_page(&state, gfp_mask))
+ {
+ retval = 0;
+ break;
+ }
+ }
current->flags &= ~PF_MEMALLOC;
}
unlock_kernel();
@@ -594,7 +630,8 @@
if (priority) {
p->counter = p->priority << priority;
wake_up_process(p);
- }
+ } else
+ enable_swap_tick();
}

/*
@@ -632,9 +669,8 @@
want_wakeup = 3;

kswapd_wakeup(p,want_wakeup);
- }
-
- timer_active |= (1<<SWAP_TIMER);
+ } else
+ enable_swap_tick();
}

/*
@@ -643,7 +679,6 @@

void init_swap_timer(void)
{
- timer_table[SWAP_TIMER].expires = jiffies;
timer_table[SWAP_TIMER].fn = swap_tick;
- timer_active |= (1<<SWAP_TIMER);
+ enable_swap_tick();
}
As usual if you Steve or other will try this I am interested about numbers
;). Thanks.
Andrea Arcangeli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary] [ In reply to ]
Hi Andrea,
OK, I tested your latest VM and I'm impressed :) I have a 64MB PPro
box, and I tested a kernel compile. This isn't the BEST test, because
it doesn't stress the VM much. So, I got almost exactly the same times
(one or two seconds shorter out of 9m25s) as with Linus's test1-pre4.
HOWEVER! It used NO swap! Here is the result of 'free' after I
compiled:
total used free shared buffers
cached
Mem: 63548 52600 10948 31772 6176
24868
-/+ buffers/cache: 21556 41992
Swap: 34236 0 34236
test1-pre4, on the other hand, had used 10 Mb of swap by this point.
BTW, I used these settings in both cases:
telomere:~> cat /proc/sys/vm/pagecache
1 30 75
telomere:~> cat /proc/sys/vm/buffermem
1 20 60
I am going to test the actual pre4 now.
Oh, and I have a PPro box, libc6, EIDE (4MB/s media read/write time).
-BenRI
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary] [ In reply to ]
Hi Andrea,
My pet VM benchmark is the compilation of a set of about 50 C++
files which regularly grow the EGCS compiler VM size (as shown
by 'top') to 75 to 90 MB. I only have 64MB of RAM so it swaps a lot.
Here are the times (as measured by the 'time' command) for the
compilation of this suite of files (using 'make' and EGCS 1.0.1)
with 2.2.0pre4 and 2.2.0pre4 with your latest VM patch:
TMS Compile with 2.2.0pre4
589.830u 68.830s 18:09.88 60.4% 0+0k 0+0io 188062pf+260255w
TMS Compile with 2.2.0pre4 and Andreas latest patch
597.840u 71.030s 21:59.36 50.6% 0+0k 0+0io 298514pf+237324w
^^^^^^^^ ^^^^^^
Note the wall-clock time increases from 18 minutes to almost
22 minutes and the number of page faults increases from 188,000
to 298,500. It seems something is invalidating pages too aggressively
in your patch.
Is there something I can tune to improve this? Is there an experiment
I can run to help fine-tune your VM changes?
-Ben McCann
--
Ben McCann Indus River Networks
31 Nagog Park
Acton, MA, 01720
email: bmccann@indusriver.com web: www.indusriver.com
phone: (978) 266-8140 fax: (978) 266-8111
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary] [ In reply to ]
Ben McCann writes:
> Note the wall-clock time increases from 18 minutes to almost
> 22 minutes and the number of page faults increases from 188,000
> to 298,500. It seems something is invalidating pages too
> aggressively in your patch.
>
> Is there something I can tune to improve this? Is there an
> experiment I can run to help fine-tune your VM changes?
For the past few weeks, VM patches and benchmark results have been
going back and forth. I start to worry that Linux is being tuned
to run benchmarks, not to run useful software. Please be careful.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary] [ In reply to ]
>>>>> "Albert" == Albert D Cahalan <acahalan@cs.uml.edu> writes:
Albert> For the past few weeks, VM patches and benchmark results have
Albert> been going back and forth. I start to worry that Linux is
Albert> being tuned to run benchmarks, not to run useful
Albert> software. Please be careful.
Yes just like one of Stephen Tweedie's benchmarks was to time the
compile of some program running over NFS .... thats of course not a
real world thing.
Jes
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] new-vm improvement [Re: 2.2.0 Bug summary] [ In reply to ]
On Wed, 6 Jan 1999, Albert D. Cahalan wrote:
>
> Ben McCann writes:
>
> > Note the wall-clock time increases from 18 minutes to almost
> > 22 minutes and the number of page faults increases from 188,000
> > to 298,500. It seems something is invalidating pages too
> > aggressively in your patch.
> >
> > Is there something I can tune to improve this? Is there an
> > experiment I can run to help fine-tune your VM changes?
>
> For the past few weeks, VM patches and benchmark results have been
> going back and forth. I start to worry that Linux is being tuned
> to run benchmarks, not to run useful software. Please be careful.
Best insurance against this concern is to test alpha patches against
favorite useful application and report results to developers.
-Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/