Kernel Traffic #16 For 29 Apr 1999

By Zack Brown

Table Of Contents

Introduction

For those of you who noticed the Quotes (quotes.html) pages haven't been updated since issue 13, good news! They're finally uptodate.

Mailing List Stats For This Week

We looked at 907 posts in 3440K.

There were 364 different contributors. 163 posted more than once. 132 posted last week too.

The top posters of the week were:

1. Linux Takes A Performance Hit To Compensate For Bug In MacOS

9 Apr 1999 - 13 Apr 1999 (8 posts) Archive Link: "TCP push sometimes missing under 2.2.5?"

Topics: Networking

People: Andi KleenJens-Uwe MagerDavid S. MillerChris Wedgwood

Once again, Linux must sacrifice performance because of the closed development of other systems.

Jens-Uwe Mager noticed some TCP performance problems under the latest kernels that were not there in 2.0.36 (the problem only cropped up with clients running MacOS 8.5.1 with OpenTransport, because of a bug in that system). Andi Kleen replied, "PSH is not set when a part of the write exceeds the usable window. This OpenTransport behaviour is really stupid. Does this small patch fix it? It adds some cycles to a fast path, but I see no better way."

Jens-Uwe gratefully replied that the patch worked like a charm; but Chris Wedgwood didn't like the fact that it slowed down a fast code-path; and felt that Apple should just fix their bug. Jens-Uwe asked, "It is just one more variable and the assignment and one more test for this variable, is that so bad?"

David S. Miller explained, "It's not going to stop us from putting in the fix, but the issue is that register pressure is extremely high in that routine, and this change causes many more local variables to live on the stack instead of in hard registers."

2. FAT Fixes

10 Apr 1999 - 21 Apr 1999 (5 posts) Archive Link: "Re: [CFT] rename patch + FAT stuff"

Topics: FS: FAT, FS: NFS, FS: UMSDOS, FS: VFAT

People: Alexander Viro

Alexander Viro was the only contributor to this thread. He posted a new FAT patch (ftp://ftp.math.psu.edu/pub/viro/) , not for inclusion in the main kernel tree. It fixed a bug that could corrupt the whole filesystem, and stopped a few race conditions; but still had lots of problems. He said to consider it ALPHA code.

Two days later he had a new patch, which he dubbed early BETA. It still didn't have UMSDOS or NFS nailed down, and MSDOS was likely to be less stable than VFAT.

Four days later he had a BETA patch, now working with UMSDOS (barely), but with a lot of stuff left to do.

After another four days he posted a new patch, presumably BETA. Apparently it worked for him, but was still likely to have bugs.

The next day he posted a new patch (the last for awhile, apparently), BETA, with some fixes, but still without the real UMSDOS cleanup he wanted to do.

Throughout the thread, he exhorted folks to test it out and help squash bugs.

3. Scheduling Optimization Attempt

11 Apr 1999 - 14 Apr 1999 (8 posts) Archive Link: "[PATCH] small scheduling optimization"

Topics: Modems, Scheduler

People: Rik van RielIngo MolnarRichard GoochKurt Garloff

Rik van Riel posted a patch (http://www.nl.linux.org/~riel/patches/) for a small optimization to the code that called the scheduling code. He summed up the problem as, "with a program like rc5des running (nice +19) in the background and NOTHING in the foreground Linux still goes through the scheduler 50 times a second!" He added, "This small patch solves that problem by simply not rescheduling if there's nothing else to be run."

Kurt Garloff reported success, but started having another problem after applying Rik's patch: his modem tools started misbehaving. Apparently there was some private email on that, and Rik replied, "I implemented the fix Pascal Dupuis suggested and have put the new patch online: http://www.nl.linux.org/~riel/patches/schedule-bigpatch-2.2.5-2"

Meanwhile, Ingo Molnar said, "hm, it should only go 5 times a second into the scheduler if this is the only process running. Are you sure that it's 50 times a second? With 5 times a second and 5 usecs per schedule(), it's 0.000025 seconds per 1.0 second overhead, acceptable i think."

Rik replied, "A nice +19 process only has a timeslice of 10 (20 if it's alone) milliseconds. That will give 50 reschedules" per second; but Richard Gooch corrected him with, "a lone process will get a timeslice of 20 *ticks*, not 20 milliseconds. So that's 200 millisecond timeslices, which is 5 times per second."

The thread ended here; presumably Rik's fix was not broken, just not as much of an improvement as he'd first thought.

4. Swap Files Vs. Swap Partitions

13 Apr 1999 - 20 Apr 1999 (24 posts) Archive Link: "Static Swap"

Topics: Big Memory Support, FS

People: Riley WilliamsMike A. HarrisStephen C. TweedieDavid LangBrandon S. AllberySteffen GrunewaldMatthew SaylerNick Holloway

Someone posted, wanting to resize their swap space dynamically, i.e. without repartitioning. Nick Holloway pointed to his swapd program (written in 1993) at http://www.alfie.demon.co.uk/download/swapd-1.4.tar.gz, which automatically detects when more swap is needed and adds swapfiles to cover the increased load. Apparently Marc Merlin is the current maintainer, though Nick didn't know of a URL for his work (and Marc's homepage (http://marc.merlins.org/indexor.html) doesn't mention swapd either).

Meanwhile, Steffen Grunewald also suggested swap files, but he wasn't sure if the space could be freed up without rebooting. Several people pointed him to the 'swapoff' command, and Riley Williams loudly declared, "you CAN free up swap space without rebooting, PROVIDING the swap file or partition you wish to free up is SMALLER than your unused swap space at the time you wish to free it up."

Mike A. Harris added, "swap files are SLOW compared to partitions, and dynamically creating them and destroying them will likely end up with highly fragmented swapspace, and the performance of your swapping system will be reminiscent of Windows 95 which, by default uses dynamically resizing temp swap space via file access. Rediculously slow indeed." But Stephen C. Tweedie corrected him with, "That used to be true, but the worst inefficiencies have been cured and in 2.2.*, swap files are not all that much slower than partitions (although they are still measurably slower)."

There was also a bit of discussion about how much swap space is optimal for a given system, regardless of the question of dynamically changing swap size. Riley replied (essentially to the original post) with:

There's a common misconception that the swap space one needs is exactly proportional to the amount of RAM one has, and this looks like a classic example of that...

Can I state FROM EXPERIENCE that the so-called rule that swap space should be 1.5 times RAM size is BOGUS!!! Some systems need much more, others run happily on much less...

My experience has been that the following apply...

  1. All systems should have at least 4M of swap space, as Linux is (or at least appears to be) less stable otherwise.
  2. Where a system has multiple hard drives, Linux appears to be more stable with a swap area on each drive than having some drives with no swap area on them.
  3. The sum of RAM and swap space should be at least 32M on a system that is not running X, and at least 64M on a system that is running X.
  4. Too much swap space is not a problem, but too little swap space really hurts.
  5. If the system in question regularly runs with more than 67% of its swap space used, more swap space should be allocated.

Personally, I just allocate one 124M swap partition on each drive installed, always as a primary partition, and leave it at that. I've never had problems as a result of that policy, and with modern 2G+ drives, the loss of 124M of data area per drive isn't even noticed.

Matthew Sayler also recommended giving all swap files and partitions equal priority, saying he saw a tremendous performance gain on his 486.

Brandon S. Allbery objected to Riley's post, saying that the swap=ram*1.5 rule was just a guideline, and there was no hard and fast rule for determining how much swap is best for a given system. Riley replied that the 1.5 guideline was still bogus, and added, "In my experience, the BIG problem with the current 'rule' as presented is that people accept it as a hard and fast rule, so never consider whether their system is under-performing as a result."

He went on to recommend:

A general guideline that's usable is to allocate 8M or the amount required to top up your RAM to 64M, whichever is greater, rounded up to the next complete cylinder value ABOVE the relevant figure.

Alternatively, one that says to allocate 124 megs of swap could easily be honoured on well over 90% of current systems without ever noticing the reduction in drive data capacity, and will normally produce much better performance than the 1.5-rule stands a chance of doing. Since the maximum size of a swap partition is 126 megs

(though as Stephen C. Tweedie subsequently pointed out, this limit is gone in 2.2.x kernels. --ed)

, this basically takes the worst case requirement and assumes such will occur.

Granted, this will normally overstate the amount of swap needed, but as stated in my original post, too much swap is not a problem whereas too little really hurts performance...

David Lang followed up with this interesting point: "Part of the problem with this is that on other OS's it really is a nessasary rule. For several AIX boxes we have I have had to add hard drives to them so hold the swap partition becouseeven with 2G of ram each they would run out of swap before they ran out of memory now that swap is in the 3-4G range we are actually able to use all the memory in the system. Linux swap does not work this way, but most unix people don't realize it."

5. Possible Race Condition Explored

17 Apr 1999 - 21 Apr 1999 (4 posts) Archive Link: "SMP race in page IO list"

Topics: Debugging, SMP, Virtual Memory

People: Stephen C. TweedieIngo Molnar

Patrik Rak was browsing the code, and thought he'd found a race in put_pio_request() (http://lxr.linux.no/source/mm/filemap.c?v=2.2.6#L1614) . He felt that some data could be lost if several processors tried to access it at once; and recommended a spinlock to protect the data.

Ingo Molnar pointed out that the exploit could never happen because, even though put_pio_request() itself had no locks, all calls to put_pio_request() were already protected by a lock. Patrik said, that was what he was asking: were all the calls really protected? Stephen C. Tweedie said, "Yes, pretty much the whole VM is protected by that lock. The only bits which aren't are protected by the mutex on the current process's memory management structures, but that's mostly limited to some bits of the page fault code. All of the swap code runs under the kernel lock, with no exceptions."

6. First Responses To Mindcraft

17 Apr 1999 - 21 Apr 1999 (6 posts) Archive Link: "Configurable larger physical memory"

Topics: Microsoft, Networking

People: Mitchell Blank JrAndi Kleen

Big Memory Support

In response to the Mindcraft tests, Bernd Paysan wrote a patch against 2.2.5 (just for Intel machines) to tune the amount of physical memory Linux uses. Andi Kleen pointed out that the same thing was tried in 2.1.x and taken out again, because a lot of people were misconfiguring their systems.

John Kodis seemed to recall that with that older patch, some of the misconfiguration going on was when they set their systems to match the actual amount of physical memory they had installed. Mitchell Blank Jr felt the problems had also been "people setting it as high as they possibly can ("hey, I might put 3G of ram in here someday...") without realizing that there are tradeoffs and they probably want the default." Bernd felt some better documentation could probably fix those problems.

There were a lot of other linux-kernel responses to the Mindcraft tests, but most of those discussions seem to be ongoing.

7. Memory Management Bug Hunt

17 Apr 1999 - 21 Apr 1999 (4 posts) Archive Link: "Linux version 2.2.5 memory management"

People: Richard B. JohnsonStephen C. TweedieAndrea Arcangeli

Richard B. Johnson said:

There is a serious problem with V2.2.5 memory management. When the machine is first booted (single-user, no swap, only root mounted) I can acquire about 50 megabyte of free pages from a 132 megabyte machine.

Whether or not this is appropriate is another matter. I acquire pages until the 'low_on_memory' flag is true. If I acquire pages until the return value is NULL, the machine will hang forever with processes being killed off including init.

If I dirty the pages, then give them all back -- and I promise that I did give them all back (source-code is available), on subsequent attempts to acquire pages, I can only get about a megabyte.

Once in this new state, I acquire and give back the same low amount consistantly. In other words, there is an initial gigantic leak, followed by no aparent additional leak.

The test-code implements a device module which can be configured as a FIFO, i.e., you write to it until it's full, then you can read back what has been written. This is not the 'Unix' notion of a FIFO, but a FIFO that can be read and written as a device. This device can initially acquire about 50 megabytes of pages before it is 'full'. It is emptied by reading from it, at which time it returns pages to the kernel.

Therefore the testing can be done with a simple script, the results are attached.

If I do the tests in single-user mode, then restart multi-user, eventually (after several hours), the kernel recovers most of the pages that have been lost. I can speed up the recovery by executing ls -R /, i.e., exercising the directory cache.

At that time, I can consistently get 30 or more megabytes of pages without any apparent additional leaks.

I believe that there is something very wrong with the heuistics in returning free pages to the free list. Free pages have to go on the free list before they have been sorted out to see what makes continuous areas of various sizes. It looks from the code that recently freed pages are not used again until they have been 'coalased' because somebody needed some continuous ones via kmalloc.

At 4 AM, Andrea Arcangeli groggily posted a small fix (not even a patch, just a line of code to insert), to which Richard replied, "Yep! This plugs a major hole. It now makes the system usable. However there is still a 5:1 difference in the page-count I can initially get and eventually be left with (instead of 50:1). So there is still a problem with getting those free pages back onto the free list," and added, "I have everything set up to test anything else you want to try."

But Stephen C. Tweedie crashed the party, with, "If this makes any difference at all then you are not just grabbing and releasing free pages! That change can only have an effect if you are using fs buffers. Free list management for buffers and cache pages is completely different from the physical free list management. It would be useful to know what you are actually doing here."

End Of Thread. They must have gone to private email.

8. linux-kernel Slowdown

19 Apr 1999 - 24 Apr 1999 (14 posts) Archive Link: "SCSI error: hardware, software, or firmware?"

Topics: Disks: SCSI, Mailing List Administration

People: Adam HeathDavid S. MillerMatti Aarnio

Disks

There were some problems with the list. It started when someone posted a SCSI problem, and Adam Heath said, "the guy who sent this is my coworker, and I was watching for the email on the list. He told me when he sent it. It didn't arrive until 10 hours later. I highly doubt this is optimal."

David S. Miller replied, "The machine ran out of disk space right after I went to sleep, once I woke up I promptly fixed the problem and now the machine has caught up with it's backlog, relax."

Adam said, "Hey, no prob. Glad it was so simple to fix. Was it a case of the small / getting filled? I hate it when that happens," and Matti Aarnio replied:

Nope, it was question of 0.5G mailer spool filling with a flood of about 5MB error messages ...

Setting input message size limit to 1 MB, and killing all spooled message files which size exceeded 1MB did get us out of the tight spot..

As a result, you can't send thru VGER any message which size exceeds that 1MB -- not that you would want to, either.. Digests may grow up to 100 kB, most messages are under 5 kB in size.

 

 

 

 

 

 

Sharon And Joy
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.