Kernel Traffic
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Latest | Archives | People | Topics
Home | News | RSS Feeds | Mailing Lists | Authors Info | Mirrors | Stalled Traffic

Kernel Traffic #142 For 19 Nov 2001

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1525 posts in 6343K.

There were 529 different contributors. 227 posted more than once. 197 posted last week too.

The top posters of the week were:

1. Clock Synchronization Problems In 2.2 And 2.4

30 Oct 2001 - 10 Nov 2001 (36 posts) Archive Link: "PROBLEM: Linux updates RTC secretly when clock synchronizes"

People: Ian Maclaine-crossPavel MachekLinus Torvalds

Ian Maclaine-cross posted a patch, and explained:

When /usr/sbin/ntpd synchronizes the Linux kernel (or system) clock using the Network Time Protocol the kernel time is accurate to a few milliseconds. Linux then sets the Real Time (or Hardware or CMOS) Clock to this time at approximately 11 minute intervals. Typical RTCs drift less than 10 s/day so rebooting causes only millisecond errors.

Linux currently does not record the 11 minute updates to a log file. Clock programs (like hwclock) cannot correct RTC drift at boot without knowing when the RTC was last set. If NTP service is available after a long shutdown, ntpd may step the time. Worse after a longer shutdown ntpd may drop out or even synchronize to the wrong time zone. The workarounds are clumsy.

Please find following my small patch for linux/arch/i386/kernel/time.c which adds a KERN_NOTICE of each 11 minute update to the RTC. This is just for i386 machines at present. A script can search the logs for the last set time of the RTC and update /etc/adjtime. Hwclock can then correct the RTC for drift and set the kernel clock.

I patched Linux 2.2.19 and 2.4.12 then compiled, installed and rebooted on Pentium MMX and AMD K6-III machines respectively. When the kernel clock synchronized "...: Real Time Clock set at xxx s" appeared in the kernel log every 661 s where "xxx" was the current system time. Messages ceased whenever the clock was unsynchronized. Ntpd produces typically four log lines in 661 s so the increase in log volume is small for ntpd users and nothing for nonusers. The patch added 11 bytes to the size of my compressed kernel.

Pavel Machek replied, "That seems as very wrong solution. What about just making kernel only _read_ system clock, and never set it? That looks way cleaner to me." Ian replied:

QUESTION: What results in best timekeeping by the RTC, constant updates or logging the offset?


The Linux kernel code for the 11 minute update in arch/i386/kernel/time.c has an RTC setting error of +-0.005 s. The adjtimex source suggests an RTC reading error of +-0.000025 s.

Accurate RTC timekeeping also requires an accurate value of average drift rate for typical use. Measuring this requires timing over a long unset period such as one month.

Logging the offset is more accurate per reading and allows more accurate measurement of drift than 11 minute updates.


RTC accuracy supports optionalizing the 11 minute update.

Other reasons to optionalize the 11 minute update which various people suggest:

  1. The kernel should not dictate OS policy;
  2. Simplifies programming with /dev/rtc;
  3. Improves performance of /dev/rtc;
  4. Slightly reduced kernel size;
  5. 5. Slightly faster timer_interrupt;
  6. 6. Easier to use utilities like hwclock.

I agree with you, Pavel. Commenting out the 11 minute update code is a better solution. :)

Pavel asked if Ian would push the patch through to Linus Torvalds, and Ian said he'd work something up for 2.5.

2. Some ext3 Users Discover They've Been Using ext2

7 Nov 2001 (53 posts) Archive Link: "ext3 vs resiserfs vs xfs"

Topics: FS: ReiserFS, FS: ext2, FS: ext3

People: Alan CoxArjan van de VenRoy Sigurd Karlsbakk

Roy Sigurd Karlsbakk asked which was best, ext3, Reiserfs, or xfs. In particular, he'd noticed that after a sudden shutdown on his RedHat 7.2 box, the system wanted to run fsck at bootup anyway, just like ext2. But Alan Cox replied, "RH 7.2 after an unexpected shutdown will give you a 5 second count down when you can choose to force an fsck - ext3 doesnt need an fsck but sometimes folks might want to force it thats all."

Several people complained that fsck ran whether they wanted it to or not, and it was pointed out that they might not actually be using ext3. After some investigation, this was confirmed. A number of folks who believed themselves to be using ext3 because of text in various system configuration files like /etc/fstab, were surprised to find that they'd been sudden-booting non-journaling filesystems all this time.

After fixing the problem on his own system, Zvi Har'El asked why RedHat did not compile ext3 directly into the kernel, which would make the problem much less likely. Arjan van de Ven (of RedHat) replied, "The basic idea is "everything which can be a module will be a module", even scsi is a module. And if you use grub, it's 100% transparent as the initrd will be automatically added to the grub config when you install the RH kernel rpm; even if you use lilo the initrd is supposed to be made for you automatically."

The thread ended somewhat inconclusively. There was not much discussion of the relative merits of Reiserfs and xfs.

3. Revising The Linux Scheduler

8 Nov 2001 - 14 Nov 2001 (23 posts) Archive Link: "[patch] scheduler cache affinity improvement for 2.4 kernels"

Topics: Real-Time, SMP, Scheduler

People: Ingo MolnarMike FedykDavide Libenzi

Ingo Molnar announced, "i've attached a patch that fixes a long-time performance problem in the Linux scheduler." He went on to explain:

it's a fix for a UP and SMP scheduler problem Alan described to me recently, the 'CPU intensive process scheduling' problem. The essence of the problem: if there are multiple, CPU-intensive processes running, intermixed with other scheduling activities such as interactive work or network-intensive applications, then the Linux scheduler does a poor job of affinizing processes to processor caches. Such scheduler workload is common for a large percentage of important application workloads: database server workloads, webserver workloads and math-intensive clustered jobs, and other applications.

If there are CPU-intensive processes A B and C, and a scheduling-intensive X task, then in the stock 2.4 kernels we end up scheduling in the following way:

    A X A X A ... [timer tick]
    B X B X B ... [timer tick]
    C X C X C ... [timer tick]

ie. we switch between CPU-intensive (and possibly cache-intensive) processes every timer tick. The timer tick can be 10 msec or shorter, depending on the HZ value.

the intended length of the timeslice of such processes is supposed to be dependent on their priority - for typical CPU-intensive processes it's 100 msecs. But in the above case, the effective timeslice of the CPU/cache-intensive process is 10 msec or lower, causing potential cache trashing if the working set of A, B and C are larger than the cache size of the CPU but the invidivual process' workload fits into cache. Repopulating a large processor cache can take many milliseconds (on a 2MB on-die cache Xeon CPU it takes more than 10 msecs to repopulate a typical cache), so the effect can be significant.

The correct behavior would be:

    A X A X A ... [10 timer ticks]
    B X B X B ... [10 timer ticks]
    C X C X C ... [10 timer ticks]

this is in fact what happens if the scheduling acitivity of process 'X' does not happen.

solution: i've introduced a new current->timer_ticks field (which is not in the scheduler 'hot cacheline', nor does it cause any scheduling overhead), which counts the number of timer ticks registered by any particular process. If the number of timer ticks reaches the number of available timeslices then the timer interrupt marks the process for reschedule, clears ->counter and ->timer_ticks. These 'timer ticks' have to be correctly administered across fork() and exit(), and some places that touch ->counter need to deal with timer_ticks too, but otherwise the patch has low impact.

scheduling semantics impact: this causes CPU hogs to be more affine to the CPU they were running on, and will 'batch' them more agressively - without giving them more CPU time than under the stock scheduler. The change does not impact interactive tasks since they grow their ->counter above that of CPU hogs anyway. It might cause less 'interactivity' in CPU hogs - but this is the intended effect.

performance impact: this field is never used in the scheduler hotpath. It's only used by the low frequency timer interrupt, and by the fork()/exit() path, which can take an extra variable without any visible impact. Also some fringe cases that touch ->counter needed updating too: the OOM code and RR RT tasks.

performance results: The cases i've tested appear to work just fine, and the change has the cache-affinity effect we are looking for. I've measured 'make -j bzImage' execution times on an 8-way, 700 MHz, 2MB cache Xeon box. (certainly not a box whose caches are easy to trash.) Here are 6 successive make -j execution times with and without the patch applied. (To avoid pagecache layout and other effects, the box is running a modified but functionally equivalent version of the patch which allows runtime switching between the old and new scheduler behavior.)

stock scheduler:

  real    1m1.817s
  real    1m1.871s
  real    1m1.993s
  real    1m2.015s
  real    1m2.049s
  real    1m2.077s

with the patch applied:

  real    1m0.177s
  real    1m0.313s
  real    1m0.331s
  real    1m0.349s
  real    1m0.462s
  real    1m0.792s

ie. stock scheduler is doing it in 62.0 seconds, new scheduler is doing it in 60.3 seconds, a ~3% improvement - not bad, considering that compilation is exeucting 99% in user-space, and that there was no 'interactive' activity during the compilation job.

- to further measure the effects of the patch i've changed HZ to 1024 on a single-CPU, 700 MHz, 2MB cache Xeon box, which improved 'make -j' kernel compilation times by 4%.

- Compiling just drivers/block/floppy.c (which is a cache-intensive operation) in parallel, with a constant single-process Apache network load in the background shows a 7% improvement.

This shows the results we expected: with smaller timeslices, the effect of cache trashing shows up more visibly.

(NOTE: i used 'make -j' only to create a well-known workload that has a high cache footprint. It's not to suggest that 'make -j' makes much sense on a single-CPU box.)

(it would be nice if those people who suspect scalability problems in their workloads, could further test/verify the effects this patch.)

the patch is against 2.4.15-pre1 and boots/works just fine on both UP and SMP systems.

Davide Libenzi gave a link to his own proposal for a better scheduler, and to the patch to make it happen. Ingo had some criticism of the patch and they went back and forth for awhile on it, and at one point Mike Fedyk stepped between them with:

Conceptually, both patches are compatible.

Whether they are technically is for someone else to say...

Ingo's patch in effect lowers the number of jiffies taken per second in the scheduler (by making each task use several jiffies).

Davide's patch can take the default scheduler (even Ingo's enhanced scheduler) and make it per processor, with his extra layer of scheduling between individual processors.

I think that together, they both win. Davide's patch keeps a task from switching CPUs very often, and Ingo's patch will make each task on each CPU use the cache to the best extent for that task.

It remains to be proven whether the coarser scheduling approach (Ingo's) will actually help when looking at cache properties.... When each task takes a longer time slice, that allows the other tasks to be flushed out of the caches during that time. When the next task comes back in, it will have to re-populate the cache again. And the same for the next and etc...

There was some more argument, and the thread petered out inconclusively.

4. Lots Of Swapping During NFS Writes

11 Nov 2001 (3 posts) Archive Link: "Writing over NFS causes lots of paging"

Topics: FS: NFS, Virtual Memory

People: Simon KirbyLinus TorvaldsTrond Myklebust

Simon Kirby reported:

It looks like when writing large amounts of data to NFS where the remote end is slower than the local end the local end appears to start swapping out a lot I'm guessing this is because it can read much faster than it can write.

Also, I see NFS timeouts and thus "I/O error" messages fom cp when it is mounted with the "soft" option, even with high timeouts. "hard" works fine, but I didn't want to use it for this mount.

Linus Torvalds replied:

the real reason for why the NFS write stuff causes page-outs is that the VM layer does not really understand the notion of writeback pages.

The VM layer has one explicit special case: it knows about the magic in "page->buffers", and can handle writeback for block-oriented devices sanely. But any non-buffer-oriented filesystem is "invisible" to the VM layer, and has to use other tricks to make the VM ignore its pages.

In the case of NFS, it increments the page count and has it's own private non-VM-visible writeback data structures. This pins the page in memory, but at the same time, because the VM doesn't understand it, the VM will end up thinking the page is mapped in user space or something else, and won't know how to start writeouts.

Quite frankly, I don't rightly know what the real fix is. Making "page->buffers" be a generic thing (a "void *") along with making the buffer flushing logic be behind a address space operation is probably the right thing in the long run.

But Trond Myklebust objected:

That only takes care of writebacks. Don't forget that reading & readahead can also eat memory if somebody forgets to call lock_page() (a common problem on 'hard,intr' mounts).

You'll notice that in the NFS updates I sent you the other day, there is a new function 'nfs_try_to_free_pages()' that provides a rather generic way of freeing up NFS memory resources. Its sole purpose today is to ensure that we keep an upper limit of 256 requests per mount.

My (still somewhat vague) plan is to expand that interface some time during 2.5.x to allow the VM to control and limit the memory usage of the NFS client - flushing out read and write requests if necessary. IMHO, the filesystem can often be more efficient at clearing out pages if we leave the choice of strategy up to it, rather than having the VM micro-manage exactly which page is to be thrown out first. For instance, under NFSv3 there is usually a huge advantage to sending off a COMMIT over any other call, since it can potentially free up a whole truckload of pages.

There was no reply.

5. Linus Preparing 2.4 Hand-Off To Marcelo

12 Nov 2001 - 14 Nov 2001 (14 posts) Archive Link: "[PATCH] reformat mtrr.c to conform to CodingStyle"

Topics: BSD: NetBSD

People: Linus TorvaldsJeff GarzikBenjamin LaHaiseAndreas DilgerChris WedgwoodHelge Hafting

Benjamin LaHaise posted a patch to make the mtrr.c file conform to the coding style standards layed out in Documentation/CodingStyle. But Linus Torvalds replied politically, "I don't like reformatting without at least asking the maintainer, unless the maintainer isn't doing maintenance. Also, right now I'd rather not have any big patches even if they are just syntactic.. Makes hand-over to Marcelo simpler."

Chris Wedgwood pointed out that if folks wanted to start doing stylistic patches, there were tons of places that needed it. Jeff Garzik agreed, and added, "For mtrr.c it (a) is unmaintained for years, and (b) is actively being hacked on by non-maintainers. It has an especially strong case for Lindent'ing."

Elsewhere, Andreas Dilger pointed out that some of Benjamin's changes to mtrr.c looked not so nice, and Benjamin replied, "That's what Lindent came up with, which evidently needs tweaking." Close by, Jeff put in, "IMHO CodingStyle is defined in theory by Documentation/CodingStyle, and in practice by linux/scripts/Lindent, which was changed in 2.4.15-preXX even, to be more up-to-date."

Elsewhere, Helge Hafting suggested that Linus should just run Lindent himself on the whole tree, instead of having bunches of people submit style patches. He proposed:

find linux/ -name "*.[ch]" | linux/scripts/Lindent

But Jeff replied, "Lindent still does a few dumb things which make me review the code after formatting and before submission..." He also suggested checking out NetBSD's indent program, which Christoph Hellewig had recently ported to Linux. Christoph gave the URL to his version, and suggested the discussion might be heading off topic.

End of thread.

6. Linux Vs. FreeBSD Benchmark

13 Nov 2001 - 14 Nov 2001 (11 posts) Archive Link: "2.4.x has finally made it!"

Topics: BSD: FreeBSD, Virtual Memory

People: Alastair StevensMatthias AndreeChristoph HellwigKevin WootenDoug McNaught

Alastair Stevens pointed to an article comparing Linux and FreeBSD:

For those who haven't seen it yet, Moshe Bar at has revisited his Linux 2.4 vs FreeBSD benchmarks, using 2.4.12 in this case:

During the original benchmarks, using the newly-release 2.4.0, Linux was largely hammered by FreeBSD, and exhibited all sorts of interactivity problems under load, due to VM and other issues.

Well now, with the newer releases, we really seem to have caught up again. The new VM, and the millions of other fixes since 2.4.0 have made all the difference. FreeBSD is an impressive OS, and a worthy competitor with a distinguished heritage - it's great to see Linux snapping at its heels.

So congratulations to all kernel developers - 2.4.x has basically 'made it' now, and months of hard work have produced a stable, high-performance and cutting edge kernel. I'm looking forward to running the cross-bred Linus / Alan 2.4.15 soon, and even more to 2.5.x - as Linux heads onwards to new levels yet again.

Matthias Andree replied, "Wow. That person is knowledgeable... NOT. Turning off fsync() for mail is just as good as piping it to /dev/null. See RFC-1123." And Christoph Hellwig added, "After the last VM article no one expect any clue from him anayway 8)" . Elsewhere, Kevin Wooten also remarked, "Why is he using FreeBSD 4.3? Version 4.4 has been out for quite a while....that seems like quite an oversight, unless 4.3 performs better than 4.4, which I doubt." However, Doug McNaught pointed out, regarding the fsync() comment:

Umm... He specifically stated that it was a Very Bad Idea for production systems. He simply wanted to measure general throughput rather than disk latency (which is a bottleneck with fsync() enabled).

It's a benchmark, lighten up! ;)

Matthias replied that the article purported to test every-day use, and that disk latency would fall into that category. He said, "fsync() efficiency comes into play and wants to be benchmarked as well. How do you know if your fsync() syncs what's needed, the whole partition, the partition's meta data (softupdates!) or the world (all blocks)?" Doug felt that was a good point, and the thread ended.







Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.