Kernel Traffic #296 For 12 Feb 

By Zack Brown

Table Of Contents

Introduction

I'd like to thank two folks who've helped me out with Kernel Traffic over the past couple weeks. Folkert van Heusden with mailbox statistics, and Jonas Berlin with URL generation for thread archives.

Folkert's mboxstats (http://www.vanheusden.com/mboxstats/) program is really terrific, in fact it gathers way more statistics than I actually use. Hopefully my XSLT recipes will grow into the wealth of information provided by mboxstats.

Jonas replied to my call for help, to find the Google Groups URLs for the various threads covered in KT. This is a thorny problem, because although Groups lets you search by Message-ID, the particular lkml archive is gated through a Usenet server so the Message-ID gets munged. Jonas wrote a utility to scrape Google's HTML, sift through search results, and find the links to the proper post. Anyone interested in this script can look it over (../bin/news2googleurl-4) .

Mailing List Stats For This Week

We looked at 1608 posts in 10MB. See the Full Statistics.

There were 598 different contributors. 224 posted more than once. The average length of each message was 103 lines.

The top posters of the week were: The top subjects of the week were:
53 posts in 313KB by Adrian Bunk
41 posts in 311KB by Karim Yaghmour
41 posts in 241KB by Andreas Gruenbacher
38 posts in 236KB by Andrew Morton
31 posts in 118KB by Matt Mackall
110 posts in 612KB for "2.6.11-rc1-mm1"
60 posts in 287KB for "[PATCH] dynamic tick patch"
46 posts in 219KB for "[patch 1/13] Qsort"
29 posts in 150KB for "seccomp for 2.6.11-rc1-bk8"
26 posts in 108KB for "Announce loop-AES-v3.0b file/swap crypto package"

These stats generated by mboxstats version 2.2

1. Linux 2.6.11-rc1-mm1 Released; FUSE And LTT (With relayfs) Included

14 Jan  - 25 Jan  (144 posts) Archive Link: "2.6.11-rc1-mm1"

Topics: Extended Attributes, FS: ext3, Kernel Release Announcement, Samba, Security, Version Control

People: Andrew MortonAndre EisenbachMiklos SzerediRogério BritoPeter BuckinghamMatthias UrlichsAndi KleenRoman ZippelThomas GleixnerKarim YaghmourMasami Hiramatsu

Andrew Morton announced Linux 2.6.11-rc1-mm1, saying:

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc1/2.6.11-rc1-mm1/

The FUSE developers had been trying like the dickens to get FUSE into the main kernel. Getting into the -mm tree was a major step, and Miklos Szeredi, Kasper Sandberg, and Andre Eisenbach all expressed joy and happiness. Andre said:

As a long time user of KDE's kio-slaves, I was always missing the kio-slave functionality on the command line and in non-kde programs. FUSE provides a kio-slave interface, but hopefully the inclusion of FUSE in the mm-kernel will cause more "fuse native" filesystems to come out which provide the functionality of the various kio-slaves.

Some things I'd like to see (as I am currently using the KIO equivalent) implemented as FUSE fs:

Imagination is the limit, and since it can be implemented in userspace pretty easily with FUSE, I am looking forward to see what people can come up with and hope that FUSE is here to stay.

Miklos replied that 'fish' was already available at http://sourceforge.net/projects/fuse, although "You need to dowload fuse-2.2-pre3 and sshfs-1.0. It should work on any kernel including the 2.6.10-rc1-mm1 with FUSE compiled in."

Andrew Morton asked Kasper, one of the other FUSE enthusiasts, what FUSE-based filesystems he used, and why. Rogério Brito replied, "I have never used a -mm kernel tree before, but seeing that fuse got included made me download the patch to try it.I'll be using gmailfs (which needs fuse) just to see how things work with Debian's testing (sarge) userland." Peter Buckingham also said to Andrew, "we're currently prototyping a lightweight network filesystem proxy using fuse." And Matthias Urlichs said that he currently used "sshfs (best idea for file access through firewalls). gmailfs (best free off-site backup facility). Will use encfs as soon as FUSE is in mainline (I'm using cryptoloop now, but that's not sanely backupable.)" Kasper had nothing to add.

The other big topic of the thread was relayfs, which got a more measured reaction. Andi Kleen said:

relayfs and netlink are for completely problem spaces. relayfs is for relaying a lot of data quickly (e.g. for kernel instrumentation). There it fills a niche that printk doesn't fill (since it's too slow). netlink is quite slow (allocates data for each event, does lots of other gunk), but an useful extensible format for low frequency events.

For the problems that relayfs solves netlink is totally unusable due to low efficiency (you could as well use printk, but that is also to slow). I think a low overhead logging mechanism is very much needed, because I find myself reinventing it quite often when I need to debug some timing sensitive problem. Trying to tackle these with printk is hopeless because it changes timing too much.

The problem relayfs has IMHO is that it is too complicated. It seems to either suffer from a overfull specification or second system effect. There are lots of different options to do everything, instead of a nice simple fast path that does one thing efficiently. IMHO before merging it should go through a diet and only keep the paths that are actually needed and dropping a lot of the current baggage.

Preferably that would be only the fastest options (extremly simple per CPU buffer with inlined fast path that drop data on buffer overflow), with leaving out anything more complicated. My ideal is something like the old SGI ktrace which was an extremly simple mechanism to do lockless per CPU logging of binary data efficiently and reading that from a user daemon.

Roman Zippel felt that relayfs should indeed be merged, though its API should be declared unstable as developers worked to simplify it. He agreed with Andi about relayfs' overcomplexity, saying, "relayfs should resemble a very simple pipe, maybe making it possible to writing them directly to disk." Close by, Masami Hiramatsu suggested a project he was working on as a simpler replacement to relayfs: Linux Kernel State Tracer (LKST) (http://lkst.sourceforge.net/) . Andi encouraged him to publish his code for review, and Masami said the latest version only supported kernel 2.6.9; he added that he'd port it up to the latest kernel as soon as he could.

Also in reply to Andi's first post, Karim Yaghmour, one of the relayfs developers, said the relayfs folks would consider "reasonable" changes, but he also felt a lot of what Andi considered unnecessary was actually quite useful. The subthread veered into technical considerations at that point.

Elsewhere, Thomas Gleixner replied more harshly to Andrew's initial announcement regarding relayfs, or more precisely LTT as a whole. He thought the code was way too big, adding 150K to the source tree just for debugging. He suggested, "I don't see any real advantage over a nice implemented per cpu ringbuffer, which is lock free and does not add variable timed delays in the log path. Don't tell me that a ringbuffer is not suitable, it's a question of size and it is the same problem for relayfs. If you don't have enough buffers it does not work. This applies for every implementation of tracebuffering you do. In space constraint systems relayfs is even worse as it needs more memory than the plain ringbuffer. The ringbuffer has a nice advantage. In case the system crashes you can retrieve the last and therefor most interesting information from the ringbuffer without any hassle via BDI or in the worstcase via a serial dump. You can even copy the tail of the buffer into a permanent storage like buffered SRAM so it can be retrieved after reboot." His post was quite a bit longer, all critical, and Karim replied:

Granted tracing is not free, but please avoid spreading FUD without actually carrying out proper testing. We've done quite a large number of tests and we've demonstrated over and over that LTT, and ltt-over- relayfs, is actually very efficient. If you're interested in actual test data, then you may want to check out the following:

http://www.opersys.com/ftp/pub/LTT/Documentation/ltt-usenix.ps.gz

http://lwn.net/Articles/13870/

We are aware of the cost of the various tracing components, as you can see by my earlier posting about early-checking to minimize the cost of the tracing hooks for kernel compiled with them, and are open for any optimization. If you have any concrete suggestions, save the scrap-everything-I-know-better (which is really unproductive as you would anyway have to go down the same path we have), we are more than willing to entertain them.

Thomas replied, "Yes, the "you would anyway have to go down the same path we have" argument really scares me away from doing so. I don't buy this kind of arguments." But Andrew Morton said, "I do. When someone has been working on a real-world project for several years we *need* to understand all the problems which that person encountered before we can competently review the implementation. Surely you've been there before: you throw out all the old stuff, write a new one and once you've addressed all the warts and corner cases and weird-but-valid requirements it ends up with the same complexity as the original."

The rest of the subthread degenerated into bickering between Thomas and Karim.

2. Loop-AES Version 3.0b Crypto Package Released

16 Jan  - 21 Jan  (26 posts) Archive Link: "Announce loop-AES-v3.0b file/swap crypto package"

Topics: Advanced Encryption Standard, BSD: FreeBSD, Ioctls

People: Jari RuusuBill Davidsen

Jari Ruusu said:

loop-AES changes since previous release:

bzip2 compressed tarball is here:

http://loop-aes.sourceforge.net/loop-AES/loop-AES-v3.0b.tar.bz2
md5sum b295ff982cd4503603b38fdc54e604cc

http://loop-aes.sourceforge.net/loop-AES/loop-AES-v3.0b.tar.bz2.sign

Bill Davidsen asked if this would ever make it into the official tree, and Jari replied, "Unlikely to go to mainline kernel. Mainline folks are just too much in love with their backdoored device crypto (http://marc.theaimsgroup.com/?l=linux-kernel&m=107419912024246&w=2) implementations. If you want strong device crypto in mainline kernel, maybe you should take a look at FreeBSD gbde."

A number of folks took Jari to task over his use of the term 'backdoor', which implied an intentional deception; and folks debated for awhile.

3. plugsched Version 2.0 Released; Some Discussion Of Official Inclusion

19 Jan  - 21 Jan  (11 posts) Archive Link: "[ANNOUNCE][RFC] plugsched-2.0 patches ..."

Topics: Big O Notation, FS: sysfs, Scheduler

People: Peter WilliamsMarc E. FiuczynskiShailabh NagarValdis KletnieksJens AxboeCon KolivasIngo MolnarWilliam Lee Irwin IIIAndrew Morton

Peter Williams that his plugsched-2.0 patches "are now available from: <http://prdownloads.sourceforge.net/cpuse/plugsched-2.0-for-2.6.10.patch?download> as a single patch to linux-2.6.10 and at: <http://prdownloads.sourceforge.net/cpuse/plugsched-2.0-for-2.6.10.patchset.tar.gz?download> as a (gzipped and tarred) patch set including "series" file which nominates the order of application of the patches." He went on:

This is an update of the earlier version of plugsched (previously released by Con Kolivas) and has a considerably modified scheduler interface that is intended to reduce the amount of code duplication required when adding a new scheduler. It also contains a sysfs interface based on work submitted by Chris Han.

This version of plugsched contains 4 schedulers:

  1. "ingosched" which is the standard active/expired array O(1) scheduler created by Ingo Molnar,
  2. "staircase" which is Con Kolivas's version 10.5 O(1) staircase scheduler,
  3. "spa_no_frills" which is a single priority array O(1) scheduler without any interactive response enhancements, etc., and
  4. "zaphod" which is a single priority array O(1) scheduler with interactive response bonuses, throughput bonuses and a choice of priority based or entitlement based interpretation of "nice".

Schedulers 3 and 4 also offer unprivileged real time tasks and hard/soft per task CPU rate caps.

The required scheduler can be selected at boot time by supplying a string of the form "cpusched=<name>" where <name> is one of the names listed above.

The default scheduler (that will be used in the absence of a "cpusched" boot argument) can be configured at build time and is set to "ingosched" by default.

The file /proc/scheduler contains a string describing the current scheduler.

The directory /sys/cpusched/<current scheduler name>/ contains any scheduler configuration control files that may apply to the current scheduler.

Kasper Sandberg was very pleased to see this work going forward, and Marc E. Fiuczynski put it more historically:

Peter, thank you for maintaining Con's plugsched code in light of Linus' and Ingo's prior objections to this idea. On the one hand, I partially agree with Linus&Ingo's prior views that when there is only one scheduler that the rest of the world + dog will focus on making it better. On the other hand, having a clean framework that lets developers in a clean way plug in new schedulers is quite useful.

Linus & Ingo, it would be good to have an indepth discussion on this topic. I'd argue that the Linux kernel NEEDS a clean pluggable scheduling framework.

Let me make a case for this NEED by example. Ingo's scheduler belongs to the egalitarian regime of schedulers that do a poor job of isolating workloads from each other in multiprogrammed environments such as those found on Enterprise servers and in my case on PlanetLab (www.planet-lab.org) nodes. This has been rectified by HP-UX, Solaris, and AIX through the use of fair share schedulers that use O(1) schedulers within a share. Currently PlanetLab uses a CKRM modified version of Ingo's scheduler. Similarly, the linux-vserver project also modifies Ingo's scheduler to construct an entitlement based scheduling regime. These are not just variants of O(1) schedulers in the sense of Con's staircase O(1). Nor is it clear what the best type of scheduler is for these environments (i.e., HP-UX, Solaris and AIX don't have it fully solved yet either). The ability to dynamically swap out schedulers on a production system like PlanetLab would help in determining what type of scheduler is the most appropriate. This is because it is non-trivial, if not impossible, to recreate the multiprogrammed workloads that we see in a lab.

For these reasons, it would be useful for plugsched (or something like it) to make its way into the mainline kernel as a framework to plug in different schedulers. Alternatively, it would be useful to consider in what way Ingo's scheduler needs to support plugins such as the CKRM and Vserver types of changes.

Peter remarked, "I'm hoping that the CKRM folks will send me a patch to add their scheduler to plugsched :-)" Marc replied, "They are planning to release a patch against 2.6.10. But their patch wont stand alone against 2.6.10 and so it might be difficult for you to integrate their code into a scheduler for plugsched." Shailabh Nagar, who had done work on the project, replied:

Thats true. The current CKRM CPU scheduler is not a standalone component...if it were made one, it would need a non-CKRM interface to define classes, set their shares etc.

However, we have not investigated the possibility of making our CPU scheduler a pluggable one that could be loaded into a kernel equipped with the plugsched patches AND the CKRM framework. This should be possible but not a high priority until there is more consensus for having CPU schedulers pluggable at all (we have more basic stuff to fix in our scheduler such as load balancing).

Of course, we're more than happpy to work with someone willing to chip in and make our scheduler pluggable.

Elsewhere, Valdis Kletnieks suggested including plugsched in Andrew Morton's -mm series, comparing the situation to the disk elevator code. He said, "we started with one disk elevator, and now we have 3 or 4 that are selectable on the fly after some banging around in -mm." [...] "All the arguments that support having more than one elevator apply equally well to the CPU scheduler...." Jens Axboe, on the other hand, felt the two were completely different kettles of fish. He said, "Yes they are both schedulers, but that's about where the 'similarity' stops. The CPU scheduler must be really fast, overhead must be kept to a minimum. For a disk scheduler, we can affort to burn cpu cycles to increase the io performance. The extra abstraction required to fully modularize the cpu scheduler would come at a non-zero cost as well, but I bet it would have a larger impact there. I doubt you could measure the difference in the disk scheduler." He added, "There are vast differences between io storage devices, that is why we have different io schedulers. I made those modular so that the desktop user didn't have to incur the cost of having 4 schedulers when he only really needs one."

Marc replied, "Modularization usually is done through a level of indirection (function pointers). I have a can of "indirection be gone" almost ready to spray over the plugsched framework that would reduce the overhead to zero at runtime. I'd be happy to finish that work if it makes it more palpable to integrate a plugsched framework into the kernel?" Con Kolivas replied, "The indirection was a minor point. On modern cpus it was suggested by wli" [William Lee Irwin III] "that this would not be a demonstrable hit in perormance. Having said that, I'm sure Peter would be happy for another developer. I know how tiring and lonely it can feel maintaining such a monster." Peter agreed whole-heartedly that "the more hands the lighter the load." He went on:

Another issue (than indirection) that I think needs to be addressed at some stage is freeing up the memory occupied by the code of the schedulers that were unlucky not to be picked. Something like what __init offers only more selective.

And the option of allowing more than one CPU per run queue is another direction that needs addressing. This could allow a better balance between the good scheduling fairness that is obtained by using a single run queue with the better scalability obtained by using separate run queues.

But this idea was not explored further during the thread.

4. Linux 2.6.11-rc1-mm2 Released

19 Jan  - 21 Jan  (23 posts) Archive Link: "2.6.11-rc1-mm2"

Topics: Ioctls, Kernel Release Announcement, Kexec

People: Andrew Morton

Andrew Morton announced Linux 2.6.11-rc1-mm2, saying:

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc1/2.6.11-rc1-mm2/

5. Status Of RelayFS

19 Jan  - 23 Jan  (11 posts) Archive Link: "[PATCH] relayfs redux for 2.6.10: lean and mean"

Topics: FS: procfs

People: Karim YaghmourGreg KH

Karim Yaghmour said:

I've reworked the relayfs patch extensively. The API and internals have been heavily purged. The patch is in fact almost HALF! its original size, loosing 90KB and going from 200KB to 110KB.

Here's a summary of changes:

I've tested this with a hacked ltt patch and I can get it to collect data in the managed mode without a problem. Reading the data though is another story, I'll update the LTT patch once I know where the relayfs stuff is heading. Beware: don't try to use the ltt code in Andrew's tree with this relayfs, it's a completely different beast.

So without further ado, here's the code. I've removed the Documentation/filesystems/relayfs.txt file for now (no, don't worry that's not part of the 90KB that went away ;). It's probably going to need some rewriting before we're done, so let's not get distracted by it for now.

Karim

The full patch is here:
http://www.opersys.com/ftp/pub/relayfs/patch-relayfs-redux-2.6.10-050120-real
http://www.opersys.com/ftp/pub/relayfs/patch-relayfs-redux-2.6.10-050120-real.bz2

Greg KH and Pekka Enberg read the patch and offered specific suggestions. Then on an architectural note, Greg said:

Hm, how about this idea for cutting about 500 more lines from the code:

Why not drop the "fs" part of relayfs and just make the code a set of struct file_operations. That way you could have "relayfs-like" files in any ram based file system that is being used. Then, a user could use these fops and assorted interface to create debugfs or even procfs files using this type of interface.

As relayfs really is almost the same (conceptually wise) as debugfs as far as concept of what kinds of files will be in there (nothing anyone would ever rely on for normal operations, but for debugging only) this keeps users and developers from having to spread their debugging and instrumenting files from accross two different file systems.

Karim replied:

However this assumes that the users of relayfs are not going to want it during normal system operation. This is an assumption that fails with at least LTT as it is targeted at sysadmins, application developers and power users who need to be able to trace their systems at any time.

I don't mind piggy-backing off another fs, if it makes sense, but unlike debugfs, relayfs is meant for general use, and all files in there are of the same type: relay channels for dumping huge amounts of data to user-space. It seems to me the target audience and basic idea (relay channels only in the fs) are different, but let me know if there's a compeling argument for doing this in another way without making it too confusing for users of those special "files" (IOW, when this starts being used in distros, it'll be more straightforward for users to understand if all files in a mounted fs behave a certain way than if they have certain "odd" files in certain directories, even if it's /proc.)

Greg asked, "since you are proposing that relayfs be mounted all the time, where do you want to mount it at? I had to provide a "standard" location for debugfs for people to be happy with it, and the same issue comes up here." Karim replied, "this is a very good question. We've taken to the habit of having a /relayfs. If this is too problematic, I don't see any problem with /mnt/relayfs also. In either case, I have to admit frankly that I'm not familiar with the exact formal rules for introducing something like this. Of course I'm aware of the FHS and LSB, but let me know what you think is the best way to proceed here." However, the conversation did not continue on the mailing list.

6. Some Debate Over OOM Killer Future

20 Jan  - 21 Jan  (13 posts) Archive Link: "oom killer gone nuts"

Topics: Big Memory Support, OOM Killer, Virtual Memory

People: Marcelo TosattiAndries BrouwerAndrea ArcangeliJens AxboeAndrew Morton

Jens Axboe reported problems with the OOM (out-of-memory) killer, where it would kill processes apparently indiscriminately, even though there was plenty of RAM free. Andries Brouwer remarked sardonically that the OOM killer should just be removed entirely as a failed idea. But Marcelo Tosatti replied:

There is a user requirement for overcommit mode, you know.

Saying "hey, there's no more overcommit mode in future v2.6 releases, you run out of memory and get -ENOMEM" is not really an option is it?

You propose to remove the OOM killer and do what? Lockup solid?

It is _WAY_ off right now: look at the amount of free pages:

DMA free:4536kB min:60kB low:72kB high:88kB active:0kB inactive:0kB present:16384kB pages_scanned:0 all_unreclaimable? no protections[]: 0 0 0
Normal free:524648kB min:4028kB low:5032kB high:6040kB active:76508kB inactive:81760kB present:1031360kB pages_scanned:0 all_unreclaimable? no protections[]: 0 0 0
HighMem free:0kB min:128kB low:160kB high:192kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no protections[]: 0 0 0
DMA: 556*4kB 155*8kB 65*16kB 1*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 4536kB
Normal: 29800*4kB 25115*8kB 6953*16kB 1251*32kB 326*64kB 103*128kB 31*256kB 12*512kB 3*1024kB 1*2048kB 0*4096kB = 524648kB
HighMem: empty

v2.4 gets it pretty much right for most cases, and its obviously screwed up right now in v2.6.

Andrea/Thomas were working on getting it fixed ??

Andries replied:

Right now we have three overcommit modes. They are specified by:

0: overcommit, but keep it reasonable (the current default)
1: overcommit, always say yes
2: keep track of all our obligations, do not overcommit

So, one has the right to expect that no OOM situation can occur in overcommit mode 2. But in 2.6.10 it can. That is a bug. The conclusion must be that bookkeeping is done incorrectly. Perhaps also mode 0 is affected by that same bug.

Now you ask what I propose. There is no hurry worrying about that - the first thing should be to fix the bookkeeping problem.

But assume that fixed. Then everybody can run in mode 2 and never have any problems. That is what I do.

Yes, you say, but that is an inefficient use of memory. Perhaps. That is the price I am willing to pay for the guarantee that my processes are not killed at some random moment.

But if someone else does not do anything of importance and doesnt care if his processes die at arbitrary moments if only things go as fast as possible and use as much of his precious memory as possible, then also for him overcommit mode 2 can be useful. It is accompanied by the variable overcommit_ratio R - the amount of memory that can be used is Swap + Memory*(R/100). Here R can be larger than 100, so in overcommit mode 2 one can specify very precisely what amount of overcommitment is considered acceptable.

Very few people run overcommit mode 2, and lots of things are badly tested. It cannot become the default today. But I would like to see it the default at some future moment.

Close by, Andrea Arcangeli replied to Marcelo:

I'm working on fixing it, not just tuning it. The bugs in mainline aren't about the selection algorithm (which is normally what people calls oom killer). The bugs in mainline are about being able to kill a task reliably, regardless of which task we pick, and every linux kernel out there has always killed some task when it was oom. So the bugs are just obvious regressions of 2.6 if compared to 2.4.

But this is all fixed now, I'm starting sending the first patches to Andrew very shortly (last week there was still the oracle stuff going on). Now I can fix the rejects.

I will guarantee nothing about which task will be picked (that's the old code at works, I changed not a bit in what normally people calls "the oom killer", plus the recent improvement from Thomas), but I guarantee the VM won't kill tasks right and left like it does now (i.e. by invoking the oom killer multiple times).

Andries was very skeptical, and insisted that the default should be to not kill any processes. But the discussion petered out at this point. However, Andrea later posted a bunch of patches for various fixes in this area; and it looked as though Andrew Morton was in the process of applying them to his -mm tree.

7. Mysterious Disk-Space Reportage

21 Jan  - 24 Jan  (8 posts) Archive Link: "negative diskspace usage"

Topics: FS: ext3

People: Wichert AkkermanAndries Brouwer

Wichert Akkerman reported:

After cleaning up a bit df suddenly showed interesting results:

Filesystem            Size  Used Avail Use% Mounted on
/dev/md4             1019M  -64Z  1.1G 101% /tmp
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/md4               1043168 -73786976294838127736   1068904 101% /tmp

This is on a ext3 filesystem on a 2.6.10-ac10 kernel.

Andries Brouwer replied:

The Used column is total-free, so free was 2^66 + 964440. That 2^66 no doubt was 2^64 in a computation counting 4K-blocks, and arose at some point where a negative number was considered unsigned.

But having available=1068904 larger than free=964440 is strange.

I assume this was produced by statfs or statfs64 or so. You can check using "strace -e statfs64 df /dev/md4" that these really are the values returned by the kernel, so that we can partition the blame between df and the kernel.

The values are computed by

buf->f_blocks = es->s_blocks_count - overhead;
buf->f_bfree = ext3_count_free_blocks (sb);
buf->f_bavail = buf->f_bfree - es->s_r_blocks_count;

that is: blocks = total - overhead, and available = free - reserved. strace shows three values, and I expect tune2fs or so will show 2 more.

More available than free sounds like a negative count of reserved blocks. Are you still able to examine the situation?

Wichert confirmed statfs64, and also that he was unable to examine the situation further. But he did say:

I do have some more information. A e2fsck run on that filesystem was just as interesting:

/dev/md4: clean, 16/132480 files, -15514/264960 blocks

Forcing an e2fsck revelated a few groups with incorrect block counts:

Free blocks count wrong for group #2 (34308, counted=32306).
Free blocks count wrong for group #6 (45805, counted=32306).
Free blocks count wrong for group #8 (14741, counted=2354).
Free blocks count wrong (280474, counted=252586).

After fixing those everything returned to normal. I did run dumpe2fs on the filesystem, if that is interesting I can retrieve and post that.

Andries replied, "It is an interesting situation, but probably there is not enough information to find out what happened."

8. Weird Behavior On Seek Or Append In SysFS

21 Jan  - 24 Jan  (4 posts) Archive Link: "[PATCH 1/3] disallow seeks and appends on sysfs files"

Topics: FS: sysfs

People: Mitch WilliamsGreg KH

Mitch Williams posted a patch which "causes sysfs to return errors if the caller attempts to append to or seek on a sysfs file." Greg KH asked what the current SysFS behavior was if one attempted an append or seek, Mitch replied:

Because the store method doesn't have an offset argument, it must assume that all writes are based from the beginning of the buffer.

So if your sysfs file contains "123" and you do

echo "45" >> mysysfsfile

instead of the expected "12345", you end up with "45" in the file with no errors. Opening the file, seeking, and writing gives the same type of behavior, with no errors.

This patch just sets a few flags to make sure that errors are returned when this behavior is seen. Logically then, the two "features" do the same thing (set flags), and prevent the same behavior (writing wrong contents without error).

Greg was astonished that this was the case, and said he'd accept the patch; but he wanted Mitch to split it into two patches, one to handle seeks, and one for appends. Mitch was OK with this.

9. Software Suspend Under SMP

24 Jan  - 25 Jan  (3 posts) Archive Link: "Enable swsusp on SMP machines"

Topics: SMP, Software Suspend

People: Pavel Machek

Pavel Machek said, "This enables swsusp on SMP machines. It should be working in 2.6.10, already (but you may need noapic in 2.6.10)."

10. superio scx200 Module Renamed To scx

24 Jan  (4 posts) Archive Link: "[1/1] superio: change scx200 module name to scx."

People: Greg KH

Evgeniy Polyakov changed the superio scx200 module name to scx. Greg KH accepted the patch. There was no discussion.

11. New timeofday Core Subsystem

24 Jan  - 26 Jan  (25 posts) Archive Link: "[RFC][PATCH] new timeofday core subsystem (v. A2)"

People: John Stultz

John Stultz said:

Here is a new release of my time of day proposal, which include ppc64 support as well as suspend/resume and cpufreq hooks. For basic summary of my ideas, you can follow this link: http://lwn.net/Articles/100665/

This patch implements the architecture independent portion of the time of day subsystem. Included is timeofday.c (which includes all the time of day management and accessor functions), ntp.c (which includes the ntp scaling code, leapsecond processing, and ntp kernel state machine code), timesource.c (for timesource specific management functions), interface definition .h files, the example jiffies timesource (lowest common denominator time source, mainly for use as example code) and minimal hooks into arch independent code.

The patch does not function without minimal architecture specific hooks (i386, x86-64, and ppc64 examples to follow), and it can be applied to a tree without affecting the code.

New in this version:

12. Marvell mv64xxx I2C Driver

25 Jan  - 26 Jan  (6 posts) Archive Link: "[PATCH][I2C] Marvell mv64xxx i2c driver"

Topics: I2C

People: Mark A. GreerGreg KHJean Delvare

Mark A. Greer said, "Marvell makes a line of host bridge for PPC and MIPS systems. On those bridges is an i2c controller. This patch adds the driver for that i2c controller." Jean Delvare and Greg KH had some technical comments, which Mark attempted to address in subsequent versions of his patch. Folks seemed to look favorably on the patch overall.

13. Software Suspend 2.1.5.7B For Linux 2.4.28

25 Jan  (1 post) Archive Link: "Software Suspend for 2.4 Final Release"

Topics: Forward Port, Software Suspend

People: Nigel Cunningham

Nigel Cunningham said:

SoftwareSuspend 2.1.5.7B for the 2.4.28 kernel is now available from softwaresuspend.berlios.de (http://softwaresuspend.berlios.de) .

Bug fixes and forward ports to 2.4.29 and later kernels notwithstanding, it is intended to be the last release of SoftwareSuspend for the 2.4 series kernels.

The 2.4 version of Suspend is generally pretty easily to get going, but if you have any questions or problems, you will find lots of resources at softwaresuspend.berlios.de (http://softwaresuspend.berlios.de) . In particular, there are HOWTOs, FAQs, and a Wiki that you can consult before asking on the mailing lists you'll also find there.

Fuller instructions regarding applying the package can be found in the README file, included in the package.

 

 

 

 

 

 

Sharon And Joy
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.