Kernel Traffic #291 For 4 Jan 2005

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 3430 posts in 21201K.

There were 555 different contributors. 313 posted more than once. 164 posted last week too.

The top posters of the week were:

1. Possible Changes To Kernel Stable/Unstable Development Methods

1 Dec 2004 - 16 Dec 2004 (108 posts) Subject: "page fault scalability patch V12 [0/7]: Overview and performance"

People: Linus TorvaldsJeff GarzikAndrew MortonChristoph Lameter

Christoph Lameter posted some page fault performance improvements, which Linus Torvalds liked, but Linus said, "I don't want to apply this before I get 2.6.10 out the door, but I'm happy with it." Jeff Garzik asked, "Does that mean that 2.6.10 is actually close to the door?" And Andrew Morton replied:

We need an -rc3 yet. And I need to do another pass through the regressions-since-2.6.9 list. We've made pretty good progress there recently. Mid to late December is looking like the 2.6.10 date.

We need to be be achieving higher-quality major releases than we did in 2.6.8 and 2.6.9. Really the only tool we have to ensure this is longer stabilisation periods.

Of course, nobody will test -rc3 and a zillion people will test final 2.6.10, which is when we get lots of useful bug reports. If this keeps on happening then we'll need to get more serious about the 2.6.10.n process.

Or start alternating between stable and flakey releases, so 2.6.11 will be a feature release with a 2-month development period and 2.6.12 will be a bugfix-only release, with perhaps a 2-week development period, so people know that the even-numbered releases are better stabilised.

We'll see. It all depends on how many bugs you can fix in the next two weeks ;)

I expected a big discussion about this, but no. The thread veered off into ways of doing regression testing, and automating test-suites for each kernel release. Only a few folks had any comments to make about the possible change of development process, and there was no significant discussion.

2. Dynamically Defined HZ Value Coming To 2.6

11 Dec 2004 - 22 Dec 2004 (126 posts) Subject: "dynamic-hz"

Topics: Forward Port, Power Management: ACPI

People: Andrea ArcangeliAndrew MortonPavel MachekCon Kolivas

Andrea Arcangeli said:

The below patch allows to set the HZ dynamically at boot time with command line parameter. HZ=1000 HZ=100 HZ=333 any other value just works (though certain value may cause more or less drift to the system time advance/decrease).

Is there any interest from the mainline developers to merge this into 2.6? I'm getting requests for this feature being forward ported to 2.6 (both for batch jobs and for the powersaved that can trim the hz down to 80mhz). It should be up to the user to choose the HZ like it was in 2.4-aa.

This patch is quite intrusive since many HZ visible to userspace have to be converted to USER_HZ, and most important because HZ isn't available at compile time anymore and every variable in function of HZ must be either changed to be in function of USER_HZ or it must be initialized at runtime. The code has debugging code (optional at compile time) so that I can guarantee that there cannot be any regression.

Technically this makes a lot of sense to me (well, you can guess why I implemented it in the first place), at least in archs where one cannot reprogram the timer chip in a performant way (to stop timer ticks completely until the next posted timer). This is in production for years in SLES8 btw.

http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.23aa3/9999_zzz-dynamic-hz-5.gz

Several folks expressed interest in this, and Pavel Machek asked what the overhead was. Andrea replied, "It's not a measurable difference." Con Kolivas pointed out (and Adrea agreed) that the value of this patch would not be seen by desktop users. Andrea acknowledged:

Sure, desktop doesn't need this, the reason somebody is asking for it, is that the desktop stuff hurted some other non-desktop usages. Infact my 2.4 tree was setting by default HZ=1000 if 'desktop' paramter was passed to the kernel (so that I could lower the timeslice accordingly too, without losing the effect of the nicelevels between nice 0 and +19).

The other new case where I'm asked for this feature is again not the desktop but the high end laptop with cpu throttling down to 80mhz, and what Pavel mentioned about the lower consumption. Perhaps we could do variable HZ there, though I doubt it has a pit that can be reprogrammed with sane performance.

Very few people are going to get real benefit from HZ=1000, but I certainly agree it worth to keep HZ=1000 on desktops since on a idle machine the downside of the more frequent irq sure isn't measurable, while having shorter timeslices may be visible with many tasks, and shorter timeslices requires faster HZ to preserve the nicelevels.

There was a fairly long discussion about the potential benefits of Andrea's patch (or lack thereof), and at one point Andrew Morton remarked:

There are apparently some laptops which exhibit appreciable latency between the start of ACPI sleep and actually consuming less power. The 1ms wakeup frequency will shorten battery life on these machines significantly. (I forget the exact numbers - Len will know).

So I guess we're going to have to do this sometime - I don't think there's any other solution apart from going fully tickless, which would be considerably more intrusive.

We should retain the option of compile-time constant HZ - it's easy enough. Probably the patch already does that.

The discussion of merits continued for some time, though Andrew's post seemed to decide the issue in favor of the patch; a final interesting tidbit came regarding the minimum possible HZ value, when Pavel said that he'd "tried defining HZ to 10 once, and there are some #if arrays in the kernel that prevented me from doing that." Andrea replied, "I guess you're right and the minimum is HZ=12. I'm pretty sure I could go down to 25, perhaps the absolute minium was 12 and not 10."

3. Linux 2.6.10-rc3-mm1

13 Dec 2004 - 17 Dec 2004 (25 posts) Subject: "2.6.10-rc3-mm1"

Topics: Kernel Release Announcement, Software Suspend

People: Andrew MortonNigel Cunningham

Andrew Morton announced Linux 2.6.10-rc3-mm1, saying:

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10-rc3/2.6.10-rc3-mm1/

Kasper Sandberg pointed out that there seemed to be a lot of swsusp1 work going into 2.6; he suggested that swsusp2 should be merged instead. Andrew referred Kasper to the discussion covered in Issue #289, Section #2  (24 Nov 2004: Status Of Software Suspend) , saying that Nigel Cunningham's work on Suspend2 was not yet ripe for inclusion. Nigel replied that he'd been working to address many of the issues raised in that thread, and that he'd be submitting a new patch-set soon.

4. Status Of ioctls In Linux

14 Dec 2004 - 22 Dec 2004 (11 posts) Subject: "ioctl assignment strategy?"

Topics: FS: sysfs, Ioctls

People: Greg KHChris FriesenAlan CoxLee RevellOlivier Galibert

Al Hooton asked about the policies for official ioctl assignment, having looked in all the docs and list archives he could find; Greg KH replied, "why do you want to use an ioctl? ioctls are generally frowned upon these days, and trying to add a new one is a tough and arduous process, that is not for the weak, or faint of heart." Chris Friesen asked, "what other options would you suggest for arbitrary char devices to allow for control that doesn't fit nicely into the read/write paradigm?" Greg said, "Rethink the way you want to control your device. Seriously, a lot of ioctls can be broken down into single device files, single sysfs files, or other such things (a whole new fs as a last resort too.)" Chris asked what the big problem was with ioctls, and Greg said, "ioctls are basically a simple way to add any kind of syscall to the kernel. They also have nasty 32/64 bit issues. Because we want to have well-defined syscalls that work on all platforms, and not any arbitrary type of call, it is good to restrict ioctls." Alan Cox, close by, also said:

Ioctls do have some serious problems that make them nice to avoid

  1. Each ioctl handler has its own data structures. While you could write XML objects to encapsulate this in write() it is also true in many cases that there is a simple logical expression of the operation - eg configuration options tend to fit well into files as you can see with /sysfs - unless they need to be atomic transactions with rollback at which point the same people who decry ioctl will hate embedding sqlite in the kernel

    Seriously however - multiple structures means multiple validation functions means more new code and more errors. It's a lot easier to get ioctls wrong. There are a lot of things that don't need to be ioctl. A look at security history says in general "ioctls cause bugs"

  2. Ioctl structures tend to be binary. Welcome to 32/64bit emulation hell. Good design can avoid this. Good design is not XML for this purpose.
  3. Ioctl is unstructured and so each ioctl is a new mystery to the programmer. We all know how write works and in many cases echo "451" > /proc/sys/vm/blah is quite obvious.
  4. It's hard to ioctl from the command line or scripts

The "ioctls are evil" blind hate department really annoy me however because like all extreme views the truth very rarely fits their model

Lee Revell added, "Another objection was that all ioctls take the BKL. I think you did not hear this one raised as much because it reflected a deficiency in the system. But now at least 2 different solutions have been posted for BKL-less ioctls so that objection is no longer valid." Olivier Galibert also added to Alan's list, "ioctls don't have a reliable size information in the call, making them hard to forward over a network in a generic way, or even pass to another userspace process."

5. Linux 2.4.29-pre2 Released

16 Dec 2004 (3 posts) Subject: "Linux 2.4.29-pre2"

Topics: FS: XFS, Security, USB

People: Marcelo TosattiAdam Heath

Marcelo Tosatti announced Linux 2.4.29-pre2, saying:

It contains a relatively small number of changes: XFS sync, SPARC64 sync, USB gadget updates, couple of libata fixes, amongst others.

Also a networking update, containing fixes for following recently discovered security issues:

CAN-2004-1137
IGMP vulnerabilities - local priveledge escalation and remote DoS:
http://isec.pl/vulnerabilities/isec-0018-igmp.txt

CAN-2004-1016
scm_send local DoS:
http://isec.pl/vulnerabilities/isec-0019-scm.txt

Adam Heath remarked, "I don't know if you've been following, but it was recently discoverd that on smp, if multiple processes read from /dev/urandom at the same time, they can get the same data. Theodore Y. T'so posted a patch to fix this for 2.6, and someone else told me this problem has existed all the way back to 1.3. This is a security issue, and should be included in the 2.4 tree." Marcelo replied, "Yes, I'm aware of it, Tytso is working on v2.4 backport of the correct locking. Thanks for the reminder!"

6. Linux 2.6.9-ac16 Released

16 Dec 2004 - 21 Dec 2004 (21 posts) Subject: "Linux 2.6.9-ac16"

Topics: Kernel Release Announcement

People: Alan CoxArjan van de Ven

Alan Cox announced Linux 2.6.9-ac16, saying:

Further small fixes for different minor things. A merge of some of the small cleanups from Fedora work and also the fixes for the igmp and vc holes.

Arjan van de Ven is now building RPMS of the kernel and those can be found in the RPM subdirectory and should be yum-able. Expect the RPMS to lag the diff a little as the RPM builds and tests do take time.

The HPT366 rework project is also not ready (its gone back to the drawing board until the current panic is over if you are a volunteer and wondered what is up).

ftp://ftp.kernel.org/pub/linux/kernel/people/alan/linux-2.6/2.6.9/

7. Which 2.6 Branch To Use

16 Dec 2004 (4 posts) Subject: "2.6 flavours"

Topics: Version Control

People: Maciej SoltysiakAlan CoxAndrew Morton

Maciej Soltysiak remarked:

AFAICS the -ac tree should be the most stable of all kernels, right?

-mm is totally bleeding edge
-bk the same
-ck is experimental

Others are experimental too.

Looking at the changelogs, the most reasonable kernel to use for generic use are the -ac kernels, which I am going to use since 2.6.10 as long as Alan is kindly going to continue his fabulous work.

I swear not to use 2.6.10 until Alan publishes 2.6.10-ac1 :-)

Someone pointed out that Andrew Morton's -mm tree might be bleeding edge, but that Andrew made a conscious choice about when to do each release, and that this choice probably took stability into account. Alan Cox said:

2.6.x-mm is more like some of the work the old 2.4-ac did in merging new stuff (its also worth noting that 2.4-ac ended up more stable than 2.4 at times so -mm might be stable)

The -ac tree is trying to be fairly conservative. When I merge stuff that is a little less conservative because it has to be done then I've tried to put a note in the relnotes for that release warning people its more testing grade.

8. usbmon Debugging Tool; Location Of Debug Directory

19 Dec 2004 - 23 Dec 2004 (14 posts) Subject: "My vision of usbmon"

Topics: FS: sysfs, Hot-Plugging, USB

People: Pete ZaitcevNick PigginGreg KHJeff Garzik

Pete Zaitcev said:

This is usbmon which I cooked up because I got tired from adding dbg()'s and polluting my dmesg. I use it to hunt bugs in USB storage devices so far, and it's useful, although limited at this stage.

I looked at the Harding's USBmon patch, and I think he got a few things right. The main of them is that I underestimated the benefits of placing the special files into the filesystem namespace. When we discussed it with Greg in the airport, we decided that having some sort of Netlink-style socket would be the best option. I decided to make a u-turn and attach those sockets into the namespace (currently under /dbg, but it can change). What this buys us is:

  1. cat(1): never bet against it. It's too handy. And netcat is just not the same.
  2. USBmon userland in Java. Just try to hack in JNI a little as I have and you'll see.

He also got some parts wrong. They are small things, but unfortunately, pervasive. For example, he relies on urb->dev, which is not a good idea in case of HCD which zero it far away from the completion call site, such as usb-ohci in 2.4. And it's error-prone and a maintenance problem to audit all HCDs and add usbmon calls. Races by design, too. Small things like that, but many. Eventually, I wrote everything from scratch. It's rather embarrassing that I could not save USBmon and gave in to NIH.

Since it's a big NIH, usbmon is not compatible with USBmon's userland. It can be made compatible, but it needs a small adaptation layer, because Harding aggregated at a device, and I do it on a bus (I can explain why, but it's rather long; it has to do with hotplug and races).

The architecture to support various output formats is present. Obvious candidates are Old USBmon format and a Binary format. But it's not done.

Please ask if something is not obvious in the code.

Greg KH loved all of this, and said he'd add it to the official tree whenever Pete felt it was ready.

Nick Piggin asked, "Is there any reason why these debug filesystems are going under the root directory? Why not /sys/debug or /sys/kernel/debug or something?" Greg said he didn't really care, but Jeff Garzik said that someone should pick a single location, and use that consistantly. Greg replied:

Bah, fine, make me make a policy decision, damm I tried hard to resist :)

Anyway, here's a patch I just applied that creates the /sys/kernel/debug directory (you need a small patch that exports the proper subsys for this to work, if anyone wants that too, I'll send it.) Now, if you want, you can mount debugfs at this location.

Now either this is going to make people happy, or make them mad I didn't pick their proposed location. Either way, I'm going on vacation in 2 days, so I will not be around to hear the screams...

9. Linux 2.4.29-pre3 Released

22 Dec 2004 (1 post) Subject: "Linux 2.4.29-pre3"

Topics: FS: NFS

People: Marcelo Tosatti

Marcelo Tosatti announced Linux 2.4.29-pre3, saying:

Here goes the third -pre of 2.4.29.

More importantly this release contains a correction for the "int 0x80 hole" security problem in AMD64 port (CAN-2004-1144).

It also contains a few important v2.6 backports (tty/ldisc and pty races), some hardening patches from Solar (none of those are exploitable bugs, just paranoic/early error detection), and a few networking updates.

This release should also fix the "NFS hang on unlink" issues present in v2.4.28.

It should appear in the kernel.org mirrors in a few minutes.

 

 

 

 

 

 

Sharon And Joy
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.