Kernel Traffic #260 For 5 Jun 2004

By Zack Brown

Table Of Contents

Introduction

Last issue I put out a call for help, after my computer crashed, leaving me with a slow loaner and nothing else. I'd like to thank the two or three dozen people who responded with very nice, generous offers. In the end, Professor Allan Cruse and Professor Greg Benson of the University of San Francisco Department of Computer Science (http://www.cs.usfca.edu) , fixed me up with a really excellent dual PIII system. That's the same department responsible for the open source FlashMob Computing (http://www.flashmobcomputing.org) project, which in April of this year linked 150 volunteer personal systems together to produce a 77 GFlop supercomputer. Various problems prevented them from making use of the over 700 computers that arrived to participate, but clearly the level of public interest is high, and future experiments will likely produce ever-more-impressive results. I'm very grateful for their help with Kernel Traffic. They transformed an essentially unworkable situation into a much easier and better situation.

Mailing List Stats For This Week

We looked at 1986 posts in 10592K.

There were 514 different contributors. 254 posted more than once. 189 posted last week too.

The top posters of the week were:

1. Speeding Up SATA

27 Mar 2004 - 3 Apr 2004 (115 posts) Subject: "[PATCH] speed up SATA"

Topics: Disks: IDE, Serial ATA, Virtual Memory

People: Jeff GarzikStefan SmietanowskiNick PigginAndrea ArcangeliJens Axboe

Jeff Garzik said:

The "lba48" feature in ATA allows for addressing of sectors > 137GB, and also allows for transfers of up to 64K sector, instead of the traditional 256 sectors in older ATA.

libata simply limited all transfers to a 200 sectors (just under the 256 sector limit). This was mainly being careful, and making sure I had a solution that worked everywhere. I also wanted to see how the iommu S/G stuff would shake out.

Things seem to be looking pretty good, so it's now time to turn on lba48-sized transfers. Most SATA disks will be lba48 anyway, even the ones smaller than 137GB, for this and other reasons.

With this simple patch, the max request size goes from 128K to 32MB... so you can imagine this will definitely help performance. Throughput goes up. Interrupts go down. Fun for the whole family.

The attached patch is for 2.6.x kernels only. It should apply to 2.6.5-rc2 or later, including my latest 2.6-libata patch on kernel.org. This patch should be pretty harmless, but you never know what could happen when you throw the throttle wide open. Testing in -mm would be a good thing, for example :)

Volunteers are welcome to post a 2.4 backport of this patch to linux-ide@vger.kernel.org, and I'll merge it into my 2.4 libata queue.

Stefan Smietanowski asked, "What will happen when a PATA disk lies behind a Marvel(ous) bridge, as in most SATA disks today?" And Jeff replied, "Larger transfers work fine in PATA, too. WRT bridges, it is generally the best idea to limit to UDMA/100 (udma), but larger transfers are OK." Stefan also asked, "Is large transfers mandatory in the LBA48 spec and is LBA48 really mandatory in SATA?" And Jeff replied:

Yes and no, in that order :) SATA doesn't mandate lba48, but it is highly unlikely that you will see SATA disk without lba48.

Regardless, libata supports what the drive supports. Older disks still work just fine.

Elsewhere, Nick Piggin suggested to Jeff that a maximum request size of 32M was too much. It would, he said, incur additional latency costs, as well as sacrifice the granularity of disk scheduling. He added, "I bet returns start diminishing pretty quickly after 1MB or so." Jeff disagreed; saying that his implementation simply exported the hardware maximums, and that it was up to the system administrator of the given machine to institute a disk scheduling policy appropriate for that machine. A long discussion followed, and at one point Andrea Arcangeli said:

this is not an I/O scheduler or VM issue.

the max size of a request is something that should be set internally to the blkdev layer (at a lower level than the I/O scheduler or the VM layer).

The point is that if you run read contigously from disk with a 1M or 32M request size, the wall time speed difference will be maybe 0.01% or so. Running 100 irqs per second or 3 irq per second doesn't make any measurable difference. Same goes for keeping the I/O pipeline full, 1M is more than enough to go at the speed of the storage with minimal cpu overhead. we waste 900 irqs per second just in the timer irq and another 900 irqs per second per-cpu in the per-cpu local interrupts in smp.

In 2.4 reaching 512k DMA units that helped a lot, but going past 512k didn't help in my measurements. 1M maybe these days is needed (as Jens suggested) but >1M still sounds overkill and I completely agree with Jens about that.

If one day things will change and the harddisk will require 32M large DMA transactions to keep up with the speed of the disk, the thing should be still solved during disk discovery inside the blkdev layer. The "automagic" suggestions discussed by Jamie and Jens should be just benchmarks internal to the blkdev layer, trying to read contigously first with 1M then 2M then 4M etc.. until the speed difference goes below 1% or whatever similar "autotune" algorithm.

But definitely this is not an I/O scheduler or VM issue, it's all about discovering the minimal DMA transaction size that provides peak bulk I/O performance for a certain device. The smaller the size, the better the latencies and the less ram will be pinned at the same time (i.e. think a 64M machine writing at 32M chunks at time).

Of course if we'll ever deal with hardware where 32M requests makes a difference, then we may have to add overrides to the I/O scheduler to lower the max_requests (i.e. like my obsolete max_bomb_segments did). But I expect that by default the contigous I/O will use the max_sector choosen by the blkdev layer (not choosen by VM or I/O scheduler) to guarantee the best bulk I/O performance as usual (the I/O scheduler option would be just an optional override). the max_sectors is just about using a sane DMA transaction size, good enough to run at disk-speed without measurable cpu overhead, but without being too big so that it provides sane latencies. Overkill huge DMA transactions might even stall the cpu when accessing the mem bus (though I'm not an hardware guru so this is just a guess).

So far there was no need to autotune it, and settings like 512k were optimal.

Don't take me wrong, I find extremely great that you now can raise the IDE request size to a value like 512k, the 128k limit was the ugliest thing of IDE ever, but you provided zero evidence that going past 512k is beneficial at all, and your bootup log showing 32M is all but exciting, I'd be a lot more excited to see 512k there.

I expect that the boost from 128k to 512k is very significant, but I expect that from 512k to 32M there will be just a total waste of latency with zero performance gain in throughput. So unless you measure any speed difference from 512k to 32M I recommend to set it to 512k for the short term like most other driver does for the same reasons.

Jeff agreed with most of this, but said:

My point is there are two maximums:

1) the hardware limit
2) the limit that "makes sense", e.g. 512k or 1M for most

The driver should only care about #1, and should be "told" #2.

He added later, "I think the length of this discussion alone clearly implies that the low-level driver should not be responsible for selecting this value, if nothing else ;-)" Jens Axboe also said:

Here's a quickly done patch that attempts to adjust the value based on a previous range of completed requests. It changes ->max_sectors to be a hardware limit, adding ->optimal_sectors to be our max issued io target. It is split on READ and WRITE. The target is to keep request execution time under BLK_IORATE_TARGET, which is 50ms in this patch. read-ahead max window is kept within a single request in size.

So this is pretty half-assed, but it gets the point across. Things that should be looked at (meaning - should be done, but I didn't want to waste time on them now):

Jeff and others were very happy to see this, and discussed some of its technical issues.

2. Reservation-Based ext3 Pre-Allocation

30 Mar 2004 - 5 Apr 2004 (12 posts) Subject: "[RFC, PATCH] Reservation based ext3 preallocation"

Topics: FS: ext3

People: Mingming CaoAndrew Morton

Mingming Cao said:

Ext3 preallocation is currently missing. This is the first cut of the prototype for the reservation based ext3 preallocation based on the ideas suggested by Andrew and Ted. The implementation is incomplete, but I want to hear your valuable opinion about the current design.

What I have done in this version of prototype:

  1. basic reservation structure and operations
  2. reservation based ext3 block allocation
  3. and reservation window allocations
  4. block allocation when fs reservation is turned off

For 1) Use a sorted double linked list for the per-filesystem reservation list, like the vm_region does. The operations on double linked list are abstract so later if necessary we could replace it with other sohpysicated tree easily.

Each inode have a reservation structure inside it's ext3_inode_info structure. Each reservation structure contains(start, end, list_head, goal_window_size)

For 2) The basic idea is: When we try to allocate a new block for a inode, if there is a reservation window for it, it will try to do allocation from there.

If it does not have a reservation window, we will allocate a block and make a reservation window for it. Instead of doing the block allocation first then do the reservation window allocation second, we make the reservation window first, then allocate a block within the window. The new reservation window has at least one free block and does not overlap with other reservation windows. This way we avoid keeping looking up the reservation list again and again when we found a free bit on bitmap and not sure if it belongs to any body's reservation window.

For 3) To allocate a new reservation window, we search the part of filesystem reservation list that fall into the group which we are trying to allocate a block from. We will have a goal block to guide where we want the new reservation window start from. If we already have a old reservation, we will discard it first, then search the part of list that after the old reservation window. Otherwise the sub-list start from the beginning of the group. The new reservation window could cross group boundary. The reservation window has contains at least one free block.

For 4) If the filesystem has reservation turned off, all the code/path for new block allocation is the same as the current code-- just call ext3_try_to_allocate() with a NULL reservation window pointer.

Above logic has been verified on a user level simulation program. Attached prototype patch (against 2.6.4 kernel) compiles and boots. I have done initial test of the patch on a 2way PIII 700Mhz box.

Andrew Morton replied:

I thing this is heading the right way.

Apart from that, looking good.

Mingming posted an updated patch, and they discussed some technical aspects.

3. Linux 2.6.5-rc3-mm2 Released

31 Mar 2004 - 2 Apr 2004 (7 posts) Subject: "2.6.5-rc3-mm2"

Topics: Kernel Release Announcement

People: Andrew Morton

Andrew Morton announced Linux 2.6.5-rc3-mm2, saying, " A small update, mainly to sync up with the CPU scheduler developers and testers."

4. Linux 2.6.5-rc3-mm4 Released

1 Apr 2004 - 6 Apr 2004 (21 posts) Subject: "2.6.5-rc3-mm4"

Topics: Kernel Release Announcement, USB, Version Control

People: Andrew MortonMarc-Christian PetersenGreg KH

Andrew Morton announced Linux 2.6.5-rc3-mm4, saying:

2.6.5-rc3-mm3 was just a quick sync for CPU scheduler devel-and-test.

This update again mainly contains CPU scheduler changes.

Marc-Christian Petersen replied, "hmm, did something changed in handling USB mice? starting with 2.6.5-rc3-mm1 and the included bk-usb.patch my USB mouse won't work anymore. Using bk-usb.patch from 2.6.5-rc2-mm5 in 2.6.5-rc3-mm4 all works fine for me." Greg KH replied, "The hid.ko module was renamed to usbhid.ko. Are you sure that you are still loading the proper driver?" Marc-Christian saw that no, the driver wasn't successfully loaded; and said, "grmpf. If I had read Documentation/input/input.txt more carefully I'd noticed the change myself. Sorry for the noise. Works now."

5. New disable-cap-mlock() sysctl

1 Apr 2004 - 5 Apr 2004 (62 posts) Subject: "disable-cap-mlock"

People: Andrea ArcangeliAndrew MortonWilliam Lee Irwin III

Andrea Arcangeli said:

Oracle needs this sysctl, I designed it and Ken Chen implemented it. I guess google also won't dislike it.

This is a lot simpler than the mlock rlimit and this is people really need (not the rlimit). The rlimit thing can still be applied on top of this. This should be more efficient too (besides its simplicity).

can you apply to mainline?

http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.5-rc3/disable-cap-mlock-1

William Lee Irwin III posted an alternative patch, and Andrea said he didn't mind which was chosen. A technical discussion started up, and at one point Andrew Morton asked:

What is the Oracle requirement in detail?

If it's for access to hugetlbfs then there are the uid= and gid= mount options.

If it's for access to SHM_HUGETLB then there was some discussion about extending the uid= thing to shm, but nothing happened. This could be resurrected.

If it's just generally for the ability to mlock lots of memory then RLIMIT_MEMLOCK would be preferable. I don't see why we'd need the sysctl when `ulimit -m' is available? (Where is that patch btw?)

Andrea said that Oracle's requirement was to be able "to shmget(SHM_HUGETLB) as normal user. However I cannot disable the CAP_IPC_LOCK from hugetlbfs/inode.c since that would break local security w.r.t. mlock. If I've to disable that single check in hugetlbfs/inode.c then I prefer to disable all CAP_IPC_LOCK so they can as well use mlock." Kenneth W Chen also confirmed that accessing SHM_HUGETLB was the main goal.

6. Linux VFS Timestamp Resolution Causing User Problems In 2.6

1 Apr 2004 - 2 Apr 2004 (28 posts) Subject: "Linux 2.6 nanosecond time stamp weirdness breaks GCC build"

Topics: FS: FAT, FS: NFS, FS: ext3

People: Ulrich WeigandAndi KleenPaul EggertJamie Lokier

Ulrich Weigand reported "spurious rebuilds of some auto-generated files in the gcc/ directory while building the Ada libraries." For example, he said, "insn-conditions.c is a rather small file, and compiling it into insn-conditions.o takes usually less than a second. So, it may well happen that these two files end up having a time stamp that differs only in the sub-second part. (Note that with Linux 2.4, file time stamps didn't actually have a sub-second part; this is a new 2.6 feature.)" He went on:

Unfortunately, while Linux now has sub-second time stamps, those cannot actually be *stored* on disk when using the ext3 file system (because the on-disk format has no space). This has the rather curious effect that while the inode is still held in Linux's inode cache, the time stamp will retain its sub-second component, but once the inode was dropped from the inode cache and later re-read from disk, the sub-second part will be truncated to zero.

Now, if this truncation happens to the insn-conditions.o file, but not the insn-conditions.c file, the former will jump from being just slightly newer than the latter to being just slightly *older* than the latter. For some reason, this tends to occur rather reliably during a gcc bootstrap on my system.

Now, as long as the main 'make' is running, this has no adverse effect (apparently because make remembers it has already built the insn-conditions.o file from insn-conditions.c). However, once the libada library is built, it performs a recursive make invocation that once again checks dependencies in the master gcc build directory, including the dependencies on gnat1. At this point, make will re-check the file time stamps and decide it needs to rebuild insn-conditions.o. This in turn triggers a rebuild of libbackend.a and all compiler binaries.

This is bad for two reasons. First, at this point in time various macros required to build the main gcc components aren't in fact set up correctly, and thus the file will be rebuilt using the host compiler and not the newly built stage3 gcc.

More importantly, when using parallel make to do the bootstrap, at this point some *other* library, e.g. libstdc++ or libjava, will be built at the same time, using the cc1plus or jc1 binaries that the libada make has just decided it needs to rebuild. While these binaries are being rebuilt, they will be in a deleted or inconsistent state for a certain period of time. During this period, attempts to start compiles for libstdc++ or libjava components will fail, causing the whole bootstrap to abort.

He concluded, "this should probably be fixed in the kernel, e.g. by not reporting high-precision time stamps in the first place if the file system cannot store them ..."

Andi Kleen replied:

Interesting. We discussed the case as a theoretical possibility when the patch was merged, but it seemed to unlikely to make it worth complicating the first version.

The solution from back then I actually liked best was to just round up to the next second instead of rounding down when going from 1s resolution to ns.

But Paul Eggert said:

Please don't do that. Longstanding tradition in timestamp code is to truncate toward minus infinity when converting from a higher-resolution timestamp to a lower-resolution timestamp. This is consistent behavior, and is easy to explain: let's stick to it as a uniform practice.

There are two basic principles here. First, ordinary files should not change spontaneously: hence a file's timestamp should not change merely because its inode is no longer cached. Second, a file's timestamp should never be "in the future": hence one should never round such timestamps up.

The only way I can see to satisfy these two principles is to truncate the timestamp right away, when it is first put into the inode cache. That way, the copy in main memory equals what will be put onto disk. This is the approach taken by other operating systems like Solaris, and it explains why parallel GCC builds won't have this problem on these other systems.

Switching subjects slightly, in <http://mail.gnu.org/archive/html/bug-coreutils/2004-03/msg00095.html> I recently contributed code to coreutils that fixes some bugs with "cp --update" and "mv --update" when files are copied from high-resolution-timestamp file systems to low-resolution-timestamp file systems. This code dynamically determines the timestamp resolution of a file system by examining (and possibly mutating) its timestamps. The current Linux+ext3 behavior (which I did not know about) breaks this code, because it can cause "cp" to falsely think that ext3 has nanosecond-resolution timestamps.

How long has the current Linux+ext3 behavior been in place? If it's widespread, I'll probably have to think about adding a workaround to coreutils. Does the behavior affect all Linux filesystems, or just ext3?

Jamie Lokier said:

All Linux filesystems - the nanoseconds field is retained on in-memory inodes by the generic VFS code. The stored resolution varies among filesystems, with the coarsest being 2 seconds (FAT), and some do store nanoseconds. AFAIK there is no way to determine the stored resolution using file operations alone.

This behaviour was established in 2.5.48, 18th November 2002.

The behaviour might not be restricted to Linux, because non-Linux NFS clients may be connected to a Linux NFS server which has this behaviour.

Paul replied, "it sounds like there's no easy workaround for existing systems. Still it'd be nice to fix the bug for future systems." There was a bit more discussion, but nothing conclusive came out of it, and the thread petered out.

7. Linux 2.6.5 Released

3 Apr 2004 - 4 Apr 2004 (2 posts) Subject: "Linux v2.6.5"

Topics: Kernel Release Announcement, Sound: ALSA

People: Linus Torvalds

Linus Torvalds announced Kernel 2.6.5, saying:

Some more architecture updates, and a few fixes for silly broken stuff in -rc3. And an ALSA update.

And I'll be offline for a week, so have fun with this, and if you get the shakes because you want to compile a new kernel every day, there's always the -mm tree to play with while I'm gone..

8. IPv6 Support For SELinux

3 Apr 2004 - 4 Apr 2004 (2 posts) Subject: "[SELINUX] Add IPv6 support"

Topics: Networking

People: James MorrisJames H. Cloos Jr.James H. Cloos

James Morris said:

The patch below, against 2.6.5, adds explicit IPv6 support to SELinux.

Brief description of changes:

Corresponding userspace patches are available at <http://people.redhat.com/jmorris/selinux/ipv6/>, although current userspace tools will continue to function normally (but without explicit IPv6 support).

For more details at the security management level, see &lt;http://marc.theaimsgroup.com/?l=selinux&m=108068187630948&w=2>

This code has been under testing and review for several weeks. Andrew, please consider applying to -mm.

James H. Cloos Jr. remarked in reply:

From a scan through the patch, it looks like it does in fact only handle those tcp, udp and raw.

Sctp also should be supported by these mechanisms, given that 2.6 has both in the main tree.

I'd expect many systems which will be installed in the next few quarters and which could make good use of the selinux controls will require sctp support.

9. atp870u SCSI Driver Maintainership

4 Apr 2004 - 7 Apr 2004 (2 posts) Subject: "Who maintains the atp870u driver? (ACARD PCI SCSI)"

Topics: Disks: SCSI, PCI

People: Marcelo TosattiDoug LedfordJames Bottomley

Stuart Longland asked who the atp870u SCSI driver maintainer was, saying he recently got a PCI ACARD SCSI card, and was seeing errors at run-time. Marcelo Tosatti replied, "No one really maintains it officially. James Bottomley and Doug Ledford have done some fixes for it on v2.6, you might want trying to ask them directly."

10. Linux 2.6.5-mc1 Released

4 Apr 2004 - 5 Apr 2004 (5 posts) Subject: "2.6.5-mc1"

Topics: Kernel Release Announcement, Version Control

People: Andrew Morton

Andrew Morton announced 2.6.5-mc1 (sic), saying:

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.5/2.6.5-mc1/

"mc" == "merge candidate", for want of a better name. This tree holds the patches which are slated for inclusion into Linus's tree in the short-term.

As Linus is offline for a week we should expect that the contents of -mc1 will be merged into kernel bitkeeper around April 12.

2.6.5-mm1 will consist of all of 2.6.5-mc1 plus other patches. The separation point is "mc.patch" in the -mm series file - everything before mc.patch is part of both the -mc and -mm kernels and everything after mc.patch is in -mm only.

The -mc series probably won't live for very long - I'm releasing it so that people can prepare patches against what Linus's kernel will look like when he returns.

11. Linux 2.6.5-mm1 Released

4 Apr 2004 - 5 Apr 2004 (6 posts) Subject: "2.6.5-mm1"

Topics: Version Control

People: Andrew Morton

45 Minutes after releasing 2.6.5-mc1, Andrew Morton also released 2.6.5-mm1, saying:

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.5/2.6.5-mm1/

The current versions of these patches are in

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.5/2.6.5-mm1/dropped/

See the series file there for applying order.

12. BTTV Maintainership

5 Apr 2004 (3 posts) Subject: "[BTTV] is anyone taking care of this drivers?"

People: Luis MiguelGerd Knorr

Luis Miguel said, "I have data for a new bttv 878 based card that I think must be introduced in the cards database info in order to make it work. Who is the maintainer of that driver right now?" Gerd Knorr replied that he (Gerd) maintained the BTTV driver.

13. GPL Violation By Tritton Technologies

7 Apr 2004 (3 posts) Subject: "probable Linux Kernel GPL violation"

Topics: Networking

People: Roy FranzMatthias UrlichsErik Andersen

Roy Franz said:

I have encountered a vendor that is selling a device that admittedly contains a modified Linux kernel, and they refuse to provide the source for the modified kernel. They claim that the GPL does not apply to those changes. They do not seem to be using modules. I don't know how to pursue this further, so I am bringing it to the attention of the kernel developers.

The product in question is the Tritton Technologies NAS120. (They also offer version with router functionality that is based on the same board.) This board is based on the Toshiba TX39 processor (MIPS), and has a realtek ethernet chip on it. It is running kernel version 2.4.16.

See: http://www.trittontechnologies.com/products.html

This is clearly made by mct.com.tw: http://www.mct.com.tw/prod/sa-100.html As some files in the image identify it as such.

Several other vendors also sell versions of this. http://www.iogear.com/main.php?loc=product&product_id=645 and http://www.claxan.ch/de/prod_det.asp?PRODID=CL-SA110&TOPNAVID=-1 (claxan offers download of some source code, but not the kernel. A Customer of theirs contacted them and they also refuse to release kernel source.)

Here is the response from Tritton stating that they will not release the modified kernel source. This was after several email exchanges where I was being very clear that I was interested in the kernel source.

He quoted the email he'd received from technical support:

Earlier I stated that the kernal would be included with the package. While this is still true, you must know that the modifications made to the kernal will not be included. This is because of two things: 1) Those mods do not fall under the GPL and 2) They are owned by Toshiba.

Matthias Urlichs said, "I'd alert them to the fact that they're about to get a lot of well-deserved negative publicity about this." And Erik Andersen added, "I've added the NAS120 to the BusyBox Hall of Shame..."

 

 

 

 

 

 

Sharon And Joy
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.