Kernel Traffic
Latest | Archives | People | Topics
Wine
Latest | Archives | People | Topics
GNUe
Latest | Archives | People | Topics
Czech
Home | News | RSS Feeds | Mailing Lists | Authors Info | Mirrors | Stalled Traffic
 

Kernel Traffic #181 For 25 Aug 2002

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1483 posts in 7515K.

There were 409 different contributors. 210 posted more than once. 149 posted last week too.

The top posters of the week were:

1. Some 2.5 Performance Patches And Benchmarks

10 Aug 2002 - 14 Aug 2002 (21 posts) Archive Link: "[patch 1/21] random fixes"

Topics: Big O Notation, Disks: IDE, FS: ext2, FS: ext3, FS: sysfs, SMP

People: Andrew MortonAdam Kropelin

Andrew Morton posted a patch and explained:

Sorry, but there's a ton of stuff here. It ends up as a 4600 line diff. Some code dating back to 2.5.24. It's almost all performance work and it has been very painful getting its effectiveness tested on the big machines; the main problem has been getting them booting 2.5 at all. The results still are not as conclusive as I'd like, but the signs are good, and there are no other proposals around to fix these problems.

This one is mainly a resend.

Old stuff:

Fixes some bogosity which I added to max_block():

After a few posts, Adam Kropelin replied:

I did a bit of testing since I've always thought 2.4 (and 2.5) writeout behavior left something to be desired. Testbed was a SMP x86 (2xPPro-200) with 160 MB of RAM. I used everyone's favorite 2.5 scapegoat: IDE, with a single not-very- fast IBM disk. Filesystem was ext3 in data=ordered mode. Test workload was an inbound (from the point of view of the system under test) FTP transfer of a 600 MB iso image. All test runs were from a clean boot with all unnecessary services shut down.

Results (average of 4 runs):

2.5.31-akpm: 2m 43s
2.5.31:      2m 33s
2.4.19:      2m 18s

`vmstat 1` shows some differences, expecially with respect to 2.4 vs. 2.5. In about 40% of the cases when the bo drops to (near) 0, the machine stalled (FTP transfer halted, vmstat output paused, etc.). With 2.5.31-akpm, the stalls were about 3-4 seconds in length. With 2.5.31, the stalls were of the same duration, but slightly less frequent. With 2.4.19, the stalls were very frequent (closer to 70% of the time bo hit 0), but were only 1-2 seconds in duration.

Below are representative samples of `vmstat 1` for each kernel during the test. (Note that the low cache usage in the 2.5.31 sample is because the snapshot is from early in the run when the cache is still filling.)

Andrew replied:

yes. For this workload (10 mbyte/sec ftp transfer onto a >20 meg/sec disk) the application should never block on IO - all writeback should happen via pdflush.

2.4 starts background writeback at 30% dirty and synchronous writeback at 60% dirty.

2.5 starts background writeback at 40% dirty and synchronous writeback at 50% dirty.

You can make 2.5 use the 2.4 settings with

cd /proc/sys/vm
echo 30 > dirty_background_ratio
echo 60 > dirty_async_ratio
echo 70 > dirty_sync_ratio

and I expect you'll find that fixes it up. Setting dirty_background_ratio to 10% will make it even better. But it will hurt dbench numbers at certain client counts, which is a national emergency.

Sigh. I don't know what the right numbers are. There aren't any; that's the problem with magic numbers. That part of the kernel is making writeback and throttling decisions in total ignorance of the overall state of the system.

Worst comes to worst, we can set the 2.5 knobs at the same level as the 2.4 ones, but I'd rather prefer that we can some up with something dynamic.

In fact, I'd be inclined to set the background ratio much lower than 2.4, and to hell with dbench. Because the lower level is better for real programs, as you've observed.

Adam replied that Andrew's suggested numbers brought the patch in line with vanilla 2.5.31 speeds, but that both kernels still tested slower than 2.4.19. Andrew said the numbers really sweetened things up on his own box, and couldn't explain the discrepancy. They went back and forth on it for a few posts, until Adam posted some information on his disk speed. Andrew sputtered:

gack. I've seen pencils which can write faster than that.

So your wirespeed actually exceeds the disk speed. That changes things.

The kernel *has* to stall the generator of dirty data. We can make the stalls shorter, and more frequent. Go into drivers/block/ll_rw_blk.c and see where it's initialising batch_requests. Just change it to

batch_requests = 1;

batch_requests needs to die anyhow...

And in fs/mpage.c, set RATELIMIT_PAGES to 16.

The application has to block, but the disk should certainly never fall idle. I'll play with this a bit. IDE ceased to be an option in 2.5.30, which does not aid this effort.

It also turned out that Adam had been running with ext3 instead of ext2, as he'd reported. He reported:

*Actually* switching to ext2 (rather than just pretending) made a tremendous difference. New numbers:

2.5.31-stock: 1m 49s
2.5.31-akpm: 1m 50s
2.4.19-stock: 1m 34s

...but, applying the writeout threshold settings you suggested:

2.5.31-stock: 1m 34s
2.5.31-akpm: 1m 34s

(That's with dirty_background at 30%; 10% turned in the same numbers as 30% did.)

Presumably with the disk as the bottleneck, the -akpm changes aren't expected to do much. At least they're not degrading anything.

2. Status Of klibc And Logging

12 Aug 2002 - 20 Aug 2002 (24 posts) Archive Link: "klibc and logging"

Topics: FS: NFS, FS: initramfs, FS: ramfs, Klibc

People: H. Peter AnvinErik AndersenOliver XymoronMiquel van SmoorenburgAndrew Morton

H. Peter Anvin felt that klibc was almost complete, though he imagined plenty of bugs would turn up once folks started really pounding on it. But he added, "However, I'm wondering what to do about logging. Kernel log messages get stored away until klogd gets started, but early userspace may need some way to log messages -- and syslog is obviously not running. The easiest way to do this would probably be to be able to write to /proc/kmsg (which probably really should be /dev/kmsg) and push messages onto the kernel's message queue; but we could also have a dedicated location in the initramfs for writing logs, and do it all in userspace. In the latter case there needs to be a convention to make sure this file is actually present in the namespace at the time syslog starts, and of course syslog needs to know about it..."

Erik Andersen suggested simply writing to /dev/console, adding, "If someone is unable to read from that (VGA, serial, network console, whatever) while trying to set up an NFS-root, they get to keep both pieces."

Elsewhere, Miquel van Smoorenburg suggested using /dev/shm/log, but H. Peter said that it "Requires too much to work before it's can be made available." But he added, "Andrew Morton sent me a proposed patch last night which adds a klogctl (a.k.a. sys_syslog) which does a printk() from userspace. It was less than 10 lines; i.e. probably worth it. I have hooked this up to syslog(3) in klibc, although the code is not checked in yet." But Oliver Xymoron thought this was overkill. He added, "it's got some interesting security implications - only root should be able to flood the kernel message queue, but non-root should be able to syslog, even from early userspace."

3. New Thread Creation Syscall

13 Aug 2002 - 15 Aug 2002 (32 posts) Archive Link: "[patch] clone_startup(), 2.5.31-A0"

Topics: Version Control

People: Ingo MolnarLinus TorvaldsChristoph Hellwig

Ingo Molnar posted a patch and announced:

the attached patch implements a new syscall on x86, clone_startup(). The parameters are:

clone_startup(fn, child_stack, flags, tls_desc, pid_addr)

with the help of this syscall glibc's next-generation pthreads code is able to perform single-syscall thread creation: clone_startup() sets up the child TLS and writes the child PID back into userspace. (which address points to the thread control block.).

the child PID has to be written back because otherwise the parent and the child would have to do expensive and unrobust synchronization. The first instruction the child executes might as well be a signal handler, and glibc code needs the TLS and the PID of the thread. There are a number of workarounds in userspace that can solve this problem without clone_startup(), but each of them has disadvantages:

and the kernel can indeed provide a pretty good solution - why not do it this way.

the TLS setup avoids an extra set_thread_area() syscalls. [Standalone glibc applications still use set_thread_area(), so this syscall does not obsolete set_thread_area().]

Implementational issues: i've introduced a new kernel-internal clone flag, CLONE_STARTUP. In theory we could use the existing clone() syscall and let applications fill in CLONE_STARTUP - i felt uneasy about this solution because it introduces a versioned sys_clone() parametering, makings things messy. But it would undoubtedly work safely and reliably, and it's even a bit slower than the separated syscalls solution this patch adds.

for performance reasons the kernel does not recognize the -1 TLS descriptor number, but this is a non-issue, child threads inherit the TLS index of the parent anyway. Also, the kernel does not allow an 'empty' TLS to be defined.

Linus Torvalds had some implementation issues, but said, "Other than that the thing seems to make sense." Christoph Hellwig also said the name clone_startup() was just terrible. He suggested spawn_thread() or create_thread() instead, adding, "it's not our good old clone and not a lookalike, it's some pthreadish monster.." Linus replied:

I agree that the name is a bit ugly, but this is a system call that I actually think is fundamentally useful (ie I can see how it would be totally usable quite outside any specific library implementation issues).

Talking it over with the IBM threading guys is still worthwhile, though. There may be _other_ information that the IBM guys have problems with, and it could easily be that the interface we really want is even more generic.

But he went on:

This one definitely isn't a pthread-specific problem. The old UNIX fork() semantics for <pid> returning really are fairly broken, since the notion of returning the pid in a local register is inherently racy for _anything_ that wants to maintain a list of its children and needs to access the list in the SIGCHLD handler.

(Simple explanation: imagine a child that exits so quickly that the parent hasn't even had time to do the "store pid into the array" before the parent is already signalled with SIGCHLD. Yes, this happens, and yes, it's a real problem. It wasn't that long ago that bash would _crash_ on this).

With a system call like this, the parent can

Note how this isn't even thread-specific: it very much would work with a regular fork-like approach and a standard shell.

Christoph suggested calling the thing spawn(), but Linus replied:

spawn() to me implies doing the equivalent of "vfork()+execve()", the way most non-UNIX OS's do new process creation.

I don't dislike the "thread" name too much, but I want this to be generic, and seen as such. The same way the original clone() was a proper superset of fork(), this needs to be a proper superset, not just in name, but in _usage_. If it's useful for only one thing, that's not good.

Ingo replied:

okay, the attached patch gets rid of clone_startup() and adds two new clone() flags instead:

CLONE_SETTLS => if present then the third clone() syscall parameter is the new TLS.

CLONE_SETTID => if present then the child TID is written to the address specified by the fourth clone() parameter.

the new parameters are handled in a safe way, clone() returns -EFAULT or -EINVAL if there's some problem with them.

No current code is affected by these new flags. Patch was testbooted on 2.5.31-BK-current.

They went on to discuss the particularities of the patch itself, and the subthread ended.

4. Prospects Of NFSv4 And Crypto In 2.5

13 Aug 2002 - 16 Aug 2002 (21 posts) Archive Link: "patch 14/38: CLIENT: add ->setup_read() nfs_rpc_op for async read, part 1"

Topics: Disk Arrays: LVM, Disks: IDE, Disks: SCSI, FS: NFS

People: Dax KelsonAlan CoxLinus TorvaldsOliver Xymoron

In the course of discussion, Dax Kelson asked:

Linus, I'm curious if the NFSv4 patches will be accepted in the near future (ie, before 2.6).

I for one would REALLY like to see NFSv4 (actually, Kerberized NFSv4 is what I'm after). I just finished setting up a Kerberized Solaris NFS environment with home directories automounted from the clients with strong user authentication.

Frankly, the stock (non-Kerberized) NFS security model blows.

The fact that any janitor with a laptop (or any client with a malicious root user) can nuke all home directories from a standard NFS home directory server bothers me greatly.

Alan Cox replied, "Thats not an NFS2 or NFS3 issue, thats an implementation matter. A proper NFS credential system prevents that from occurring. You also have to fix some bogon assumptions in our NFS client too I grant." Dax asked for some more details, and Alan went on:

Ok item #1 you authenticate with the server and get a cryptographic key for use as credentials. This solves the bad client problem. Kerberos, gssapi etc will do the job

Item #2 is a bug in our NFS page cache handling. Its not legal in NFS to assume we can share caches between processes unless they have the same NFS credentials for the query. The most we can do (and should do) is that when we think we can reuse a cache entry we issue an NFS ACCESS check for NFSv3 or for NFSv2 we write it back to the server if dirty then issue a read for the new credential set.

In light of item #1, Dax asked what the prospects were for getting crypto into the main kernel tree. Linus Torvalds replied:

For a good enough excuse, and with a good enough argument that it's not likely to be a big export problem, I don't think it's impossible any more.

However, the "good enough excuse" has to be better than "some technically excellent, but not very widespread" thing.

Quite frankly, I personally suspect that crypto is one of those things that will be added by vendor kernels first - if vendors are willing to handle whatever export issues there are, that's good, and if they aren't, then the standard kernel cannot really force it upon them anyway.

I personally doubt that NFS would be the thing driving this. Judging by past performance, NFS security issues don't seem to bother people. I'd personally assume that the thing that would be important enough to people for vendors to add it is VPN or encrypted (local) disks.

Oliver Xymoron replied:

I would have thought that there'd be a big push for merging IPSEC in as it creates one of those "network effects" but it's still stalled by politics. I think they're waiting for a written invitation or something.

Is loopback solid enough currently to make crypto over loopback worthwhile? It's occurred to me that it might be better to move the translation hooks down to the generic block layer proper so that things like LVM and iSCSI and brain-damaged bit-swapped IDE could take advantage of them without the deadlock-prone layering issues of loopback. Thoughts?

And Linus replied:

I don't know that it is clear which layer should do it. It's certainly _not_ clear whether the block layer is the right point, and even if you want a hook there I really suspect that upper layers want to pass in "context" data to the encryption layer.

In particular, having a global disk security mechanism may not actually be a good idea - you may want to have per-file key information, which at least implies that the block layer cannot do it alone, and upper layers need to pass in different user keys depending on the operation.

In other situations, the proper layer is obviously the stream itself (ie the "NFS over SSH/Kerberos" kind of thing), but that approach assumes that you trust the remote end. If you don't trust the remove end, you're back to wanting per-file encryption (possibly in _addition_ to the stream encryption), which then ends up implying that you need to have encryption either at the page cache level or at the actual API level.

(The API level tends to be just done with user-level loadable libraries, of course, so there may not be much reason for kernel support there. Kernel support may or may not be desireable even if the encryption itself were to be done by the user)

And separate from the actual encryption, you have key management. Even if the kernel were to do no encryption at all (as with a user-level library approach), I suspect that some people would like to have support for keeping track of which keys a process has.

And THIS, I suspect, is one of the major reasons there isn't really all that much encryption in the kernel. There's just too much choice, and different people really need different things - resulting in it being all over the place.

Clearly some people trust their servers, and just want to have a safe conduit over the WAN when they access them. Others don't even trust the LAN or the server contents themselves, and want per-file protection with private passwords. And both have a good point. It just means that there is no "hook". There is a "maze of hooks, all slightly different".

5. NFSv4 Server Support

13 Aug 2002 - 14 Aug 2002 (2 posts) Archive Link: "patch 38/38: SERVER: giant patch importing NFSv4 server functionality"

Topics: FS: NFS

People: Kendrick M. Smith

Kendrick M. Smith posted a big patch and announced:

Now that all the hooks are in place, this large patch imports all of the new code for the NFSv4 server.

This patch makes almost no changes to the existing nfsd codebase (these have been taken care of by the preceding patches).

One aspect of the NFSv4 code deserves comment. The most natural scheme for processing a COMPOUND request would seem to be:

1a. XDR decode phase, decode args of all operations
2a. processing phase, process all operations
3a. XDR encode phase, encode results of all operations

However, we use a scheme which works as follows:

1b. XDR decode phase, decode args of all operations
2b. For each operation,
process the operation
encode the result

To see what is wrong with the first scheme, consider a COMPOUND of the form READ REMOVE. Since the last bit of processing for the READ request occurs in XDR encode, we might discover in step 3a that the READ request should return an error. Therefore, the REMOVE request should not be processed at all. This is a fatal problem, since the REMOVE was already been done in step 2a!

Another type of problem would occur in a COMPOUND of the form READ WRITE. Assume that both operations succeed. Under scheme (a), the WRITE is actually performed _before_ the READ (since the "real" READ is really done during XDR encode). This is certainly incorrect if the READ and WRITE ranges overlap.

These examples might seem a little artificial, but nevertheless it does seem that in order to process a COMPOUND correctly in all cases, we need to use scheme (b) instead of scheme (a).

(To construct less artificial examples, just substitute GETATTR for READ in the examples above. This works because the "real" GETATTR is done during XDR encode: one would really have to bend over backwards in order to arrange things otherwise.)

6. More Logging Issues Considered

13 Aug 2002 - 19 Aug 2002 (21 posts) Archive Link: "[patch] printk from userspace"

Topics: Capabilities, Klibc, Spam

People: Andrew MortonBenjamin LaHaiseAlexander ViroLinus TorvaldsH. Peter Anvin

Andrew Morton posted a patch to allow programs in user-space to use printk(). He said, "The main use of this is within hpa's klibc - initial userspace needs a way of logging information and this API allows that information to be captured into the printk ringbuffer. It ends up in /var/log/messages. Messages are truncated at 1024 characters by printk's vsprintf(). Requires CAP_SYS_ADMIN." Benjamin LaHaise thought this was an awful idea for security reasons, since any user could spam the kernel's log rinbuffer. But several folks pointed out that Andrew's patch required root priveleges, so not every use could take advantage of it. But Benjamin asked, "isn't adding yet another syscall that's equivalent to write(2) a reason to take this patch and burn it along with the vomit its caused?" Alexander Viro replied:

I have a better suggestion. How about we make write(2) on /dev/console to act as printk()? IOW, how about making _all_ writes to console show up in dmesg?

Then we don't need anything special to do logging _and_ we get output of init scripts captured. For free. dmesg(8) would pick that up, klogd(8) will work as is, etc.

H. Peter Anvin said /dev/console was probably not the best place for that; and Linus Torvalds agreed, but added, "I like the notion. I've always hated the fact that all the boot-time messages get lost, simply because syslogd hadn't started, and as a result things like fsck ran without any sign afterwards. The kernel log approach saves it all in one place."

Benjamin reiterated that the original patch should be reverted, as it created a new syscall that duplicated existing functionality. He posted a patch that he preferred, and Andrew agreed that it was better.

7. Dealing With Oopsen

14 Aug 2002 (1 post) Archive Link: "[patch] 2.4.20-pre2 RTFM Documentation/oops-tracing.txt"

People: Keith Owens

Keith Owens posted a patch and explained:

In the hope that this will reduce the number of "what does this oops mean?" and "what do I need to report?" questions on l-k. Hits all architectures, only tested on i386 but "obviously correct" (yeah, right!).

Print "Read Documentation/oops-tracing.txt before reporting this problem" before oops, nmi lockup etc. If somebody is maintaining a bug database, I don't mind if the message is modified to also point to the bug database.

8. PC-Speaker Driver In Mainstream Kernel

14 Aug 2002 - 18 Aug 2002 (22 posts) Archive Link: "Re: [ANNOUNCE] New PC-Speaker driver"

People: Stas SergeevDenis VlasenkoDaniel PhillipsAndrew Rodland

A couple weeks before, Stas Sergeev announced a new a new PC speaker driver for 2.4.18. Now, Denis Vlasenko reported that it worked for playing mp3s. Stas thanked him for the test, and said, "indeed my primary goal was to make the sound quality acceptable even for playing MP3s. With the motherboard's output attached to an external speakers the quality is definitely acceptable, but for the internal beeper I am not shure if it is possible to really enjoy MP3s however:)" Andrew Rodland reported getting some good sound and some horrible noise. He thought it might be the motherboard, and Stas agreed with this (pointing out that most folks were not experiencing bad noise problems with the driver), and tried to come up with a workaround.

There was also some talk of getting the driver into the mainstream kernel, but at one point Denis Vlasenko said:

I'm afraid I'll disappoint you guys but chances of getting this into mainline are slim for the following reasons:

  1. New motherboards have built-in sound, it may be crappy but definitely better than PC speaker.
  2. PC speaker hardware is not standardized enough. It is designed to beep reliably, but no manufacturer tests it for good frequency diagram and such. Since they may be wired differently, you can't be sure which way you can force maximum amplitude on a particular mobo (there are 2 or 3 ways to reach max on different mobos. Or so I read in a magazine a long ago).
  3. It loads CPU enormously. Even more so considering that some recent chipsets _emulate_ speaker via their integrated sound and SMM mode (ick).

Stas admitted that getting the code into the main tree was not very likely, mainly because the code just wasn't ready. He and others felt that Denis' specific objections were not serious barriers. To Denis' first point, Stas and Daniel Phillips pointed out that not all modern motherboards came with sound. And to Denis' third point, Stas felt that this was a bug, and the code should not be so CPU hungry.

No agreement was reached in the thread.

9. Benchmarking Forking On 2.4.20-pre2 And 2.4.20-pre2-ac1

14 Aug 2002 (4 posts) Archive Link: "Performance differences for recent kernels running forky test."

Topics: FS: sysfs, SMP, Virtual Memory

People: Steven ColeAlan CoxAndrew Morton

Steven Cole announced:

I ran the following lots_of_forks.sh script for several kernels.

http://people.nl.linux.org/~phillips/patches/lots_of_forks.sh

using time -v sh lots_of_forks.sh

The results for 2.4.20-pre2 and 2.4.20-pre2-ac1 are very different.

2.4.20-pre2:
Command being timed: "sh lots_of_forks.sh"
User time (seconds): 18.15
System time (seconds): 24.96
Percent of CPU this job got: 181%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:23.71

2.4.20-pre2-ac1:
Command being timed: "sh lots_of_forks.sh"
User time (seconds): 28.04
System time (seconds): 53.18
Percent of CPU this job got: 187%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:43.32

I ran this test 8 times in a row with no pause in between runs. The numbers are System time as reported by time -v. The test machine is 2-way p3, SMP kernels, configured the same, no tweaks to /proc/sys/vm.

        2.4.20-pre2     2.4.20-pre2-ac1 2.5.28          2.5.31

1       24.96           53.18           39.91           37.04
2       24.92           52.42           44.91           45.88
3       24.69           50.69           48.63           44.89
4       24.39           51.9            58.12           55.8
5       24.72           46.14           49.81           43.18
6       24.34           47.99           57.62           40.93
7       24.64           52.33           50.42           47.27
8       24.53           52.84           45              36.49

This may be a very unfair benchmark. Or it may show something worth investigating further.

Alan Cox replied:

I'd expect that to be the case. Rmap is a huge win for many things but its not a good win on fork times. The question is whether fork bombs dominate your working load and what the tradeoffs are between saner VM behaviour and less accounting overhead.

Its not clear what the answer is.

Andrew Morton also targeted the rmap code as a likely cause of Steven's results. He suggested, "Could you retest 2.5.31 with http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.31/stuff-sent-to-linus/everything.gz applied?" Steven did so and reported back:

here are the results:

        2.5.31-vanilla  2.5.31-akpm-all
1               37.04           35.98
2               45.88           36.94
3               44.89           35.23
4               55.8            38.49
5               43.18           51.43
6               40.93           46.3
7               47.27           46.94
8               36.49           47.19

10. WOLK Project Needs Maintainer

14 Aug 2002 (1 post) Archive Link: "[ANNOUNCE] WOLK v3.5 FINAL, Codemane 'Fin' alias 'Birthday Release'"

Topics: Big O Notation, Clustering: Mosix, Scheduler

People: Marc-Christian Petersen

Marc-Christian Petersen gave a URL and announced:

I am very proud to announce: (and also give me a small present for my b'day :)

WOLK v3.5 FINAL, Codename 'Fin' alias 'Birthday Release'

A lot of development has been done since the last version of WOLK, v3.4.

Also I am a kind of happy that this is the last release of the "Working Overloaded Linux Kernel", because I don't have the time that WOLK needs for further good development. I have a job, a girl friend, some friends etc. so I depend on some people who will help me to manage this project successfull.

I've asked a thousand times for help, got some offers, but never really get that what they promised to do. The only real helping hands for patches was Michael Gasperi, the only Co-Developer for my project, and Dominik Perpeet, my "personal" C/C++ Guru ;). And 2 times Jure Pecar ... But 2-3 people are definitively too less for managing this big project successfull.

Anyway, it was a nice time, having fun, a good kernel, many learning about kernel internals. Thanks a lot to all those people around over the planet of using WOLK.

Sorry, no OpenMosix and also no O(1) scheduler as I promised!

11. GPG-Signing Kernel Mailing List Posts

14 Aug 2002 (2 posts) Archive Link: "GPG/PGP-signatures"

People: David WeinehallJohn L. MalesThomas DuffyRoger GammansMartin WaitzUdo A. SteinbergBrandon LowAustin Gonyou

David Weinehall reported:

I think that it is great that people GPG/PGP-sign their posts (and I intend to start doing so myself, as soon as I get proper connectivity at home, and thus don't have to send all e-mail via a remote server where I don't want to store my private key), but when the public keys are unavailable, hard to obtain, or invalid for one reason or another, the signing is useless.

I've monitored the use of signatures on the list the last few months, and have come up with this list of people whose signatures I've been unable to find (neither available on wwwkeys.pgp.net nor has a x-pgp or x-gpg header saying where to download the key):

Justin Carlson          C1A06FBE
Florent Chabaud         95C81C3C
Thomas Duffy            38F3C1BC
David Fries             CB1EE8F0
Roger Gammans           88DE0B3E
Austin Gonyou           59853282
Josh Litherland         893D9228
Brandon Low             1F012DC6
John L. Males           99ED3565
Brendan W. McAdams      82306710
Solomon Peachy          2DBBE7D0
Joe Radinger            F957E8F3
Udo A. Steinberg        233B9D29
Gianni Tedesco          8646BE7D
Martin Waitz            DFE80FB2
Derek James Witt        972FE938
Wiktor Wodecki          A1559FE7
Pete de Zwart           984AF710

I've bcc:d all of the above.

For those who who possibly don't know how to upload their public key to a public server, here's how:

gpg --keyserver wwwkeys.pgp.net --send-keys <keyid>

He replied to himself the next day, saying:

I found the public keys for the following persons on search.keyserver.net (thanks to Roger Gammans for the hint!);

Brandon Low
Solomon Peachy
Pete de Zwart
Gianno Tedesco (seems to have a broken mail-client; the signatures on his posts are BAD, at least according to mutt/gnupg v1.0.7)
Roger Gammans

Finally a recommendation:

add

x-gpg-fingerprint: <fingerprint>
x-gpg-key: <url to your key or a keyserver>

to your mail-headers.

12. Journaling API Documentation For 2.4

14 Aug 2002 (2 posts) Archive Link: "Re: [PATCH] Re: Some JBD documenation"

Topics: FS

People: Roger Gammans

Roger Gammans posted a patch to add some some journaling API documentation to Documentation/DocBook/journal-api.tmpl in 2.4. He said:

This patch is the JBD DocBook documentation patch which I been promising for too long now ;-).

This one is against 2.4.20-pre2.

Andrew if you would prefer it diff against something different let me know.

13. VM Regress 0.5 Released

14 Aug 2002 - 17 Aug 2002 (4 posts) Archive Link: "VM Regress 0.5"

Topics: Virtual Memory

People: Mel Gorman

Mel Gorman announced:

Project page: http://www.csn.ul.ie/~mel/projects/vmregress/

Download: http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.5.tar.gz

This is the second public release of VM Regress. It is the beginnings of a regression, benchmarking and test tool for the Linux VM. The web page has an introduction and the project itself has quiet comprehensive documentation and commentary. It is still in it's very early days.

This release has a lot of solidification of the suite infrastructure, the full changelog is at the end. I'm not going to bore people with the details but it is in better shape. There is two new features that are of serious note.

  1. Print out page present/swapped map The pagetable core module can now print out a map at the end of the test representing pages in an address space. This means that when a tests is run referencing pages, it is possible to see *exactly* what the address space looked like after. I am considering expanding this to print out the address space of any process but I'm not sure how useful that would be. It might be handy for running an ad-hoc shell script and having the module attach to the process to print out it's address space before exiting to see what it looked like at the end.....
  2. A script plot_map.pl is provided that will take test output that has a map (currently only fault.o) and produce a webpage of it. See http://www.csn.ul.ie/~mel/projects/vmregress/output_sample/test_fault_zero.html for what a sample page looks like. The graph shows the address space been tested. When the test finished, the whole process was been swapped out so all the pages at the beginning are swapped and the later ones are still present. It is expected that later tests will show how well rmap makes this look in comparison to older kernels. Later tests will also have a more interesting reference pattern than straight linear referencing

The next feature to add to the graphs is including an artifical page reference count. For a given test, the number of times a page was referenced will be recorded and this will be compared to the pages present in memory. This should help determine if Linux can choose the right pages to keep in memory or not. From there, tests will be written with different reference pattern and different types of memory such as mmaped files and so on.

A feature of lesser note is the beginning of been able to view the vmalloc area. At the moment, it'll only print out the dimensions though. Much of the rest was building up the core.

Again, it can't provide all the answers yet, but it's very early days and if this was a quick job, it would have been written already :-) . I hope that eventually it'll be able to answer most VM related questions.

Changelog for this release as follows....

Version 0.5

Any feedback is appreciated.

14. I/O Scheduler Logic Munged In 2.4.20 Pre-Patch Cycle

14 Aug 2002 (1 post) Archive Link: "[patch] elevator seek accounting fixes"

People: Jens Axboe

Jens Axboe posted a patch and reported:

Folks, we have a somewhat embarassing problem in the 2.4 i/o scheduler. Basically the accounting currently done in 2.4.20-pre2 is bogus.

Requests are aged as the i/o scheduler passes over them, looking for a merge or insertion point. If no merges are found, a complete queue scan will have been completed and thus aged all requests down by one. So requests in from of the insertion point are penalized as much as those behind it. If a merge is found, ll_rw_blk.c will call the elevator merge_cleanup function, which will account the merge by decrementing the sequence of requests behind the merge point by the sector count of the buffer. In addition, these requests have been aged as well. So the end result is that a seek penalizes all requests with a cost of 1, a merge penalizes requests behind it with a cost of sectors + 1, typically 9.

Completely crap! I dunno where this logic got reversed, but here's a patch to fix it. It does two things:

The actual values will probably need a bit of tweaking. I would very much appreciate people testing what gives good throughput on benchmarks, and what provides good latency for desktops.

Marcelo, please apply for 2.4.20-pre2

15. Cleaning Up Language In 2.4 Configure.help File

14 Aug 2002 (1 post) Archive Link: "PATCH - Configure.help grammar/readability patch"

People: Kevin Burtch

Kevin Burtch announced:

I've been thinking about doing this since the beginning of the configuration help texts. The majority of the changes are the replacement of "say" with "select" (along with appropriate rewording), since even though there are many different ways to configure the kernel, speech recognition isn't one of them yet. :) I've tried to correct the grammar with all of the replacements as thoroughly as I could, and make it more consistent. Since I was at it, I also wrapped some of the long lines.

The patch is against 2.4.19, and is located at http://www.geocities.com/kevin_burtch/config.patch.gz (I didn't want to post it since it's over 1MB in size uncompressed)

16. IDE Maintainership Troubles; IDE Work Delayed Until 2.7

16 Aug 2002 - 21 Aug 2002 (87 posts) Archive Link: "IDE?"

Topics: Disks: IDE, PCI

People: Linus TorvaldsMatthias AndreeAndre HedrickSkip FordAlan CoxMarc-Christian PetersenRik van RielAndrew MortonLarry McVoyAndries BrouwerAlexander ViroJens AxboeMartin J. BlighRussell KingMark Lord

Martin J. Bligh noticed that Jens Axboe had deleted the entire IDE core from 2.5, and replaced it with the version from 2.4.19-pre-acX. He asked what had happened, and Linus Torvalds replied, "Martin gave up the fight he had to do all the time, so.." Matthias Andree replied:

Not having seen much of all the work, this sounds like it is a sad day, for Martin who contributed a lot of his time to work on the issues, while people seemed not to be too grateful, screaming from either end of the road, this must wear people out.

Is there a way how the improvements that parts of the stuff have received can be rescued somehow? Or at least the knowledge of which can be used somewhat directly for a IDE-TNG driver? I'd find it really sad to just let go of so much time that had been invested into the project -- but keep in mind I have NO knowledge of the gory details of the last 2.5 IDE stuff.

Russell King said he'd resubmit some of his own stuff to the new maintainer, and Andre Hedrick said:

I will hand it to you guys on a silver platter IDE-TNG.

Below yields modular chipsets and channel index registration. Selectable IOPS for arch independent Taskfile Transport layers. Now to finish the job with device class link lists to address fully modular subdrivers. It also includes 1st generation of device open and select calls of subdrivers.

You have ide-cd registered on a cdrw and you want to burn a cd?
open(/dev/hdX) transform_subdriver_scsi close(/dev/hdX)
open(/dev/sg) and burn baby burn.
close(/dev/sg) releases transform_subdriver_scsi
open(/dev/hdX) load native atapi transport.

He added, "If this is what you want, this is what I have to put on the table. If you do not I will delete the code."

Skip Ford replied, "Can't you just create a patch and send it to the list? I for one would like to try out your code. Just diff it and send it without the song and dance please." Alan Cox remarked, "You can do the switch (one way only right now) in 2.4.20-pre2-ac3. Ultimately for 2.4 I want to get to the point where open() tries to switch between srfoo and hdfoo and locks out the other user. For 2.5 we can get more esoteric. 2.4 has to continue to just work."

Elsewhere, Marc-Christian Petersen said to Linus:

I am beside my self with laughing, sorry :P

I really can imagine what are you dreaming of. Like: "shit, f*ck, why the hell I kicked Andre Hedrick in the ass and why, for heaven's sake, I said jump in the lake to him?!!?"

Sorry, couldn't resist. ;)

Rik van Riel spoke up:

Having thrown away months and months of hard work, or giving up on months of hard work is NOT FUN.

I'm thankful Martin tried to make the IDE layer better.

His method of removing things to "add a better implementation later" may not have worked out in the end, but I'm thankful he tried.

And Andrew Morton added, "Yes. Martin starkly demonstrated how much work is needed in there, and how much cruft has accumulated. That is valuable."

Elsewhere, Linus also replied to Marc-Christian's "I really can imagine what you are dreaming" comment, saying:

Actually, you apparently can't.

I'm dreaming of an IDE maintainer that people (including, very much, me) can work with. I don't know why, but IDE has pretty much since day one been a fairly problematic area, and has caused a lot more maintainer headache than the rest of the kernel put together..

There's been one fairly smooth IDE transition (the original transition from hd.c to ide.c), and calling even that "smooth" is pretty much all hindsight - at the time people thought it was horribly stupid to not allow big controversial changes to hd.c, and the resulting code duplication was considered a disaster.

Right now it looks like Alan is at least for the moment willing to work on the IDE code, which is obviously great. I just wonder how long he'll stand it (he's maintained various IDE buglists etc issues for years, so we can hope).

Larry McVoy asked, "could you (or anyone) provide a history of the IDE maintainers to date and why they didn't work out and what would need to change to make them work out? I'm sticking my nose in where I know nothing but maybe one of them could rise up to the necessary level if it were spelled out what that level was." Andries Brouwer explained:

Prehistory: Linus and others.

Since start of 1994: Mark Lord. Everybody was happy until mid 1998 (2.1.111), when, after a discussion about problems a few people had with DMA Linus patched linux/drivers/block/Config.in:

-        bool '     Generic PCI bus-master DMA support' CONFIG_BLK_DEV_IDEDMA
+       # Either the DMA code is buggy or the DMA hardware is unreliable. FIXME.
+        #bool '     Generic PCI bus-master DMA support' CONFIG_BLK_DEV_IDEDMA

and Mark wrote: "I will not be updating the IDE driver again until the linux/drivers/block/Config.in file is restored to its pre-111 state."

Since Fall 1998 (2.1.122): Andre Hedrick.

Recent history is known.

(Lots of other people also did useful work on IDE. I will not try to list names since I would forget some.)

Of course, "IDE maintainer" implies work on the interface with the hardware and work on the interface with the block I/O subsystem of the kernel. Some people know all about the hardware, others know much less about hardware but have good ideas about the driver interface. There is no reason to force the "IDE maintainer" to be a single person.

Linus replied:

There isn't in theory, but because a minor change to one part will make other drivers fail subtly, one person has to be the one that holds the bag in the end. Because somebody _will_ be the one that everybody looks at when something breaks.

That is probably one of the largest reasons for the "IDE disease". The symptoms of the disease is that people complain about the stuff not working, and the maintainer eventually getting so fed up with the complaints that he stops interacting with reality, and starts worrying about compliance with the documentation instead, hoping that that will fix everything.

Which would work fine, except a lot of the time the problems aren't due to things in the documentation, but simply due to hardware that isn't really in spec and needs to have workarounds etc. So once the "this is how it is documented, and if it doesn't work your machine is broken" disease starts, it's all downhill from there.

I will claim that this happens for a lot of other hardware too, but in other hardware there often isn't quite as much baggage (people in the end throw out the core and start on a new one without historical cruft), and the inter-driver linkages do not exist to _nearly_ the same degree. When a developer can work with just one chipset, it's still possible to believe that you can keep up. But when you get blamed for all the different IDE problems, you crawl into your shell and go away.

This is why I believe that the only sane result in the end is to have independent drivers that probably end up having a lot of duplication (the same way hd.c and ide.c started out with a lot of duplication), but where there truly _can_ be multiple people in charge of their own drivers (and also clear _whose_ problem it is when one IDE controller driver doesn't work).

(Some of the infrastructure could be made truly generic, but the generic part should _only_ be for stuff that is truly hardware-independent, and simply _cannot_ be impacted by quirks and outright bugs in the hw implementations. In short, only stuff that can be argued about on a logical and clear level, and where the rules are made up by us, not by quirky hardware).

Elsewhere, Linus also replied to Larry:

I actually don't think it's the people as much as it is the ridiculous linkages inside ide.c and the hugely complicated rules. The code is messy.

The network drivers have various setups that share the same chipset, but there they tend to be individual drivers that just share helper routines. Each driver still does their own PCI driver registration etc. In contrast, when it comes to IDE, you're supposed to be an IDE driver first, and a PCI chipset driver second, and that putting o fthe cart before the horse results in problems.

Even something as simple as a PIIX driver (which _should_ just register itself as a driver for the piix chipsets) doesn't do that. Instead, we have ide-pci.c, which has a list of all the chipsets it knows about, and then does initialization and calls the init routines that it knows about. That's just incestuous.

And we all know where incest leads. Hereditary insanity.

Larry asked again, what would be required for an IDE maintainer. Alan Cox replied:

IMHO you need

Linus also replied to himself:

Note: it _is_ the people too, don't get me wrong. But in other areas we have people like Al Viro, who can drive grown men to cry (and drink) with his not-very-polite postings. And in those areas it hasn't been a huge problem, even though some people probably take a valium or two before they dare open emails from Al.

So the messiness and interconnectedness of the IDE layer just seems to bring the people problem to a sharp and ugly point. The absolute lack of communication skills wrt IDE among the people who have worked on it has been stunning, and that probably _is_ because the code is so hard to even talk about.

Alexander Viro piped up with:

Sigh... What we need with IDE is

  1. translator/bogon filter between hardware folks and the rest of us. If Jens or Alan are willing to do that for a while - wonderful.
  2. review of code structure in existing code. Doing that.
  3. careful massage (as opposed to grand rewrite) of said structure

into something sane. With series of small provable equivalent transformations. And whoever does that is in serious risk of burnout - current spaghetty in there is a fscking mess. I'll try to help with that - I know how to do such work, but I don't promise to get it all the way to sanity.

When we will have sane structure and sane interfaces, the life will get easier. Until then full-time maintainership of drivers/ide/* is a one-way ticket to Bedlam.

Linus replied:

I really would suggest an alternate (but not necessarily very different) approach.

The approach I'd advocate is to

The point of IDE-TNG would be to only support the major controllers this way, but let those major controllers have a driver that is meant for _them_ and doesn't have to worry about historical baggage.

And then in five years, in Linux-3.2, we might finally just drop support for the old IDE code with PIO etc. Inevitably some people will still use it (the same way some people still use Linux-2.0 with hd.c), but it won't have been in the way for making a cleaner driver in the meantime.

And yes, by now this all is obviously 2.7.x material.

17. NCR Voyager Support In 2.5

18 Aug 2002 (1 post) Archive Link: "[PATCH: NEW SUBARCHITECTURE FOR 2.5.31] support for NCR voyager (3/4/5xxx series)"

Topics: SMP, Version Control

People: James Bottomley

James Bottomley announced:

This patch adds SMP (and UP) support for voyager which is an (up to 32 way) SMP microchannel non-PC architecture.

The only change since 2.5.31 is that the code will now boot with a non-zero boot cpu id (previously current->cpu was being initialised too late).

The patch is in two parts: The i386 sub-architecture split is separated from the addition of the voyager components

http://www.hansenpartnership.com/voyager/files/arch-split-2.5.31.diff (108k)

http://www.hansenpartnership.com/voyager/files/voyager-2.5.31.diff (146k)

You must apply the split diff before applying the voyager one.

These two patches are also available as separate bitkeeper trees (the voyager tree is a superset of the arch-split one):

http://linux-voyager.bkbits.net/voyager-2.5

http://linux-voyager.bkbits.net/arch-split-2.5

18. 2.5 Problem Report Status

18 Aug 2002 (1 post) Archive Link: "2.5 Problem Report status"

Topics: Disk Arrays: RAID, Disks: IDE, FS: JFS, Feature Freeze

People: Thomas Molina

Thomas Molina reported:

Following is the latest status report. There have been no significant updates to the list in the past couple of days. The status report, with links to discussions can be found at:

http://members.cox.net/tmolina/kernprobs/status.html

Notes:

The state of the IDE subsystem in 2.5 is in too much of a flux for tracking problems to be fruitful. I probably won't add any new ones until feature freeze unless specifically requested. Floppy support is currently broken in 2.5. Higher priority items are delaying work on a fix.

2.5 Kernel Problem Reports as of 16 Aug

Problem Title                     Status       Discussion
RAID 0 BIO problem                open         2.5.30
schedule() with irqs disabled!    open         2.5.30
bonding driver failure in 2.5     closed       2.5.30
serial oops                       closed       2.5.30
NUMA-Q minimal workaround updates closed       2.5.30
PnP BIOS problem                  closed       2.5.30 
New connections stall             closed       2.5.30
JFS oops                          closed       2.5.30
serial core on embedded PPC       closed       2.5.30 
handle_scancode oops              closed       2.5.30
spinlock deadlock                 closed       2.5.30
smp cpu problem                   closed       2.5.30
LTP process_stress causes oops    open         2.5.30
elv_queue_empty oops              open         2.5.30
Page Writeback oops               open         2.5.30
slab BUG                          open         2.5.30
pmd_page problem                  open         2.5.30
vga console problem               open         2.5.30
P200MMX boot problem              open         2.5.30
io apic problem                   open         2.5.30
dcache oops                       open         2.5.30
vm86 oops                         open         2.5.30
modules don't work                open         12 Aug 2002
unmount oops                      open         12 Aug 2002
usb problem                       open         11 Aug 2002
modules don't work                open         13 Aug 2002
pte.chain BUG                     open         13 Aug 2002
scancode oops                     open         12 Aug 2002
cciss broken                      proposed fix 14 Aug 2002
qlogicisp oops                    open         14 Aug 2002
kmap_atomic oops                  open         15 Aug 2002

19. gcml2 Version 0.7 Available

19 Aug 2002 (1 post) Archive Link: "ANNOUNCE: gcml2 0.7"

People: Greg Banks

Greg Banks announced:

gcml2 is (among other things) a Linux kconfig language syntax checker. Version 0.7 is available at

http://sourceforge.net/project/showfiles.php?group_id=18813&release_id=106023

and

http://www.alphalink.com.au/~gnb/gcml2/download.html

There's also an online summary of the warnings and errors from the syntax checker, with real examples, from

http://www.alphalink.com.au/~gnb/gcml2/checker.html

Here's the change log

20. Improving Threading Scalability

19 Aug 2002 - 20 Aug 2002 (22 posts) Archive Link: "[patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6"

Topics: Big O Notation, SMP, Version Control

People: Ingo MolnarLinus Torvalds

Ingo Molnar posted a patch and announced:

this patch is the next step in the journey to get top-notch threading support implemented under Linux.

every Linux kernel in existence, including 2.5.31, has a fundamentally unscalable sys_exit() implementation, with an overhead that goes up if the number of tasks in the system goes up. It does not matter whether those tasks are doing any work - just sleeping indefinitely causes sys_exit() overhead to go up.

200 tasks is typical for a relatively idle server system. 1000 tasks or more is not uncommon during usual server load on a midrange server. There are servers that use 5000 tasks or more. It is not uncommon for highly threaded code to use even more threads - client/server models are often the easiest to implement by using a per-connection thread model. [Eg. vsftpd, one of the fastest and most secure FTP servers under Linux, uses 2 (often 3) threads per client connection [which, for security reason is implemented via inside per-client isolated processes].]

Some numbers to back this up, i've tested 2.5.31-BK-curr on a 525 MHz PIII, it produces the following lat_proc fork+exit latencies:

  # of tasks in the system                200   1000    2000    4000
  ------------------------------------------------------------------
  ./lat_proc fork+exit (microseconds):  743.0  923.1  1153.4  1622.2

it can be seen that the fork()+exit() overhead more than doubles with every 4000 tasks in the system.

for threaded applications the situation is even worse. A threading benchmark that just tests the (linear) creation and exit of 100 thousand threads using the old glibc libpthreads library, gives the following results:

  # of tasks in the system                200   1000    2000    4000
  ------------------------------------------------------------------
  ./perf -s 100000 -t 1 -r 0 -T
  in seconds:                            17.8   37.3    61.1   108.0

  latency of single-thread create+exit
  in microseconds:                        178    373     611    1080

using the same test linked against new libpthreads:

  # of tasks in the system                200   1000    2000    4000
  ------------------------------------------------------------------
  ./perf -s 100000 -t 1 -r 0 -T
  in seconds:                             6.8   25.6    48.7    95.1

  latency of single-thread create+exit
  in microseconds:                         68    256     487     951

the same regression happens as in the old-pthreads case, but with a (dramatically) lower baseline [which are due to the other optimizations]. With a couple of hundred threads created, the thread create+exit latency becomes dominated by the hefty sys_exit() overhead.

all in one - sys_exit() is O(nr_tasks), and heavily so - even a reasonable number of completely idle tasks increase the exit() overhead very visibly.

why is sys_exit() so expensive? The main overhead is in forget_original_parent(), which must be called for every thread that exits: the kernel has to find all children the exiting task has created during its lifetime, and has to 'reparent' them to some other task. The current implementation uses a for_each_task() over every task in the system, and finds those tasks whose real parent is the now exiting task. This is a fundamental work of the kernel that cannot be optimized away - the child/parent tree must always stay coherent.

but the for_each_task() iteration is excessive. There is a subtle reason for it though: ptrace reparents debugged tasks to the debugging task, which detaches the child from the original parent. Thus forget_original_parent() has to search the whole tasklist, to make sure even debugged tasks are properly reparented.

the attached patch (against BK-curr) reimplements this logic in a scalable way: the pthreads code also maintains a global list of debugged tasks - which list is usually empty in a normal system. It has at most a few tasks - those one which are debugged currently. This list can be maintained very cheaply, in a number of strategic places.

forget_original_parent() then searched the exiting task's ->children list, plus the global ptrace list. In the usual 'task has no children and there are no debugged tasks around' case this becomes as cheap as two list_empty() tests!

now on to the performance results, on the same 525 MHz PIII box, lat_proc:

  # of tasks in the system                200     1000     2000     4000
  ----------------------------------------------------------------------
  ./lat_proc fork+exit (microseconds):
                                        657.1    680.6    640.8    682.5

  (unpatched kernel results):          (743.0)  (923.1) (1153.4) (1622.2)

process creation latency is essentially constant, it's independent of the number of tasks in the system. Even the baseline results got improved by more than 10%. For the 4000 tasks case the speedup is more than 130%.

the speedup is even bigger for threaded applications using the old pthreads code:

  # of tasks in the system                200    1000    2000    4000
  --------------------------------------------------------------------
  ./perf -s 100000 -t 1 -r 0 -T
  in seconds:                            12.6    12.8    11.9    11.9
  (unpatched results):                   (17.8) (37.3)  (61.1) (108.0)

  latency of single-thread create+exit
  in microseconds:                        126    128     119      119
  (unpatched kernel):                    (178)   (373)   (611)  (1080)

  improvement:                             41%    191%   413%     807%

ie. in the 4000 tasks case the improvement is almost 10-fold!

testing the new glibc libpthreads code shows dramatic improvements:

  # of tasks in the system                200   1000    2000    4000
  ------------------------------------------------------------------
  ./perf -s 100000 -t 1 -r 0 -T
  in seconds:                             1.7    1.9     1.9     1.9
  (unpatched results):                   (6.8) (25.6)  (48.7)  (95.1)

  latency of single-thread create+exit
  in microseconds:                         17     19      19      19
  (unpatched kernel):                     (68)  (256)   (487)   (951)


  improvement:                            300%  1247%   2463%   4900%

in the 200 tasks case the speedup is 4-fold, in the 4000 tasks case the speedup is 50-fold (!). Thread create+destroy latency is a steady 19 usecs. This also enables the head-to-head comparison of old pthreads and new libpthreads: new libpthreads is more than 6 times faster. This is what i'd finally call good threading performance.

the hardest part of the patch was to make sure ptrace() semantics are still correct. I believe i have managed to at least test the typical workload: i've tested a complex mix of high-load strace situations, threaded and unthreaded code as well, SMP and UP, so i'm reasonably certain that it works for the kind of load i use on my systems. [ But ptrace() is complex beyond belief, so i might as well have missed some of the subtler items. ]

He posted an update shortly thereafter, and Linus Torvalds replied:

Hmm.. This looks good, but I wonder if the real problem isn't really that our ptrace approach has always been kind of flaky.

Basically, we started with the notion that only parents can trace their children, so no reparenting was ever needed. Then PTRACE_ATTACH came along, and we did the reparenting, and I think _that_ is where we made our big mistake.

We should just have made a separate "tsk->tracer" pointer, instead of continuing with the perverted "my parent is my tracer" logic. We shouldn't really re-write the parent/child relationship just because we're being traced.

I'd be happy to apply this patch (well, your fixed version), but I think I'd prefer even more to make the whole reparenting go away, and keep the child list valid all through the lifetime of a process. How painful could that be?

Ingo replied, "unless the ptrace interface is reworked in an incompatible way, i cannot see how this would work." He explained, "the problem is that the tracing task wants to do a wait4() on all traced children, and the only way to get that is to have the traced tasks in the child list. Eg. strace -f traces a random number of tasks, and after the PTRACE_CONTINUE call, the wait4 done by strace must be able to 'get events' from pretty much any of the traced tasks." Various folks discussed possible implementations, and at some point Linus said, "Ok, you've convinced me. The reparenting is fairly ugly, but it sounds like other implementations would be fairly equivalent and it would be mainly an issue of just which list we'd work on."

21. SUNRPC Maintainership

19 Aug 2002 (1 post) Archive Link: "SUNRPC maintainer"

Topics: MAINTAINERS File, Maintainership

People: Chuck Lever

Chuck Lever asked to be listed as the SUNRPC maintainer in the /usr/src/linux/MAINTAINERS file.

22. Support For vt8235 IDE Chip In 2.4

19 Aug 2002 (1 post) Archive Link: "Add support for vt8235 IDE for 2.4"

Topics: Disks: IDE, Version Control

People: Vojtech Pavlik

Vojtech Pavlik posted a patch and announced:

This adds support for the vt8235 IDE to the 2.4 kernel. Very needed, because the chip is now starting to sell.

Same patch should also apply to 2.5.

You can import this changeset into BK by piping this whole message to: '| bk receive [path to repository]' or apply the patch as usual.

23. Maintainer List

19 Aug 2002 (1 post) Archive Link: "lk maintainers"

Topics: Bug Tracking, Disk Arrays: RAID, Disks: IDE, Disks: SCSI, FS: NFS, FS: NTFS, FS: ReiserFS, FS: autofs, FS: devfs, FS: ext2, FS: ext3, FS: smbfs, Framebuffer, Hot-Plugging, I2O, Kernel Build System, Networking, PCI, Real-Time: RTLinux, Samba, Serial ATA, Software Suspend, Sound: ALSA, Spam, USB, Virtual Memory

People: Denis VlasenkoTrond MyklebustArnaldo Carvalho de MeloAlexander ViroHans ReiserRik van RielLinus TorvaldsVojtech PavlikGeert UytterhoevenJeff GarzikAndre HedrickGreg KHJaroslav KyselaAnton AltaparmakovTigran AivazianMartin J. BlighArjan van de VenEric S. RaymondMike PhillipsOleg DrokinH. Peter AnvinAlan CoxPavel MachekDave JonesRichard GoochAndrew MortonJens AxboeIngo MolnarVictor YodaikenJames SimmonsTim WaughRusty RussellGerd KnorrAndrea ArcangeliMartin DaleckiDavid S. MillerRogier WolffUrban WidmarkPetr VandrovecMarcelo TosattiNeil BrownRalf BaechleRussell KingKeith OwensRobert LoveMaksim Krasnyanskiy

Denis Vlasenko said:

This document is mailed to lkml regularly and will be modified whenever new victim wishes to be listed in it or someone can no longer devote his time to maintainer work.

If you want your entry added/updated/removed, contact me.

------- cut here ------ cut here ------ cut here ------ cut here ------

So, you are new to Linux kernel hacking and want to submit a kernel bug report or a patch but don't know how to do it and _where_ to report it?

Preparing bug report:

Compile problems: report GCC output and result of "grep '^CONFIG_' .config"
Oops: decode it with ksymoops
Unkillable process: Alt-SysRq-T and ksymoops relevant part
Yes it means you should have ksymoops installed and tested, which is easy to get wrong. I've done that too often.

More info in the FAQ at http://www.tux.org/lkml/

Sending bug report/patch:

Current Linux kernel people

Note that this list is sorted in reversed date order, most recent entries first. This means than entries at bottom can be outdated :-(

Linux kernel mailing list <linux-kernel@vger.kernel.org>
        Post anything related to Linux kernel here, but nothing else :-)

Dave Engebretsen <engebret@vnet.ibm.com> [15 aug 2002]
        PPC64 architecture maintainer.  Please send PPC64 patches to me
        and our mailing list at <linuxppc64-dev@lists.linuxppc.org>

Ingo Molnar <mingo@elte.hu> [30 jul 2002]
        Ingo wrote the new scheduler for 2.5.

Ralf Baechle <ralf@uni-koblenz.de> [30 jul 2002]
        I am maintainer of the AX.25 code

Victor Yodaiken <yodaiken@fsmlabs.com> [30 jul 2002]
        RTLinux patches, updates, contributions, drivers.
        Please send first to the list: rtl@rtlinux.org

Pavel Machek <pavel@ucw.cz> [27 jul 2002]
        I am network block device maintainer. Visit http://nbd.sf.net.
        (see Steven Whitehouse <steve@gw.chygwyn.com> entry)
        I am working on software suspend.

William Irwin <wli@holomorphy.com> [02 jul 2002]
        Send bug reports and/or feature requests related to many tasks,
        rmap, space consumption, or allocators to me. I'm involved in
        * rmap
        * memory allocators
        * reducing space consumed by data structures (e.g. struct page)
        * issues arising in workloads with many tasks
        * kernel janitoring
        See also:
        Rik van Riel <riel@surriel.com>
        Andrea Arcangeli <andrea@suse.de>
        Martin Bligh <Martin.Bligh@us.ibm.com>
        Andrew Morton <akpm@zip.com.au>

Dave Jones <davej@suse.de> [23 apr 2002]
        I collect various bits and pieces for inclusion in 2.5,
        especially small and trivial ones and driver updates.
        I'll feed them to Linus when (and if) they
        are proved to be worthy.

Andre Hedrick <andre@linux-ide.org> [09 apr 2002]
        ATA/ATAPI Storage Architect [2.0,2.2,2.4]
        HBA interface developer
        Serial ATA Architect [future release]
        Voting NCITS member AT-Attachment Committee

Andrea Arcangeli <andrea@suse.de> [28 mar 2002]
        Send VM related bug reports and patches to me.
        I'm especially interested in VM issues with:
        * lots of RAM and CPUs
        * NUMA
        * heavy swap scenarios
        * performance of I/O intensive workloads (in particular
          with lots of async buffer flushing involved)
        See also Martin J. Bligh <Martin.Bligh@us.ibm.com> entry
        Mail also:
        Arjan van de Ven <arjanv@redhat.com>

Martin J. Bligh <Martin.Bligh@us.ibm.com> [28 mar 2002]
        I'm interested in VM issues with lots (>4G for i386)
        of RAM, lots of CPUs, NUMA

Steven Whitehouse <steve@chygwyn.com> [27 mar 2002]
        I am the Linux DECnet network stack maintainer
        Visit http://www.chygwyn.com/decnet/

Arnaldo Carvalho de Melo <acme@conectiva.com.br> [26 mar 2002]
        IPX, 802.2 LLC, NetBEUI, http://kerneljanitors.org,
        cyclom2x sync card driver

John Cagle <jcagle@kernel.org> [19 mar 2002]
        The current maintainer of devices.txt, the list of
        assigned device numbers for LANANA.  Consult the web
        site (www.lanana.org) for instructions on submitting
        requests for new device numbers.  Send all device
        related email to <device@lanana.org>.

Tigran Aivazian <tigran@veritas.com>
        I am author and maintainer of BFS filesystem and IA32
        microcode update driver.

Rogier Wolff <R.E.Wolff@BitWizard.nl> [12 mar 2002]
        I do "specialix serial ports":
        drivers/char/specialix.c (IO8+)
        drivers/char/sx.c        (SX, SI, SIO)
        drivers/char/rio/*.c     (RIO)

Martin Dalecki <martin@dalecki.de> [11 mar 2002]
        IDE subsystem maintainer for 2.5
        (mail Vojtech Pavlik <vojtech@suse.cz> too)

Ed Vance <serial24@macrolink.com> [05 mar 2002]
        Maintainer for the generic serial driver, serial.c,
        for 2.2 and 2.4 kernels.  Please post patches to
        linux-serial@vger.kernel.org for tested bug
        fixes or to add support for a new serial device.
        Limited to time available. If I have not responded
        in a week, yell at serial24@macrolink.com

netfilter/iptables development <netfilter-devel@lists.samba.org> [23 feb 2002]
        Please report all netfilter/iptables related problems
        to this mailinglist, where all netfilter developers are present.
        See also http://www.netfilter.org/contact.html

Hans Reiser <reiser@namesys.com> [16 feb 2002]
        Send me all reiserfs related patches with a cc to
        reiserfs-dev@namesys.com, send bug reports to
        reiserfs-dev@namesys.com, send paid support requests to
        support@namesys.com after going to www.namesys.com/support.html
        to pay, send discussions (not bug reports unless they are
        interesting to most persons) to reiserfs-list@namesys.com.
        If we sit on your patch for a week without responding,
        yell at us, we deserve it.  Look at our web page
        at www.namesys.com for more about sending us code,
        working with us, and our patch submission and tracking system.

Paul Bristow <paul@paulbristow.net> [16 feb 2002]
        I am an ide-floppy driver maintainer
        (ATAPI ZIP, LS-120/240 Superdisk, Clik! drives).

Mike Phillips <phillim2@comcast.net> [15 feb 2002]
        Token ring subsystem and drivers.

Anton Altaparmakov <aia21@cam.ac.uk> [15 feb 2002]
        I am the NTFS guy.

https://bugzilla.redhat.com/bugzilla [14 feb 2002]
        Reports of problems with the Red Hat shipped kernels.

Alan Cox <alan@lxorguk.ukuu.org.uk> [14 feb 2002]
        Linux 2.2 maintainer (maintenance fixes only).
        Collator of patches for unmaintained things in 2.2/2.4.
        Maintainer of the 2.4-ac (2.4 plus stuff being tested) tree.
        I2O, sound, 3c501 maintainer for 2.2/2.4.

Robert Love <rml@tech9.net> [14 feb 2002]
        Preemptible kernel is mine.

ALSA development <alsa-devel@alsa-project.org> [12 feb 2002]
Jaroslav Kysela <perex@perex.cz> [12 feb 2002]
        Advanced Linux Sound Architecture
        ALSA patches are available at
        ftp://ftp.alsa-project.org/pub/kernel-patches/*

Neil Brown <neilb@cse.unsw.edu.au> [08 feb 2002]
        I am interested in any issues with the code in:
        NFS server    (fs/nfsd/*)
        software RAID (drivers/md/{md,raid,linear}*)
        or related include files.

Maksim Krasnyanskiy <maxk@qualcomm.com> [08 feb 2002]
        I'm author and maintainer of the Bluetooth subsystem
        and Universal TUN/TAP device driver.
        These days mostly working on Bluetooth stuff.

Rik van Riel <riel@conectiva.com.br> [07 feb 2002]
        Send me VM related stuff, please CC to linux-mm@kvack.org

Geert Uytterhoeven <geert@linux-m68k.org> [07 feb 2002]
        I work on the frame buffer subsystem, the m68k port (Amiga part),
        and the PPC port (CHRP LongTrail part).
        Unfortunately I barely have spare time to really work on these
        things. My job is not Linux-related (so far :-). I can not
        promise anything about my maintainership performance.

H. Peter Anvin <hpa@zytor.com> [07 feb 2002]
        i386 boot and feature code, i386 boot protocol, autofs3,
        compressed iso9660 (but I'll accept all iso9660-related
        changes.)  kernel.org site manager; please contact me
        for sponsorship-related issues.

kernel.org admins <ftpadmin@kernel.org> [07 feb 2002]
        Kernel.org sysadmins.  Contact us if you notice something breaks,
        or if you want a change make sure you give us at least 1-2 weeks.
        Please note that we got a lot of feature requests, a lot of
        which conflict or simply aren't practical; we don't have time to
        respond to all requests.

Greg KH <greg@kroah.com> [07 feb 2002]
        I am USB and PCI Hotplug maintainer.

Trond Myklebust <trond.myklebust@fys.uio.no> [07 feb 2002]
        I am NFS client maintainer.

James Simmons <jsimmons@transvirtual.com> [07 feb 2002]
        Console and framebuffer sybsustems.
        I also play around with the input layer.

Richard Gooch <rgooch@atnf.csiro.au> [07 feb 2002]
        I maintain devfs. I want people to Cc: me when reporting devfs
        problems, since I don't read all messages on linux-kernel.
        Send devfs related patches to me directly, rather than
        bypassing me and sending to Linus/Marcelo/Alan/Dave etc.

Russell King <rmk@arm.linux.org.uk> [06 feb 2002]
        ARM architecture maintainer.  Please send all ARM patches through
        the patch system at http://www.arm.linux.org.uk/developer/patches/
        New serial drivers maintainer for 2.5.  Submit patches to
        rmk+serial@arm.linux.org.uk

Andrew Morton <akpm@zip.com.au> [05 feb 2002]
        I'm receptive to any reproducible bug anywhere in the 2.4 kernel.
        Specialising in ext2, ext3 and network drivers.
        Not thinking about 2.5.x at this time.

Petr Vandrovec <vandrove@vc.cvut.cz> [05 feb 2002]
        ncpfs filesystem, matrox framebuffer driver, problems related
        to VMware - in all of 2.2.x, 2.4.x and 2.5.x.

Reiserfs developers list <reiserfs-dev@namesys.com> [05 feb 2002]
        Send all reiserfs-related stuff here including but not limited to bug
        reports, fixes, suggestions.

Oleg Drokin <green@linuxhacker.ru> [05 feb 2002]
        SA11x0 USB-ethernet and SA11x0 watchdog are mine.

Vojtech Pavlik <vojtech@ucw.cz> [05 feb 2002]
        Feel free to send me bug reports and patches to input device drivers
        (drivers/input/*, drivers/char/joystick/*)
        I also want to receive bug reports and patches for following
        USB drivers: printer, acm, catc, hid*, usbmouse, usbkbd, wacom.
        All other (not in the list) USB driver changes should go to USB
        maintainer (hopefully there is one listed here :-).
        Also CC me if you are posting VIA IDE driver related message
        (although I am not IDE subsystem maintainer).

======= These entries are suggested by lkml folks ========

Ralf Baechle <ralf@gnu.org> [27 mar 2002]
        I am mips/mips64 maintainer.

David S. Miller <davem@redhat.com> [07 feb 2002]
        I am Sparc64 and networking core maintainer.

======= These ones I made myself ========
======= I am waiting confirmation/correction from these people ========

Urban Widmark <urban@teststation.com> [13 feb 2002]
        smbfs

Jeff Garzik <jgarzik@mandrakesoft.com> [12 feb 2002]
        I am the network-card-drivers guy (8139 for instance).
        CC me and Andrew Morton <akpm@zip.com.au> on network driver patches.

video4linux list <video4linux-list@redhat.com> [12 feb 2002]
Gerd Knorr <kraxel@bytesex.org> [12 feb 2002]
        video4linux

Tim Waugh <twaugh@redhat.com> [08 feb 2002]
        > Who is maintaining the linux iomega stuff?
        For 2.4.x, me (in theory). I don't have time for 2.5.x at the moment.

Alexander Viro <viro@math.psu.edu> [5 feb 2002]
        I am NOT a fs subsystem maintainer. But I won't kill
        you if you send me some generic fs bug reports and (hopefully) patches.

Eric S. Raymond <esr@thyrsus.com> [5 feb 2002]
        Send kernel configuration bug reports and suggestions to me.
        Also I'll be more than happy to accept help enties for kernel config
        options (Configure.help).

G?rard Roudier <groudier@free.fr> [5 feb 2002]
        I am SCSI guy.

Jens Axboe <axboe@suse.de> [5 feb 2002]
        I am block device subsystem maintainer.

Keith Owens <kaos@ocs.com.au> [5 feb 2002]
        ksymoops, kbuild, .. .. .. .. .  are mine.

Linus Torvalds <torvalds@transmeta.com> [5 feb 2002]
        Do not send anything to me unless it is for 2.5, well tested,
        discussed on lkml and is used by significant number of people.
        In general it is a bad idea to send me small fixes and driver
        updates, send them to subsystem maintainers and/or
        "small stuff" integrator (currently Dave Jones <davej@suse.de>,
        see his entry). Sorry, I can't do everything.

Marcelo Tosatti <marcelo@conectiva.com.br> [5 feb 2002]
        Do not send anything to me unless it is for 2.4 and well tested.
        If you are sending me small fixes and driver updates, send
        a copy to subsystem maintainers and/or "small stuff" integrators:
        - Alan Cox <alan@lxorguk.ukuu.org.uk>,
        - Rusty Russell <trivial@rustcorp.com.au>.

Rusty Russell <rusty@rustcorp.com.au> [5 feb 2002]
        Here are some cleanups of whitespace in .....
        Want me to add this to the trivial patch collection for tracking?
        If so just send (or cc:) it to trivial@rustcorp.com.au.

24. Status Of 2.4 Virtual Memory Subsystem

19 Aug 2002 (5 posts) Archive Link: "=?iso-8859-1?Q?Re: Linux 2.4.20-pre4?="

Topics: Virtual Memory

People: Rik van RielMartin J. BlighAndrew MortonAndrea Arcangeli

Someone asked if Andrea Arcangeli's VM subsystem work from his -aa tree would be merged into 2.4.20, or a later 2.4 kernel, or at all; and someone else said they hoped it would be merged soon, as it seemed really great. Rik van Riel said it wouldn't happen until someone split the huge -aa patch into small self-contained pieces, in which case there was a good chance it would be merged; Martin J. Bligh remarked that as far as he knew, Andrew Morton had already done that. Someone else confirmed this; adding that 2.4.19 had received some of those pieces, and that the rest were slated to go into 2.4.20.

25. Preallocating Blocks On ReiserFS

20 Aug 2002 (1 post) Archive Link: "[bk] Reiserfs patch 1 of 1 for 2.4.20"

Topics: FS: ReiserFS, Version Control

People: Hans ReiserOleg Drokin

Hans Reiser announced, "Credit to Oleg Drokin. This changeset enables preallocation of blocks on reiserfs filesystems by default. Prevents excessive interleaving of file layouts and excessive calls to the allocator that can hurt performance, especially on multiprocess workloads. This simply restores the default to what was in 2.4.19 but using the new allocator code. Please apply. You can get it from bk://thebsh.namesys.com/bk/reiser3-linux-2.4"

26. devfs v199.16 Is Available

20 Aug 2002 (1 post) Archive Link: "[PATCH] devfs v199.16 available"

Topics: FS: devfs

People: Richard Gooch

Richard Gooch announced:

Version 199.16 of my devfs patch is now available from: http://www.atnf.csiro.au/~rgooch/linux/kernel-patches.html The devfs FAQ is also available here.

Patch directly available from: ftp://ftp.??.kernel.org/pub/linux/kernel/people/rgooch/v2.4/devfs-patch-current.gz

AND:

ftp://ftp.atnf.csiro.au/pub/people/rgooch/linux/kernel-patches/v2.4/devfs-patch-current.gz

This is against 2.4.20-rc3. Highlights of this release:

27. NTFS Update

20 Aug 2002 (1 post) Archive Link: "[BK-2.5 PATCH] NTFS 2.1.0 1/7: Add config option for writing"

Topics: FS: NTFS, FS: ext3, Version Control

People: Anton AltaparmakovAndrew MortonRichard Russon

Anton Altaparmakov said:

Linus, please do a

bk pull http://linux-ntfs.bkbits.net/ntfs-2.5

Below is the 1st of 7 ChangeSets updating NTFS to 2.1.0, which you will get when you bk pull the ntfs-2.5 repository. Together they implement file overwrite support for NTFS.

This first ChangeSet is the only one touching files outside fs/ntfs/ and the files touched are only the defconfig files and fs/Config.in and fs/Config.help, which are updated adding a new configuration option for the new write support.

The remaining ChangeSets add the actual write code in small chunks.

I would like to thank Andrew Morton and Al Viro for investing their time and answering the numerous questions I had.

Comments on the code would be appreciated, so get reading everyone. (-:

The current code is relatively well tested both for mmap(2) and write(2) both using existing applications to randomly write to files and using custom programs to do specialized writes to test boundary conditions.

Still the code has only been run on two machines, so people trying it, please have backups! I am confident it won't eat your data, but I am not willing to guarantee it! I have put in an appropriately very scary config help message to scare off the casual user for the moment...

Features of NTFS 2.1.0

It is now possible to write over existing files both with mmap(2) and write(2).

It is now possible to setup a loopback on an NTFS file and then you have full read/write access to the loopback device. You can create a Linux fs on the loop device for example and mount it.

This has been a much requested feature because it allows installation of Linux on an NTFS partition using the loopback trick, i.e. from windows one creates a large file on NTFS, then one boots Linux (from installation CD, rescue floppies or whatever) and as root does:

mount -t ntfs -o rw /dev/hda1 /mnt/ntfs
losetup /dev/loop0 /mnt/ntfs/some_dir/preprepared_large_file
mke2fs -j /dev/loop0
mount -t ext3 /dev/loop0 /mnt/new_root
mkdir old_root
<install Linux into /mnt/new_root>
umount /mnt/new_root
losetup -d /dev/loop0
umount /mnt/ntfs

From now on, you can boot Linux and using a minimal ramdisk loaded via floppy for example, one just needs to have something simillar to the following done:

mount -t ntfs -o rw /dev/hda1 /mnt/ntfs
mount -t ext3 -o loop /ntfs/some_dir/preprepared_large_file /mnt/new_root
cd /mnt/new_root
pivot_root . old_root
exec chroot . sh <dev/console >dev/console 2>&1
umount /old-root

[Note you probably cannot umount /old-root but it doesn't matter. It doesn't disturb anyone... You could always hide it inside root/old_root or something so users don't see it.]

I haven't actually tried to install Linux in the above way but Richard Russon (flatcap) tested the loopback/mke2fs/read-write stuff and it worked fine for him.

Limitations of NTFS 2.1.0 overwrite abilities

Anyone who tries this new code please let me know how you get on...

28. Status Of 2.5 Kernel

20 Aug 2002 (1 post) Archive Link: "[STATUS 2.5] August 21, 2002"

Topics: Disks: IDE, FS: XFS, Feature Freeze, Serial ATA

People: Guillaume Boissiere

Guillaume Boissiere announced the August 21 Status List:

In recent news, the 1.0 version of Serial ATA has been released, and XFS is now available as a series of small patches for inclusion.

The details, as always, are at: http://www.kernelnewbies.org/status/

I also marked a number of projects that seem unlikely to be merged before feature freeze in grey. If you disagree, let me know.

29. devfs Version 217 Available

20 Aug 2002 (1 post) Archive Link: "[PATCH] devfs v217 available"

Topics: FS: devfs

People: Richard Gooch

Richard Gooch announced:

Version 217 of my devfs patch is now available from: http://www.atnf.csiro.au/~rgooch/linux/kernel-patches.html. The devfs FAQ is also available here.

Patch directly available from: ftp://ftp.??.kernel.org/pub/linux/kernel/people/rgooch/v2.5/devfs-patch-current.gz

AND: ftp://ftp.atnf.csiro.au/pub/people/rgooch/linux/kernel-patches/v2.5/devfs-patch-current.gz

NOTE: kernel 2.5.1 and later require devfsd-v1.3.19 or later.

This is against 2.5.31. Highlights of this release:

30. IPMI Driver For 2.4

21 Aug 2002 (5 posts) Archive Link: "[patch] IPMI driver for Linux"

People: Corey MinyardAlan CoxLarry Butler

Corey Minyard announced:

I have been working on an IPMI driver for Linux for MontaVista, and I think it's ready to see the light of day :-). I would like to see this included in the mainstream kernel eventually. You can get it at http://home.attbi.com/~minyard. It should work on any kernel version, although you will have to fix up the Config.in and Makefile, and the Configure.help stuff may not work (it's currently in the 2.4 location).

The web page has documentation on the driver, and documentation is included in the patch, too. This is a fairly full-featured driver with a watchdog, panic event generation, full kernel and userland access to the driver, multi-user/multi-interface support, and emulators for other IPMI device drivers.

Alan Cox had some specific complaints, but added, "its way way way nicer than the hideous thing a certain chip vendor sent me."

Larry Butler said he'd been working on his own version of the same driver. He said:

I've been working on a driver too because the busy waits in the drivers that are out there can hold a CPU for too long. I've measured as much as 120ms.

First I tried sleeping in the driver until the very next jiffy. I found that my driver became unreliable under high CPU load because the scheduling delays were too long. I even managed wedge the BMC on one of my test systems in a way I can't seem to fix. :)

What I finally settled on was using the timer interrupt. This seems to work well both in terms of being nice to the rest of the system (I register a shared irq handler only while I need it) and being reliable even under high load. So, just consider it a suggestion. I'd like to see your driver included too. It's certainly more complete than mine. You must have access to more documentation than I do.

And Corey replied:

I tie into the highres timer code for short sleeps. It does require that you have highres timers installed in your kernel and enabled. Otherwise you are right, it is very slow.

Since I had access to highres timers, that was a lot easier than hooking into and configuring the timer interrupt, and a lot more portable, too.

If you want to post your code or modify mine to add the timer interrupt support, that would be great.

31. Plan For IDE In 2.5

21 Aug 2002 (3 posts) Archive Link: "2.5 IDE Whitepaper?"

Topics: Disks: IDE, Hot-Plugging, PCI

People: Paul BennettJeff GarzikAlan Cox

Paul Bennett asked, "I am looking for documentation regarding the 2.5 IDE rewrite. For example: What are the goals for 2.5. What is the implementation plan? What were the problems in 2.4, and how will they be fixed in 2.5, etc?" Jeff Garzik replied:

<chuckle> I wish :) I imagine it will happen like most things happen, Linus describes his ideas and goals and wishes in a few lkml posts, and eventually something like it happens :)

Alan Cox (the acting IDE maintainer) replied:

I can try, my working list approximates this (ignoring the 2.5 porting/block I/O stuff which is a chapter in itself)

Phase #1 (mostly complete)
Merge Andre's current code [DONE]
Remove all the bogus code from the PCI drivers [90% DONE]
Move all the drivers seperate from the core code [DONE]
Migrate the PCI drivers to a registration API and allow insmod
Fix bugs arising from the first bits of phase 1

Phase #2
Deal with insmod of a device currently running as legacy
Fix up the locking ready to allow rmmod of a pci driver
Allow rmmod and hotplug at the controller level

Phase #3
Complete splitting setup-pci functions into smaller bits of code and replace deep magic and callbacks with functions called from each driver. Get all the if device==foo out of the PCI code paths

Phase #4
Do something about the ide_register/unregister end of the world and legacy chipset stuff. The PPC folks may tackle this in advance
Get us to the point we can foo = ide_attach(); ide_remove(foo) for arbitary interfaces

And then (when the setup is turned the right way out and not before) begin looking at turning the actual block I/O engine the right way out. (That is driver calls helpers not midlayer and magic)

That should allow us to keep solid stable IDE along the way.

32. VM Regress 0.6 Is Available

21 Aug 2002 (1 post) Archive Link: "VM Regress 0.6"

Topics: Big Memory Support, Virtual Memory

People: Mel Gorman

Mel Gorman announced:

Project page: http://www.csn.ul.ie/~mel/projects/vmregress/
Download: http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.6.tar.gz

This is the third public release of VM Regress. It is the beginnings of a regression, benchmarking and test tool for the Linux VM. The web page has an introduction and the project itself has quiet comprehensive documentation and commentary. It is still in it's very early days but there is a lot more in here than there was in 0.5.

This release had a lot of minor bug fixes in it including building with highmem support and late 2.5 kernels. It has been heavily tested with both 2.4.19 and 2.4.19-rmap14a .

The first item feature of note is that multiple instances of the same test can now run but only one will output information to the proc entry. This will allow 100 small instances of a test to run rather than one very large instance.

Second item is the pagemap.o module. When read, it will print out all VMA's of the reading process and print what pages are present/swapped in that region in encoded format. A perl library is provided for decoding the information.

Third item is the introduction of the mapanon.o module. It exports four proc interfaces for open, reading, writing and closing memory mapped regions. It is designed to be used by a benchmarking perl script (bin/bench_mapanon.pl --man for details) for testing how quickly anonymous pages are used within an mmaped region and illustrates what pages the kernel decides to swap out. The report from the benchmark will show how quickly pages were accessed, what pages were present/swapped in comparison to how often a page was referenced and a graph of vmstat output. Tests are running currently measuring the performance of 2.4.19 and 2.4.19-rmap14a. They will be posted up when they complete running.

Fourth item is several perl libraries made available that are aimed at making developing of new tests very easy. They cover a lot of the drudge work a test has to do such as graphing, reading proc entries, decoding information and so on. The manual has most of the details. All of VM Regress is designed to be very easy to interface with so other tests can be easily developed.

The next step is to update mapanon to cover mmaped files as well as anonymous memory. This is so a simulation web server will be run complete with bots browsing web pages similar to what Rik Van Riel outlined in an email sent to the list. This will help the tool be both a micro analysis and overall performance testing and benchmark tool.

Further down the line is the development of statistical analysis tools for examining different data sets, in particular the timing information the bench_mapanon.pl script produces.

This is still very much in it's early days and is expected to take a long time to develop fully but it's at the point where it can produce useful figures. Reasonably comprehensive documentation is available with the package and from the webpage. Any feedback is appreciated.

Full changelog for 0.6

Version 0.6

 

 

 

 

 

 

Sharon And Joy
 

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.