Kernel Traffic #42 For 8�Nov�1999

By Zack Brown

Table Of Contents

Mailing List Stats For This Week

We looked at 1024 posts in 4085K.

There were 405 different contributors. 186 posted more than once. 146 posted last week too.

The top posters of the week were:

1. The Saga Continues: Filesystem Corruption In Stable Series

21�Oct�1999�-�29�Oct�1999 (34 posts) Archive Link: "access beyond end of device errors in 2.2.13pre18 AND 2.2.5"

Topics: Disk Arrays: RAID, Disks: IDE, Disks: SCSI, FS: NFS, SMP

People: Mark Hagger,�Alan Cox,�Martin Schulz,�Lech Szychowski,�Stephen C. Tweedie,�Paul Jakma,�Steven N. Hirsch

This first came up in Issue�#25, Section�#3� (16�Jun�1999:�FS Corruption With Later 2.2.x?) , when the problem of filesystem corruption in the stable series was first reported. A little progress was made the following week in Issue�#26, Section�#7� (29�Jun�1999:�Filesystem Corruption Saga Continues) , but nothing conclusive. Meanwhile, Alan Cox had inserted some debugging code that became useful in other ways in Issue�#27, Section�#7� (4�Jul�1999:�Tekram Bug, Hunt And Fix) . But the problem was still causing frayed nerves in Issue�#28, Section�#20� (18�Jul�1999:�Questions Of FS Integrity Slows Development Of Stable Series) . This time, Mark Hagger reported an error ("attempt to access beyond end of device") that would lead to filesystem corruption with all stable kernels after 2.2.5, and added, "This problem is really starting to get very worrying indeed, and I suspect is going to be a real pain to track down. But as it leads to file system corruption (and I've seen others reporting this in the list), surely this has to be tracked down pretty urgently? Especially since Redhat are now shipping 2.2.12 as their distribution kernel!"

Alan replied, "People are looking believe me. The little **** is going to die as soon as its found 8)"

A number of people reported similar errors and corruption. Tommy van Leeuwen was one of those, but added that folks seemed to be speculating that all such problems with the 2.2.x series boiled down to hardware problems. Martin Schulz also reported on his experiences, saying, "Well I do have a similar Problem with LVD Scsi drives (IBM DDRS) connected to an 7890 Adaptec U2W controller (on board on Asus P2Bs); Intel Pentium II 400MHz, 128MB RAM. Running kernel 2.2.5-22." He had suspected various hardware components (memory, cables, scanner, power supply) and tested each in turn. They all checked out OK, and the problem persisted.

Lech Szychowski reported the same problem with a similar system, but added, "2.3.24pre4 seems (so far, of course) to be free of this problem; by this I mean that 2.3.23pre1 after having been running for the same time as 2.3.24pre4 has now most likely would have complained."

Steven N. Hirsch reported his first encounter with this problem, and posted his hardware configuration:

ASUS P2B-DS w/ 2 x 500Mhz. PIII
IBM 9LZX U2 LVD SCSI disk on built-in Adaptec 7890/1 controller
256 MB PC-100 non-ECC Dimm
Viper 550 AGP video w/ 16 Meg

Kernel 2.2.13 + valinux nfs patches

He was sure there were no hardware problems with the box, since it had been totally stable prior to that error. Lech speculated that it might be an SMP problem, but Martin Schulz (who also reported similar problems) said his box was a UP system.

Alan suggested that anyone experiencing the problem with any kernel in the stable tree, should first of all upgrade to the latest kernel (2.2.13) and see if the problem persisted. Darrell Wright tried this and got the same errors when using ide-scsi emulation. Accessing files on a cdrom mounted with ide-scsi emulation turned on, gave the tell-tale error.

Paul Jakma also reported getting the same error under 2.2.13pre12 and earlier under 2.2.11; he had assumed hardware problems at first, but wasn't sure anymore, with all the other reports. He described his system:

AMD K6-233
TMC Via MVP3 board
Buslogic Multimaster.
Seagate Cheetah, IBM DDRS, IBM DGHS.
Errors occured on the DGHS, other drives are scratch space.

Machine patched with latest RAID and knfsd 1.4. NFS server to a couple of
other machines, 1 of which mounts /usr.

Regarding the likelihood of the problem being caused by hardware, Alan replied, "Most cases are. However I am sure not every single one is. That makes it really hard to trace alas." Paul replied that neither SMP nor SCSI/IDE seemed to be the common factor, and suggested that it might be knfs. Martin, for one, couldn't deny it.

Stephen C. Tweedie said that hardware problems and certain knfsd problems could account for the vast majority of reports, and suggested that everyone seeing the errors use the most recent knfsd.

P.A.M. van Dam replied that he was getting the error but not using knfsd. For him, doing extensive IO on a 4-way striped LV triggered the error, while a single LV gave no problem. Stephen replied:

Yes, but my reply was to another user with another fault.

This is exactly why this problem is so hard to track down --- when something random goes wrong in the OS, users often notice it first as filesystem corruption (I saw exactly the same thing when working on VMS filesystem internals for DEC).

In this case we have a known knfsd fault which explains some of the problems. The bulk of the other reports have been traced to hardware problems, or to memory scribbles caused by other bits of the kernel in certain specific kernel versions.

If you are only seeing the problem on a specific complex LV volume, then that suggests that either you are getting data corruption from one of the drives on that LV (over-long IDE cables, for example, or drives with known firmware bugs have all shown it in the past), or that there is an LV bug somewhere.

Meanwhile, Martin had followed Stephen's advice and upgraded to knfsd-1.4.7-7 and kernel 2.2.13, but still saw the error. However, he then took out knfsd and replaced it with the nfs-server-2.2beta40-1 package. At first glance, this seemed to solve the problem, but shortly afterwards the same error started cropping up again.

No solution presented itself on the list.

2. Protecting The Kernel From Root

26�Oct�1999�-�28�Oct�1999 (16 posts) Archive Link: "Sealing the kernel"

Topics: Access Control Lists, Backward Compatibility

People: John Langford,�Aaron Sethman,�Dimitris Margaritis,�Jesse Pollard,�Horst von Brand

John Langford and Dimitris Margaritis took an idea from phrack-55, to create a module that would seal the kernel off, so arbitrary kernel-mode access was not available to root. This would protect systems from trojan modules. They had some working code, but they wanted to make it a bit more subtle. They wanted to allow only certain pre-selected modules to be loadable, and no others. But John explained:

Unfortunately, a problem develops here, and this is where we would like some advice. We want the md5sum to run in the kernel because we are thinking of kernel mode as "safe" and anything in user mode as "unsafe". But the relocation of the module happens in insmod in (untrusted) user mode. Naturally, modules relocated to different addresses have different md5sums. There are several possible work arounds:

  1. Try to guess the most likely eventual module relocation locations and allow the md5sum of the relocated module to match any one of the possibilities.

    - messy and likely to fail to work.

  2. Shift module relocation into kernel mode doing the md5sum as soon as it is seen.

    - feasible but a fairly large change in the way things are done. All of a sudden the create_module syscall is useless.

  3. Make create_module pass back a dummy relocation (0x00000000?) then complete the relocation in kernel mode if the module passes an md5sum.

    - inelegant, but maintains backwards compatibility

  4. Try to just do an md5sum on unrelocated portions of the module

    - nasty and it's unclear that this is "safe".

Are there other methods? What's best?

Aaron Sethman questioned the value of such a project, saying, "You have a good idea...But what good is it really when the person can just go ahead a recompile a kernel...replace the current one and then cause the system to "crash". Unless of course you are booting off of some sort of read-only media(A write protected floppy comes to mind). Also, another idea, backdoor insmod/modprobe so that your special module doesn't get loaded again in the future. Its really impossible to protect the machine from root. Sure you can keep them out of the kernel level stuff. But what good is that really? The root user could still do nasty things to your system regardless."

Dimitris explained:

Yes, John forgot to mention that we're assuming boot from a read-only media such as a write-protected floppy or CD-ROM. We also assume that the rc scripts, kernel, and all modules to be loaded at boot time (before of course the sealing module) also reside on that medium.

About your last point, yes, root can do a lot of nasty things, but by sealing the kernel at least they are constrained to what's available through kernel services. That may help presumably by disabling a lot of stuff in the running kernel.

Jesse Pollard gave a pointer to Rule Set Based Access Control (RSBAC) for Linux (http://www.rsbac.de/rsbac/) page, and suggested that a better general solution would be to make root a non-privileged user.

There was a bit more discussion, but big-time folks like Horst von Brand felt there were much simpler solutions available.

3. Proposal For Half Duplex Serial Support

27�Oct�1999 (9 posts) Archive Link: "Half duplex serial support"

Topics: Ioctls

People: Riley Williams,�David Woodhouse,�Theodore Y. Ts'o

Riley Williams made the following proposal:

This is a revision of my original proposal to take in information since received, not a duplicate, but I've revised it to state what appears to be the current requirements:

  1. The serial port will provide a flag field to indicate its state, with the following states being defined:
    1. NOT-HERE - This port is not present. This is the default state for all ports in the kernel before any probes occur, and is the only valid state for ports that are not found to be present.
    2. FULL-DUPLEX - This port is currently operating in full duplex mode. This is the default state for all ports for which the hardware has been detected.
    3. FULL-TO-RX - This port is in the process of switching from full duplex mode to half duplex receiving mode.
    4. RECEIVING - This port is currently receiving data in half duplex mode, so anything to transmit is just queued until reception finishes.
    5. RX-TO-TX - This port is in the process of switching from reception to transmission of data in half duplex mode. This would only occur when the receiver has been idle for a predefined length of time AND the transmit buffer is not empty.
    6. TRANSMITTING - This port is currently transmitting data in half duplex mode, so is not able to receive anything.
    7. TX-TO-RX - This port is in the process of switching from transmission to reception of data in half duplex mode.
    8. RX-TO-FULL - This port is in the process of switching from half duplex reception to full duplex mode.
  2. The following hooks should be implemented:
    1. A routine to be called to say "Time sufficient for X bytes to be received has passed without reception of any bytes, and the port is in the RECEIVING mode" (as defined above). The default is for this routine to be called only if X (an unsigned quantity) is non-zero.
    2. A routine to be called to say "This port is in the TRANSMITTING mode and has run out of data to send".
    3. A routine to be called to switch the transmitter when any of the transitions (1.3), (1.5), (1.7) or (1.8) need to occur.
  3. The following additional IOCTL's should be implemented:
    1. One to specify the value of X as referred to in (2.1) above.
    2. One to specify that the port should switch to state Y (as defined in (1) above) now.

From an implementation viewpoint:

2.1 could be implemented along the lines of "On receipt of each byte, set a timeout to expire when the time needed for X bytes to be received has passed, resetting any such timeout that is currently in progress for this port. Should this timeout actually occur, call the specified routine."

2.2 would be implemented fairly easily, and the default response in the absence of any other would be to cause transition (1.7) to occur.

2.3 would be specific to the individual line disciplines.

David Woodhouse pointed out that part 3 of Riley's proposal did not have to implement net IOCTLs. He explained that Theodore Y. Ts'o had suggested that they "extend the API between the line discipline and the low level driver to support these requirements, and then implement our half-duplex protocols as line disciplines."

Speaking for himself, Ted elsewhere added, "What you've laid out is a pretty good design of what the half-duplex line discpline will need to do; it's where the work is done that's critical in terms of managing complexity." He and Riley then had a brief implementation discussion.

4. VFS Inode Structure

27�Oct�1999 (6 posts) Archive Link: "Question on the VFS inode structure."

Topics: BSD, Microkernels: Mach

People: Malcolm Beattie,�Alexander Viro

T V Govind noticed that a lot of filesystem-specific information was kept in the VFS (Virtual File System) inode structure. Since the VFS layer is supposed to be generic, it didn't make sense to him that it would have information about specific filesystems. He suggested that a pointer into filesystem-specific data would make the inode structure more reasonable, and wouldn't force a modification to the VFS layer every time a new filesystem was supported under Linux. Alexander Viro replied that as soon as the pagecache changes settled down in the unstable series, the filesystem-specific data would be moved to external allocation. As of 2.3.1 there were no longer any obstacles to implementation, but there were still a lot of things to add along the way.

Malcolm Beattie offered an interesting explanation of Digital UNIX:

I don't remember exactly what you proposed for this, but if you are intending to replace the inode structure with a generic one which has a pointer to a separately allocated per-fs structure then here's a warning. Digital UNIX used to that but had to change it exactly the opposite way for performance reasons. Its vnode structure used to have a v_data pointer to a separately allocated per-fs "private" structure. However, performance was bad because of the pointer following and the extra fragmentation. They ended up reuniting the two so that they could allocate the whole (generic + private) structure in one chunk and then made v_data point to directly after the generic part (i.e. the start of the no-longer-separate private part).

They did the same with their ungodly mixture of BSD and Mach structures for tasks and threads: they had separate struct task, struct proc and struct utask which they combined into a single struct super_task (though still with pointers between the supposedly "separate" bits) and they had separate struct thread and struct np_uthread which they combined into struct super_thread.

Alexander was properly appalled.

5. Subtle Multiarchitecture Code

28�Oct�1999�-�31�Oct�1999 (20 posts) Archive Link: "[PATCH] four warnings and a patch"

Topics: Microkernels: Mach

People: Horst von Brand,�Michael Elizabeth Chastain,�Matthew Wilcox,�Alan Cox,�Dave Jones

Dennis Hou got some warnings while compiling 2.3.24, and posted a one-line patch to fix one of them. The warning fixed by the patch was, "8390.c:1092: warning: unused variable `ei_local'", and the patch merely removed the line that used that variable. Alan Cox (and others) pointed out that the variable actually was used, just not in Intel architectures. It was also very hard to see because of some weird macro indirection. The question had come up before, and nobody seemed surprised that it had come up again. Dave Jones asked if maybe the relevant code could be #ifdef'd for the architectures that used it, so it wouldn't come up again. But Horst von Brand recommended, "Linus hates #ifdef (and I concur, for what it is worth). Why not just place the answer I was given (and which the gentleman doing the latest patch got, and countless others, no doubt) as a comment near the offending line?"

Elsewhere, Dennis did submit a patch that #ifdef'd the line, but Michael Elizabeth Chastain replied:

I think duplicating that #ifdef is bad because now there are two places where a maintainer would have to change it.

I recommend just putting a comment in front of NS8390_trigger_send and letting this problem go.

I tried another fix, which is to make EI_SHIFT(x) always reference its argument (I was inspired by #define spin_lock in include/linux/spinlock.h). This fixes the 8390.c warning but breaks pcnet_cs.c.

This little warning isn't worth the trouble.

Matthew Wilcox also replied to Dennis' patch, saying, "I don't think this is a good idea. Take a look at the Mach sources sometime for what happens if you allow this sort of thing. For example: scsi_ncr53c700.c contains 568 preprocessor lines out of a total of 3430. It's real spaghetti code."

6. Backward Compatibility In The Unstable Tree

28�Oct�1999�-�30�Oct�1999 (5 posts) Archive Link: "term issue 2.3.22 <-> 25"

Topics: Backward Compatibility

People: Alan Cox,�Jeremy Katz,�Linus Torvalds

Soren Leiden reported that 'gnome-terminal' wouldn't run since 2.3.23, while oddly enough 'rxvt' and 'xterm' both worked fine. Jeremy Katz replied that he didn't know about 2.3.23, but there was a change to the 'rlimit' struct in 2.3.24 that could break 'gnome-terminal'. Alan Cox replied that he had just sent Linus Torvalds a patch to fix this backward compatibility problem. Soren asked if the problem affected more than just 'gnome-terminal', and Alan replied, "It breaks gnome-terminal and a couple of other apps that do sanity checking of their potentially setuid environment. Those that don't care don't trip up," and added, "Its a syscall bug. Now a fixed one 8)"

7. Questions And Answers About New Buffer Hash

29�Oct�1999�-�30�Oct�1999 (5 posts) Archive Link: "questions about new buffer hash"

Topics: Disk Arrays: RAID, Disks: IDE, Disks: SCSI, FS: ext2, FS: procfs

People: David S. Miller,�Andrea Arcangeli

Andrea Arcangeli noticed that 2.2.14pre12 merged a new buffer hashtable from David S. Miller. In David's code was the comment (quoted by Andrea), "After several hours of tedious analysis, the following hash function won. Do not mess with it... -DaveM" Andrea asked if he could see the analysis. David replied:

First I modified the kernel such that I could dump the active buffer head state into user space at any point in time. I created a new procfs node, /proc/bcache, which allow this. It dumped raw structures which presented, for each currently active buffer head, only the elements of the buffer head structure which the hash function cares about.

Next, using this kernel, I configured several block device configurations to get a decent mix of block device numbers. For the most part the configurations were composed of varying ratio'd combinations of SCSI disks, IDE disks, and RAID volumes of the previous two.

On the configurations mentioned I would run two sets of tests to fill up the buffer cache. The first set would be to fill the disks up with tons of directories and other metadata, then run a "find -type f" to fill as much of main memory with buffer cache data as I could. The second test used "dbench X" where X was chosen such that 2.2.x (which doesn't have your LRU changes nor the new page cache stuff) only got minimal memory pressure and main memory held most of the written dirty data and metadata created by the benchmark.

During 1 second intervals, and also right after each test set, buffer cache snapshots were taken using my /proc/bcache interface and stored into files.

Next using this test data acquired from the machines I wrote a simulator. Into which various hash functions could be plugged in and various statistics about the distribution of the hash input bits could be obtained.

The simulator would simply use the hash function you selected, the hash table size you tell it, and build a hash table from the dump file to get a snapshot of buffer cache hash table state for any one of the dumps obtained during the test runs done above. It would print out all sorts of useful statistics such as longest hash chain length, shortest, average length, etc.

At this point I had all of the data and tools necessary to perform the analysis.

The next task of course was hash function experimentation and trying to quantize something about the behavior of the bits in the hash input values.

The simulator was first given the old buffer cache hash function, plus another function which just took the device/blocknr and jumbled all of the bits around evenly into the hash modulo bits. The latter was found to act better than the existing hash, but still left a few medium to long sized chains during the simulations.

To come up with the final hash function seen in the code right now I analyzed the bit changing distribution of each bit in each hash function input value from all the buffer cache state dumps.

Each bit in each of the two inputs was assigned a "changing priority" by the simulator. A bit which changed more often was assigned a higher priority. So for example a bit which was 0 half of the time and 1 half of the time would have the highest possible priority.

Bits which changed rarely or not at all are of no use to the hash, since they will not change the value much if at all. Bits which have a high rate of change, if possitioned in non-conflicting spots within the hash, will aide the distribution of the hash function.

With that in mind, and my bit changing priorities, I decided upon the positions where block encompassing high priority bit sets belonged within the hash modulo encompassing bits. This is how the hash function you see was derived.

Finally, I took this new hash function and plugged it into the simulator. It performed better than the other two test hash functions, and had the best distribution for all hash table size selections. On average the distribution was at worst %92 off of an ideal perfect even distribution. No hash function is perfect, and perfection was not what was strived for here.

I enclose below the dump hash analysis program in it's utter RAW form I left it in when I finished my buffer cache hash function tuning.

The hash table size is chosen by the HSHIFT macro, and the hash function is controlled by the HASH macro. Like I said this thing is in a very RAW form, which was fine for me since I had this constantly recompiled and in an editor during my work which is all I needed. I could turn on and off various forms of statistics as I needed them using preprossor directives.

Andrea gave many thanks for the information, and had two points to make in response. First, he said that David hadn't taken account of the time needed to compute the hash function. With an empty hash table, the cost of creating the table might be a real-world concern. David replied:

Yes I most certainly did, I count processor cycles for a living in the work I do. Anyone who knows me well will tell you that I'll completely rearchitect the trap handling on UltraSparc from scratch just to save 2 processor cycles in the trap entry/exit code for that platform.

The computational cost in real life means:

  1. Some _constant_ number of processor cycles
  2. Some _constant_ set of instruction cache lines to hold to hash computation code

Compare this (in real life) to the cost of:

  1. _Non-constant_ extra traversals through the main hash chain loop
  2. _Non-constant_ memory references assosciated with touching significantly more than (nr_buffer_heads / nr_hash_chains) buffer heads.

And furthermore, I did not use any multiplications or other multi-cycle operations in the hash function.

Furthermore, if the hash table is almost empty, there isn't much buffer cache activity is there? So whatever small added cost my new hash function has is lost on the profiling radar compared to whatever other things the system must be doing.

Andrea replied, having done some benchmarks to compare 2.2.12 hash table creation times with the new 2.2.14pre2's hash table creation times. His results seemed to show that David's code used 2.6 times the time as the previous code.

Andrea's other point, in his initial reply to David's long explanation, was that David's test-data may not have been generic enough. In particular, Andrea was concerned that his hash function may have been unintentionally optimized for ext2 (or ext2+raid) patterns. He explained, "This because for example the metadata on ext2 can't flow on the whole disk (there's no inode map in ext2) and depending on the layout of your filesystem and the size of your hashtable, you may found the per-group-fixed-located metadata colliding in very special (not generic) buckets (and so you may need to ignore some special bit in the block_nr that is never hit by metadata writes). So it's not obvious to me that the fixed bits remains the same without using ext2 for example." David replied:

I used UFS, MINIX, and ISO9660 file systems in my tests. I did not explicitly list the filesystem types I had used in my initial response, for this I apologize.

If it is impossible to generate a perfect hash for all input values, then some data set must be chosen to act as one which is representative of the data upon which the hash function can be expected to act.

I believe the data sets I chose to use are something a bit more than "empirical" as you have described it, and are a good representation of the data sets which will be seen by the general populace of Linux users.

Sharon And Joy

Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.