Table Of Contents
|1.||13 Jun 2002 - 23 Jun 2002||(29 posts)||Sparc64 Support For O(1) Scheduler; Developer Interaction|
|2.||18 Jun 2002 - 23 Jun 2002||(39 posts)||Shrinking ext2 And ext3 Directories|
|3.||18 Jun 2002 - 24 Jun 2002||(100 posts)||The Future Of Linux Multiprocessor Support|
|4.||18 Jun 2002 - 22 Jun 2002||(11 posts)||Thoughts On Bootless Kernel Upgrades|
|5.||18 Jun 2002 - 21 Jun 2002||(6 posts)||Cleaning Up The Source Tree|
|6.||19 Jun 2002 - 24 Jun 2002||(35 posts)||ext2/ext3 Scalability|
|7.||20 Jun 2002 - 22 Jun 2002||(5 posts)||Status Of CML2 And Kernel Configuration System|
Mailing List Stats For This Week
We looked at 1159 posts in 5612K.
There were 375 different contributors. 182 posted more than once. 146 posted last week too.
The top posters of the week were:
1. Sparc64 Support For O(1) Scheduler; Developer Interaction
13 Jun 2002 - 23 Jun 2002 (29 posts) Archive Link: "[PATCH] 2.4-ac: sparc64 support for O(1) scheduler"
Topics: Big O Notation, Developer Interaction, SMP, Scheduler
People: Robert Love, David S. Miller, Ingo Molnar, Thomas Duffy, Alan Cox
Robert Love posted a patch and said to Alan Cox:
Attached patch provides SPARC64 support for the O(1) scheduler in 2.4-ac. This is based off a 2.5 backport for my O(1) scheduler patches by Thomas Duffy (i.e. give him the credit).
I do not know if any other architectures in 2.4-ac support the new scheduler yet, but I will work on sending you the diffs as I get them or do them...
Patch is against 2.4.19-pre10-ac2, please apply.
David S. Miller objected, "Ummm what is with all of those switch_mm() hacks? Is this an attempt to work around the locking problems? Please don't do that as it is going to kill performance and having ifdef sparc64 sched.c changes is ugly to say the least. Ingo posted the correct fix to the locking problem with the patch he posted the other day, that is what should go into the -ac patches." But Robert replied, "I am explicitly refraining from sending Alan any code that is not well-tested in 2.5 and my machines first. As Ingo's new switch_mm() bits are not even in 2.5 yet, I plan to wait a bit before sending them... (I am currently putting together all the scheduler bits we have been working on for a 2.4-ac patch...)" He added, "If you like, Alan can hold off on this and take it when the appropriate patches are in." But David rejoined:
Your sparc64 kernel/sched.c bits have zero testing in any kernel. What point are you trying to make? It disables a very important optimization on SMP sparc64. It's simply unacceptable.
Ingo's change which deletes the frozen locking bits has to be installed with the patches which allow sparc64 to continue working without the deadlock bug, they cannot be added seperately.
And Robert came back with, "I don't care about Sparc64, especially as a short term item. Long term yes you are right but for the -ac work, it can fall back for a while."
David didn't reply to this, but close by, Ingo Molnar said regarding his own patches, "Linus applied them already, they will be in 2.5.22. They fix real bugs and i've seen no problems on my testboxes. Those bits are a must for SMP x86 and Sparc64 as well, there is absolutely no reason to selectively delay their backmerge. Besides the last task_rq_lock() optimization which got undone in 2.5 already, all the recent scheduler bits i posted are needed." Robert replied:
I know they are fine (I looked over them) and I saw Linus took them, but 2.5.22 is not yet out and I did not see any reason to rush to new bits to Alan for 2.4 when we could wait a bit and make sure 2.5 proves them fine...
My approach thus far with 2.5 -> 2.4 O(1) backports has been one of caution and it has worked fine thus far. I figure, what is the rush?
2. Shrinking ext2 And ext3 Directories
18 Jun 2002 - 23 Jun 2002 (39 posts) Archive Link: "Shrinking ext3 directories"
Topics: FS: ext2, FS: ext3
People: Stephen C. Tweedie, Alexander Viro, Andrew Morton, Andreas Dilger, Daniel Phillips
Someone pointed out the age-old problem that after deleting files and directories from a given directory in an ext2 or ext3 filesystem, the blocks allocated for them in that directory would not be freed. The posted knew that the traditional way around this was to create a new directory, move any desired items from the old directory to the new, and then delete the old entirely. But this seemed messy, so the poster asked if there were any way to 'shrink' the directory without going through the rigamarole.
At one point, Andreas Dilger said that there was no way currently, and that implementing such a feature would probably take a lot of work. Stephen C. Tweedie also agreed that there was no current implementation, but he added, "However, I know that Daniel Phillips has been thinking about adding that for his HTree extensions which add fast directory indexing to ext2/3." But Alexander Viro said with a shrug, "for ext2 a limited form of "shrinking" is easy to implement. ext2_delete_entry() can easily notice that it's about to create an empty entry spanning entire last block. In that case it should just walk back and check beginnings of previous blocks, as long as they are empty (inode = 0, len = block size). Then it's vmtruncate() time - all IO on directories is protected by i_sem, so we are safe. IOW, making sure that empty blocks in the end of directory get freed is a matter of 10-20 lines." He offered to do it himself, and the original poster said that would be great. But Stephen objected, with:
It's certainly easier at the tail, but with htree we may have genuinely enormous directories and being able to hole-punch arbitrary coalesced blocks could be a huge win. Also, doing the coalescing block by block is likely to be far easier for ext3 than truncating the directory arbitrarily back in one go.
Chopping a large directory at once brings back the truncate() nightmare of having to make an unbounded disk operation seem atomic, even if it has to get split over multiple transactions. Incremental coalescing should allow us to know in advance how many disk blocks we might end up touching for the operation, so we can guarantee to do it in one transaction.
At one point Andrew Morton gave some links to patches and remarked, "btw, I merged all the ext3 htree stuff into 2.5.23 yesterday. Haven't tested it much at all yet." At this point folks delved into the particular implementation details.
3. The Future Of Linux Multiprocessor Support
18 Jun 2002 - 24 Jun 2002 (100 posts) Archive Link: "latest linus-2.5 BK broken"
Topics: Clustering, FS: initramfs, FS: ramfs, Hyperthreading, Microkernels, Ottawa Linux Symposium, Real-Time, SMP, Scheduler, Version Control
People: Linus Torvalds, Eric W. Biederman, Larry McVoy, Cort Dugan, Jeff Garzik, Cort Dougan
In the course of discussing something else, Linus Torvalds remarked:
I'm absolutely 100% conviced that you don't want to have a "single kernel" for a cluster, you want to run independent kernels with good communication infrastructure between them (ie global filesystem, and try to make the networking look uniform).
Trying to have a single kernel for thousands of nodes is just crazy. Even if the system were ccNuma and _could_ do it in theory.
The NuMA work can probably take single-kernel to maybe 64+ nodes, before people just start turning stark raving mad. There's no way you'll have single-kernel for thousands of CPU's, and still stay sane and claim any reasonable performance under generic loads.
Eric W. Biederman replied:
The compute cluster problem is an interesting one. The big items I see on the todo list are:
Services like a schedulers, already exist.
Basically the job of a cluster scheduler gets much easier, and the scheduler more powerful once it gets the ability to suspend jobs. Checkpointing buys three things. The ability to preempt jobs, the ability to migrate processes, and the ability to recover from failed nodes, (assuming the failed hardware didn't corrupt your jobs checkpoint).
Once solutions to the cluster problems become well understood I wouldn't be surprised if some of the supporting services started to live in the kernel like nfsd. Parts of the distributed filesystem certainly will.
I suspect process checkpointing and restoring will evolve something something like pthread support. With some code in user space, and some generic helpers in the kernel as clean pieces of the job can be broken off. The challenge is only how to save/restore interprocess communications. Things like moving a tcp connection from one node to another are interesting problems.
But also I suspect most of the hard problems that we need kernel help with can have uses independent of checkpointing. Already we have web server farms that spread connections to a single ip across nodes.
Larry McVoy came in with:
I've been trying to get Linus to listen to this for years and he keeps on flogging the tired SMP horse instead. DEC did it and Sun has been passing around these slides for a few weeks, so maybe they'll do it too. Then Linux can join the party after it has become a fine grained, locked to hell and back, soft "realtime", numa enabled, bloated piece of crap like all the other kernels and we'll get to go through the "let's reinvent Unix for the 3rd time in 40 years" all over again. What fun. Not.
Sorry to be grumpy, go read the slides, I'll be at OLS, I'd be happy to talk it over with anyone who wants to think about it. Paul McKenney from IBM came down the San Francisco to talk to me about it, put me through an 8 or 9 hour session which felt like a PhD exam, and after trying to poke holes in it grudgingly let on that maybe it was a good idea. He was kind of enough to write up what he took away from it, here it is.
Eric W. Biederman replied:
Hmm. My impression is that Linux has been doing SMP but mostly because it hasn't become a nightmare so far. Linus just a moment ago noted that there are scaleablity limits, to SMP.
As for the cc-SMP stuff.
You have presented your idea, and maybe it will be useful. But at the moment it is not the place to start. What I need today is process checkpointing. The rest comes in easy incremental steps from there.
For me the natural place to start is with clusters, they are cheaper and more accessible than SMPs. And then work on the clustering software with gradual refinements until it can be managed as one machine. At that point it should be easy to compare which does a better job for SMPs.
At one point, Cort Dugan said:
"Beating the SMP horse to death" does make sense for 2 processor SMP machines. When 64 processor machines become commodity (Linux is a commodity hardware OS) something will have to be done. When research groups put Linux on 1k processors - it's an experiment. I don't think they have much right to complain that Linux doesn't scale up to that level - it's not designed to.
That being said, large clusters are an interesting research area but it is _not_ a failing of Linux that it doesn't scale to them.
Linus replied, regarding the insistance on SMP for 2 processor systems, saying:
It makes fine sense for any tightly coupled system, where the tight coupling is cost-efficient.
Today that means 2 CPU's, and maybe 4.
Things like SMT (Intel calls it "HT") increase that to 4/8. It's just _cheaper_ to do that kind of built-in SMP support than it is to not use it.
The important part of what Cort says is "commodity". Not the "small number of CPU's". Linux is focusing on SMP, because it is the ONLY INTERESTING HARDWARE BASE in the commodity space.
ccNuma and clusters just aren't even on the _radar_ from a commodity standpoint. While commodity 4- and 8-way SMP is just a few years away.
So because SMP hardware is cheap and efficient, all reasonable scalability work is done on SMP. And the fringe is just that - fringe. The numa/cluster fringe tends to try to use SMP approaches because they know they are a minority, and they want to try to leverage off the commodity.
And it will continue to be this way for the forseeable future. People should just accept the fact.
The only thing that may change the current state of affairs is that some cluster/numa issues are slowly percolating down and they may become more commoditized. For example, I think the AMD approach to SMP on the hammer series is "local memories" with a fast CPU interconnect. That's a lot more NUMA than we're used to in the PC space.
On the other hand, another interesting trend seems to be that since commoditizing NUMA ends up being done with a lot of integration, the actual _latency_ difference is so small that those potential future commodity NUMA boxes can be considered largely UMA/SMP.
And I guarantee Linux will scale up fine to 16 CPU's, once that is commodity. And the rest is just not all that important.
Elsewhere, Larry said that any attempt to scale Linux up past a few CPUs would lead to an unworkable mass of threading and locking that could never be undone. Jeff Garzik chimed in, with:
One point that is missed, I think, is that Linux secretly wants to be a microkernel.
Oh, I don't mean the strict definition of microkernel, we are continuing to push the dogma of "do it in userspace" or "do it in process context" (IOW userspace in the kernel).
Look at the kernel now -- the current kernel is not simply an event-driven, monolithic program [the tradition kernel design]. Linux also depends on a number of kernel threads to perform various asynchronous tasks. We have had userspace agents managing bits of hardware for a while now, and that trend is only going to be reinforced with Al's initramfs.
IMO, the trend of the kernel is towards a collection of asynchronous tasks, which lends itself to high parallelism. Hardware itself is trending towards playing friendly with other hardware in the system (examples: TCQ-driven bus release and interrupt coalescing), another element of parallelism.
I don't see the future of Linux as a twisted nightmare of spinlocks.
That's not a microkernel design philosophy, it's a good OS design philosophy. If it doesn't _have_ to be in the kernel, it generally shouldn't be.
I agree with you that Linux is already a loosely connected yet highly inter-dependent set of asynchronous tasks. That makes for a very difficult to analyze system.
I don't see Linux being in serious jeopardy in the short-term of becoming solaris. It only aims at running on 1-4 processors and does a pretty good job of that. Most sane people realize, as Larry points out, that the current design will not scale to 64 processors and beyond. That's obvious, it's not an alarmist or deep statement. The key is to realize that it's not _meant_ to scale that high right now.
I've done a little work with Larry's suggestion for scaling Linux and it's very smart in that it solves the problem in a very simple and elegant way. DEC did the same thing with Galaxy some time ago but they layered it with so much of their cluster software and OpenVMS that it lost all the performance that it had gained by being clever. If you want a simple description of the idea (the way I am working on it), it's a software version of NORMA.
Linux's sweet spot is 2-4 processors and probably shouldn't try to change. It's a very hard problem going higher. Many systems have failed in exactly the same way trying to do that sort of thing. Just cluster a bunch of those 2-4 processor Linux's (room full of boxes, large 64-way IBM server or some hybrid) and you have a clean solution.
4. Thoughts On Bootless Kernel Upgrades
18 Jun 2002 - 22 Jun 2002 (11 posts) Archive Link: "kernel upgrade on the fly"
Topics: FS, Hot-Plugging, Microsoft, Software Suspend
People: Rob Landley, John Alvord
Adi Zaimi asked if anyone had thought about ways of upgrading the kernel of a running system, without requiring a reboot. Rob Landley replied:
Thought about, yes. At length. That's why it hasn't been done. :)
Closest you'll get at the moment is some variant of two kernel monte, I.E. a reboot to a new kernel with all processes offed, but at least without involving the bios.
The new swsup infrastructure from pavel machek theoretically lets you freeze the state of your system to disk, so we're a heck of a lot farther ahead then we were. If you want to re-open this can of worms, the only way to go is to start with some combination of these two projects:
That said, the fundamental problem is that when you change kernels, run-time state structures change. Parsing your run-time state from oldvers to feed into newvers can't really be done automatically because your tool wouldn't know what any of the changes MEAN, so you would probably have to write a custom frozen process converter, which would be a pain and a half to debug, to say the least. (And by the time you've got that even half debugged you need to do it for the NEXT kernel...)
Of course software suspend theoretically deals with at least some of the device driver issues, so there's a certain amount of handwaving you can do on that end. And migrating hot network connections is something people have in fact done before, although you'll have to ask around about who. (Ask the security nuts, they consider it a bad thing. :)
Nothing is impossible for anyone impervious to reason, and you might suprise us (it'd make a heck of a graduate project). Hot migration isn't IMPOSSIBLE, it's just a flipping pain in the ass. But the issue's a bit threadbare in these parts (somewhere between "are we there yet mommy?" and "can I buy a pony?"). Try the swsup mailing list, they might be willing to humor you...
(And the people most likely to WANT this feature ("this system never goes down" types) are also the least likely to want to deal with subtle bugs from a bad conversion that don't manifest until a week after the new system comes up when cron goes nuts at 3 am. Of course whether hot migration it's more dangerous to your data than the interaction between Andre's and Martin's egoes in the ATAPI layer is an open question... :) Ahem. Right...)
The SANE answer always has been to just schedule some down time for the box. The insane answer involves giving an awful lot of money to Sun or IBM or some such for hot-pluggable backplanes. (How do you swap out THE BACKPLANE? That's an answer nobody seems to have...)
Clusters. Migrating tasks in the cluster, potentially similar problem. Look at mosix and the NUMA stuff as well, if you're actually serious about this. You have to reduce a process to its vital data, once all the resources you can peel away from it have been peeled away, swapped out, freed, etc. If you can suspend and save an individual running process to a disk image (just a file in the filesystem), in such a way that it can be individually re-loaded later (by the same kernel), you're halfway there. No, it's not as easy as it sounds. :)
John Alvord replied, "IMO the biggest reason it hasn't been done is the existence of loadable modules. Most driver-type development work can be tested without rebooting." But Rob came back with:
That's part of it, sure. (And I'm sure the software suspend work is leveraging the ability to unload modules.)
There's a dependency tree: processes need resources like mounted filesystems and open file handles to the network stack and such, and you can't unmount filesystems and unload devices while they're in use. Taking a running system apart and keeping track of the pieces needed to put it back together again is a bit of a challenge.
The software suspend work can't freeze processees individually to seperate files (that I know of), but I've heard blue-sky talk about potentially adding it. (Dunno what the actual plans are, pavel machek probably would). If processes could be frozen in a somewhat kernel independent way (so that their run-time state was parsed in again in a known format and flung into any functioning kernel), then upgrading to a new kernel would just be a question of suspending all the processes you care about preserving, doing a two kernel monte, and restoring the processes. Migrating a process from one machine to another in a network clsuter would be possible too.
I'm sure it's not as easy as it sounds, but looking at the software suspend work would be a necessary first step. They are, at least, serializing processes to disk and bringing them back afterwards. I'm fairly certain it's happening the microsoft word saves *.doc files (block write the run-time structures to disk and block read them back in verbatim later, and hope all your compiler alignment offsets and such match if there's any version skew).
Then again, the star office people reverse engineered that and made it (mostly) work without even having access to the source code... :)
Hmmm, what would be involved in serializing a process to disk? Obviously you start by sending it a suspend signal. There's the process stuff, of course. (Priority, etc.) That's not too bad. You'd need to record all the memory mappings (not just the contents of the physical and swapped out memory mappings (which should be saved to the serializing file), but also the memory protection states and memory mapped file ranges and such, so you can map it all back in at the appropriate location later). I'd bug whoever did the recent shared page table work (daniel philips?) for information about what that really MEANS.
You'd need to record all the open file handles, of course. (For actual files this includes position in file, corresponding locks, etc. For the zillions of things that just LOOK like files, pipes and sockets and character and block devices, expect special case code).
Pipes bring up a fun point: you can't always serialize just one process. Sometimes they clump together, and if you kill one more go down with it. Thread groups are easy to spot, as well as parent/child relationships that share memory maps and file handles and such, but even just a simple "cat blah | less" means there are two processes connected by a pipe which pretty much need to be serialized together. (A common real-world case is that one of those processes is going to be the X11 server, this brings up a WORLD of fun. For a 1.00 release it's an obvious "Don't Do That Then", and later on might have special case behavior.)
If an actual file handle is open to an otherwise unlinked file, you need to either make a link to that file somewhere (not too hard, that info is already in proc/###/fs) or maybe cache the contents of the file as part of the serialized image...
Which brings up the whole question of how portable a serialized program image should be. Forget swapping kernels, I mean running the system for a while before resuming the "frozen" executable. Rename a couple files and the resume is going to get confused. You kind of have to restore to the exact same system you left off at, because if you have an open fiile handle to file or device driver that isn't there on the resumed system, you basically have some variant of a "broken pipe" scenario. (Then again, forced unmount of filesystems can sort of give you this problem anyway, so infrastructure to deal with it is going to have to be faced at some point...)
For rebooting a running system with the same mounted partitions and hopefully the same set of device drivers, this isn't really any worse than software suspend. And detecting a missing file and having the resume fail with an error would be pretty easy. But also pretty darn easy to trigger, but that's the user's problem...
What other resources attach to a process? The process infos itself (user ID, capabilities), memory mappings, file handles... Bound sockets... Signal handlers and masks... I/O port mappings and such if you're running as root...
It's not an unsolvable problem, but it IS a can of worms. Just plain reparenting a process turned out to be complicated enough they made reparent_to_init (see kernel/sched.c).
5. Cleaning Up The Source Tree
18 Jun 2002 - 21 Jun 2002 (6 posts) Archive Link: "2.5.x: arch/i386/kernel/cpu"
Topics: Source Tree
People: H. Peter Anvin, Dave Jones, Patrick Mochel
H. Peter Anvin called out:
Whomever broke up arch/i386/kernel/setup.c and created the CPU directory (very good idea) messed up in at least one place:
The *AMD-defined* CPUID flags (0x80000001) are not just used on AMD processors! In fact, at least AMD, Transmeta, Cyrix and VIA all use them; I don't know about Centaur or Rise. Intel supports the actual level starting with the P4 although it returns all zero.
It should, in my opinion, be moved into generic_identify(). Anyone who has a reason why that shouldn't be done speak now or I'll send the patch to Linus.
Dave Jones gave credit for the patch to Patrick Mochel, and agreed that H. Peter's patch would be a good idea, unless Patrick had a better one. H. Peter remarked, "Note that this is great. We should do the same with bugs.h which is, if anything, an even worse mess." And Dave replied, "Agreed. Patrick also did similar work on the mtrr driver which isn't merged anywhere yet. That's something else that's been long overdue this treatment. (Also on my list for chopping into bits is agpgart_be.c, but that's another story..)"
Patrick also came into the discussion, thanking H. Peter for the catch, and encouraging him to send in his patch if it was readily available, and "If not, I'll add it to my short list and look at it in the next few days (hopefully)."
6. ext2/ext3 Scalability
19 Jun 2002 - 24 Jun 2002 (35 posts) Archive Link: "ext3 performance bottleneck as the number of spindles gets large"
Topics: Disks: SCSI, FS: ext2, FS: ext3, Locking, SMP
People: Andrew Morton, Dave Hansen, Andreas Dilger, Stephen C. Tweedie
Someone from Intel reported that they'd been doing throughput comparisons and benchmarks of block I/O throughput for 8K writes, as the number of SCSI addapters and drives per adapter were increased. On their dual processor 1.2GHz PIII with 2G RAM, running kernel 2.4.16 or 2.4.18, they found that the Bonnie++ benchmark showed throughput going down as the number of spindles went up. As far as they could tell, the problem boiled down to ext3 making too much use of the BKL (Big Kernel Lock). The poster suggested replacing the BKL usage with per-filesystem locking instead. Andrew Morton replied, "ext3 scalability is very poor, I'm afraid. The fs really wasn't up and running until kernel 2.4.5 and we just didn't have time to address that issue." He added, "The vague plan there is to replace lock_kernel with lock_journal where appropriate. But ext3 scalability work of this nature will be targetted at the 2.5 kernel, most probably."
Dave Hansen took a look at the code and agreed that BKL contention was pretty hairy in ext3. He added, "We used to see plenty of ext2 BKL contention, but Al Viro did a good job fixing that early in 2.5 using a per-inode rwlock. I think that this is the required level of lock granularity, another global lock just won't cut it. http://lse.sourceforge.net/lockhier/bkl_rollup.html#getblock." Andreas Dilger said:
There are a variety of different efforts that could be made towards removing the BKL from ext2 and ext3. The first, of course, would be to have a per-filesystem lock instead of taking the BKL (I don't know if Al has changed lock_super() in 2.5 to be a real semaphore or not). As Andrew mentioned, there would also need to be be a per-journal lock to ensure coherency of the journal data. Currently the per-filesystem and per-journal lock would be equivalent, but when a single journal device can be shared among multiple filesystems they would be different locks.
I will leave it up to Andrew and Stephen to discuss locking scalability within the journal layer.
Within the filesystem there can be a large number of increasingly fine locks added - a superblock-only lock with per-group locks, or even per-bitmap and per-inode-table(-block) locks if needed. This would allow multi- threaded inode and block allocations, but a sane lock ranking strategy would have to be developed. The bitmap locks would only need to be 2-state locks, because you only look at the bitmaps when you want to modify them. The inode table locks would be read/write locks.
If there is a try-writelock mechanism for the individual inode table blocks you can avoid write lock contention for creations by simply finding the first un-write-locked block in the target group's inode table (usually in the hundreds of blocks per group for default parameters). For inode allocation you don't really care which inode you get, as long as you get one in the preferred group (even that isn't critical for directory creation). For inode deletions you will get essentially random block locking, which is actually improved by the find-first-unlocked allocation policy (at the expense of dirtying more inode table blocks).
Contention for the superblock lock for updates to the superblock free block and free inode counts could be mitigated by keeping "per-group delta buckets" in memory, that are written into the superblock only once every few seconds or at statfs time instead of needing multiple locks for each block/inode alloc/free. The groups already keep their own summary counts for free blocks and inodes. The coherency of these fields with the superblock on recovery would be handled at journal recovery time (either in the kernel or e2fsck). Other than these two fields there are few write updates to the superblock (on ext3 there is also the orphan list, modified at truncate and when an open file is unlinked and when such a file is closed).
I have even been thinking about multi-threaded directory-entry creation in a single directory. One nice thing about ext2/ext3 directory blocks is that each one is self-contained and can be modified independently. For regular ext2/ext3 directories you would only be able to do multi-threaded deletes by having a lock for each directory block. For creations you would need to lock the entire directory to ensure exclusive access for a create, which is the same single-threaded behaviour for a single directory we have today with the directory i_sem.
However, if you are using the htree indexed directory layout (which you will be, if you care about scalable filesystem performance) then there is only a single block into which a given filename can be added, so you can have per-block locks even for file creation. As the number of directory entries grows (and hence more directory blocks) the locking becomes increasingly more fine-grained so you get better scalability with larger directories, which is what you want.
The next steps for ext2 are: stare at Anton's next set of graphs and then, I expect, removal of the fs-private bitmap LRUs, per-cpu buffer LRUs to avoid blockdev mapping lock contention, per-blockgroup locks and removal of lock_super from the block allocator.
But there's no point in doing that while zone->lock and pagemap_lru_lock are top of the list. Fixes for both of those are in progress.
ext2 is bog-simple. It will scale up the wazoo in 2.6.
But he added, "ext3 is about 700x as complex as ext2. It will need to be done with some care."
Elsewhere, Stephen C. Tweedie felt that it might not be necessary to wait for 2.6 for ext3 scalability. He said:
I think we can do better than that, with care. lock_journal could easily become a read/write lock to protect the transaction state machine, as there's really only one place --- the commit thread --- where we end up changing the state of a transaction itself (eg. from running to committing). For short-lived buffer transformations, we already have the datalist spinlock.
There are a few intermediate types of operation, such as the do_get_write_access. That's a buffer operation, but it relies on us being able to allocate memory for the old version of the buffer if we happen to be committing the bh to disk already. All of those cases are already prepared to accept BKL being dropped during the memory allocation, so there's no problem with doing the same for a short-term buffer spinlock; and if the journal_lock is only taken shared in such places, then there's no urgent need to drop that over the malloc.
Even the commit thread can probably avoid taking the journal lock in many cases --- it would need it exclusively while changing a transaction's global state, but while it's just manipulating blocks on the committing transaction it can probably get away with much less locking.
7. Status Of CML2 And Kernel Configuration System
20 Jun 2002 - 22 Jun 2002 (5 posts) Archive Link: "CML2"
Topics: Configuration, Disks: SCSI, Kernel Build System
People: Eric Weigle, Roman Zippel, Sam Ravnborg, Eric S. Raymond
Hayden James asked about the status of CML2, and Eric Weigle replied, "OOoooh, ouch. You apparently missed the two (or more) significant flamewars on these topics. The current status is that kbuild will probably slowly be merged by going through Kai and being munged into Linus-acceptable patches, while CML2 will probably sit around and never get merged unless ESR accepts the fact that cool code solving a problem doesn't automagically get into the kernel. See the thread rooted somewhere around here ("Disgusted with Kbuild..."): http://www.uwsg.iu.edu/hypermail/linux/kernel/0202.2/0000.html"
Roman Zippel also said (referring to Eric S. Raymond, CML2 author), "Due to the silence of him, we must assume that he has given up. CML2 has a few problems, which make it unlikely that it gets included as is. Anyway, not all hope is lost, I started my own configuration system some time ago, which will be less complex than CML2. It's only advancing a bit slowly currently, as I only have little time to work on it." Sam Ravnborg asked:
Despite the fact that you are advancing slowly could you explain what your plans are with the configuration system?
As of today we have basically three different ways to read the Config.in files, where xconfig are the one with the best but also most critical parser/analyser. Do you plan to replace all of them or?
And Roman replied:
My plan is to convert the current configuration into a new format (I have a tool for that), which is more flexible and will allow that all needed information to configure/build a driver is at a single place. It currently looks like this:
tristate "SCSI disk support"
If you want to use a SCSI hard disk ...
More information can be added later to this.
The current parsers will all be replaced with a single parser, actually it's a library that does all the work and which allows multiple front ends to behave identical.
Sharon And Joy
Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.