<?xml version="1.0" ?>

<kc>

<title>Kernel Traffic</title>

<author contact="mailto:zbrown@tumblerings.org">Zack Brown</author>

<issue num="319" date="28 Aug 2005 00:00:00 -0800" />

<mailbox-stats>
	<global-stats>
		<generated-at>Sat Aug 27 13:41:48 2005</generated-at>
		<first-message>Tue Dec  7 15:09:03 2004</first-message>
		<last-message>Wed Jul 20 16:35:04 2005</last-message>
		<totals>
			<n-messages>2412</n-messages>
			<n-is-reply>1857</n-is-reply>
			<avg-resp-time>545:03:01</avg-resp-time>
			<n-pgp-signed>219</n-pgp-signed>
			<total-size>14MB</total-size>
			<avg-size>6KB</avg-size>
			<n-attachments>62</n-attachments>
			<att-size>597KB</att-size>
			<bussiest-day-in-n day="2005-07-12"><n-msgs>275</n-msgs><n-bytes>2MB</n-bytes></bussiest-day-in-n>
			<bussiest-day-in-bytes day="2005-07-08"><n-msgs>227</n-msgs><n-bytes>2MB</n-bytes></bussiest-day-in-bytes>
			<n-writers>717</n-writers>
			<wrote-more-then-1-message>276</wrote-more-then-1-message>
			<n-lines>247937</n-lines>
			<header-size>133126</header-size>
			<n-user-agents>65</n-user-agents>
			<n-organisations>54</n-organisations>
			<n-toplevel-domains>50</n-toplevel-domains>
			<avg-spam-score>-28.207131</avg-spam-score>
				<spammiest-writer><score>4.200000</score><name>sandy</name></spammiest-writer>
		</totals>
		<averages>
			<lines-per-message>102</lines-per-message>
			<lines-per-header>55</lines-per-header>
			<header-percent-of-message>53.69%</header-percent-of-message>
			<header-percent-of-total>43.00%</header-percent-of-total>
			<line-length>28</line-length>
		</averages>
		<importance>
			<low>0.00%</low>
			<normal>0.91%</normal>
			<high>0.00%</high>
		</importance>

	</global-stats>
	<top-writers>
		<top-writer rank="1">
			<e-mail-addr>hans&#32;reiser</e-mail-addr>
			<n-messages>87</n-messages>
			<avg-size>5KB</avg-size>
			<total-size>384KB</total-size>
			<mostly-written-at>14:32</mostly-written-at>
		</top-writer>
		<top-writer rank="2">
			<e-mail-addr>david&#32;masover</e-mail-addr>
			<n-messages>85</n-messages>
			<avg-size>7KB</avg-size>
			<total-size>538KB</total-size>
			<mostly-written-at>11:59</mostly-written-at>
		</top-writer>
		<top-writer rank="3">
			<e-mail-addr>hal&#32;rosenstock</e-mail-addr>
			<n-messages>65</n-messages>
			<avg-size>10KB</avg-size>
			<total-size>609KB</total-size>
			<mostly-written-at>00:00</mostly-written-at>
		</top-writer>
		<top-writer rank="4">
			<e-mail-addr>nigel&#32;cunningham</e-mail-addr>
			<n-messages>62</n-messages>
			<avg-size>5KB</avg-size>
			<total-size>288KB</total-size>
			<mostly-written-at>18:14</mostly-written-at>
		</top-writer>
		<top-writer rank="5">
			<e-mail-addr>pavel&#32;machek</e-mail-addr>
			<n-messages>57</n-messages>
			<avg-size>6KB</avg-size>
			<total-size>289KB</total-size>
			<mostly-written-at>13:39</mostly-written-at>
		</top-writer>
	</top-writers>
	<top-subjects>
		<top-subject rank="1">
			<subject>reiser4&#32;plugins</subject>
			<n-messages>394</n-messages>
			<avg-size>6KB</avg-size>
			<total-size>3MB</total-size>
			<mostly-written-at>13:51</mostly-written-at>
			<first-msg>1119406078</first-msg>
			<last-msg>1121277917</last-msg>
		</top-subject>
		<top-subject rank="2">
			<subject>-mm&#32;-&#62;&#32;2.6.13&#32;merge&#32;status</subject>
			<n-messages>104</n-messages>
			<avg-size>5KB</avg-size>
			<total-size>436KB</total-size>
			<mostly-written-at>14:10</mostly-written-at>
			<first-msg>1119340498</first-msg>
			<last-msg>1120199404</last-msg>
		</top-subject>
		<top-subject rank="3">
			<subject>[git&#32;patches]&#32;ide&#32;update</subject>
			<n-messages>51</n-messages>
			<avg-size>7KB</avg-size>
			<total-size>333KB</total-size>
			<mostly-written-at>15:18</mostly-written-at>
			<first-msg>1120445526</first-msg>
			<last-msg>1121124107</last-msg>
		</top-subject>
		<top-subject rank="4">
			<subject>reiser4&#32;vs&#32;politics:&#32;linux&#32;misses&#32;out&#32;again</subject>
			<n-messages>51</n-messages>
			<avg-size>5KB</avg-size>
			<total-size>250KB</total-size>
			<mostly-written-at>15:26</mostly-written-at>
			<first-msg>1120139040</first-msg>
			<last-msg>1121143220</last-msg>
		</top-subject>
		<top-subject rank="5">
			<subject>-mm&#32;-&#62;&#32;2.6.13&#32;merge&#32;status&#32;(fuse)</subject>
			<n-messages>40</n-messages>
			<avg-size>4KB</avg-size>
			<total-size>157KB</total-size>
			<mostly-written-at>12:31</mostly-written-at>
			<first-msg>1119371217</first-msg>
			<last-msg>1119769466</last-msg>
		</top-subject>
	</top-subjects>
	<top-receivers>
		<top-receiver rank="1">
			<e-mail-addr>linux-kernel@vger.kernel.org</e-mail-addr>
			<n-messages>384</n-messages>
			<avg-size>9KB</avg-size>
			<total-size>4MB</total-size>
			<mostly-written-at>13:19</mostly-written-at>
		</top-receiver>
		<top-receiver rank="2">
			<e-mail-addr>akpm@osdl.org</e-mail-addr>
			<n-messages>290</n-messages>
			<avg-size>8KB</avg-size>
			<total-size>3MB</total-size>
			<mostly-written-at>12:00</mostly-written-at>
		</top-receiver>
		<top-receiver rank="3">
			<e-mail-addr>david&#32;masover</e-mail-addr>
			<n-messages>85</n-messages>
			<avg-size>5KB</avg-size>
			<total-size>425KB</total-size>
			<mostly-written-at>14:13</mostly-written-at>
		</top-receiver>
		<top-receiver rank="4">
			<e-mail-addr>hans&#32;reiser</e-mail-addr>
			<n-messages>84</n-messages>
			<avg-size>5KB</avg-size>
			<total-size>402KB</total-size>
			<mostly-written-at>13:52</mostly-written-at>
		</top-receiver>
		<top-receiver rank="5">
			<e-mail-addr>nigel&#32;cunningham</e-mail-addr>
			<n-messages>55</n-messages>
			<avg-size>5KB</avg-size>
			<total-size>237KB</total-size>
			<mostly-written-at>12:44</mostly-written-at>
		</top-receiver>
	</top-receivers>
	<top-ccers>
		<top-ccers rank="1">
			<e-mail-addr>linux-kernel@vger.kernel.org</e-mail-addr>
			<n-messages>923</n-messages>
			<avg-size>6KB</avg-size>
			<total-size>5MB</total-size>
			<mostly-written-at>12:57</mostly-written-at>
		</top-ccers>
		<top-ccers rank="2">
			<e-mail-addr>akpm@osdl.org</e-mail-addr>
			<n-messages>208</n-messages>
			<avg-size>6KB</avg-size>
			<total-size>2MB</total-size>
			<mostly-written-at>14:15</mostly-written-at>
		</top-ccers>
		<top-ccers rank="3">
			<e-mail-addr>hans&#32;reiser</e-mail-addr>
			<n-messages>73</n-messages>
			<avg-size>6KB</avg-size>
			<total-size>369KB</total-size>
			<mostly-written-at>13:16</mostly-written-at>
		</top-ccers>
		<top-ccers rank="4">
			<e-mail-addr>openib-general@openib.org</e-mail-addr>
			<n-messages>65</n-messages>
			<avg-size>10KB</avg-size>
			<total-size>610KB</total-size>
			<mostly-written-at>01:04</mostly-written-at>
		</top-ccers>
		<top-ccers rank="5">
			<e-mail-addr>jeff&#32;garzik</e-mail-addr>
			<n-messages>63</n-messages>
			<avg-size>5KB</avg-size>
			<total-size>275KB</total-size>
			<mostly-written-at>15:09</mostly-written-at>
		</top-ccers>
	</top-ccers>
	<top-level-domains>
		<tld rank="1">
			<name>com</name>
			<freq>1067</freq>
			<avg-size>7KB</avg-size>
			<total-size>7MB</total-size>
		</tld>
		<tld rank="2">
			<name>org</name>
			<freq>331</freq>
			<avg-size>6KB</avg-size>
			<total-size>2MB</total-size>
		</tld>
		<tld rank="3">
			<name>de</name>
			<freq>216</freq>
			<avg-size>6KB</avg-size>
			<total-size>2MB</total-size>
		</tld>
		<tld rank="4">
			<name>net</name>
			<freq>215</freq>
			<avg-size>8KB</avg-size>
			<total-size>2MB</total-size>
		</tld>
		<tld rank="5">
			<name>cz</name>
			<freq>93</freq>
			<avg-size>6KB</avg-size>
			<total-size>502KB</total-size>
		</tld>
	</top-level-domains>
	<top-timezones>
		<tz rank="1">
			<name>+0200</name>
			<freq>679</freq>
			<avg-size>6KB</avg-size>
			<total-size>4MB</total-size>
		</tz>
		<tz rank="2">
			<name>-0700</name>
			<freq>397</freq>
			<avg-size>6KB</avg-size>
			<total-size>2MB</total-size>
		</tz>
		<tz rank="3">
			<name>-0400</name>
			<freq>389</freq>
			<avg-size>6KB</avg-size>
			<total-size>3MB</total-size>
		</tz>
		<tz rank="4">
			<name>-0500</name>
			<freq>198</freq>
			<avg-size>8KB</avg-size>
			<total-size>2MB</total-size>
		</tz>
		<tz rank="5">
			<name>+0100</name>
			<freq>195</freq>
			<avg-size>6KB</avg-size>
			<total-size>2MB</total-size>
		</tz>
	</top-timezones>
	<top-organisations>
		<org rank="2">
			<name>cycades</name>
			<freq>62</freq>
			<bytes>288KB</bytes>
		</org>
		<org rank="3">
			<name>me,&#32;myself&#32;and&#32;i</name>
			<freq>18</freq>
			<bytes>112KB</bytes>
		</org>
		<org rank="4">
			<name>ypo4</name>
			<freq>17</freq>
			<bytes>90KB</bytes>
		</org>
		<org rank="5">
			<name>kihon&#32;technologies</name>
			<freq>11</freq>
			<bytes>46KB</bytes>
		</org>
	</top-organisations>
	<top-user-agents>
		<useragent rank="1">
			<name>mozilla</name>
			<freq>63</freq>
			<bytes>2MB</bytes>
		</useragent>
		<useragent rank="2">
			<name>evolution</name>
			<freq>44</freq>
			<bytes>2MB</bytes>
		</useragent>
		<useragent rank="3">
			<name>mutt/1.5.9i</name>
			<freq>43</freq>
			<bytes>2MB</bytes>
		</useragent>
		<useragent rank="4">
			<name>mozilla/5.0</name>
			<freq>30</freq>
			<bytes>823KB</bytes>
		</useragent>
		<useragent rank="5">
			<name>kmail/1.8.1</name>
			<freq>18</freq>
			<bytes>391KB</bytes>
		</useragent>
	</top-user-agents>
	<messages-per-day>
		<Sunday><msgs>165</msgs><bytes>910KB</bytes></Sunday>
		<Monday><msgs>399</msgs><bytes>3MB</bytes></Monday>
		<Tuesday><msgs>457</msgs><bytes>3MB</bytes></Tuesday>
		<Wednesday><msgs>442</msgs><bytes>3MB</bytes></Wednesday>
		<Thursday><msgs>351</msgs><bytes>3MB</bytes></Thursday>
		<Friday><msgs>322</msgs><bytes>2MB</bytes></Friday>
		<Saturday><msgs>181</msgs><bytes>899KB</bytes></Saturday>
	</messages-per-day>
	<messages-per-month>
		<Jan><msgs>0</msgs><bytes>0</bytes></Jan>
		<Feb><msgs>0</msgs><bytes>0</bytes></Feb>
		<Mar><msgs>0</msgs><bytes>0</bytes></Mar>
		<Apr><msgs>0</msgs><bytes>0</bytes></Apr>
		<May><msgs>0</msgs><bytes>0</bytes></May>
		<Jun><msgs>500</msgs><bytes>3MB</bytes></Jun>
		<Jul><msgs>1817</msgs><bytes>11MB</bytes></Jul>
		<Aug><msgs>0</msgs><bytes>0</bytes></Aug>
		<Sep><msgs>0</msgs><bytes>0</bytes></Sep>
		<Oct><msgs>0</msgs><bytes>0</bytes></Oct>
		<Nov><msgs>0</msgs><bytes>0</bytes></Nov>
		<Dec><msgs>0</msgs><bytes>0</bytes></Dec>
	</messages-per-month>
	<messages-per-day-of-month>
		<day-1><msgs>40</msgs><bytes>198KB</bytes></day-1>
		<day-2><msgs>11</msgs><bytes>53KB</bytes></day-2>
		<day-3><msgs>14</msgs><bytes>82KB</bytes></day-3>
		<day-4><msgs>59</msgs><bytes>290KB</bytes></day-4>
		<day-5><msgs>61</msgs><bytes>290KB</bytes></day-5>
		<day-6><msgs>179</msgs><bytes>2MB</bytes></day-6>
		<day-7><msgs>159</msgs><bytes>918KB</bytes></day-7>
		<day-8><msgs>227</msgs><bytes>2MB</bytes></day-8>
		<day-9><msgs>147</msgs><bytes>717KB</bytes></day-9>
		<day-10><msgs>100</msgs><bytes>538KB</bytes></day-10>
		<day-11><msgs>261</msgs><bytes>2MB</bytes></day-11>
		<day-12><msgs>275</msgs><bytes>2MB</bytes></day-12>
		<day-13><msgs>175</msgs><bytes>972KB</bytes></day-13>
		<day-14><msgs>96</msgs><bytes>742KB</bytes></day-14>
		<day-15><msgs>13</msgs><bytes>63KB</bytes></day-15>
		<day-16><msgs>0</msgs><bytes>0</bytes></day-16>
		<day-17><msgs>0</msgs><bytes>0</bytes></day-17>
		<day-18><msgs>0</msgs><bytes>0</bytes></day-18>
		<day-19><msgs>0</msgs><bytes>0</bytes></day-19>
		<day-20><msgs>2</msgs><bytes>15KB</bytes></day-20>
		<day-21><msgs>77</msgs><bytes>284KB</bytes></day-21>
		<day-22><msgs>65</msgs><bytes>280KB</bytes></day-22>
		<day-23><msgs>50</msgs><bytes>263KB</bytes></day-23>
		<day-24><msgs>42</msgs><bytes>207KB</bytes></day-24>
		<day-25><msgs>23</msgs><bytes>129KB</bytes></day-25>
		<day-26><msgs>51</msgs><bytes>291KB</bytes></day-26>
		<day-27><msgs>78</msgs><bytes>409KB</bytes></day-27>
		<day-28><msgs>44</msgs><bytes>235KB</bytes></day-28>
		<day-29><msgs>22</msgs><bytes>113KB</bytes></day-29>
		<day-30><msgs>47</msgs><bytes>362KB</bytes></day-30>
		<day-31><msgs>0</msgs><bytes>0</bytes></day-31>
	</messages-per-day-of-month>
	<messages-per-hour>
		<hour-1><msgs>73</msgs><bytes>365KB</bytes></hour-1>
		<hour-2><msgs>47</msgs><bytes>284KB</bytes></hour-2>
		<hour-3><msgs>29</msgs><bytes>183KB</bytes></hour-3>
		<hour-4><msgs>31</msgs><bytes>267KB</bytes></hour-4>
		<hour-5><msgs>14</msgs><bytes>62KB</bytes></hour-5>
		<hour-6><msgs>22</msgs><bytes>91KB</bytes></hour-6>
		<hour-7><msgs>40</msgs><bytes>248KB</bytes></hour-7>
		<hour-8><msgs>58</msgs><bytes>297KB</bytes></hour-8>
		<hour-9><msgs>104</msgs><bytes>552KB</bytes></hour-9>
		<hour-10><msgs>120</msgs><bytes>656KB</bytes></hour-10>
		<hour-11><msgs>125</msgs><bytes>795KB</bytes></hour-11>
		<hour-12><msgs>187</msgs><bytes>2MB</bytes></hour-12>
		<hour-13><msgs>153</msgs><bytes>848KB</bytes></hour-13>
		<hour-14><msgs>146</msgs><bytes>963KB</bytes></hour-14>
		<hour-15><msgs>161</msgs><bytes>911KB</bytes></hour-15>
		<hour-16><msgs>148</msgs><bytes>843KB</bytes></hour-16>
		<hour-17><msgs>122</msgs><bytes>596KB</bytes></hour-17>
		<hour-18><msgs>122</msgs><bytes>724KB</bytes></hour-18>
		<hour-19><msgs>87</msgs><bytes>510KB</bytes></hour-19>
		<hour-20><msgs>101</msgs><bytes>656KB</bytes></hour-20>
		<hour-21><msgs>130</msgs><bytes>781KB</bytes></hour-21>
		<hour-22><msgs>103</msgs><bytes>511KB</bytes></hour-22>
		<hour-23><msgs>118</msgs><bytes>613KB</bytes></hour-23>
	</messages-per-hour>
	<urls>
		<url-1><freq>1743</freq><url>http://vger.kernel.org/majordomo-info.html</url></url-1>
		<url-2><freq>1740</freq><url>http://www.tux.org/lkml/</url></url-2>
		<url-3><freq>69</freq><url>http://enigmail.mozdev.org</url></url-3>
		<url-4><freq>10</freq><url>http://www.lenzg.org/</url></url-4>
		<url-5><freq>9</freq><url>http://johannes.sipsolutions.net/powerbook/touchpad/</url></url-5>
	</urls>
	<top-avg-resp>
		<resp-pers rank="1">
			<name>guorke</name>
			<avg-resp-time>00:00:40</avg-resp-time>
			<n-replies>1</n-replies>
		</resp-pers>
		<resp-pers rank="2">
			<name>kerin&#32;millar</name>
			<avg-resp-time>00:03:27</avg-resp-time>
			<n-replies>1</n-replies>
		</resp-pers>
		<resp-pers rank="3">
			<name>liyu@xxx</name>
			<avg-resp-time>00:04:12</avg-resp-time>
			<n-replies>1</n-replies>
		</resp-pers>
		<resp-pers rank="4">
			<name>brice&#32;goglin</name>
			<avg-resp-time>00:12:11</avg-resp-time>
			<n-replies>1</n-replies>
		</resp-pers>
		<resp-pers rank="5">
			<name>mark&#32;williamson</name>
			<avg-resp-time>00:16:47</avg-resp-time>
			<n-replies>1</n-replies>
		</resp-pers>
		<resp-pers rank="6">
			<name>alan&#32;stern</name>
			<avg-resp-time>00:17:54</avg-resp-time>
			<n-replies>1</n-replies>
		</resp-pers>
	</top-avg-resp>
	<created-with><name>mboxstats</name><version>2.8</version><developer>folkert@vanheusden.com</developer><url>http://www.vanheusden.com/mboxstats/</url></created-with>
</mailbox-stats>

<section
  title="Status Of -mm Tree Merging Into 2.6.13"
  subject="-mm -&gt; 2.6.13 merge status"
  archive="http://groups.google.com/group/linux.kernel/msg/125a82b96bbe2e2e?hl=en"
  posts="588"
  startdate="20 Jun 2005 22:54:58 -0800"
  enddate="14 Jul 2005 14:15:33 -0800"
>
<topic>FS: CacheFS</topic>
<topic>FS: NFS</topic>
<topic>FS: sysfs</topic>
<topic>Hot-Plugging</topic>
<topic>Kexec</topic>
<topic>Profiling</topic>
<topic>SMP</topic>
<topic>Software Suspend</topic>

<p>Andrew Morton said:</p>

<quote who="Andrew Morton">

<p>This summarises my current thinking on various patches which are presently
in -mm. I cover large things and small-but-controversial things. Anything
which isn't covered here (and that's a lot of material) is probably a "will
merge", unless it obviously isn't.</p>

<p>(If you reply to this email it would be a good idea to alter the Subject:
to reflect which feature you are discussing)</p>

<p>git-ocfs</p>

<blockquote>

<p>    The OCFS2 filesystem.  OK by me, although I'm not sure it's had enough
    review.</p>

</blockquote>

<p>sparsemem</p>

<blockquote>

<p>    OK by me for a merge.  Need to poke arch maintainers first, check that
    they've looked at it sufficiently closely.</p>

</blockquote>

<p>vm-early-zone-reclaim</p>

<blockquote>

<p>    Needs some convincing benchmark numbers to back it up.  Otherwise OK.</p>

</blockquote>

<p>avoiding-mmap-fragmentation</p>

<blockquote>

<p>    Tricky.  Addresses vm area fragmentation issues due to recent
    optimisations to the free-area lookup code.  Will merge.</p>

</blockquote>

<p>periodically-drain-non-local-pagesets</p>

<blockquote>

<p>    Will merge</p>

</blockquote>

<p>pcibus_to_node and users</p>

<blockquote>

<p>    Will merge</p>

</blockquote>

<p>CONFIG_HZ for x86 and ia64: changes default HZ to 250, make HZ
Kconfigurable.</p>

<blockquote>

<p>    Will merge (will switch default to 1000 Hz later if that seems
necessary)</p>

</blockquote>

<p>dmi-*.patch</p>

<blockquote>

<p>    Will merge.  I have a comment "The below break x440".  Maybe it got
    fixed.  We'll doubtless hear if not.</p>

</blockquote>

<p>xen-*.patch</p>

<blockquote>

<p>    These are little cleanups and abstractions which make a Xen merge
    easier.  May as well merge them.</p>

</blockquote>

<p>CPU hotplug for x86 and x86_64</p>

<blockquote>

<p>    Not really useful on current hardware, but these provide
    infrastructure which some power management patches need, and it seems
    sensible to make the reference architecture support hotplug.  Will
merge.</p>

</blockquote>

<p>swsusp-on-SMP</p>

<blockquote>

<p>    Will merge.</p>

</blockquote>

<p>cfq version 3</p>

<blockquote>

<p>    Not sure.  Jens seems to be setting up a few git trees.  On hold.</p>

</blockquote>

<p>RCUification of the key management code</p>

<blockquote>

<p>    Don't know - dhowells seemed diffident last time we discussed this.</p>

</blockquote>

<p>timers-fixes-improvements.patch</p>

<blockquote>

<p>    SMP speedups for the core timer code.  It was bumpy, but this seems
    stable now.  Will merge.</p>

</blockquote>

<p>kprobes-*</p>

<blockquote>

<p>    Will merge</p>

</blockquote>

<p>rapidio-*</p>

<blockquote>

<p>    Will merge.</p>

</blockquote>

<p>namespace*.patch</p>

<blockquote>

<p>    Awaiting viro ack.</p>

</blockquote>

<p>xtensa architecture</p>

<blockquote>

<p>    Is xtensa now, or will it be in the future a sufficiently popular
    architecture to justify the cost of having this code in the tree?</p>

<p>    Heaven knows.  Will merge.</p>

</blockquote>

<p>dlm-*.patch: Red Hat distributed lock manager</p>

<blockquote>

<p>    Hard.  Right now it seems that no in-kernel projects will use this and
    only one out-of-kernel project will use it.  Shelve the problem until
    after Kernel Summit, where some light may be shed.</p>

<p>    Opinions are sought...</p>

</blockquote>

<p>connector.patch</p>

<blockquote>

<p>    Nice idea IMO, but there are still questions around the
    implementation.  More dialogue needed ;)</p>

</blockquote>

<p>connector-add-a-fork-connector.patch</p>

<blockquote>

<p>    OK, but needs connector.</p>

</blockquote>

<p>inotify</p>

<blockquote>

<p>    There are still concerns about the userspace API and internal
    implementation details.  More slogging needed.</p>

</blockquote>

<p>pcmcia-*.patch</p>

<blockquote>

<p>    Makes the pcmcia layer generate hotplug events and deprecates cardmgr.
    Will merge.</p>

</blockquote>

<p>NUMA-aware slab allocator</p>

<blockquote>

<p>    Seems stable now, but it needs some ifdef reduction work before
    merging, please.</p>

</blockquote>

<p>CPU scheduler</p>

<blockquote>

<p>    Will merge some of these patches.  We're still discussing which ones.</p>

</blockquote>

<p>perfctr</p>

<blockquote>

<p>    Not yet, but getting closer.  The PPC64 guys still need to sort out a
    few interface issues with Mikael.  We might be able to fit this into
    2.6.13 if people get a move on.</p>

</blockquote>

<p>cachefs</p>

<blockquote>

<p>    This is a ton of code which knows rather a lot about pagecache
    internals.  It allows the AFS client to cache file contents on a local
    blockdev.</p>

<p>    I don't think it's a justified addition for only AFS and I'd prefer to
    see it proven for NFS as well.</p>

<p>    Issues around add-page-becoming-writable-notification.patch need to
    be resolved.</p>

</blockquote>

<p>cachefs-for-nfs</p>

<blockquote>

<p>    A recent addition.  Needs review from NFS developers and considerably
    more testing.</p>

<p>    These things aren't looking likely for 2.6.13.</p>

</blockquote>

<p>kexec and kdump</p>

<blockquote>

<p>    I guess we should merge these.</p>

<p>    I'm still concerned that the various device shutdown problems will
    mean that the success rate for crashing kernels is not high enough for
    kdump to be considered a success.  In which case in six months time we'll
    hear rumours about vendors shipping wholly different crashdump
    implementations, which would be quite bad.</p>

<p>    But I think this has gone as far as it can go in -mm, so it's a bit of
    a punt.</p>

</blockquote>

<p>reiser4</p>

<blockquote>

<p>    Merge it, I guess.</p>

<p>    The patches still contain all the reiser4-specific namespace
    enhancements, only it is disabled, so it is effectively dead code.  Maybe
    we should ask that it actually be removed?</p>

</blockquote>

<p>v9fs</p>

<blockquote>

<p>    I'm not sure that this has a sufficiently high
    usefulness-to-maintenance-cost ratio.</p>

</blockquote>

<p>fuse</p>

<blockquote>

<p>    tHis is useful, but there are, AFAIK, two issues:</p>

<p>    - We're still deadlocked over some permission-checking hacks in there</p>

<p>    - It has an NFS server implementation which only works if the
      to-be-served file happens to be in dcache.</p>

<p>      It has been said that a userspace NFS server can be used to get
      full NFS server functionality with FUSE.  I think the half-assed kernel
      implementation should be done away with.</p>

</blockquote>

<p>execute-in-place</p>

<blockquote>

<p>    Will merge.  Have the embedded guys commented on the usefulness of
    this for execute-out-of-ROM?</p>

</blockquote>

</quote>

<p>There was quite a long thread following this post. A frustrated Miklos
Szeredi spoke out about the FUSE situation. The sticking point was whether
to keep unpriveleged mount support in the FUSE patch, or remove it to make
acceptance easier. Miklos said he would not remore it even if that meant FUSE
couldn't go in the kernel. He argued that people should examine the code and
the feature, and try to improve their understanding before simply rejecting it
on esthetic terms. Various folks argued back and forth without saying much;
then Andrew asked for some clarification on what the controversy was really
about. Miklos said:</p>

<quote who="Miklos Szeredi">

<p>The controversial part is fuse_allow_task() called from fuse_permission()
and fuse_revalidate() (fs/fuse/dir.c).</p>

<p>This function (as explained by the header comment) disallows access to
the filesystem for any task, which the filesystem owner (the user who did
the mount) is not allowed to ptrace.</p>

<p>The rationale is that accessing the filesystem gives the
filesystem implementor ptrace like capabilities (detailed in
Documentation/filesystems/fuse.txt)</p>

<p>It is controversial, because obviously root owned tasks are not ptrace-able
by the user, and so these tasks will be denied access to the user mounted
filesystem (-EACCESS is returned on stat() or any other file operation).</p>

<p>However nobody raised _any_ concrete technical problem associated with
this, and the 4 years of widespread use didn't turn up any either. So IMO
it's "ugly" only in people's heads and not in reality.</p>

</quote>

<p>The discussion continued, and Andrew seemed to grok what Miklos was saying,
but there was still no concrete decision made.</p>

<p>Elsewhere, Andi Kleen had some remarks to make about the prospect of
Reiser4 going in. He said, <quote who="Andi Kleen">Has there been actually
any serious review on this? Last time I looked there was a lot of very ugly
code in there. Also I'm not sure things like comming with an own profiler
and spinlock debugger are really acceptable. At least this stuff should
be removed too.</quote> He asked if the code base had been reviewed, and
Christoph Hellwig replied, <quote who="Christoph Hellwig">I don't think so.
Everyone used the previous criteria of the broken core changes, broken
filesystem semantics and it's own useless abtraction layer as an excuse not
to look deeply at this huge mess yet.</quote> But Hans said:</p>

<quote who="Hans Reiser">

<p>V4 has a mailing list, and a large number of testers who read the code
and comment on it.  V4 has been reviewed and tested much more than V3 was
before merging.  Given that we sent it in quite some time ago, your suggestion
that an additional review by unspecified additional others be a requirement
for merging seems untimely. Do you see my point of view on this?</p>

<p>I would however enjoy receiving coding suggestions at ANY time. We don't
get as much of that as I would like.  I would in particular love to have
you Andi Kleen do a full review of V4 if you could be that generous with
your time, as I liked much of the advice you gave us on V3.</p>

<p>Unspecified others doing a review, well, who knows, I will surely take
the time to consider what is said by them though.....</p>

<p>I would prefer to not get reviews from authors of other filesystems who
prefer their own code, skim through our code without taking the time to
grok our philosophy and approach in depth, and then complain that our code
is different from what they chose to write, and think that our changing to
be like them should be mandated. I will not name names here....</p>

<p>Some of the suggestions on our mailing list are great, some reflect a lack
of 5 years working with our code, perhaps I should feed our mailing list into
the linux kernel mailing list so that people on the kernel mailing list are
more aware that we exist and are active?</p>

</quote>

<p>Jeff Garzik pointed out, <quote who="Jeff Garzik">when a merge is imminent,
a lot more attention is paid</quote> ... <quote who="Jeff Garzik">If you
want to get your code merged, you gotta work with the system, and LISTEN to
the feedback.</quote></p>

<p>At some point close by, Hans remarked, <quote who="Hans Reiser">I like
feedback on our code, and I particularly like feedback from a Mr. Andi
Kleen, but there is no need to tie it to merging. If, however, it serves as
an effective excuse to get some of your time allocated by SuSE management,
sure, go for it.;-)</quote> Jeff replied, <quote who="Jeff Garzik">All
merges of new code go like this. You've been around here for a while, this
should not be a shock. "Hans' team says its good stuff" is not a criteria
for merging.</quote> Hans suggested benchmarking the code instead of talking
so much, and Jeff replied, <quote who="Jeff Garzik">Still not a criteria
for merging. We have to care about the code behind the benchmarks.</quote></p>

<p>Elsewhere, regarding v9fs, Eric Van Hensbergen said to Andrew:</p>

<quote who="Eric Van Hensbergen">

<p>I think v9fs/9P has some unique aspects which differentiate it from
the other distributed system protocols integrated into Linux:
a) it presents a unified distributed resource sharing protocol.  It
will be able to distribute devices, file systems, system services, and
application interfaces.</p>

<ul>

<li>it provides non-caching RPC-style access to synthetic file systems
which could be used with in-kernel file systems such as sysfs or with
user-space synthetics such as those provided by FUSE</li>

<li>its implementation supports transport independence enabling easy
support for different interconnects (shared memory, Xen device
channels, RDMA, Infiniband, etc.)</li>

</ul>

<p>v9fs-2.0 has a somewhat limited audience at the moment - but now that
the initial implementation is more or less complete we are working to
build applications on top of it (and provide a better server).  It's
being integrated into cluster projects at LANL and being looked at wrt
virtualization I/O at IBM.  Its our hope that these improvements and
cluster applications will motivate more wide-spread use of the v9fs
module.</p>

</quote>

<p>Ronald G. Minnich also added:</p>

<quote who="Ronald G. Minnich">

<p>I got pointed at this discussion. Here are my $.02 on why we at LANL are
interested in v9fs.</p>

<p>We build clusters on the order of 2000 machines at present, with larger
systems coming along. The system which we use to run these clusters is
bproc. While bproc has proven to be very powerful to date, it does have
its limits:</p>

<ul>

<li>requires homogenous system</li>

<li>the network protocols it uses, while simple, are somewhat ad-hoc (as is
common in this type of system)</li>

<li>if you are on a bproc system as user x, using 25% of the system, you
still see 100% of the processes. This is a bit of a security issue.</li>

</ul>

<p>We have a desire to build single-system-image looking clusters along the
bproc model, but at the same time compose those clusters of, e.g., Opterons
and G5s. This mixing is highly desirable for compoutations that have phases,
some of which belong on one type of a machine, and some on another.</p>

<p>We are going to use v9fs as the glue for our next-generation cluster
software, called 'xcpu'. Xcpu has been implemented on Plan 9 and works
there. I have ported xcpu to Linux, using v9fs as the client side and Russ
Cox's plan9ports server to write servers.</p>

<p>xcpu presents a remote execution service as a 9p server. xcpu has been
tested across architectures and it works very well. By summer 2006, we hope
to have cut over our bproc systems to xcpu.</p>

<p>That's one use for v9fs. We also plan to use v9fs to provide us with
servers for global /proc, monitoring, and control systems for our clusters.</p>

<p>The global /proc is interesting. bproc provides a global /proc, but it is
incomplete; entries for, e.g., exe and maps are not filled in. bproc also
caches part of the /proc, but the rules about what is cached and what the
timeouts are, are set in the kernel module and not easily changed. We are
going to have an "aggregating" user level 9p server based on Mirtchovskis's
aggrfs, which will both aggregate all the cluster nodes, and have caching
rules that make sense in clusters of 1000s of node (for example, it is ok
to cache /proc/x/status; there is no need to cache /proc/x/maps, and you
probably don't want to anyway).</p>

<p>A neat capability is that if we give a user, e.g., 25% of the cluster,
we can tailor that user's name space so that they only see their procs and
the 25% of the cluster they own. This is good for security, but also good
for convenience: most users don't really care that some other user is on 75%
of the cluster. Global pid spaces are neat in theory, messy in practice at
large scale. I want my global pid space to be global to *me*, meaning I see
the global space of the nodes I care about. The sysadmin, of course, wants
to see everything. All this is possible. V9fs, along with Linux private name
spaces, will allow us to provide this model: users can see some or all of
the global pid space, depending on need; users can be constrained to only
see part of the global pid space, depending on other issues.</p>

<p>9p will also replace the Supermon protocol, allowing people to easily
view status information in a file system.</p>

<p>In addition to the cluster usage, there is also grid usage. The 9grid,
composed of plan 9 systems, is connected by 9p servers. Linux systems can
join the 9grid with no problem, once Linux has v9fs.</p>

<p>Were v9fs just a file system, I would not really be interested in it one
way or another; we have NFS, after all. But v9fs is really the key piece of
a new model of cluster services we are building at LANL. 9p will be the glue,
and v9fs will be the needed client side for hooking 9p servers into the file
system name space.</p>

<p>I'm hoping we can see v9fs in the kernel someday.</p>

</quote>

</section>

<section
  title="Clamping Down On SysFS &quot;Abuses&quot;"
  subject="sysfs abuse in recent i2o changes"
  archive="http://groups.google.com/group/linux.kernel/msg/2482ff295bcc8136?hl=en"
  posts="8"
  startdate="28 Jun 2005 03:21:02 -0800"
  enddate="08 Jul 2005 03:11:17 -0800"
>
<topic>FS: sysfs</topic>
<topic>I2O</topic>

<p>Christoph Hellwig remarked:</p>

<quote who="Christoph Hellwig">

<p>drivers/message/i2o/config-osm.c has a function sysfs_create_fops_file,
which creates a sysfs file with supplied file_operations. This is pretty
much against the sysfs design which only wants simple attributes, ascsii or
for corner cases binary.</p>

<p>Also, if we're going to allow this code it should move to sysfs.
And stop using lookup_hash directly (use lookup_one_len instead), it'll go
away soon.</p>

</quote>

<p>Markus Lidel replied:</p>

<quote who="Markus Lidel">

<p>First, the attributes provided through these functions are for accessing
the firmware... The controller has a little limitation, it could only handle
64 blocks, but sysfs only have 4k...</p>

<p>Now there are two options:</p>

<ol>

<li>when writing: read a 64k block, merge it with the 4k block and write it
back, when reading: read a 64k block and only return the needed 4k block.</li>

<li>extend the sysfs attribute to allow 64k blocks</li>

</ol>

<p>IMHO the first is not a very good solution, because for a 64k block it
has to be written 16 times...</p>

<p>Of course if someone finds a better solution i would be glad to hear
about it...</p>

</quote>

<p>Greg KH said, <quote who="Greg KH">Use the binary file interface of sysfs,
which was written exactly for this kind of thing. :)</quote> Markus gave this
a try, but said, <quote who="Markus Lidel">i haven't found a way to increase
the block size beyond 4k, could you please tell me how i could adjust it,
or where i could read about it?</quote> Greg replied:</p>

<quote who="Greg KH">

<p>Your code should not care about the block size of the data given to you,
as userspace could be giving you 1 byte at a time. Buffer it up yourself
and then write it out to the device when needed.</p>

<p>But if you are doing this for firmware, then please use the kernel firmware
interface, it does all of the buffering for you.</p>

<p>Either way, having your own file_ops in sysfs is not allowed.</p>

</quote>

<p>Markus said that Greg's solution was more complex and required a lot more
code. Greg offered a couple of suggestions, none of which worked for Markus,
and the thread ended.</p>

</section>

<section
  title="Some Consideration Of Swap Files Versus Swap Partitions"
  subject="Swap partition vs swap file"
  archive="http://groups.google.com/group/fa.linux.kernel/msg/926b53e8dcf0a11a?hl=en"
  posts="19"
  startdate="28 Jun 2005 16:57:21 -0800"
  enddate="13 Jul 2005 02:58:40 -0800"
>
<topic>FS: ext3</topic>

<p>Mike Richards asked if there were any differences between using a swap file
and a swap partition. Andrew Morton replied, <quote who="Andrew Morton">In
2.6 they have the same reliability and they will have the same performance
unless the swapfile is badly fragmented.</quote> Mike replied:</p>

<quote who="Mike Richards">

<p>Three more short questions if you have time:</p>

<ol>

<li>You specify kernel 2.6 -- What about kernel 2.4? How less reliable or
worse performing is a swapfile on 2.4?</li>

<li>Is it possible for the swapfile to become fragmented over time, or does
it just keep using the same blocks over and over? i.e. if it's all contiguous
when you first create the swapfile, will it stay that way for the life of
the file?</li>

<li>Does creating the swapfile on a journaled filesystem (e.g. ext3 or reiser)
incur a significant performance hit?</li>

</ol>

</quote>

<p>To the first question, Andrew replied, <quote who="Andrew Morton">2.4 is
weaker: it has to allocate memory from the main page allocator when performing
swapout. 2.6 avoids that.</quote> And to the third, Andrew said there was no
performance penalty for creating a swapfile on a journaled filesystem. He
said, <quote who="Andrew Morton">The kernel generates a map of swap offset
-&gt; disk blocks at swapon time and from then on uses that map to perform
swap I/O directly against the underlying disk queue, bypassing all caching,
metadata and filesystem code.</quote></p>

<p>To the second question, regarding fragmentation, Andrew said, <quote
who="Andrew Morton">Create the swapfile when the filesystem is young and empty,
it'll be nice and contiguous. Once created the kernel will never add or remove
blocks. The kernel won't let you use a sparse file for a swapfile.</quote>
Coywolf Qi Hunt remarked, <quote who="Coywolf Qi Hunt">I guess/hope dd
always makes it contiguously.</quote> And Bernd Eckenfels replied, <quote
who="Bernd Eckenfels">No, it is creating files by appending just like any
other file write. One could think about a call to create unfragmented files
however since this is not always working best is to create those files young
or defragment them before usage.</quote> But Jeremy Nickurak remarked that
<quote who="Jeremy Nickurak">this defeats one of the biggest advantages a
swap file has over a swap partition: the ability to easilly reconfigure the
amount of hd space reserved for swap.</quote> Wakko Warner then asked, <quote
who="Wakko Warner">Is it possible to create a large file w/o actually writing
that much to the device (ie uninitialized). There's absolutely no reason
that a swap file needs to be fully initialized, only part which mkswap does.
Of course, I would expect that ONLY root beable to do this.</quote> A couple
posts down the line, Bernd replied, <quote who="Bernd Eckenfels">There is no
portable/documented way to grow a file without having the file system null
its content. However why is that a problem, you dont create those files very
often. Besides it is better for the OS to be able to asume that a page with
zeros in it is equal to the page on fresh swap.</quote></p>

</section>

<section
  title="Status Of Linux Trace Toolkit Overhaul"
  subject="[PATCH/RFC] Significantly reworked LTT core"
  archive="http://groups.google.com/group/linux.kernel/msg/947af33a27f753e5?hl=en"
  posts="8"
  startdate="01 Jul 2005 18:46:25 -0800"
  enddate="08 Jul 2005 05:20:09 -0800"
>

<p>Karim Yaghmour said:</p>

<quote who="Karim Yaghmour">

<p>A few months back, there was a very large thread of discussion about
the inclusion of the ltt code by Andrew in -mm. Following this discussion,
relayfs was quite heavily trimmed down. However, unlike what I had promised,
I never got around to actually do the same to the ltt code. Part of it was my
not being ready to actually gut 5 years of coding ... that was just kind of
difficult. Lately though, through active discussion on the ltt-dev list, this
issue has resurfaced and a few pieces of revamped code started going around.
Thanks to Mathieu Desnoyers (Ecole Polytechnique) and Michael Raymond (SGI)
getting things moving again, I got back to thinking about the best way
to get the LTT code down to a palatable structure. And this time around,
I gave simplicity a chance ...</p>

<p>Which brings me to the patch below. This is a significantly cut
down version of the ltt core. It's now 5K instead of the initial 100K.
While the size has been trimmed down, much of the functionality can still
be easily obtained through the introduction of a new method: the ltt
multiplexer (ltt_mux). Basically, this is the function that controls the
tracing behavior. If none is provided, no tracing goes on. Typically, such
a function would be implemented as part of a loadable "control" module. Said
module would be responsible for:</p>

<ul>

<li>Allocating and managing relayfs buffers for storing events</li>

<li>Allowing the user-space tracing daemon to control tracing, such as by
controling event masks, etc.</li>

<li>Communicate with the user-space daemon for committing buffered data</li>

<li>Providing primitives for having multiple tracing streams, including
flight-recording.</li>

<li>Provide abstractions for registering new facilities and events.</li>

<li>Maintaining overall sanity of tracing functionality.</li>

</ul>

<p>IOW, much of what was purged can now be modularized and loaded
separately. Obviously this doesn't preclude having those modules still
packaged with the rest of the kernel, but it does make things much cleaner.</p>

<p>This patch isn't definitive, it's truely experimental. I've only
compile-tested it for now. I'm posting it here mostly as a preview. Of course,
your feedback is welcome.</p>

</quote>

<p>Christoph Hellwig remarked:</p>

<quote who="Christoph Hellwig">

<p>This code is rather pointless. The ltt_mux is doing all the real work
and it's not included. And while we're at it the layering for it is wrong
aswell - the ltt_log_event API should be implemented by the actual multiplexer
with what's in ltt_log_event now minus the irq disabling becoming a library
function.</p>

<p>Exporting a pointer to the root dentry seems like a very wrong API aswell,
that's an implementation detail that should be hidden.</p>

<p>Besides that the code is not following Documentation/CodingStyle at all,
please read it.</p>

<p>Besides that I'd sugest scrapping the ltt name and ltt_ prefix - we know
we're on linux, adn we don't care whether it's a toolkit, but spelling trace_
out would actually be a lot more descriptive. So what about trace_* symbol
names and trace.[ch] filenames?</p>

</quote>

<p>Karim said he didn't mind changing the name, and he'd look into following
CodingStyle more closely. Regarding Christoph's criticism of the code itself,
Karim replied:</p>

<quote who="Karim Yaghmour">

<p>Yes, you're partially right, ltt_mux is doing a lot of work, and it's not
included. However, what work ltt_mux is doing is administrative and that's what
was complained about a lot last time the ltt patches were included. So yes,
I could provide a very basic ltt_mux that would instantiate a single relayfs
channel and does no filtering whatsoever, but that would be insufficient for
real usage. And if I provided a full mux, then we'd pretty much end up with
the same code we had previously.</p>

<p>By having it this way, the essential part of the mechanism, its logging
code, is shared by all, yet there can be any number of muxes loaded on top
of it. The LKST project, for example, has got a module that just counts the
events that occur. Plug that as the mux, and always return NULL (no channel
to write to) and you've ready to go.</p>

<p>For ltt, the mux would be quite involved, including having netlink sockets
going back and forth talking to a user-space daemon, and allowing quite a
few options/features to be set.</p>

<p>In other cases, it should be fairly simple to implement a mux local to
a given subsystem that a developer needs to monitor. He can then manage
everything about how tracing goes on without having to rewrite his own
logging function.</p>

<p>The rational here is simple: there is no need to have multiple logging
functions, but there are already multiple existing implementations of deciding
how and what needs to be logged, how it's control, and how it interfaces with
the outside world (be it user-space or otherwise.) This code, simplistic as
it may be, serves this reality quite well.</p>

<p>If what's in ltt_log_event goes into the multiplexer, then we're back
to having each implementation have its own buffering mechanism and yet no
single entry-point for tracing inside the kernel.</p>

<p>Replacing local_irq_disable/enable() with function pointers is not a
problem, if that's something desirable.</p>

</quote>

<p>Christoph said, <quote who="Christoph Hellwig">We're not gonna add hooks to
the kernel so you can copile the same horrible code you had before against it
out of tree. Do a sane demux and submit it.</quote> And Karim replied, <quote
who="Karim Yaghmour">If I just wanted hooks, I would have submitted a patch
that did just that, without any logging function. The code for the mux that
goes on top of that code is actually on its way to be completely rewritten.
I can see that you may have read my posting as indicating that we were
recompiling the same previous code out of tree, but that is certainly not the
intent. FWIW, we'll look submitting a minimal mux with the patch.</quote></p>

</section>

<section
  title="Linux 2.6.13-rc2-mm1 Released"
  subject="2.6.13-rc2-mm1"
  archive="http://groups.google.com/group/fa.linux.kernel/msg/8c4aff776db3c315?hl=en"
  posts="16"
  startdate="07 Jul 2005 03:00:37 -0800"
  enddate="11 Jul 2005 14:22:20 -0800"
>
<topic>Digital Video Broadcasting</topic>
<topic>Kernel Release Announcement</topic>
<topic>Software Suspend</topic>
<topic>User-Mode Linux</topic>
<topic>Virtual Memory</topic>

<mention>Miklos Szeredi</mention>
<mention>Andrew Morton</mention>

<p>Andrew Morton announced Linux 2.6.13-rc2-mm1, saying:</p>

<quote who="2.6.13-rc2-mm1">

<p><a
href="ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.13-rc2/2.6.13-rc2-mm1/">ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.13-rc2/2.6.13-rc2-mm1/</a></p>

<p>(kernel.org seems to be stuck again - there's a copy at <a
href="http://www.zip.com.au/~akpm/linux/patches/stuff/2.6.13-rc2-mm1.gz">http://www.zip.com.au/~akpm/linux/patches/stuff/2.6.13-rc2-mm1.gz</a>)</p>

<ul>

<li>Various stuff.</li>

<li>I seem to have quite a bit of material here which is appropriate to
2.6.13:</li>

<ul>

  <li>All the ppc64 patches</li>

  <li>Most of the PM/swsusp patches</li>

  <li>UML</li>

  <li>Not sure about smsc-ircc2-*?</li>

  <li>infiniband update (the VM changes are awkward, but are localised)</li>

  <li>rapidio is still on hold pending rework of rapidio-support-net-driver.patch</li>

  <li>all the VFS/namespace patches are on hold pending review</li>

  <li>all the DVB patches</li>

  <li>all the pcmcia patches</li>

  <li>all the nfs4 patches</li>

  <li>all the v4l patches</li>

  <li>other random patches</li>

</ul>

<li>Anything which you think needs to go into 2.6.13, please let me know.</li>

<li>And I need to do another round of sending patches to subsystem maintainers.
Each time I do this only about one third of it sticks. Please try harder.</li>

</ul>

</quote>

<p>Miklos Szeredi asked about the status of FUSE inclusion, but there was
no yay or nay on it.</p>

</section>

<section
  title="Audit Subsystem Maintainership"
  subject="[PATCH] Add MAINTAINERS entry for audit subsystem"
  archive="http://groups.google.com/group/linux.kernel/msg/ca3831574041db67?hl=en"
  posts="4"
  startdate="07 Jul 2005 17:12:23 -0800"
  enddate="09 Jul 2005 04:19:17 -0800"
>
<topic>MAINTAINERS File</topic>

<mention>David Woodhouse</mention>
<mention>Chris Wright</mention>
<mention>Andrew Morton</mention>

<p>Chris Wright added an entry for the audit subsystem to the MAINTAINERS
file, giving a mailing list but no actual maintainer. As it turned out,
David Woodhouse had already submitted a similar patch to Andrew Morton's -mm
tree, but Chris had missed it because David had made a slight alphabetizing
error. They sorted it out.</p>

</section>

<section
  title="New Apple USB Touchpad Driver For Recent PowerBooks"
  subject="[PATCH] Apple USB Touchpad driver (new)"
  archive="http://groups.google.com/group/linux.kernel/msg/a94a4dfb929d59db?hl=en"
  posts="28"
  startdate="08 Jul 2005 02:17:32 -0800"
  enddate="12 Jul 2005 11:21:57 -0800"
>
<topic>USB</topic>

<mention>Vojtech Pavlik</mention>
<mention>Peter Osterlund</mention>
<mention>Johannes</mention>

<p>Stelian Pop said:</p>

<quote who="Stelian Pop">

<p>This is a driver for the USB touchpad which can be found on post-February
2005 Apple PowerBooks (PowerBook5,6).</p>

<p>This driver is derived from Johannes Berg's appletrackpad driver (<a
href="http://johannes.sipsolutions.net/PowerBook/touchpad/">http://johannes.sipsolutions.net/PowerBook/touchpad/</a>),
but it has been improved in some areas:</p>

<ul>

<li>appletouch is a full kernel driver, no userspace program is necessary</li>

<li>appletouch can be interfaced with the synaptics X11 driver (<a
href="http://web.telia.com/~u89404340/touchpad/index.html">http://web.telia.com/~u89404340/touchpad/index.html</a>),
in order to have touchpad acceleration, scrolling, etc.</li>

</ul>

<p>This driver has been tested by the readers of the 'debian-powerpc' mailing
list for a few weeks now and I believe it is now ready for inclusion into
the mainline kernel.</p>

<p>Credits go to Johannes Berg for reverse-engineering the touchpad protocol,
Frank Arnold for further improvements, and Alex Harper for some additional
information about the inner workings of the touchpad sensors.</p>

</quote>

<p>Johannes Berg was happy to see this going into the kernel, and offered
some technical suggestions, which Stelian accepted and submitted and updated
patch. Vojtech Pavlik, Peter Osterlund and others also pitched in with
their suggestions, which Stelian also implemented.</p>

</section>

<section
  title="bootutils 0.0.5 Released"
  subject="[ANNOUNCE] bootutils v0.0.5"
  archive="http://groups.google.com/group/linux.kernel/msg/8018e03c9643ab1f?hl=en"
  posts="1"
  startdate="09 Jul 2005 13:36:36 -0800"
>
<topic>FS: ReiserFS</topic>
<topic>FS: ext2</topic>
<topic>FS: ext3</topic>
<topic>FS: initramfs</topic>
<topic>FS: ramfs</topic>
<topic>Klibc</topic>

<p>Nigel Kukard said:</p>

<quote who="Nigel Kukard">

<p>Project Description:</p>

<p>BootUtils is a collection of utilities to facilitate booting of modern
Kernel 2.6 based systems. BootUtils is designed for initramfs, although
volunteers to add support for initrd are welcome. The process of finding the
root volume either by label or explicit label= on the kernel command line,
mounting it and 'switchroot'ing is automated. BootUtils can also drop to
emergency shell if the root volume cannot be mounted. Why not even start
sshd and allow admin login if the box is in a remote location?</p>

<p>Features:</p>

<ul>

<li>Automatic detection of root volume by label or explicit kernel commandline option</li>
<li>Supports ext2, ext3, jfs, reiserfs and xfs</li>
<li>Emergency shell dropping in the case of a root volume problem</li>
<li>Distribution independant</li>

</ul>

<p>Changes:</p>

<ul>

<li>Added support to build with klibc</li>
<li>Included libblkid/libuuid</li>
<li>Fixed parsing of multiple root= kernel commandline options.</li>

</ul>

<p>Website:</p>

<p><a
href="http://www.freshmeat.net/projects/bootutils/">http://www.freshmeat.net/projects/bootutils/</a></p>

</quote>

</section>

<section
  title="Summary Of Recent RT Patch Acceptance Discussion"
  subject="Attempted summary of &quot;RT patch acceptance&quot; thread, take 2"
  archive="http://groups.google.com/group/linux.kernel/msg/bb26e19e01a90413?hl=en"
  posts="7"
  startdate="11 Jul 2005 06:55:52 -0800"
  enddate="13 Jul 2005 06:29:18 -0800"
>
<topic>Assembly</topic>
<topic>Big O Notation</topic>
<topic>Microkernels: Adeos</topic>
<topic>Networking</topic>
<topic>POSIX</topic>
<topic>Real-Time: RTAI</topic>
<topic>SMP</topic>
<topic>Scheduler</topic>
<topic>Small Systems</topic>
<topic>Sound: ALSA</topic>
<topic>Virtual Memory</topic>

<mention>Thomas Gleixner</mention>
<mention>Lee Revell</mention>
<mention>David Lang</mention>
<mention>Bill Davidsen</mention>
<mention>Duncan Sands</mention>
<mention>Karim Yaghmour</mention>
<mention>Steven Rostedt</mention>
<mention>John Alvord</mention>
<mention>Takashi Iwai</mention>
<mention>Peter Chubb</mention>
<mention>Inaky Perez-Gonzalez</mention>
<mention>Andrew Morton</mention>
<mention>Paul G. Allen</mention>
<mention>Con Kolivas</mention>
<mention>Ingo Molnar</mention>
<mention>Victor Yodaiken</mention>
<mention>Kristian Benoit</mention>
<mention>Jonathan Corbet</mention>
<mention>Andrea Arcangeli</mention>
<mention>Gene Heskett</mention>
<mention>Daniel Walker</mention>
<mention>Darren Hart</mention>
<mention>Nicolas Pitre</mention>
<mention>Philippe Gerum</mention>
<mention>Sven-Thorsten Dietrich</mention>
<mention>Chris Friesen</mention>
<mention>Marcelo Tosatti</mention>
<mention>Paulo Marques</mention>
<mention>Nick Piggin</mention>
<mention>Andi Kleen</mention>
<mention>Bill Huey</mention>
<mention>William Lee Irwin III</mention>
<mention>Zwane Mwaikambo</mention>

<p>Paul E. McKenney posted a summary of some recent discussion of RT patch
acceptance:</p>

<quote who="Paul E. McKenney">

<p>CONTENTS</p>

<p>A.      INTRODUCTION<br />
B.      DESIRABLE PROPERTIES<br />
C.      LINUX REALTIME APPROACHES<br />
D.      OTHER ASPECTS OF REALTIME<br />
E.      SUMMARY<br />
F.      RESOURCES</p>

<p>Search for a line beginning with the corresponding capital letter followed
by a period to jump to the corresponding section.</p>


<p>A.  INTRODUCTION</p>

<p>Common wisdom dictates that realtime operating systems, particularly
hard-realtime operating systems, must be designed from ground up; that
serious realtime support cannot be simply grafted onto an existing
general-purpose operating system.  Although this common wisdom was
not arrived at lightly, it is often worthwhile to look for important
exceptions to this sort of general rule of thumb.  Candidate exceptions
include:</p>

<ol>

<li>      Many realtime applications use a very restricted subset of
        the services provided by a general-purpose OS like Linux.
        Some applications require realtime support only for scheduling
        user-mode code, for example, an application that directly accesses
        MMIO registers mapped into its address space.  This observation
        leads to the possibility of providing very limited realtime
        support.</li>

<li>      Computer performance and capacity has increased dramatically
        over the past few decades, quite literally by multiple orders
        of magnitude.  A small embedded system can easily be much more
        capable than a mid-70s supercomputer, for example, the vaunted
        Cray-1, introduced in 1976, ran at 160MFLOPs and sported 8MB of
        main memory.  In today's terms, this would be a modest embedded
        system -- and just you try running Linux on an 8MB system!
        This dramatic increase in performance permits some applications
        that would have required heavy-duty RTOS support in the 70s to
        run reasonably well on unmodified general-purpose OSes.</li>

</ol>

<p>There are still limits to the degree of realtime support that one can
expect from a general-purpose OS -- there are some extremely demanding
applications that can be satisfied only by hand-coded assembly running
on bare metal.  In fact, there are applications that can be satisfied
only by custom hardware implementations.  For example, standard DRAM is
only so fast, and large CPU caches help only the common case, not the
worst case that is important for hard realtime.  In this case, the
custom hardware might be a small CPU core with a modest amount of static
RAM.  In still more demanding situations, custom logic might be required.</p>

<p>Nevertheless, it is clear that Linux can support significant realtime
requirements, as it is already being used heavily in the realtime arena.
But how far should Linux extend its realtime support, and what is the
best way to extend Linux in this direction?  Can one approach to realtime
satisfy all reasonable requirements, or would it be better to support
multiple approaches, each with its area of applicability?</p>

<p>The answers to these questions are not yet clear, and have been the
subject of much spirited discussion, for example, see the more than
300 messages in the following LKML thread:</p>

<p>        <a href="http://lkml.org/lkml/2005/5/23/156">http://lkml.org/lkml/2005/5/23/156</a><br />
        <a href="http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=111689227213061&amp;w=2">http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=111689227213061&amp;w=2</a></p>

<p>This document looks at some strategies that have been proposed for
realtime Linux, comparing and contrasting their capabilities.  But, to
evaluate these strategies, it is first necessary to determine what exactly
one might want in a realtime Linux.  If you would rather skip straight to
the comparing and contrasting, search for "LINUX REALTIME APPROACHES".</p>


<p>B.  DESIRABLE PROPERTIES</p>

<p>As usual, there are conflicting desires, at least they conflict given
the current state of the art.  These desires fall into the following
categories:</p>

<ol>

<li>Quality of service</li>
<li>Amount of code that must be inspected to assure quality of service</li>
<li>API provided</li>
<li>Relative complexity of OS and applications</li>
<li>Fault isolation: what non-RT failures endanger RT code?</li>
<li>What hardware and software configurations are supported?</li>

</ol>

<p>Each of these categories is expanded upon below, and later used to compare
a number of proposed realtime approaches for Linux.  The discussion does
go for some time, which is not surprising given that it is summarizing
many hundreds of email messages.  ;-)  Search for the corresponding
number at the beginning of a line to skip directly to the discussion of
a given category.</p>

<ol>

<li>Quality of Service</li>

<p>The traditional view is that the entire operating system is either
hard realtime, soft realtime, or non-realtime, but this viewpoint
is too coarse grained.  Different workloads have different needs,
and there is disagreement over the exact definitions of these three
categories of realtime.  For example, (at least) the following two
definitions of "hard realtime" are in use:</p>

<blockquote>

<p>a.      In absence of hardware failures, software provably meets
        the specified deadlines.  This is fine and good, but many
        applications simply do not need this "diamond hard" realtime.</p>

<p>b.      Failure to meet the specified deadline results in application
        failure.  This is OK, but -only- if there is a corresponding
        required probability of success.  Otherwise, one could claim
        "hard realtime" by simply failing the application every time it
        tries to do anything, which is clearly not useful.</p>

</blockquote>

<p>A better approach is to simply specified the required probability
of meeting the specified deadline in absence of hardware failure.
A probability of 1.0 is consistent with definition (a).  Other
applications will be satisfied with a probability such as 0.999999,
which might be sufficiently high that the probability of software
scheduling failure is "in the noise" compared with the probability
of hardware failure.  A recent LKML thread called this "metal hard"
realtime.  Or was it "ruby hard"?  ;-)</p>

<p>Of course, one can increase the reliability of hardware through
redundancy, but no hardware configuration provides perfect reliability.
For example, clusters can increase reliability, so that the probability
of failure of the cluster is p^n, where "p" is the probability of a
single node failing and "n" is the number of nodes.  Note that this
expression never reaches a probability of 0, no matter how large "n" is.
In addition, this mathematical expression assumes that the failover
software is perfectly reliable and perfectly configured.  This assumption
conflicts sharply with my own experience, in which there has always been
a point beyond which adding nodes -decreased- cluster reliability.
So one can argue that effort put into making software more reliable
than is the underlying hardware is effort wasted.  That said, there
are situations, such as when human life is on the line, where such
effort might be an extremely wise investment.</p>

<p>The timeframe is just as important as is the probability of meeting the
deadline.  Any system can provide hard realtime guarantees if the deadline
is an infinite amount of time in the future.  No computer system that I am
aware of at this writing is capable of meeting a 1-picosecond scheduling
deadline for any task of non-zero duration, but then neither can dedicated
digital hardware.  Some applications have definite response-time goals,
for example, industrial process-control applications tend to have
response-time goals ranging from 100s of microseconds to small numbers
of seconds, while non-interactive applications such as graphics playback
(movies and the like) are said to need no better than about 7 milliseconds
scheduling jitter.  Other applications can benefit from any improvement in
response-time goals -- faster is better, think in terms of Doom players --
but even in these cases there is normally a point of diminishing returns.</p>

<p>The services used by the realtime application also figure in.  Given
current disk technology, it is not possible to meet a 100-microsecond
deadline for a 1GB synchronous write to disk.  Not even if you cheat
and supply the disk with a battery-backed-up DRAM cache.  However, many
realtime applications need only a few of the services that an operating
system might provide.  This list might include interrupt handling, process
scheduling, disk I/O, network I/O, process creation/destruction, VM
operations, and so on.  Keep in mind that many popular RTOSes provide very
little in the way of services!  They frequently leave the complex stuff
(e.g., web serving) to general-purpose operating systems.  This situation
raises the possibility of providing a single Linux operating-system
instance that provides some services with realtime guarantees and other
services in a non-realtime fashion, with no guarantees of any sort.</p>

<p>Note that each service can have an associated deadline that it can
meet.  The interrupt system might be able to meet a 1-microsecond
deadline, the real-time process scheduler a 10-microsecond deadline,
the disk I/O system a 10-millisecond deadline for moderate-sized
I/Os, and so on.  The deadline that a service can meet might also
depend on the parameters, so that the disk-I/O system would be
expected to take longer for larger I/Os.</p>

<p>Furthermore, the probability might vary from service to service or with
the parameters to that service.  For example, the probability of network
I/O completing successfully in minimal time might well be a function
of the number of packets transmitted (to account for the probability of
packet loss) as well as of packet size (to account for bit-error rate).
To make things even more complicated, the probability of meeting the
deadline will vary depending on the length of time allowed.  Considering
the networking example, a very short deadline might not allow the data
transmission to complete, even if it proceeds at wire speed.  A longer
deadline might allow transmission to complete, but only if there are
no transmission errors.  An even longer deadline might allow time for
a limited number of retransmissions, in order to recover from packet
loss due to transmission errors.  Of course, a deadline infinitely far
into the future would allow guaranteed completion, but I for one am not
that patient.</p>

<p>Finally, the performance and scalability of both realtime and non-realtime
applications running on the system can be important.  Given the current
state of the art, one must pay a performance penalty for realtime support,
but the smaller the penalty, the better.</p>

<p>So, to sum up, here are the components of a quality-of-service metric
for realtime OSes:</p>

<blockquote>

<p>a.      List of services for which realtime response is supported.</p>

<p>b.      For each service:</p>

<blockquote>

<p>        i.      Probability of meeting a deadline in absence of hardware
                failure, ranging from 0 to 1, with the value of 1
                corresponding to the hardest possible hard realtime.</p>

<p>        ii.     Allowable deadline, measured from the time that
                the request is initiated to the time by which the
                response must be received.</p>

</blockquote>

<p>c.      Performance and scalability provided to both realtime and
        non-realtime applications.</p>

</blockquote>

<li>Amount of Code Inspection Required</li>

<p>So you add a new feature to a realtime operating system.  How much of
the rest of the system must you inspect and understand in order to be
able to guarantee that your new feature provides the required level
of realtime response?  The smaller this amount of code, the easier it
is to add new features and fix bugs, and the greater the number of
people who will be able to contribute to the project.  In addition,
the smaller the amount of such code, the smaller the probability that
some well-intentioned bug fix will break realtime response.</p>

<p>Each of the following categories of code might need to be inspected:</p>

<blockquote>

<p>a.      The low-level interrupt-handing code.</p>

<p>b.      The realtime process scheduler.</p>

<p>c.      Any code that disables interrupts.</p>

<p>d.      Any code that disables preemption.</p>

<p>e.      Any code that holds a lock, mutex, semaphore, or other resource
        that is needed by the code implementing your new feature, as
        well as the code that actually implements the lock, mutex,
        semaphore, or other resource.</p>

<p>f.      Any code that manipulates hardware that can stall the bus,
        delay interrupts, or otherwise interfere with forward progress.
        Note that it is also necessary to inspect user-level code that
        directly manipulates such hardware.</p>

</blockquote>

<p>Of course, use of automated tools could make such inspection much more
reliable and less onerous, but one would want such tools to deal with
the very large number of CPU architectures and configuration options
that Linux supports.  The smaller the amount of code that must be
inspected, the less chance there is that such a tool will fall victim to
configuration-architecture combinatorial explosion.  Of course, a tool
that supported only a specific CPU architecture with a limited set of
configuration options might still be useful, but the wider the coverage,
the more useful the tool.</p>

<p>The hardware connection called out in point "f" above is quite important,
and much more difficult to deal with, since machine-inspectable source
code for firmware and for hardware (e.g., VHDL code) are typically not
readily available.  These sorts of problems are anything but theoretical,
for example, see section 4.5 of:</p>

<p><a href="http://www.cs.utah.edu/~regehr/papers/hotos7/hotos7.html">http://www.cs.utah.edu/~regehr/papers/hotos7/hotos7.html</a></p>

<p>which describes some problems that were triggered by X-windows (not
kernel!) driver bugs that resulted in hardware stalls.  Similar problems
have been triggered in other chipsets:</p>

<p><a href="http://www.rme-audio.de/english/techinfo/nforce4_tests.htm">http://www.rme-audio.de/english/techinfo/nforce4_tests.htm</a></p>

<p>At present, there is no known way of finding these problems other than
exhaustive testing.</p>

<p>Each of Linux realtime approaches uses a different strategy to
minimize the amount of code in these categories.  These differences
are surprisingly important, and will be discussed in more detail
when going over the various approaches to Linux realtime.</p>

<li>API Provided</li>

<p>I never have learned to -really- like the POSIX API, with the gets()
primitive being a particular cause of heartburn, but given the huge
amount of software out there that relies on it and the equally huge
number of developers who are familiar with it, one should certainly
strive to provide it, or at least a sizeable subset of it.</p>

<p>Other popular APIs include the various Java runtime environments,
and of course the feared and loathed, but quite ubiquitous, Windows
API.</p>

<p>There are a lot of developers and a lot of software out there.  The
more of these existing developers and software your API supports,
the more successful your realtime facility is likely to be.</p>

<li>Relative Complexity</li>

<p>How much realtime capability should be added to the operating system?
How much of this burden should the applications take on?  Is it better
to push some of the complexity into a nanokernel, hypervisor, or other
software or firmware layer?  Let's first look at the tradeoff between
OS and application.</p>

<p>For example, although it is certainly possible to program for separate
realtime and non-realtime operating-system instances, doing so adds
complexity to the application.  Complexity is particularly deadly in the
hard realtime arena, and can be literally so if human lives are at risk.</p>

<p>Balancing this consideration is the need for simplicity in the
operating-system kernel.  This balancing act must be carefully considered,
taking both the relative complexities and the number of uses into
account.  Some would argue that it is worthwhile adding 1,000 lines
to the OS if that saves 100 lines in each of 1,000 applications.
Others would disagree, perhaps citing the greater fault isolation
that might be provided by the separation.</p>

<p>But this balance clearly must be struck somewhere between writing the
application to bare metal on the one hand (but achieving a perfectly
simple zero-size operating system) and bloating the operating system
beyond the limits of maintainability on the other hand.</p>

<p>Similar arguments can be made for moving some functionality into a
hypervisor or nanokernel layer, though fault isolation also comes
into play here.</p>

<p>Many of the most vociferous arguments seem to revolve around this
complexity issue.  It is quite possible that there never will be a single
agreed-upon solution, since different people place different emphasis on
different aspects of this design choice.  Nonetheless, a well-thought-out
discussion is very likely to turn up better design choices.</p>

<li>Fault Isolation</li>

<p>Can a programming error in a non-realtime application or in a non-realtime
portion of the OS harm a realtime application?</p>

<p>Some applications do not care: in these cases, a failure anywhere
causes a user-visible failure, so it is not important to isolate
faults.  Of course, even in these cases, it may be valuable to isolate
faults in order to aid debugging, but, other than that, the fault
isolation does not help overall application reliability -- regardless
of where the bug occurs, the user sees a failure.</p>

<p>In other cases, the realtime portion of the application is protecting
someone's life and limb, but the non-realtime portion is only compiling
statistics and reports.  In this case, fault isolation can be of the
utmost importance.</p>

<p>What sorts of faults need isolating?</p>

<ul>

<li>Excessive disabling of interrupts.</li>

<li>Excessive disabling of preemption.</li>

<li>Holding a lock, mutex, or semaphore for too long, when that
resource must be acquired by realtime code.</li>

<li>Memory corruption, either via wild pointers or via wild DMA.</li>

</ul>

<p>These faults might occur in the main kernel, in a loadable module, or in
some debugging tool, such as a kprobe procedure or a kernel-debugger
breakpoint script.  Though in the latter case, perhaps realtime
deadlines should not be guaranteed when actively debugging.  After all,
straightforward debugging techniques, such as use of kprint(), can cause
response-time problems even in non-realtime environments.</p>

<li>Hardware and Software Configurations</li>

<p>Is SMP required?  If so, how many CPUs?  How many tasks?  How many
disks?  How many HBAs?</p>

<p>If all the code in the kernel were O(1), it might not matter, but
the Linux kernel has not yet reached this goal, and perhaps never
will completely reach it.  Therefore, some applications may choose to
restrict the software or the hardware configuration of the platform in
order to meet the realtime deadlines.  This approach is consistent with
traditional RTOS methodology, as RTOS vendors have been known to restrict
the configurations in which they will support hard realtime guarantees.</p>

</ol>

<p>C.  LINUX REALTIME APPROACHES</p>

<p>The following general approaches to Linux realtime have been proposed,
along with many variations on each of these themes:</p>

<ol>

<li>non-CONFIG_PREEMPT</li>
<li>CONFIG_PREEMPT</li>
<li>CONFIG_PREEMPT_RT</li>
<li>Nested OS</li>
<li>Dual-OS/Dual-Core</li>
<li>Migration Between OSes</li>
<li>Migration Within OS</li>

</ol>

<p>Each of these general approaches is discussed in the following sections.
Each section ends with a brief (but perhaps controversial) summary of
the corresponding approach's strengths and weaknesses.  I do not address
"strength of community", even though this may well be the decisive factor.
After all, the technical comparision will provide sufficient flame-bait.
That said, if you are working on realtime extensions to Linux, you really
really should be posting regularly on LKML.  Yes, the resulting flames
can be painful at times, but a little heat is needed for a patchset to
get "well done" (sorry for the pun, but the point is nonetheless serious).</p>

<p>This document does not present measured comparisons among all of the
approaches, despite the fact that such comparisons would be extremely
useful.  The reason for this, aside from gross laziness, is that it is
wise to agree on the metrics beforehand.  Therefore, the comparisons
in this document are for the most part qualitative.  In some cases,
they are based on actual measurements, but these measurements were
taken by different people on different configurations using different
benchmarks.  This is a prime area for future improvement.</p>

<ol>

<li>non-CONFIG_PREEMPT</li>

<p>This is the stock kernel, without even preemption.  Why would -anyone-
think of using stock 2.6 for a realtime task?  Because some realtime
applications have very forgiving scheduling deadlines.  One project
I worked on in the early 1980s had 2-second response-time deadlines.
This was quite a challenge, given that it was running on a 4MHz Z80 CPU --
though, to be fair, the Z80 was accompanied by a hardware floating-point
processor that was able to compute a 32-bit floating-point multiply in
well under a millisecond.  Modern hardware running a stock Linux 2.6
kernel would have no problem with this application.  Hey, just having
32 address bits rather than only 16 would have helped a lot!</p>

<blockquote>

<p>a.      Quality of service: "soft realtime", with timeframe of 10s of
        milliseconds for most services.  Some I/O requests can take
        longer.  Provides full performance and scalability to both
        realtime and non-realtime applications.</p>

<p>b.      Amount of code that must be inspected to assure quality of service
        for a new feature: the entire kernel, every little bit of it,
        since the entire kernel runs with preemption disabled.</p>

<p>c.      API provided: POSIX with limited realtime extensions.
        Realtime and non-realtime applications can interact using
        the normal POSIX services.</p>

<p>d.      Relative complexity of OS and applications: everything is
        stock, and all the normal system calls operate as expected.</p>

<p>e.      Fault isolation: none.</p>

<p>f.      Hardware and software configurations supported: all of them.
        Larger hardware configurations and some device drivers can
        result in degraded response time.</p>

</blockquote>

<p>Strengths:  Simplicity and robustness.  "Good enough" realtime support
        for undemanding realtime applications.  Excellent performance
        and scalability for both realtime and non-realtime applications.
        Applications and administrators see a single OS instance.</p>

<p>Weaknesses:  Poor realtime response, need to inspect the entire kernel
        to find issues that degrade realtime response.</p>

<li>CONFIG_PREEMPT</li>

<p>The CONFIG_PREEMPT option renders much of the kernel code preemptible,
with the exception of spinlock critical sections, RCU read-side critical
sections, code with interrupts disabled, code that accesses per-CPU
variables, and other code that explicitly disables preemption.</p>

<blockquote>

<p>a.      Quality of service: "soft realtime", with timeframe of 100s of
        microseconds for task scheduling and interrupt handling, but
        -only- for very carefully restricted hardware configurations
        that exclude problematic devices and drivers (such as VGA)
        that can cause latency bumps of tens or even hundreds of
        milliseconds (-not- microseconds).  Furthermore, the software
        configuration of such systems must be carefully controlled,
        for example, doing a "kill -1" traverses the entire task list
        with tasklist_lock held (see kill_something_info()), which might
        result in disappointing latencies in systems with very large
        numbers of tasks.  System services providing I/O, networking,
        task creation, and VM manipulation can take much longer.  A very
        small performance penalty is exacted, since spinlocks and RCU
        must suppress preemption.</p>

<p>        Kristian Benoit and Karim Yaghmour measured CONFIG_PREEMPT
        at a maximum interrupt-response-time latency of about 555
        microseconds, see:</p>

<p>        <a href="http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=112086443319815&amp;w=2">http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=112086443319815&amp;w=2</a></p>

<p>        The machine under test was a Dell PowerEdge SC420 with a P4
        2.8GHz CPU and 256MB RAM running a UP build of Fedora Core 3.</p>

<p>b.      Amount of code that must be inspected to assure quality of service
        for a new feature:</p>

<blockquote>

<p>        i.      The low-level interrupt-handing code.</p>

<p>        ii.     The process scheduler.</p>

<p>        iii.    Any code that disables interrupts, which includes
                all interrupt handlers, both hardware and softirq.</p>

<p>        iv.     Any code that disables preemption, including spinlock
                critical sections, RCU read-side critical sections,
                code with interrupts disabled, code that accesses
                per-CPU variables, and other code that explicitly
                disables preemption.</p>

<p>        v.      Any code that holds a lock, mutex, semaphore, or other
                resource that is needed by the code implementing your
                new feature, as well as the code that actually implements
                the lock, mutex, semaphore, or other resource.</p>

<p>        vi.     Any code that manipulates hardware that can stall the
                bus, delay interrupts, or otherwise interfere with
                forward progress.  Note that it is also necessary to
                inspect user-level code that directly manipulates such
                hardware.</p>

</blockquote>

<p>c.      API provided: POSIX with limited realtime extensions.</p>

<p>d.      Relative complexity of OS and applications: all the normal system
        calls operate as expected, so realtime and non-realtime processes
        can interact normally.</p>

<p>e.      Fault isolation: none.</p>

<p>f.      Hardware and software configurations supported: all of them.
        Larger hardware configurations and some device drivers can
        result in degraded response time.</p>

</blockquote>

<p>Strengths:  Simplicity.  Available now, even from distributions.
        Provides "good enough" realtime support for a large number
        of applications.  Applications and administrators see a
        single OS instance.</p>

<p>Weaknesses:  Limited testing, so that some robustness issues remain.
        Need to inspect large portions of the kernel in order
        to find issues that degrade realtime response.</p>

<li>CONFIG_PREEMPT_RT</li>

<p>The CONFIG_PREEMPT_RT patch by Ingo Molnar introduces additional
preemption, allowing most spinlock (now "mutexes") critical sections,
RCU read-side critical sections, and interrupt handlers to be preempted.
Preemption of spinlock critical sections requires that priority
inheritance be added to prevent the "priority inversion" problem where
a low-priority task holding a lock is preempted by a medium-priority
task, while a high-priority task is blocked waiting on the lock.
The CONFIG_PREEMPT_RT patch addresses this via "priority inheritance",
where a task waiting on a lock "donates" its priority to the task holding
that lock, but only until it releases the lock.  In the example above,
the low-priority task would run at high priority until it released the
lock, preempting the medium-priority task, so that the high-priority
task gets the lock in a timely fashion.  Priority inheritance has been
used in a number of realtime OS environments over the past few decades,
so it is a well-tested concept.</p>

<p>One problem with priority inheritance is that it is difficult to implement
for reader-writer locks, where a high-priority writer might wish to
donate its high priority to a large number of low-priority readers.
The CONFIG_PREEMPT_RT patch addresses this by allowing only one task at
a time to read-acquire a reader-writer lock, although it is permitted
to do so recursively.  This can limit the scalability of reader-writer
locks, but one would not expect any change unless and until someone finds
a serious scalability limit that affected a significant fraction of
realtime users.</p>

<p>Note that a few critical spinlocks remain non-preemptible, using the
"raw spinlock" implementation.</p>

<blockquote>

<p>a.      Quality of service: "soft realtime", with timeframe of a few 10s
        of microseconds for task scheduling and interrupt-handler entry.
        System services providing I/O, networking, task creation, and
        VM manipulation can take much longer, though some subsystems
        (e.g., ALSA) have been reworked to obtain good latencies.
        Since spinlocks are replaced by blocking mutexes, the performance
        penalty can be significant (up to 40%) for some system calls,
        but user-mode execution runs at full speed.  There is likely to
        be some performance penalty exacted from RCU, but, with luck,
        this penalty will be minimal.</p>

<p>        Kristian Benoit and Karim Yaghmour have run an impressive set of
        benchmarks comparing CONFIG_PREEMPT_RT with CONFIG_PREEMPT(?) and
        Ipipe, see the LKML threads starting with:</p>

<p>        1. <a href="http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=111846495403131&amp;w=2">http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=111846495403131&amp;w=2</a><br />
        2. <a href="http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=111928813818151&amp;w=2">http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=111928813818151&amp;w=2</a><br />
        3. <a href="http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=112008491422956&amp;w=2">http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=112008491422956&amp;w=2</a><br />
        4. <a href="http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=112086443319815&amp;w=2">http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=112086443319815&amp;w=2</a></p>

<p>        This last run put CONFIG_PREEMPT_RT at about 70 microseconds
        interrupt-response-time latency.  The machine under test was a
        Dell PowerEdge SC420 with a P4 2.8GHz CPU and 256MB RAM running
        a UP build of Fedora Core 3.</p>

<p>b.      Amount of code that must be inspected to assure quality of service
        by a new feature:</p>

<blockquote>

<p>        i.      The low-level interrupt-handing code.</p>

<p>        ii.     The process scheduler.</p>

<p>        iii.    Any code that disables interrupts, but -not- including
                interrupt handlers, which now run in process context.</p>

<p>        iv.     Any code that disables preemption, including raw-spinlock
                critical sections, code with interrupts disabled, code
                that accesses per-CPU variables, and other code that
                explicitly disables preemption.</p>

<p>        v.      Any code that holds a lock, mutex, semaphore, or other
                resource that is needed by the code implementing your
                new feature, as well as the code that actually implements
                the lock, mutex, semaphore, or other resource.</p>

<p>        vi.     Any code that manipulates hardware that can stall the
                bus, delay interrupts, or otherwise interfere with
                forward progress.  Note that it is also necessary to
                inspect user-level code that directly manipulates such
                hardware.</p>

</blockquote>

<p>c.      API provided: POSIX with limited realtime extensions.</p>

<p>d.      Relative complexity of OS and applications: all the normal system
        calls operate as expected, so realtime and non-realtime processes
        can interact normally.</p>

<p>e.      Fault isolation: none.</p>

<p>f.      Hardware and software configurations supported: most of them.
        SMP support is a bit rough, and a number of drivers have not yet
        been upgraded to work properly in the CONFIG_PREEMPT_RT environment.
        It is likely that larger hardware configurations and some device
        drivers can result in degraded scheduling latency, but given that
        normal spinlocks are now preemptible, this effect should be much
        less of an issue than for CONFIG_PREEMPT.</p>

</blockquote>

<p>Strengths:  Excellent scheduling latencies, potential for hard
        realtime for some services (e.g., user-mode execution) in
        some configurations.  A number of aspects of this approach
        might be incrementally added to Linux (e.g., priority
        inheritance for semaphores to prevent semaphore priority
        inversion, see "other aspects of realtime" for more discussion
        of this).  Applications and administrators see a single
        OS instance.</p>

<p>Weaknesses:  Limited testing, so that robustness issues remain.
        Large patch to Linux (~31K lines of context diff as of
        V0.7.51-23).  Both realtime and non-realtime applications pay
        performance and scalability penalties for the realtime service.</p>

<li>Nested OS</li>

<p>The Linux instance runs as a user process in an enclosing RTOS.  Realtime
service is provided by the RTOS, and a richer set of non-realtime services
is provided by the Linux instance.  Note that there is considerable
variety in RTOSes, and this section defines this term in its broadest
possible meaning, including full OSes, hypervisors, nanokernels, and
interrupt pipelines.  At some point, it may make sense to split this
section based on the type of the enclosing "OS", but there does not
seem to be much reason to break it up at this point.</p>

<blockquote>

<p>a.      Quality of service: hard realtime, with timeframe of about 10
        microseconds for services provided by the underlying RTOS.
        More complex services (I/O, task creation, and so on) will
        likely take longer to execute, which may impose a significant
        performance and scalability penalty.</p>

<p>        Philippe Gerum's interrupt-pipeline layer, named Ipipe, is an
        example of an extreme case of a minimal RTOS.  Kristian Benoit
        and Karim Yaghmour measured Ipipe's CONFIG_PREEMPT at a maximum
        interrupt-response-time latency of about 50 microseconds, see:</p>

<p>        <a href="http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=112086443319815&amp;w=2">http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=112086443319815&amp;w=2</a></p>

<p>        This result was the best of the three alternatives tested
        (CONFIG_PREEMPT, CONFIG_PREEMPT_RT, and Ipipe in conjunction
        with Linux 2.6.12).  It is believed that hardware limitations
        prevent much improvement in this result.</p>

<p>        The machine under test was a Dell PowerEdge SC420 with a P4
        2.8GHz CPU and 256MB RAM running a UP build of Fedora Core 3.</p>

<p>b.      Amount of code that must be inspected to assure quality of service
        by a new feature:</p>

<blockquote>

<p>        i.      All of the RTOS.  One would strive to keep the RTOS
                quite small, the greater the number of realtime services
                provided, the larger the RTOS must be.</p>

<p>        ii.     Any Linux-kernel code that disables interrupts.
                Note that in many implementations, the Linux kernel
                will be prevented from disabling interrupts, since
                any attempt to disable interrupts will trap into
                the RTOS.</p>

<p>                If the Linux kernel runs in privileged mode, however,
                all bets are off.  In this case, special care must be
                used to avoid disabling the real hardware interrupts,
                including such disabling within any kernel modules
                that might be loaded.</p>

<p>        iii.    Any code that manipulates hardware that can stall the
                bus, delay interrupts, or otherwise interfere with
                forward progress.  Note that it is also necessary to
                inspect user-level code that directly manipulates such
                hardware.</p>

</blockquote>

<p>c.      API provided: Whatever the RTOS wants to provide, often a
        subset of POSIX with realtime extensions.</p>

<p>d.      Relative complexity of OS and applications: there are now two
        operating systems, both of which must be configured and
        administered.  Applications that contain both realtime and
        non-realtime components must be explicitly aware of both OS
        instances, and of their respective APIs.</p>

<p>e.      Fault isolation: the following faults may propagate from the
        Linux OS to the underlying RTOS, or not, depending on the
        implementation:</p>

<blockquote>

<p>        i.      Excessive disabling of interrupts, if the Linux instance
                is permitted to disable them (hopefully not).</p>

<p>        ii.     Memory corruption, if the Linux instance is given direct
                access to the hardware MMU or to DMA-capable I/O devices.</p>

</blockquote>

<p>f.      Hardware and software configurations supported: depends on
        the implementation, however, there are products with this
        architecture that support SMP and a reasonable variety of devices.
        Note that supporting a large variety of devices either requires
        that this support be present in the RTOS, or that Linux be
        granted access to the devices.  In the latter case, Linux will
        likely have the ability to DMA over the top of the RTOS.</p>

</blockquote>

<p>Strengths:  Excellent scheduling latencies.  Hard-realtime support for
        some services in some configurations.  Reasonable fault isolation
        for some implementations.  Well-tested and robust implementations
        are available (I-pipe, L4Linux, RT-Linux, ...).</p>

<p>Weaknesses:  Realtime application software must deal with two separate
        OS instances and their respective APIs, with explicit
        communication.  Administrators must deal with two OS instances.
        Non-realtime applications are likely to suffer significant
        performance and scalability penalties.</p>

<li>Dual-OS/Dual-Core</li>

<p>Linux and RTOS instances run side-by-side on different CPUs in the
same system.  The CPUs might be different physical CPUs, different
hardware threads in the same CPU, or different virtual CPUs provided by
a virtualizing layer, such as Xen.  The two instances might or might not share
memory, and, if they do share memory, there might or might not be hardware
protection to prevent one OS from overwriting the other OS's memory.</p>

<blockquote>

<p>a.      Quality of service: hard realtime, with timeframe of about 10
        microseconds for services provided by the RTOS.  Extremely simple
        polling-loop "RTOSes" could potentially provide sub-microsecond
        latencies.  More complex services (I/O, task creation, and so on)
        will likely take longer to execute.  Since the Linux instance
        runs on a separate core, there need not be any performance or
        scalability penalty for non-realtime tasks.</p>

<p>b.      Amount of code that must be inspected to assure quality of service
        by a new feature: all of the RTOS, but only the RTOS.  One would
        strive to keep the RTOS quite small, but the greater the number
        of realtime services provided, the larger the RTOS must be.</p>

<p>        One important exception: if the RTOS and the Linux kernel access
        a shared hardware device (including memory!), it may be possible
        for Linux accesses to that hardware device to stall the RTOS.</p>

<p>c.      API provided: Whatever the RTOS wants to provide, often a
        subset of POSIX with realtime extensions.</p>

<p>d.      Relative complexity of OS and applications: there are now two
        operating systems, both of which must be configured and
        administered.  Applications that contain both realtime and
        non-realtime components must be explicitly aware of both
        OS instances and APIs, and must also be aware of whatever
        hardware facility is used to communicate between the realtime
        and non-realtime CPUs.</p>

<p>e.      Fault isolation: the following faults may propagate from the
        Linux OS to the underlying RTOS, or not, depending on the
        implementation:</p>

<blockquote>

<p>        i.      Memory corruption, but only if the Linux instance is given
                direct access to the RTOS's memory or to DMA-capable
                I/O devices that can access the RTOS's memory.</p>

</blockquote>

<p>f.      Hardware and software configurations supported: depends on
        the implementation, however, there are products based on this
        approach that support SMP and a reasonable variety of devices.</p>

</blockquote>

<p>Strengths:  Best possible scheduling latencies with the hardest reasonable
        realtime -- just as good as bare metal in some implementations.
        Best possible fault isolation for some implementations.
        Well-tested and robust implementations are available.
        Linux can be used as is, so full performance and scalability
        can be provided to non-realtime tasks.</p>

<p>Weaknesses:  Realtime application software must deal with two separate
        OS instances, with explicit communication.  Administrators must
        deal with two OS instances.  "RTOSes" that provide the best
        latencies offer the least services -- in extreme cases, the only
        service is execution of raw code on bare metal.  The pair of cores
        will be more expensive than a single core, though one might
        use virtualization to emulate the two CPUs.</p>

<li>Migration Between OSes</li>

<p>A Linux and RTOS instance run side-by-side in the same system.  The two
OSes might run on different physical CPUs, different hardware threads
in the same CPU, different virtual CPUs provided by a virtualizing
layer like Xen, or alternatively, the two OSes might use some sort
of interrupt-pipeline scheme (such as Adeos) to share a single CPU.</p>

<p>However, applications see a single unified environment.  Applications
run on the RTOS, but the RTOS provides Linux-compatible system calls and
memory layout.  If the application invokes a non-realtime system call,
the task is transparently migrated to the Linux OS instance for the
duration of that system call.  This differs from the other dual-OS
approaches, where the applications must be explicitly aware of the
different OSes.</p>

<p>At this writing, it appears that the two instances need to share memory,
since tasks can migrate from one OS to the other.</p>

<blockquote>

<p>a.      Quality of service: hard realtime, with timeframe of about 10
        microseconds for services provided by the RTOS.  More complex
        services (I/O, task creation, and so on) will likely take longer
        to execute.  It is also possible for tasks to be "trapped"
        in the Linux instance, for example, if they are sleeping, but
        have not yet been given a chance to respond to some event that
        should wake them up.  The performance and scalability penalties
        to non-realtime tasks can be expected to depend on the amount
        of protection provided for realtime tasks against non-realtime
        misbehavior -- the greater the protection, the greater the
        expected penalty.  It may be possible to provide hardware
        support to improve this tradeoff.</p>

<p>b.      Amount of code that must be inspected to assure quality of service
        by a new feature:</p>

<blockquote>

<p>        i.      All of the RTOS.  One would strive to keep the RTOS
                quite small, but the greater the number of realtime
                services provided, the larger the RTOS must be.</p>

<p>        ii.     Any Linux-kernel code that disables interrupts.
                Note that in many implementations, the Linux kernel
                will be prevented from disabling interrupts, since
                any attempt to disable interrupts will trap into
                the RTOS or into the underlying software/firmware
                layer (e.g., Xen or Adeos).</p>

<p>                If the Linux kernel runs in privileged mode, however,
                all bets are off.  In this case, special care must be
                used to avoid disabling the real hardware interrupts,
                including such disabling within any kernel modules
                that might be loaded.</p>

<p>        iii.    Any code that manipulates hardware that can stall the
                bus, delay interrupts, or otherwise interfere with
                forward progress.  Note that it is also necessary to
                inspect user-level code that directly manipulates such
                hardware.</p>

<p>        iv.     Any Linux code that manipulates a data structure that
                the RTOS accesses.  If the Linux and RTOS code
                share any sort of lock, then all critical sections
                of that lock must be inspected, as must the implementation
                of the lock itself.  The same is true of any shared mutex,
                shared semaphore, or other shared resource.</p>

</blockquote>

<p>c.      API provided: Full POSIX with realtime extensions.  Anytime
        a task running in the context of the RTOS attempts to execute
        a non-realtime system call, it is migrated to the Linux instance.</p>

<p>d.      Relative complexity of OS and applications: there are now two
        operating systems, both of which must be configured and
        administered.  However, applications can be written as if
        there was only one OS instance that provided the full set
        of services, some realtime and some not.</p>

<p>e.      Fault isolation: the following faults may propagate from the
        Linux OS to the underlying RTOS, or not, depending on the
        implementation:</p>

<blockquote>

<p>        i       Excessive disabling of interrupts, if the Linux OS
                is permitted to disable hardware interrupts (hopefully
                not, though preventing this may require special hardware).</p>

<p>        ii.     Memory corruption, either due to wild pointer or
                via wild DMA.</p>

</blockquote>

<p>f.      Hardware and software configurations supported: depends on
        the implementation, however, it is reasonable to believe that
        SMP and a reasonable variety of devices could be supported.
        Note that supporting a large variety of devices either requires
        that this support be present in the RTOS, or that Linux be
        granted access to the devices.  In the latter case, Linux will
        likely have the ability to DMA over the RTOS.</p>

</blockquote>

<p>Strengths:  Excellent scheduling latencies.  Hard-realtime support for
        some services in some configurations.  Applications see a
        single OS.</p>

<p>Weaknesses:  Administrators must deal with two OS instances.
        The two OSes will be extremely sensitive to each other's
        version and patch level, since they access each other's
        data structures.</p>

<li>Migration Within OS</li>

<p>A Linux instance runs on multiple CPUs, either different physical CPUs,
different hardware threads in the same CPU, or different virtual CPUs
provided by a virtualizing layer such as Xen.  Some (but not all!) of
the CPUs are designated as realtime CPUs.  If a task running on a
realtime CPU executes a trap or system call that contains non-deterministic
code sequences, the task is migrated to a non-realtime CPU to complete
execution of the trap or system call, then migrated back.  This prevents
any non-realtime execution of a given realtime task from interfering
with that of other realtime tasks.</p>

<p>Interrupts can be directed away from realtime CPUs.  Such interrupt
redirection is supported on a few architectures, and has in fact been
used for realtime support since at least the 2.4 kernel.</p>

<blockquote>

<p>a.      Quality of service: ~40 microseconds for ARTiS, with restricted
        hard/firm realtime supported for user-mode execution.  More
        complex services (I/O, task creation, and so on) will likely take
        longer to execute.  It is also possible for tasks to be "trapped"
        on the non-realtime CPUs, for example, if they are sleeping,
        but have not yet been given a chance to respond to some event
        that should wake them up.  Since a stock non-CONFIG_PREEMPT Linux
        may be used, there need be no performance or scalability penalty
        for non-realtime tasks, nor for realtime tasks that execute only
        realtime operations.  There can be a significant migration penalty
        when realtime tasks frequently execute non-realtime operations.</p>

<p>b.      Amount of code that must be inspected to assure quality of service
        by a new feature:</p>

<blockquote>

<p>        i.      Any part of the Linux kernel that is permitted to execute
                on the realtime CPUs.  This would normally be only the
                realtime portions of the scheduler and the low-level
                interrupt and trap handling code (the actual interrupts
                and traps would be migrated, if necessary).</p>

<p>        ii.     Any critical section of any lock acquired by the portion
                of the Linux kernel that is permitted to execute on the
                realtime CPUs.</p>

<p>        iii.    Any code that manipulates hardware that can stall the
                bus, delay interrupts, or otherwise interfere with
                forward progress, but only if that hardware can affect
                or is used by both the realtime and the non-realtime
                CPUs.</p>

<p>                That said, note that it is also necessary to inspect
                user-level code that directly manipulates such hardware.</p>

</blockquote>

<p>c.      API provided: Full POSIX with realtime extensions.</p>

<p>d.      Relative complexity of OS and applications: There is but
        one OS, though it has a bit of added complexity due to the
        migration capability.  Applications see only one OS.</p>

<p>e.      Fault isolation: the following faults may propagate from the
        non-realtime CPUs to the realtime CPUs:</p>

<blockquote>

<p>        i       Holding a lock, mutex, or semaphore for too long, when
                that resource must be acquired by code that is permitted
                to run on the realtime CPUs.</p>

<p>        ii.     Memory corruption, either due to wild pointer or
                via wild DMA.</p>

</blockquote>

<p>f.      Hardware and software configurations supported: all configurations,
        though single-CPU systems must have some sort of virtualizing
        facility so that the OS sees at least two virtual CPUs.</p>

</blockquote>

<p>Strengths:  Excellent scheduling latencies.  Hard-realtime support for
        some services in some configurations.  Applications and
        administrators see a single OS and API.  Full performance and
        scalability for non-realtime and for pure-realtime tasks.</p>

<p>Weaknesses:  Migration overhead.  Requires multiple CPUs, either real or
        virtual.</p>

</ol>

<p>D.  OTHER ASPECTS OF REALTIME</p>

<p>1.      PRIORITY INVERSION PROBLEM STATEMENT<br />
2.      PRIORITY INVERSION SOLUTIONS<br />
3.      PRIORITY INVERSION AND PTHREADS</p>

<ol>

<li>PRIORITY INVERSION PROBLEM STATEMENT</li>

<p>Priority inversion is a situation where a low-priority thread is holding
a resource that a high-priority task needs.  Priority inversion can
result in indefinite delay of the high-priority task, so is fatal for
realtime applications, and, in extreme cases, can be intolerable even
for non-realtime applications.</p>

<p>To see how priority inversion can happen, consider the following sequence
of events:</p>

<blockquote>

<p>a.      Low-priority thread A acquires a pthread_mutex.</p>

<p>b.      Medium-priority thread B starts executing CPU-bound, preempting
        thread A.</p>

<p>c.      High-priority thread C attempts to acquire the pthread_mutex,
        but is blocked because A holds it.</p>

</blockquote>

<p>Suppose that thread B is a realtime thread and that it will execute
CPU-bound indefinitely.  Since it is a realtime thread, its priority
will never age down, so low-priority thread A will never get to execute.
Thread A will therefore never release the pthread_mutex, so high-priority
thread C will never be able to proceed.  This situation is fatal for
realtime systems, and can be literally so if thread C is controlling
a life-support system.</p>

<p>Note that although this example used a pthread_mutex, many other types
of resources can be involved in a priority-inversion situation.  For
a second example, consider the following sequence of events:</p>

<blockquote>

<p>a.      Low-priority task A holds a large block of memory, which
        it is about to free up.</p>

<p>b.      Medium-priority task B starts executing CPU-bound, preempting
        task A.</p>

<p>c.      High-priority task C attempts to allocate some memory, but
        is blocked because the system is short on memory, and A has
        not yet freed up its large block.</p>

</blockquote>

<p>Different type of resource, but very similar result.  This problem is
not limited to mutexes and memory, some other types of resources that
can be involved in priority inversion include:</p>

<blockquote>

<p>a.      Communications packets.  Low-priority task A is prevented
        from transmitting by medium-priority task B, thereby blocking
        high-priority task C, which needs to receive the packet that
        task A is being prevented from sending.  In the case of things
        like TCP/IP, the priority inversion can span multiple systems,
        for example, tasks A and B might be on one system and task C
        on another system on the same LAN.</p>

<p>b.      Signals and/or events.  Low-priority task A is prevented from
        posting by medium-priority task B, thereby blocking
        high-priority task C, which needs to receive the signal/event
        that task A is being prevented from sending.</p>

<p>c.      File data.  Low-priority task A is prevented from writing out
        data to a file by task B, thereby blocking task C, which needs
        this data in order to proceed with its own processing.</p>

</blockquote>

<p>The hard cold fact is that pretty much any resource that can cause a
task to block can be involved in a priority inversion situation.</p>

<li>PRIORITY INVERSION SOLUTIONS</li>

<p>There are a number of ways of preventing priority inversion:</p>

<p>a.      Disable preemption while a resource is held.<br />
b.      Forbid resources to be acquired by tasks of different priorities.<br />
c.      Priority inheritance.</p>

<p>These are each covered in the following sections.</p>

<blockquote>

<p>a.  Disable preemption while a resource is held.</p>

<p>A simple, but effective, way to prevent priority inheritance is to
simply disable preemption during the time that the resource is held.
This works very well for some sorts of resources, particularly
locks.  The CONFIG_PREEMPT option in the Linux kernel uses this for
all spinlocks and also for RCU read-side critical sections.  However,
this approach is impractical for resources that may be held while
blocked, such as sema_t sleeplocks, memory, and communications,
the latter of which might involve memory allocation, which might
block if the system is low on memory.</p>

<p>Even where disabling preemption does work well, it can degrade
scheduling latencies.  Since a major goal of extreme realtime
support is to -reduce- scheduling latencies, other approaches are
needed.</p>

<p>b.  Forbid resources to be acquired by tasks of different priorities.</p>

<p>The "diamond-hard" realtime approach is to simply prohibit tasks of
different priorities from sharing any blocking resources.  This is
simple in principle, but can become quite complex in practice.  In some
cases, non-blocking mechanisms can be used, such as asynchronous I/O
or non-blocking synchronization.  However, although non-blocking mechanisms
can prevent the high-priority task from blocking, they are of no
help if the high-priority task really needs the information held
by the low-priority task.  In such cases, it may be necessary to
dynamically adjust priorities, perhaps via schemes such as deadline
scheduling.</p>

<p>There is a huge body of literature on realtime scheduling mechanisms at
all levels of complexity and effectiveness, which cannot be reproduced
here.  However, a conceptually simple approach would be to increase the
priority of "supplier" tasks so that "consumer" tasks get what they
need when they need it.  If this is automated, it is called "priority
inheritance".</p>

<p>c.  Priority inheritance.</p>

<p>With priority inheritance, the holder of a given resource is temporarily
boosted to the maximum priority of all tasks waiting for that
resource.  This temporary priority-boost is removed as soon as the resource
is released.</p>

<p>Of course, there can be complications, for example, a given low-priority
task might be holding multiple locks, each of which is being waited on
by different high-priority tasks.  While the low-priority holds all of
these locks, its priority is boosted to that of the highest-priority
task waiting on any of the locks, but when it releases one of the locks,
it might be necessary to decrease (but not eliminate) the boost to allow
for the smaller set of high-priority tasks still waiting.</p>

<p>Another complication is "transitivity", where a low-priority task A
holds one lock needed by medium-priority task B, which in turns holds
a second lock needed by high-priority task C.  In this case, task A
needs to inherit task C's priority in a transitive manner through
both of the locks.  Such a priority inheritance chain could be
arbitrarily long.</p>

<p>Furthermore, avoiding blocking does not necessarily make the underlying
problem go away, for example, suppose that the high-priority task was
executing the following loop:</p>

<pre>        for (;;) {
                spin_trylock(&amp;my_mutex);
                set_current_state(TASK_UNINTERRUPTIBLE);
                schedule_timeout(HZ / 100);
        }</pre>

<p>The standard priority-inheritance mechanisms would not understand the
need to priority boost in this case.  But suppose that they did.  Then
what would they make of the following code?</p>

<pre>        for (;;) {
                spin_trylock(&amp;my_mutex);
                if ((random() &amp; 0xfff) == 0)
                        break;
                set_current_state(TASK_UNINTERRUPTIBLE);
                schedule_timeout(HZ / 100);
        }</pre>

<p>How is the priority-inheritance mechanism going to figure out that it
should remove the priority boost when the high-priority task breaks
out of the loop?</p>

<p>Despite such complications, priority inheritance works reasonably
well for exclusive locks, and is a major component of Ingo Molnar's
CONFIG_PREEMPT_RT patch.  There are strongly held opinions both for and
against priority inheritance, for example:</p>

<p><a href="http://www.linuxdevices.com/articles/AT7168794919.html">http://www.linuxdevices.com/articles/AT7168794919.html</a></p>

<p>in which Victor Yodaiken considers priority inheritance to be
harmful, and, as near as I can tell, soft realtime to be irrelevant.
Doug Locke posted a rebuttal at:</p>

<p><a href="http://www.linuxdevices.com/articles/AT5698775833.html">http://www.linuxdevices.com/articles/AT5698775833.html</a></p>

<p>The big advantage of priority inheritance is that it is simple for
its users.  Use of priority inheritance does degrade scheduling
latency compared to a carefully hand-crafted solution, and priority
inheritance's implementation is difficult for reader-writer locks,
to say nothing of memory allocation or communications primitives.</p>

<p>Nevertheless, priority inheritance does seem to have a significant
role to play in mainstream "metal hard" realtime.  It is not perfect,
but, then again, what is?</p>

</blockquote>

<li>PRIORITY INVERSION AND PTHREADS</li>

<p>Inaky Perez-Gonzalez's "fusyn" project is intended to bring priority
inheritance to user-level pthread_mutex primitives, although it (perhaps
wisely) leaves reader-writer primitives alone.  More information on
fusyn may be found at the following web sites and LKML threads:</p>

<p>        http://developer.osdl.org/dev/robustmutexes/fusyn/20040510<br />
        http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=111362457509145&amp;w=2<br />
        http://marc.theaimsgroup.com/?t=111333601400001&amp;r=1&amp;w=2</p>

<p>Interestingly enough, the complexity of pthread_mutex priority
inheritance depends strongly on the threading model in use.  Linux
NPTL uses a 1:1 threading model, so that each user-visible pthread
has its own kernel task.  In this threading model, priority inheritance
can be carried out entirely by the Linux kernel, since all pthreads
are visible to it.</p>

<p>However, some pthreads implementers choose an m:n threading model, where
a user-level thread scheduler multiplexes multiple user-visible pthreads
onto a potentially smaller set of kernel tasks.  In this m:n case, the
Linux kernel has no idea which of multiple tasks it should priority-boost,
and it might well be that the pthread in need of a boost is currently
not assigned to a task.  Therefore, m:n priority boosting must involve
both the kernel and the user-level schedulers, making it quite complex
and fragile.</p>

<p>Therefore, use of 1:1 user-level thread scheduling is recommended in
the strongest possible terms.</p>

<p>Why use m:n user-level thread scheduling in the first place?  It turns
out that some application benefit from the extremely efficient
user-level context switches that m:n scheduling provides.  However,
every optimization has its price, and the price of m:n user-level thread
scheduling becomes apparent in realtime systems.</p>

</ol>

<p>E.  SUMMARY</p>

<p>At this point, it does not appear that any one approach can be all things
to all realtime applications.  It is therefore too early to pick a winner.
Advocates of a given approach are therefore advised to concentrate their
energy on implementations of their favorite approach, rather than engaging
in flamewars with advocates of other approaches.  ;-)</p>

<p>After all, in the end, the approaches that best meet the needs of the
user community will win out.  In fact, given that the Linux community
has come up with no fewer than seven classes of solutions to a problem
that is commonly thought to be unsolvable, it seems quite reasonable
to expect that yet more classes of solutions will yet appear.</p>

<p>So, which of these approaches can be combined?  The first three can
be thought of as elaborations on the general preemption theme, and
can be combined with each of the remaining four.  The nested-OS and
dual-OS/dual-core ideas can be combined by having one of the OSes
on one of the cores have another OS nested within it.   The
dual-core/dual-OS approach can be combined with either of the
migration approaches, simply by having one of the cores implement
the migration approach.  It should be possible to combine the two
migration approaches, though it is not clear that this is useful.</p>

<p>Regardless of whether Linux's direction ends up being a single one of
these approaches, a yet-as-unknown approach, some combination, or one
of several approaches depending on the workload, realtime Linux looks
to remain an exciting area.</p>

<p>F.  RESOURCES</p>

<p>1.  General Discussion</p>

<p><a href="http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=111689227213061&amp;w=2">http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=111689227213061&amp;w=2</a></p>

<p>        Spirited LKML debate on realtime Linux that inspired this
        document.</p>

<p><a href="http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=111846495403131&amp;w=2">http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=111846495403131&amp;w=2</a><br />
<a href="http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=111928813818151&amp;w=2">http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=111928813818151&amp;w=2</a><br />
<a href="http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=112008491422956&amp;w=2">http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=112008491422956&amp;w=2</a><br />
<a href="http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=112086443319815&amp;w=2">http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=112086443319815&amp;w=2</a></p>

<p>        Kristian Benoit's and Karim Yaghmour's realtime-latency
        measurement LKML threads.</p>

<p><a href="http://www.cs.utah.edu/~regehr/papers/hotos7/hotos7.html">http://www.cs.utah.edu/~regehr/papers/hotos7/hotos7.html</a><br />
<a href="http://www.rme-audio.de/english/techinfo/nforce4_tests.htm">http://www.rme-audio.de/english/techinfo/nforce4_tests.htm</a></p>

<p>        Description of how hardware latencies can impact response time.</p>

<p><a href="http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=111362457509145&amp;w=2">http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=111362457509145&amp;w=2</a><br />
<a href="http://marc.theaimsgroup.com/?t=111333601400001&amp;r=1&amp;w=2">http://marc.theaimsgroup.com/?t=111333601400001&amp;r=1&amp;w=2</a></p>

<p>        LKML discussions of "fusyn" priority-inheritance implementation
        of pthread_mutex.</p>

<p>2.  Example Realtime Approaches</p>

<p><a href="ftp://kernel.org/pub/linux/kernel/v2.6">ftp://kernel.org/pub/linux/kernel/v2.6</a></p>

<p>        Linux kernel source for non-CONFIG_PREEMPT and CONFIG_PREEMPT
        kernels.</p>

<p><a href="http://people.redhat.com/mingo/realtime-preempt/">http://people.redhat.com/mingo/realtime-preempt/</a></p>

<p>        Ingo Molnar's CONFIG_PREEMPT_RT patch.</p>

<p><a href="http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=112051169508144&amp;w=2">http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=112051169508144&amp;w=2</a></p>

<p>        Philippe Gerum's I-pipe patch 2.6.12-v0.9-00.  This is an
        example of the nested-OS approach, with I-pipe being an
        extreme example of an lightweight enclosing OS.</p>

<p><a href="http://download.gna.org/rtai/documentation/fusion/pdf/Life-with-Adeos.pdf">http://download.gna.org/rtai/documentation/fusion/pdf/Life-with-Adeos.pdf</a><br />
<a href="http://download.gna.org/rtai/documentation/fusion/pdf/Introduction-to-UVMs.pdf">http://download.gna.org/rtai/documentation/fusion/pdf/Introduction-to-UVMs.pdf</a><br />
<a href="http://download.gna.org/rtai/documentation/fusion/pdf/Native-API-Tour.pdf">http://download.gna.org/rtai/documentation/fusion/pdf/Native-API-Tour.pdf</a></p>

<p>        Documents describing Philippe Gerum's RTAI/fusion approach,
        which is an example of migration between OSes.</p>

<p><a href="http://www.lifl.fr/west/publi/MPSD04rtlws.pdf">http://www.lifl.fr/west/publi/MPSD04rtlws.pdf</a><br />
<a
href="http://lkml.org/lkml/2005/5/3/50">http://lkml.org/lkml/2005/5/3/50</a></p>

<p>    Paper describing ARTiS (Asymmetric RealTime SMP), an example
    of the migration-within-OS approach, along with
    an LKML posting of the corresponding Linu patch.
    Additional ARTiS publications may be found at <a
    href="http://www.lifl.fr/west/artis/">http://www.lifl.fr/west/artis/</a>.</p>


<p>ACKNOWLEDGEMENTS</p>

<p>This document was extracted from the emails and code of a large number
of people, including those listed below in alphabetic order.  Please
accept my apologies if I left you out, and please let me know of this
or any other error or omission so that I can generate the fix.</p>

<p>Andi Kleen,
Andrea Arcangeli,
Andrew Morton,
Bill Davidsen,
Bill Huey,
Brian O'Mahoney,
Chris Friesen,
Con Kolivas,
Daniel Walker,
Darren Hart,
David Lang,
Duncan Sands,
Elladan,
Eric Piel,
Esben Nielsen,
Gene Heskett,
Giuseppe Bilotta,
Hari N,
Henry Kingman,
Ingo Molnar,
James R Bruce,
John Alvord,
Jonathan Corbet,
K.R. Foley,
Karim Yaghmour,
Kristian Benoit,
Kusche Klau,
Lee Revell,
Manas Saksena,
Marcelo Tosatti,
NZG,
Nick Piggin,
Nicolas Pitre,
Paul G. Allen,
Paulo Marques,
Peter Chubb,
Philippe Gerum,
Steven Rostedt,
Sven-Thorsten Dietrich,
Takashi Iwai,
Theodore Y Tso,
Thomas Gleixner,
Tim Bird,
Tom Vier,
Valdis Kletniek,
William Lee Irwin III,
Zan Lynx,
Zwane Mwaikambo,
john cooper</p>

</quote>

<p></p>

</section>

<section
  title="Linux 2.6.13-rc2-mm2 Released"
  subject="2.6.13-rc2-mm2"
  archive="http://groups.google.com/group/fa.linux.kernel/msg/c06a2a063cf6d352?hl=en"
  posts="21"
  startdate="12 Jul 2005 01:17:24 -0800"
  enddate="14 Jul 2005 01:58:06 -0800"
>
<topic>Kernel Release Announcement</topic>

<p>Andrew Morton announced Linux version 2.6.13-rc2-mm2, saying:</p>

<quote who="Andrew Morton">

<p><a href="ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.13-rc2/2.6.13-rc2-mm2/">ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.13-rc2/2.6.13-rc2-mm2/</a></p>

<p>(And at <a
href="http://www.zip.com.au/~akpm/linux/patches/stuff/2.6.13-rc2-mm2.gz">http://www.zip.com.au/~akpm/linux/patches/stuff/2.6.13-rc2-mm2.gz</a>
- kenrel.org mirroring is being slow again)</p>

<ul>

<li>MM updates</li>

<li>More video4linux updates</li>

<li>Infiniband feature work</li>

</ul>

</quote>

<p>Matthias Urlichs added that this was also available <quote
who="Matthias Urlichs">as a GIT archive (once the mirror has mirrored): <a
href="http://www.kernel.org/pub/scm/linux/kernel/git/smurf/v2.6.13-rc2-mm2.git/">http://www.kernel.org/pub/scm/linux/kernel/git/smurf/v2.6.13-rc2-mm2.git/</a>.
Suggestions for improvements welcome.</quote></p>

</section>

</kc>

