<?xml version="1.0" ?>

<kc>

<title>Kernel Traffic</title>

<author contact="mailto:zbrown@tumblerings.org">Zack Brown</author>

<issue num="298" date="06 Mar 2005 00:00:00 -0800" />

<mailbox-stats>
	<global-stats>
		<generated-at>Sun Mar  6 08:38:43 2005</generated-at>
		<first-message>2005/01/19 00:31:37</first-message>
		<last-message>2005/01/20 16:50:54</last-message>
		<totals>
			<n-messages>1994</n-messages>
			<total-size>11MB</total-size>
			<avg-size>6KB</avg-size>
			<n-writers>623</n-writers>
			<wrote-more-then-1-message>238</wrote-more-then-1-message>
			<n-lines>195450</n-lines>
			<header-size>106019</header-size>
			<n-user-agents>61</n-user-agents>
			<n-organisations>50</n-organisations>
			<n-toplevel-domains>35</n-toplevel-domains>
		</totals>
		<averages>
			<lines-per-message>98</lines-per-message>
			<lines-per-header>53</lines-per-header>
			<header-percent-of-message>54.24%</header-percent-of-message>
			<header-percent-of-total>43.07%</header-percent-of-total>
			<line-length>10898</line-length>
			<bits-per-byte>4.2465</bits-per-byte>
		</averages>
		<importance>
			<low>0.00%</low>
			<normal>0.10%</normal>
			<high>0.00%</high>
		</importance>

	</global-stats>
	<top-writers>
		<top-writer rank="1">
			<e-mail-addr>Tejun Heo</e-mail-addr>
			<n-messages>74</n-messages>
			<avg-size>8KB</avg-size>
			<total-size>556KB</total-size>
			<mostly-written-at>14:18</mostly-written-at>
		</top-writer>
		<top-writer rank="2">
			<e-mail-addr>Eric W. Biederman</e-mail-addr>
			<n-messages>66</n-messages>
			<avg-size>8KB</avg-size>
			<total-size>470KB</total-size>
			<mostly-written-at>00:14</mostly-written-at>
		</top-writer>
		<top-writer rank="3">
			<e-mail-addr>Bartlomiej Zolnierkiewicz</e-mail-addr>
			<n-messages>62</n-messages>
			<avg-size>5KB</avg-size>
			<total-size>260KB</total-size>
			<mostly-written-at>08:14</mostly-written-at>
		</top-writer>
		<top-writer rank="4">
			<e-mail-addr>Adrian Bunk</e-mail-addr>
			<n-messages>53</n-messages>
			<avg-size>10KB</avg-size>
			<total-size>495KB</total-size>
			<mostly-written-at>12:08</mostly-written-at>
		</top-writer>
		<top-writer rank="5">
			<e-mail-addr>Greg KH</e-mail-addr>
			<n-messages>43</n-messages>
			<avg-size>5KB</avg-size>
			<total-size>186KB</total-size>
			<mostly-written-at>14:00</mostly-written-at>
		</top-writer>
	</top-writers>
	<top-subjects>
		<top-subject rank="1">
			<subject>Patch 4/6  randomize the stack pointer</subject>
			<n-messages>63</n-messages>
			<avg-size>5KB</avg-size>
			<total-size>303KB</total-size>
			<mostly-written-at>14:46</mostly-written-at>
		</top-subject>
		<top-subject rank="2">
			<subject>i8042 access timings</subject>
			<n-messages>44</n-messages>
			<avg-size>5KB</avg-size>
			<total-size>213KB</total-size>
			<mostly-written-at>14:11</mostly-written-at>
		</top-subject>
		<top-subject rank="3">
			<subject>[PATCH] OpenBSD Networking-related randomization port</subject>
			<n-messages>39</n-messages>
			<avg-size>6KB</avg-size>
			<total-size>228KB</total-size>
			<mostly-written-at>14:54</mostly-written-at>
		</top-subject>
		<top-subject rank="4">
			<subject>[PATCH] Dynamic tick, version 050127-1</subject>
			<n-messages>38</n-messages>
			<avg-size>7KB</avg-size>
			<total-size>245KB</total-size>
			<mostly-written-at>13:16</mostly-written-at>
		</top-subject>
		<top-subject rank="5">
			<subject>[PROPOSAL/PATCH] Remove PT_GNU_STACK support before 2.6.11</subject>
			<n-messages>36</n-messages>
			<avg-size>4KB</avg-size>
			<total-size>144KB</total-size>
			<mostly-written-at>14:29</mostly-written-at>
		</top-subject>
	</top-subjects>
	<top-receivers>
		<top-receiver rank="1">
			<e-mail-addr>linux-kernel@vger.kernel.org</e-mail-addr>
			<n-messages>341</n-messages>
			<avg-size>7KB</avg-size>
			<total-size>3MB</total-size>
			<mostly-written-at>13:39</mostly-written-at>
		</top-receiver>
		<top-receiver rank="2">
			<e-mail-addr>Andrew Morton</e-mail-addr>
			<n-messages>195</n-messages>
			<avg-size>8KB</avg-size>
			<total-size>2MB</total-size>
			<mostly-written-at>12:12</mostly-written-at>
		</top-receiver>
		<top-receiver rank="3">
			<e-mail-addr>Bartlomiej Zolnierkiewicz</e-mail-addr>
			<n-messages>58</n-messages>
			<avg-size>8KB</avg-size>
			<total-size>410KB</total-size>
			<mostly-written-at>15:33</mostly-written-at>
		</top-receiver>
		<top-receiver rank="4">
			<e-mail-addr>Tejun Heo</e-mail-addr>
			<n-messages>54</n-messages>
			<avg-size>5KB</avg-size>
			<total-size>225KB</total-size>
			<mostly-written-at>08:23</mostly-written-at>
		</top-receiver>
		<top-receiver rank="5">
			<e-mail-addr>Linus Torvalds</e-mail-addr>
			<n-messages>50</n-messages>
			<avg-size>7KB</avg-size>
			<total-size>309KB</total-size>
			<mostly-written-at>14:35</mostly-written-at>
		</top-receiver>
	</top-receivers>
	<top-ccers>
		<top-ccers rank="1">
			<e-mail-addr>&lt;linux-kernel@xxx</e-mail-addr>
			<n-messages>810</n-messages>
			<avg-size>6KB</avg-size>
			<total-size>5MB</total-size>
			<mostly-written-at>13:06</mostly-written-at>
		</top-ccers>
		<top-ccers rank="2">
			<e-mail-addr>Andrew Morton</e-mail-addr>
			<n-messages>199</n-messages>
			<avg-size>5KB</avg-size>
			<total-size>960KB</total-size>
			<mostly-written-at>12:00</mostly-written-at>
		</top-ccers>
		<top-ccers rank="3">
			<e-mail-addr>linux-ide@vger.kernel.org</e-mail-addr>
			<n-messages>76</n-messages>
			<avg-size>5KB</avg-size>
			<total-size>379KB</total-size>
			<mostly-written-at>09:12</mostly-written-at>
		</top-ccers>
		<top-ccers rank="4">
			<e-mail-addr>&lt;fastboot@xxx</e-mail-addr>
			<n-messages>66</n-messages>
			<avg-size>8KB</avg-size>
			<total-size>490KB</total-size>
			<mostly-written-at>02:30</mostly-written-at>
		</top-ccers>
		<top-ccers rank="5">
			<e-mail-addr>Linus Torvalds</e-mail-addr>
			<n-messages>64</n-messages>
			<avg-size>6KB</avg-size>
			<total-size>363KB</total-size>
			<mostly-written-at>14:23</mostly-written-at>
		</top-ccers>
	</top-ccers>
	<top-level-domains>
		<tld rank="1">
			<name>com</name>
			<freq>725</freq>
		</tld>
		<tld rank="2">
			<name>org</name>
			<freq>397</freq>
		</tld>
		<tld rank="3">
			<name>de</name>
			<freq>203</freq>
		</tld>
		<tld rank="4">
			<name>net</name>
			<freq>138</freq>
		</tld>
		<tld rank="5">
			<name>cz</name>
			<freq>99</freq>
		</tld>
	</top-level-domains>
	<top-timezones>
		<tz rank="1">
			<name>+0100</name>
			<freq>728</freq>
		</tz>
		<tz rank="2">
			<name>-0800</name>
			<freq>339</freq>
		</tz>
		<tz rank="3">
			<name>-0500</name>
			<freq>260</freq>
		</tz>
		<tz rank="4">
			<name>+0000</name>
			<freq>159</freq>
		</tz>
		<tz rank="5">
			<name>+0900</name>
			<freq>129</freq>
		</tz>
	</top-timezones>
	<top-organisations>
		<org rank="2">
			<name>OSDL</name>
			<freq>10</freq>
			<bytes>39KB</bytes>
		</org>
		<org rank="3">
			<name>Red Hat, Inc.</name>
			<freq>9</freq>
			<bytes>103KB</bytes>
		</org>
		<org rank="4">
			<name>USAGI Project</name>
			<freq>9</freq>
			<bytes>36KB</bytes>
		</org>
		<org rank="5">
			<name>Mostly alphabetical, except Q, which We do not fancy</name>
			<freq>9</freq>
			<bytes>32KB</bytes>
		</org>
	</top-organisations>
	<top-user-agents>
		<useragent rank="1">
			<name>Mozilla</name>
			<freq>48</freq>
			<bytes>834KB</bytes>
		</useragent>
		<useragent rank="2">
			<name>Evolution</name>
			<freq>41</freq>
			<bytes>985KB</bytes>
		</useragent>
		<useragent rank="3">
			<name>Mutt/1.5.6+20040907i</name>
			<freq>40</freq>
			<bytes>2MB</bytes>
		</useragent>
		<useragent rank="4">
			<name>Mozilla/5.0</name>
			<freq>36</freq>
			<bytes>856KB</bytes>
		</useragent>
		<useragent rank="5">
			<name>Mutt/1.4.1i</name>
			<freq>24</freq>
			<bytes>424KB</bytes>
		</useragent>
	</top-user-agents>
	<messages-per-day>
		<Sunday><msgs>252</msgs><bytes>2MB</bytes></Sunday>
		<Monday><msgs>266</msgs><bytes>2MB</bytes></Monday>
		<Tuesday><msgs>227</msgs><bytes>2MB</bytes></Tuesday>
		<Wednesday><msgs>274</msgs><bytes>2MB</bytes></Wednesday>
		<Thursday><msgs>348</msgs><bytes>2MB</bytes></Thursday>
		<Friday><msgs>342</msgs><bytes>2MB</bytes></Friday>
		<Saturday><msgs>213</msgs><bytes>2MB</bytes></Saturday>
	</messages-per-day>
	<messages-per-month>
		<Jan><msgs>301</msgs><bytes>2MB</bytes></Jan>
		<Feb><msgs>1620</msgs><bytes>9MB</bytes></Feb>
		<Mar><msgs>0</msgs><bytes>0</bytes></Mar>
		<Apr><msgs>0</msgs><bytes>0</bytes></Apr>
		<May><msgs>0</msgs><bytes>0</bytes></May>
		<Jun><msgs>0</msgs><bytes>0</bytes></Jun>
		<Jul><msgs>0</msgs><bytes>0</bytes></Jul>
		<Aug><msgs>0</msgs><bytes>0</bytes></Aug>
		<Sep><msgs>1</msgs><bytes>4KB</bytes></Sep>
		<Oct><msgs>0</msgs><bytes>0</bytes></Oct>
		<Nov><msgs>0</msgs><bytes>0</bytes></Nov>
		<Dec><msgs>0</msgs><bytes>0</bytes></Dec>
	</messages-per-month>
	<messages-per-day-of-month>
		<day-1><msgs>68</msgs><bytes>361KB</bytes></day-1>
		<day-2><msgs>148</msgs><bytes>833KB</bytes></day-2>
		<day-3><msgs>238</msgs><bytes>2MB</bytes></day-3>
		<day-4><msgs>291</msgs><bytes>2MB</bytes></day-4>
		<day-5><msgs>180</msgs><bytes>2MB</bytes></day-5>
		<day-6><msgs>227</msgs><bytes>2MB</bytes></day-6>
		<day-7><msgs>217</msgs><bytes>2MB</bytes></day-7>
		<day-8><msgs>146</msgs><bytes>806KB</bytes></day-8>
		<day-9><msgs>86</msgs><bytes>475KB</bytes></day-9>
		<day-10><msgs>19</msgs><bytes>103KB</bytes></day-10>
		<day-11><msgs>0</msgs><bytes>0</bytes></day-11>
		<day-12><msgs>0</msgs><bytes>0</bytes></day-12>
		<day-13><msgs>0</msgs><bytes>0</bytes></day-13>
		<day-14><msgs>0</msgs><bytes>0</bytes></day-14>
		<day-15><msgs>0</msgs><bytes>0</bytes></day-15>
		<day-16><msgs>0</msgs><bytes>0</bytes></day-16>
		<day-17><msgs>0</msgs><bytes>0</bytes></day-17>
		<day-18><msgs>0</msgs><bytes>0</bytes></day-18>
		<day-19><msgs>33</msgs><bytes>305KB</bytes></day-19>
		<day-20><msgs>1</msgs><bytes>4KB</bytes></day-20>
		<day-21><msgs>1</msgs><bytes>32KB</bytes></day-21>
		<day-22><msgs>2</msgs><bytes>6KB</bytes></day-22>
		<day-23><msgs>0</msgs><bytes>0</bytes></day-23>
		<day-24><msgs>2</msgs><bytes>8KB</bytes></day-24>
		<day-25><msgs>13</msgs><bytes>56KB</bytes></day-25>
		<day-26><msgs>8</msgs><bytes>33KB</bytes></day-26>
		<day-27><msgs>90</msgs><bytes>393KB</bytes></day-27>
		<day-28><msgs>48</msgs><bytes>341KB</bytes></day-28>
		<day-29><msgs>31</msgs><bytes>166KB</bytes></day-29>
		<day-30><msgs>25</msgs><bytes>126KB</bytes></day-30>
		<day-31><msgs>48</msgs><bytes>323KB</bytes></day-31>
	</messages-per-day-of-month>
	<messages-per-hour>
		<hour-1><msgs>65</msgs><bytes>421KB</bytes></hour-1>
		<hour-2><msgs>42</msgs><bytes>195KB</bytes></hour-2>
		<hour-3><msgs>35</msgs><bytes>256KB</bytes></hour-3>
		<hour-4><msgs>12</msgs><bytes>55KB</bytes></hour-4>
		<hour-5><msgs>6</msgs><bytes>25KB</bytes></hour-5>
		<hour-6><msgs>11</msgs><bytes>50KB</bytes></hour-6>
		<hour-7><msgs>30</msgs><bytes>133KB</bytes></hour-7>
		<hour-8><msgs>40</msgs><bytes>207KB</bytes></hour-8>
		<hour-9><msgs>105</msgs><bytes>543KB</bytes></hour-9>
		<hour-10><msgs>125</msgs><bytes>716KB</bytes></hour-10>
		<hour-11><msgs>128</msgs><bytes>691KB</bytes></hour-11>
		<hour-12><msgs>144</msgs><bytes>2MB</bytes></hour-12>
		<hour-13><msgs>106</msgs><bytes>557KB</bytes></hour-13>
		<hour-14><msgs>128</msgs><bytes>712KB</bytes></hour-14>
		<hour-15><msgs>105</msgs><bytes>472KB</bytes></hour-15>
		<hour-16><msgs>118</msgs><bytes>649KB</bytes></hour-16>
		<hour-17><msgs>101</msgs><bytes>563KB</bytes></hour-17>
		<hour-18><msgs>87</msgs><bytes>439KB</bytes></hour-18>
		<hour-19><msgs>105</msgs><bytes>708KB</bytes></hour-19>
		<hour-20><msgs>105</msgs><bytes>488KB</bytes></hour-20>
		<hour-21><msgs>103</msgs><bytes>610KB</bytes></hour-21>
		<hour-22><msgs>80</msgs><bytes>436KB</bytes></hour-22>
		<hour-23><msgs>61</msgs><bytes>329KB</bytes></hour-23>
	</messages-per-hour>
	<created-with><name>mboxstats</name><version>2.2</version><developer>folkert@vanheusden.com</developer><url>http://www.vanheusden.com/mboxstats/</url></created-with>
</mailbox-stats>

<section
  title="kexec And crashdump"
  subject="[PATCH 0/29] overview"
  archive="http://groups-beta.google.com/group/linux.kernel/msg/a51e5eed36a75769"
  posts="99"
  startdate="18 Jan 2005 23:31:37 -0800"
  enddate="04 Feb 2005 04:02:42 -0800"
>
<topic>Executable File Format</topic>
<topic>Kexec</topic>

<mention>Itsuro Oda</mention>

<p>Eric W. Biederman said:</p>

<quote who="Eric W. Biederman">

<p>This patchsset is a major refresh of the kexec on panic
functionality in the kernel.  The primary aim of which was to take
the requirements capture of the kernel crashdump patches and
start integrating the functionality cleanly into the kexec
patches.</p>

<p>Major accomplishments:</p>

<p>

<ul>

<li>Compat syscall support has been added.</li>

<li>The crashdump capture code has been separated from the kexec on panic
code.</li>

<li>The kernel to jump to on panic is now loaded in place.</li>

<li>A long standing bug that allowed 2 sources pages to copy data to a single
destination page has been caught and fixed.</li>

<li>Support for loading an x86_64 kernel in a reserved of memory has been
completed.</li>

</ul>

</p>

<p>The crashdump code is currently slightly broken.  I have attempted to
minimize the breakage so things can quick be made to work again.</p>

<p>With respect to a final design discussion there are two remaining open
issues.  The first is how little hardware shutdown we can get away with in
the kernel that is panicing.  I believe we can reduce this to a simply NMI
to the other cpus telling them to stop.  This has been address as a major
concern in previous conversations.</p>

<p>The second is an issue is the most significant with respect to the
design of a kernel based crash dump capture implementation.  How does the
crashdump capture process discover relevant information about the kernel
that just crashed?  There are two options.</p>

<p>

<ol>

<li>As represented by the current crashdump patches the crashdump kernel
and the kernel in which it loads are kept in sync so that it has uptodate
versions all of crashed kernels data structures because it is built from the
same source.  So it only needs to find the address of the data structures
it would like to look at.</li>

<li>

<p>The relevant information if it is available when sys_kexec_load is called
is exported to user space, or the machine_crash_shutdown method marshalls what
little information must be captured when the machine dies in a well known
standard format (most likely ELF notes).  Allowing the crashdump capture
process to simply pass on the information or utilize it as appropriate.</p>

<p>If the second method can successfully represent all of the interesting
information then we can allow kernel version skew, between the two kernels, and
potentially implement the entire crash dump capture process in user space.</p>

</li>

</ol>

</p>

<p>As best as I have been able to discover the interesting information
includes.  The cpu state (registers) at the time of the crash/panic.  The list
of memory regions the kernel that has crashed was using.  And potentially
the list of pages dedicated to kernel data as opposed to user space, so the
the people with insane amounts of memory (1TB+) don't require unmanagely
large core files.</p>

</quote>

<p>He quoted an earlier message by Andrew Morton, in which Andrew had
said, <quote who="Andrew Morton">I don't want us to be in a position of
merging all that code and then finding out that it cannot be made to work
"sufficiently well", forcing us to revert it and find a new crashdump solution.
You guys know far better than I when we will reach that threshold.  If the
kexec/dump developers can say "yup, this is going to work (because X)"
then I'm happy.</quote> Eric now offered:</p>

<quote who="Eric W. Biederman">

<p>So here is my subjective view.</p>

<p>

<ul>

<li>This code needs to sit in a development tree for a little while to shake
out whatever bugs still linger from my massive refactoring.</li>

<li>Through the kexec patches the code and design appears to be sound.
Given that machine_kexec is little more than a jump there are few possible
implementations that will be able to use it.  The only exception I can see
are running special dump drivers from the kernel that crashed, and I believe
no one thinks the that will work well.</li>

<li>Once we finish sorting out the best way to get information out of the
kernel that crashed I think we will have a complete architecture that is
largely portable to any architecture.</li>

</ul>

</p>

<p>In the interests of full disclosure my main interesting is using the
kernel as a bootloader for other kernels and that has been working fairly
for years now :)</p>

</quote>

<p>He posted a couple dozen atomic patches for this. Vivek Goyal replied:</p>

<quote who="Vivek Goyal">

<p>We have started doing changes to make crashdump up and running again.
Following are few identified items to be done.</p>

<p>

<ol>

<li>Reserve the backup region (640k) during kernel bootup.</li>

<li>Copy the data to backup region during crash.(moved to kexec user space
code, patch posted in separate mail)</li>

<li>Prepare elf headers while loading kexec panic kernel and store in reserved
memory area.</li>

<li>Pass required information to crashdump kernel, which parses it and exports
through /proc/vmcore. (may be user space utility, open to discussion)</li>

</ol>

</p>

<p>Following patch implements item 1) in the list. Soon we shall be rolling
out the patches for rest.</p>

</quote>

<p>In going over some of the implementation details, Eric found a number of
problems with Vivek's patch; for awhile it seemed the discussion would descend
into confusion, when Eric felt Vivek was only producing minimal changes in
response to Eric's design suggestions. This had not been Vivek's intention,
however, and they soon were 'back on the same page', as Eric put it. Vivek
described the new design, saying, <quote who="Vivek Goyal">The whole idea is
that Crash image is represented in ELF Core format.  These ELF Headers are
prepared by kexec-tools user space and put in one segment. Address of start of
image is passed to the capture kernel(or user space) using one command line
(eg. crashimage=). Now either kernel space or user space can parse the elf
headers and extract required information and export final kernel elf core
image.</quote> He went on:</p>

<quote who="Vivek Goyal">

<p>If I prepare One elf header for each physical contiguous memory area (as
obtained from /proc/iomem) instead of per zone, then number of elf headers
will come down significantly. I don't have any idea on number of actual
physically contiguous regions present per machine, but roughly assuming it
to be 1 per node, it will lead to 256 + 1024 = 1280 program headers.At 56
bytes per 64 bit program header this will amount to 70KB.</p>

<p>This is worst case estimate and on lower end machines this will require
much less a space. On machines as big as 1024 cpus, this should not be a
concern, as big machines come with big RAMs.</p>

<p>Eric, do you still think that ELF headers are inappropriate to be passed
across interface boundary.</p>

<p>ELF headers can be prepared by kexec-tools in advance and put into one
of the data segments. This requires following information to be available
to user space.</p>

<p>

<ul>

<li>Starting address of space reserved by kernel for notes section
(crash_notes[]). Probably can be obtained from /proc/kallsysms?</li>

<li>NR_CPUS. May be sysconf(_SC_NPROCESSORS_CONF) should be sufficient.</li>

<li>Size of memory reserved per cpu. No clue how to get that? Any
suggestions?<br />
        May be hard-coding like 1K area per cpu should be to address the
        future needs ?</li>

</ul>

</p>

<h3>Regarding Backup Region</h3>

<p>

<ul>

<li>Kexec user space does the reservation for backup region segment.</li>

<li>Purgatory copies the backup data to backup region. (Already
implemented)</li>
<li>A separate elf header is prepared to represent backed up memory
region. And "offset" field of this program header can contain the actual
physical address where backup contents are stored.</li>

</ul>

</p>

</quote>

<p>Eric had some criticisms, but felt this was a "good place to start". Itsuro
Oda asked why, in all this, the ELF format was considered necessary. Eric
replied that the ELF format itself was not necessary, but the information
contained within an ELF header was a match for the kind of information that
needed to be used here. Therefore, Eric said, it made a good match. When Koichi
Suzuki echoed Itsuro's concerns, saying, <quote who="Koichi Suzuki">Format
conversion should be done in healthy system separately and we should restrict
what to do while taking the dump as few as possible,</quote> Eric expanded:</p>

<quote who="Eric W. Biederman">

<p>The big part of the conversation that is happening right now is how do
we uncouple dependencies between the various parts as much as possible.
There is nothing here about format conversions except as to convert weird
kernel formats into a stable interface.</p>

<p>There are 3 pieces of code interacting.</p>

<p>

<ol>

<li>The primary kernel that will call panic.</li>

<li>The kernel+initrd that takes over.</li>

<li>The user space that sets it all up (/sbin/kexec) while the primary kernel
is still in a sane state.</li>

</ol>

</p>

<p>The goal is to make those 3 pieces as independent of each other as
reasonably possible.</p>

<p>So the kernel+initrd that captures a crash dump will live and execute in
a reserved area of memory.  It needs to know which memory regions are valid,
and it needs to know small things like the final register state of each cpu.
For the set of valid memory regions it is the intention to encode that as
an array of ELF program headers.  The information of what the final register
contents were will be encoded as ELF notes.  There will be one PT_NOTE segment
per cpu that holds the notes needed to encode a given cpu's final state.
It really does not matter to implementation that captures each cpu's final
register state which format we record the data in so using a format designed
not to change is not a problem.  So all that needs to be communicated to the
kernel+initrd that captures a crash dump is the location of an ELF header
and it can figure out all of the rest.</p>

<p>For the primary kernel except for remembering it's final cpu register
state as it dies it does nothing except jump to the crash recover kernel.
All of the interesting information will be exported to user space.</p>

<p>/sbin/kexec is the glue that fills in the cracks.  While the primary
kernel is in a sane state it sets everything up including finding out which
memory areas need to be looked at.  And it stashes it all in a reserved area
of memory, that has never been the target of DMA transfers.</p>

<p>The goal is to reduce the dependencies as much as possible.  So an old
stable kernel can take a crash dump of a new buggy kernel.  And so that you
don't have to be running the latest and greatest user space simply to set
everything up.  Although it is still better to require a user-space upgrade
to cope with new kernels than to require the crash capture kernel+initrd to
be upgraded.</p>

</quote>

</section>

<section
  title="New scrubd Page Zeroing Daemon"
  subject="A scrub daemon (prezeroing)"
  archive="http://groups-beta.google.com/group/linux.kernel/msg/9c4cc81b96c27015"
  posts="38"
  startdate="21 Jan 2005 12:29:03 -0800"
  enddate="08 Feb 2005 15:32:43 -0800"
>
<topic>FS: sysfs</topic>
<topic>SMP</topic>

<mention>Andrea Arcangeli</mention>
<mention>Andrew Morton</mention>

<p>Christoph Lameter posted a patch that <quote who="Christoph Lameter">Adds
management of ZEROED and NOT_ZEROED pages and a background daemon called
scrubd.</quote> He went on:</p>

<quote who="Christoph Lameter">

<p>scrubd is disabled by default but can be enabled by writing an order
number to /proc/sys/vm/scrub_start. If a page is coalesced of that order
or higher then the scrub daemon will start zeroing until all pages of order
/proc/sys/vm/scrub_stop and higher are zeroed and then go back to sleep.</p>

<p>In an SMP environment the scrub daemon is typically running on the most
idle cpu. Thus a single threaded application running on one cpu may have the
other cpu zeroing pages for it etc. The scrub daemon is hardly noticable
and usually finished zeroing quickly since most processors are optimized
for linear memory filling.</p>

<p>Note that this patch does not depend on any other patches but other
patches would improve what scrubd does. The extension of clear_pages by an
order parameter would increase the speed of zeroing and the patch introducing
alloc_zeroed_user_highpage is necessary for user pages to be allocated from
the pool of zeroed pages.</p>

</quote>

<p>There was a good bit of wrangling over implementation, and later he posted
an update, saying:</p>

<quote who="Christoph Lameter">

<p>Changes from V4 to V6:</p>

<p>

<ul>

<li>V5 posted as independent patches</li>

<li>copyright update in Altix BTE driver</li>

<li>Note early work on __GFP_ZERO by Andrea Arcangeli</li>

<li>Simplify Altix BTE zeroing driver and handle timeouts correctly (kscrubd
hung once in a while).</li>

<li>Support /proc/buddyinfo</li>

<li>Make the higher order clear_page patch less invasive. Name it
clear_pages.</li>

<li>patch against 2.6.11-rc3</li>

</ul>

</p>

<p>More information and a combined patchset is available at <a
href="http://oss.sgi.com/projects/page_fault_performance">http://oss.sgi.com/projects/page_fault_performance</a>.</p>

<p>The most expensive operation in the page fault handler is (apart of SMP
locking overhead) the touching of all cache lines of a page by zeroing the
page. This zeroing means that all cachelines of the faulted page (on Altix
that means all 128 cachelines of 128 byte each) must be handled and later
written back. This patch allows to avoid having to use all cachelines  if
only a part of the cachelines of that page is needed immediately after the
fault. Doing so will only be effective for sparsely accessed memory which is
typical for anonymous memory and pte maps.  Prezeroed pages will only be used
for those purposes. Unzeroed pages will be used as usual for file mapping,
page caching etc etc.</p>

<p>The patch makes prezeroing very effective by:</p>

<p>

<ol>

<li>Appplying zeroing operations only to pages of higher order, which results
in many pages that will later become zero order pages to be zeroed in one
step.</li>

<li>Hardware support for offloading zeroing from the cpu. This avoids the
invalidation of the cpu caches by extensive zeroing operations.</li>

</ol>

</p>

<p>The scrub daemon is invoked when a unzeroed page of a certain order has
been generated so that its worth running it. If no higher order pages are
present then the logic will favor hot zeroing rather than simply shifting
processing around. kscrubd typically runs only for a fraction of a second
and sleeps for long periods of time even under memory benchmarking. kscrubd
performs short bursts of zeroing when needed and tries to stay out off the
processor as much as possible.</p>

<p>The benefits of prezeroing are reduced to minimal quantities if all
cachelines of a page are touched. Prezeroing can only be effective if the
whole page is not immediately used after the page fault.</p>

<p>The patch is composed of 3 parts:</p>

<p>[1/3] clear_pages(page, order) to zero higher order pages
        Adds a clear_pages function with the ability to zero higher order
        pages. This allows the zeroing of large areas of memory without
        repeately invoking clear_page() from the page allocator, scrubd and
        the huge page allocator.</p>

<p>[2/3] Page Zeroing
        Adds management of ZEROED and NOT_ZEROED pages and a background
        daemon called scrubd.</p>

<p>[3/3] SGI Altix Block Transfer Engine Support
        Implements a driver to shift the zeroing off the cpu into hardware.
        This avoids the potential impact of zeroing on cpu caches.</p>

</quote>

<p>Andrew Morton seemed interested in accepting the patch; but he required some
benchmarks showing a real improvement; and he needed the patch to adhere to
existing APIs for starting, binding, and stopping kernel threads. Christopher
started to comply, but the thread petered out.</p>

</section>

<section
  title="ST M41T00 I2C RTC Chip Driver Released"
  subject="[PATCH][I2C] ST M41T00 I2C RTC chip driver"
  archive="http://groups-beta.google.com/group/fa.linux.kernel/msg/69a8223ea4e173c0"
  posts="8"
  startdate="31 Jan 2005 10:05:28 -0800"
  enddate="04 Feb 2005 16:04:09 -0800"
>
<topic>I2C</topic>

<mention>Greg KH</mention>
<mention>Jean Delvare</mention>

<p>Mark A. Greer said:</p>

<quote who="Mark A. Greer">

<p>This patch adds support for the ST M41T00 RTC chip.</p>

<p>You will likely notice that it implements a PPC-specific interface
(/dev/rtc-&gt;drivers/char/genrtc.h-&gt;include/asm-ppc/rtc.h-&gt;this file).
This was necessary to support a subset of ppc platforms that need to hook
up the rtc support at runtime.  If I implemented /dev/rtc directly or
interfaced to genrtc.c directly, those platforms couldn't use this driver.
Eventually, I hope to work on more uniform rtc support across all the
processor architectures.</p>

<p>Also, on ppc at least, the hw clock can be set from a timer interrupt if
STA_UNSYNC is not set (e.g., ntpd is running).  To handle this, a tasklet
is used to set the clock if in_interrupt() is true.</p>

</quote>

<p>Jean Delvare, although not intimately familiar with the hardware involved,
still offered some comments, mainly typos, naming conventions, and some
memory management advice. Mark posted an updated patch, taking all of Jean's
suggestions. Several days later, with no further replies, he asked if his
patch could be accepted for inclusion at that point. Greg KH asked if Mark could
send the patch with a proper Changlog blurb, and Mark did so. The blurb read:</p>

<quote who="Mark A. Greer">

<p>This patch adds support for the ST M41T00 I2C RTC chip.</p>

<p>This rtc chip has no mechanism to freeze it's registers while being read;
however, it will delay updating the external values of the registers for 250ms
after a register is read.  To ensure that a sane time value is read, the driver
verifies that the same registers values were read twice before returning.</p>

<p>Also, when setting the rtc from an interrupt handler, a tasklet is used
to provide the context required by the i2c core code.</p>

</quote>

</section>

<section
  title="Linux 2.6.11-rc3 Released"
  subject="Linux 2.6.11-rc3"
  archive="http://groups-beta.google.com/group/fa.linux.kernel/msg/701f1a5e40afab24"
  posts="9"
  startdate="02 Feb 2005 18:35:39 -0800"
  enddate="04 Feb 2005 09:37:14 -0800"
>
<topic>Disks: SCSI</topic>
<topic>FS: XFS</topic>
<topic>Kernel Release Announcement</topic>
<topic>Power Management: ACPI</topic>
<topic>Sound: ALSA</topic>
<topic>Version Control</topic>

<p>Linus Torvalds announced Linux 2.6.11-rc3, saying:</p>

<quote who="Linus Torvalds">

<p>This has a number of architecture updates (mips, arm, ppc, x86-64, ia64),
and updates ACPI, DRI, ALSA, SCSI, XFS and InfiniNand.. And a lot of small
one-liners all over.</p>

<p>I'd _really_ like to calm down for a final 2.6.11 now, so please note
anything really important I missed, but keep the rest pending. And give this
a good testing..</p>

<p>Oh, and the automated bitkeeper mirroring to bkbits.net seems slightly
broken right now (hasn't updated in the last 48 hours), but the tar-balls
are all there, and the BK upating mechanism will hopefully be fixed soon.</p>

<p>(I've got a few BK trees in private places, it's only the public bkbits.net
one that hasn't gotten mirrored out yet - many other BK developers will know
where to find my secondary trees and can pull from them instead).</p>

</quote>

</section>

<section
  title="FUSE Version 2.2 Released"
  subject="[ANNOUNCE] Filesystem in Userspace - 2.2 "
  archive="http://groups-beta.google.com/group/fa.linux.kernel/msg/204129ce5d9bfee2"
  posts="2"
  startdate="03 Feb 2005 03:29:26 -0800"
  enddate="03 Feb 2005 04:39:22 -0800"
>

<p>Miklos Szeredi announced:</p>

<quote who="Miklos Szeredi">

<p>FUSE version 2.2 is out there:</p>

<p><a
href="http://sourceforge.net/project/showfiles.php?group_id=121684&amp;package_id=132802&amp;release_id=301878">http://sourceforge.net/project/showfiles.php?group_id=121684&amp;package_id=132802&amp;release_id=301878</a></p>

<p>This can be used standalone or with recent -mm kernels (with the exception
of -rc2-mm2).</p>

<p>Most notable changes since 2.1:</p>

<p>

<ul>

<li>Added file handle parameter to open/read/write/release.  This should
make life easier for filesystems wanting to implement stateful I/O.</li>

<li>Added compatibility to the 2.1 and to some extent to the 1.X API</li>

<li>Re-added ability to interrupt operations.  This time more carefully than
in 1.X.</li>

</ul>

</p>

<p>Regressions:</p>

<p>

<ul>

<li>Removed shared-writable mmap support, which could deadlock the linux
memory subsystem.  This should not affect most people, but if some application
breaks for you, I'd like to hear about it.</li>

<li>Made the readpages() operation synchronous, again for deadlock
considerations.  This can degrade performance, especially for high latency
filesystems, since previously parallel read-ahead is now serialized.</li>

</ul>

</p>

<p>In the long run I hope to solve both problems, but neither is trivial.
Ideas are welcome, as well as bugreports of course.</p>

</quote>

<p>Franco Broi reported excellent success, saying, <quote who="Franco
Broi">I've just ported my filesystem to 2.2-pre6 and was able to throw away
about 300 lines of code, the filehandle stuff is great. I was hoping to give
it a thorough test and report back before 2.2 was released but you beat me
to it.  It just keeps getting better and better, well done!</quote></p>

</section>

<section
  title="Linux 2.6.11-rc3-mm1 Released"
  subject="2.6.11-rc3-mm1"
  archive="http://groups-beta.google.com/group/fa.linux.kernel/msg/52ce998695501ac9"
  posts="59"
  startdate="04 Feb 2005 10:33:50 -0800"
  enddate="09 Feb 2005 16:22:02 -0800"
>
<topic>Device Mapper</topic>
<topic>Kernel Release Announcement</topic>
<topic>PCI</topic>
<topic>USB</topic>
<topic>Version Control</topic>

<p>Andrew Morton announced Linux 2.6.11-rc3-mm1, saying:</p>

<quote who="Andrew Morton">

<p><a href="ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc3/2.6.11-rc3-mm1/">ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc3/2.6.11-rc3-mm1/</a></p>

<p>

<ul>

<li>The bk-usb and bk-pci and bk-driver-core trees have been temporarily
dropped from -mm, for they are not healthy at present.</li>

<li>After many months dormancy, the ieee1394 tree is back and is included
in -mm.  Anyone who has been having firewire problems please test it.</li>

</ul>

</p>

</quote>

<p>Greg KH said:</p>

<quote who="Greg KH">

<p>Ok, I've cleaned up the bk-usb tree a bunch.  If anyone had a previous
copy of it, please just delete it and clone it again.  It's at:</p>

<blockquote>

<p>        bk://kernel.bkbits.net/gregkh/linux/usb-2.6</p>

</blockquote>

<p>and is safe for consumption.</p>

<p>Andrew, can you put it back into the next -mm release?</p>

<p>Oh, and below is the diffstat and changelog of the patches in it.
I've also placed a full patch of it, against the 2.6.11-rc3-bk1 tree for
those who don't like to use bk, or are just curious about putting this on
top of the latest -mm release:</p>

<blockquote>

<p><a
href="http://kernel.org/pub/linux/kernel/people/gregkh/usb/2.6/2.6.11-rc3/bk-usb-2.6.11-rc3-mm1.patch">kernel.org/pub/linux/kernel/people/gregkh/usb/2.6/2.6.11-rc3/bk-usb-2.6.11-rc3-mm1.patch</a></p>

</blockquote>

<p>Also, if you have sent me a USB patch that is not already in the mainline
tree, and is not included in this big patch-bundle, please resend it, as my
USB patch queue is now empty.</p>

<p>Oops, no, I have a pending patch from Petko Manolov that didn't make it
into here, sorry about that Petko, I'll get to that one next week.</p>

<p>Next up, the bk-pci and bk-driver-core mess...</p>

</quote>

<p>Elsewhere, Laurent Riffard reported:</p>

<quote who="Laurent Riffard">

<p>loading dm-mod module fails with this message :</p>

<blockquote>        

<p>FATAL: Error inserting dm-mod
(/lib/modules/2.6.11-rc3-mm1/kernel/drivers/md/dm-mod.ko): Device or
resource busy</p>

</blockquote>        

<p>The following line appears in dmesg :</p>

<blockquote>        

<p>register_blkdev: failed to get major for device-mapper</p>

</blockquote>        

<p>It was OK with kernel 2.6.11-rc2-mm2. Same config, did "make oldconfig".</p>

</quote>

<p>Andrew replied:</p>

<quote who="Andrew Morton">

<p>You've enabled CONFIG_BASE_SMALL and so the major_names[] hashtable has
just one element.  device-mapper uses dynamic major allocation, the range of
which is limited to the size of the top-level major_names[] array.  You ran
out of slots and register_blkdev() failed.</p>

<p>So for now I guess we must drop
base-small-shrink-major_names-hash.patch.</p>

<p>Al, that code looks rather crappy.  Shouldn't we be using an idr tree
or something?</p>

<p>Also, we can never generate a major number of zero if the caller passed
in major=0.  How come?</p>

</quote>

<p>Laurent confirmed that selecting CONFIG_BASE_FULL=y solved his
problem. Close by, Christoph Hellwig remarked, <quote who="Christoph
Hellwig">It'd be nice to see major_names just gone completely.  It's only
used for /proc/devices output, and with the infrastucture for easily sharing
majors that one is completely misleading..</quote> Alexander Viro replied:</p>

<quote who="Alexander Viro">

<p>ACK.  Moreover, dynamic registration of *majors* makes very little sense
these days - about as much as setting lower limit on IP block registration
to /12.</p>

<p>IMO we should put a large part of device number space for dynamic
allocations (current static ones barely scratch the surface - we could easily
leave upper half and nobody'd noticed) and use e.g. buddy allocator within it.
With allocation requests taking size of area as argument (rounded up to
power of 2, which it normally would be anyway).</p>

<p>Any objections to that?  Hell, we can even have register_blkdev() without
a fixed major calling blkdev_allocate(name, 1&lt;&lt;20) and then eliminate
the callers in favour of saner-sized requests.  Then kill register_blkdev()
completely...</p>

</quote>

<p>There was no reply to this on the list.</p>

</section>

<section
  title="RelayFS Updated"
  subject="[PATCH] relayfs redux, part 3"
  archive="http://groups-beta.google.com/group/linux.kernel/msg/26e66b87c73eee2d"
  posts="9"
  startdate="04 Feb 2005 12:17:37 -0800"
  enddate="05 Feb 2005 01:54:16 -0800"
>
<topic>SMP</topic>

<mention>Christoph Hellwig</mention>
<mention>Andi Kleen</mention>

<p>Tom Zanussi said:</p>

<quote who="Tom Zanussi">

<p>Here's the latest version of relayfs, against 2.6.10.  It includes a
bunch of cleanup and restructuring prompted by the previous round of
comments, but the major change that people would care about would
probably be the changes to the logging functions relay_write(),
__relay_write(), and relay_reserve().  They've been rewritten to be
more efficient, or so I hope - I'm sure I'll hear about how they
should be improved for the next version in any case. ;-) Thanks to
everyone who commented on the previous version.</p>

<p>This is what the API currently looks like:</p>

<p>    rchan *relay_open(chanpath, subbuf_size, n_subbufs, flags, callbacks);<br />
    void relay_close(chan);<br />
    unsigned relay_write(chan, data, length);<br />
    unsigned __relay_write(chan, data, length);<br />
    void *relay_reserve(chan, length);<br />
    void relay_subbufs_consumed(chan, subbufs_consumed, cpu);<br />
    extern void relay_reset(chan);<br />
    void relay_commit(buf, subbuf_idx, count);</p>

<p>  helper macros:</p>

<p>    relay_get_buffer(chan, cpu)<br />
    relay_get_padding(buf, subbuf_idx)<br />
    relay_get_commit(buf, subbuf_idx)</p>

<p>  callbacks:</p>

<p>    int subbuf_start(buf, subbuf, prev_subbuf_idx);<br />
    int deliver(buffer, subbuf, subbuf_idx);<br />
    int fileop_notify(buf, filp, fileop);</p>

<p>As before, I've tested this code on a single proc machine using a
hacked version of the kprobes network packet tracing module, which can
be found here:</p>

<p><a href="http://prdownloads.sourceforge.net/dprobes/plog.tar.gz?download">http://prdownloads.sourceforge.net/dprobes/plog.tar.gz?download</a></p>

<p>Once everyone's more or less happy with the API and implementation,
I'll do some SMP testing and write some Documentation.</p>

</quote>

<p>Christoph Hellwig and Andi Kleen both had nitty-gritty objections to
various lines of the patch; but neither had any serious problems with it, and
Tom said he'd incorporate all their corrections into a subsequent version.</p>

</section>

<section
  title="Linux Test Project Updated"
  subject="[ANNOUNCE] February release of LTP"
  archive="http://groups-beta.google.com/group/linux.kernel/msg/7329cf21b9c51696"
  posts="1"
  startdate="07 Feb 2005 14:24:39 -0800"
>
<topic>Networking</topic>

<mention>David Stevens</mention>

<p>Marty Ridgeway announced the February release of the Linux Test Project
(LTP), saying:</p>

<quote who="Marty Ridgeway">

<p>LTP-20050207</p>

<p>

<ul>

<li>runltp now exports $TMPDIR as a copy of $TMP, certain exceptions caused
these to be different.</li>

<li>extra functions for LTP libs are to make these tests fail with a more
informative message when attempts to create swap on tmpfs are made.</li>

<li>IPV6 testcase updates from David Stevens</li>

<li>Applied patch from Jacky Malcles that fixes an inconsistency regarding
synchronization.</li>

<li>Make proc01 skip kcore</li>

<li>Fix gives an hint to the probable solution if capset01 test fails</li>

<li>Fix for race conditions in synchronization between children and parent
on fcntl15.</li>

<li>Applied patch from Jacky Malcles to allow test to run on ia64.</li>

<li>The test llseek sets RLIMIT_FSIZE to a small number, this fix to restore
it to its original value.</li>

<li>Fix IPV6 Makefile install path problem</li>

</ul>

</p>

</quote>

</section>

<section
  title="New Marvell MV64xxx I2C Driver"
  subject="[PATCH][I2C] Marvell mv64xxx i2c driver"
  archive="http://groups-beta.google.com/group/linux.kernel/msg/ae4293720826e40e"
  posts="5"
  startdate="08 Feb 2005 15:27:23 -0800"
  enddate="09 Feb 2005 13:33:59 -0800"
>
<topic>I2C</topic>

<mention>Bartlomiej Zolnierkiewicz</mention>
<mention>Jean Delvare</mention>

<p>Mark A. Greer said:</p>

<quote who="Mark A. Greer">

<p>Marvell makes a line of host bridge for PPC and MIPS systems.  On those
bridges is an i2c controller.  This patch adds the driver for that i2c
controller.</p>

<p>Please apply.</p>

<p>Depends on patch submitted by Jean Delvare: <a
href="http://archives.andrew.net.au/lm-sensors/msg29405.html">http://archives.andrew.net.au/lm-sensors/msg29405.html</a></p>

</quote>

<p>Bartlomiej Zolnierkiewicz offered some minor fixes and criticisms of the
patch, and Mark went through several patch iterations with him.</p>

</section>

</kc>

