Commit Graph

43937 Commits

Author SHA1 Message Date
Nick Piggin
cd54e7e543 [PATCH] mm: incorrect VM_FAULT_OOM returns from drivers
Some drivers are returning OOM when it is not in response to a memory
shortage.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Dave Airlie <airlied@linux.ie>
Cc: Jaroslav Kysela <perex@suse.cz>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Nick Piggin
f2a2a7108a [PATCH] oom: less memdie
Don't cause all threads in all other thread groups to gain TIF_MEMDIE
otherwise we'll get a thundering herd eating our memory reserve.  This may not
be the optimal scheme, but it fits our policy of allowing just one TIF_MEMDIE
in the system at once.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Nick Piggin
f3af38d30c [PATCH] oom: cleanup messages
Clean up the OOM killer messages to be more consistent.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Nick Piggin
c33e0fca35 [PATCH] oom: don't kill unkillable children or siblings
Abort the kill if any of our threads have OOM_DISABLE set.  Having this
test here also prevents any OOM_DISABLE child of the "selected" process
from being killed.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Paul Jackson
7253f4ef04 [PATCH] memory page_alloc zonelist caching reorder structure
Rearrange the struct members in the 'struct zonelist_cache' structure, so
as to put the readonly (once initialized) z_to_n[] array first, where it
will come right after the zones[] array in struct zonelist.

This pretty much eliminates the chance that the two frequently written
elements of 'struct zonelist_cache', the fullzones bitmap and last_full_zap
times, will end up on the same cache line as the performance sensitive,
frequently read, never (after init) written zones[] array.

Keeping frequently written data off frequently read cache lines is good for
performance.

Thanks to Rohit Seth for the suggestion.

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Rohit Seth <rohitseth@google.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Paul Jackson
9276b1bc96 [PATCH] memory page_alloc zonelist caching speedup
Optimize the critical zonelist scanning for free pages in the kernel memory
allocator by caching the zones that were found to be full recently, and
skipping them.

Remembers the zones in a zonelist that were short of free memory in the
last second.  And it stashes a zone-to-node table in the zonelist struct,
to optimize that conversion (minimize its cache footprint.)

Recent changes:

    This differs in a significant way from a similar patch that I
    posted a week ago.  Now, instead of having a nodemask_t of
    recently full nodes, I have a bitmask of recently full zones.
    This solves a problem that last weeks patch had, which on
    systems with multiple zones per node (such as DMA zone) would
    take seeing any of these zones full as meaning that all zones
    on that node were full.

    Also I changed names - from "zonelist faster" to "zonelist cache",
    as that seemed to better convey what we're doing here - caching
    some of the key zonelist state (for faster access.)

    See below for some performance benchmark results.  After all that
    discussion with David on why I didn't need them, I went and got
    some ;).  I wanted to verify that I had not hurt the normal case
    of memory allocation noticeably.  At least for my one little
    microbenchmark, I found (1) the normal case wasn't affected, and
    (2) workloads that forced scanning across multiple nodes for
    memory improved up to 10% fewer System CPU cycles and lower
    elapsed clock time ('sys' and 'real').  Good.  See details, below.

    I didn't have the logic in get_page_from_freelist() for various
    full nodes and zone reclaim failures correct.  That should be
    fixed up now - notice the new goto labels zonelist_scan,
    this_zone_full, and try_next_zone, in get_page_from_freelist().

There are two reasons I persued this alternative, over some earlier
proposals that would have focused on optimizing the fake numa
emulation case by caching the last useful zone:

 1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems)
    have seen real customer loads where the cost to scan the zonelist
    was a problem, due to many nodes being full of memory before
    we got to a node we could use.  Or at least, I think we have.
    This was related to me by another engineer, based on experiences
    from some time past.  So this is not guaranteed.  Most likely, though.

    The following approach should help such real numa systems just as
    much as it helps fake numa systems, or any combination thereof.

 2) The effort to distinguish fake from real numa, using node_distance,
    so that we could cache a fake numa node and optimize choosing
    it over equivalent distance fake nodes, while continuing to
    properly scan all real nodes in distance order, was going to
    require a nasty blob of zonelist and node distance munging.

    The following approach has no new dependency on node distances or
    zone sorting.

See comment in the patch below for a description of what it actually does.

Technical details of note (or controversy):

 - See the use of "zlc_active" and "did_zlc_setup" below, to delay
   adding any work for this new mechanism until we've looked at the
   first zone in zonelist.  I figured the odds of the first zone
   having the memory we needed were high enough that we should just
   look there, first, then get fancy only if we need to keep looking.

 - Some odd hackery was needed to add items to struct zonelist, while
   not tripping up the custom zonelists built by the mm/mempolicy.c
   code for MPOL_BIND.  My usual wordy comments below explain this.
   Search for "MPOL_BIND".

 - Some per-node data in the struct zonelist is now modified frequently,
   with no locking.  Multiple CPU cores on a node could hit and mangle
   this data.  The theory is that this is just performance hint data,
   and the memory allocator will work just fine despite any such mangling.
   The fields at risk are the struct 'zonelist_cache' fields 'fullzones'
   (a bitmask) and 'last_full_zap' (unsigned long jiffies).  It should
   all be self correcting after at most a one second delay.

 - This still does a linear scan of the same lengths as before.  All
   I've optimized is making the scan faster, not algorithmically
   shorter.  It is now able to scan a compact array of 'unsigned
   short' in the case of many full nodes, so one cache line should
   cover quite a few nodes, rather than each node hitting another
   one or two new and distinct cache lines.

 - If both Andi and Nick don't find this too complicated, I will be
   (pleasantly) flabbergasted.

 - I removed the comment claiming we only use one cachline's worth of
   zonelist.  We seem, at least in the fake numa case, to have put the
   lie to that claim.

 - I pay no attention to the various watermarks and such in this performance
   hint.  A node could be marked full for one watermark, and then skipped
   over when searching for a page using a different watermark.  I think
   that's actually quite ok, as it will tend to slightly increase the
   spreading of memory over other nodes, away from a memory stressed node.

===============

Performance - some benchmark results and analysis:

This benchmark runs a memory hog program that uses multiple
threads to touch alot of memory as quickly as it can.

Multiple runs were made, touching 12, 38, 64 or 90 GBytes out of
the total 96 GBytes on the system, and using 1, 19, 37, or 55
threads (on a 56 CPU system.)  System, user and real (elapsed)
timings were recorded for each run, shown in units of seconds,
in the table below.

Two kernels were tested - 2.6.18-mm3 and the same kernel with
this zonelist caching patch added.  The table also shows the
percentage improvement the zonelist caching sys time is over
(lower than) the stock *-mm kernel.

      number     2.6.18-mm3	   zonelist-cache    delta (< 0 good)	percent
 GBs    N  	------------	   --------------    ----------------	systime
 mem threads   sys user  real	  sys  user  real     sys  user  real	 better
  12	 1     153   24   177	  151	 24   176      -2     0    -1	   1%
  12	19	99   22     8	   99	 22	8	0     0     0	   0%
  12	37     111   25     6	  112	 25	6	1     0     0	  -0%
  12	55     115   25     5	  110	 23	5      -5    -2     0	   4%
  38	 1     502   74   576	  497	 73   570      -5    -1    -6	   0%
  38	19     426   78    48	  373	 76    39     -53    -2    -9	  12%
  38	37     544   83    36	  547	 82    36	3    -1     0	  -0%
  38	55     501   77    23	  511	 80    24      10     3     1	  -1%
  64	 1     917  125  1042	  890	124  1014     -27    -1   -28	   2%
  64	19    1118  138   119	  965	141   103    -153     3   -16	  13%
  64	37    1202  151    94	 1136	150    81     -66    -1   -13	   5%
  64	55    1118  141    61	 1072	140    58     -46    -1    -3	   4%
  90	 1    1342  177  1519	 1275	174  1450     -67    -3   -69	   4%
  90	19    2392  199   192	 2116	189   176    -276   -10   -16	  11%
  90	37    3313  238   175	 2972	225   145    -341   -13   -30	  10%
  90	55    1948  210   104	 1843	213   100    -105     3    -4	   5%

Notes:
 1) This test ran a memory hog program that started a specified number N of
    threads, and had each thread allocate and touch 1/N'th of
    the total memory to be used in the test run in a single loop,
    writing a constant word to memory, one store every 4096 bytes.
    Watching this test during some earlier trial runs, I would see
    each of these threads sit down on one CPU and stay there, for
    the remainder of the pass, a different CPU for each thread.

 2) The 'real' column is not comparable to the 'sys' or 'user' columns.
    The 'real' column is seconds wall clock time elapsed, from beginning
    to end of that test pass.  The 'sys' and 'user' columns are total
    CPU seconds spent on that test pass.  For a 19 thread test run,
    for example, the sum of 'sys' and 'user' could be up to 19 times the
    number of 'real' elapsed wall clock seconds.

 3) Tests were run on a fresh, single-user boot, to minimize the amount
    of memory already in use at the start of the test, and to minimize
    the amount of background activity that might interfere.

 4) Tests were done on a 56 CPU, 28 Node system with 96 GBytes of RAM.

 5) Notice that the 'real' time gets large for the single thread runs, even
    though the measured 'sys' and 'user' times are modest.  I'm not sure what
    that means - probably something to do with it being slow for one thread to
    be accessing memory along ways away.  Perhaps the fake numa system, running
    ostensibly the same workload, would not show this substantial degradation
    of 'real' time for one thread on many nodes -- lets hope not.

 6) The high thread count passes (one thread per CPU - on 55 of 56 CPUs)
    ran quite efficiently, as one might expect.  Each pair of threads needed
    to allocate and touch the memory on the node the two threads shared, a
    pleasantly parallizable workload.

 7) The intermediate thread count passes, when asking for alot of memory forcing
    them to go to a few neighboring nodes, improved the most with this zonelist
    caching patch.

Conclusions:
 * This zonelist cache patch probably makes little difference one way or the
   other for most workloads on real numa hardware, if those workloads avoid
   heavy off node allocations.
 * For memory intensive workloads requiring substantial off-node allocations
   on real numa hardware, this patch improves both kernel and elapsed timings
   up to ten per-cent.
 * For fake numa systems, I'm optimistic, but will have to leave that up to
   Rohit Seth to actually test (once I get him a 2.6.18 backport.)

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Rohit Seth <rohitseth@google.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: David Rientjes <rientjes@cs.washington.edu>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Christoph Lameter
89689ae7f9 [PATCH] Get rid of zone_table[]
The zone table is mostly not needed.  If we have a node in the page flags
then we can get to the zone via NODE_DATA() which is much more likely to be
already in the cpu cache.

In case of SMP and UP NODE_DATA() is a constant pointer which allows us to
access an exact replica of zonetable in the node_zones field.  In all of
the above cases there will be no need at all for the zone table.

The only remaining case is if in a NUMA system the node numbers do not fit
into the page flags.  In that case we make sparse generate a table that
maps sections to nodes and use that table to to figure out the node number.
 This table is sized to fit in a single cache line for the known 32 bit
NUMA platform which makes it very likely that the information can be
obtained without a cache miss.

For sparsemem the zone table seems to be have been fairly large based on
the maximum possible number of sections and the number of zones per node.
There is some memory saving by removing zone_table.  The main benefit is to
reduce the cache foootprint of the VM from the frequent lookups of zones.
Plus it simplifies the page allocator.

[akpm@osdl.org: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Chen, Kenneth W
c0a499c2c4 [PATCH] __unmap_hugepage_range(): add comment
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Paul Jackson
0798e5193c [PATCH] memory page alloc minor cleanups
- s/freeliest/freelist/ spelling fix

- Check for NULL *z zone seems useless - even if it could happen, so
  what?  Perhaps we should have a check later on if we are faced with an
  allocation request that is not allowed to fail - shouldn't that be a
  serious kernel error, passing an empty zonelist with a mandate to not
  fail?

- Initializing 'z' to zonelist->zones can wait until after the first
  get_page_from_freelist() fails; we only use 'z' in the wakeup_kswapd()
  loop, so let's initialize 'z' there, in a 'for' loop.  Seems clearer.

- Remove superfluous braces around a break

- Fix a couple errant spaces

- Adjust indentation on the cpuset_zone_allowed() check, to match the
  lines just before it -- seems easier to read in this case.

- Add another set of braces to the zone_watermark_ok logic

From: Paul Jackson <pj@sgi.com>

  Backout one item from a previous "memory page_alloc minor cleanups" patch.
   Until and unless we are certain that no one can ever pass an empty zonelist
  to __alloc_pages(), this check for an empty zonelist (or some BUG
  equivalent) is essential.  The code in get_page_from_freelist() blow ups if
  passed an empty zonelist.

Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Andrew Morton
a2ce774096 [PATCH] uml: workqueue build fix
arch/um/drivers/chan_kern.c:643: error: conflicting types for 'chan_interrupt'
  arch/um/include/chan_kern.h:31: error: previous declaration of 'chan_interrupt'

Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Andrey Mirkin
822191a2fa [PATCH] skip data conversion in compat_sys_mount when data_page is NULL
OpenVZ Linux kernel team has found a problem with mounting in compat mode.

Simple command "mount -t smbfs ..." on Fedora Core 5 distro in 32-bit mode
leads to oops:

  Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: compat_sys_mount+0xd6/0x290
  Process mount (pid: 14656, veid=300, threadinfo ffff810034d30000, task ffff810034c86bc0)
  Call Trace: ia32_sysret+0x0/0xa

The problem is that data_page pointer can be NULL, so we should skip data
conversion in this case.

Signed-off-by: Andrey Mirkin <amirkin@openvz.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Andrew Morton
a1e85378ba [PATCH] drm-sis linkage fix
Fix http://bugzilla.kernel.org/show_bug.cgi?id=7606

WARNING: "drm_sman_set_manager" [drivers/char/drm/sis.ko] undefined!

Cc: <daniel-silveira@gee.inatel.br>
Cc: Dave Airlie <airlied@linux.ie>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Andrew Morton
676dcb8bc2 [PATCH] add bottom_half.h
With CONFIG_SMP=n:

  drivers/input/ff-memless.c:384: warning: implicit declaration of function 'local_bh_disable'
  drivers/input/ff-memless.c:393: warning: implicit declaration of function 'local_bh_enable'

Really linux/spinlock.h should include linux/interrupt.h.  But interrupt.h
includes sched.h which will need spinlock.h.

So the patch breaks the _bh declarations out into a separate header and
includes it in both interrupt.h and spinlock.h.

Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Cc: Andi Kleen <ak@suse.de>
Cc: <stable@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:20 -08:00
Russell King
05f96ef118 [ARM] Allow gcc to optimise arm_add_memory a little more
For some reason, gcc was calculating meminfo.bank[meminfo.nr_banks]
repeatedly.  Use a pointer to it instead.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-12-07 16:26:16 +00:00
Pavel Pisa
86987d5bf4 [ARM] 3991/1: i.MX/MX1 high resolution time source
Enhanced resolution for time measurement functions.

Signed-off-by: Pavel Pisa <pisa@cmp.felk.cvut.cz>
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-12-07 16:24:16 +00:00
Pavel Pisa
5c894cd1c8 [ARM] 3990/1: i.MX/MX1 more precise PLL decode
The future high resolution support inclusion utilizes
imx_decode_pll() in timer base frequency computation.
This use requires more precise computation without
discarding 10 bits by shifting left.

Signed-off-by: Pavel Pisa <pisa@cmp.felk.cvut.cz>
Acked-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-12-07 16:24:15 +00:00
Ben Dooks
9073341c2b [ARM] 3986/1: H1940: suspend to RAM support
Add support to suspend and resume, using the
H1940's bootloader

Signed-off-by: Ben Dooks <ben-linux@fluff.org>
Signed-off-by: Arnaud Patard <arnaud.patard@rtp-net.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-12-07 16:17:49 +00:00
Kevin Hilman
f9a8ca1cab [ARM] 3985/1: ixp4xx clocksource cleanup
Rather than using a device_initcall() for the clocksource initialization, just call the init from the sys_timer init function.

Signed-off-by: Kevin Hilman <khilman@mvista.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-12-07 16:17:07 +00:00
Rod Whitby
a47d08e2e3 [ARM] 3984/1: ixp4xx/nslu2: Fix disk LED numbering (take 2)
This patch fixes an error in the numbering of the disk LEDs on the
Linksys NSLU2. The error crept in because the physical location
of the LEDs has the Disk 2 LED *above* the Disk 1 LED.

Thanks to Gordon Farquharson for reporting this.

Signed-off-by: Rod Whitby <rod@whitby.id.au>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-12-07 16:17:06 +00:00
Lennert Buytenhek
46156e04de [ARM] 3994/1: ixp23xx: fix handling of pci master aborts
The PCI master abort handling issue that affected ixp2000 also
affects ixp23xx.

Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-12-07 16:16:19 +00:00
Nicolas Pitre
2dc20a51dc [ARM] 3981/1: sched_clock for PXA2xx
Here's a 63-bit implementation of shed_clock() for PXA2xx.  The actual
period depends on the value of CLOCK_TICK_RATE and whether or not
reduced scaling factors were provided for it.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-12-07 16:06:55 +00:00
Nicolas Pitre
752bee178e [ARM] 3980/1: extend the ARM Versatile sched_clock implementation from 32 to 63 bit
period

This provides a 63 bit clock counter guaranteed to be monotonic over a
period of 35583 days instead of a clock wrap every 179 seconds, as long
as sched_clock() is called at least once every 89 seconds.  This should
not be a problem in practice, although a kernel timer could be scheduled
every 80 seconds for example simply to call sched_clock() making sure
top bits are always synchronized if need be.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-12-07 16:06:53 +00:00
Nicolas Pitre
2f1675c11a [ARM] 3979/1: extend the SA11x0 sched_clock implementation from 32 to 63 bit period
This provides a 63 bit clock counter guaranteed to be monotonic over a
period of 370 days instead of a clock wrap every 19.4 minutes, as long
as sched_clock() is called at least once every 9.7 minutes which
shouldn't be a problem in practice.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-12-07 16:06:50 +00:00
Nicolas Pitre
838ccbc35e [ARM] 3978/1: macro to provide a 63-bit value from a 32-bit hardware counter
This is done in a completely lockless fashion. Bits 0 to 31 of the count
are provided by the hardware while bits 32 to 62 are stored in memory.
The top bit in memory is used to synchronize with the hardware count
half-period.  When the top bit of both counters (hardware and in memory)
differ then the memory is updated with a new value, incrementing it when
the hardware counter wraps around.  Because a word store in memory is
atomic then the incremented value will always be in synch with the top
bit indicating to any potential concurrent reader if the value in memory
is up to date or not wrt the needed increment.  And any race in updating
the value in memory is harmless as the same value would be stored more
than once.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-12-07 16:06:45 +00:00
Nicolas Pitre
fa4adc6149 [ARM] 3611/4: optimize do_div() when divisor is constant
On ARM all divisions have to be performed "manually".  For 64-bit
divisions that may take more than a hundred cycles in many cases.

With 32-bit divisions gcc already use the recyprocal of constant
divisors to perform a multiplication, but not with 64-bit divisions.

Since the kernel is increasingly relying upon 64-bit divisions it is
worth optimizing at least those cases where the divisor is a constant.
This is what this patch does using plain C code that gets optimized away
at compile time.

For example, despite the amount of added C code, do_div(x, 10000) now
produces the following assembly code (where x is assigned to r0-r1):

	adr	r4, .L0
	ldmia	r4, {r4-r5}
	umull	r2, r3, r4, r0
	mov	r2, #0
	umlal	r3, r2, r5, r0
	umlal	r3, r2, r4, r1
	mov	r3, #0
	umlal	r2, r3, r5, r1
	mov	r0, r2, lsr #11
	orr	r0, r0, r3, lsl #21
	mov	r1, r3, lsr #11
	...
.L0:
	.word	948328779
	.word	879609302

which is the fastest that can be done for any value of x in that case,
many times faster than the __do_div64 code (except for the small x value
space for which the result ends up being zero or a single bit).

The fact that this code is generated inline produces a tiny increase in
.text size, but not significant compared to the needed code around each
__do_div64 call site this code is replacing.

The algorithm used has been validated on a 16-bit scale for all possible
values, and then recodified for 64-bit values.  Furthermore I've been
running it with the final BUG_ON() uncommented for over two months now
with no problem.

Note that this new code is compiled with gcc versions 4.0 or later.
Earlier gcc versions proved themselves too problematic and only the
original code is used with them.

Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-12-07 16:06:09 +00:00
Lennert Buytenhek
47d7e524b7 [ARM] 3993/1: ep93xx: add cirrus logic edb9302a support
Add support for the Cirrus Logic EDB9302A Evaluation Board.  Confirmed
to work by Chase Douglas.

Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-12-07 16:01:56 +00:00
George G. Davis
5636810d6f [ARM] 3982/2: Explicitly select 32-bit ARM ISA (-marm)
Do not assume that the ARM GCC toolchain defaults to building for the
32-bit ARM ISA (-marm) case. Instead, explicitly select -marm in CFLAGS
since the toolchain default can be for the 16-bit Thumb ISA (-mthumb) in
some odd/rare cases.

Signed-off-by: George G. Davis <gdavis@mvista.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-12-07 16:01:11 +00:00
Patrick Caulfield
ac33d07105 [DLM] Clean up lowcomms
This fixes up most of the things pointed out by akpm and Pavel Machek
with comments below indicating why some things have been left:

Andrew Morton wrote:
>
>> +static struct nodeinfo *nodeid2nodeinfo(int nodeid, gfp_t alloc)
>> +{
>> +	struct nodeinfo *ni;
>> +	int r;
>> +	int n;
>> +
>> +	down_read(&nodeinfo_lock);
>
> Given that this function can sleep, I wonder if `alloc' is useful.
>
> I see lots of callers passing in a literal "0" for `alloc'.  That's in fact
> a secret (GFP_ATOMIC & ~__GFP_HIGH).  I doubt if that's what you really
> meant.  Particularly as the code could at least have used __GFP_WAIT (aka
> GFP_NOIO) which is much, much more reliable than "0".  In fact "0" is the
> least reliable mode possible.
>
> IOW, this is all bollixed up.

When 0 is passed into nodeid2nodeinfo the function does not try to allocate a
new structure at all. it's an indication that the caller only wants the nodeinfo
struct for that nodeid if there actually is one in existance.
I've tidied the function itself so it's more obvious, (and tidier!)

>> +/* Data received from remote end */
>> +static int receive_from_sock(void)
>> +{
>> +	int ret = 0;
>> +	struct msghdr msg;
>> +	struct kvec iov[2];
>> +	unsigned len;
>> +	int r;
>> +	struct sctp_sndrcvinfo *sinfo;
>> +	struct cmsghdr *cmsg;
>> +	struct nodeinfo *ni;
>> +
>> +	/* These two are marginally too big for stack allocation, but this
>> +	 * function is (currently) only called by dlm_recvd so static should be
>> +	 * OK.
>> +	 */
>> +	static struct sockaddr_storage msgname;
>> +	static char incmsg[CMSG_SPACE(sizeof(struct sctp_sndrcvinfo))];
>
> whoa.  This is globally singly-threaded code??

Yes. it is only ever run in the context of dlm_recvd.
>>
>> +static void initiate_association(int nodeid)
>> +{
>> +	struct sockaddr_storage rem_addr;
>> +	static char outcmsg[CMSG_SPACE(sizeof(struct sctp_sndrcvinfo))];
>
> Another static buffer to worry about.  Globally singly-threaded code?

Yes. Only ever called by dlm_sendd.

>> +
>> +/* Send a message */
>> +static int send_to_sock(struct nodeinfo *ni)
>> +{
>> +	int ret = 0;
>> +	struct writequeue_entry *e;
>> +	int len, offset;
>> +	struct msghdr outmsg;
>> +	static char outcmsg[CMSG_SPACE(sizeof(struct sctp_sndrcvinfo))];
>
> Singly-threaded?

Yep.

>>
>> +static void dealloc_nodeinfo(void)
>> +{
>> +	int i;
>> +
>> +	for (i=1; i<=max_nodeid; i++) {
>> +		struct nodeinfo *ni = nodeid2nodeinfo(i, 0);
>> +		if (ni) {
>> +			idr_remove(&nodeinfo_idr, i);
>
> Didn't that need locking?

Not. it's only ever called at DLM shutdown after all the other threads
have been stopped.

>>
>> +static int write_list_empty(void)
>> +{
>> +	int status;
>> +
>> +	spin_lock_bh(&write_nodes_lock);
>> +	status = list_empty(&write_nodes);
>> +	spin_unlock_bh(&write_nodes_lock);
>> +
>> +	return status;
>> +}
>
> This function's return value is meaningless.  As soon as the lock gets
> dropped, the return value can get out of sync with reality.
>
> Looking at the caller, this _might_ happen to be OK, but it's a nasty and
> dangerous thing.  Really the locking should be moved into the caller.

It's just an optimisation to allow the caller to schedule if there is no work
to do. if something arrives immediately afterwards then it will get picked up
when the process re-awakes (and it will be woken by that arrival).

The 'accepting' atomic has gone completely. as Andrew pointed out it didn't
really achieve much anyway. I suspect it was a plaster over some other
startup or shutdown bug to be honest.


Signed-off-by: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Pavel Machek <pavel@ucw.cz>
2006-12-07 09:25:13 -05:00
Steven Whitehouse
34126f9f41 [GFS2] Change gfs2_fsync() to use write_inode_now()
This is a bit better than the previous version of gfs2_fsync()
although it would be better still if we were able to call a
function which only wrote the inode & metadata. Its no big deal
though that this will potentially write the data as well since
the VFS has already done that before calling gfs2_fsync(). I've
also added a comment to explain whats going on here.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Morton <akpm@osdl.org>
2006-12-07 09:13:14 -05:00
Alan
fd3367af3d [PATCH] libata: Incorrect timing computation for PIO5/6
The ata timing computation code makes some mistakes in PIO5/6 because a
check was not updated correctly when I put this support into the kernel.

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-07 07:37:07 -05:00
Mikael Pettersson
25b93d81b9 [PATCH] sata_promise: new EH conversion, take 2
This patch converts sata_promise to use new-style libata error
handling on Promise SATA chips, for both SATA and PATA ports.

* ATA_FLAG_SRST is no longer set
* ->phy_reset is no longer set as it is unused when ->error_handler
   is present, and pdc_sata_phy_reset() has been removed
* pdc_freeze() masks interrupts and halts DMA via PDC_CTLSTAT
* pdc_thaw() clears interrupt status in PDC_INT_SEQMASK and then
  unmasks interrupts in PDC_CTLSTAT
* pdc_error_handler() reinitialises the port if it isn't frozen,
  and then invokes ata_do_eh() with standard {s,}ata reset methods
* pdc_post_internal_cmd() resets the port in case of errors
* the PATA-only 20619 chip continues to use old-style EH:
  not by necessity but simply because I don't have documentation
  for it or any way to test it

Since the previous version pdc_error_handler() has been rewritten
and it now mostly matches ahci and sata_sil24. In case anyone
wonders: the call to pdc_reset_port() isn't a heavy-duty reset,
it's a light-weight reset to quickly put a port into a sane state.

The discussion about the PCI flushes in pdc_freeze() and pdc_thaw()
seemed to end with a consensus that the flushes are OK and not
obviously redundant, so I decided to keep them for now.

This patch was prepared against 2.6.19-git7, but it also applies
to 2.6.19 + libata #upstream, with or without the revised sata_promise
cleanup patch I recently submitted.

This patch does conflict with the #promise-sata-pata patch:
this patch removes pdc_sata_phy_reset() while #promise-sata-pata
modifies it. The correct patch resolution is to remove the function.

Tested on 2037x and 2057x chips, with PATA patches on top and disks
on both SATA and PATA ports.

Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-07 07:25:01 -05:00
Albert Lee
e3472cbe5c [PATCH] libata: let ATA_FLAG_PIO_POLLING use polling pio for ATA_PROT_NODATA
Even if ATA_FLAG_PIO_POLLING is set, libata uses irq pio for the ATA_PROT_NODATA protocol.
This patch let ATA_FLAG_PIO_POLLING use polling pio for the ATA_PROT_NODATA protocol.

Signed-off-by: Albert Lee <albertcc@tw.ibm.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-07 07:22:28 -05:00
Mikael Pettersson
d324d4627d [PATCH] sata_promise: cleanups, take 2
This patch performs two simple cleanups of sata_promise.

* Remove board_20771 and map device id 0x3577 to board_2057x.
  After the recent corrections for SATAII chips, board_20771 and
  board_2057x were equivalent in the driver.

* Remove hp->hotplug_offset and use hp->flags & PDC_FLAG_GEN_II
  to compute hotplug_offset in pdc_host_init(). hp->hotplug_offset
  was used to distinguish 1st and 2nd generation chips in one
  particular case, but now we have that information in a more
  general form in hp->flags, so hp->hotplug_offset is redundant.

Changes since previous submission: rebased on libata-dev #upstream,
cleaned up hotplug_offset computation based on Tejun's comments,
expanded hotplug_offset removal rationale.

This patch does not depend on the pending new EH conversion patch.

Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-07 07:21:24 -05:00
Jeff Garzik
0ae851352a [wireless] zd1211rw: workqueue-related build fixes
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-07 06:30:30 -05:00
Jeff Garzik
0bfdcc88df [netdrvr] netxen: workqueue-related build fixes 2006-12-07 06:30:07 -05:00
Jeff Garzik
f1ff0fdc35 Merge tag 'r8169-upstream-20061204-00' of git://electric-eye.fr.zoreil.com/home/romieu/linux-2.6 into upstream 2006-12-07 05:05:58 -05:00
Jeff Garzik
359f2d17e3 Merge branch 'upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6 into upstream
Conflicts:

	drivers/net/wireless/zd1211rw/zd_mac.h
	net/ieee80211/softmac/ieee80211softmac_assoc.c
2006-12-07 05:02:40 -05:00
Stephen Hemminger
0efdf26266 [PATCH] sky2: sparse warnings
Get rid of sparse warnings in sky2 driver because of mixed enum
usage.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-07 04:59:20 -05:00
Stephen Hemminger
7f4b45c526 [PATCH] skge: fix sparse warnings
Fix sparse warnings from using enum as part of arithmetic
expression, and comment indentation fixes

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-07 04:59:20 -05:00
Brice Goglin
e67bda55e2 [PATCH] myri10ge: write as 2 32-byte blocks in myri10ge_submit_8rx
In the myri10ge_submit_8rx() routine, write the 64 byte request block as
2 32-byte blocks so that it is handled by the hardware pio write handler
if write-combining is enabled.

Signed-off-by: Brice Goglin <brice@myri.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-07 04:59:20 -05:00
Stephen Hemminger
c3905bc4b7 [PATCH] sky2: receive queue watermark tweak
This patch makes the receive performance on some systems go from
714MB/s to 941MB/s. It adjusts the watermark of the receive queue
to be lower, thereby avoiding excess hardware flow control. This is
most important on the systems which have little/no additional buffering.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-07 04:58:33 -05:00
Stephen Hemminger
6771290102 [PATCH] sky2: beter ram buffer partitioning
Different chips have different sizes of ram buffers, and some versions have
no ram buffer at all!.  Be more careful about sizing the ram usage because
it maybe a problem if vendor keeps changing sizes.

There is the (unlikely) possibility that some of the errors on some of the
chips have been caused by partitioning not on a 1K boundary.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-07 04:58:33 -05:00
Stephen Hemminger
e5b74c7ddd [PATCH] sky2: add comments to PCI ids
Add comments to sky2 driver to show relationship between PCI id and
hardware.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-07 04:58:33 -05:00
Stephen Hemminger
2a45b49c30 [PATCH] sky2: add PCI for 88ec033
Add another new/missing pci id for 88ec033 chip.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-07 04:58:32 -05:00
Andrew Victor
a3f63e4f4b [PATCH] AT91RM9200 Ethernet: Use dev_alloc_skb()
Use dev_alloc_skb() instead of alloc_skb().

It is also not necessary to adjust skb->len manually since that's
already done by skb_put().

Signed-off-by: Andrew Victor <andrew@sanpeople.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-07 04:58:32 -05:00
Andrew Victor
51cc210457 [PATCH] AT91RM9200 Ethernet: Add netpoll / netconsole support
Adds netpoll / netconsole support.

Original patch from Bill Gatliff.

Signed-off-by: Andrew Victor <andrew@sanpeople.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-07 04:58:32 -05:00
Andrew Victor
cf42553ab4 [PATCH] AT91RM9200 Ethernet: Move check_timer variable and use mod_timer()
Move the global 'check_timer' variable into the private data structure.
Also now use mod_timer().

Signed-off-by: Andrew Victor <andrew@sanpeople.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-07 04:58:32 -05:00
Andrew Victor
c57ee096b6 [PATCH] AT91RM9200 Ethernet: Remove 'at91_dev' and use netdev_priv()
Remove the global 'at91_dev' variable.
Use netdev_priv() instead of casting dev->priv directly.

Signed-off-by: Andrew Victor <andrew@sanpeople.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-07 04:58:32 -05:00
Jeff Garzik
8d1413b280 Merge branch 'master' into upstream
Conflicts:

	drivers/net/netxen/netxen_nic.h
	drivers/net/netxen/netxen_nic_main.c
2006-12-07 04:57:19 -05:00
Randy Dunlap
272491ef42 [NETFILTER]: Fix non-ANSI func. decl.
Fix non-ANSI function declaration:
net/netfilter/nf_conntrack_core.c:1096:25: warning: non-ANSI function declaration of function 'nf_conntrack_flush'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-07 01:17:24 -08:00