Commit e986850598 ("mm, vmscan: only evict file pages when we have
plenty") makes a point of not going for anonymous memory while there is
still enough inactive cache around.
The check was added only for global reclaim, but it is just as useful to
reduce swapping in memory cgroup reclaim:
200M-memcg-defconfig-j2
vanilla patched
Real time 454.06 ( +0.00%) 453.71 ( -0.08%)
User time 668.57 ( +0.00%) 668.73 ( +0.02%)
System time 128.92 ( +0.00%) 129.53 ( +0.46%)
Swap in 1246.80 ( +0.00%) 814.40 ( -34.65%)
Swap out 1198.90 ( +0.00%) 827.00 ( -30.99%)
Pages allocated 16431288.10 ( +0.00%) 16434035.30 ( +0.02%)
Major faults 681.50 ( +0.00%) 593.70 ( -12.86%)
THP faults 237.20 ( +0.00%) 242.40 ( +2.18%)
THP collapse 241.20 ( +0.00%) 248.50 ( +3.01%)
THP splits 157.30 ( +0.00%) 161.40 ( +2.59%)
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As per documentation and other places calling putback_lru_pages(),
putback_lru_pages() is called on error only. Make the CMA code behave
consistently.
[akpm@linux-foundation.org: remove a test-n-branch in the wrapup code]
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the warning:
drivers/md/persistent-data/dm-transaction-manager.c:28:1: warning: "HASH_SIZE" redefined
In file included from include/linux/elevator.h:5,
from include/linux/blkdev.h:216,
from drivers/md/persistent-data/dm-block-manager.h:11,
from drivers/md/persistent-data/dm-transaction-manager.h:10,
from drivers/md/persistent-data/dm-transaction-manager.c:6:
include/linux/hashtable.h:22:1: warning: this is the location of the previous definition
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull powerpc updates from Benjamin Herrenschmidt:
"So from the depth of frozen Minnesota, here's the powerpc pull request
for 3.9. It has a few interesting highlights, in addition to the
usual bunch of bug fixes, minor updates, embedded device tree updates
and new boards:
- Hand tuned asm implementation of SHA1 (by Paulus & Michael
Ellerman)
- Support for Doorbell interrupts on Power8 (kind of fast
thread-thread IPIs) by Ian Munsie
- Long overdue cleanup of the way we handle relocation of our open
firmware trampoline (prom_init.c) on 64-bit by Anton Blanchard
- Support for saving/restoring & context switching the PPR (Processor
Priority Register) on server processors that support it. This
allows the kernel to preserve thread priorities established by
userspace. By Haren Myneni.
- DAWR (new watchpoint facility) support on Power8 by Michael Neuling
- Ability to change the DSCR (Data Stream Control Register) which
controls cache prefetching on a running process via ptrace by
Alexey Kardashevskiy
- Support for context switching the TAR register on Power8 (new
branch target register meant to be used by some new specific
userspace perf event interrupt facility which is yet to be enabled)
by Ian Munsie.
- Improve preservation of the CFAR register (which captures the
origin of a branch) on various exception conditions by Paulus.
- Move the Bestcomm DMA driver from arch powerpc to drivers/dma where
it belongs by Philippe De Muyter
- Support for Transactional Memory on Power8 by Michael Neuling
(based on original work by Matt Evans). For those curious about
the feature, the patch contains a pretty good description."
(See commit db8ff90702: "powerpc: Documentation for transactional
memory on powerpc" for the mentioned description added to the file
Documentation/powerpc/transactional_memory.txt)
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (140 commits)
powerpc/kexec: Disable hard IRQ before kexec
powerpc/85xx: l2sram - Add compatible string for BSC9131 platform
powerpc/85xx: bsc9131 - Correct typo in SDHC device node
powerpc/e500/qemu-e500: enable coreint
powerpc/mpic: allow coreint to be determined by MPIC version
powerpc/fsl_pci: Store the pci ctlr device ptr in the pci ctlr struct
powerpc/85xx: Board support for ppa8548
powerpc/fsl: remove extraneous DIU platform functions
arch/powerpc/platforms/85xx/p1022_ds.c: adjust duplicate test
powerpc: Documentation for transactional memory on powerpc
powerpc: Add transactional memory to pseries and ppc64 defconfigs
powerpc: Add config option for transactional memory
powerpc: Add transactional memory to POWER8 cpu features
powerpc: Add new transactional memory state to the signal context
powerpc: Hook in new transactional memory code
powerpc: Routines for FP/VSX/VMX unavailable during a transaction
powerpc: Add transactional memory unavaliable execption handler
powerpc: Add reclaim and recheckpoint functions for context switching transactional memory processes
powerpc: Add FP/VSX and VMX register load functions for transactional memory
powerpc: Add helper functions for transactional memory context switching
...
Commit d2e5f0c (ACPI / PCI: Rework the setup and cleanup of device
wakeup) moved the initial disabling of system wakeup for PCI devices
into a place where it can actually work and that exposed a hidden old
issue with crap^Wunusual system designs where the same power
resources are used for both wakeup power and device power control at
run time.
Namely, say there is one power resource such that the ACPI power
state D0 of a PCI device depends on that power resource (i.e. the
device is in D0 when that power resource is "on") and it is used
as a wakeup power resource for the same device. Then, calling
acpi_pci_sleep_wake(pci_dev, false) for the device in question will
cause the reference counter of that power resource to drop to 0,
which in turn will cause it to be turned off. As a result, the
device will go into D3cold at that point, although it should have
stayed in D0.
As it turns out, that happens to USB controllers on some laptops
and USB becomes unusable on those machines as a result, which is
a major regression from v3.8.
To fix this problem, (1) increment the reference counters of wakup
power resources during their initialization if they are "on"
initially, (2) prevent acpi_disable_wakeup_device_power() from
decrementing the reference counters of wakeup power resources that
were not enabled for wakeup power previously, and (3) prevent
acpi_enable_wakeup_device_power() from incrementing the reference
counters of wakeup power resources that already are enabled for
wakeup power.
In addition to that, if it is impossible to determine the initial
states of wakeup power resources, avoid enabling wakeup for devices
whose wakeup power depends on those power resources.
Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Fabio Baltieri <fabio.baltieri@linaro.org>
Tested-by: Fabio Baltieri <fabio.baltieri@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Trivial, but WRITE SAME is an SBC command so it seems strange for a
related function (defined in target_core_sbc.c) to be in the spc_
namespace.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
The sock_diag_lock_handler() and sock_diag_unlock_handler() actually
make the code less readable. Get rid of them and make the lock usage
and access to sock_diag_handlers[] clear on the first sight.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Userland can send a netlink message requesting SOCK_DIAG_BY_FAMILY
with a family greater or equal then AF_MAX -- the array size of
sock_diag_handlers[]. The current code does not test for this
condition therefore is vulnerable to an out-of-bound access opening
doors for a privilege escalation.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mlx4_en driver allocates the number of objects for the CPU affinity
reverse-map based on the number of rx rings of the device. However,
mlx4_assign_eq() calls irq_cpu_rmap_add() as many times as IRQ's are
assigned to EQ's, which can be as large as mlx4_dev->caps.comp_pool. If
caps.comp_pool is larger than rx_ring_num we will eventually hit the
BUG_ON() in cpu_rmap_add().
Fix this problem by allocating space for the maximum number of CPU
affinity reverse-map objects we might want to add.
Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com>
Acked-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The memory to hold the network device tx_cq is not being allocated with
the correct size in mlx4_en_init_netdev(). It should use MAX_TX_RINGS
instead of MAX_RX_RINGS. This can cause problems if the number of tx
rings being used is greater than MAX_RX_RINGS.
Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com>
Acked-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Disable hard IRQ before kexec a new kernel image.
Not doing it can result in corrupted data in the memory segments
reserved for the new kernel.
Signed-off-by: Phileas Fogg <phileas-fogg@mail.ru>
CC: <stable@vger.kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull parisc updates from Helge Deller.
The bulk of this is optimized page coping/clearing and cache flushing
(virtual caches are lovely) by John David Anglin.
* 'parisc-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: (31 commits)
arch/parisc/include/asm: use ARRAY_SIZE macro in mmzone.h
parisc: remove empty lines and unnecessary #ifdef coding in include/asm/signal.h
parisc: sendfile and sendfile64 syscall cleanups
parisc: switch to available compat_sched_rr_get_interval implementation
parisc: fix fallocate syscall
parisc: fix error return codes for rt_sigaction and rt_sigprocmask
parisc: convert msgrcv and msgsnd syscalls to use compat layer
parisc: correctly wire up mq_* functions for CONFIG_COMPAT case
parisc: fix personality on 32bit kernel
parisc: wire up process_vm_readv, process_vm_writev, kcmp and finit_module syscalls
parisc: led driver requires CONFIG_VM_EVENT_COUNTERS
parisc: remove unused compat_rt_sigframe.h header
parisc/mm/fault.c: Port OOM changes to do_page_fault
parisc: space register variables need to be in native length (unsigned long)
parisc: fix ptrace breakage
parisc: always detect multiple physical ranges
parisc: ensure that mmapped shared pages are aligned at SHMLBA addresses
parisc: disable preemption while flushing D- or I-caches through TMPALIAS region
parisc: remove IRQF_DISABLED
parisc: fixes and cleanups in page cache flushing (4/4)
...
Right now it's safe only during initial mount *and* functions are asking
to be abused for dynamic adding of objects.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Running AIO is pinning inode in memory using file reference. Once AIO
is completed using aio_complete(), file reference is put and inode can
be freed from memory. So we have to be sure that calling aio_complete()
is the last thing we do with the inode.
CC: Christoph Hellwig <hch@infradead.org>
CC: Jens Axboe <axboe@kernel.dk>
CC: Jeff Moyer <jmoyer@redhat.com>
CC: stable@vger.kernel.org
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Allocating a file structure in function get_empty_filp() might fail because
of several reasons:
- not enough memory for file structures
- operation is not allowed
- user is over its limit
Currently the function returns NULL in all cases and we loose the exact
reason of the error. All callers of get_empty_filp() assume that the function
can fail with ENFILE only.
Return error through pointer. Change all callers to preserve this error code.
[AV: cleaned up a bit, carved the get_empty_filp() part out into a separate commit
(things remaining here deal with alloc_file()), removed pipe(2) behaviour change]
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It's safe only under namespace_sem or vfsmount_lock; all places
in fs/namespace.c that want mnt->mnt_ns->user_ns actually want to use
current->nsproxy->mnt_ns->user_ns (note the calls of check_mnt() in
there).
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull ia64 build breakage fix from Tony Luck.
* tag 'please-pull-fix-ia64-build' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
sched: move RR_TIMESLICE from sysctl.h to rt.h
Pull core locking changes from Ingo Molnar:
"The biggest change is the rwsem lock-steal improvements, both to the
assembly optimized and the spinlock based variants.
The other notable change is the clean up of the seqlock implementation
to be based on the seqcount infrastructure.
The rest is assorted smaller debuggability, cleanup and continued -rt
locking changes."
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rwsem-spinlock: Implement writer lock-stealing for better scalability
futex: Revert "futex: Mark get_robust_list as deprecated"
generic: Use raw local irq variant for generic cmpxchg
lockdep: Selftest: convert spinlock to raw spinlock
seqlock: Use seqcount infrastructure
seqlock: Remove unused functions
ntp: Make ntp_lock raw
intel_idle: Convert i7300_idle_lock to raw_spinlock
locking: Various static lock initializer fixes
lockdep: Print more info when MAX_LOCK_DEPTH is exceeded
rwsem: Implement writer lock-stealing for better scalability
lockdep: Silence warning if CONFIG_LOCKDEP isn't set
watchdog: Use local_clock for get_timestamp()
lockdep: Rename print_unlock_inbalance_bug() to print_unlock_imbalance_bug()
locking/stat: Fix a typo
Pull x86 microcode loading update from Peter Anvin:
"This patchset lets us update the CPU microcode very, very early in
initialization if the BIOS fails to do so (never happens, right?)
This is handy for dealing with things like the Atom erratum where we
have to run without PSE because microcode loading happens too late.
As I mentioned in the x86/mm push request it depends on that
infrastructure but it is otherwise a standalone feature."
* 'x86/microcode' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/Kconfig: Make early microcode loading a configuration feature
x86/mm/init.c: Copy ucode from initrd image to kernel memory
x86/head64.c: Early update ucode in 64-bit
x86/head_32.S: Early update ucode in 32-bit
x86/microcode_intel_early.c: Early update ucode on Intel's CPU
x86/tlbflush.h: Define __native_flush_tlb_global_irq_disabled()
x86/microcode_intel_lib.c: Early update ucode on Intel's CPU
x86/microcode_core_early.c: Define interfaces for early loading ucode
x86/common.c: load ucode in 64 bit or show loading ucode info in 32 bit on AP
x86/common.c: Make have_cpuid_p() a global function
x86/microcode_intel.h: Define functions and macros for early loading ucode
x86, doc: Documentation for early microcode loading
With commit 8170e6bed4 ("x86, 64bit: Use a #PF handler to materialize
early mappings on demand") we started hitting an early bootup crash
where the Xen hypervisor would inform us that:
(XEN) d7:v0: unhandled page fault (ec=0000)
(XEN) Pagetable walk from ffffea000005b2d0:
(XEN) L4[0x1d4] = 0000000000000000 ffffffffffffffff
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 7 (vcpu#0) crashed on cpu#3:
(XEN) ----[ Xen-4.2.0 x86_64 debug=n Not tainted ]----
.. that Xen was unable to context switch back to dom0.
Looking at the calling stack we find:
[<ffffffff8103feba>] xen_get_user_pgd+0x5a <--
[<ffffffff8103feba>] xen_get_user_pgd+0x5a
[<ffffffff81042d27>] xen_write_cr3+0x77
[<ffffffff81ad2d21>] init_mem_mapping+0x1f9
[<ffffffff81ac293f>] setup_arch+0x742
[<ffffffff81666d71>] printk+0x48
We are trying to figure out whether we need to up-date the user PGD as
well. Please keep in mind that under 64-bit PV guests we have a limited
amount of rings: 0 for the Hypervisor, and 1 for both the Linux kernel
and user-space. As such the Linux pvops'fied version of write_cr3
checks if it has to update the user-space cr3 as well.
That clearly is not needed during early bootup. The recent changes (see
above git commit) streamline the x86 page table allocation to be much
simpler (And also incidentally the #PF handler ends up in spirit being
similar to how the Xen toolstack sets up the initial page-tables).
The fix is to have an early-bootup version of cr3 that just loads the
kernel %cr3. The later version - which also handles user-page
modifications will be used after the initial page tables have been
setup.
[ hpa: removed a redundant #ifdef and made the new function __init.
Also note that x86-32 already has such an early xen_write_cr3. ]
Tested-by: "H. Peter Anvin" <hpa@zytor.com>
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/r/1361579812-23709-1-git-send-email-konrad.wilk@oracle.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Different versions of glibc are broken in different ways, but the short of
it is that for the time being, frsize should == bsize, and be used as the
multiple for the blocks, free, and available fields. This mirrors what is
done for NFS. The previous reporting of the page size for frsize meant
that newer glibc and df would report a very small value for the fs size.
Fixes http://tracker.ceph.com/issues/3793.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
randconfig build reports the following error which is caused by
CONFIG_PM_OPP being unset.
CC arch/arm/mach-imx/mach-imx6q.o
arch/arm/mach-imx/mach-imx6q.c: In function ‘imx6q_opp_init’:
arch/arm/mach-imx/mach-imx6q.c:248:2: error: implicit declaration of function ‘of_init_opp_table’ [-Werror=implicit-function-declaration]
Fix the error by giving a more correct condition for empty
of_init_opp_table() implementation.
Reported-by: Rob Herring <robherring2@gmail.com>
Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
Acked-by: Nishanth Menon <nm@ti.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
After commit 92ef2a2 (ACPI: Change the ordering of PCI root bridge
driver registrarion), acpi_hest_init() is never called for acpi=off
(acpi_disabled), so hest_disable is not set, but hest_tab is NULL,
which causes apei_hest_parse() to crash when it is called from
aer_acpi_firmware_first().
Fix that by making apei_hest_parse() check if hest_tab is not NULL
in addition to checking hest_disable. Also remove the now useless
acpi_disabled check from apei_hest_parse().
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit def8203 ("microblaze idle: delete pm_idle") introduced the following
compile error:
arch/microblaze/kernel/process.c: In function 'cpu_idle':
arch/microblaze/kernel/process.c💯 error: 'idle' undeclared (first use in this function)
arch/microblaze/kernel/process.c💯 error: (Each undeclared identifier is reported only once
arch/microblaze/kernel/process.c💯 error: for each function it appears in.)
arch/microblaze/kernel/process.c:106: error: implicit declaration of function 'idle'
This patch fixes it.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Acked-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 26bab0c ("blackfin idle: delete pm_idle") introduced the following
compile error:
arch/blackfin/kernel/process.c: In function ‘cpu_idle’:
arch/blackfin/kernel/process.c:83: error: ‘idle’ undeclared (first use in this function)
arch/blackfin/kernel/process.c:83: error: (Each undeclared identifier is reported only once
arch/blackfin/kernel/process.c:83: error: for each function it appears in.)
arch/blackfin/kernel/process.c:88: error: implicit declaration of function ‘idle’
This patch fixes it.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Acked-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>