The current VFIO_DEVICE_RESET interface only maps to PCI use cases
where we can isolate the reset to the individual PCI function. This
means the device must support FLR (PCIe or AF), PM reset on D3hot->D0
transition, device specific reset, or be a singleton device on a bus
for a secondary bus reset. FLR does not have widespread support,
PM reset is not very reliable, and bus topology is dictated by the
system and device design. We need to provide a means for a user to
induce a bus reset in cases where the existing mechanisms are not
available or not reliable.
This device specific extension to VFIO provides the user with this
ability. Two new ioctls are introduced:
- VFIO_DEVICE_PCI_GET_HOT_RESET_INFO
- VFIO_DEVICE_PCI_HOT_RESET
The first provides the user with information about the extent of
devices affected by a hot reset. This is essentially a list of
devices and the IOMMU groups they belong to. The user may then
initiate a hot reset by calling the second ioctl. We must be
careful that the user has ownership of all the affected devices
found via the first ioctl, so the second ioctl takes a list of file
descriptors for the VFIO groups affected by the reset. Each group
must have IOMMU protection established for the ioctl to succeed.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
net: sctp: Fix data chunk fragmentation for MTU values which are not multiple of 4
Initially the problem was observed with ipsec, but later it became clear that
SCTP data chunk fragmentation algorithm has problems with MTU values which are
not multiple of 4. Test program was used which just transmits 2000 bytes long
packets to other host. tcpdump was used to observe re-fragmentation in IP layer
after SCTP already fragmented data chunks.
With MTU 1500:
12:54:34.082904 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 1500)
10.151.38.153.39303 > 10.151.24.91.54321: sctp (1) [DATA] (B) [TSN: 2366088589] [SID: 0] [SSEQ 1] [PPID 0x0]
12:54:34.082933 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 596)
10.151.38.153.39303 > 10.151.24.91.54321: sctp (1) [DATA] (E) [TSN: 2366088590] [SID: 0] [SSEQ 1] [PPID 0x0]
12:54:34.090576 IP (tos 0x2,ECT(0), ttl 63, id 0, offset 0, flags [DF], proto SCTP (132), length 48)
10.151.24.91.54321 > 10.151.38.153.39303: sctp (1) [SACK] [cum ack 2366088590] [a_rwnd 79920] [#gap acks 0] [#dup tsns 0]
With MTU 1499:
13:02:49.955220 IP (tos 0x2,ECT(0), ttl 64, id 48215, offset 0, flags [+], proto SCTP (132), length 1492)
10.151.38.153.39084 > 10.151.24.91.54321: sctp[|sctp]
13:02:49.955249 IP (tos 0x2,ECT(0), ttl 64, id 48215, offset 1472, flags [none], proto SCTP (132), length 28)
10.151.38.153 > 10.151.24.91: ip-proto-132
13:02:49.955262 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 600)
10.151.38.153.39084 > 10.151.24.91.54321: sctp (1) [DATA] (E) [TSN: 404355346] [SID: 0] [SSEQ 1] [PPID 0x0]
13:02:49.956770 IP (tos 0x2,ECT(0), ttl 63, id 0, offset 0, flags [DF], proto SCTP (132), length 48)
10.151.24.91.54321 > 10.151.38.153.39084: sctp (1) [SACK] [cum ack 404355346] [a_rwnd 79920] [#gap acks 0] [#dup tsns 0]
Here problem in data portion limit calculation leads to re-fragmentation in IP,
which is sub-optimal. The problem is max_data initial value, which doesn't take
into account the fact, that data chunk must be padded to 4-bytes boundary.
It's enough to correct max_data, because all later adjustments are correctly
aligned to 4-bytes boundary.
After the fix is applied, everything is fragmented correctly for uneven MTUs:
15:16:27.083881 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 1496)
10.151.38.153.53417 > 10.151.24.91.54321: sctp (1) [DATA] (B) [TSN: 3077098183] [SID: 0] [SSEQ 1] [PPID 0x0]
15:16:27.083907 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto SCTP (132), length 600)
10.151.38.153.53417 > 10.151.24.91.54321: sctp (1) [DATA] (E) [TSN: 3077098184] [SID: 0] [SSEQ 1] [PPID 0x0]
15:16:27.085640 IP (tos 0x2,ECT(0), ttl 63, id 0, offset 0, flags [DF], proto SCTP (132), length 48)
10.151.24.91.54321 > 10.151.38.153.53417: sctp (1) [SACK] [cum ack 3077098184] [a_rwnd 79920] [#gap acks 0] [#dup tsns 0]
The bug was there for years already, but
- is a performance issue, the packets are still transmitted
- doesn't show up with default MTU 1500, but possibly with ipsec (MTU 1438)
Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nsn.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Free_irq is not needed if there has been no request_irq. Free_irq is
removed from both the probe and remove functions. The correct request_irq
and free_irq are found in the open and close functions.
A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
expression e;
@@
*e = platform_get_irq(...);
... when != request_irq(e,...)
*free_irq(e,...)
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Solution:
=========
- Synchronize linux's `include/uapi/linux/in6.h'
with glibc's `inet/netinet/in.h'.
- Synchronize glibc's `inet/netinet/in.h with linux's
`include/uapi/linux/in6.h'.
- Allow including the headers in either other.
- First header included defines the structures and macros.
Details:
========
The kernel promises not to break the UAPI ABI so I don't
see why we can't just have the two userspace headers
coordinate?
If you include the kernel headers first you get those,
and if you include the glibc headers first you get those,
and the following patch arranges a coordination and
synchronization between the two.
Let's handle `include/uapi/linux/in6.h' from linux,
and `inet/netinet/in.h' from glibc and ensure they compile
in any order and preserve the required ABI.
These two patches pass the following compile tests:
cat >> test1.c <<EOF
int main (void) {
return 0;
}
EOF
gcc -c test1.c
cat >> test2.c <<EOF
int main (void) {
return 0;
}
EOF
gcc -c test2.c
One wrinkle is that the kernel has a different name for one of
the members in ipv6_mreq. In the kernel patch we create a macro
to cover the uses of the old name, and while that's not entirely
clean it's one of the best solutions (aside from an anonymous
union which has other issues).
I've reviewed the code and it looks to me like the ABI is
assured and everything matches on both sides.
Notes:
- You want netinet/in.h to include bits/in.h as early as possible,
but it needs in_addr so define in_addr early.
- You want bits/in.h included as early as possible so you can use
the linux specific code to define __USE_KERNEL_DEFS based on
the _UAPI_* macro definition and use those to cull in.h.
- glibc was missing IPPROTO_MH, added here.
Compile tested and inspected.
Reported-by: Thomas Backlund <tmb@mageia.org>
Cc: Thomas Backlund <tmb@mageia.org>
Cc: libc-alpha@sourceware.org
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Cc: David S. Miller <davem@davemloft.net>
Tested-by: Cong Wang <amwang@redhat.com>
Signed-off-by: Carlos O'Donell <carlos@redhat.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It upsets static analyzers when we don't check for allocation failure.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Having PCIe/PCI-X capability isn't enough to assume that there are
extended capabilities. Both specs define that the first capability
header is all zero if there are no extended capabilities. Testing
for this avoids an erroneous message about hiding capability 0x0 at
offset 0x100.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Jeff Kirsher says:
====================
This series contains updates to igb only.
Todd provides a fix for igb to not look for a PBA in the iNVM on
devices that are flashless.
Akeem provides igb patches to add a new PHY id for i354, as well as
a couple of patches to implement the new PHY id. He also provides
several patches to correctly report the appropriate media type as
well as correctly report advertised/supported link for i354 devices.
Lastly Akeem implements a 1 second delay mechanism for i210 devices
to avoid erroneous link issue with the link partner.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull x86 mm changes from Ingo Molnar:
"Misc smaller fixes:
- a parse_setup_data() boot crash fix
- a memblock and an __early_ioremap cleanup
- turn the always-on CONFIG_ARCH_MEMORY_PROBE=y into a configurable
option and turn it off - it's an unrobust debug facility, it
shouldn't be enabled by default"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: avoid remapping data in parse_setup_data()
x86: Use memblock_set_current_limit() to set limit for memblock.
mm: Remove unused variable idx0 in __early_ioremap()
mm/hotplug, x86: Disable ARCH_MEMORY_PROBE by default
Pull x86 relocation changes from Ingo Molnar:
"This tree contains a single change, ELF relocation handling in C - one
of the kernel randomization patches that makes sense even without
randomization present upstream"
* 'x86-kaslr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, relocs: Move ELF relocation handling to C
Pull timers/nohz changes from Ingo Molnar:
"It mostly contains fixes and full dynticks off-case optimizations, by
Frederic Weisbecker"
* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
nohz: Include local CPU in full dynticks global kick
nohz: Optimize full dynticks's sched hooks with static keys
nohz: Optimize full dynticks state checks with static keys
nohz: Rename a few state variables
vtime: Always debug check snapshot source _before_ updating it
vtime: Always scale generic vtime accounting results
vtime: Optimize full dynticks accounting off case with static keys
vtime: Describe overriden functions in dedicated arch headers
m68k: hardirq_count() only need preempt_mask.h
hardirq: Split preempt count mask definitions
context_tracking: Split low level state headers
vtime: Fix racy cputime delta update
vtime: Remove a few unneeded generic vtime state checks
context_tracking: User/kernel broundary cross trace events
context_tracking: Optimize context switch off case with static keys
context_tracking: Optimize guest APIs off case with static key
context_tracking: Optimize main APIs off case with static key
context_tracking: Ground setup for static key use
context_tracking: Remove full dynticks' hacky dependency on wide context tracking
nohz: Only enable context tracking on full dynticks CPUs
...
Pablo Neira Ayuso says:
====================
The following batch contains:
* Three fixes for the new synproxy target available in your
net-next tree, from Jesper D. Brouer and Patrick McHardy.
* One fix for TCPMSS to correctly handling the fragmentation
case, from Phil Oester. I'll pass this one to -stable.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename the new 'recover_locks' kernel parameter to 'recover_lost_locks'
and change the default to 'false'. Document why in
Documentation/kernel-parameters.txt
Move the 'recover_lost_locks' kernel parameter to fs/nfs/super.c to
make it easy to backport to kernels prior to 3.6.x, which don't have
a separate NFSv4 module.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When an NFSv4 client loses contact with the server it can lose any
locks that it holds.
Currently when it reconnects to the server it simply tries to reclaim
those locks. This might succeed even though some other client has
held and released a lock in the mean time. So the first client might
think the file is unchanged, but it isn't. This isn't good.
If, when recovery happens, the locks cannot be claimed because some
other client still holds the lock, then we get a message in the kernel
logs, but the client can still write. So two clients can both think
they have a lock and can both write at the same time. This is equally
not good.
There was a patch a while ago
http://comments.gmane.org/gmane.linux.nfs/41917
which tried to address some of this, but it didn't seem to go
anywhere. That patch would also send a signal to the process. That
might be useful but for now this patch just causes writes to fail.
For NFSv4 (unlike v2/v3) there is a strong link between the lock and
the write request so we can fairly easily fail any IO of the lock is
gone. While some applications might not expect this, it is still
safer than allowing the write to succeed.
Because this is a fairly big change in behaviour a module parameter,
"recover_locks", is introduced which defaults to true (the current
behaviour) but can be set to "false" to tell the client not to try to
recover things that were lost.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add client side debugging to help trace socket connection/disconnection
and unexpected state change issues.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When CONFIG_NFS_V4_1 is not enabled, gcc emits this warning:
linux/fs/nfs/nfs4state.c:255:12: warning:
‘nfs4_begin_drain_session’ defined but not used [-Wunused-function]
static int nfs4_begin_drain_session(struct nfs_client *clp)
^
Eventually NFSv4.0 migration recovery will invoke this function, but
that has not yet been merged. Hide nfs4_begin_drain_session()
behind CONFIG_NFS_V4_1 for now.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
linux/fs/nfs/nfs4session.c:337:6: warning:
symbol 'nfs41_set_target_slotid' was not declared. Should it be static?
Move nfs41_set_target_slotid() and nfs41_update_target_slotid() back
behind CONFIG_NFS_V4_1, since, in the final revision of this work,
they are used only in NFSv4.1 and later.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Pull x86 fb changes from Ingo Molnar:
"This tree includes preparatory patches for SimpleDRM driver support,
by David Herrmann. They clean up x86 framebuffer support by creating
simplefb devices wherever possible. More background can be found at
http://lwn.net/Articles/558104/"
* 'x86-fb-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
fbdev: fbcon: select VT_HW_CONSOLE_BINDING
fbdev: efifb: bind to efi-framebuffer
fbdev: vesafb: bind to platform-framebuffer device
fbdev: simplefb: add common x86 RGB formats
x86: sysfb: move EFI quirks from efifb to sysfb
x86: provide platform-devices for boot-framebuffers
fbdev: simplefb: mark as fw and allocate apertures
fbdev: simplefb: add init through platform_data
Pull x86 cpu feature fixes from Ingo Molnar:
"Two small cpufeature support updates"
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Fix override new_cpu_data.x86 with 486
x86, cpufeature: Use new CC_HAVE_ASM_GOTO
Pull tiny x86 boot cleanups from Ingo Molnar.
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Fix a sanity check in printf.c
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, boot: Fix warning due to undeclared strlen()
Pull x86/asmlinkage changes from Ingo Molnar:
"As a preparation for Andi Kleen's LTO patchset (link time
optimizations using GCC's -flto which build time optimization has
steadily increased in quality over the past few years and might
eventually be usable for the kernel too) this tree includes a handful
of preparatory patches that make function calling convention
annotations consistent again:
- Mark every function without arguments (or 64bit only) that is used
by assembly code with asmlinkage()
- Mark every function with parameters or variables that is used by
assembly code as __visible.
For the vanilla kernel this has documentation, consistency and
debuggability advantages, for the time being"
* 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asmlinkage: Fix warning in xen asmlinkage change
x86, asmlinkage, vdso: Mark vdso variables __visible
x86, asmlinkage, power: Make various symbols used by the suspend asm code visible
x86, asmlinkage: Make dump_stack visible
x86, asmlinkage: Make 64bit checksum functions visible
x86, asmlinkage, paravirt: Add __visible/asmlinkage to xen paravirt ops
x86, asmlinkage, apm: Make APM data structure used from assembler visible
x86, asmlinkage: Make syscall tables visible
x86, asmlinkage: Make several variables used from assembler/linker script visible
x86, asmlinkage: Make kprobes code visible and fix assembler code
x86, asmlinkage: Make various syscalls asmlinkage
x86, asmlinkage: Make 32bit/64bit __switch_to visible
x86, asmlinkage: Make _*_start_kernel visible
x86, asmlinkage: Make all interrupt handlers asmlinkage / __visible
x86, asmlinkage: Change dotraplinkage into __visible on 32bit
x86: Fix sys_call_table type in asm/syscall.h
Pull x86/asm changes from Ingo Molnar:
"Main changes:
- Apply low level mutex optimization on x86-64, by Wedson Almeida
Filho.
- Change bitops to be naturally 'long', by H Peter Anvin.
- Add TSX-NI opcodes support to the x86 (instrumentation) decoder, by
Masami Hiramatsu.
- Add clang compatibility adjustments/workarounds, by Jan-Simon
Möller"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, doc: Update uaccess.h comment to reflect clang changes
x86, asm: Fix a compilation issue with clang
x86, asm: Extend definitions of _ASM_* with a raw format
x86, insn: Add new opcodes as of June, 2013
x86/ia32/asm: Remove unused argument in macro
x86, bitops: Change bitops to be native operand size
x86: Use asm-goto to implement mutex fast path on x86-64
Pull x86/apic changes from Ingo Molnar:
"Smaller fixes"
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ioapic: Check attr against the previous setting when programmed more than once
x86/ioapic/kcrash: Prevent crash_kexec() from deadlocking on ioapic_lock
x86/acpi: Fix incorrect sanity check in acpi_register_lapic()
Pull scheduler changes from Ingo Molnar:
"Various optimizations, cleanups and smaller fixes - no major changes
in scheduler behavior"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix the sd_parent_degenerate() code
sched/fair: Rework and comment the group_imb code
sched/fair: Optimize find_busiest_queue()
sched/fair: Make group power more consistent
sched/fair: Remove duplicate load_per_task computations
sched/fair: Shrink sg_lb_stats and play memset games
sched: Clean-up struct sd_lb_stat
sched: Factor out code to should_we_balance()
sched: Remove one division operation in find_busiest_queue()
sched/cputime: Use this_cpu_add() in task_group_account_field()
cpumask: Fix cpumask leak in partition_sched_domains()
sched/x86: Optimize switch_mm() for multi-threaded workloads
generic-ipi: Kill unnecessary variable - csd_flags
numa: Mark __node_set() as __always_inline
sched/fair: Cleanup: remove duplicate variable declaration
sched/__wake_up_sync_key(): Fix nr_exclusive tasks which lead to WF_SYNC clearing
The dpll actually runs at the port clock so we don't need
to multiply it again with the pixel multiplier to get the
adjusted_mode.clock. This is in contrast to the ironlake
pixel clock readout code which uses the fdi dotclock: That
one does _not_ run with multiplied pixels.
This issue goes back to the original clock readout code added
in
commit f1f644dc66
Author: Jesse Barnes <jbarnes@virtuousgeek.org>
Date: Thu Jun 27 00:39:25 2013 +0300
drm/i915: get mode clock when reading the pipe config v9
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
The sdvo input timing needs to be the actual mode, the sdvo
encoder automatically adjusts for the need of pixel doubling or
quadrupling. This was lost in pipe config conversion of the
pixel multiplier in
commit 6cc5f341b5
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date: Wed Mar 27 00:44:53 2013 +0100
drm/i915: add pipe_config->pixel_multiplier
While at it ditch the intel_ prefix from the crtc in
intel_sdvo_mode_set.
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Historically we've run our own driver hotplug handling in our own
work-queue, which then launched the drm core hotplug handling in the
system workqueue. This is important since we flush our own driver
workqueue in the pageflip code while hodling modeset locks, and only
the drm hotplug code grabbed these locks. But with
commit 69787f7da6
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date: Tue Oct 23 18:23:34 2012 +0000
drm: run the hpd irq event code directly
this was changed and now we could deadlock in our flip handler if
there's a hotplug work blocking the progress of the crucial unpin
works. So this broke the careful deadlock avoidance implemented in
commit b4a98e57fc
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Thu Nov 1 09:26:26 2012 +0000
drm/i915: Flush outstanding unpin tasks before pageflipping
Since the rule thus far has been that work items on our own workqueue
may never grab modeset locks simply restore that rule again.
v2: Add a comment to the declaration of dev_priv->wq to warn readers
about the tricky implications of using it. Suggested by Chris Wilson.
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Stuart Abercrombie <sabercrombie@chromium.org>
Reported-by: Stuart Abercrombie <sabercrombie@chromium.org>
References: http://permalink.gmane.org/gmane.comp.freedesktop.xorg.drivers.intel/26239
Cc: stable@vger.kernel.org
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
[danvet: Squash in a comment at the place where we schedule the work.
Requested after-the-fact by Chris on irc since the hpd work isn't the
only place we botch this.]
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Pull perf changes from Ingo Molnar:
"As a first remark I'd like to point out that the obsolete '-f'
(--force) option, which has not done anything for several releases,
has been removed from 'perf record' and related utilities. Everyone
please update muscle memory accordingly! :-)
Main changes on the perf kernel side:
- Performance optimizations:
. for trace events, by Steve Rostedt.
. for time values, by Peter Zijlstra
- New hardware support:
. for Intel Silvermont (22nm Atom) CPUs, by Zheng Yan
. for Intel SNB-EP uncore PMUs, by Zheng Yan
- Enhanced hardware support:
. for Intel uncore PMUs: add filter support for QPI boxes, by Zheng Yan
- Core perf events code enhancements and fixes:
. for full-nohz feature handling, by Frederic Weisbecker
. for group events, by Jiri Olsa
. for call chains, by Frederic Weisbecker
. for event stream parsing, by Adrian Hunter
- New ABI details:
. Add attr->mmap2 attribute, by Stephane Eranian
. Add PERF_EVENT_IOC_ID ioctl to return event ID, by Jiri Olsa
. Export u64 time_zero on the mmap header page to allow TSC
calculation, by Adrian Hunter
. Add dummy software event, by Adrian Hunter.
. Add a new PERF_SAMPLE_IDENTIFIER to make samples always
parseable, by Adrian Hunter.
. Make Power7 events available via sysfs, by Runzhen Wang.
- Code cleanups and refactorings:
. for nohz-full, by Frederic Weisbecker
. for group events, by Jiri Olsa
- Documentation updates:
. for perf_event_type, by Peter Zijlstra
Main changes on the perf tooling side (some of these tooling changes
utilize the above kernel side changes):
- Lots of 'perf trace' enhancements:
. Make 'perf trace' command line arguments consistent with
'perf record', by David Ahern.
. Allow specifying syscalls a la strace, by Arnaldo Carvalho de Melo.
. Add --verbose and -o/--output options, by Arnaldo Carvalho de Melo.
. Support ! in -e expressions, to filter a list of syscalls,
by Arnaldo Carvalho de Melo.
. Arg formatting improvements to allow masking arguments in
syscalls such as futex and open, where the some arguments are
ignored and thus should not be printed depending on other args,
by Arnaldo Carvalho de Melo.
. Beautify futex open, openat, open_by_handle_at, lseek and futex
syscalls, by Arnaldo Carvalho de Melo.
. Add option to analyze events in a file versus live, so that
one can do:
[root@zoo ~]# perf record -a -e raw_syscalls:* sleep 1
[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 25.150 MB perf.data (~1098836 samples) ]
[root@zoo ~]# perf trace -i perf.data -e futex --duration 1
17.799 ( 1.020 ms): 7127 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, ua
113.344 (95.429 ms): 7127 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, uaddr2: 0x7fff3f6c6648, val3: 4294967
133.778 ( 1.042 ms): 18004 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, uaddr2: 0x7fff3f6c6648, val3: 429496
[root@zoo ~]#
By David Ahern.
. Honor target pid / tid options when analyzing a file, by David Ahern.
. Introduce better formatting of syscall arguments, including so
far beautifiers for mmap, madvise, syscall return values,
by Arnaldo Carvalho de Melo.
. Handle HUGEPAGE defines in the mmap beautifier, by David Ahern.
- 'perf report/top' enhancements:
. Do annotation using /proc/kcore and /proc/kallsyms when
available, removing the forced need for a vmlinux file kernel
assembly annotation. This also improves this use case because
vmlinux has just the initial kernel image, not what is actually
in use after various code patchings by things like alternatives.
By Adrian Hunter.
. Add --ignore-callees=<regex> option to collapse undesired parts
of call graphs, by Greg Price.
. Simplify symbol filtering by doing it at machine class level,
by Adrian Hunter.
. Add support for callchains in the gtk UI, by Namhyung Kim.
. Add --objdump option to 'perf top', by Sukadev Bhattiprolu.
- 'perf kvm' enhancements:
. Add option to print only events that exceed a specified time
duration, by David Ahern.
. Improve stack trace printing, by David Ahern.
. Update documentation of the live command, by David Ahern
. Add perf kvm stat live mode that combines aspects of 'perf kvm
stat' record and report, by David Ahern.
. Add option to analyze specific VM in perf kvm stat report, by
David Ahern.
. Do not require /lib/modules/* on a guest, by Jason Wessel.
- 'perf script' enhancements:
. Fix symbol offset computation for some dsos, by David Ahern.
. Fix named threads support, by David Ahern.
. Don't install scripting files files when perl/python support
is disabled, by Arnaldo Carvalho de Melo.
- 'perf test' enhancements:
. Add various improvements and fixes to the "vmlinux matches
kallsyms" 'perf test' entry, related to the /proc/kcore
annotation feature. By Adrian Hunter.
. Add sample parsing test, by Adrian Hunter.
. Add test for reading object code, by Adrian Hunter.
. Add attr record group sampling test, by Jiri Olsa.
. Misc testing infrastructure improvements and other details,
by Jiri Olsa.
- 'perf list' enhancements:
. Skip unsupported hardware events, by Namhyung Kim.
. List pmu events, by Andi Kleen.
- 'perf diff' enhancements:
. Add support for more than two files comparison, by Jiri Olsa.
- 'perf sched' enhancements:
. Various improvements, including removing reliance on some
scheduler tracepoints that provide the same information as the
PERF_RECORD_{FORK,EXIT} events. By David Ahern.
. Remove odd build stall by moving a large struct initialization
from a local variable to a global one, by Namhyung Kim.
- 'perf stat' enhancements:
. Add --initial-delay option to skip measuring for a defined
startup phase, by Andi Kleen.
- Generic perf tooling infrastructure/plumbing changes:
. Tidy up sample parsing validation, by Adrian Hunter.
. Fix up jobserver setup in libtraceevent Makefile.
by Arnaldo Carvalho de Melo.
. Debug improvements, by Adrian Hunter.
. Fix correlation of samples coming after PERF_RECORD_EXIT event,
by David Ahern.
. Improve robustness of the topology parsing code,
by Stephane Eranian.
. Add group leader sampling, that allows just one event in a group
to sample while the other events have just its values read,
by Jiri Olsa.
. Add support for a new modifier "D", which requests that the
event, or group of events, be pinned to the PMU.
By Michael Ellerman.
. Support callchain sorting based on addresses, by Andi Kleen
. Prep work for multi perf data file storage, by Jiri Olsa.
. libtraceevent cleanups, by Namhyung Kim.
And lots and lots of other fixes and code reorganizations that did not
make it into the list, see the shortlog, diffstat and the Git log for
details!"
[ Also merge a leftover from the 3.11 cycle ]
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Prevent race in unthrottling code
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (237 commits)
perf trace: Tell arg formatters the arg index
perf trace: Add beautifier for open's flags arg
perf trace: Add beautifier for lseek's whence arg
perf tools: Fix symbol offset computation for some dsos
perf list: Skip unsupported events
perf tests: Add 'keep tracking' test
perf tools: Add support for PERF_COUNT_SW_DUMMY
perf: Add a dummy software event to keep tracking
perf trace: Add beautifier for futex 'operation' parm
perf trace: Allow syscall arg formatters to mask args
perf: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node()
perf: Export struct perf_branch_entry to userspace
perf: Add attr->mmap2 attribute to an event
perf/x86: Add Silvermont (22nm Atom) support
perf/x86: use INTEL_UEVENT_EXTRA_REG to define MSR_OFFCORE_RSP_X
perf trace: Handle missing HUGEPAGE defines
perf trace: Honor target pid / tid options when analyzing a file
perf trace: Add option to analyze events in a file versus live
perf evlist: Add tracepoint lookup by name
perf tests: Add a sample parsing test
...
Let's not add a function for every external interrupt subclass for
which we need reference counting. Just have two register/unregister
functions which have a subclass parameter:
void irq_subclass_register(enum irq_subclass subclass);
void irq_subclass_unregister(enum irq_subclass subclass);
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Use hlists for the hashed array of external interrupt handlers.
Reduces the size of the array by 50% (2KB).
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
This is the same as what other architectures did.
The change has also the advantage that there won't be any interleaving
messages between printk() and print_symbol().
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Pull core/locking changes from Ingo Molnar:
"Main changes:
- another mutex optimization, from Davidlohr Bueso
- improved lglock lockdep tracking, from Michel Lespinasse
- [ assorted smaller updates, improvements, cleanups. ]"
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
generic-ipi/locking: Fix misleading smp_call_function_any() description
hung_task debugging: Print more info when reporting the problem
mutex: Avoid label warning when !CONFIG_MUTEX_SPIN_ON_OWNER
mutex: Do not unnecessarily deal with waiters
mutex: Fix/document access-once assumption in mutex_can_spin_on_owner()
lglock: Update lockdep annotations to report recursive local locks
lockdep: Introduce lock_acquire_exclusive()/shared() helper macros
Pull RCU updates from Ingo Molnar:
"Main RCU changes this cycle were:
- Full-system idle detection. This is for use by Frederic
Weisbecker's adaptive-ticks mechanism. Its purpose is to allow the
timekeeping CPU to shut off its tick when all other CPUs are idle.
- Miscellaneous fixes.
- Improved rcutorture test coverage.
- Updated RCU documentation"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
nohz_full: Force RCU's grace-period kthreads onto timekeeping CPU
nohz_full: Add full-system-idle state machine
jiffies: Avoid undefined behavior from signed overflow
rcu: Simplify _rcu_barrier() processing
rcu: Make rcutorture emit online failures if verbose
rcu: Remove unused variable from rcu_torture_writer()
rcu: Sort rcutorture module parameters
rcu: Increase rcutorture test coverage
rcu: Add duplicate-callback tests to rcutorture
doc: Fix memory-barrier control-dependency example
rcu: Update RTFP documentation
nohz_full: Add full-system-idle arguments to API
nohz_full: Add full-system idle states and variables
nohz_full: Add per-CPU idle-state tracking
nohz_full: Add rcu_dyntick data for scalable detection of all-idle state
nohz_full: Add Kconfig parameter for scalable detection of all-idle state
nohz_full: Add testing information to documentation
rcu: Eliminate unused APIs intended for adaptive ticks
rcu: Select IRQ_WORK from TREE_PREEMPT_RCU
rculist: list_first_or_null_rcu() should use list_entry_rcu()
...
Function test_and_clear_bit implies a memory barrier, so subsequent
memory barriers are unnecessary.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
scale_stime() silently assumes that stime < rtime, otherwise
when stime == rtime and both values are big enough (operations
on them do not fit in 32 bits), the resulting scaling stime can
be bigger than rtime. In consequence utime = rtime - stime
results in negative value.
User space visible symptoms of the bug are overflowed TIME
values on ps/top, for example:
$ ps aux | grep rcu
root 8 0.0 0.0 0 0 ? S 12:42 0:00 [rcuc/0]
root 9 0.0 0.0 0 0 ? S 12:42 0:00 [rcub/0]
root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt]
root 11 0.1 0.0 0 0 ? S 12:42 0:02 [rcuop/0]
root 12 62422329 0.0 0 0 ? S 12:42 21114581:35 [rcuop/1]
root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt]
or overflowed utime values read directly from /proc/$PID/stat
Reference:
https://lkml.org/lkml/2013/8/20/259
Reported-and-tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: stable@vger.kernel.org
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/20130904131602.GC2564@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Call generic_write_sync() from the deferred I/O completion handler if
O_DSYNC is set for a write request. Also make sure various callers
don't call generic_write_sync if the direct I/O code returns
-EIOCBQUEUED.
Based on an earlier patch from Jan Kara <jack@suse.cz> with updates from
Jeff Moyer <jmoyer@redhat.com> and Darrick J. Wong <darrick.wong@oracle.com>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add support to the core direct-io code to defer AIO completions to user
context using a workqueue. This replaces opencoded and less efficient
code in XFS and ext4 (we save a memory allocation for each direct IO)
and will be needed to properly support O_(D)SYNC for AIO.
The communication between the filesystem and the direct I/O code requires
a new buffer head flag, which is a bit ugly but not avoidable until the
direct I/O code stops abusing the buffer_head structure for communicating
with the filesystems.
Currently this creates a per-superblock unbound workqueue for these
completions, which is taken from an earlier patch by Jan Kara. I'm
not really convinced about this use and would prefer a "normal" global
workqueue with a high concurrency limit, but this needs further discussion.
JK: Fixed ext4 part, dynamic allocation of the workqueue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With this series, this check is no longer required and
we shouldn't need to reject drivers DMA'ing more than the
MAX number of slots.
Signed-off-by: Joel Fernandes <joelf@ti.com>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
Dummy slot has been used as a way for missed-events not to be
reported as missing. This has been particularly troublesome for cases
where we might want to temporarily pause all incoming events.
For EDMA DMAC, there is no way to do any such pausing of events as
the occurence of the "next" event is not software controlled.
Using "edma_pause" in IRQ handlers doesn't help as by then the event
in concern from the slave is already missed.
Linking a dummy slot, is seen to absorb these events which we didn't
want to miss. So we don't link to dummy, but instead leave it linked
to NULL set, allow an error condition and detect the channel that
missed it.
Consider the case where we have a scatter-list like:
SG1->SG2->SG3->SG4->SG5->SG6->Null
For ex, for a MAX_NR_SG of 2, earlier we were splitting this as:
SG1->SG2->Null
SG3->SG4->Null
SG5->SG6->Null
Now we split it as
SG1->SG2->Null
SG3->SG4->Null
SG5->SG6->Dummy
This approach results in lesser unwanted interrupts that occur
for the last list split. The Dummy slot has the property of not
raising an error condition if events are missed unlike the Null
slot. We are OK with this as we're done with processing the
whole list once we reach Dummy.
Signed-off-by: Joel Fernandes <joelf@ti.com>
[modifed duplicate s-o-b & patch title]
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
In an effort to move to using Scatter gather lists of any size with
EDMA as discussed at [1] instead of placing limitations on the driver,
we work through the limitations of the EDMAC hardware to find missed
events and issue them.
The sequence of events that require this are:
For the scenario where MAX slots for an EDMA channel is 3:
SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> Null
The above SG list will have to be DMA'd in 2 sets:
(1) SG1 -> SG2 -> SG3 -> Null
(2) SG4 -> SG5 -> SG6 -> Null
After (1) is succesfully transferred, the events from the MMC controller
donot stop coming and are missed by the time we have setup the transfer
for (2). So here, we catch the events missed as an error condition and
issue them manually.
In the second part of the patch, we make handle the NULL slot cases:
For crypto IP, we continue to receive events even continuously in
NULL slot, the setup of the next set of SG elements happens after
the error handler executes. This is results in some recursion problems.
Due to this, we continously receive error interrupts when we manually
trigger an event from the error handler.
We fix this, by first detecting if the Channel is currently transferring
from a NULL slot or not, that's where the edma_read_slot in the error
callback from interrupt handler comes in. With this we can determine if
the set up of the next SG list has completed, and we manually trigger
only in this case. If the setup has _not_ completed, we are still in NULL
so we just set a missed flag and allow the manual triggerring to happen
in edma_execute which will be eventually called. This fixes the above
mentioned race conditions seen with the crypto drivers.
[1] http://marc.info/?l=linux-omap&m=137416733628831&w=2
Signed-off-by: Joel Fernandes <joelf@ti.com>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
Manual trigger for events missed as a result of splitting a
scatter gather list and DMA'ing it in batches. Add a helper
function to trigger a channel incase any such events are missed.
Signed-off-by: Joel Fernandes <joelf@ti.com>
Acked-by: Sekhar Nori <nsekhar@ti.com>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
Process SG-elements in batches of MAX_NR_SG if they are greater
than MAX_NR_SG. Due to this, at any given time only those many
slots will be used in the given channel no matter how long the
scatter list is. We keep track of how much has been written
inorder to process the next batch of elements in the scatter-list
and detect completion.
For such intermediate transfer completions (one batch of MAX_NR_SG),
make use of pause and resume functions instead of start and stop
when such intermediate transfer is in progress or completed as we
donot want to clear any pending events.
Signed-off-by: Joel Fernandes <joelf@ti.com>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
Changes are made here for configuring existing parameters to support
DMA'ing them out in batches as needed.
Also allocate as many as slots as needed by the SG list, but not more
than MAX_NR_SG. Then these slots will be reused accordingly.
For ex, if MAX_NR_SG=10, and number of SG entries is 40, still only
10 slots will be allocated to DMA the entire SG list of size 40.
Also enable TC interrupts for slots that are a last in a current
iteration, or that fall on a MAX_NR_SG boundary.
Signed-off-by: Joel Fernandes <joelf@ti.com>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>