In xen_set_pte() if batching is unavailable (because the caller is in
an interrupt context such as handling a page fault) it would fall back
to using native_set_pte() and trapping and emulating the PTE write.
On 32-bit guests this requires two traps for each PTE write (one for
each dword of the PTE). Instead, do one mmu_update hypercall
directly.
During construction of the initial page tables, continue to use
native_set_pte() because most of the PTEs being set are in writable
and unpinned pages (see phys_pmd_init() in arch/x86/mm/init_64.c) and
using a hypercall for this is very expensive.
This significantly improves page fault performance in 32-bit PV
guests.
lmbench3 test Before After Improvement
----------------------------------------------
lat_pagefault 3.18 us 2.32 us 27%
lat_proc fork 356 us 313.3 us 11%
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Coverity points out that we do not free in one case the
pr_backup - and sure enough we forgot.
Found by Coverity (CID 401970)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
If a driver leaves its poll method NULL, the device is assumed to
be both readable and writable without blocking.
This patch add .poll method to xen mcelog device driver, so that
when mcelog use system calls like ppoll or select, it would be
blocked when no data available, and avoid spinning at CPU.
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
copy_to_user might sleep and print a stack trace if it is executed
in an atomic spinlock context. Like this:
(XEN) CMCI: send CMCI to DOM0 through virq
BUG: sleeping function called from invalid context at /home/konradinux/kernel.h:199
in_atomic(): 1, irqs_disabled(): 0, pid: 4581, name: mcelog
Pid: 4581, comm: mcelog Tainted: G O 3.5.0-rc1upstream-00003-g149000b-dirty #1
[<ffffffff8109ad9a>] __might_sleep+0xda/0x100
[<ffffffff81329b0b>] xen_mce_chrdev_read+0xab/0x140
[<ffffffff81148945>] vfs_read+0xc5/0x190
[<ffffffff81148b0c>] sys_read+0x4c/0x90
[<ffffffff815bd039>] system_call_fastpath+0x16
This patch schedule a workqueue for IRQ handler to poll the data,
and use mutex instead of spinlock, so copy_to_user sleep in atomic
context would not occur.
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Suggested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This patch provide Xen physical cpus online/offline sys interface.
User can use it for their own purpose, like power saving:
by offlining some cpus when light workload it save power greatly.
Its basic workflow is, user online/offline cpu via sys interface,
then hypercall xen to implement, after done xen inject virq back to dom0,
and then dom0 sync cpu status.
Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
there are 3 funcs which need to be _initcalled in a logic sequence:
1. xen_late_init_mcelog
2. mcheck_init_device
3. threshold_init_device
xen_late_init_mcelog must register xen_mce_chrdev_device before
native mce_chrdev_device registration if running under xen platform;
mcheck_init_device should be inited before threshold_init_device to
initialize mce_device, otherwise a a NULL ptr dereference will cause panic.
so we use following _initcalls
1. device_initcall(xen_late_init_mcelog);
2. device_initcall_sync(mcheck_init_device);
3. late_initcall(threshold_init_device);
when running under xen, the initcall order is 1,2,3;
on baremetal, we skip 1 and we do only 2 and 3.
Acked-and-tested-by: Borislav Petkov <bp@amd64.org>
Suggested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
When MCA error occurs, it would be handled by Xen hypervisor first,
and then the error information would be sent to initial domain for logging.
This patch gets error information from Xen hypervisor and convert
Xen format error into Linux format mcelog. This logic is basically
self-contained, not touching other kernel components.
By using tools like mcelog tool users could read specific error information,
like what they did under native Linux.
To test follow directions outlined in Documentation/acpi/apei/einj.txt
Acked-and-tested-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Ke, Liping <liping.ke@intel.com>
Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Elminate some simple_strto* usage.
checkpatch also noted pr_ conversations, which have been done as
recommended. The pr_fmt() define is used to shorten line length.
Other multi-line string warnings are also elmininated.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add a congestion control agent in the driver that handles gets and
sets from the congestion control manager in the fabric for the
Performance Scale Messaging (PSM) library.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Profiling has shown that sdma_lock is proving a bottleneck for
performance. The situations include:
- RDMA reads when krcvqs > 1
- post sends from multiple threads
For RDMA read the current global qib_wq mechanism runs on all CPUs
and contends for the sdma_lock when multiple RMDA read requests are
fielded on differenct CPUs. For post sends, the direct call to
qib_do_send() from multiple threads causes the contention.
Since the sdma mechanism is per port, this fix converts the existing
workqueue to a per port single thread workqueue to reduce the lock
contention in the RDMA read case, and for any other case where the QP
is scheduled via the workqueue mechanism from more than 1 CPU.
For the post send case, This patch modifies the post send code to test
for a non empty sdma engine. If the sdma is not idle the (now single
thread) workqueue will be used to trigger the send engine instead of
the direct call to qib_do_send().
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
There is a cut-and-paste typo in the function qib_pci_slot_reset()
where it prints that the "link_reset" function is called rather than
the "slot_reset" function. This makes the message misleading.
Signed-off-by: Betty Dall <betty.dall@hp.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
In trusted networks, e.g., intranet, data-center, the client does not
need to use Fast Open cookie to mitigate DoS attacks. In cookie-less
mode, sendmsg() with MSG_FASTOPEN flag will send SYN-data regardless
of cookie availability.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On paths with firewalls dropping SYN with data or experimental TCP options,
Fast Open connections will have experience SYN timeout and bad performance.
The solution is to track such incidents in the cookie cache and disables
Fast Open temporarily.
Since only the original SYN includes data and/or Fast Open option, the
SYN-ACK has some tell-tale sign (tcp_rcv_fastopen_synack()) to detect
such drops. If a path has recurring Fast Open SYN drops, Fast Open is
disabled for 2^(recurring_losses) minutes starting from four minutes up to
roughly one and half day. sendmsg with MSG_FASTOPEN flag will succeed but
it behaves as connect() then write().
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sendmsg() (or sendto()) with MSG_FASTOPEN is a combo of connect(2)
and write(2). The application should replace connect() with it to
send data in the opening SYN packet.
For blocking socket, sendmsg() blocks until all the data are buffered
locally and the handshake is completed like connect() call. It
returns similar errno like connect() if the TCP handshake fails.
For non-blocking socket, it returns the number of bytes queued (and
transmitted in the SYN-data packet) if cookie is available. If cookie
is not available, it transmits a data-less SYN packet with Fast Open
cookie request option and returns -EINPROGRESS like connect().
Using MSG_FASTOPEN on connecting or connected socket will result in
simlar errno like repeating connect() calls. Therefore the application
should only use this flag on new sockets.
The buffer size of sendmsg() is independent of the MSS of the connection.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On receiving the SYN-ACK after SYN-data, the client needs to
a) update the cached MSS and cookie (if included in SYN-ACK)
b) retransmit the data not yet acknowledged by the SYN-ACK in the final ACK of
the handshake.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements sending SYN-data in tcp_connect(). The data is
from tcp_sendmsg() with flag MSG_FASTOPEN (implemented in a later patch).
The length of the cookie in tcp_fastopen_req, init'd to 0, controls the
type of the SYN. If the cookie is not cached (len==0), the host sends
data-less SYN with Fast Open cookie request option to solicit a cookie
from the remote. If cookie is not available (len > 0), the host sends
a SYN-data with Fast Open cookie option. If cookie length is negative,
the SYN will not include any Fast Open option (for fall back operations).
To deal with middleboxes that may drop SYN with data or experimental TCP
option, the SYN-data is only sent once. SYN retransmits do not include
data or Fast Open options. The connection will fall back to regular TCP
handshake.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With help from Eric Dumazet, add Fast Open metrics in tcp metrics cache.
The basic ones are MSS and the cookies. Later patch will cache more to
handle unfriendly middleboxes.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch impelements the common code for both the client and server.
1. TCP Fast Open option processing. Since Fast Open does not have an
option number assigned by IANA yet, it shares the experiment option
code 254 by implementing draft-ietf-tcpm-experimental-options
with a 16 bits magic number 0xF989. This enables global experiments
without clashing the scarce(2) experimental options available for TCP.
When the draft status becomes standard (maybe), the client should
switch to the new option number assigned while the server supports
both numbers for transistion.
2. The new sysctl tcp_fastopen
3. A place holder init function
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In its receive path, mlx4_en driver maps each page chunk that it pushes
to the hardware and unmaps it when pushing it up the stack. This limits
throughput to about 3Gbps on a Power7 8-core machine.
One solution is to map the entire allocated page at once. However, this
requires that we keep track of every page fragment we give to a
descriptor. We also need to work with the discipline that all fragments will
be released (in the sense that it will not be reused by the driver
anymore) in the order they are allocated to the driver.
This requires that we don't reuse any fragments, every single one of
them must be reallocated. We do that by releasing all the fragments that
are processed and only after finished processing the descriptors, we
start the refill.
We also must somehow guarantee that we either refill all fragments in a
descriptor or none at all, without resorting to giving up a page
fragment that we would have already given. Otherwise, we would break the
discipline of only releasing the fragments in the order they were
allocated.
This has passed page allocation fault injections (restricted to the
driver by using required-start and required-end) and device hotplug
while 16 TCP streams were able to deliver more than 9Gbps.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dynamically allocated sysfs attributes must be initialized using
sysfs_attr_init(), otherwise lockdep complains:
BUG: key <address> not in .data!
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update the references to bridge utilities and web pages
to current locations
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 9ac32e1b firmware: convert e100 driver to request_firmware()
did a straight conversion of the in-driver ucode to external
files. This introduced the possibility of the driver failing
to enable an interface due to missing ucode. There was no
evaluation of the importance of the ucode at the time.
Based on comments in earlier versions of this driver, and in
the source code for the FreeBSD fxp driver, we can assume that
the ucode implements the "CPU Cycle Saver" feature on supported
adapters. Although generally wanted, this is an optional
feature. The ucode source is not available, preventing it from
being included in free distributions. This creates unnecessary
problems for the end users. Doing a network install based on a
free distribution installer requires the user to download and
insert the ucode into the installer.
Making the ucode optional when possible improves the user
experience and driver usability.
The ucode for some adapters include a bugfix, making it
essential. We continue to fail for these adapters unless the
ucode is available.
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the asix_set_eeprom() function to provide support for
programming the configuration EEPROM via ethtool.
Signed-off-by: Christian Riesch <christian.riesch@omicron.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current code for reading the EEPROM via ethtool in the asix
driver has a few issues. It cannot handle odd length values
(accesses must be aligned at 16 bit boundaries) and interprets the
offset provided by ethtool as 16 bit word offset instead as byte offset.
The new code for asix_get_eeprom() introduced by this patch is
modeled after the code in
drivers/net/ethernet/atheros/atl1e/atl1e_ethtool.c
and provides read access to the entire EEPROM with arbitrary
offsets and lengths.
Signed-off-by: Christian Riesch <christian.riesch@omicron.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because there are multiple variants to the stmmac/dwmac driver, the
dts bindings should be updated to include version of the IP used.
Signed-off-by: Dinh Nguyen <dinguyen@altera.com>
Acked-by: Stefan Roese <sr@denx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
cxgb3 interface has a bad performance when VLAN is set. On my current
setup, a PowerLinux 7R2, I am able to get around 7 Gbps on a TCP_STREAM
(8 instances, 4k message).
With this patch, I am able to reach 9.5 Gbps.
Signed-off-by: Breno Leitao <brenohl@br.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Ethernet II wrapper is only used by IPX protocol, may have once
been used by Appletalk but not currently. Therefore it makes sense to
move it to the IPX dust bin and drop the exports.
Build tested only.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The stack_not_used() function in <linux/sched.h> assumes that stacks
grow downwards. This is not true on IA64 or PARISC, so this function
would walk off in the wrong direction and into the weeds.
Found on IA64 because of a compilation failure with recursive dependencies
on IA64_TASKSIZE and IA64_THREAD_INFO_SIZE.
Fixing the code is possible, but should be combined with other
infrastructure additions to set up the "canary" at the end of the stack.
Reported-by: Fengguang Wu <fengguang.wu@intel.com> (failed allmodconfig build)
Signed-off-by: Tony Luck <tony.luck@intel.com>
tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket
per network namespace.
This leads to bad behavior on multiqueue NICS, because many cpus
contend for the socket lock and once socket lock is acquired, extra
false sharing on various socket fields slow down the operations.
To better resist to attacks, we use a percpu socket. Each cpu can
run without contention, using appropriate memory (local node)
Additional features :
1) We also mirror the queue_mapping of the incoming skb, so that
answers use the same queue if possible.
2) Setting SOCK_USE_WRITE_QUEUE socket flag speedup sock_wfree()
3) We now limit the number of in-flight RST/ACK [1] packets
per cpu, instead of per namespace, and we honor the sysctl_wmem_default
limit dynamically. (Prior to this patch, sysctl_wmem_default value was
copied at boot time, so any further change would not affect tcp_sock
limit)
[1] These packets are only generated when no socket was matched for
the incoming packet.
Reported-by: Bill Sommerfeld <wsommerfeld@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use global seqlock for the nh_exceptions. Call
fnhe_oldest with the right hash chain. Correct the diff
value for dst_set_expires.
v2: after suggestions from Eric Dumazet:
* get rid of spin lock fnhe_lock, rearrange update_or_create_fnhe
* continue daddr search in rt_bind_exception
v3:
* remove the daddr check before seqlock in rt_bind_exception
* restart lookup in rt_bind_exception on detected seqlock change,
as suggested by David Miller
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use RFS infrastructure and flow steering in HW to keep CPU
affinity of rx interrupts and application per TCP stream.
A flow steering filter is added to the HW whenever the RFS
ndo callback is invoked by core networking code.
Because the invocation takes place in interrupt context, the
actual setup of HW is done using workqueue. Whenever new filter
is added, the driver checks for expiry of existing filters.
Since there's window in time between the point where the core
RFS code invoked the ndo callback, to the point where the HW
is configured from the workqueue context, the 2nd, 3rd etc
packets from that stream will cause the net core to invoke
the callback again and again.
To prevent inefficient/double configuration of the HW, the filters
are kept in a database which is indexed using hash function to enable
fast access.
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enable callers of mlx4_assign_eq to supply a pointer to cpu_rmap.
If supplied, the assigned IRQ is tracked using rmap infrastructure.
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_options_compile can be called for forwarded packets,
make sure the specific-destionation address is a local one as
specified in RFC 1812, 4.2.2.2 Addresses in Options
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move fib_compute_spec_dst at the only place where it
is needed.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull three md bugfixes from NeilBrown:
"One of the bugs was introduced in 3.5-rc1. Others have been there for
longer."
* tag 'md-3.5-fixes' of git://neil.brown.name/md:
md/raid1: close some possible races on write errors during resync
md: avoid crash when stopping md array races with closing other open fds.
md: fix bug in handling of new_data_offset
Pull networking changes from David Miller:
"Ok, we should be good to go now"
1) We have to statically initialize the init_net device list head rather
than do so in an initcall, otherwise netprio_cgroup crashes if it's
built statically rather than modular (Mark D. Rustad)
2) Fix SKB null oopser in CIPSO ipv4 option processing (Paul Moore)
3) Qlogic maintainers update (Anirban Chakraborty)
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
net: Statically initialize init_net.dev_base_head
MAINTAINERS: Changes in qlcnic and qlge maintainers list
cipso: don't follow a NULL pointer when setsockopt() is called
Pull HID update from Jiri Kosina:
"A final round of changes for HID for 3.5: just device ID additions."
* 'upstream-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
HID: hid-multitouch: add support for Zytronic panels
HID: add Sennheiser BTD500USB device support
HID: add battery quirk for Apple Wireless ANSI
The strcpy was being used to set the name of the board. Since the
destination char* was read-only and the name is set statically at
compile time; this was both wrong and redundant.
The type of char* is changed to const char* to prevent future errors.
Reported-by: Radek Masin <radek@masin.eu>
Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com>
[ Taking directly due to vacations - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>