[IB] mthca: first pass at catastrophic error reporting
Add some initial support for detecting and reporting catastrophic errors reported by Mellanox HCAs. We start a periodic timer which polls the catastrophic error reporting buffer in device memory. If an error is detected, we dump the contents of the buffer for port-mortem debugging, and report a fatal asynchronous error to higher levels. In the future we can try to recover from these errors by resetting the device, but this will require some work in higher-level code as well. Let's get this in now, so that we at least get catastrophic errors reported in logs. Signed-off-by: Roland Dreier <rolandd@cisco.com>
This commit is contained in:
@@ -1,6 +1,7 @@
|
||||
/*
|
||||
* Copyright (c) 2004, 2005 Topspin Communications. All rights reserved.
|
||||
* Copyright (c) 2005 Mellanox Technologies. All rights reserved.
|
||||
* Copyright (c) 2005 Cisco Systems. All rights reserved.
|
||||
*
|
||||
* This software is available to you under a choice of one of two
|
||||
* licenses. You may choose to be licensed under the terms of the GNU
|
||||
@@ -706,9 +707,13 @@ int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status)
|
||||
|
||||
MTHCA_GET(lg, outbox, QUERY_FW_MAX_CMD_OFFSET);
|
||||
dev->cmd.max_cmds = 1 << lg;
|
||||
MTHCA_GET(dev->catas_err.addr, outbox, QUERY_FW_ERR_START_OFFSET);
|
||||
MTHCA_GET(dev->catas_err.size, outbox, QUERY_FW_ERR_SIZE_OFFSET);
|
||||
|
||||
mthca_dbg(dev, "FW version %012llx, max commands %d\n",
|
||||
(unsigned long long) dev->fw_ver, dev->cmd.max_cmds);
|
||||
mthca_dbg(dev, "Catastrophic error buffer at 0x%llx, size 0x%x\n",
|
||||
(unsigned long long) dev->catas_err.addr, dev->catas_err.size);
|
||||
|
||||
if (mthca_is_memfree(dev)) {
|
||||
MTHCA_GET(dev->fw.arbel.fw_pages, outbox, QUERY_FW_SIZE_OFFSET);
|
||||
|
Reference in New Issue
Block a user