[IB] mthca: first pass at catastrophic error reporting

Add some initial support for detecting and reporting catastrophic
errors reported by Mellanox HCAs.  We start a periodic timer which
polls the catastrophic error reporting buffer in device memory.  If an
error is detected, we dump the contents of the buffer for port-mortem
debugging, and report a fatal asynchronous error to higher levels.

In the future we can try to recover from these errors by resetting the
device, but this will require some work in higher-level code as well.
Let's get this in now, so that we at least get catastrophic errors
reported in logs.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
This commit is contained in:
Roland Dreier
2005-10-27 11:03:38 -07:00
parent 7cc656efb5
commit 3d155f8cd0
5 changed files with 176 additions and 1 deletions

View File

@@ -1,6 +1,7 @@
/*
* Copyright (c) 2004, 2005 Topspin Communications. All rights reserved.
* Copyright (c) 2005 Mellanox Technologies. All rights reserved.
* Copyright (c) 2005 Cisco Systems. All rights reserved.
*
* This software is available to you under a choice of one of two
* licenses. You may choose to be licensed under the terms of the GNU
@@ -706,9 +707,13 @@ int mthca_QUERY_FW(struct mthca_dev *dev, u8 *status)
MTHCA_GET(lg, outbox, QUERY_FW_MAX_CMD_OFFSET);
dev->cmd.max_cmds = 1 << lg;
MTHCA_GET(dev->catas_err.addr, outbox, QUERY_FW_ERR_START_OFFSET);
MTHCA_GET(dev->catas_err.size, outbox, QUERY_FW_ERR_SIZE_OFFSET);
mthca_dbg(dev, "FW version %012llx, max commands %d\n",
(unsigned long long) dev->fw_ver, dev->cmd.max_cmds);
mthca_dbg(dev, "Catastrophic error buffer at 0x%llx, size 0x%x\n",
(unsigned long long) dev->catas_err.addr, dev->catas_err.size);
if (mthca_is_memfree(dev)) {
MTHCA_GET(dev->fw.arbel.fw_pages, outbox, QUERY_FW_SIZE_OFFSET);