My First JBOD, Part 2: Irony

J4200 After unpack­ing, rack­ing, and mount­ing the JBOD, I waited until the week­end had started before pow­er­ing down the server and installing the RAID card. Connected it all up, rebooted into the Adaptec BIOS, and con­fig­ured the 6x 1TB dri­ves into a RAID6 array. After that, I installed the RAID StorageManager off of Sun’s web­site, and then the “Common Array Manager” soft­ware. CAM is sup­posed to pro­vide a web GUI to an organization’s worth of Sun JBODs, so you can update JBOD firmware and query sta­tus and what­not from a sin­gle inter­face. There’s client and server bits writ­ten in Java that run on the var­i­ous boxes, so the data path was going to look like this:

JBOD -> XEN dom0 run­ning remote proxy tool -> XEN domU run­ning web GUI

I say “was going” and “sup­posed to” because all the remote proxy tool in CAM ended up doing was con­sis­tently trig­ger­ing a ker­nel panic in the aacraid dri­ver when­ever it’s detec­tion code fired up.

Take a long drag off the irony of dri­ver and firmware issues, and down­load the latest-n-greatest aacraid dri­ver and firmware from Intel via Sun, and update. Same results. Repeat in var­i­ous con­fig­u­ra­tions, and before throw­ing in the towel, get a basic dump and file a bug. I didn’t put any more seri­ous thought into debug­ging it sim­ply because this whole thing has to be up and run­ning yes­ter­day, and the last time I asked for doc­u­men­ta­tion on the topic, I was rebuffed with a vari­ant of this clas­sic: “If you were smart enough to debug the ker­nel, you wouldn’t need doc­u­men­ta­tion on how to debug the kernel.”

Take a moment to stand in awe of the mas­sive poi­so­nous cobag­gery involved in that state­ment being offered to some­one who wants to help fix a crasher. I’ll wait.

That kind of shit would never fly in any GNOME venue, which is why GNOME kicks so much ass.

Update: The cobag­gery about ker­nel devel­op­ment did not come from Sun or any rep­re­sen­ta­tive of any com­pany involved in open-source, and was unre­lated to this sit­u­a­tion at all. I relate it sim­ply as it per­tains to debug­ging ker­nel issues, and why I don’t do it.

4 Responses

  1. numpty says:

    Without wish­ing to defend the response from Sun, it’s not the first time I’ve been on the receiv­ing end of a Bad Case of Attitude from mem­bers the GNOME com­mu­nity as well. There are good peo­ple and bad peo­ple in every com­mu­nity; don’t kid your­self that just because GNOME isn’t a busi­ness, that there aren’t ass­holes in our midst.

  2. Paul McDonnell says:

    James, you describe the CAM proxy run­ning in dom0 and the CAM BUI run­ning in domU. Is the panic occur­ring in the dom0 or domU RHE instance? If the lat­ter, the BUI needs to be told where the proxy is. The reg­is­tra­tion wiz­ard will search for you, but I won­der if when the search is being done in the domU (which will be fruit­less since it is a vir­tual machine with vir­tual dri­vers), the aacraid dri­ver is chok­ing on the vir­tual dri­vers. A pos­si­ble work-around is to spec­ify the ip address of the dom0 host in the reg­is­tra­tion wiz­ard. The BUI and the proxy com­mu­ni­cate via TCP/IP. As long as there is net­work con­nec­tiv­ity between the domU and dom0 (which is required for CAM to work in your setup), spec­i­fy­ing the ip address of the dom0 host in the reg­is­tra­tion wiz­ard will pre­vent the dis­cov­ery from hap­pen­ing in the domU and only occur in the dom0 (which is what you want).

  3. James Cape says:

    numpty: As noted in the update, the cobag in ques­tion isn’t a known employee of any company.

    Paul: The actual array is attached to dom0, and it’s dom0 that’s crash­ing when the reg­is­ter process on dom0 starts talk­ing to the raid con­troller. Inside the domUs, they only get the generic xen device (disk image on a log­i­cal vol, on a dif­fer­ent array/controller) that’s pre­sented to them.

Leave a Reply

*