We saw the exact same symptoms previously using Syncsort backup, with QLogic HBAs, on Windows 2k, with Vixel switches. All firmware and drivers are up to date and have been confirmed on all vendors compatibility matrix. We've turned off Insight polling and the EMC polling of devices. All cables and GBICs have been replaced multiple times. Since we have replaced every single component, and had the individual vendor experts either perform the configuration or verify the configuration, we are at our wits end as to finding the cause.
Things that make a SAN stop:
1) Look at the switch port error statistics for any periodic errors that match the timeframe when you lose connection to the tape resources. Take a look at both the incoming ports from the hosts, and the outgoing ports to the tapes. If your issue is light loss, or bad connections, the errors should be getting logged in the switch. Also look for any type of timeout error, since your problem seems to be intermittent. (hard down problems are MUCH easier to figure out!)
2) Make sure your SCSI connected drives are hard coded to the Chapparel SCSI bridge. I have seen instances in the past where power glitches can cause the internal FCAL loops or SCSI busses to reset, thereby changing the ID of the connected drives. Do not auto negotiate anything. Hard code it.
3) Weird SAN issues such as you are seeing can be caused by error conditions by one member of your zone sets affecting the other members. The way to solve that problem is to make sure you are doing HBA zoning. HBA zoning means there is only a single initiator and target within each zone. Create a different zone for each HBA in your SAN, and include only that HBA and the storage port it needs access to.
4) Make sure there are no tape targets mixed in with your disk targets in any zone. All tape connectivity must be excluded from disk connectivity. In other words, zone out all your tapes into multiple separate tape zones.
5) Do not use the same HBA for disk connectivity as tape connectivity. This is a sure way to see the problems you are experiencing, because any SCSI bus reset commands that the tapes perform when they rewind will wreak havoc with your disk access. It is always best to use three HBAs in each server. Two for path failover disk access, and the third one dedicated to tape access.
6) If there are any patch panels in your SAN, make sure there is minimum db signal loss from the connections. (A decibel (dB) is a unit of measurement of signal loss for optical networks.) Using a patch panel, the maximum allowable signal loss is .5 dB per connection. This means that the total allowable loss when using a patch panel, together with the host and storage connections, is about 4 dB. Using 50 um multi-mode cables, you lose about 4 dB per kilometer. Using 62.5 um cables, you lose about 4.5 dB per kilometer. Just make sure you don't lose more than 6 dB between connections when using either 62.5 um or 50 um cables.
7) Use your fingers, and run them along the cables to make sure there are no indentations. This is known as a micro bend, and causes refraction in the cable. Cable refraction means the light is being scattered inside the cable, which also causes dB loss.
8) Make sure there are no macro bends in your cable plant. Macro bends can occur when cables are bunched up under the floor improperly, and the cables get bent into tight loops of less than 6 inches. Never bend a fibre cable more than 90 degrees. Fibre cables are made of glass, and glass (ahem) can fracture or break. You might never be able to see the break inside the cable, but it can cause all kinds of strange, intermittent problems to pop up. Never use twist-tie wraps to bind fibre cables together. Use Velcro wraps instead, because twist-tie wraps can cause crimps in the fibre cable. A crimp in the cable will cause signal loss, making your life miserable.
You replaced all the components, and you had the same issues with different brands of equipment, so your issue is most likely not a hardware problem. It sounds like more of a timing problem, a component mismatch (such as 50 um cables and 62.5 um cables in the same cable plant), or a configuration or software issue. My guess is you may have tapes and disks in the same zone, or more than one server in the same zone.
In the event you still cannot solve the problem after doing everything mentioned here, then it's time to break out the old protocol analyzer. Protocol analyzers are not cheap, but they can be reasonably rented for a day or two to help finally determine the cause of your errors. If you do not feel comfortable doing your own analysis, I'm sure any one of the storage hardware vendors you are using should be able to help you out with a statement of work to get the problem fixed.
This was first published in May 2005