Hyper-v guest disk bottleneck

Have a hyper-v environment where all physical hosts connect to a Powervault 3800f with 10x15k SAS in RAID10 via fibre channel - one virtual disk per host, 4 hosts.

On the SAN performance logs I'm seeing latency spikes of around 50ms on each virtual disk, but average latency is around 2ms.

On the VMs at the same time I'm seeing latency spikes of 600ms - once I even saw 1100ms.

This tells me that something is going wrong between the SAN and the VM.

Write cache hit is constant at 100%, read cache hit is usually around the 70% mark - read % is usually below 40%.
Maximum combined IOPS of two raid controllers is around 9k.

On the VMs memory and processor usage is unremarkable - high paging on all VMs though (using dynamic memory - maybe stick to fixed?)

The VM with a particular issue has processor usage of around 50% (25 processors) - so split over NUMA nodes.

Where should I be looking for the delay between the SAN and the VM ?



Probably nowhere. That sounds like you are hitting the limits of your chosen configuration, and the numbers are roughly in like with what I'd expect. 10 disks in RAID 10 is effectively going to limit you to five simultaneous spindle writes, and even with high speed disks, four VMs will be random seeking enough that performance can tank temporarily without notice.

More smaller disks is always better than fewer larger disks. Tiered storage and/or lots of spindles is the cure. There's a reason why those deployments are so popular now.



Probably nowhere. That sounds like you are hitting the limits of your chosen configuration, and the numbers are roughly in like with what I'd expect. 10 disks in RAID 10 is effectively going to limit you to five simultaneous spindle writes, and even with high speed disks, four VMs will be random seeking enough that performance can tank temporarily without notice.

More smaller disks is always better than fewer larger disks. Tiered storage and/or lots of spindles is the cure. There's a reason why those deployments are so popular now.



Why is there such disparity between the latency reported by the SAN and that on the VM though? Where are those 500ms going?



Hyper-V is going to try and organize requests when under load. It may decide delaying one VM to handle others is better than allowing the writes to go through in a true FIFO fashion. And its usually right. Seeking for VM 1, then 3 then 1 then 3, then 4 will be far more expensive than doing VM 1,1,1 then 3,3,3, but doing the latter does show up as big latency in VM3. But overall, the tradeoff is still a good one. So your SAN latency is lower than the VM latency. If left to run FIFO, the SAN latency would go up because there is more random seeking and that'd impact all VMs significantly.



Ah, I see.

Ok - thanks for your help once again Cliff.

More disks then.



Cliff - was just thinking about this some more. The VM that's showing the highest latencies is an RD session host and the only other VMs on the same host are a DC and a licensing server. So the session host is pretty much the only VM that's doing anything on that host - so there would be no need for Hyper-V to intervene and reorganise disk requests from different hosts because the lion's share would be coming from the session host.

That being the case, why would the latencies on the SAN for the virtual disk used by this particular hyper-v host be so different from the latencies reported by the session host VM?

Share this

Related Posts

There was an error in this gadget