Attack of the Disk Fairies

Many years ago while working on unix mini-computer systems from IBM, HP and Data General (now consigned to the annals of computing history) – disk IO monitoring and performance tuning was all the rage – attempting to get the most out of the systems that were implemented. Indeed much effort went into the development of benchmarks to ensure that the applications could perform on the hardware with cetainty before the applications went live. Oracle database, Pick under the guise of Unidata and medical specific languages such as MUMPS were all part of the test mix and occasionally through up the odd result. In particular I remember two results that stand out.

One was a HP 9000 box nicknamed Emerald. When tested – it was so ahead of the previous benchmark that we had to readjust the workload to bring the figures for future test back into something we could easily compare with other suppliers. The other was a Data General machine that should have been a scorching performer but that due to a particular set of circumstances (what was described at the time as a “thundering herd” – where a massive number of processes and disk activity hit the server at the same point in the benchmark. The server then stopped in its tracks and did not respond for an extended amount of time. The effect was that the overall benchmark result was well below par. A patch to the kernel resolved the issue – and allowed the machine to return a much improved result.

The IO issue however stayed with me and through other incidents over the years where disk performance has not been as anticipated. Recently a client with a new machine suffered performance issues where it appeared that phantom activity on the disk was stopping a single virtual machine from delivering the service required.

The server in question was a Dell R420 with a raid 1 stripe with 2 drives and a raid 5 stripe on 3 disks. The machine was running esxi 5.10 and one SBS 2011 Vm but the performance was very poor. On closer inspection we noted that disk activity was occuring even when the vm was not running and after comparing with a similar esx found that it had virtually no disk activity. Checking in the bios of the raid controller showed that the virtual disk were in “patrol read mode” which we were advised by Dell was an automated mode that checked the disks (in this case two virtual inside a raid 5 stripe) to make sure that they were optimal. In this case though it seemed that the patrol mode was occurring prettymuch all the time and that this was the source of the phantom disk activity. Dell recommended using the Openmanage tool to resolve this by turning the Patrol Read Mode off. They were unable to explain why the check had switched itself on and why it seemed to affect only one stripe and not the other.

So if looking for ghostly disk activity – don’t rule out the machine may be working against you.