Using PowerShell To Send Metrics To Graphite

Standard

One side of monitoring that is difficult or expensive in the Windows world is performance monitoring. Windows comes with Performance Monitor, but that is only useful for short term monitoring or for troubleshooting a live performance problem. If you want to keep historic metrics, you might use something like SCOM, but it can be expensive and is a fairly complex product.

There is a tool that has been around for a few years in the Linux world called Graphite. This is a very simple, but powerful metric collection system which used to store and render time-series data. You can find out more about it at the Graphite website. There is also an excellent blog post which introduces you to the baisc concepts of Graphite here: http://matt.aimonetti.net/posts/2013/06/26/practical-guide-to-graphite-monitoring/.

The problem I faced was there was no way to get Windows Performance counters over to the Graphite server. There were a ton of daemons which do this on Linux, but Windows was left out.

In the environment I look after at work, my servers are all Windows, so I ended ended up writing my own PowerShell functions to do this collection and forwarding to a Graphite server. This is done over UDP to the metric collection daemon used by Graphite called Carbon.

Here is an example graph which can be generated in a few clicks in Graphite. It received its metrics from my Graphite PowerShell functions. It is tracking LDAP searches against our 3 domain controllers for the last 24 hours.

image

Another example, comparing CPU usage to SQL Work Tables Created on our database server for the last week.

image

The Configuration File

First off, there is an XML configuration file where you can specify the details of your Graphite server, how regularly you want to collect your metrics, the metrics you want to collect, and any filters to remove metrics you don’t want (this is useful when you have multiple network adapters but only care about 1 or 2 of them that are connected).

Here is what the included configuration file looks like:

As you can see, it is very easy to add metrics which will be collected by the script. You can configure your MetricSendIntervalSeconds, which is how long you want the script to wait before sending the metrics to Graphite. Keep in mind it takes around 1.5 seconds to get the default metrics included in the script, so I don’t recommend you collect and send metrics more than every 5 seconds.

How The Functions Work

The script includes a few internal functions which are used to getting metrics over to Graphite. All the functions have inbuilt help which you can access via the Get-Help PowerShell CmdLet.

  • Load-XMLConfig – Loads the configuration values in the XML file into an object which can be used inside the rest of the functions
  • ConvertTo-GraphiteMetric  – This takes the name of a Windows Performance counter and converts it to a metric name usable by Graphite

  • Send-GraphiteMetric – submits metrics to the Graphite Carbon Daemon using UDP. This can be useful on its own, for example if you want to send a metric so you know when you are about to deploy a new patch from the developers. You can compare the time of this metric to environment performance and see if the patch caused any performance impacts. Etsy has a great artilce on how they do this with Graphite here: http://codeascraft.com/2010/12/08/track-every-release/

If you wanted to manually send a metric on patch install so you can graph it against other metrics, you can use the following command, which will use the current date to create the metric.

  • Start-StatsToGraphite – this is an endless While loop which collects the metrics specified in the XML file and sends them to Graphite. If you change the XML file while the script is running, the next metric send interval it runs through, it will reload the XML configuration file automatically, so any new Performance Counters will start being collected and sent through to Graphite.

With the VerboseOutput configuration value set in the XML file, you will see the following output when you run Start-StatsToGraphite from an interactive PowerShell session.

image

The script can also be run as a service, to make sure you don’t miss any metrics even if the machine reboots.

To start using PowerShell to to send metrics to Graphite in your environment, you can find more details on a detailed installation guide over at GitHub: https://github.com/MattHodge/Graphite-PowerShell-Functions

Update (11/11/2014): v1.0 of the PowerShell Graphite functions has been released, which now bundles the tools as a PowerShell Module: https://github.com/MattHodge/Graphite-PowerShell-Functions/releases/tag/v1.0

Replacing a failed disk in Windows Server 2012 Storage Spaces with PowerShell

Standard

Failed hard disks are in-evadable. There are many ways to provide resiliency for hard disk failure, and Windows Server 2012/Windows Server 2012 R2′s build in feature to provide this is Storage Spaces.

A hard disk failed inside my Storage Pool, so lets switch over to PowerShell to get this resolved.

image

PS: You can skip to the bottom of this post for the TL;DR version.

Diagnosis

Firstly, open up an Administrative PowerShell prompt. To get the status of my Storage Space (which I called pool) I run the command

image

I can see that my Storage Space named pool is in a degraded state.

To check the health of the volumes sitting inside the Storage Pool, use the command

image

We can see that Media, Software and DocumentsPhotos volumes are have Degraded as their OperationalStatus. This means that they are still attached and accessible, but their reliability cannot be ensured should there be another drive failure. These volumes have either a Parity or Mirror parity setting, which has allowed Storage Spaces to save my data even with the drive failure.

The Backups and VMTemplates have a Detached operational status. I was not using any resiliency mode on this data as it is easily replaced, so it looks like I have lost the data on these volumes.

To get an idea what is happening at the physical disk layer I run the command

image

We can see that PhysicalDisk1 is in a failed state. As the HP N40L has a 4 bay enclosure with 4TB Hard Disks in them, it is easy to determine that PhyisicalDisk1 is in the first bay in the enclosure.

Retiring the failed Disk

Now I determined which disk had failed, the server was shutdown and the failed disk from the first bay was replaced with a spare 4TB Hard Disk.

With the server back online, open PowerShell back up with administrative permissions and check what the physical disks look like now

image

We can see that the new disk that was installed has taken the FriendlyName of PhysicalDisk1 and has a HealthStatus of Healthy. The failed disk has lost its FriendlyName and its OperationalStatus has changed to Lost Communication.

First lets single out the missing disk

image

Assign the missing disk to a variable

Next we need to tell the storage pool that the disk has been retired

Adding a new Disk

To add the replacement disk into the Storage Pool

Repairing the Volumes

The next step after adding the new disk to the Storage Pool is to repair each of the Virtual Disks residing on it.

We can see the the repair running by entering

image

The OperationalStatus of InService lets us know the volume is currently being repaired. The percentage completion of the repair can be found by running

image

Remove the lost VirtualDisks

Since there were no parity on the VMTemplates and Backups Volumes, they can be deleted with the following command

Removing the Failed Disk from the Pool

This step will not work if you still have Degraded disk in the Storage Pool, so make sure all repairs complete first.

TL;DR

Here is the Too Long; Didn’t Read version of the post

The full list of Windows Server Storage Spaces CmdLets can be found on TechNet here: http://technet.microsoft.com/en-us/library/hh848705.aspx