As a long time enterprise infrastructure specialist, I’ve spent countless hours trying to optimize the performance of environments. Early in my career, I spent some time on a team who worked very closely with the monitoring team where I learned how hard it was to correlate the volumes of data collected. We were collecting so much data about our environment that it was almost overwhelming. Things like the temperature of the CPU, how many storage IOs were pending, and memory usage was. We had all this awesome data and what did we do with it? We set up monitoring to make sure numbers didn’t cross a certain threshold. When it did cross that threshold, we sent an alert. All this data at our fingertips and all we used it was for alerting. I knew something was off, but I was green and didn’t understand that we were missing the bigger picture.
That was a long time ago, and I’m on a very different career path now, and I’ve learned that data matters. What I’ve learned is that we can use technical data to make business decisions. If you’ve never had to write a business justification for spending IT dollars this may seem foreign to you, so let me explain. Let’s say I have a server running a business critical application responsible for batch processing and sending invoices. Without this application, the bills never get sent to the customers, and the money stops coming in. When the money stops flowing the wheels of the business stop rolling, and it’s a huge problem. The physical server this application runs on is 6 years old, but it meets the business need. No one never complains about performance so it’s largely out of people’s minds. To the IT operation folks, this s an aging server that needs to be replaced, but that doesn’t translate into business value.
Is the IT operations person wrong? The answer is, of course, that it depends. To understand this, we really need to understand the performance characters of the server in question. Let’s look at the completely fictional story of an IT operations guy, let’s call him Alex.
With all the information being collected Alex can see that the nightly batch process consumed 100% of its CPU. Alex was also able to see that a dozen times a day the server was using all its memory. To Alex what does this all mean? That sometimes we are doing more work than we can handle. That is all he knows. What he can’t see is what the application is doing during does spikes so he makes notes of the peak times and decides to talk to the application support team. Once Alex talks to Jen from the application team, he is able to connect the dots that this is when a sales guy is entering a big order in. Alex knows he needs to talk to a sales guy to understand what is actually going on so he calls Todd, who tells him every time he uses the system to enter a sale he has to wait for 45 minutes. Todd has said it has always been like this so he has never complained to anyone. Alex now has a clear picture of what’s going on and enough information to see he can give the sales guy back hours of their day to make more sales. Even just giving Todd back 45 minutes a day is worth the cost of a new server.
I know the storage above is a simple example, but it holds true for any time IT is looking to spend money. IT can’t keep being a segmented cost center for the business and until you have insight into how changes in infrastructure impact the business that will never happen. I recently had a chance to meet with the team at CloudPhysics at Tech Field Day 11, and I have to say I was impressed. Right now the software collected and reports on all measure of metrics inside VMware as an initial scope. It collects loads of data and uploads it to their centralize data lake. The data is compared and correlates it in from phenomenal ways. It will take VM data, host data, and data store data and even knowledge base articles to gain insights into the performance of the system. On top of that CloudPhysics is building a platform which allows you to see how a workload would change if the underlying infrastructure would change. I was very impressed bu the team, but don’t take my word for it check it out for yourself.