Performance is a feature

Recently a client of ours came to us discussing one of their systems and the millions of hits per day that they have to process. Through some fine tuning and a migration into Azure they’re handling that load just fine on their API cluster. I had a little free time and wanted to take a look at whether there were any obvious hot paths that would be easy to fix. First, I had to learn how to profile a .NET core application.

Donald Knuth had quite a bit to say about optimization:

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

Premature optimization occurs when you’re optimizing something that you don’t fully understand the impact of. Sure you can probably guess that the speed of a database query is slow. With performance profiling you can know precisely how slow it is, and how much of a speedup you have gained from your efforts. It’s no different with code; performance metrics are key. With the stage set let’s take a look at the different ways to collect those performance metrics for .NET Core, as the playing field has changed somewhat.

I have a strong opinion that an integrated development environment should be integrated, so when I want to accomplish something the first route I look is using Visual Studio. I’ll be exploring the slightly different tools in VS 2017 for working with .NET Core applications. I’ve created a simple repository of two different ways of concatenating strings that we can explore the performance impact of. All screenshots will be taken using Visual Studio Enterprise with that repository.

It’s also a nice obvious indication of why you should avoid string concatenation in a loop!

Starting with the obvious

Our little application has a problem. If you execute the executable program you will find that one of the methods runs quite quickly while the other does not. On my machine these are the results I see:
Running StringBuilder
Time Elapsed: 1ms
Running Concatenate
Time Elapsed: 3079ms
Press enter to close the app.

Fortunately I’ve thought ahead and put in a stopwatch timer so we can see what’s going on here. If we were to instead be looking at an otherwise opaque system we’d have to use tools to get these numbers. The easiest thing to do? Hit F5 and run the application in debug mode to completion. Be sure to hit enter to close the application.

Visual Studio, when running in debug mode, will helpfully display the Diagnostics Tools window (which can also be found under Tools -> Windows -> Show Diagnostics Tools) with a variety of tools for better understanding what our application is doing. From top to bottom we have Events, Process Memory, and CPU use.

  • Events
    The Events window shows IntelliTrace events, a system provided in Visual Studio Enterprise. These events are automatically fired by tools and the .NET framework libraries. They provide useful details about program state in a point of time, and enable features such as historical debugging. This system deserves an entire separate blog post of its own for another time.
  • Process Memory
    The CLR will track the running program’s memory use and display it in this time chart. Additionally you’ll see events such as garbage collection passes.
  • CPU use
    Finally a chart of the CPU use over time as a percentage of all processors on the system. We only see limited details because CPU profiling isn’t enabled.

Below all of the above the summary tab contains a simple button for ‘Record CPU profile’. This is exactly what we’re looking for so we can enable a deeper look into our .NET Core application. Clicking it brings us to the CPU Usage tab showing us CPU profiling is enabled but was not enabled for the previous execution of the application. With the CPU profiler now enabled we can re-run the application. In doing so we find our CPU graph is now green!

Debug tools with visible CPU profile results

By drag-selecting a region of time in the CPU graph we can get a breakdown of what our code was doing in that time period, filling out the CPU use tab with the function calls that were executing during that time period. The Total CPU chart lists out what was going on with each of those individual calls.

In our performance test code here we can see that there are three offenders: Main, TightLoopConcatenate, and an external call to System.String.Concat. This gives us a roadmap to go looking for what’s going on with this code, but we can dive even deeper into these particular methods by double-clicking on any one. In our case we can double-click on the Main function call and get the caller/callee relationship chart.

The center box in this view is the function we’re investigating. Unsurprisingly 90% of the program’s runtime was spent inside the Main function. The box to the right however reveals our issue. That one method took over 91.5% of runtime while the other methods took less than a percentage point combined. By left-clicking on the bright red box we can jump right into inspecting what’s going down there.

The caller and callee display of the diagnostics profiler

In this view we can see right away which line of code was the hot spot. All of those calls out to System.String.Concat took a lot of time to process, and now that we know right where the issue lies we can write a more efficient method to handle that line of code. Perhaps we might deprecate this method entirely with an obsolete attribute informing consumers of this library to use the updated stringbuilder version? Or we could simply refactor this method to use the stringbuilder function below. Either way we’ve isolated what our program is doing.

Memory matters

But perhaps we need to investigate a memory issue instead? With millions of hits to a web service we should be aware of the resources of the machines we’re using, even in the cloud. .NET performance degrades when available memory is limited and reducing memory footprint of requests will help keep them snappy and throughput high. If we don’t need to scale we don’t need to pay for it!

Fortunately we also happen to have a nice tool that can tell us all about the memory allocations our application is performing by using snapshots. We’ll set three breakpoints in Program.cs, one at line 10, one at line 17, and one at line 22. This way we won’t interfere with the stopwatch timer. Run the application in debug mode and when the first breakpoint is hit go to the Memory Usage tab and click ‘Take Snapshot’.

This will mostly be a snapshot of the initial state of the application. We haven’t really done much just yet other than write a line to the console. This is reflected in the grand total of around 334 object and 69.01KB of memory use on my machine. Since we’ve got a snapshot we also get the opportunity to inspect the object heap by clicking the View Heap button. We’ll get to this in a moment. For now go ahead and hit Continue to move to the next breakpoint and take another memory snapshot.

At this point we’ve executed the first of the two test methods, generating some significant memory use and loading up the heap. Fortunately by using a StringBuilder we haven’t incurred any GC pauses and we haven’t spent a lot of time doing the concatenation. We can take a look at precisely what’s changed by viewing the object heap with the View Heap button. In the heap viewer the top right dropdown lets us compare this snapshot #2 to snapshot #1 to see what’s changed.

With this view we can see that our StringBuilder got a bit bigger, and we’ve added both a Stopwatch and an Int64. We’ve added a RuntimeType somewhere along the lines as well as a NumberFormatInfo, likely as part of the Stopwatch. If we needed to dig deeper we can select a specific object type to see a listing of where they’re coming from.

Now we’ll hit continue again and arrive at the last breakpoint. It takes a bit longer and we can watch the Memory graph bounce around with what appears to be a solid yellow bar above it. Taking a snapshot at the breakpoint reveals.. not that much has changed! But our graph clearly grew, so what gives? Clearly something happened.

The issue lies with our choice of location for the breakpoint. By the time we’ve reached our last breakpoint all of the problematic memory is no longer active according to the garbage collector. We’re outside the library method and we’re limited to the objects that have a path to root, or objects that the GC considers to be ‘alive’. That yellow bar above the chart? Let’s zoom in a lot.

Here we can see the issue at play. Each one of those yellow tags indicates a garbage collection event occurred. We’re generating a huge amount of memory inside this method and the garbage collector is being called very often to make sure the memory pressure stays low. Each time it’s called the strings being concatenated are cleaned up, so they won’t show up in our final breakpoint’s snapshot as they’re already gone.

We’ll adjust our debug setup by setting a breakpoint on line 14 of Class1.cs and pausing there to take a look. I’ve also let the loop counter run to around 5000 and taken another snapshot at that point. If you hit continue a few times, taking snapshots every time, you’ll notice that while the object count doesn’t increase the heap size does increase a little at at time. The heap analysis doesn’t show the cumulative effects but the cumulative process memory chart does. The one GC event in my timeline indicates a whole 12.6MB (!) of strings was cleaned up in one go. Clearly we should be taking a look at precisely what this method is doing to generate so much memory, and of course we learn about strings being immutable types and generating a lot of GC overhead.

How should we defend against these types of allocations? Tools like the CLR Heap Allocation Analyzer can give you insight when writing the code the first time, while the Visual Studio memory profiler can help figure out where large allocations of memory are coming from. Other tools such as the JetBrains dotMemory tool can further assist with problematic memory use.

Invest to save

By using these out-of-the-box profiler features we can narrow down the source of expensive code and see if there may be more effective methods for writing that kind of code. Spending a little time watching our program in the debugger can give us insight into how it might handle a huge number of requests, such as the 3 billion per day that one of our clients has to deal with. Analyzing hotspots of CPU use and large memory allocations can not only improve your program’s thoroughput but reduce your Azure bill when you find your systems scaling out less often when dealing with high load.

What’s your performance bottleneck in your codebase? Do you find your team doesn’t have enough time to circle back around to look at things with a critical eye? Nebbia can help your team find the time to investigate issues and more effectively structure your organization to adapt to the modern Azure cloud development lifecycle.