Vladimir Vukićević — Words
 



If you’ve ever used any profiling tools but have never used Shark on OSX, find a Mac and play with it for a while.  In fact, if you can’t find a mac, go to an Apple Store and find a Mac with the SDK installed and play with Shark.  I’ll warn you though; it’ll spoil you.  You will have a hard time using any other profiling tools.

While much of what we do at Mozilla can be profiled using Shark, and have the results be meaningful cross-platform, the graphics subsystem is not one of these.  Under OS X, all of our drawing operations go through Quartz; we don’t go through much of Cairo’s low-level software rasterizer, like we do on Win32.  So, I often have to profile on Win32 when chasing down performance issues.

The best tool available for profiling on Windows today is Intel’s VTune, at least that I can find.  (If there are better tools, please let me know.)  Much like Shark and other similar tools, it collects data whenever a sampling interval expires, either time-based or program counter based.  Crucially, unlike Shark, it does not collect the entire call stack at each sample point, which become an issue later.

The main difference between the two is the UI.  Here’s what Shark looks like when you start it up:

Note that this is with the mini-config editor shown; normally you’d only see the top line.  A Start button, a (sane) default sampling type, whether you want to sample a process or everything, and a list of running processes.  If I wanted to, I could have Shark launch a process for me, which would be useful if I was profiling startup, but there’s no need to bother in this case.

Let’s hit Start and sample Firefox; note that I’m using a debug build with symbols because it’s what I have handy, and am running a simple benchmark.  I fire up the benchmark and hit Option-Esc at the start, and Option-Esc at the end.  Shark will chug a bit, looking up symbols and so forth, and this view will come up:

With no configuration, I have an incredibly useful view of what’s going on in my program.  From this, I can clearly see which functions are taking the most time.  But, because of the call stack sampling I mentioned above, I can drill down and figure out exactly who’s calling these top functions:

A top-down view provides an inverse view of my sampling run.  95% of all samples were in firefox-bin, and they were all from the start() symbol.. and I can quickly see the hot path all the way from the start of my program here:

These views are the simplest ones that Shark provides, but they provide a wealth of information.  Based on this, I have a clear picture of where time is being spent, and can identify potential candidates for optimization.

However, going back to the first view above, the top function is in CoreGraphics; it was called by functions in libRIP.A.dylib.  As those are core OS X system libraries, it’s interesting to know that that’s where my time is spent, but I really care about where the time is spent in parts of the application that I have control over.  Shark has a feature called Data Mining that lets you do exactly this.  For any symbol or library, you can perform a simple operation — charge its time to its callers.  Here’s what the right-click menu looks like for the top symbol:

Note that you can also do things like remove certain call paths from the sampling display, to drill down into cases that you care about that you may not have been able to create direct benchmarks for.  I’m going to go through and charge the system libraries all to callers — CoreGraphics, libRIP, libSystem, mach_kernel, AppKit, libobjc, etc., and I end up with this view:

If the first view, with no work and configration was merely “incredibly useful”, this is mind-bogglingly amazing.  I am looking at where time is being spent in my app, and that’s it.  I can still drill down and see who called _cairo_quartz_surface_fill:

That’s only a small fraction of what Shark can do, but even without knowing any other Shark abilities, you get crucial and useful information for your app.

Let’s run through the same thing with VTune.  Here’s the first screen you see when you launch VTune:

Let’s do some Quick Performance Analysis, and I get sent over to this screen:

My options are to profile the entire system or to select an application to launch.  I don’t want to launch an application (though much of the remainder applies regardless), so let’s profile the entire system.  Firefox is already running, so that should get caught.  I can’t actually even choose call graph data here, because that requires an app to launch; if you do choose both call graph and sampling data, it will run your app twice, the first time to sample, and the second to collect call graph info (with no merging of the data, just two different datasets).  Let’s choose sampling and go:

We’re presented with this UI information overload disaster.  I’m going to make the window quite large to see the data from now on, but this is what we have.  We see that most of the instructions retired samples are in firefox.exe; drilling down into that, 99.9% is inside a single thread, also good.  The problems start on the next view, once I drill down into that thread:

I am presented with a view of all the modules that were included in a sample of this thread.  Why do I care about the module?  I might care to see what module a symbol is in, but I really want all the symbols, not all the modules.  Most of the clockticks are in ntkrnlpa.exe, and the second highest (and most of the instructions) are in igxpdv32.DLL.  I don’t control either of those; the first is a windows kernel module, and the second is (I think) the Intel video driver.  I could obtain symbols for the first from Microsoft, but I have no idea what’s happening in the second.  xul.dll is next, which we’ll jump into.  Note that this was even more useless before libXUL was created; when we had a dozen different component dlls, it was impossible to get any kind of information here.  Even now, some pieces, such as the JS library and NSPR, are separate.  Let’s drill down into xul.dll:

This view is showing me functions that I was executing when a sample was taken for functions that are inside xul.dll.  I have no idea how this relates to the rest of the system; indeed, if any of these functions happened to call a function that took up most of the time inside ntkrnlpa.exe, I’d have no idea which one it would be — the time spent inside ntkrnlpa.exe isn’t represented here.  This data is at best not useful, and at worst damagingly misleading. For example, in the image above, the OnPaint method is showing up as taking almost no time.  In reality, all of the time is inside OnPaint’s descendants, but VTune doesn’t have this information.  Doing a Call Graph profiling run reveals some of this, but it suffers from many similar UI difficulties, including being able to see only immediate ancestors and immediate children of a given function.

It’s entirely possible that my difficulties with VTune are because I just don’t know how to use it.  I’d be happy to admit this, if someone could show me how to get the same data that I can get out of Shark out of VTune.  But as it is, VTune is incredibly frustrating to use for optimizing any kind of application.  I can see it being useful for doing microoptimization on specific inner loops that can be isolated into a stand-alone benchmark, but Shark handles that case fine as well.

On top of all of this, Shark comes with the free OS X SDK, whereas VTune is a $700 product with an annual $280 support renewal (required to receive upgrades, even for the major version you purchased!).  I hope that the VTune folks bite the bullet and just copy Shark’s UI for the next version; why mess with something that works really well?  There’s still plenty of innovation that would be possible, especially in the area of optimization assistants and the like, but for pure profiling work, give me Shark.


14 Comments to “Why can’t VTune be more like Shark?”  

  1. 1 Perry Lorier

    I /really/ like valgrind, using “callgrind” skin and then using kcachegrind for reporting. It runs with “valgrind –tool=cachegrind $yourprocesshere” and then generates a file that’s read by kcachegrind. This produces various reports, including a nested stacked box graph (http://kcachegrind.sourceforge.net/cgi-bin/show.cgi/KcacheGrindCalleeMap). This is awesome coz you can see where your CPU’s going at a glance. There are a variety of visualisations (call graphs, callee/caller, per line of code, etc) which really help nail down what small changes you can make to your program to produce huge performance improvements.

  2. 2 David Smith

    I realize you probably already know this, but just for the benefit of anyone else reading who wants to try out Shark:

    1) With a profile open, hit Window Menu > Show Advanced Settings
    2) Among the plethora of useful things here is a checkbox “Show All Branches”, make sure it’s checked

    The “total” column is now actually useful rather than merely showing the exact same value as the “self” column! I rarely use Charge * to Callers anymore because I can just sort by total.

  3. 3 Joe

    One comment about Shark: If you go to Window > Show Advanced Settings, you can get Shark to automatically charge all system libraries to their callers by ticking the appropriate checkbox. This should save a lot of manual work.

    Also, something that people will inevitably point out is that Shark comes free with the OS, but that OS isn’t free and it only works on Apple computers, so it’s not precisely Apples-to-apples.

  4. 4 Jeffrey Stedfast

    Heya Vlad,

    Michael Zucchi and I used to use Rational’s Quantify for profiling Evolution back in the day (we had a copy for Solaris) and I /think/ it is available for Windows these days. Of course, it’s not free ;-)

    I also can’t really remember how it compares to Shark or VTune other than Quantify instrumented the code, it didn’t use intervals iirc. So it probably gets more accurate data overall, but that might not be all that important depending on what you are profiling. The sampling technique used by most tools is probably “good enough”.

    http://www.mozilla.org/unix/quantify.html

    hehe :-)

    Anyways, clearly not as quick and simple to get setup as Shark which is one of the down sides to needing to instrument the code :-(

  5. 5 asdf

    On the other hand, Apple’s instrumented profiling is incredibly non-useful, as you will quickly produce traces so big that the analysis tool can’t open them. It is buggy as hell in many other ways too.

  6. 6 Miha

    Vlad did you by any chance try AQTime from http://www.automatedqa.com/ ?

  7. 7 Joe

    Also, I just remembered one awesome memory/performance profiler that I used/saw used in a previous job: GlowCode. It doesn’t work on 64-bit code, but I do remember it being pretty awesome.

  8. 8 vladimir

    @fejj: Yeah, Quantify still exists, but the last few times I tried it it either didn’t work with more recent versions of Visual C , or just flat out balked at mozilla-sized code. Not a pretty experience.

    @Miha: I haven’t! I remember hearing about AQTime a while ago, but it completely slipped my mind. I’ve grabbed the eval and will give it a try. They have sane licensing terms as well, so I hope it works out well.

    @Perry: Yeah, linux has a bunch of great tools here, especially since it’s easy to collect data and then massage them for different displays. I’ve mostly used oprofile and sysprof, but I need to try kcachegrind as well.

    @Joe: Cool, I’ll add GlowCode to the mix.

    I guess I’ll do another followup post once I try out AQTime and GlowCode, and I’ll poke at the linux tools as well.

  9. 9 Robert O'Callahan

    I tried AQTime a while back. It sucked, I think because it didn’t do call-stack sampling, which is what Shark does and what VTune doesn’t (which is the fundamental reason why VTune sucks, although the UI is horrible too).

    Sysprof and oprofile (and dtrace?) do call-stack sampling. Oprofile doesn’t really have UI, sysprof’s UI is in the right direction but not quite as good as Shark’s last I checked. Having the X server doing a lot of work on your app’s behalf, but in its own process, really hurts profilability though.

    Rob

  10. 10 vladimir

    @roc: I think AQtime has changed significantly since then; I just wrote up a post at http://blog.vlad1.com/2008/07/20/aqtime/ about my quick eval of it. I’ll be talking about its allocation profiler as well, since it seems pretty useful.

  11. 11 Bluehive

    Hey Vladimir,
    I’m quite late to post a reply as I happen to see this interesting blog today itself. I’d like you to see one more tool HP Caliper. Take my word, you’ll get addicted to it like me :). Besides HP-UX, it works on Linux as well. The call graph feature you mentioned is available with Caliper in two manner: 1. sampled call graph 2. Sampled call Stack. It can give you every info you need including thread primitive data, application blocking/running status, dis-assembly listing and so on. I know Shark is good but Caliper is a serious contender :-) Only downside is that it’s available on Intel Itanium platform only.

    Bluehive

  12. 12 Zhentar

    I came across this post a few weeks ago wishing VTune could do this. I have since found a solution Shark’s Windows counterpart. The Windows Performance Toolkit, a free download from MSDN, has a UI that puts VTune to shame, and in Vista will do call stack profiling. You will have to do one step of set-up to download symbols for system files. I also have not been able to get the call stack functionality to work in 64-bit, though it’s supposed to be possible.

  1. 1 AQtime at Vladimir Vukićević
  2. 2 BSBlog » Blog Archive » Profiling Dromaeo Testcases with Shark