Profiling Tools

Contents

In-code performance profiling

The onnxruntime_perf_test.exe tool (available from the build drop) can be used to test various knobs. Please find the usage instructions using onnxruntime_perf_test.exe -h. The perf_view tool can also be used to render the statistics as a summarized view in the browser.

You can enable ONNX Runtime latency profiling in code:

import onnxruntime as rt

sess_options = rt.SessionOptions()
sess_options.enable_profiling = True

If you are using the onnxruntime_perf_test.exe tool, you can add -p [profile_file] to enable performance profiling.

In both cases, you will get a JSON file which contains the detailed performance data (threading, latency of each operator, etc). This file is a standard performance tracing file, and to view it in a user-friendly way, you can open it by using multiple tools.

  • (Windows) Use the WPA GUI to open the trace using the Perfetto OSS plugin - Microsoft-Performance-Tools-Linux-Android
  • Perfetto UI - Successor to Chrome Tracing UI
  • chrome://tracing:
    • Open a Chromium based browser such as Edge or Chrome
    • Type chrome://tracing in the address bar
    • Load the generated JSON file

Execution Provider (EP) Profiling

Starting with ONNX 1.17 support has been added to profile EPs or Neural Processing Unit (NPU)s, if that EP supports profiling in it’s SDK

Qualcomm QNN EP

As mentioned in the QNN EP Doc profiling is supported

Cross-Platform CSV Tracing

The Qualcomm AI Engine Direct SDK (QNN SDK) supports profiling. QNN will output to CSV in a text format if a dev were to use the QNN SDK directly outside ONNX. To enable equivalent functionality, ONNX mimics this support and outputs the same CSV formatting.

If profiling_level is provided then ONNX will append log to current working directory a qnn-profiling-data.csv file

TraceLogging ETW (Windows) Profiling

As covered in logging ONNX supports dynamic enablement of tracing ETW providers. Specifically the following settings. If the Tracelogging provider is enabled and profiling_level was provided, then CSV support is automatically disabled

  • Provider Name: Microsoft.ML.ONNXRuntime
  • Provider GUID: 3a26b1ff-7484-7484-7484-15261f42614d
  • Keywords: Profiling = 0x100 per logging.h
  • Level:
    • 5 (VERBOSE) = profiling_level=basic (good details without perf loss)
    • greater than 5 = profiling_level=detailed (individual ops are logged with inference perf hit)
  • Event: QNNProfilingEvent

GPU Profiling

To profile CUDA kernels, please add the cupti library to your PATH and use the onnxruntime binary built from source with --enable_cuda_profiling. To profile ROCm kernels, please add the roctracer library to your PATH and use the onnxruntime binary built from source with --enable_rocm_profiling.

Performance numbers from the device will then be attached to those from the host. For example:

{"cat":"Node", "name":"Add_1234", "dur":17, ...}
{"cat":"Kernel", "name":"ort_add_cuda_kernel", dur:33, ...}

Here, the “Add” operator from the host initiated a CUDA kernel on device named “ort_add_cuda_kernel” which lasted for 33 microseconds. If an operator called multiple kernels during execution, the performance numbers of those kernels will all be listed following the call sequence:

{"cat":"Node", "name":<name of the node>, ...}
{"cat":"Kernel", "name":<name of the kernel called first>, ...}
{"cat":"Kernel", "name":<name of the kernel called next>, ...}