ONNX Runtime provides various graph optimizations to improve performance. Graph optimizations are essentially graph-level transformations, ranging from small graph simplifications and node eliminations to more complex node fusions and layout optimizations.
Graph optimizations are divided in several categories (or levels) based on their complexity and functionality. They can be performed either online or offline. In online mode, the optimizations are done before performing the inference, while in offline mode, the runtime saves the optimized graph to disk. ONNX Runtime provides Python, C#, C++, and C APIs to enable different optimization levels and to choose between offline vs. online mode.
Below we provide details on the optimization levels, the online/offline mode, and the various APIs to control them.
Graph optimizations are divided into three levels:
- Layout Optimizations
The optimizations belonging to one level are performed after the optimizations of the previous level have been applied (e.g., extended optimizations are applied after basic optimizations have been applied).
All optimizations are enabled by default.
These are semantics-preserving graph rewrites which remove redundant nodes and redundant computation. They run before graph partitioning and thus apply to all the execution providers. Available basic graph optimizations are as follows:
Constant Folding: Statically computes parts of the graph that rely only on constant initializers. This eliminates the need to compute them during runtime.
- Redundant node eliminations: Remove all redundant nodes without changing the graph structure. The following such optimizations are currently supported:
- Identity Elimination
- Slice Elimination
- Unsqueeze Elimination
- Dropout Elimination
- Semantics-preserving node fusions : Fuse/fold multiple nodes into a single node. For example, Conv Add fusion folds the Add operator as the bias of the Conv operator. The following such optimizations are currently supported:
- Conv Add Fusion
- Conv Mul Fusion
- Conv BatchNorm Fusion
- Relu Clip Fusion
- Reshape Fusion
These optimizations include complex node fusions. They are run after graph partitioning and are only applied to the nodes assigned to the CPU or CUDA or ROCm execution provider. Available extended graph optimizations are as follows:
|GEMM Activation Fusion||CPU|
|Matmul Add Fusion||CPU|
|Conv Activation Fusion||CPU|
|GELU Fusion||CPU, CUDA, ROCm|
|Layer Normalization Fusion||CPU, CUDA, ROCm|
|BERT Embedding Layer Fusion||CPU, CUDA, ROCm||Fuse BERT embedding layer, layer normalization and attention mask length|
|Attention Fusion*||CPU, CUDA, ROCm|
|Skip Layer Normalization Fusion||CPU, CUDA, ROCm||Fuse bias of fully connected layer, skip connection and layer normalization|
|Bias GELU Fusion||CPU, CUDA, ROCm||Fuse bias of fully connected layer and GELU activation|
|GELU Approximation*||CUDA, ROCm||Disabled by default. Enable with kOrtSessionOptionsEnableGeluApproximation|
Approximations (click to expand)
Approximations (click to expand)
To optimize performance of BERT, approximation is used in GELU Approximation and Attention Fusion for CUDA and ROCm execution provider. The impact on accuracy is negligible based on our evaluation: F1 score for a BERT model on SQuAD v1.1 is almost same (87.05 vs 87.03).
These optimizations change the data layout for applicable nodes to achieve higher performance improvements. They are run after graph partitioning and are only applied to nodes assigned to CPU execution provider. Available layout optimizations are as follows:
- NCHWc Optimizer: Optimizes the graph by using NCHWc layout instead of NCHW layout.
All optimizations can be performed either online or offline. In online mode, when initializing an inference session, we also apply all enabled graph optimizations before performing model inference. Applying all optimizations each time we initiate a session can add overhead to the model startup time (especially for complex models), which can be critical in production scenarios. This is where the offline mode can bring a lot of benefit. In offline mode, after performing graph optimizations, ONNX Runtime serializes the resulting model to disk. Subsequently, we can reduce startup time by using the already optimized model and disabling all optimizations.
- When running in offline mode, make sure to use the exact same options (e.g., execution providers, optimization level) and hardware as the target machine that the model inference will run on (e.g., you cannot run a model pre-optimized for a GPU execution provider on a machine that is equipped only with CPU).
- When layout optimizations are enabled, the offline mode can only be used on compatible hardware to the environment when the offline model is saved. For example, if model has layout optimized for AVX2, the offline model would require CPUs that support AVX2.
ONNX Runtime defines the
GraphOptimizationLevel enum to determine which of the aforementioned optimization levels will be enabled. Choosing a level enables the optimizations of that level, as well as the optimizations of all preceding levels. For example, enabling Extended optimizations, also enables Basic optimizations. The mapping of these levels to the enum is as follows:
- GraphOptimizationLevel::ORT_DISABLE_ALL -> Disables all optimizations
- GraphOptimizationLevel::ORT_ENABLE_BASIC -> Enables basic optimizations
- GraphOptimizationLevel::ORT_ENABLE_EXTENDED -> Enables basic and extended optimizations
- GraphOptimizationLevel::ORT_ENABLE_ALL -> Enables all available optimizations including layout optimizations
To enable serialization of the optimized model to disk, set the SessionOptions option
import onnxruntime as rt sess_options = rt.SessionOptions() # Set graph optimization level sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_EXTENDED # To enable model serialization after graph optimization set this sess_options.optimized_model_filepath = "<model_output_path\optimized_model.onnx>" session = rt.InferenceSession("<model_path>", sess_options)
const OrtApi* Ort::g_api = OrtGetApi(ORT_API_VERSION); OrtEnv* env; g_ort->CreateEnv(ORT_LOGGING_LEVEL_WARNING, "test", &env); OrtSessionOptions* session_options; g_ort->CreateSessionOptions(&session_options) // Set graph optimization level g_ort->SetSessionGraphOptimizationLevel(session_options, ORT_ENABLE_EXTENDED); // To enable model serialization after graph optimization set this const wchar_t* optimized_model_path = L"optimized_model_path"; g_ort->SetOptimizedModelFilePath(session_options, optimized_model_path); OrtSession* session; const wchar_t* model_path = L"model_path"; g_ort->CreateSession(env, model_path, session_option, &session);
SessionOptions so = new SessionOptions(); // Set graph optimization level so.GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_EXTENDED; // To enable model serialization after graph optimization set this so.OptimizedModelFilePath = "model_output_path\optimized_model.onnx" var session = new InferenceSession(modelPath, so);
Ort::SessionOptions session_options; // Set graph optimization level session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_EXTENDED); // To enable model serialization after graph optimization set this session_options.SetOptimizedModelFilePath("optimized_file_path"); auto session_ = Ort::Session(env, "model_file_path", session_options);