afe.ir.quantization_conv ======================== .. py:module:: afe.ir.quantization_conv .. autoapi-nested-parse:: Quantization functions for convolution and matrix multiply. Attributes ---------- .. autoapisummary:: afe.ir.quantization_conv.ChannelScale afe.ir.quantization_conv.ChannelQScale afe.ir.quantization_conv.ChannelShift afe.ir.quantization_conv.INTRINSIC_SHIFT_LO afe.ir.quantization_conv.INTRINSIC_SHIFT_HI Classes ------- .. autoapisummary:: afe.ir.quantization_conv.ConvolutionPrecision afe.ir.quantization_conv.ConvPlanRequantization afe.ir.quantization_conv.ConvPlanQuantizations afe.ir.quantization_conv.ConvBacktrackingParameters Functions --------- .. autoapisummary:: afe.ir.quantization_conv.reshape_weight_to_output_channels afe.ir.quantization_conv.get_quantization_range afe.ir.quantization_conv.decompose_power_of_2 afe.ir.quantization_conv.normalize_with_pow2 afe.ir.quantization_conv.weight_single_quantization_scale afe.ir.quantization_conv.weight_quantization_scale afe.ir.quantization_conv.select_convolution_scales afe.ir.quantization_conv.run_backtracking_loop afe.ir.quantization_conv.adjust_plan_zero_weights afe.ir.quantization_conv.try_increase_intrinsic_shift afe.ir.quantization_conv.try_adjust_plan_shift_value afe.ir.quantization_conv.try_adjust_plan_product_value afe.ir.quantization_conv.quantize_convolution_scales afe.ir.quantization_conv.quantize_weight_tensor afe.ir.quantization_conv.try_quantize_bias_tensor afe.ir.quantization_conv.quantized_product_zero_value afe.ir.quantization_conv.output_zp_correction_in_bias afe.ir.quantization_conv.quantize_convolution_parameters afe.ir.quantization_conv.get_bfloat16_with_int_weights_quant_params Module Contents --------------- .. py:data:: ChannelScale .. py:data:: ChannelQScale .. py:data:: ChannelShift .. py:data:: INTRINSIC_SHIFT_LO :value: 1 .. py:data:: INTRINSIC_SHIFT_HI :value: 8 .. py:function:: reshape_weight_to_output_channels(weight: numpy.ndarray) -> numpy.ndarray Reshape a weight tensor so that its last axis corresponds to a convolution operation's output channel axis. That is, the convolution's output at a given channel output[..., c] depends on reshaped_weights[..., c], bias[c], and some values from the convolution's input. This tensor shape is useful for code that computes per-channel information or does per-channel scaling on weights. .. py:function:: get_quantization_range(dtype: Union[afe.ir.tensor_type.ScalarType, numpy.number], asymmetry: bool) -> Tuple[int, int] Get the numeric range that should be used when quantizing numbers to be stored using dtype. The range is the entire value range when using asymmetric quantization, and is reduced to a symmetric range when using symmetric quantization. :param dtype: Quantized data type. It must be a signed integer type. :param asymmetry: Whether to use an asymmetric range :return: Numeric range .. py:function:: decompose_power_of_2(x: ChannelScale, rounding: ml_kernels.math_helpers.RoundType) -> Tuple[ChannelShift, ChannelScale] Decompose x into a power-of-2 part i and a fractional part f such that x = f * 2**i The range of f is selected based on how i is rounded: UPWARD: 0.5 < f <= 1 TONEAREST: sqrt(0.5) <= f <= sqrt(2) TRUNC: 1 <= f < 2 Where x is 0, f and i will be 0. :param x: Number to decompose :param rounding: How to round the exponent :return: Decomposed values (i, f) .. py:function:: normalize_with_pow2(x: ChannelScale) -> Tuple[ChannelShift, ChannelScale] Find powers of 2 that normalize each element of x to the range (0.5, 1.0]. :param x: Scale factors to normalize :return: Tuple (i, y) of exponents and normalized scale factors satisfying x = y * 2**i. .. py:function:: weight_single_quantization_scale(weight: numpy.ndarray, bits: int = 8) -> float Calculate a scalar quantization scale for a convolution or matrix multiply weight tensor. :param weight: Floating-point weight tensor :param bits: Number of bits used for quantization :return: Quantization scale. It has the same meaning as the scale field of class Quantization. .. py:function:: weight_quantization_scale(weight: numpy.ndarray, per_channel: bool, bits: int = 8) -> ChannelScale Calculate a quantization scale for a convolution or matrix multiply weight tensor. :param weight: Floating-point weight tensor :param per_channel: Whether to do per-channel quantization :param bits: Number of bits to be used :return: Quantization scale. .. py:class:: ConvolutionPrecision The precision to use for quantizing convolution. This determines how quantization does some calculations and chooses which integer type to use. Some choices (such as sima_int8) completely determine the integer type, while others do not. .. py:attribute:: sima_int8 .. py:attribute:: tflite_int8 .. py:attribute:: restricted_tflite_int8 .. py:attribute:: sima_int16 .. py:attribute:: tflite_int16 .. py:attribute:: restricted_tflite_int16 .. py:attribute:: sima_int32 .. py:method:: has_multiplier() -> bool Return true if this quantization method can use a TFLite multiplier other than 1. Return False if it uses ArithFoldedRequantization or forces the multiplier to be 1. .. py:method:: has_zp_correction() -> bool Return true if this quantization method can use a zero point correction other than 0. .. py:method:: is_arith_folded() -> bool Return true if this is one of the quantization methods that uses ArithFoldedRequantization. .. py:method:: is_tflite() -> bool Return true if this is one of the quantization methods that uses TFLiteRequantization. .. py:class:: ConvPlanRequantization(scale: ChannelScale, shift: ChannelShift, multiplier: ChannelQScale) Adjustable requantization for convolution. This class holds the requantization as both a floating-point number and a quantized representation. When these values are modified, they are kept consistent (modulo rounding) with the formula scale = multiplier * (2**-shift) :param scale: Requantization scale as a floating-point value. :param shift: Right shift to perform. Its shape must be the same as scale's. :param multiplier: Integer multiplier to use. Its shape must be either () or the same as scale's. .. py:attribute:: scale :type: ChannelScale .. py:attribute:: shift :type: ChannelShift .. py:attribute:: multiplier :type: ChannelQScale .. py:method:: deepcopy() -> ConvPlanRequantization Make an independent copy of this object. .. py:method:: adjust_shift(adjustment: Union[ChannelShift, int]) Add the given value to the right-shift value. .. py:method:: set_unit_scale(positions: numpy.ndarray) Set the scale to 1 in the given positions. Shift is set to 0 and multiplier is set to 1 in the given positions. .. py:class:: ConvPlanQuantizations Adjustable quantization parameters for convolution or matrix multiply. This class holds parameters that may be modified while deciding how to quantize the calculation. The parameters relate a real-number calculation c = a * w + b to a quantized calculation (the actual calculation is not selected here, and it may be different from this formula) Qc = S * (Qa * Qw) / 2^h + constant_terms by Qw = w * Sw Qa = a * Sa Qc = c * Sc + Zc S = 2^h * Sc / (Sa * Sw). The factor of 2^h is a right-shift that is included in the integer convolution. :param weight: Scale factor Sw relating real weight w to quantized weight Qw. It may contain 0. :param output: Quantization (Sc, Zc) relating real output c to quantized output Qc :param requant: Requantization S relating quantized product to output Qc :param intrinsic_shift: Right-shift h, used to produce an additional scale factor in the convolution product .. py:attribute:: weight :type: ChannelScale .. py:attribute:: output :type: afe.ir.defines.Quantization .. py:attribute:: requant :type: ConvPlanRequantization .. py:attribute:: intrinsic_shift :type: numpy.ndarray .. py:method:: deepcopy() -> ConvPlanQuantizations Make an independent copy of this object. .. py:method:: set_intrinsic_shift(value: numpy.ndarray) Set the intrinsic shift, h, to the given value. .. py:method:: set_weight_zero(positions: numpy.ndarray) Set the weight scale, Sw, to 0 at the given channel positions. .. py:method:: set_requant_one(positions: numpy.ndarray) Set the requantization scale to 1 at the given channel positions. .. py:method:: scale_weight_pow2(exponent: Union[numpy.ndarray, int]) Multiply the weight quantization scale, Sw, by 2**exponent. .. py:method:: scale_output_pow2(exponent: int) Multiply the output quantization scale, Sc, by 2**exponent. .. py:method:: scale_requant_pow2(exponent: numpy.ndarray) Multiply the requantization, S, by 2**exponent. .. py:function:: select_convolution_scales(weight: numpy.ndarray, input_quant: afe.ir.defines.Quantization, output_distribution: afe.ir.attributes.ObservedDistribution, *, precision: ConvolutionPrecision, asymmetry: bool, per_channel: bool) -> ConvPlanQuantizations Choose quantization parameters for a generalized matrix multiply based on the input's quantization and the optimal quantization of the weight and output. This choice does not account for value ranges of other integer constants and intermediate results. Those should be handled separately. :param weight: A weight tensor. :param input_quant: Quantization that was selected for the input of generalized matrix multiply. :param output_distribution: Value distribution of the output of generalized matrix multiply. :param precision: Precision to quantize for. :param asymmetry: Whether to use asymmetric quantization. :param per_channel: Whether to do per-channel quantization. If true, the scales will be a tensor with one value per channel. If false, the scales will be scalars. :return: Weight tensor scale, requantization scale, and quantization of the convolution output. .. py:class:: ConvBacktrackingParameters Quantization parameters that are fixed at the beginning of the quantization algorithm, such that the algorithm has to restart if they are changed. These values may be modified in the backtracking loop. :param precision: Precision to use for output calculations. :param relu_fallback_precision: Alternative precision to use if "precision" can't be supported due to limitations in the backend's implementation of ReLU. If this is None, "precision" is assumed to be fully supported. :param intrinsic_shift_adjustment: Locations where extra right-shift is used with the int15 convolution algorithm. When the input is int8, it must be a 0D array of False. It is an array of bool, where True means to use extra right-shift. It is 0D for per-tensor or 1D for per-channel. :param weight_adjustment: Extra right-shift applied to weights. Values greater than zero reduce the weight's precision to fewer than 8 bits. It is an array of int. It is 0D for per-tensor or 1D for per-channel. .. py:attribute:: precision :type: ConvolutionPrecision .. py:attribute:: relu_fallback_precision :type: Optional[ConvolutionPrecision] .. py:attribute:: intrinsic_shift_adjustment :type: numpy.ndarray .. py:attribute:: weight_adjustment :type: numpy.ndarray .. py:method:: default_intrinsic_shift_adjustment(n_channels: int, per_channel: bool, use_int15: bool) -> numpy.ndarray :staticmethod: Default value of intrinsic shift. The default is not to use any extra right-shift. :param n_channels: Number of channels in the convolution output :param per_channel: Whether per-channel quantization is used :param use_int15: Whether the int15 convolution algorithm is used :return: Default value of intrinsic shift .. py:method:: default_weight_adjustment(n_channels: int, per_channel: bool) -> numpy.ndarray :staticmethod: Default weight adjustment. The default is not to use any extra right-shift. :param n_channels: Number of channels in the convolution output :param per_channel: Whether per-channel quantization is used :return: Default value of weight adjustment .. py:function:: run_backtracking_loop(f: Callable[[afe.ir.defines.NodeReporter], _A], backtracking_limit: int, backtracking_error_message: str, error_reporter: Optional[afe.ir.defines.NodeReporter] = None) -> _A Retry the backtracking computation in f until it succeeds. The callable object in f represents a restartable function that uses some mutable state to represent its starting condition. It may update its mutable state and raise a _Retry exception to restart; the state change should help it make progress after it restarts. It may return a value to end the loop. :param f: Backtracking computation to run :param backtracking_limit: Maximum number of times to attempt f. If f is attempted this many times without returning a result, an exception will be raised. :param backtracking_error_message: Error message to use if f does not return. :param error_reporter: Used for reporting errors. :return: Return value of f. .. py:function:: adjust_plan_zero_weights(weights: numpy.ndarray, quantizations: ConvPlanQuantizations, per_channel: bool, error_reporter: afe.ir.defines.NodeReporter) Adjust the convolution plan where the weights would be zero after quantization. :param weights: Floating-point weights. :param quantizations: Quantization parameters. Will be modified. :param per_channel: Whether to do per-channel quantization. :param error_reporter: Error reporter used for quantization warnings. .. py:function:: try_increase_intrinsic_shift(backtracking_parameters: ConvBacktrackingParameters, positions: numpy.ndarray) -> None Set backtracking_parameters.intrinsic_shift_adjustment to True where positions is True. Raise _Retry() if any backtracking parameters were changed. :param backtracking_parameters: Mutable variables for backtracking. May be modified. :param positions: Array of bool, containing True where the intrinsic shift adjustment should be set to True. .. py:function:: try_adjust_plan_shift_value(backtracking_parameters: ConvBacktrackingParameters, quantizations: ConvPlanQuantizations, use_int15: bool, error_reporter: afe.ir.defines.NodeReporter) -> None Adjust the convolution plan where the shift value is out of range or where the shift is so large that it causes severe precision loss. Raise _Retry() if any backtracking parameters were changed. :param backtracking_parameters: Mutable variables for backtracking. May be modified. :param quantizations: Quantization parameters. May be modified. :param use_int15: Whether the plan is for int15 convolution. :param error_reporter: Error reporter used for quantization warnings. .. py:function:: try_adjust_plan_product_value(backtracking_parameters: ConvBacktrackingParameters, quantizations: ConvPlanQuantizations, use_int15: bool, error_reporter: afe.ir.defines.NodeReporter) -> None Adjust the convolution plan where the integer convolution result is not in the representable range. Raise _Retry() if any backtracking parameters were changed. :param backtracking_parameters: Mutable variables for backtracking. May be modified. :param quantizations: Quantization parameters. May be modified. :param use_int15: Whether the plan is for int15 convolution. :param error_reporter: Error reporter used for quantization warnings. .. py:function:: quantize_convolution_scales(quantizations: ConvPlanQuantizations, precision: ConvolutionPrecision, allow_full_output_precision: bool) -> Tuple[ChannelScale, ChannelScale, ml_kernels.requantization.BaseRequantization[numpy.ndarray], afe.ir.tensor_type.ScalarType, afe.ir.defines.Quantization] Adjust the quantization parameters based on zero values, limits on integer constants, and limits on integer intermediate results. The final choice of weight scale, bias scale, requantization, and output quantization are returned. :param quantizations: Quantization parameters. :param precision: The precision to use for quantizing convolution. :param allow_full_output_precision: Whether 16-bit precision can be widened to 32-bit output. If false, quantizing with 16-bit precision will always produce 16-bit output. :return: New quantization scale of weights, requantization to perform after convolution, type of output, and quantization of output. .. py:function:: quantize_weight_tensor(weight: numpy.ndarray, weight_scale: ChannelScale, bits: int = 8) -> Tuple[numpy.ndarray, numpy.ndarray] Create a quantized weight tensor. :param weight: np.ndarray, weights value being quantized :param weight_scale: np.ndarray Scale of the weights. :param bits: Number of bits used for quantized weights. :return: Tuple of np.ndarray. First returned value is quantized weights, while fake_quantized weights are calculated by dividing quantized weights by scale, thus returning them to similar fp32 values, and exposing quantization difference that is caused by rounding and clipping during quantization. .. py:function:: try_quantize_bias_tensor(backtracking_parameters: ConvBacktrackingParameters, bias: Optional[numpy.ndarray], zp_correction: numpy.ndarray, bias_scale: ChannelScale, use_int15: bool, per_channel: bool) -> numpy.ndarray Quantize a bias tensor. If it can't be quantized due to integer overflow, adjust backtracking parameters. Raise _Retry() if any backtracking parameters were changed. :param backtracking_parameters: Mutable variables for backtracking. May be modified. :param bias: Floating-point bias tensor. :param zp_correction: Integer zero point correction to be added to the bias. This may include correction for the input zero point and/or output zero point, depending on the quantization scheme. :param bias_scale: Quantization scale to use for bias. :param use_int15: Whether int15 convolution is used. :param per_channel: Whether per-channel quantization is used. :return: Quantized bias tensor. .. py:function:: quantized_product_zero_value(q_weight: numpy.ndarray, zero_point: int, intrinsic_shift: Union[numpy.ndarray, int]) -> numpy.ndarray Calculate the result of quantized generalized matrix multiply when the input is filled with the zero point value. This represents the zero point result, which should be subtracted to get the true product. :param q_weight: Quantized weight tensor :param zero_point: Zero point of input tensor :param intrinsic_shift: Right-shift that is performed by the convolution algorithm. :return: Convolution result as a 1D tensor .. py:function:: output_zp_correction_in_bias(precision: ConvolutionPrecision, output_quant: afe.ir.defines.Quantization, requantization: ml_kernels.requantization.BaseRequantization[numpy.ndarray]) -> int Calculate the zero point correction to add to the convolution or matrix multiply's bias array so that the output has the desired quantization. If the convolution will not combine zero point correction with bias, but instead will do two separate additions, then the result is 0. Otherwise, the result is the output's zero point, scaled based on the requantization. :param precision: Convolution precision type :param output_quant: Quantization of convolution's output :param requantization: Requantization that is performed at the end of convolution :return: Zero point correction that should be added to the bias array .. py:function:: quantize_convolution_parameters(input_quant: afe.ir.defines.Quantization, output_distribution: afe.ir.attributes.ObservedDistribution, weight: numpy.ndarray, bias: Optional[numpy.ndarray], *, per_channel: bool, bias_corrector: afe.ir.bias_correction.BiasCorrector, asymmetry: bool, use_int15: bool, use_sima_relu_workaround: bool, precision: ConvolutionPrecision, allow_full_output_precision: bool, error_reporter: Optional[afe.ir.defines.NodeReporter] = None) -> Tuple[numpy.ndarray, numpy.ndarray, ml_kernels.requantization.BaseRequantization[numpy.ndarray], afe.ir.tensor_type.ScalarType, afe.ir.defines.Quantization, bool] Select quantized parameters for convolution or matrix multiply. :param input_quant: Quantization that was selected for the input of convolution. :param output_distribution: Value distribution of the output of convolution. :param weight: Weight tensor. :param bias: A bias tensor. If it is None, a bias tensor will still be returned containing the bias correction that was introduced by quantization. :param per_channel: Whether to do per-channel quantization. If true, the scale will be a tensor with one value per channel. :param bias_corrector: How to calculate a bias correction term. :param use_int15: Whether to quantize for the int15 convolution algorithm. If false, quantize for the int8 convolution algorithm. :param use_sima_relu_workaround: Whether to use a workaround for int8 SiMa quantization with relu activation. If True, and relu cannot be executed by the backend, then use TFLite quantization. This parameter is only relevant when precision is sima_int8 or sima_int16, and it must be False otherwise. :param precision: The precision to use for quantizing convolution output. :param allow_full_output_precision: Whether 16-bit precision can be widened to 32-bit output. If false, quantizing with 16-bit precision will always produce 16-bit output. :param error_reporter: Used for warnings about bad quantization. :return: A tuple containing the chosen quantization-related parameters: the quantized weight tensor, the quantized bias tensor, the requantization, the scalar type of the output, the quantization of the output, and the msb_left_shift flag value. .. py:function:: get_bfloat16_with_int_weights_quant_params(attrs: afe.ir.attributes.ConvAddActivationAttrs, per_channel: bool, bits: int) -> tuple[numpy.ndarray, numpy.ndarray | None, ml_kernels.requantization.BaseRequantization] Get quantized weights and bias if present and requantization. Weights are quantized to int8 or int4 and bias if present is unquantized, this allows the requantization scale factor to be just 1/weight_scale as requantization is done after adding bias. :param attrs: Weights. :param per_channel: Whether per-channel quantization scheme is used for weights. :param bits: Number of bits to be used. :return: Quantized weights, Optional(quantized bias) and requantization.