afe.ir.quantization_conv

Quantization functions for convolution and matrix multiply.

Attributes

ChannelScale

ChannelQScale

ChannelShift

INTRINSIC_SHIFT_LO

INTRINSIC_SHIFT_HI

Classes

ConvolutionPrecision

The precision to use for quantizing convolution. This determines

ConvPlanRequantization

Adjustable requantization for convolution.

ConvPlanQuantizations

Adjustable quantization parameters for convolution or matrix multiply.

ConvBacktrackingParameters

Quantization parameters that are fixed at the beginning of the quantization algorithm, such that

Functions

reshape_weight_to_output_channels(β†’Β numpy.ndarray)

Reshape a weight tensor so that its last axis corresponds to a convolution operation's

get_quantization_range(β†’Β Tuple[int,Β int])

Get the numeric range that should be used when quantizing numbers

decompose_power_of_2(β†’Β Tuple[ChannelShift,Β ChannelScale])

Decompose x into a power-of-2 part i and a fractional part f such that

normalize_with_pow2(β†’Β Tuple[ChannelShift,Β ChannelScale])

Find powers of 2 that normalize each element of x to the range (0.5, 1.0].

weight_single_quantization_scale(β†’Β float)

Calculate a scalar quantization scale for a convolution or matrix multiply weight tensor.

weight_quantization_scale(β†’Β ChannelScale)

Calculate a quantization scale for a convolution or matrix multiply weight tensor.

select_convolution_scales(β†’Β ConvPlanQuantizations)

Choose quantization parameters for a generalized matrix multiply based on

run_backtracking_loop(β†’Β _A)

Retry the backtracking computation in f until it succeeds.

adjust_plan_zero_weights(weights,Β quantizations,Β ...)

Adjust the convolution plan where the weights would be zero after quantization.

try_increase_intrinsic_shift(β†’Β None)

Set backtracking_parameters.intrinsic_shift_adjustment to True where positions is True.

try_adjust_plan_shift_value(β†’Β None)

Adjust the convolution plan where the shift value is out of range or where

try_adjust_plan_product_value(β†’Β None)

Adjust the convolution plan where the integer convolution result

quantize_convolution_scales(β†’Β Tuple[ChannelScale,Β ...)

Adjust the quantization parameters based on zero values,

quantize_weight_tensor(β†’Β Tuple[numpy.ndarray,Β ...)

Create a quantized weight tensor.

try_quantize_bias_tensor(β†’Β numpy.ndarray)

Quantize a bias tensor. If it can't be quantized due to integer overflow,

quantized_product_zero_value(β†’Β numpy.ndarray)

Calculate the result of quantized generalized matrix multiply when the input is filled

output_zp_correction_in_bias(β†’Β int)

Calculate the zero point correction to add to the convolution or matrix multiply's

quantize_convolution_parameters(β†’Β Tuple[numpy.ndarray,Β ...)

Select quantized parameters for convolution or matrix multiply.

get_bfloat16_with_int_weights_quant_params(...)

Get quantized weights and bias if present and requantization.

Module Contents

afe.ir.quantization_conv.ChannelScale[source]
afe.ir.quantization_conv.ChannelQScale[source]
afe.ir.quantization_conv.ChannelShift[source]
afe.ir.quantization_conv.INTRINSIC_SHIFT_LO = 1[source]
afe.ir.quantization_conv.INTRINSIC_SHIFT_HI = 8[source]
afe.ir.quantization_conv.reshape_weight_to_output_channels(weight: numpy.ndarray) numpy.ndarray[source]

Reshape a weight tensor so that its last axis corresponds to a convolution operation’s output channel axis. That is, the convolution’s output at a given channel output[…, c] depends on reshaped_weights[…, c], bias[c], and some values from the convolution’s input. This tensor shape is useful for code that computes per-channel information or does per-channel scaling on weights.

afe.ir.quantization_conv.get_quantization_range(dtype: afe.ir.tensor_type.ScalarType | numpy.number, asymmetry: bool) Tuple[int, int][source]

Get the numeric range that should be used when quantizing numbers to be stored using dtype. The range is the entire value range when using asymmetric quantization, and is reduced to a symmetric range when using symmetric quantization.

Parameters:
  • dtype – Quantized data type. It must be a signed integer type.

  • asymmetry – Whether to use an asymmetric range

Returns:

Numeric range

afe.ir.quantization_conv.decompose_power_of_2(x: ChannelScale, rounding: ml_kernels.math_helpers.RoundType) Tuple[ChannelShift, ChannelScale][source]

Decompose x into a power-of-2 part i and a fractional part f such that

x = f * 2**i

The range of f is selected based on how i is rounded:

UPWARD: 0.5 < f <= 1 TONEAREST: sqrt(0.5) <= f <= sqrt(2) TRUNC: 1 <= f < 2

Where x is 0, f and i will be 0.

Parameters:
  • x – Number to decompose

  • rounding – How to round the exponent

Returns:

Decomposed values (i, f)

afe.ir.quantization_conv.normalize_with_pow2(x: ChannelScale) Tuple[ChannelShift, ChannelScale][source]

Find powers of 2 that normalize each element of x to the range (0.5, 1.0].

Parameters:

x – Scale factors to normalize

Returns:

Tuple (i, y) of exponents and normalized scale factors satisfying x = y * 2**i.

afe.ir.quantization_conv.weight_single_quantization_scale(weight: numpy.ndarray, bits: int = 8) float[source]

Calculate a scalar quantization scale for a convolution or matrix multiply weight tensor.

Parameters:
  • weight – Floating-point weight tensor

  • bits – Number of bits used for quantization

Returns:

Quantization scale. It has the same meaning as the scale field of class Quantization.

afe.ir.quantization_conv.weight_quantization_scale(weight: numpy.ndarray, per_channel: bool, bits: int = 8) ChannelScale[source]

Calculate a quantization scale for a convolution or matrix multiply weight tensor.

Parameters:
  • weight – Floating-point weight tensor

  • per_channel – Whether to do per-channel quantization

  • bits – Number of bits to be used

Returns:

Quantization scale.

class afe.ir.quantization_conv.ConvolutionPrecision[source]

The precision to use for quantizing convolution. This determines how quantization does some calculations and chooses which integer type to use. Some choices (such as sima_int8) completely determine the integer type, while others do not.

sima_int8[source]
tflite_int8[source]
restricted_tflite_int8[source]
sima_int16[source]
tflite_int16[source]
restricted_tflite_int16[source]
sima_int32[source]
has_multiplier() bool[source]

Return true if this quantization method can use a TFLite multiplier other than 1. Return False if it uses ArithFoldedRequantization or forces the multiplier to be 1.

has_zp_correction() bool[source]

Return true if this quantization method can use a zero point correction other than 0.

is_arith_folded() bool[source]

Return true if this is one of the quantization methods that uses ArithFoldedRequantization.

is_tflite() bool[source]

Return true if this is one of the quantization methods that uses TFLiteRequantization.

class afe.ir.quantization_conv.ConvPlanRequantization(scale: ChannelScale, shift: ChannelShift, multiplier: ChannelQScale)[source]

Adjustable requantization for convolution. This class holds the requantization as both a floating-point number and a quantized representation. When these values are modified, they are kept consistent (modulo rounding) with the formula

scale = multiplier * (2**-shift)

Parameters:
  • scale – Requantization scale as a floating-point value.

  • shift – Right shift to perform. Its shape must be the same as scale’s.

  • multiplier – Integer multiplier to use. Its shape must be either () or the same as scale’s.

scale: ChannelScale[source]
shift: ChannelShift[source]
multiplier: ChannelQScale[source]
deepcopy() ConvPlanRequantization[source]

Make an independent copy of this object.

adjust_shift(adjustment: ChannelShift | int)[source]

Add the given value to the right-shift value.

set_unit_scale(positions: numpy.ndarray)[source]

Set the scale to 1 in the given positions. Shift is set to 0 and multiplier is set to 1 in the given positions.

class afe.ir.quantization_conv.ConvPlanQuantizations[source]

Adjustable quantization parameters for convolution or matrix multiply. This class holds parameters that may be modified while deciding how to quantize the calculation.

The parameters relate a real-number calculation

c = a * w + b

to a quantized calculation (the actual calculation is not selected here, and it may be different from this formula)

Qc = S * (Qa * Qw) / 2^h + constant_terms

by

Qw = w * Sw Qa = a * Sa Qc = c * Sc + Zc S = 2^h * Sc / (Sa * Sw).

The factor of 2^h is a right-shift that is included in the integer convolution.

Parameters:
  • weight – Scale factor Sw relating real weight w to quantized weight Qw. It may contain 0.

  • output – Quantization (Sc, Zc) relating real output c to quantized output Qc

  • requant – Requantization S relating quantized product to output Qc

  • intrinsic_shift – Right-shift h, used to produce an additional scale factor in the convolution product

weight: ChannelScale[source]
output: afe.ir.defines.Quantization[source]
requant: ConvPlanRequantization[source]
intrinsic_shift: numpy.ndarray[source]
deepcopy() ConvPlanQuantizations[source]

Make an independent copy of this object.

set_intrinsic_shift(value: numpy.ndarray)[source]

Set the intrinsic shift, h, to the given value.

set_weight_zero(positions: numpy.ndarray)[source]

Set the weight scale, Sw, to 0 at the given channel positions.

set_requant_one(positions: numpy.ndarray)[source]

Set the requantization scale to 1 at the given channel positions.

scale_weight_pow2(exponent: numpy.ndarray | int)[source]

Multiply the weight quantization scale, Sw, by 2**exponent.

scale_output_pow2(exponent: int)[source]

Multiply the output quantization scale, Sc, by 2**exponent.

scale_requant_pow2(exponent: numpy.ndarray)[source]

Multiply the requantization, S, by 2**exponent.

afe.ir.quantization_conv.select_convolution_scales(weight: numpy.ndarray, input_quant: afe.ir.defines.Quantization, output_distribution: afe.ir.attributes.ObservedDistribution, *, precision: ConvolutionPrecision, asymmetry: bool, per_channel: bool) ConvPlanQuantizations[source]

Choose quantization parameters for a generalized matrix multiply based on the input’s quantization and the optimal quantization of the weight and output.

This choice does not account for value ranges of other integer constants and intermediate results. Those should be handled separately.

Parameters:
  • weight – A weight tensor.

  • input_quant – Quantization that was selected for the input of generalized matrix multiply.

  • output_distribution – Value distribution of the output of generalized matrix multiply.

  • precision – Precision to quantize for.

  • asymmetry – Whether to use asymmetric quantization.

  • per_channel – Whether to do per-channel quantization. If true, the scales will be a tensor with one value per channel. If false, the scales will be scalars.

Returns:

Weight tensor scale, requantization scale, and quantization of the convolution output.

class afe.ir.quantization_conv.ConvBacktrackingParameters[source]

Quantization parameters that are fixed at the beginning of the quantization algorithm, such that the algorithm has to restart if they are changed. These values may be modified in the backtracking loop.

Parameters:
  • precision – Precision to use for output calculations.

  • relu_fallback_precision – Alternative precision to use if β€œprecision” can’t be supported due to limitations in the backend’s implementation of ReLU. If this is None, β€œprecision” is assumed to be fully supported.

  • intrinsic_shift_adjustment – Locations where extra right-shift is used with the int15 convolution algorithm. When the input is int8, it must be a 0D array of False. It is an array of bool, where True means to use extra right-shift. It is 0D for per-tensor or 1D for per-channel.

  • weight_adjustment – Extra right-shift applied to weights. Values greater than zero reduce the weight’s precision to fewer than 8 bits. It is an array of int. It is 0D for per-tensor or 1D for per-channel.

precision: ConvolutionPrecision[source]
relu_fallback_precision: ConvolutionPrecision | None[source]
intrinsic_shift_adjustment: numpy.ndarray[source]
weight_adjustment: numpy.ndarray[source]
static default_intrinsic_shift_adjustment(n_channels: int, per_channel: bool, use_int15: bool) numpy.ndarray[source]

Default value of intrinsic shift. The default is not to use any extra right-shift.

Parameters:
  • n_channels – Number of channels in the convolution output

  • per_channel – Whether per-channel quantization is used

  • use_int15 – Whether the int15 convolution algorithm is used

Returns:

Default value of intrinsic shift

static default_weight_adjustment(n_channels: int, per_channel: bool) numpy.ndarray[source]

Default weight adjustment. The default is not to use any extra right-shift.

Parameters:
  • n_channels – Number of channels in the convolution output

  • per_channel – Whether per-channel quantization is used

Returns:

Default value of weight adjustment

afe.ir.quantization_conv.run_backtracking_loop(f: Callable[[afe.ir.defines.NodeReporter], _A], backtracking_limit: int, backtracking_error_message: str, error_reporter: afe.ir.defines.NodeReporter | None = None) _A[source]

Retry the backtracking computation in f until it succeeds.

The callable object in f represents a restartable function that uses some mutable state to represent its starting condition. It may update its mutable state and raise a _Retry exception to restart; the state change should help it make progress after it restarts. It may return a value to end the loop.

Parameters:
  • f – Backtracking computation to run

  • backtracking_limit – Maximum number of times to attempt f. If f is attempted this many times without returning a result, an exception will be raised.

  • backtracking_error_message – Error message to use if f does not return.

  • error_reporter – Used for reporting errors.

Returns:

Return value of f.

afe.ir.quantization_conv.adjust_plan_zero_weights(weights: numpy.ndarray, quantizations: ConvPlanQuantizations, per_channel: bool, error_reporter: afe.ir.defines.NodeReporter)[source]

Adjust the convolution plan where the weights would be zero after quantization.

Parameters:
  • weights – Floating-point weights.

  • quantizations – Quantization parameters. Will be modified.

  • per_channel – Whether to do per-channel quantization.

  • error_reporter – Error reporter used for quantization warnings.

afe.ir.quantization_conv.try_increase_intrinsic_shift(backtracking_parameters: ConvBacktrackingParameters, positions: numpy.ndarray) None[source]

Set backtracking_parameters.intrinsic_shift_adjustment to True where positions is True. Raise _Retry() if any backtracking parameters were changed.

Parameters:
  • backtracking_parameters – Mutable variables for backtracking. May be modified.

  • positions – Array of bool, containing True where the intrinsic shift adjustment should be set to True.

afe.ir.quantization_conv.try_adjust_plan_shift_value(backtracking_parameters: ConvBacktrackingParameters, quantizations: ConvPlanQuantizations, use_int15: bool, error_reporter: afe.ir.defines.NodeReporter) None[source]

Adjust the convolution plan where the shift value is out of range or where the shift is so large that it causes severe precision loss. Raise _Retry() if any backtracking parameters were changed.

Parameters:
  • backtracking_parameters – Mutable variables for backtracking. May be modified.

  • quantizations – Quantization parameters. May be modified.

  • use_int15 – Whether the plan is for int15 convolution.

  • error_reporter – Error reporter used for quantization warnings.

afe.ir.quantization_conv.try_adjust_plan_product_value(backtracking_parameters: ConvBacktrackingParameters, quantizations: ConvPlanQuantizations, use_int15: bool, error_reporter: afe.ir.defines.NodeReporter) None[source]

Adjust the convolution plan where the integer convolution result is not in the representable range. Raise _Retry() if any backtracking parameters were changed.

Parameters:
  • backtracking_parameters – Mutable variables for backtracking. May be modified.

  • quantizations – Quantization parameters. May be modified.

  • use_int15 – Whether the plan is for int15 convolution.

  • error_reporter – Error reporter used for quantization warnings.

afe.ir.quantization_conv.quantize_convolution_scales(quantizations: ConvPlanQuantizations, precision: ConvolutionPrecision, allow_full_output_precision: bool) Tuple[ChannelScale, ChannelScale, ml_kernels.requantization.BaseRequantization[numpy.ndarray], afe.ir.tensor_type.ScalarType, afe.ir.defines.Quantization][source]

Adjust the quantization parameters based on zero values, limits on integer constants, and limits on integer intermediate results.

The final choice of weight scale, bias scale, requantization, and output quantization are returned.

Parameters:
  • quantizations – Quantization parameters.

  • precision – The precision to use for quantizing convolution.

  • allow_full_output_precision – Whether 16-bit precision can be widened to 32-bit output. If false, quantizing with 16-bit precision will always produce 16-bit output.

Returns:

New quantization scale of weights, requantization to perform after convolution, type of output, and quantization of output.

afe.ir.quantization_conv.quantize_weight_tensor(weight: numpy.ndarray, weight_scale: ChannelScale, bits: int = 8) Tuple[numpy.ndarray, numpy.ndarray][source]

Create a quantized weight tensor.

Parameters:
  • weight – np.ndarray, weights value being quantized

  • weight_scale – np.ndarray Scale of the weights.

  • bits – Number of bits used for quantized weights.

Returns:

Tuple of np.ndarray. First returned value is quantized weights, while fake_quantized weights are calculated by dividing quantized weights by scale, thus returning them to similar fp32 values, and exposing quantization difference that is caused by rounding and clipping during quantization.

afe.ir.quantization_conv.try_quantize_bias_tensor(backtracking_parameters: ConvBacktrackingParameters, bias: numpy.ndarray | None, zp_correction: numpy.ndarray, bias_scale: ChannelScale, use_int15: bool, per_channel: bool) numpy.ndarray[source]

Quantize a bias tensor. If it can’t be quantized due to integer overflow, adjust backtracking parameters. Raise _Retry() if any backtracking parameters were changed.

Parameters:
  • backtracking_parameters – Mutable variables for backtracking. May be modified.

  • bias – Floating-point bias tensor.

  • zp_correction – Integer zero point correction to be added to the bias. This may include correction for the input zero point and/or output zero point, depending on the quantization scheme.

  • bias_scale – Quantization scale to use for bias.

  • use_int15 – Whether int15 convolution is used.

  • per_channel – Whether per-channel quantization is used.

Returns:

Quantized bias tensor.

afe.ir.quantization_conv.quantized_product_zero_value(q_weight: numpy.ndarray, zero_point: int, intrinsic_shift: numpy.ndarray | int) numpy.ndarray[source]

Calculate the result of quantized generalized matrix multiply when the input is filled with the zero point value. This represents the zero point result, which should be subtracted to get the true product.

Parameters:
  • q_weight – Quantized weight tensor

  • zero_point – Zero point of input tensor

  • intrinsic_shift – Right-shift that is performed by the convolution algorithm.

Returns:

Convolution result as a 1D tensor

afe.ir.quantization_conv.output_zp_correction_in_bias(precision: ConvolutionPrecision, output_quant: afe.ir.defines.Quantization, requantization: ml_kernels.requantization.BaseRequantization[numpy.ndarray]) int[source]

Calculate the zero point correction to add to the convolution or matrix multiply’s bias array so that the output has the desired quantization.

If the convolution will not combine zero point correction with bias, but instead will do two separate additions, then the result is 0. Otherwise, the result is the output’s zero point, scaled based on the requantization.

Parameters:
  • precision – Convolution precision type

  • output_quant – Quantization of convolution’s output

  • requantization – Requantization that is performed at the end of convolution

Returns:

Zero point correction that should be added to the bias array

afe.ir.quantization_conv.quantize_convolution_parameters(input_quant: afe.ir.defines.Quantization, output_distribution: afe.ir.attributes.ObservedDistribution, weight: numpy.ndarray, bias: numpy.ndarray | None, *, per_channel: bool, bias_corrector: afe.ir.bias_correction.BiasCorrector, asymmetry: bool, use_int15: bool, use_sima_relu_workaround: bool, precision: ConvolutionPrecision, allow_full_output_precision: bool, error_reporter: afe.ir.defines.NodeReporter | None = None) Tuple[numpy.ndarray, numpy.ndarray, ml_kernels.requantization.BaseRequantization[numpy.ndarray], afe.ir.tensor_type.ScalarType, afe.ir.defines.Quantization, bool][source]

Select quantized parameters for convolution or matrix multiply.

Parameters:
  • input_quant – Quantization that was selected for the input of convolution.

  • output_distribution – Value distribution of the output of convolution.

  • weight – Weight tensor.

  • bias – A bias tensor. If it is None, a bias tensor will still be returned containing the bias correction that was introduced by quantization.

  • per_channel – Whether to do per-channel quantization. If true, the scale will be a tensor with one value per channel.

  • bias_corrector – How to calculate a bias correction term.

  • use_int15 – Whether to quantize for the int15 convolution algorithm. If false, quantize for the int8 convolution algorithm.

  • use_sima_relu_workaround – Whether to use a workaround for int8 SiMa quantization with relu activation. If True, and relu cannot be executed by the backend, then use TFLite quantization. This parameter is only relevant when precision is sima_int8 or sima_int16, and it must be False otherwise.

  • precision – The precision to use for quantizing convolution output.

  • allow_full_output_precision – Whether 16-bit precision can be widened to 32-bit output. If false, quantizing with 16-bit precision will always produce 16-bit output.

  • error_reporter – Used for warnings about bad quantization.

Returns:

A tuple containing the chosen quantization-related parameters: the quantized weight tensor, the quantized bias tensor, the requantization, the scalar type of the output, the quantization of the output, and the msb_left_shift flag value.

afe.ir.quantization_conv.get_bfloat16_with_int_weights_quant_params(attrs: afe.ir.attributes.ConvAddActivationAttrs, per_channel: bool, bits: int) tuple[numpy.ndarray, numpy.ndarray | None, ml_kernels.requantization.BaseRequantization][source]

Get quantized weights and bias if present and requantization. Weights are quantized to int8 or int4 and bias if present is unquantized, this allows the requantization scale factor to be just 1/weight_scale as requantization is done after adding bias.

Parameters:
  • attrs – Weights.

  • per_channel – Whether per-channel quantization scheme is used for weights.

  • bits – Number of bits to be used.

Returns:

Quantized weights, Optional(quantized bias) and requantization.