sima_utils.transformer.model.language_cache_model

Classes

LanguageCacheModel

Base implementation for the cache model of the language model.

Module Contents

class sima_utils.transformer.model.language_cache_model.LanguageCacheModel

Base implementation for the cache model of the language model.

With support for Sliding Window Attention, a cache model has two flavors, depending on layer index: global cache or local cache. Because the cache is managed outside the cache model, the difference is reflected by input shapes of K and V tensors.

num_tokens

Number of tokens. Set to a value greater than 1 to consume multiple input tokens in one model.

token_idx

Token index.

logit_softcapping

Attention logit soft capping for gemma 2.

num_tokens: int
token_idx: int
logit_softcapping: float | None
gen_onnx_files()