sima_utils.transformer.model.language_cache_model

Classes

LanguageCacheModel

Base implementation for the cache model of the language model.

Module Contents

class sima_utils.transformer.model.language_cache_model.LanguageCacheModel

Base implementation for the cache model of the language model.

With support for Sliding Window Attention, a cache model has two flavors, depending on layer index: global cache or local cache. Because the cache is managed outside the cache model, the difference is reflected by input shapes of K and V tensors.

num_tokens: Number of tokens. Set to a value greater than 1 to consume multiple input tokens in one model.

token_idx: Token index.

logit_softcapping: Attention logit soft capping for gemma 2.

num_tokens: int

token_idx: int

logit_softcapping: float | None

gen_onnx_files()