Perplexity

Perplexity is a common metric used to evaluate language models. Presented with the test set, the model is set up to predict tokens based on prior tokens (context); the better such predictions are, the less 'perplexed' (surprised) the model is at seeing the test set, and therefore the better the language model is. There are different mathematical formulations for perplexity; a typical one is the exponent of the average negative log-likelihood computed over a sliding window on the test set.
Related concepts:
BLEUWord Error Rate
External reference:
https://huggingface.co/docs/transformers/perplexity