The main goal of language modeling is estimating the probability distribution of several linguistic units like words, sentences, etc. In recent years, the LM application under natural language processing has become an interesting field which has resulted in many researchers’ attention. Almost all the language forms that we encounter in real life are somehow susceptible to noise. Examples include:
Speech experiences both speech errors and physical background noise.
Handwriting can be messy and unreadable, or the paper can have dirt on it.
Likewise, older printing can also have defects or the paper can have dirt on it.
Typing on mobile can also have typographical errors.
If all the noise gets removed, what will be the resulting language in its pure form? “Pure Language” is a kind of mental representation. When we produce any form of language, there are chances of additional kinds of “noise”:
Speech definitely undergoes a transformation from mental representation to the act of speaking.
Spelling can be treated as a significant transformation from mental representations of words to sequences of letters. It is especially non-trivial in English, which is why there exist spelling bees in English but not in other languages.
But we have no idea what such a mental representation looks like, so for the purpose of building language technologies, it’s far more convenient to pretend that pure language is digital text. Therefore, our language models are models of text, represented as sequences of words from a finite vocabulary.