In the field of library science, language identification is important for categorizing materials. As librarians often have to categorize materials which are in languages they are not familiar with, they sometimes rely on tables of frequent words and distinctive letters or characters to help them identify languages. While identifying a single such word or character may not suffice to distinguish a language from another with a similar orthography, identifying several is often highly reliable.
This can be done by comparing the compressibility of the text to the compressibility of texts in the known languages. This approach is known as mutual information based distance measure . The same techniques can also be used to empirically construct family trees of languages which closely correspond to the trees constructed using historical methods.
Another technique, as described by Dunning (1994) is to create a language n-gram model from a “training text” for each of the languages. Then, for any piece of text needing to be identified, a similar model is made, and the two models are compared. The stored language model which is most similar to the model from the piece of text is the most likely language.
Need an webmaster? Click HERE