Written by Shailesh Sridhar, Associate Data Scientist at MyGate
Introduction
Here’s a quick exercise.
The two sentences below mean “The coffee smells good”. One of them is in Chinese and the other one is in German.
- Der kaffee riecht gut.
- Kafei wen qilai hen xiang.
Can you guess which one is in which language?
If you guessed that the first sentence is in German and the second is in Chinese then you are right.
If you are like most people, you are not confident about the meaning of even one of the words in those two sentences.
So how did you know?
Language Identification in the real world
Language Identification is the task of identifying the language which a certain text belongs to.
It can act as an important first step in several text processing operations. Especially for operations such as search and text analytics, figuring out the language of text and applying language specific processing is generally a lot more efficient than applying generic processing.
Let us look at the example of a complex of guest houses for an academic institution. Many people enter the complex and visit the guest rooms. A data entry person sits at the gate and is responsible for noting down their purpose of visit. His english is not very good and he tends to make mistakes.
The following are entries made by people in similar situations. Can you guess what the correct spelling of each is?
Entries
1.Oobar
2.POLEMBAR
3.GEZER
4.WHATRPHITR
. . .
Correct Spellings
1.Uber
2.Plumber
3.Geyser
4.Water Filter
Such misspellings make it quite a difficult task to map entries to their corresponding categories.
At MyGate, The data cleanse system is responsible for carrying out this mapping and uses phonetics based algorithms along with normal text based methods in the process. Application of Levenshtein distance technique to identify the most likely candidates to replace the incorrect word and Elastic Search for representative storage are two essential components of the system.
The real problem arises when text from other languages is mixed into the entries.
Here are a few such examples:
—————-
newspaper ka check dena hai
Room bearing 702 dekhna
DELIVERY NAM TO BATATE JAO YAAR
————-
Mapping such entries becomes a nightmare as not only do spelling mistakes have to be dealt with for english words, but for other languages as well. How do you know if a misspelled word belongs to english or Hindi? Adding to the complexity is words such as ‘TO’, which can be a valid word in both Hindi and english.
If there was a way to at least identify, within an entry, which language each word belonged to, things would be a lot easier. This is where language identification comes into the picture.
Word level language identification refers to the task of identifying the language in text, one word at a time. It is a popular research topic, attracting even big names such as Microsoft Research.
Several approaches exist for solving this problem. However one aspect most have in common is that they rely on surrounding context. Given the words around the word in consideration, it is possible to predict which language a word belongs to.
“DELIVERY NAM TO BATATE JAO YAAR”
Here the word TO is flanked on either side by NAM and BATATE, which both belong to the Hindi dictionary and hence highly likely to be a Hindi word.
This approach works very well for texts that are many words long. However, for our use case this is clearly not something that is guaranteed. We can not depend on the context. Find the below plot of the entries made.
Figure 1
Most entries are 1-2 words long and lack useful context. How do you figure out the language of an unseen word without context?
The Feel of a Word
Let us go back to the sentences we encountered at the beginning of the post.
1.Der kaffee riecht gut
2.Kafei wen qilai hen xiang
Why does the second sentence ‘sound’ more chinese? Let us say you only had one word from each sentence.
1.Riecht
2.Xiang
Most people would still identify 2 as chinese and 1 as german. “Xiang” can be broken up like so:
Xi – A – ng
Intuitively, these parts of the word feel chinese. Using the hundreds of chinese words we have heard from places we may not ever have consciously considered (eg. Beijing, Li Xingping, Jackie Chan etc.), our brain subconsciously identifies ‘Xiang’ as a highly probable chinese candidate. The word ‘feels’ chinese.
To some extent we can actually teach this ‘feeling’ to a computer.
NGrams
Let’s take two 5 letter words and split them into pairs of adjacent letters.
xiang -> [xi,ia,an,ng] (list 1)
hello -> [ he,el,ll,lo] (list 2)
If we have a dictionary of all possible 2 letter sequences possible and, the number of times they are seen in all words of a particular language, we will probably see that the sequences seen in list 1 are much more common in chinese than in english.
Each such sequence of adjacent characters is called a bigram.
A sequence of ‘n’ adjacent characters is called a character level ngram. Ngrams play a very important role in Natural Language Processing (NLP). While ‘word level’ ngrams (a sentence is decomposed into sequences of words) are more popular, ‘character level’ ngrams, which we use here, can be very useful in their own right.
Let’s take a look at two more familiar languages, English and Hindi. It is very easy to think of words in Hindi which contain the bigram ‘kh’. Khana. Khargosh. Aankh. Khel.
Now try doing the same for english.
The stark difference becomes clearer when we plot the number of words in each language containing the ‘kh’ bigram.
Figure 2
All languages have their own unique flow and NGram frequencies. Programming a computer to decompose a word into NGrams and analyzing their frequencies to predict the language it belongs to seems like a viable option.
The reason ngrams is such a powerful concept, is that spelling errors will only consist of a few characters. Most of the characters remain unaffected. Hence while a few ngrams may get affected by spelling errors, a majority of them is likely to remain untouched.
Not only that, spelling errors tend to follow patterns themselves and these patterns also may follow a trend based on language. If we have enough examples of spelling errors in a language we can probably identify ngram patterns for that particular language.
Let us see how well an implementation of this idea performs.
An Implementation
Let us limit the problem to, given a word, identify whether it belongs to English or Hindi.
In order to understand the important NGram based features a language possesses, we can either try to hand code certain rules which is an incredibly difficult task, or depend on good old machine learning to identify rules on its own from thousands and thousands of training examples.
So first , we create a large dictionary of English and Hindi words, decompose each word into its corresponding ngrams. We can try out various values of ‘n’ and identify which value seems to produce the best result. We also add every single ngram we encounter into a huge dictionary of ngrams.
Now that we know the ngrams that belong to each word and the language which the word belongs to, we can begin training a classifier to figure out patterns that make a word English or Hindi. Here we use Naive Bayes, a simple yet powerful algorithm which probably deserves a post of its own. The basic idea of Naive Bayes is that based on frequency of occurrence of ngrams in a particular language, we can identify the probability of a word belonging to a language based on each of its ngrams.
By training a Naive Bayes model on a large number of examples, the model learns which ngrams are important to classify a word as Hindi or English. Now when an unseen word, even one with a spelling mistake, is encountered it is decomposed into its ngrams and the probability of the word being Hindi or English is calculated, based on the probabilities corresponding to each of its ngrams.
A model was trained using a dataset of 8000 english words and 8000 Hindi Words, with both 3-grams and 4-grams. The model predicted the probability of a word belonging to Hindi and English. If the calculated probability was less than 0.6 for both Hindi and English, then the words’ label was marked as ‘unknown’.
This allows us to get very informative results.
English Precision | English Recall | Hindi Precision | Hindi Recall | |
n=3,TrainData=90%,TestingData=10% | 0.9388 | 0.9136 | 0.9158 | 0.9403 |
n=3,TrainData=10%,TestingData=90% | 0.9235 | 0.8908 | 0.8946 | 0.9262 |
n=4TrainData=90%,TestingData=10% | 0.9444 | 0.9205 | 0.9224 | 0.9457 |
n=4TrainData=10%,TestingData=90% | 0.9156 | 0.8815 | 0.8859 | 0.9186 |
Table 3
The model seems to perform quite well, reinforcing our idea that n-grams can be very useful in language identification of text. The hypothesis that ngrams can capture the feel of a language worked out well.
We can now take a breath of relief, knowing that the person who said the words:
“DELIVERY NAM TO BATATE JAO YAAR”
is one step closer to having his pain understood.
References
[1] Chanda, A., Das, D., & Mazumdar, C. (2016b). Unraveling the English-Bengali Code- Mixing Phenomenon.
In Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 80–89, Austin, TX, USA.
[2]Tommi Jauhiainen, Marco Lui, Marcos Zampieri, Timothy Baldwin, and Krister Linden. 2018d. Automatic language identification in texts: A survey. arXiv preprint arXiv:1804.08186.
[3] Char n-gram based model to detect language of sentences out of 6 possible choices
https://github.com/dinkarjuyal/language-identification
[4]Vatanen, T., Väyrynen, J. J., & Virpioja, S. (2010). Language Identification of Short Text Segments with N-gram Models.
In Proceedings of the 7th International Conference onLanguage Resources and Evaluation (LREC 2010), pp. 3423–3430, Valletta, Malta