Factor Language Model Programming

(ii) Inflection Dictionary
This dictionary contains the list of all possible inflections of the Telugu language. Each entry of Stem word dictionary lists the indexes of this dictionary to indicate which all inflections are possible with that stem.
The proposed corpus structure helps in reducing the corpus size drastically. Every stem word may have number of inflections possible. If the inflected words are stored as it is, then corpus size would be m*n, where m is number of stem words and n is number of inflections. Instead of storing all the inflected words, the proposed corpus structure stores stem words and inflections separately, and handles the inflected words through morphology. Hence the corpus size required is for m stem words and n inflections i.e., m+n. Thus there is a great reduction in the corpus size. For a corpus of 1000 stem words and 10 inflections, the required corpus size is 1000+10=1010, which otherwise would have required 1000*10=10000.

Fig 5.3 : Corpus structure of proposed Language Model
Textual Word Segmentation using Proposed Language Model
The proposed language model is used to develop a textual word segmenter. A word segmenter is used to divide the given inflected word into a stem and single inflection. This is required as the corpus stores stems and inflections separately.
Input the word segmenter is an Inflected word. Syllabifier takes this word and divides the word into syllables and identifies if the letter is a vowel or a consonant. After applying the rules syllabified form of the input will be obtained. Once the process of syllabification is done, this will be taken up by the analyzer. Analyzer separates the stem and inflection part of the given word. This stem word will be validated by comparing it with the stem words present in stem dictionary. If the stem word is present, then the inflection of the input word will be compared with the inflections present in inflection dictionary of the given stem word. If both the inflections get matched then it will directly displays the output otherwise it takes the appropriate inflection(s) through comparison and then displays.
Syllabification is the separation of the words into syllables, where syllables are considered as phonological building blocks of words. It is dividing the word in the way of our pronunciation. The separation is marked by hyphen. In the morphological analyzer, the main objective is to divide the given word into root word and the inflection. For this, we divide the given input word into syllables and we compare the syllables with the root words and inflections to get the root word and appropriate inflection.

Fig 5.4: Block diagram of Word Segmentr for text
Steps for word segmentation

Receiving the inflected word as an input from the user.
Syllabify the input
Analyze the input and validating the stem word.
Identify the appropriate inflection for the given stem word by comparing the inflection of given word with the inflections present in inflection dictionary of the stem word.
Displaying the appropriate inflected word.

For example, considering the word “nAnnagariki” (à°¨à°¾à°¨à±à°¨à°-à°¾à°°à°¿à°•à°¿) meaning “to father”, the input is given the user in Roman transliteration format. This input is basically divided into lexemes as:

Now, the array is processed which gives the type of lexeme by applying the rules of syllabification one by one.

“ No two vowels come together in Telugu literature.”
The given user input does not have two vowels together. Hence this rule is satisfied by the given user input. The output after applying this rule is same as above. If the rule is not satisfied, an error message is displayed that the given input is incorrect. Now the array is:
c – v – c – c – v – c – v – c – v – c – v

“ Initial and final consonants in a word go with the first and last vowel respectively.”
Telugu literature rarely has the words which end up with a consonant. Mostly all the Telugu words end with a vowel. So this rule does not mean the consonant that ends up with the string, but it means the last consonant in string. The application of this rule2 changes the array as following:
c – v – c – c– v – c – v – c – v – c – v
cv – c – c – v – c – v – c – v – cv
This generated output is further processed by applying the other rules.

“ VCV: The C goes with the right vowel.”
The string wherever has the form of VCV, then this rule is applied by dividing it as V – CV. In the above rule the consonant is combined with the vowel, but here in this rule the consonant is combined with the right vowel and separated from the left vowel. To the output generated by the application of rule2, this rule is applied and the output will be as:
cv – c – c – v – c – v – c – v – cv
cv – c – c – v – cv – cv – cv
This output is not yet completely syllabified, one more rule is to be applied which finishes the syllabification of the given user input word.

“ Two or more Cs between Vs – First C goes to the left and the rest to right.”
It is the string which is in the form of VCCC*V, then according to this rule it is split as VC – CC*V. In the above output VCCV in the string can be syllabified as VC – CV. Then the output becomes:
cv – c – c – v – cv – cv – cv
cvc– cv – cv – cv – cv
Now this output is converted to the respective consonants and vowels. Thus giving the complete syllabified form of the given user input.
nAn – na –cA – ri – ku
cvc – cv – cv – cv – cv
Hence, for the given user input, “nAnnagAriki”, the generated syllabified form is, “nAn – na – gA – ri – ki”.

Fig 5.5: Word Segmenter showing an inflected word without change in stem form

Fig 5.6: Word Segmenter showing an inflected word with a change in stem form
SCIL – Speech Corrector for Indian Languages
In inflectional language every word consists of one or several morphemes into which the word can be segmented. The approach used here aims at reducing the above mentioned problem of having a very huge corpus for good recognition accuracy. It exploits the characteristic of Telugu language that every word consists of one or several morphemes into which the word can be segmented.
SCIL is a procedure

To deal with complex word forms
applied after recognition
Using which misrecognized words are corrected

Architecture of SCIL
The design of Speech Corrector for Indian Languages, consists of the Syllable Identifier, Phone Sequence Generator, Word Segmenter, and Morpho- Syntactic Analyzer modules. Input speech is decoded by a normal ASR system which gives the identified word as a string. The sequence of phones would be the input to the Word Segmenter module which matches the phonetized input with the root words stored in dictionary module, and generates a possible set of root words. Morpho-Syntactic Analyzer compares the inflection part of the signal with the possible inflections list from the database and gives correct inflection. This will be given to Morph Analyzer to apply morpho-syntactic rules of the language and gives the correct inflected word.

Fig 5.7: Block diagram of SCIL
i) Syllable Identifier
Syllable identifier marks the rough boundaries of the syllables and labels them. At this stage , we get list of syllables separated with hyphen. The user input is syllabified and this would be the input to the next module. E.g. dE-vA-la-yA-ku
ii) Phone Sequence Generator
As the words in the dictionary are stored at phone level transcription, this module generates the phone sequences from the syllables. E.g. d-E-v-A-l-a-y-A-k-u
iii) Word Segmentor
This module compares the phonetized input from starting with the root words stored in dictionary module and lists the possible set of root words. The possible root word is dEvAlayamu.
iv) Dictionary
Dictionary contains stems and inflections separately. It does not store inflected words as it is very difficult, if not impossible, to cover all inflected words of the language. The database consists of 2 dictionaries:

Stem Dictionary
Inflection Dictionary

Stem dictionary contains the stem words of the language, signal information for that stem which includes the duration and location of that utterance and list of indices of inflection dictionary which are possible with that stem word.
Inflection Dictionary contains the inflections of the language, signal information for that inflection which includes the duration and location of that utterance. Both the dictionaries are implemented using trie structure in order to reduce the search space.
v) Morpho Syntactic Analyzer
This module compares the inflection part of the signal with the possible inflections list from the database and gives correct inflection. This will be given to Morph Analyzer to apply morpho-syntactic rules of the language and gives the correct inflected word.
Post Recognition Procedure

Capture the utterance, an isolated inflected word.
Get its syllabified form.
Generate phone sequence from the syllabified word.
Compare the phone sequences with stem words in the dictionary and identify the stem.
Segment the word into stem and inflection.
Get the list of possible inflections.
Compare the inflection signals possible with that stem one by one and apply morpho-syntactic rules of the language to combine stem and inflection.
Display the inflected word.

Using the rules the possible set of root words are combined with possible set of inflections and the obtained results are compared with the given user input and the nearest possible root word and inflection are displayed if the given input is correct. If the given input is not correct then the inflection part of the given input word is compared with the inflections of that particular root word and identifies the nearest possible inflection and combines the root word with those identified inflections, applies sandhi rules and displays the output. When there is more than one root word or more than one inflection has minimum edit distance then the model will display all the possible options. User can choose the correct one from that. For example, when the given word is pustakaMdO (à°ªà±à°¸à±à°¤à°•à°‚à°¦à±‹), the inflections tO making it pustakaMtO (à°ªà±à°¸à±à°¤à°•à°‚à°¤à±‹) meaning ‘with the book’ and lO making it pustakaMlO (à°ªà±à°¸à±à°¤à°•à°‚à°²à±‹) meaning ‘in the book’) mis are possible. Present work will list both the words and user is given the option. We are working on improving this by selecting the appropriate word based on the context.
SCIL Algorithm

W=Utterance.wav
Syl[]=SyllableIdentifier(W)
Phone[]=phonetizer(Syl[])
Stem=getStem(Syl[])
Infl[]=getInflections(Stem)
While (not exactMatch)

word=MorphAnalyzer(stem,inflMatch)

display word
Stop

Working of SCIL
Once possible root words identified the given word is segmented into two parts, first being the root word and second part inflection. Now the inflection part is compared in the reverse direction for a match in the inflection dictionary. It will consider only the inflections that are mentioned against the possible root words, thus reducing the search space and making the algorithm faster.
For example consider “nAnnagariki” (à°¨à°¾à°¨à±à°¨à°-à°¾à°°à°¿à°•à°¿) meaning “to father”, is misrecognized as nAn-na-cA-ri-ku (à°¨à°¾à°¨à±à°¨à°šà°¾à°°à°¿à°•à±) then SCIL is applied and will correct the recognition error as follows:
The output from ASR is nAn-na-cA-ri-ku. The phone sequence generator will generate the phone sequence as n-A-n-n-a-c-A-r-i-k-u. Now, match it with the set of root words stored in dictionary module. This process will identify the possible set of root words from the Stem dictionary as follows:

…….

nAnna ( à°¨à°¾à°¨à±à°¨)

nANemu (à°¨à°¾à°£à±†à°®à±)

………

Once possible root words identified the given word is segmented into two parts, first being the root word and second part inflection. Now the inflection part is compared for a match in the inflection dictionary. It will consider only the inflections that are mentioned against the possible root words, thus reducing the search space and making the algorithm faster.

ki (à°•à°¿)

ni (à°¨à°¿)

gAriki ( à°-à°¾à°°à°¿à°•à°¿ )

………

Possible set of inflections in inflections dictionary
After getting the possible set of root words and possible set of inflections they are combined with the help of SaMdhi formation rules. Here in this example cA-ri-ku is compared with the inflections of the root word nAnna
After comparing it identifies gAriki as the nearest possible inflection and combines the root word with the inflection and displays the output as “nAnnagAriki”.
Conclusions
Language model proposed in this work results in reduction in corpus size by using factored approach. The search process is fastened by use of trie based structure. A change to standard trie is proposed.
A post recognition procedure SCIL, is designed which uses the proposed language model and corrects the words misrecognized at inflections. The approach is tested using 1500 speech samples. These samples consist of 100 distinct words , each word repeated 3 times and recorded by 5 speakers in the age group 18-50. It is implemented as a speaker dependent system. An average model is built from the three utterances of each word for each speaker. Each speaker is given a unique ID, using which average model of that speaker is used for testing.

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Factor Language Model Programming ”

Get high-quality paper

NEW! AI matching with writer