USE OF LANGUAGE MODELS IN HANDWRITTEN SENTENCE/PHRASE RECOGNITION

Rohini K. Srihari and Sargur N. Srihari

Center of Excellence for Document Analysis and Recognition
State University of New York at Buffalo
UB Commons, 520 Lee Entrance, Suite 202
Amherst, NY 14228-2567

CONTACT INFORMATION

Mail:	CEDAR
	UB Commons
	520 Lee Entrance, Suite 202
	Amherst, NY 14228-2567
E-Mail:	rohini@cedar.buffalo.edu
	srihari@cedar.buffalo.edu
Phone:	(716) 645-6162
Fax:	(716) 645-6176

WWW PAGE

http://www.cedar.buffalo.edu/Linguistics/

PROGRAM AREA

3. Other Communication Modalities.

KEYWORDS

On-line recognition
Language models
Handwriting recognition
Statistical linguistic models
Linguistic post-processing.

PROJECT SUMMARY

This research focuses on the use of human language models in performing handwriting recognition. Systems for handwriting recognition are referred to as off-line or on-line systems, depending on whether ordinary handwriting on paper is scanned and digitized or a special stylus and a pressure-sensitive tablet are used. In both, the ultimate objective is to convert handwritten sentences or phrases in analog form (off-line or on-line sources) into digital form (ASCII).

Due to variability in handwriting styles and distortions caused by the digitizing process, even the best handwritten word recognizer is unreliable when the number of word choices is large. This necessitates the use of linguistic constraints (which employ sentence-level context) to achieve a performance level comparable to that of humans.

There are three parts to the research: (i) adapting existing language models, (ii) developing new language models suitable for this application (such as hybrid language models which are more powerful than n-gram models but avoid the computational overhead associated with context-free grammars), and (iii) effective design and use of training corpora and lexicons which is necessary because language employed in handwritten text tends to be informal and ungrammatical.

The value of the research will be demonstrated by incorporating the improved language models into actual handwriting recognition systems- both on-line and off-line. Performance metrics will show the utility of accurate language models by measuring recognition accuracy as a function of the amount of linguistic processing required.

The research holds promise for pen-based human-computer communication, handwritten document recognition and understanding the role of language in interpretation of writing.

FY95 Accomplishments:
Completed an electronic corpus of informal language (CEDAR corpus) consisting of approximately 300,000 words. The words are tokenized/tagged using a tagset designed for informal language. The CEDAR corpus source is electronic mail as well as newsgroups. This corpus is used for developing and testing new language models.

Developed a statistical simulator, which models the output of a handwritten word recognizer; it models different levels of handwriting quality, styles of handwriting (cursive, printed, or mixed), and distortions of word shape.

Developed and benchmarked a syntactic language model designed to improve the recognition rate of a handwritten word recognizer. Based on testing on simulated recognition results generated from segments of the CEDAR corpus, the tag bigram program got over 75% accuracy, when the top-choice accuracy of the simulator was a little over 50%.

Developed and tested a language model based on lexical semantics, i.e., collocations. Preliminary tests indicate a significant increase in the word recognition rate.

Generalized the Damerau-Levenshtein (DL) string distance measure to accurately incorporate errors encountered in script recognition.

Developed a database of on-line handwriting corresponding to several genre (work done with principal funding from the Linguistic Data Consortium). The database, consisting of sentences, isolated words and addresses (with over 100,000 words), was collected from over 200 writers. The sentences are based on 12 passages of 15 sentences each; Each passage contains examples of common English language usage. Isolated words and postal address data were also collected. These handwriting samples are being used with the recognition systems, whose results are fed to the language models.

Developed five different interactive demonstrations of aspects of this research, which are accessible on the world-wide web (URL:http://www.cedar.buffalo.edu/Linguistics/demos.html). The demos are: simulator of handwriting recognition, handwritten word recognition (on-line and off-line), string matching and linguistic postprocessing. The last offers a choice of several algorithms. Tested the simulator's validity by comparing its output with that of an actual recognizer.

PROJECT REFERENCES

Srihari, R.K., C. Ng, C. Baltus and J. Kud, Use of Language models in On-line Recognition of Handwritten Sentences, in Proceedings of Third International Workshop on Frontiers in Handwriting Recognition, Buffalo, NY, May 1993, pp. 284-294.
Srihari, R.K., and C. Baltus, Incorporating syntactic constraints in recognizing handwritten sentences, in Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI-93), Chambery, France, August 1993, pp. 1262-1267.
Srihari, R.K., Use of lexical and syntactic constraints in recognizing handwritten sentences, in Proceedings of the ARPA workshop on Human Language Technology, March 1994, Princeton, NJ, pp. 403-407.
Seni, G., N. Nasrabadi and R.K. Srihari, An On-line Cursive Word Recognition System, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, June 1994, pp. 404-410.
Govindaraju, V., R.K. Srihari and S.N. Srihari, Handwritten Text Recognition, in Proceedings of IAPR Workshop on Document Analysis Systems, Kaiserlautern, Germany, October 1994, pp. 157-171.
Seni, G. and R.K. Srihari, A Hierarchical Approach to On-Line Script Recognition Using a Large Vocabulary, in Proceedings of Fourth International Workshop on Frontiers in Handwriting Recognition (IWFHR IV), Taipei, Taiwan, December 1994, pp. 472-479.
Seni, G., V. Kripasundar and R.K. Srihari, Generalizing edit distance for handwritten text recognition, in Proceedings of SPIE/IS&T Conference on Document Recognition, San Jose, CA, Feb. 5-10, 1995, pp. 54-65.
Srihari, S.N., and R.K. Srihari, Written Language Recognition, in Survey of Human Language Technology (Ronald A Cole, Joseph Mariani, Hans Uszkoreit, Annie Azenen, and Victor Zue, eds.), in press.
Seni, G., V. Kripasundar and R.K. Srihari, Generalizing edit distance to incorporate domain information: handwritten text recognition as a case study, in Pattern Recognition, accepted.
Seni, G., R.K. Srihari, N. Nasrabadi, Large Vocabulary Recognition of On-Line Handwritten Cursive Words, in IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), accepted.

AREA BACKGROUND

The role of human language models in handwriting recognition is explored. The goal of a handwriting recognition system(HWRS) is to convert handwritten sentences/phrases in analog form (off-line/on-line) into digital form (ASCII). Linguistic constraints (context at the lexical, syntactic, semantic and pragmatic levels) provide the promise of reaching human performance levels. The aim is to develop an integrated language model that contains lexical, syntactic and semantic knowledge so as to be integrated into a HWRS. This is to overcome ambiguity-- due to both handwriting variability/legibility, e.g., style (cursive, discrete, discrete touching), spacing (inter-word, inter-line, legibility, etc), and large lexicon sizes -- which makes recognition of characters and words by shape alone to be unreliable.

AREA REFERENCES

Florian C. Coulmas, The Writing Systems of the World, Blackwell, New York, NY, 1989.

M.A. Just and P.A. Carpenter, The Psychology of Reading Language Comprehension, Allyn and Bacon, Inc., Newton, MA, 1987.

R. Srihari and C.M. Baltus, Incorporating syntactic constraints in recognizing handwritten sentences, Proc. of the Int. Joint Conf. on Artificial Intelligence, (IJCAI-93), pp. 1262-126, Chambery, France, August 1993.

RELATED PROGRAM AREAS

2. Speech and Natural Language Understanding.

6. Intelligent Interactive Systems for Persons with Disabilities.

POTENTIAL RELATED PROJECTS

The work on language models for handwriting recognition has relevance to language models for speech recognition. Persons with typing disability or speech disability might find handwriting modality a useful alternative. A combination of handwriting, speech, and typing for interaction would be a potential project.