Rohini K. Srihari and Sargur N. Srihari
Center of Excellence for Document Analysis and Recognition
State University of New York at Buffalo
UB Commons, 520 Lee Entrance, Suite 202
Amherst, NY 14228-2567
|520 Lee Entrance, Suite 202|
|Amherst, NY 14228-2567|
3. Other Communication Modalities.
Statistical linguistic models
This research focuses on the use of human language models in performing handwriting recognition. Systems for handwriting recognition are referred to as off-line or on-line systems, depending on whether ordinary handwriting on paper is scanned and digitized or a special stylus and a pressure-sensitive tablet are used. In both, the ultimate objective is to convert handwritten sentences or phrases in analog form (off-line or on-line sources) into digital form (ASCII).
Due to variability in handwriting styles and distortions caused by the digitizing process, even the best handwritten word recognizer is unreliable when the number of word choices is large. This necessitates the use of linguistic constraints (which employ sentence-level context) to achieve a performance level comparable to that of humans.
There are three parts to the research: (i) adapting existing language models, (ii) developing new language models suitable for this application (such as hybrid language models which are more powerful than n-gram models but avoid the computational overhead associated with context-free grammars), and (iii) effective design and use of training corpora and lexicons which is necessary because language employed in handwritten text tends to be informal and ungrammatical.
The value of the research will be demonstrated by incorporating the improved language models into actual handwriting recognition systems- both on-line and off-line. Performance metrics will show the utility of accurate language models by measuring recognition accuracy as a function of the amount of linguistic processing required.
The research holds promise for pen-based human-computer communication, handwritten document recognition and understanding the role of language in interpretation of writing.
Completed an electronic corpus of informal language (CEDAR corpus) consisting of approximately 300,000 words. The words are tokenized/tagged using a tagset designed for informal language. The CEDAR corpus source is electronic mail as well as newsgroups. This corpus is used for developing and testing new language models.
Developed a statistical simulator, which models the output of a handwritten word recognizer; it models different levels of handwriting quality, styles of handwriting (cursive, printed, or mixed), and distortions of word shape.
Developed and benchmarked a syntactic language model designed to improve the recognition rate of a handwritten word recognizer. Based on testing on simulated recognition results generated from segments of the CEDAR corpus, the tag bigram program got over 75% accuracy, when the top-choice accuracy of the simulator was a little over 50%.
Developed and tested a language model based on lexical semantics, i.e., collocations. Preliminary tests indicate a significant increase in the word recognition rate.
Generalized the Damerau-Levenshtein (DL) string distance measure to accurately incorporate errors encountered in script recognition.
Developed a database of on-line handwriting corresponding to several genre (work done with principal funding from the Linguistic Data Consortium). The database, consisting of sentences, isolated words and addresses (with over 100,000 words), was collected from over 200 writers. The sentences are based on 12 passages of 15 sentences each; Each passage contains examples of common English language usage. Isolated words and postal address data were also collected. These handwriting samples are being used with the recognition systems, whose results are fed to the language models.
Developed five different interactive demonstrations of aspects of this research, which are accessible on the world-wide web (URL:http://www.cedar.buffalo.edu/Linguistics/demos.html). The demos are: simulator of handwriting recognition, handwritten word recognition (on-line and off-line), string matching and linguistic postprocessing. The last offers a choice of several algorithms. Tested the simulator's validity by comparing its output with that of an actual recognizer.
The role of human language models in handwriting recognition is explored. The goal of a handwriting recognition system(HWRS) is to convert handwritten sentences/phrases in analog form (off-line/on-line) into digital form (ASCII). Linguistic constraints (context at the lexical, syntactic, semantic and pragmatic levels) provide the promise of reaching human performance levels. The aim is to develop an integrated language model that contains lexical, syntactic and semantic knowledge so as to be integrated into a HWRS. This is to overcome ambiguity-- due to both handwriting variability/legibility, e.g., style (cursive, discrete, discrete touching), spacing (inter-word, inter-line, legibility, etc), and large lexicon sizes -- which makes recognition of characters and words by shape alone to be unreliable.
Florian C. Coulmas, The Writing Systems of the World, Blackwell, New York, NY, 1989.
M.A. Just and P.A. Carpenter, The Psychology of Reading Language Comprehension, Allyn and Bacon, Inc., Newton, MA, 1987.
R. Srihari and C.M. Baltus, Incorporating syntactic constraints in recognizing handwritten sentences, Proc. of the Int. Joint Conf. on Artificial Intelligence, (IJCAI-93), pp. 1262-126, Chambery, France, August 1993.
2. Speech and Natural Language Understanding.
6. Intelligent Interactive Systems for Persons with Disabilities.
The work on language models for handwriting recognition has relevance to language models for speech recognition. Persons with typing disability or speech disability might find handwriting modality a useful alternative. A combination of handwriting, speech, and typing for interaction would be a potential project.