Computer Science Department
Mississippi State, MS 39762
The exploration of this problem is important because tagging failures due to part-of-speech confusion often lead to a large loss of information at later stages in the understanding process. In our current research, tagged and parsed scientific journal article text is delivered to another sub-system which extracts information used in generating indices for the text. When the content of a sentence, or set of sentences, is lost due to the mis-tagging of a few critical words, the knowledge extraction process for those sentences is less successful and can even fail entirely.
With published error rates ranging from 5% to less than 3% for existing taggers, there is a temptation to consider the tagging problem "solved". However, these rates are for full text, which includes a high proportion of function words (prepositions, conjunctions, pronouns, and such), which are relatively easy to tag. Error rates on the content-bearing words are higher. Moreover, most taggers have access to substantial lexicons of previously encountered words and the tags with which those words have been labeled. When the error rates are calculated solely for novel vocabulary not represented in the lexicon, the error skyrockets to ranges of 25-50%
The research proposed here targets a critical tagging error, that of mis-labeling the main verbs of sentences. This error is especially damaging to later stages of information extraction systems. It is also a surprisingly common error in domain-specific texts outside of the few domains for which large, well-developed corpora and lexicons exist. Our own domain includes a high proportion of previously unseen words. A rule-based tagger whose published error rate is less than 3% gives the wrong part of speech to about 50% of the novel verbs that occur in main-verb position after being trained on 160,000 words of text from our domain. It is the goal of this research to cut that error rate by an order of magnitude.
To this end, a method for broadening the context is described below, as well as a technique for incorporating the broader context into the tagging task. A prototype system for tagging has obtained very encouraging results, which are also reported below.
Whenever a tagger encounters an unknown word, it must use some heuristic to decide the part of speech for that word. Because nouns occur more frequently than any other part of speech category, most taggers have a heavy bias toward labeling an unfamiliar word as a noun. Our experience suggests that guessing that an unknown word is a noun is usually the least damaging strategy, for the following reason: sentences tend to have many noun phrases in which adjectives, nouns, and verbs can occur as modifiers. When one of these adjectives or verbs is mis-labeled as a noun, minimal damage is done (in practical terms of recovering information from the tagged text). The noun phrase in which the mis-tagged term occurs is undisturbed, and the term is perceived to be fulfilling the appropriate role (modification). In our corpus of text, verbs occur more often as modifiers than as main verbs. The error rate on verbs is about equal that of other non-closed classes. However, the error rate on novel verbs in main verb position is very high (about 50%).
When main verbs are mis-tagged as nouns, the structure of the resulting parse is usually severely compromised. Rarely is it the case that there is simply a noun place-holder where the verb should be. Consequently, a major block of information can be lost from the text. The problem is compounded when several sentences in a row lose their main verbs to mis-tagging. To solve the problem of mis-tagging novel verbs in main verb position, we propose to widen the tagging context window. Like others who use this approach, we postpone the combinatorial explosion which results from examining the larger context by looking at tag sequences rather than lexical sequences. In the preliminary experiment reported below, we lower the dimensionality of the problem space further by mapping the Penn Treebank part of speech tags onto a four-bit Gray code. This approach was suggested by our investigations into the fractal aspects of language.
0010 possessive ending
0011 subordinate conjunction
0101 existential "there"
0110 (unused currently)
1000 determiner/adjective/possessive pronoun
1100 (special filler)
1110 foreign word/punctuation
In the associated code space, the concept noun is a distance of 1 from related concepts such as pronoun, adjective, gerund, and possessive ending. The concept verb is maximally close to existential "there" and to/aux/modal, while the latter is maximally distant from the noun class, using the Manhattan distance on the corresponding 4-cube, which is easily calculated from counting the number of differing bits. The Gray code induces a code space in which the syntactic neighborhoods of words may be plotted as points in four-space. Arbitrarily long neighborhoods correspond to arbitrarily fine granularity in the space. The fact that noun phrases are recursively embedded in noun phrases (e.g., a noun phrase inside a prepositional phrase that modifies a noun phrase that is the object of a relative clause that modifies a noun phrase) guarantees that neighborhoods in some regions of fine granularity will be very similar to, though not identical with, regions of larger granularity. That is, high magnification of some areas of the space should reveal patterns similar to regions at low magnification.
If each Gray code is assigned a color, then tagged sentences can be discerned. Visually inspecting these patterns revealed to us that there clearly are part of speech neighborhoods where main verbs often occur. Many of these patterns were longer than could be captured by a window of only two words, as used by most taggers today. We seek to develop a tool to automatically discover these neighborhoods in order to correct noun/verb confusions that do occur. This tool would not replace our existing tagger, but supplement it by checking the tagged text for words that are probably mis-tagged, given their neighborhood.
Using the above describe Gray code, a journal article can be translated to a series of tag neighborhoods around each noun, verb, adjective, and adverb. In a prototype experiment, these neighborhoods have been used as input to a neural network to see if it can correctly predict the correct tag for the word in the center of the neighborhood. A window of six tags on each side of the target word is used, in addition to the last two letters of the target word. The neural net was trained on nine files derived from hand-tagged and hand-corrected articles. Each of these files contained 9400 training pairs, where a pair consists of a neighborhood around a target word, and the expected output. The prototype achieves a 3% error rate as tested on a tenth file (also of 9400 testing pairs) held out from the training data. These results are encouraging because the neural net is working with open-class words, and looks only at the last two letters of that word. The neural net does no need or use a lexicon, so the known/unknown word distinction is not present. Our conjecture is that we will find that the tagging of novel vocabulary (which has been so problematic) will be about as successful as the tagging of known vocabulary, given this hybrid approach.
We propose to examine two alternative after completing the experiments with evolved and fractally configured neural networks. These experiments involve characterizing the contexts in which each type of neural net out-performs our rule-based tagger. The first alternative is to develop a hybrid system which recognizes the discovered contexts and selectively chooses the rule-based tagger of the best-performing neural network, depending on an a priori likelihood of success for each word. The second alternative is to alter the current rule-based tagger so that it consults a neural network only for those words which are not represented in its lexicon.
As it happens, information extraction systems have a pressing need for a solution to the mis-classification problem described above. The utility of the Gray code for that problem was immediately evident. Once the Gray code has been exploited in that endeavor, we plan to return to its original motivation as a tool for exploring self-similarity in language; we hope that it will be only the first of a number of tools to look for attractors corresponding to syntactic units such as noun and verb phrases.
The expected pay-off is a novel hybrid system with substantially lower overall error rates than those currently reported and, even more important, dramatic reduction (order of magnitude difference from the current error rate) of the specific and damaging error of mis-tagging main verbs.
The proposed application of fractal theory to natural language understanding is a unique and promising component of our research. Preliminary results point to a new understanding of complex patterns in natural languages and to new and innovative future approaches to the understanding of natural languages in domains where training of systems is impractical.
Months 4-9: Examination of contexts in which rule-based tagger and neural net are least and most reliable; begin development of hybrid tagger; prepare new test-bed.
Months 9-12: Complete and test hybrid neural net/rule-base tagger system.
As time permits: Further exploration of the utility of the Gray code for studying fractal (self-similar) aspects of natural language. Examine other ways to extend neighborhoods for the purpose of discovering complex patterns.