Marek Placiński
Nicolaus Copernicus University in Toruń
Scientific Supervisor: dr hab. Przemysław Żywiczyński
Psycholinguistic theories of sentence processing and their computational representations
Human sentence processing is one of the central issues in psycholinguistics. Disputes arise due to the fact that it remains unknown what is the architecture of sentence processing. The paper aims at presenting architectures and their possible realisations in the form of computational models.
Keywords: natural language processing, psycholinguistics, serial processing, parallel processing, modularity, interactivity, ambiguity, computational linguistics, CL, NLP, artificial neural networks, computational psycholinguistics.
This paper depicts various approaches towards sentence processing from psycholinguistic perspective. First, the general problem of sentence processing is presented. It is followed by the discussion of architectures of sentence processing. The final part is dedicated to describing parsing and two computational models of human sentence processing.
- The problem of sentence processing
What poses a major problem in sentence and language processing is ambiguity. Ambiguity may occur on nearly every level of comprehension. There are two types of ambiguity, which are relevant to the study of sentence processing. The first problem concerns the lexical ambiguity. Lexical ambiguity arises due to polysemy, which can be limited to a single part of speech or to cross it (for example the word “check”). Ambiguity also exists on the level of syntax. Syntactic ambiguity occurs when one sentence can be interpreted in a number of ways because of its structure. Syntactic ambiguities can be roughly divided into two ambiguities, which are global (concerning the whole sentence) and local (concerning a small part of the sentence). As a result of these ambiguities, language processing remains a still-disputed issue in psycholinguistics. (Crocker 1996: 25-27)
- Architectural issues: modularity and interactivity
To begin with, the distinction between a modular theory and an interactive theory of sentence processing must be made. These are the two competing approaches towards explaining language processing.
Modular theory of language processing is based on the nativist approach to language, which assumes that language is transmitted genetically. The same belief is shared for a sentence processing mechanism, which is also said to be innate. Modular theory assumes that language is processed at each level independently. Each module of the human brain is devoted to a different aspect of language. As input is received by one module, it is interpreted and passed to a higher module. Jerry Fodor highly influenced the theory of modularity. Fodor believes that the mind is divided into a number of central systems which are responsible for attention and memory. They receive information from input systems, which process sensory information and language. He believes input system to be modular and to have distinct functions. These systems are characterised in the terms of being domain specific because each element is responsible for a single type of data. For instance, ears receive noises from the environment, but it is the speech module which processes speech. They are also mandatory, as they must process information. Additionally, they are rapid. Systems are informationally encapsulated, which means that once one module processes information, it is passed to another one. Finally, they are localised because they have a fixed neural architecture in the brain. The fact that all meanings of homophones are activated is often mentioned in order to substantiate the modularity theory. (Field 181-182)
The second model of language processing is the interactive model. The interactive model assumes that if information is processed at one level, its disambiguation is aided by other layers, be it above or below the current level of processing. There are two means in which information is processed in the interactive model. The first one is bottom-up, which means that smaller chunks of data are consistently gathered and used to create a more general meaning. The second model is the top-down processing, where conceptual knowledge influences the interpretation of the perceptual data. Thus, bottom-up model is termed as “data-driven” and the top-down model as “context-driven”. In consequence, both of these models are constantly used to process language. Two computational models, TRACE and Shortlist, are used to bear out the theory. (Field 137-138)
- Architectural issues: parallel and serial processing
Similarly to the previous case, these two architectural models are also two competing theories. These two models are applied to interpret local syntactic ambiguity resolution.
In parallel architecture, all possible interpretations of an ambiguous utterance are produced and stored in memory. Later, when more input is received, erroneous interpretations are eliminated and only a correct interpretation remains. However, in the parallel model, not all possible interpretations are retained, which is explained by the existence of garden-path sentences (e.g. The horse raced past the barn fell). Serial processing, on the other hand, does not take into account more than one possible interpretation. Hence, the interpretation is constantly updated. (Bader et al. 165-166)
- Computational models of sentence processing
This section of the paper is devoted to the definition of parsing and two methods of parsing. Discussion of two models of sentence processing follows. The two models which are discussed are probabilistic models and artificial neural networks, both of which present a viable interpretation of human sentence processing. What is more, both of these models represent a different approach towards the structure of the human mind: the first of which is modular and the second is interactive.
4.1. Parsing
Parsing means that sentences are automatically analysed so as to determine possible syntactic structures. One of the requirements of parsing is creating a formal grammar, which is a mathematical model of syntax. A formal grammar includes rules concerning the structure of sentences and combination of words. (Nederhof and Satta 2010) Most parsers are based on either bottom-up or top-down search strategies (Jurafsky and Martin 2000: 356).
Top-down parsing starts from the general category and processes input as long as the most specific, part-of-speech, category is reached. Hence, the parser assumes that what is provided as the input is an actual sentence. The parser is initiated and accepts that input can be derived by the determined S-symbol. What follows is building all possible sentence trees in parallel. Next, all grammar rules are appropriated to the left-hand side. For example, a tree may expect a noun phrase being followed by a verb phrase, or an auxiliary followed by a noun phrase, followed by a verb phrase. Right-hand elements are used recursively to eliminate erroneous parsers. Parses whose interpretations fail to match the input are then eliminated. (Jurafsky and Martin 2000: 356-357)
The opposite of top-down parsing is bottom-up parsing. Bottom-up parsers initiate processing with the words of the input and attempts to build whole parse trees. Grammar rules from the CFG are applied one at a time. A parse is considered successful if a tree rooted in the S start symbol is constructed. The parse begins with checking input words in lexicon and creating partial representations with assigned parts-of-speech. If some input words are ambiguous, more parse trees are generated. In contrast to the top-down parser, the bottom-down parser applies right-hand rules. Again, if an interpretation does comply with the right-hand of the previous item, the parse is eliminated. For example, if the ambiguous sentence “Book a flight” is encountered, the interpretation that a noun phrase is followed by a determiner and a nominal is rejected. Rather, the interpretation that a verb phrase is followed by a noun phrase (determiner and a nominal). (Jurafsky and Martin 2000: 357-359)
4.2. Probabilistic models
This subsection focuses on defining probabilistic modelling. The definition is followed by two actual models, which include lexical and syntactic processing.
Probabilistic models help to build experience-based models which take into consideration frequency of item occurrence. They are also said to be successful in developing wide-coverage models of language processing. Furthermore, there is a shift of perspective in probabilistic models. Rather than focusing on constraints, these models focus on the vast abilities of the human sentence processor. The goal of a probabilistic model is to find the most probable interpretation of a sentence. Crocker offers the following function to describe the human parser:
ti’ = argmax Pi(ti|w1…i, K), ∀ti ∈ Ti ti
The function means that when word wi is processed, the most likely analysis of the substring w1…i, ti’ correlates with the analysis ti in the set of probable Ti analyses. The final outcome is achieved as a result of correlation of other words in the sentence and our general knowledge K. However, this is only a general equation and leaves a lot of space for interpretation. (Crocker 493-494)
Many of the ambiguities that occur on the level of syntax have their roots in lexical ambiguities. What is more, the effect of frequency is crucial in resolving lexical ambiguities. First of all, high-frequency words are analysed faster than low-frequency words. Secondly, words are predominantly interpreted in their typical part of speech. Next, verb subcategorisation has an influence on parsing decisions. Finally, polysemic words are interpreted primarily in their most frequent meaning. As a result, likelihood plays a vital role in arriving at a proper interpretation of an utterance and helps to reduce parsing ambiguity. Corley and Crocker developed a bigram model of part-of-speech (POS) tagging. The model is bigram because it uses both the most probable part of speech assigned to the words, as well as the context of the sentence. The Vitebri algorithm is implemented in order to determine the most likely POS. Importantly, if the most likely POS does not fit into the sentence and a less likely one is chosen, the algorithm may perform a reanalysis and may assign different POS to other words in the sentence, not only the polysemic one. (Crocker 2010: 494-496)
Although the lexical disambiguator is crucial in resolving some structural ambiguities, it is not a complete model of syntax disambiguation. However, frequency does not only apply to lexis, but to syntax as well. Mitchell et al. developed the tuning hypothesis, which assumes that the human parser first chooses the option which has enabled it to arrive at a proper interpretation most frequently in the past. To illustrate human syntax processing, probabilistic context-free grammars (PCFGs) are used. The difference between PCFGs and traditional context-free grammars is that PCFGs have an additional value because they annotate grammar rules with rules probabilities. Rule probabilities express the possibility of an element on the right-hand side to occur after an element on the left-hand side. For example, verb phrases expand into verbs + noun phrases. Similarly to lexical frequency, syntactic frequency also depicts the ease with which it can be accessed in the human mind. The more frequent the rule, the easier it is to access. (Crocker 2010: 496-497)
4.3. Artificial neural networks
This subsection of the paper presents artificial neural networks. Two issues are focused on, namely their architecture and learning methods.
Artificial neural networks (ANNs) are a type of computational models which represent properties of neurons in the brain. They are divided into smaller units which are responsible for simple processing tasks. Although ANNs were initially of great interest for AI, nowadays (due to their engineering capabilities) they are primarily applied in Natural Language Processing. (Henderson 2010: 221)
The most popular representation of ANNs is multi-layered perceptron (MLP). MLP can be represented by the following diagram.
Input Output
Hidden
Fig. 1 Adapted from Henderson, 2010.
The nodes in the graph represent the small processing units and the arrows represent weighted links. The units are divided into input, hidden and output units. When vector of input values x is put on the input nodes, MLP computes a vector of output values y. The MLP also iteratively processes values for each hidden unit. MLPs are characterised as feed-forward network. Hence, no loops are allowed. The hidden nodes are of primary interest in MLPs. Hidden layers compute new sets of continuous-valued output from their input. Henderson describes the MLP mechanism in the following way:
“A unit j computes its output value, called its activation, as a function of the weights wji on links from units i to unit j and the activations zi of these units i. Usually this computation is some function of the weighted sum wj0 +_i ziwji of j’s inputs zi, where wj0 is the bias of j and wji is 0 if no such link exists. For the hidden units, the output of each unit zj is often a normalized log-linear function of its weighted inputs, called a sigmoid function: ” (Henderson 2010: 223).
Learning in MLPs is conducted via the backpropagation algorithm. The algorithm starts with random weights assigned to all interpreted elements. However, each step changes the weights a little so that finally erroneous interpretation is rejected. Backpropagation uses the erroneous information and the target information and updates the network so that the algorithm achieves more plausible results once another input is processed. A more modern approach towards learning in MLPs is the gradient descent algorithm. Although it is a more complex algorithm and it requires more computation, gradient descent requires fewer steps to achieve a result. However, pure gradient algorithms find only “locally optimal set of weights”. As a result of it functioning, it can increase an error, which is called local minimum. The gradient algorithm then reaches the bottom of a local minimum and fails to determine the global minimum. However, this is not a problem, as best exact set of weights is not looked for. Yet, measures must be made to avoid local minima. There are two techniques thanks to which local minima can be avoided. These include random initial minimalisations and stochastic gradient descent. Running random initialisations makes it possible to pick the sample of the most suitable local minimum. The sample gives the bigger picture of the extent of the problem of local minimum (and if it is a problem at all). What stochastic gradient descend adds to the algorithm is randomness of weights at each processing step. Thus, weights sometimes move up the gradient descent. Stochastic gradient descent is usually implemented by performing update at each step in the learning set. (Henderson 2010: 223-224 )
- Conclusions
The aim of the paper was to introduce young scholars to the problem of psycholinguistic research of sentence processing. The paper provides a summary of some research regarding sentence processing and two influential computational approaches which may bear out the architectural theories.
References:
Bader, Markus, Josef Bayer, Jens-Max Hopf and Michael Meng. “Is human sentence parsing serial or parallel? Evidence from event-related brain potentials”. Cognitive Brain Research 15 (2003) 165-177, p. 165-166.
Clark, Alexander, Chris Fox, and Shalom Lappin (eds.). The Handbook of Computational Linguistics and Natural Language Processing. Chichester: Blackwell Publishing Ltd, 2010.
Crocker, Matthew W. Computational Psycholinguistics: an interdisciplinary approach to the study of language. Kulwer Academic Publishers, 1996 – 25-27p.
Crocker, Matthew W. “Computational Psycholinguistics,” in: Alexander Clark, Chris Fox, and Shalom Lappin (eds.), – 482-513 p. 493-497.
Field, John. Psycholinguistics: the Key Concepts. Routledge, 2004, – 137-138; 151; 181-182; 201.
Henderson, James B. “Artificial Neural Networks,” in: Alexander Clark, Chris Fox, and Shalom Lappin (eds.), – 221-224 p.
Jurafsky, Daniel and James H. Martin. Speech and Language Processing. Pearson Education International, 2000 – 355-359 p.
Nederhoff, Mark-Jan and Giorgio Satta. “Theory of Parsing,” in: Alexander Clark, Chris Fox, and Shalom Lappin (eds.), – 105-130, p. 105.