Deep Learning MT and Logos Model

Scott, Bernard

doi:10.1007/978-3-319-76629-4_8

Bernard Scott³

Part of the book series: Machine Translation: Technologies and Applications ((MATRA,volume 2))

891 Accesses

Abstract

In this Chapter we compare 45-year-old Logos Model with AI’s deep learning technology and the neural net translation (NMT) technology that deep learning has given rise to. At a strictly computational level, Logos Model bears zero relationship to NMT, but we point out a number of ways in which Logos Model may nevertheless be seen to have anticipated NMT, specifically at the level of architecture and function. We take note of the fact that NMT has drifted away from interest in the biological verisimilitude of its models, and we note what experts say about the negative effect this has had on so-called continual machine learning (where new learning does not interfere with old learning, an obvious, vital requirement in MT). We discuss the related need for generalizations in MT learning, generalizations that are both semantic as well as syntactic, generalizations akin to the function exhibited by the brain in continual learning and processing of language. Our discussion turns on a particular point that experts at Google Deep Mind are making about continual learning, one they say that that AI has overlooked, and that, from our perspective, bears critically on MT. It concerns the way that the declarative, similarity-based operations of the hippocampus complements the more analytical, procedure-based operations of the neocortex to support continual learning. Most telling in this regard is their assertion that hippocampal learning is more than “item specific” and that, to the contrary, it exhibits distinct powers of semantic generalization. We note with satisfaction how that assertion about the complementary nature of hippocampal-neocortex learning comports with the analogical/analytical aspects of language processing in Logos Model. The very name and nature of SAL pattern-rules in Logos Model suggest this complementarity. We contend that the views of these deep learning experts provide indirect neuroscientific support for an MT methodology that affords continual, complexity-free learning, one that is predicated upon hippocampal/neocortex-like generalizations (viz., semantico-syntactic patterns). The present Chapter concludes with a Logos Model exercise illustrating the effectiveness of these declarative, hippocampal-like processes for MT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Koehn (2011, p.155): “The task of decoding in [statistical] machine translation is to find the best scoring translation.” Cho et al. (2014) speak of NMT models as “encoding-decoding models.”
2.
Of course, Logos Model could also be seen as starting out with an encoding function (source string ➔ SAL string), and ending with decoding (transformed SAL string ➔ target string).
3.
Kumaran et al. (2016). The authors speak of deep learning processes as “recurrent similarity computation,” a term only remotely fitting for Logos Model.
4.
The term deep in deep learning refers to the number of hidden layers in the net, as compared to shallower machine learning models like the earlier perceptron and first connectionist models. DLTM can have as many as 10 or more layers (Kumaran et al. 2016). Logos Model has six layers. Google’s GNMT Translate has 8 decoding layers and 8 encoding layers, with the top encoding layer connected to the bottom decoding layer.
5.
Sennrich and Haddow (2016). Authors describe modest output improvements for German<–>English translation in a recurrent NMT model with experimental linguistic tags attached to input word embeddings. Linguistic tags included POS, lemmas, morphological features, and simple dependency labels (e.g. relating a word in a phrase to its head).
6.
Global context information is technically known as Long Short-Term Memory (LSTM). LSTM was designed to overcome the history-access limitations of n in n-gram processes. Most advanced NMT systems employ LSTM, e.g., Microsoft’s Bing Translator, Google Translate.
7.
Logos Model’s use of numbers was not a little influenced by the fact that Logos Model was originally implemented in Fortran IV (later re-implemented in C). See Part II Postscript for an illustration of how numerical representation was used, and a brief discussion of why LM use of numbers turned out to be advantageous.
8.
A variant convolutional function is “max-pooling” (see Kalchbrenner et al. 2014).
9.
As stated earlier, this is technically known as Long- Short-Term Memory (LSTM), a looping-back function also found in some recurrent NMT models.
10.
Kalchbrenner et al. (2014). Authors describe a convolutional neural net that automatically learns conditional distributions from an aligned, bilingual corpus.
11.
In LM, hidden layer units are indexed on their semantico-syntactic patterns, making sweeps look like longest-match lexical lookup.
12.
In LM, this history includes a top-down, structural view of the entire sentence produced by layers R1 & R2 in its initial macro-parse (described in Chap. 6). Being able to view such top-down data at any point in the subsequent micro-parse (layers P1 to P4) has obvious advantages. Bahdanau et al. (2015) describe an NMT model where units are similarly progressively annotated with information about the entire input sequence.
13.
Dettmers (2015, web) notes that it is “misleading to explain deep learning as mimicry of the human brain.”
14.
Sanborn and Chater (2016, web): “…that a Bayesian brain must represent all possible probabilities and make exact calculations using these probabilities [is] too complex for any physical system, including brains.” Authors nevertheless hold for a Bayesian brain but propose that the brain works by “sampling [just] one or a few [probabilities] at a time.”
15.
McClelland et al. (1995, p. 25). However, see Hassabis et al. (2017) where authors describe recent AI efforts to find a biologically plausible way around this issue in artificial neural nets, drawing upon dynamic, local information rather than feedback to effect correct connectivity. In LM, local information is also critical in effecting error-free connectivity.
16.
LSTM might be construed as a partial means of simulating neuromodulation in DLTM.
17.
The connection effects this resolution only if all constraints can be satisfied, e.g., that the candidate verb agrees with the subject in number. In LM, data pertaining to the subject will have to have been previously communicated via neuromodulation to the entire hidden layer for this purpose.
18.
By the same token, a hidden layer unit in LM that recognizes clausal transitions will transmit a neuromodulatory signal that effectively uninhibits units whose function is to recognize verbs. (See Fig. 4.6 in Chap. 4 for a neuroscientist’s depiction of neuromodulation in the brain.)
19.
Marblestone et al. (2016, web): “Machine learning and neuroscience speak different languages today.” See also Hassabis et al. (2017). Authors of both papers argue for closer integration between the disciplines, contending that artificial intelligence has always benefited from closer integration with brain processes.
20.
Neuroscience first became aware that the brain had two entirely different learning and memory regions when trauma to the hippocampus caused memory loss of recent experience, leaving remote, long-term memory intact (McClelland et al. 1995).
21.
We use “learning” here in the broadest, everyday sense of seeing/hearing something not seen/heard before, or saying something that had to be learned to say because it was never said before. The knowledge, for example, that 79 is a prime number was not learned directly from experience but by virtue of a learned cognitive act.
22.
Chomsky abandoned his formalisms (Chomsky 1990) but never his theory that syntax was separate and distinct from semantics. His theory of syntax however has not held up well, e.g., (Palmer 2006, 267): “Chomsky’s current theory has reallocated the explanatory burden from … the syntactic module to the lexicon, with no advance in plausibility. My own exploration and evaluation of Chomsky’s theories … led me to predict that his work will ultimately be seen as a kind of scientific flash flood, generating great excitement, wreaking havoc, but leaving behind only an arid gulch.”
23.
In its translation of (4), Microsoft’s Bing NMT Translator most interestingly (and quite legitimately) transformed the passive voice of the English sentence into the active voice in French: Les Nations Unies ont honoré aujourd’hui les pays qui ont accepté les réfugiés. For some reason, in a subsequent run of (4) through Bing MNT Translator, the French output reverted to the passive voice.
24.
Kumaran et al. (2016, web). One of the authors, Demis Hassabis, was the founder of Deep Mind Technologies, a British-based AI enterprise acquired by Google in 2014 as Google Deep Mind. The study’s principal author, Dharshan Kumaran, is a senior research scientist at Google Deep Mind. Most interestingly, the third author is James McClelland, co-originator of connectionism in the mid-80s. His presence as author signals Google Deep Mind’s interest, in their words, in “highlighting the connection between neuroscience and machine learning.”
25.
In Kumaran et al. (2016, web), authors observe that deep networks “share the characteristics of the slow-learning of the neocortical system …; they achieve an optimal parametric characterization of the statistics of the environment by learning gradually through repeated, interleaved exposure to large numbers of training examples.” Multi-layered neural networks “gradually learn to extract features when trained by adjusting weights to minimize errors in network output.”
26.
As we saw in Chap. 5, this perception of hippocampal learning as specific and superficial has been common among virtually all neuropsychologists, and doubtless explains why hippocampal role in language processes had remained unrecognized for so long. In this view, language acquisition calls for the generalities of grammar (neocortex), not just the specifics of lexicon (hippocampus).
27.
As stated in Chap. 6, Logos Model was depicted over two decades ago as a fortuitous implementation of hippocampal declarative, associative, pattern-based learning (Scott 1990, 2003).
28.
McClelland et al. (1995, web): Authors propose that “memories are first stored via synaptic changes in the hippocampal system; that these changes support reinstatement of recent memories in the neocortex; that neocortical synapses change a little on each reinstatement; and that remote memory is based on accumulated neocortical changes. Models that learn via adaptive changes to connections help explain this organization … The hippocampal system permits rapid learning of new items without disrupting this structure.”
29.
As noted in Chap. 5, constructionist linguists would clearly agree with this.
30.
See related paper on complementary learning by Guise and Shapiro (2017) where, in a study of spatial learning among rats, the authors propose that prior knowledge in medial prefrontal cortex (mPFC) interacts with the CA1 component of the hippocampus, teaching the latter to “retrieve distinct representations of similar circumstances,” thus supporting new learning in the hippocampus. Evidence for this is seen in the fact that when mPFC is inactivated, new learning by CA1 is interfered with.
31.
In Chap. 2 we asked whether there might not be some unrecognized cerebral function that accounts for the brain’s freedom from complexity in processing language, a function that might be simulated to the benefit of MT. What Kumaran, Hassabis and McClelland assert about semantic generalizations in the hippocampus might be said to identify that unrecognized function. Logos Model may be said to have demonstrated it.
32.
In academic linguistics, semantic field theory fully understands that semantics has an abstract, generalized aspect, no less than syntax. For example, terms like batter, pitcher, catcher, fielder all bring to mind the more general concept of baseball. However, I’m not aware of any MT model that knows how to exploit this aspect of semantics. Logos Model’s SAL comes closest but is far less concerned with semantic abstractions (and hypernymy per se) than with the interaction of meaning with syntax, i.e. with their mutual complementarity. Semantic field theory does not concern itself with the interactions of semantics and syntax.
33.
The authors cite a 1993 study in their bibliography (Knowlton and Squire 1993) to the effect that “category knowledge” can be acquired “cumulatively from multiple examples,” but we note that this dated study characteristically, and very specifically, excludes the hippocampus from category learning.
34.
However, see discussion on second-order language by two constructionists in Postscripts 5-F and 5-G in Chap. 5.
35.
The neurolinguist Pulvermüller (2013, web) reports on neuroimaging studies designed to identify and locate processes and regions responsible for “symbolic semantics.” Though he does not cite hippocampus directly, he identifies multiple temporal and parietal areas linked in “combinatorial learning.” Interestingly, Pulvermüller speaks of “correlated learning” that takes place between circuits that handle new words, for example, and “previously learning circuits,” a correlation he says that “leads to the emergence of semantic categories.”
36.
Neuroscientists like Pothos (2007) identify these patterns with hippocampal exemplars. Pothos concurs that semantic patterns far more than syntactic rules form the deeper basis of language learning.
37.
Koehn and Knowles (2017, web). Authors note that the quality of NMT output generally degrades (i) where input text is out-of-domain, and/or (ii) where training resources were scant. Authors also note that the internal processes of NMT systems are difficult to interpret given that “specific word choices are based on large matrices of real-numbered values.”
38.
Toral and Sánchez-Cartagena (2017) report that phrase-based SMT significantly outperforms NMT on sentences greater than 40 words in length in a wide variety of languages.
39.
In this respect SAL can be seen as a considerably more refined elaboration of Case grammar (Fillmore 1968) and Valency grammar (Fischer and Ágel 2010). That these grammars traditionally speak of their verb classifications as syntactic subdivisions conceals the fact that these verb subdivisions have their basis in semantics, i.e., that they are really semantico-syntactic subdivisions.
40.
Microsoft’s Bing NMT Translator turned out the best translation among all systems: Les Nations qui ont signé l’accord que notre pays offrait ont changé d’avis. We elected to show its SMT output for (5) only to demonstrate the point of this exercise.
41.
If (7) had been formed with no elisions, e.g.,: The cat that the dog that we owned chased belonged to our neighbor, readers presumably would have had somewhat less difficulty with it.
42.
Google’s Ray Kurzweil, for one, believes that machines in the future will not only emulate the human neocortex but will actually extend it.

References

Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Oral presentation at the 3rd international conference on learning and representation (ICLR 2015). San Diego. http://www.iclr.cc/lib/exe/fetch.php?media=iclr2015:bahdanau-iclr2015.pdf. Accessed 26 Nov 2016
Cho K (2015) Introduction to neural machine translation with GPUs (part 1). https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-with-gpus. Accessed 24 June 2016
Cho K, von Merriënboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder-decoder approaches. In: Proceedings of the eighth workshop on syntax, Semantics and Structure in Statistical Translation (SSST-8), Doha, pp 103–111. https://arxiv.org/pdf/1409.1259.pdf
Chomsky N (1990) On formalization and formal linguistics. Nat Lang Linguist Theory 8:143–147
Article Google Scholar
Deeplearning4j Development Team (2016) Introduction to deep neural networks. https://deeplearning4j.org/neuralnet-overview. Accessed 14 Aug 2016
Dettmers T (2015) Deep learning in a Netshell: core concepts. Internet Blog. https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training. Accessed 8 June 2016
Fillmore C (1968) The case for case. In: Bach E, Harms RT (eds) Universals in linguistic theory. Holt/Rinehart and Winston, New York/London, pp 1–88
Google Scholar
Fischer K, Ágel V (2010) Dependency grammar and valency theory. In: The Oxford handbook of linguistic analysis. Oxford University Press, Oxford, pp 223–255
Google Scholar
Goldberg AE (2009) The nature of generalization in language. Cogn Linguist 20(1):93–127
Google Scholar
Guise KG, Shapiro M (2017) Medial prefrontal cortex reduces memory interference by modifying hippocampal encoding. Neuron 94(1):183–192
Article Google Scholar
Hassabis D, Kumaran D, Summerfield C, Botvinick M (2017) Neuroscience-inspired artificial intelligence. Neuron 95(2):245–258
Article Google Scholar
Hawakawa SI, Hayakawa AR (1991) Language in thought and action, 5th edn. Houghton Mifflin Harcourt, New York
Google Scholar
Kalchbrenner N, Blunsom P (2013) Recurrent convolutional neural networks for discourse compositionality. In: Proceedings of the 2013 workshop on continuous vector space models and their compositionality, Sofia, pp 119–126
Google Scholar
Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural net for modeling sentences. In: Proceedings of the 52nd annual meetings of the association for computational linguistics, Baltimore, pp 655–665
Google Scholar
Knowlton BJ, Squire LR (1993) The learning of categories: parallel brain systems for item memory and category knowledge. Science 262(5140):1747–1749
Article Google Scholar
Koehn P (2011) Statistical machine translation. Cambridge University Press, Cambridge
MATH Google Scholar
Koehn P, Knowles R (2017) Six challenges for neural machine translation. In: Proceedings of the first workshop on neural machine translation, Vancouver, pp 26–39. http://arXiv: 1706.03872v1. Accessed 13 Dec 2017
Google Scholar
Kumaran D, McClelland JL (2012) Generalization through the recurrent interaction of episodic memories: a model of the hippocampal system. Psychol Rev 119(3):573–616
Article Google Scholar
Kumaran D, Hassabis D, McClelland JL (2016) What learning systems do intelligent agents need? complementary learning systems theory updated. Trends Cogn Sci 20(7). https://doi.org/10.1016/j.tics.2016.05.004. Accessed 12 Jan 2017
Article Google Scholar
Kurzweil R (2013) How to create a mind: the secret of human thought revealed. Penguin Books, New York
Google Scholar
Liu S, Yang N, Li M, Zhou M (2014) A recursive recurrent neural network for statistical machine translation. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, Baltimore, pp 1491–1500
Google Scholar
Marblestone AH, Wayne G, Kording KP (2016) Toward an integration of deep learning and neuroscience. Front Comput Neurosci 10(19). https://doi.org/10.3389/fncom.2016.00094
McClelland JL, McNaughton BL, O’Reilly RC (1995) Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychol Rev 102(3):419–457. https://www.ncbi.nlm.nih.gov/pubmed/7624455
Article Google Scholar
Meng F, Lu Z, Wang M, Li H, Jiang W, Liu Q (2015) Encoding source language with convolutional neural network for machine translation. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing, vol 1, Long Papers, Beijing, pp 20–30
Google Scholar
Palmer DC (2006) On Chomsky’s appraisal of Skinner’s verbal behavior: a half-century of misunderstanding. Behav Anal 29(2):253–267
Article Google Scholar
Pothos EM (2007) Theories of artificial grammar learning. Psychol Bull 133:227–244
Article Google Scholar
Pulvermüller F (2013) How neurons make meaning: brain mechanisms for embodied and abstract-symbolic semantics. Trends Cogn Sci 17(9):458–470. http://www.sciencedirect.com/science/article/pii/S1364661313001228. Accessed 13 Dec 2015
Article Google Scholar
Sanborn AN, Chater N (2016) Bayesean brains without probabilities. Trends Cogn Sci 20(121):883–893. http://www.sciencedirect.com/science/journal/13646613/20/12?sdc=1. Accessed 6 Feb 2017
Article Google Scholar
Scott B (1989) The logos system. In: Proceedings of MT summit II, Munich, pp 137–142
Google Scholar
Scott B (1990) Biological neural net for parsing long, complex sentences. Logos Corporation Publication
Google Scholar
Scott B (2003) Logos model: an historical perspective. Mach Transl 18(1):1–72
Article MathSciNet Google Scholar
Sennrich R, Haddow B (2016) Linguistic input features improve neural machine translation. arXiv:1606.02892v2 [cs.CL]. Accessed 15 Aug 2017
Google Scholar
Toral, Antonio and Victor M. Sánchez-Cartagena. 2017. A multifaceted evaluation of neural versus phrase-based machine translation for 9 language directions.In: Proceedings of the 15th conference of the european chapter of the association for computational linguistics, vol 1, Long Papers, Valencia, pp 1063–1073. arXiv:1701.02901 [cs.CL]
Zhang J, Ye L (2010) Series feature aggregation for content-based image retrieval. Comput Electr Eng 36(4):691–701
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Tarpon Springs, FL, USA
Bernard Scott

Authors

Bernard Scott
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Scott, B. (2018). Deep Learning MT and Logos Model. In: Translation, Brains and the Computer. Machine Translation: Technologies and Applications, vol 2. Springer, Cham. https://doi.org/10.1007/978-3-319-76629-4_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-76629-4_8
Published: 07 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-76628-7
Online ISBN: 978-3-319-76629-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics