Backwards Differentiation in AD and Neural Nets: Past Links and New Opportunities (p. 15)
Paul J. Werbos National Science Foundation, Arlington, VA, USA
pwerbos@nsf.gov Summary.
Backwards calculation of derivatives– sometimes called the reverse mode, the full adjoint method, or backpropagation– has been developed and applied in many fields. This paper reviews several strands of history, advanced capabilities and types of application– particularly those which are crucial to the development of brain-like capabilities in intelligent control and artificial intelligence.
Key words: Reverse mode, backpropagation, intelligent control, reinforcement learning, neural networks, MLP, recurrent networks, approximate dynamic programming, adjoint, implicit systems
1 Introduction and Summary
Backwards differentiation or"the reverse accumulation of derivatives" has been used in many different fields, under different names, for different purposes. This paper will review that part of the history and concepts which I experienced directly. More importantly, it will describe how reverse differentiation could have more impact across a much wider range of applications.
Backwards differentiation has been used in four main ways known to me:
1. In automatic differentiation (AD), a field well covered by the rest of this book. In AD, reverse di.erentiation is usually called the"reverse method" or the"adjoint method." However, the term"adjoint method" has actually been used to describe two different generations of methods. Only the newer generation, which Griewank has called"the true adjoint method," captures the full power of the method.
2. In neural networks, where it is normally called"backpropagation" [532, 541, 544]. Surveys have shown that backpropagation is used in a majority of the real-world applications of artificial neural networks (ANNs). This is the stream of work that I know best, and may even claim to have originated.
3. In hand-coded"adjoint" or"dual" subroutines developed for specific models and applications, e.g., [534, 535, 539, 540].
4. In circuit design. Because the calculations of the reverse method are all local, it is possible to insert circuits onto a chip which calculate derivatives backwards physically on the same chip which calculates the quantit( ies) being differentiated. Professor Robert Newcomb at the University of Maryland, College Park, is one of the people who has implemented such"adjoint circuits."
Some of us believe that local calculations of this kind must exist in the brain, because the computational capabilities of the brain require some use of derivatives and because mechanisms have been found in the brain which fit this idea.
These four strands of research could benefit greatly from greater collaboration. For example– the AD community may well have the deepest understanding of how to actually calculate derivatives and to build robust dual subroutines, but the neural network community has worked hard to find many ways of using backpropagation in a wide variety of applications.
The gap between the AD community and the neural network community reminds me of a split I once saw between some people making aircraft engines and people making aircraft bodies. |