• # Multilayer Feedforward Networks and the Backpropagation Algorithm

Artificial Intelligence

## Introduction

### Backpropagation

The backpropagation algorithm is perhaps the most widely used training algorithm for multi-layered feedforward networks. However, many people find it quite difficult to construct multilayer feedforward networks and training algorithms from scratch, whether it be because of the difficulty of the math (which can seem misleading at first glance of all the derivations) or the difficulty involved with the actual coding of the network and training algorithm. Hopefully after you have read this guide you'll walk away knowing more about the backpropagation algorithm than you ever did before. Before continuing further on in this tutorial you might want to check out James' introductory essay on neural networks.

The Perceptron

### Summary

The problem with the perceptron is that it cannot express non-linear decisions. The perceptron is basically a linear threshold device which returns a certain value, 1 for example, if the dot product of the input vector and the associated weight vector plus the bias surpasses the threshold, and another value, -1 for example, if the threshold is not reached.

When the dot product of the input vector and the associated weight vector plus the bias f(x1,x2,..,xn)=w1x1+w2x2+...wnxn+wb=threshold, is graphed in the x1,x2,...,xn coordinate plane/space one will notice that it is obviously linear. More than that however, this function separates this space into two categories. All the input vectors that will give a (f(x1,x2,..,xn)=w1x1+w2x2+...wnxn+wb) value greater than the threshold are separated into one space, and those that will not will be separated into another (see figure).

Left: a linearly separable decision surface. Right: a non-linearly separable decision surface.

The obvious problem with this model then is, what if the decision cannot be linearly separated? The failure of the perceptron to learn the XOR network and to distinguish between even and odd almost led to the demise of faith in neural network research. The solution came however, with the development of neuron models that applied a sigmoid function to the weighted sum (w1x1+w2x2+...wnxn+wb) to make the activation of the neuron non-linear, scaled and differentiable (continuous). An example of a commonly used sigmoid function is the logistic function given by o(y)=1/(1+e^(-y)), where y=w1x1+w2x2+...wnxn+wb. When these "sigmoid units" are arranged layer by layer, with each layer downstream another layer acting as the input vector etc. the multilayer feedforward network is created.

Multilayer feedforward networks normally consist of three or four layers, there is always one input layer and one output layer and usually one hidden layer, although in some classification problems two hidden layers may be necessary, this case is rare however. The term input layer neurons are a misnomer, no sigmoid unit is applied to the value of each of these neurons. Their raw values are fed into the layer downstream the input layer (the hidden layer). Once the neurons for the hidden layer are computed, their activations are then fed downstream to the next layer, until all the activations eventually reach the output layer, in which each output layer neuron is associated with a specific classification category. In a fully connected multilayer feedforward network, each neuron in one layer is connected by a weight to every neuron in the layer downstream it. A bias is also associated with each of these weighted sums. Thus in computing the value of each neuron in the hidden and output layers one must first take the sum of the weighted sums and the bias and then apply f(sum) (the sigmoid function) to calculate the neuron's activation.

Graph of the logistic function. Notice it scales the output to a value ranging from 0 to 1

How then does the network learn the problem at hand? By modifying the all the weights of course. If you know calculus then you might have already guessed that by taking the partial derivative of the error of the network with respect to each weight we will learn a little about the direction the error of the network is moving. In fact, if we take negative this derivative (i.e. the rate change of the error as the value of the weight increases) and then proceed to add it to the weight, the error will decrease until it reaches a local minima. This makes sense because if the derivative is positive, this tells us that the error is increasing when the weight is increasing, the obvious thing to do then is to add a negative value to the weight and vice versa if the derivative is negative. The actual derivation will be covered later. Because the taking of these partial derivatives and then applying them to each of the weights takes place starting from the output layer to hidden layer weights, then the hidden layer to input layer weights, (as it turns out this is neccessary since changing these set of weights requires that we know the partial derivatives calculated in the layer downstream) this algorithm has been called the "back propagation algorithm".

A 3-layer feedforward network. Notice that in this fully connected network every neuron in the hidden and output layer is connected to every neuron in the previous layer.

How is the error of the network computed? In most classification networks the output neuron that achieves the highest activation is what the network classifies the input vector to be. For example if we wanted to train our network to recognize 7x7 binary images of the numbers 0 through 9, we would expect our network to have 10 output neurons, which each output neuron corresponding to one number. Thus if the first output neuron is most activated, the network classifies the image (which had been converted to a input vector and fed into the network) as "0". For the second neuron "1", etc. In calculating the error we create a target vector consisting of the expected outputs. For example, for the image of the number 7, we would want the eigth output neuron to have an activation of 1.0 (the maximum for a sigmoid unit) and for all other output neurons to achieve an activation of 0.0. Now starting from the first output neuron calculate the squared error by squaring the difference between the target value (expected value for the output neuron) and the actual output value and end at the tenth output neuron. Take the average of all these squared errors and you have the network error. The error is squared as to make the derivative easier.

Once the error is computed, the weights can be updated one by one. This process continues from image to image until the network is finally able to recognize all the images in the training set.

### Training

Recall that training basically involves feeding training samples as input vectors through a neural network, calculating the error of the output layer, and then adjusting the weights of the network to minimize the error. Each "training epoch" involves one exposure of the network to a training sample from the training set, and adjustment of each of the weights of the network once layer by layer. Selection of training samples from the training set may be random (I would recommend this method escpecially if the training set is particularly small), or selection can simply involve going through each training sample in order.

Training can stop when the network error dips below a particular error threshold (Up to you, a threshold of .001 squared error is good. This varies from problem to problem, in some cases you may never even get .001 squared error or less). It is important to note however that excessive training can have damaging results in such problems as pattern recognition. The network may become too adapted in learning the samples from the training set, and thus may be unable to accurately classify samples outside of the training set. For example, if we over trained a network with a training set consisting of sound samples of the words "dog" and "cog", the network may become unable to recognize the word "dog" or "cog" said by a unusual voice unfamiliar to the sound samples in the training set. When this happens we can either include these samples in the training set and retrain, or we can set a more lenient error threshold.

These "outside" samples make up the "validation" set. This is how we assess our network's performance. We can not expect to assess network performance based solely on the success of the network in learning an isolated training set. Tests must be done to confirm that the network is also capable of classifying samples outside of the training set.

### Backpropagation Algorithm

The first step is to feed the input vector through the network and compute every unit in the network. Recall that this is done by computing the weighting sum coming into the unit and then applying the sigmoid function. The 'x' vector is the activation of the previous layer.

The 'w' vector denotes the weights linking the neuron unit to the previous neuron layer.

The second step is to compute the squared error of the network. Recall that this is done by taking the sum of the squared error of every unit in the output layer. The target vector involved is associated with the training sample (the input vector).

't' denotes a target value in the target vector, and 'o' denotes the activation of a unit in the output layer.

The third step is to calculate the error term of each output unit, indicated below as 'delta'.

The error term is related to the partial derivative of each weight with respect to the network error.

The fourth step is to calculate the error term of each of the hidden units.

The hidden unit error term depends on the error terms calculated for the output units.

The fifth step is to compute the weight deltas. 'Eta' here is the learning rate. A low learning rate can ensure more stable convergence. A high learning rate can speed up convergence in some cases.

'x' denotes the unit that's connected to the unit downstream by the weight 'w'

The final step is to add the weight deltas to each of the weights. I prefer adjusting the weights one layer at a time. This method involves recomputing the network error before the next weight layer error terms are computed.

Once finished, proceed back to step 1.

Report Article

## User Feedback

There are no comments to display.

This is now closed for further comments

• 0
• 23
• 0
• 1
• 0

• 26
• 10
• 9
• 9
• 11
• ### Similar Content

• By bvincent
Hello everyone,
I am doing a survey to understand better what are the pain points in terms of music composition in video games, what are the game studios / developers expectations in the future and how the industry could be improve: https://docs.google.com/forms/d/1NycMla5fhQd1fMbLy3c28alxCbhIYvkD6Fv9lqhxEJ0
It is not a promotion of any kind, I am just interested in getting feedbacks from game developers, studios and gamers (and composers as well).
The answers are completely confidential and no personal information will be published, or used for any other purpose.
Thanks for your time and help
Vincent

• Abstract
Recently in the field of Artificial Intelligence, scientists are wondering which approach best models the human brain — bottom-up or top-down. Both approaches have their advantages and their disadvantages. The top-down approach has the advantage of having all the necessary knowledge already present for the program to use (given that the knowledge was pre-programmed) and thus is can perform relatively high-level tasks such as language processing. The bottom-up approach has the advantage of being able to model lower-level human functions, such as image recognition, motor control and learning capabilities. Each method fails where the other excels — and it is this trait of the two approaches that is the root of the debate.

In order to assess this, this essay deals with two areas of Artificial Intelligence — Natural Language Processing and robotics. Natural Language Processing uses the top-down methodology, whereas robotics uses the bottom-up approach. Jerry A. Fodor, Mentalese, and conceptual representation support the ideas of a top-down approach. The MIT Cog Robot Team fervently supports the bottom-up approach when modelling the human brain. This essay looks at the theory behind conceptual representation and its parallel in philosophy, Mentalese. The theory involved in the bottom-up approaches is skipped due to the background knowledge required in both computer science and neurobiology.
After looking at the two approaches, I concluded that currently Artificial Intelligence is too segmented to create any universal approach to modelling the brain — the top-down approach has all the necessary traits needed for NLP, and the bottom-up approach has all the necessary traits for fundamental robotics. A universal model will probably arise from the bottom-up approach since Artificial Intelligence is seeking its information from neurobiology instead of philosophy, thus more concrete theories about the functioning of the brain are formed. For the moment, though, either the approaches should be used in their respective fields, or a compromise (such as an object-orientated approach) should be formulated.

Introduction
Throughout the history of artificial intelligence one question has always been asked when given a problem. Should the solution to a problem be solved via the top-down method or through the bottom-up method? There are many different areas of artificial intelligence where this question arises — but none more so than in the areas of Natural Language Processing (NLP) and robotics.

As we grow up we learn the language of our parents, making mistakes at first, but slowly growing used to using language to communicate with others. Humans definitely learn through a bottom-up approach — we all start with nothing when we are born. It is through our own intellect and learning that we master language. In the field of computing, though, such methods cannot always be utilised.
The two approaches to the problems are called top-down and bottom-up, according to how they tackle the problems. Top-down takes pre-programmed knowledge (like a large knowledge base) and uses symbolic creation, manipulation, linking and analysis to perform the calculations. The top-down approach is the one most commonly used in the field of classical (and neo-classical) artificial intelligence that utilises serial programming. The bottom-up approach tackles the problem at hand by starting with a relatively simple abstract program that is built to learn by itself — building its own knowledge base and commonsense assertions. This is normally done with parallel processing, or data structures simulating parallel processing, such as neural networks.
When looking at the similarities of these approaches to the human brain, one can see the similarities in both. The bottom-up approach has the obvious connection that it uses learning mechanisms to gain its knowledge. The top-down approach has most of its connections in the field of natural language, and the philosophical computational models of the brain. Much philosophical theory supports the idea of an inherently computational brain, one that uses symbol manipulation in order to do its ‘calculations’.
The two approaches differ greatly in their usefulness to the two fields concerned. In NLP, the bottom-up approach would take such a long time before the knowledge system was rich enough to paraphrase text, or infer from newspaper articles, that a large amount of pre-programmed (but reusable) starting information would be a much more practical approach. With robotics, though, the amount of space required for a large pre-programmed knowledge base is huge, and with the space restrictions on a computer chip, large code segments are not an option. More importantly, the top-down, linear approaches are very easily subjected to exceptions that the knowledge base cannot handle, especially given a real world environment in which to operate in. Since bottom-up approaches learn to deal with such exceptions and difficulties, the systems adapt, and are incredible flexible in their execution.
As stated, the bottom-up approach utilises parallel processing, and advanced data structures such as neural networking, evolutionary computing, and other such methods. The theory behind these ideas is too broad and is, aside from a brief introduction to the subject, beyond the boundaries of this essay. Instead, this essay deals with one of the computational models of the brain, Mentalese, and its parallel in computer science — conceptual representation.
Despite the applications of the bottom-up and the top-down approach to different fields of NLP and robotics, both fields have a common problem — how to code commonsense into the program or robot, whether to use brute computation, or when to let the program learn for itself.
Commonsense and General Knowledge
Commonsense, or consensus reality, is an incredible obstacle that AI faces. Over the course of our childhood, millions of tiny ‘facts’ are gathered by us that are taken for granted later in life. Think about the importance of this for any program that is to exhibit general intelligence. It has been generally accepted by the AI community that any future parsing program will need command of general knowledge. Such is the opinion of the ParseTalk team:

Dreyfus (a well-known sceptic of AI) says that a program can only be classified as intelligent after is has exhibited a general command of commonsense. For instance, a classic program designed to answer questions concerning a restaurant might be able to answer, "What food was ordered?" or, "How much did the person pay for his food?" Yet it could not answer "What part of the body was used to eat the food?" or "How many pairs of chopsticks were required to eat the food?" So much of our life requires us to relate to our commonsense that we never notice it. So, how could commonsense be coded into a computer?

The top-down method relies on vast amounts of pre-programmed information, concepts, and other such symbols for the program to use. The bottom-up method uses techniques such as neural networking, evolutionary computing, and parallel processing to allow the program to adapt and learn from it’s environment. Classical AI chooses the top-down method for use in programs such as inference engines, expert systems and other such programs where learning knowledge over many years is not an option. The field of robotics often takes the bottom-up methodology, letting the systems get information from the ‘senses’ and allowing them to adapt to the environment.
When looking at natural languages though, a top-down approach is often needed. This is due to several reasons, the first being that most modern day computers do not have the capabilities to learn through sensory experience. Secondly, a program that is sold with the intent to read and paraphrase large amounts of text will not be acceptable if it requires two years of continual running so it can learn the knowledge required before usage. Therefore, an incredible amount of pre-programmed information is required. The most comprehensive top-down knowledge base so far is the CYC Project.
The CYC Project: The Top-Down Approach
The CYC Project is a knowledge base (KB) being created by the Cycorp Corporation in Austin, Texas. CYC aims to turn several million general knowledge assumptions into a computable form. The project has been going on for about 14 years and over a million discrete concepts and commonsense assertions have been turned into CYC entities.

A CYC entity is not necessarily limited to one word; often it represents a group of words, or a concept. Look at this example taken from the CYC KB:
;;; #$Food-ReadyToEat (#$isa #$Food-ReadyToEat #$ProductType)
(#$isa #$Food-ReadyToEat #$ExistingStuffType) (#$genls #$Food-ReadyToEat #$FoodAndDrink)
(#$genls #$Food-ReadyToEat #$OrganicStuff) (#$genls #$Food-ReadyToEat #$Food)
You can see how ‘food that is ready to eat’ is represented in CYC as a group of IS-A relationships and GENLS relationships. The IS-A relationships are an ‘element of’-relationship whereas the GENLS relationships are a ‘subset of’-relationship. This hierarchy creates a large linked concept for a very simple term. For example, the Food-ReadyToEat concept IS-A ExistingStuffType, and in the CYC KB, ExistingStuffType is represented as:

;;; #$ExistingStuffType (#$isa #$ExistingStuffType #$Collection)
(#$genls #$ExistingStuffType #$TemporalStuffType) (#$genls #$ExistingStuffType #$StuffType)

With the following comment about it: "…A collection of collections. Each element of #$ExistingStuffType is a collection of things (including portions of things) which are temporally and spatially stufflike; they may also be stufflike in other ways, e.g., in some physical property. Division in time or space does not destroy the stufflike quality of the object…" It is apparent how generic many of the concepts get as they rise in the CYC hierarchy. Evidently, such a huge KB would generate a large concept for a small entity, but such a large concept is necessary. For example, the CYC team created a sample program in 1994 that fetched images given search criteria. Given a request to search for images of seated people, the program retrieved an image with the following caption: "There are some cars. They are on a street. There are some trees on the side of the street. They are shedding their leaves. Some of them are yellow taxicabs. The New York City skyline is in the background. It is sunny." The program had deduced that cars have seats, in which people normally sit, when the car is in motion. COG: The Bottom-Up Approach After many years of GOFAI (Good Old Fashioned Artificial Intelligence) research, scientists started doubting the classical AI approaches. The MIT Cog Robot team eloquently put their doubts: The whole robot is designed without any complex systems modelling, or pre-programmed human intelligence. An impressive example of this is Cog’s ability to successfully play with a slinky. The robot’s arms are controlled by a set of self-adaptive oscillators. Each joint in the arm is actuated by an oscillator that uses local information to adjust the frequency and phase of the arm. None of the oscillators are connected, nor do any of them share information. When the arms are unconnected, they are uncoordinated, yet if a slinky is used to connect them, the oscillators adapt to the motion, and coordinated behaviour is achieved. You can see how simple systems can model quite complex behaviour — the problem with doing this is that it takes a long time for systems to get adjusted to its environment, just like a real human. This can prove impractical. So, in the field of NLP, a top-down approach is required most of the time. An exception would perhaps be a computer program that can learn a new language dynamically. With both approaches to common sense and general knowledge, there is one thing in common — the vast amount of knowledge required. A method of learning, storing, and retrieving all this information is also needed. A lot of this is automatically taken care of through the bottom-up approach. With the top-down approach, such luxuries are not present, everything has to be hard coded. For example, all the text written by the CYC project is useless unless a way can be created to conceptualise and link all the entities together. A computer cannot understand the entity #$ExistingStuffType as it stands. A program that parses the KB, and turns it into a data structure that the computer can manipulate is necessary. There is no set way of doing this — but one promising field of Artificial Intelligence exists for this purpose, that of Conceptual Representation.
Conceptual Representation: Theory
Conceptual Representation (CR) relies on symbolic data types called conceptual structures. A conceptual structure (now referred to as a C-structure) must represent the meaning of a natural language idiom in an unequivocal way. Roger Schank, one of the pioneers of CR, says the following:

C-structures are created, stored, manipulated and interpreted within a CR program. In a typical CR program there are three parts — the parser and conceptualizer, another module that acts as an inference engine (or something similar), and finally a module to output the necessary information. The parser takes the natural language it is designed to parse and creates the C-structure primitives necessary. Then, the main program uses the concepts to either paraphrase the input text, draw inferences from the text provided or other similar functions. Finally, the output module will convert those structures back into a natural language —this does not necessarily have to be the same language the text was inputted in.

Parsing
A look at parsing and its two approaches is necessary at this point. Parsers generally take information and convert it into a data structure that the computer can manipulate. With reference to Artificial Intelligence, a parser is generally a program (or module of the program) that takes a natural language sentence and converts it into a group of symbols. There are generally two methods of parsing, bottom-up and top-down. The bottom-up method takes each word separately, matches the word to its syntactic category, does this for the following word, and attempts to find grammar rules that can join these words together. This procedure continues until the whole sentence is parsed, and the computer has represented the sentence in a well-formed structure. The top-down method, on the other hand, starts with the various grammar rules and then tries to find instances of the rules within the sentence.

Here the bottom-up and top-down relationship is slightly different. Nevertheless, a parallel can be drawn if the grammar of a sentence can be seen as the base of language (like commonsense is the base of cognitive intelligence). Both approaches have problems largely due to the large amount of computational time both require. With the bottom-up approach, a lot of time is wasted looking for combinations of syntactic categories that do not exist in the sentence. The same problem appears in the top-down approach, although it is looking for instances of grammar that are not present that wastes the time.
So which method represents the human mind closer? Neither is perfect, because both methods simply involve syntactic analysis. Take these two examples:
Carries’s box of strawberries was edible.
Carrie’s love of Kevin was intense. If a program syntactically analyzed these two statements, it would come to the correct conclusion that the strawberries were edible, but the incorrect conclusion that Kevin was intense. Despite the syntactical structure of the two sentences being the exact same, the meaning is different. Nevertheless, even if a syntactical approach is used, it can be used to point the computer to the correct meaning. As you will see with conceptual representation, if prior knowledge is known about the word ‘love’ then the computer can create the correct data structure to represent the sentence. This still does not answer the question of what type of parser the brain is closer to. In Schank’s words, ‘Does a human who is trying to understand look at what he has or does he look to fulfill his expectations?’ The answer seems to be both; a person not only handles the syntax of the sentence, but also does a certain degree of prediction. Take the following incomplete sentence:

John: I’ve played my guitar for over three hours and my fingers feel like ——
Looking at the syntax of the sentence it is easy to see that the next word will be a verb (‘dying’) or a noun (‘jelly’). It is easy, therefore, to predict the conceptual structure of a sentence. Problems arise when meaning also has to be predicted too. We have the problem of context, for instance. The fingers could be very worn out; they could be very callused from playing, or they could feel hot from playing for so long.

Prediction is easier when there is more information to go on, for example, if John had said "and my poor fingers," from the context of the sentence, we could have gathered that the fingers do not feel so good. This kind of prediction is called conversational prediction. Another type of prediction is based upon the listener’s knowledge. If the listener knows John to be an avid guitar player, then he might except a positive comment, but if he knows John’s parents force him to play the guitar, the listener could except a negative remark.
All these factors are constituents when a human listens to someone talking. With all this taken into account, Schank sums up the answer the following way:
Types of Concepts
A concept can be any one of three different types — a nominal, an action, or a modifier. Nominals are concepts that do not need to be explained further. Schank refers to nominals as picture-producers, or PPs, because he says that nominals produce a picture relating to the concept in the mind of the hearer. An action is what a nominal can do, or more specially, what an animate nominal can perform on some object. Therefore, a verb like ‘hit’ is classified as an action, but ‘like’ is not, since no action is performed. Finally, a modifier is a descriptor of a nominal or an action. In English, modifiers could be given the names adverbs and adjectives, yet since CR is supposedly independent of any language, the non-grammatical terms PA (picture aiders – for modifiers of nominals) and AA(action aiders – for modifiers of actions) are used by Schank.

These three categories can all relate to each other, such relationships are called dependencies. Dependencies are well described by Schank:
Therefore, nominals and actions are always governors, and the two types of modifiers are dependents. This does not mean, though, that a nominal or an action cannot also be a dependent. For instance, some actions are derived from other actions, take for example the imaginary structure CR-STEAL (conceptual type for stealing). Since stealing is really swapping of possession (with one party not wanting that change of possession), it can be derived from a simpler concept of possession change.

C-Diagrams
C-Diagrams are the graphical equivalent of the structures that would be created inside a computer, showing the different relationships between the concepts. C-Diagrams can get extremely complicated, with many different associations between the primitives; this essay will cover the basics. Below is an example of a C-Diagram:

The above represents the sentence "John hit his little dog." ‘John’ is a nominal since it does not need anything further to describe it. ‘Hit’ is an action, since it is something that an animate nominal does. The dependency is said to be a ‘two-way dependency’ since both ‘John’ and ‘hit’ are required for the conceptualisation — such a dependency is denoted by a Û in the diagram. ‘Dog’ is also a governor in this conceptualisation, yet it does not make sense within this conceptualization without a dependency to the action ‘hit.’ Such a dependency is called an ‘objective dependency’ — this is denoted by an arrow (the ‘o’ above it denotes the objectivity). Now all the governors are in place, and we have created "John hit a dog" as a concept. We have to further this by adding the dependencies — ‘little’ and ‘his’. Little is called an attribute dependency, since it is a PA for ‘dog’. Attributive dependencies are denoted by a ­ in the diagram. Finally, the ‘his’ has to be added — since his is just a pronoun, another way of expressing John, you would think it would be dependent of ‘dog.’ It is not this simple, though, since ‘his’ also implies possession of ‘dog’ to ‘John.’ This is called a prepositional dependency, and is denoted by a Ý, followed by a label indicating the type of prepositional dependency. POSS-BY, for instance, denotes possession.
With all this in mind lets look at a more complicated C-Diagram. Take the sentence, "I gave the man a book." Firstly, the ‘I’ and ‘give’ relationship is a two-way dependency, so a Û is used. The ‘p’ above the arrow is to denote that the event took place in the past. The ‘book’ is objectively dependent on ‘give’, so the arrow is used to denote this. Now, though we have a change in possession, this represented in the diagram by two arrows, with the arrow pointing toward the governor (the originator ), and the arrow pointing away (the actor ). The final diagram would look like:

You can see through conceptual representation, how a computer could create, store and manipulate symbols or concepts to represent sentences. How can we be sure that such methods are accurate models of the brain? We cannot be certain, but we can look at philosophy for theories that can support such computational, serial models of the brain.
Computational Models of the Brain
In order to explain the functioning of the brain, the field of philosophy has come up with many, many models of the brain. The ones that are of most interest to Artificial Intelligence are the computational models. There are quite a few computational models of the brain ranging from GOFAI approaches (like conceptual representation) to connectionism and parallelism, to the neuropsychological and neurocomputational theories and evolutionary computing theories (genetic algorithms).

Connectionism and Parallelism
Branches of AI started to look at other areas of neurobiology and computer science. The serial methods of computing were put to one side in favour of parallel processing. Connectionism attempts to model the brain by creating large networks of entities simulating the neurones of the brain — these are most commonly called ‘neural networks.’ Connectionism attempts to model lower-level functions of the brain, such as motion-detection. Parallelism, otherwise known as PDP (parallel distributed processing), is the branch of computational cognitive psychological that attempts to model that higher-level aspects of the brain, such as face recognition, or language learning. This is yet another instance of how the bottom-up approach is limited to the fields of robotics.

Introduction to Mentalese
Despite this movement away from GOFAI by some researchers, the majority of scientists carried on with the classical approach. One such pioneer of the ‘computational model’ field was Jerry A. Fodor. He says the following:

This quote is very reminiscent of conceptual representation and its methodology. Fodor argues that since sciences all have laws governing their phenomenon, psychology and the workings of the brain are not an exception.

Language of Thought and Mentalese
Fodor was an important figure in the idea of a ‘language of thought.’ Fodor theorises that the brain ‘parses’ information that it then transforms to a mental language of its own that it can subsequently manipulate and change more efficiently. When a person needs to ‘speak’ the brain converts the language it uses into a natural language. Fodor thought that this mental language was taken beyond natural language, but and to the senses to:

Fodor called this ‘language of thought’ Mentalese. The theory of Mentalese says that thought is based upon language, that we think in a language not like English, French or Japanese but in the brain’s own universal language. Before studying the idea of Mentalese, the concepts ‘syntax’ and ‘semantics’ should be fully explained.

Syntax and Semantics
Syntactic features of a language are the words and sentences that relate to the form rather than to the meaning of the sentence. Syntax tells us how a sentence in a particular language is formed, how to form grammatically correct sentences. For example, in English the sentence, ‘Cliff went to the supermarket’ is syntactically correct whereas, ‘To supermarket Cliff the went’ makes no sense whatsoever. Semantics are the features of words that relate to the overall meaning of the sentence/paragraph. The semantics of a word also define the relations of the word to the real world, and also to its contextual place in the sentence.

How are syntactics and semantics related to symbols and representations? The syntax of symbols is a huge hierarchy, simple basesymbols that represent simple representations. From these simple symbols, more complicated symbols are derived from them. The semantics of symbols is easy to explain — symbols are inherently semantic. Symbols represent, relate and various objects to each other, therefore when you hear or read a sentence it is broken up into symbols that are all related to one another.
With these terms defined, we can see how Mentalese a) can be restated as a computational theory and b) supports the idea of conceptual representation. Mentalese is computational because ‘it invokes representations which are manipulated or processed according to formal rules.’ The syntax of Mentalese is just like the hierarchy of CR structures — with different, more complex structures derived from base structures. The theory even has similarities to the architecture of an entire CR program. The brain receives sentences and turns them into Mentalese, just like a CR program would parse a stream of text, and conceptualize it into structures within the computer that do not resemble (and are independent of) the language that the text was in. When the brain needs to retrieve this information, it converts the Mentalese back into the natural language needed, just like a CR program takes the concepts and changes them back into the necessary language.
Earlier on, the idea of an implementing mechanism was introduced. How can such a mechanism be viewed within the Mentalese paradigm? The basic idea of an implementing mechanism (IM) is that lower-level mechanisms implement higher-level ones. In more generic terms, an IM goes further than the "F causes G" stage, to the "How does F cause G?" Fodor says that an IM specifies ‘a mechanism that the instantiation of F is sufficient to set in motion, the operation of which reliably produces some state of affairs that is sufficiently for the instantiation of G."
Which Model does the Brain More Closely Follow?
After looking at all the examples of the different approaches, we are presented with the final question of which one best represents the human brain? In the field of NLP, it seems that the GOFAI, top-down approach is by far the best approach, whereas, the fields of robotics and image recognition are benefited by the connectionist, bottom-up approach. Despite this seemingly concrete conclusion, we are still plagued with a few problems.

Consistency
How many different types of cells are our brains composed of? Essentially, it just uses one type — the neurone. How can the brain both exhibit serial and parallel behaviours with only one type of cell? The obvious answer to this is that it does not. This is a fault in the overall integrity of both approaches.

For example, we have parallel, bottom-up neural networks that can successfully detect pictures, but cannot figure out the algorithm for an XOR bit-modifier. We have serial, top-down CR programs that can read newspaper articles, make inferences, and even learn from these inferences — yet, such programs often make bogus discoveries, due to the lack of background information and general knowledge. We have robots that can play the guitar, walk up stairs, aid in bomb disposal, yet nothing gets close to a program with overall intellect equal to that of a human.
Integration of the Fields
What will happen when robotics reaches the stage where it is on the brink of real-time communication, voice-recognition and synthesis? The fields of NLP and robotics will have to integrate, and a common method will have to be devised. The question of what method to follow would simply arise again. Since the current breed of robots use parallel processing, future robots will no doubt use similar data systems to the ones today, and would not easily adapt to a serial, top-down approach to language. Even if a successful bottom-up approach is found for robots, will it be practical?

Will humans tolerate ‘infant’ robots that require to be ‘looked after’ for the first 3 or 4 years of their life while they learn the language of their ‘parents’? For robots that are to immediately know a certain language (for instance, English) from the minute they are turned on, a top-down approach is bound to be necessary. No doubt, a comprise system of language processing would have to be developed, perhaps the object-orientated system of ParseTalk could be promising, since OOP offers the advantages of serial programming with some of the advantages of parallel processing too.
Top-Down Approach
One of the best ways to support the top-down approach and its similarities to the brain is to look at just how similar Mentalese and conceptual representation are. Mentalese assumes that there is a language of the brain completely independent of the natural language (or languages) of the individual. CR assumes the exact same thing, creating data structures with more abstract names that are independent to the language, but rely on the parser for its input. This can explain the ease at which programs utilising CR easily convert between two languages.

Both the ideas of Mentalese and CR must have been formulated from the same direction and perspective of the brain. Both assume a certain independence that the brain has over the rest of the language/communication areas of the brain. Both assume that computational manipulation is performed only after the language has been transformed into this language of the brain. It is about this area that Mentalese and conceptual representation diverge — conceptual representation is limited to language (or has not yet been applied to other areas), whereas Fodor maintains that this mental language applies to cognitive and perceptive areas of the brain too.
A fault in the Mentalese theory is that Fodor says Mentalese is universal. It seems hard to imagine that as we all grow up, learning how to see, hear, speak, and all the other activities we learn as a new-born, we all adopt the same language. A possible counter to this is that the language is already there — but this creates a lot of new complications. Nevertheless, this fault is not detrimental to its applications in computer processing, since nearly any base for an NLP program will require some prior programming (analogous to the pre-made rules of Mentalese).
The COG Team also outlined three basic (but very important) faults with a GOFAI approach to AI:
The team backs up their assertions with various results from psychological tests that have been performed. Such monolithic representations and controlling units underlie the theory of Mentalese and conceptual representation.

Bottom-Up Approach
The main advantage of the bottom-up approach is its simplicity (somewhat), and its flexibility. Using structures such as neural networking, programs have been created that can do things that would be impossible to do with conventional, serial approaches. For example, Hopfield networks can recognise partial, noisy, shrunken, or obscured images that it has ‘seen’ before relatively quickly. Another program powered by a neural network has been trained to accept present tense English verbs and convert them to past tense.

The strongest argument for the bottom-up approach is that the learning processes are the same that any human undergoes when they grow up. If you present the robot or program to the outside world it will learn things, and adapt itself to them. Like the COG Team asserts, the key to human intelligence is the four traits they spelled out: developmental organisation, social interaction, embodiment, and integration.
The bottom-up approach also models human behaviour (emotions inclusive) due to chaotic nature of parallel processing and neural networking:
The main downfall of the bottom-up approach is its practicality. Building a robot’s knowledge base from the ground up every time one is made might be reasonable for a research robot, but for (once created) commercial robots or even very intelligent programs, such a procedure would be out of the question.

Conclusion
In conclusion, a definite approach to the brain has not yet been developed, but the two approaches (top-down and bottom-up) describe different aspects of the brain. The top-down approach seems like it can explain how humans use their knowledge in conversation. What the top-down approach does not solve however, is how we get that initial knowledge to begin with — the bottom-up approach does. Through ideas such as neural networking and parallel processing, we can see how the brain could possibly take sensory information and convert it into data that it can remember and store. Nevertheless, these systems have so far only demonstrated an ability to learn, and not sufficient ability to manipulate and change data in the way that the brain and programs utilising a top-down methodology can.

These attributes of the approaches lead to their dispersion within the two fields of AI. Natural Language Processing took up the top-down approach, since that had all the necessary data manipulation required to do the advanced analysis of languages. Yet, the large amount of storage space required for a top-down program and the lack of a good learning mechanism made the top-down approach too cumbersome for robotics. They adopted the bottom-up approach, which proved to be good for things such as face recognition, motor control, sensory analysis and other such ‘primitive’ human attributes. Unfortunately, any degree of higher-level complexity is very hard to achieve with a bottom-up approach.
Now we have one approach modelling the lower level aspects of the human brain, and another modelling the higher levels — so which models the brain overall the best? Top-down approaches have been in development for as long as AI has been around, but serious approaches to the bottom-up methodologies have only really started in the last twenty years or so. Since bottom-up approaches are looking at what we know from neurobiology and psychology, not so much from philosophy like GOFAI scientists do, there may be a lot more we have yet to discover. These discoveries, though, may be many years in the future. For the meanwhile, a compromise should be reached between the two levels to attempt to model the brain consistently given the current technology. The object-orientated approach might be one answer, research into neural networks trained to create and modify data structures similar to those used in conceptual representation might be another.
Artificial Intelligence is the science of the unknown — trying to emulate something we cannot understand. GOFAI scientists have always hoped that AI might one day explain the brain, instead of the other way around — connectionist scientists do not, and perhaps this is the key to creating flexible code that can react given any environment, just like the human brain — real Artificial Intelligence.
Bibliography.
Altman, Ira. The Concept of Intelligence: A Philosophical Analysis. Maryland: 1997.
Brooks, R. A., Breazeal (Ferrell), C., Irie, R., Kemp, C. C., Marjanovic, M., Scassellati, B. & Williamson, M. M. (1998), Alternative Essences of Intelligence, in ‘Proceedings of the American Association of Articial Intelligence (AAAI-98)’.
Brooks, R. A., Breazeal (Ferrell), C., Irie, R., Kemp, C. C., Marjanovic, M., Scassellati, B. & Williamson, M. M. (1998), The Cog Project: Building a Humanoid Robot.
Churchland, Patricia and Sejnowski, Terrence. The Computational Brain. London: 1996.
Churchland, Paul. The Engine of Reason, the Seat of the Soul: A Philosophical Journey into the Brain. London: 1996.
Crane, Tim. The Mechanical Mind: A Philosophical Introduction to Minds, Machines and Mental Representation. London: 1995.
Fodor, Jerry A. Elm and the Expert: An Introduction to Mentalese and Its Semantics. Cambridge: 1994.
Hahn, Udo. Schacht, Susanne. Bröker, Norbert. Concurrent, Object-Orientated Natural Language Parsing: The ParseTalk Model. Arbeitsgruppe Linguistische Informatik/Computerlinhguistik. Freiburg: 1995.
Lenat, Douglas B. Artificial Intelligence. Scientific American, September 1995.
Lenat, Douglas B. The Dimensions of Context-Space. Cycorp Corporation, 1998.
Penrose, Roger. The Emperor’s New Mind: Concerning Computers, Minds and The Laws of Physics. Oxford: 1989.
Schank, Roger. The Cognitive Computer: On Language, Learning and Artificial Intelligence. Reading: 1984.
Schank, Roger and Colby, Kenneth. Computer Models of Thought and Language. San Francisco: 1973.
Watson, Mark. AI Agents in Virtual Reality Worlds. New York: 1996.

• Intention
How Neural Networks Work
Neural networks work by using a system of receiving inputs, sending outputs, and performing self-corrections based on the difference between the output and expected output, also known as the cost.
Neural networks are composed of neurons, which in turn compose layers, or collections of neurons. For example, there is an input layer and an output layer. In between the these two layers, there are layers known as hidden layers. These layers allow for more complex and nuanced behavior by the neural network. A neural network can be thought of as a multi-tier cake: the first tier of the cake represents the input, the tiers in between, or lack thereof, represent the hidden layers, and the last tier represents the output.
The two mechanisms of learning are Forward Propagation and Backward Propagation. Forward Propagation uses linear algebra for calculating what the activation of each neuron of the next layer should be, and then pushing, or propagating, those values forward. Backward Propagation uses calculus to determine what values in the network need to be changed in order to bring the output closer to the expected output.
Forward Propagation

As can be seen from the gif above, each layer is composed of multiple neurons, and each neuron is connected to every other neuron of the following and previous layer, save for the input and output layers since they are not surrounding by layers from both sides.

To put it simply, a neural network represents a collection of activations, weights, and biases. They can be defined as:
Activation: A value representing how strongly a neuron is firing. Weight: How strong the connection is between two neurons. Affects how much of the activation is propagated onto the next layer. Bias: A minimum threshold for whether or not the current neuron's activation and weight should affect the next neuron's activation. Each neuron has an activation and a bias. Every connection to every neuron is represented as a weight. The activations, weights, biases, and connections can be represented using matrices. Activations are calculated using this formula:

After the inner portion of the function has been computed, the resulting matrix gets pumped into a special function known as the Sigmoid Function. The sigmoid is defined as:

The sigmoid function is handy since its output is locked between a range of zero and one. This process is repeated until the activations of the output neurons have been calculated.
Backward Propagation
The process of a neural network performing self-correction is referred to as Backward Propagation or backprop. This article will not go into detail about backprop since it can be a confusing topic. To summarize, the algorithm uses a technique in calculus known as Gradient Descent. Given a plane in an infinite number of dimensions, the direction of change that minimizes the error must be found. The goal of using gradient descent is to modify the weights and biases such that the error in the network approaches zero.

Furthermore, you can find the cost, or error, of a network using this formula:

Unlike forward propagation, which is done from input to output, backward propagation goes from output to input. For every activation, find the error in that neuron, how much of a role it played in the error of the output, and adjust accordingly. This technique uses concepts such as the chain rule, partial derivatives, and multi-variate calculus; therefore, it's a good idea to brush up on one's calculus skills.
High Level Algorithm
Initialize matrices for weights and biases for all layers to a random decimal number between -1 and 1. Propagate input through the network. Compare output with the expected output. Backwards propagate the correction back into the network. Repeat this for N number of training samples.
Source Code
If you're interested in looking into the guts of a neural network, check out AI Chan! It's a simple to integrate library for machine learning I wrote in C++. Feel free to learn from it and use it in your own projects.
https://bitbucket.org/mrsaturnsan/aichan/
Resources
http://neuralnetworksanddeeplearning.com/

• Hey guys,
I'm starting to get really interested in A.I and especially machine learning. However I can't find a good place to start. I have looked on the internet for tutorials but they are all useless to begginers because they all end up thinking that you have a lot of knowledge about machine learning already and that you know what you are doing. But for me (an absolute beginner) I find it hard to understand what they are saying in the tutorials.
I have tried to make my own A.I by just playing around with some maths and the basics of machine learning that I already know like neurons and weights and baias. For the most part my little mini projects work but they are not proper A.I.
Could anyone please recommend me some tutorials that I could watch for begginers. My main programming language is python however I am starting to learn java and c#. I have already looked at bigNeuralNetwork's tutorial, at the beginning it was great and I understood everything but then halfway through I didn't know what was going on.
• By JimboC
I'm trying to figure out the best way to implement a neural network with a varying number of inputs. Because of an NDA, I can't post my specific issue or include my data, but I've come up with a scenario that is pretty close to my dilemma, though it is over simplified quite a bit. I'm just looking for a high-level suggestion of a possible way forward so hopefully it will be enough.
Say I'm tasked with writing a neural network to automate evaluating employees in a manufacturing job. There's a set number of inputs for each employee such as: days called out, times late, hours of OT worked and so on. In additional to this data there's also the information about the items produced which will be the same for each item: time taken to complete the item, amount of extra materials used to make item, quality of the item, number of tweaks needed before the item can be shipped and stuff like that. The number of data inputs for each item is always the same. The issue is that each employee produces a different number of items and I'm not sure how to account for that in the neural network.
So what I'm looking at for input data:
- employee data with an equal number of inputs
- item data with an equal number of inputs
- varying amounts of items which need to be associated with each employee

The output I need is a score that takes into account all the employee data and all the item data for the items that employee produced.

I was looking at averaging the item 'scores' for all the items an employee makes, but this didn't work since each item has to be part of the employee's data. This is because each item could have certain criteria that needs to weigh heavier in the evaluation (say an item is made with all criteria being excellent or all criteria being exceptionally poor) so each item needs to be considered individually along side the employee's other numbers.
Next I was thinking of was to create a static number of items for all employees and set all the inputs without an associated item to zero. This didn't work since the higher numbered item input weights got rated down since they weren't used on most employees and even items that were should have swayed the output wound up barely registering for the employee.
I'm new to neural networking but my neural network does seem to be working, it's just not giving me the relevant output I was hoping for without being able to include the item data. I started looking into recurrent neural networks, but the more I read about them the less they seem to be something that would help in my situation. Is this something that wouldn't be a good application for a neural network because of the different amounts of item data? Is there some other method of implementing neural networks that would be better suited to my data?
Edit: my current implementation is actually 2 neural networks. 1 for the data items to produce an item score and 1 for the employee data using output from the first as an input. To back propagate I'm just passing the MSE from the Employee neural network to the output of the Item neural network and then back propagating from there.
Edited to make title clearer.
×