Michael Xu. Littell, Ph. Modesitt, Jr. Robert King. Owens Sitemap. We assume a set of two target classes labeled 0 and 1, with a correct label y 2 f0; 1g. When using the logistic loss, it is assumed that the output layer is transformed using the sigmoid function. Because the scores yO have been transformed using the softmax function to be non-negative and sum to one, increasing the mass assigned to the correct class means decreasing the mass assigned to all the other classes.
Ranking losses In some settings, we are not given supervision in term of labels, but rather as pairs of correct and incorrect items x and x 0 , and our goal is to score correct items above incorrect ones. Such training situations arise when we have only positive examples, and generate negative examples by corrupting a positive example. A common variation is to use the log version of the ranking loss: Lranking log. Similarly, Van de Cruys [] used the ranking loss in a selectional-preferences task, in which the network was trained to rank correct verb-object pairs above incorrect, automatically derived ones, and Weston et al.
An example of using the ranking log loss can be found in Gao et al. In order to drive the loss down, the learner can identify features letter bigrams in x o that occur in only few other documents, and give them very strong weights toward the incorrect French class.
We then look for parameter values that have both a low loss and low complexity. While Equation 2. In practice, the regularizers R equate complexity with large weights, and work to keep the parameter values low. In particular, the regularizers R measure the norms of the parameter matrices, and drive the learner toward solutions with low norms.
Common choices for R are the L2 norm, the L1 norm, and the elastic-net. L2 regularization In L2 regularization, R takes the form of the squared L2 norm of the param- eters, trying to keep the sum of the squares of the parameter values low: X RL2. L1 regularization In L1 regularization, R takes the form of the L1 norm of the parameters, trying to keep the sum of the absolute values of the parameters low: X RL1.
It thus encourages a sparse solutions—models with many parameters with a zero value. A common solution is to use a gradient-based method. For the sake of example, assume this approach cannot be used indeed, it is challenging to use this ap- proach in function of multiple variables. An alternative approach is a numeric one: compute 0 0.
Evaluating u D f. If u D 0, then xi is an optimum point. If the function f. Gradient-based optimization simply generalizes this idea for functions with multiple variables. A gradient of a function with k variables is the collections of k partial derivatives, one according to each of the variables. Moving the inputs in the direction of the gradient will increase the value of the function, while moving them in the opposite direction will decrease it.
When optimizing the loss L. Convexity In gradient-based optimization, it is common to distinguish between convex or concave functions and non-convex non-concave functions. A convex function is a function whose second-derivative is always non-negative.
As a consequence, convex functions have a single minimum point. Similarly, concave functions are functions whose second-derivatives are always negative or zero, and as a consequence have a single maximum point. Convex con- cave functions have the property that they are easy to minimize maximize using gradient- based optimization—simply follow the gradient until an extremum point is reached, and once it is reached we know we obtained the global extremum point.
In contrast, for functions that are neither convex nor concave, a gradient-based optimization procedure may converge to a local extremum point, missing the global optimum. SGD is a general optimization algorithm.
Input: - Function f. Note that the error calculated in line 3 is based on a single training example, and is thus just a rough estimate of the corpus-wide loss L that we are aiming to minimize.
A common way of reducing this noise is to estimate the error and the gradients based on a sample of m examples. In lines 3—6, the algorithm estimates the gradient of the corpus loss based on the minibatch. Higher values provide better estimates of the corpus-wide gradients, while smaller values allow more updates and in turn faster con- vergence.
For modest sizes of m, some comput- ing architectures i. With a properly decreasing learning rate, SGD is guaranteed to converge to a global optimum if the function is convex, which is the case for linear and log-linear models coupled with the loss functions and regularizers discussed in this chapter.
However, it can also be used to optimize non-convex functions such as multi-layer neural network. We need to compute the gradients of the loss with respect to the values W and b. Otherwise, consider the L derivative of b. Here, we stick to high-school level techniques. We get: 8. Adaptive learning rate algorithms including AdaGrad [Duchi et al. For details of these algorithms, see the original papers or [Bengio et al.
In the XOR example the transformed data has the same dimensions as the original one, but often in order to make the data linearly separable one needs to map it to a space with a much higher dimension. Working in very high dimensional spaces can become computationally prohibitive, and the ingenuity in kernel methods is the use of the kernel trick [Aizerman et al. In fact, Equation 3. Having established the motivation, we now turn to describe multi-layer neural networks in more detail. In the metaphor, a neuron is a computational unit that has scalar inputs and outputs.
Each input has an associated weight. Figure 4. Such networks were shown to be very capable com- putational devices.
If the weights are set correctly, a neural network with enough neurons and a nonlinear activation function can approximate a very wide range of mathematical functions we will be more precise about this later.
A typical feed-forward neural network may be drawn as in Figure 4. While the brain metaphor is sexy and intriguing, it is also distracting and cumbersome to manipulate mathematically. We therefore switch back to using more concise mathematical notation. As will soon become apparent, a feed-forward network as the one in Figure 4. In Figure 4. Taking this view, the single neuron in Figure 4. It is simply a linear model: NNPerceptron.
Breaking it down, xW 1 C b1 is a linear transformation of the input x from din dimensions to d1 dimensions. Without the nonlinearity in g , the neural network can only represent linear transfor- mations of the input. We can add additional linear-transformations and nonlinearities, resulting in an MLP with two hidden-layers the network in Figure 4.
A bias term can be added to a layer by adding to it an additional neuron that does not have any incoming connections, whose value is always 1. Each hidden layer is followed by a nonlinear activation. Other types of architectures exist. Such layers have uses also in language processing, and will be discussed in Chapter Networks with several hidden layers are said to be deep networks, hence the name deep learning.
When describing a neural network, one should specify the dimensions of the layers and the input. A layer will expect a din dimensional vector as its input, and transform it into a dout dimen- sional vector.
For a fully connected layer l. Like the case with linear models, the output of a neural network is a dout dimensional vector. Similarly, if the output vector entries are positive and sum to one, the output can be interpreted as a distribution over class assignments such output normalization is typically achieved by applying a softmax transformation on the output layer, see Section 2.
Still, the gradient-based optimization methods discussed in Section 2. Training neural networks is discussed in detail in Chapter 5. Finally, it does not state how large the hidden layer should be. Indeed, Telgarsky [] show that there exist neural networks with many layers of bounded size that cannot be approximated by networks with fewer layers unless these layers are exponentially large. In practice, we train neural networks on relatively small amounts of data using local search methods such as variants of stochastic gradient descent, and use hidden layers of relatively mod- est sizes up to several thousands.
In many cases, however, MLP1 does indeed provide strong results. For further discussion on the representation power of feed-forward neural networks, see Bengio et al. Some NLP researchers also experimented with other forms of nonlinearities such as cube and tanh-cube. Despite its sim- plicity, it performs well for many tasks, especially when combined with the dropout regularization technique see Section 4.
Model regularization is just as important in deep neural networks as it is in linear models, and perhaps even more so. Work by Wager et al. Another view links dropout to model averaging and ensemble techniques [Srivastava et al. For example, vectors v1 2 Rd and v2 2 Rd may be the output layers of two MLPs, and we would like to train the network to produce similar vectors for some training examples, and dissimilar vectors for others. In what follows we describe common functions that take two vectors u 2 Rd and v 2 Rd , and return a scalar.
Dot Product A very common options is to use the dot-product: d X simdot. Similarly, for a trainable distance function we can use: dist. If the symbolic feature is encoded as a one-hot vector x , the lookup operation can be implemented as the multiplication xE.
Embeddings are discussed in more depth in Chapter 8 when discussing dense representa- tions of categorical features, and in Chapter 10 when discussing pre-trained word representations. Still, gradient-based methods produce good results in practice.
Gradient calculation is central to the approach. However, for complex networks this process can be laborious and error-prone. For most purposes, it is preferable to use automatic tools for gradient computation [Bengio, ]. A computation graph is a representation of an arbitrary mathematical computation as a graph. Consider for example a graph for the computation of. Since a neural network is essentially a mathematical expression, it can be represented as a computation graph.
For example, Figure 5. Network inputs are treated as constants, and drawn without a surrounding node. Input and parameter nodes have no incoming arcs, and output nodes have no outgoing arcs. Figure 5. Finally, the graph in Figure 5. Once the graph is built, it is straightforward to run either a forward computation com- pute the result of the computation or a backward computation computing the gradients , as we show below.
Constructing the graphs may look daunting, but is actually very easy using dedicated software libraries and APIs. More formally, in a graph of N nodes, we associate each node with an index i according to their topological ordering. Let fi be the function computed by node i e.
Denote by v. Algorithm 5. Denote by d. After picking the i th element of a vector, only that element participates in the remainder of the computation. Similarly, for the function max. Graph creation is made almost transparent by use of operator overloading. For example, the python code for creating the computation graph from Figure 5. Maps words to numeric IDs. Building the computation graph : dy. Wrap the model parameters as graph - nodes. Finally, the fourth block is where the graph is created.
Note how transparent the graph creation is— there is an almost a one-to-one correspondence between creating the graph and describing it mathematically. GradientDescentOptimizer 0. Forward and backward propagation are then applied to this graph.
However, the compilation step itself can be costly, and it makes the interface more cumbersome to work with. See Neubig et al. Chapters 14— One can then design arbitrarily deep and complex networks, and be able to easily evaluate and train them thanks to automatic forward and gradient computation.
While this book is not intended as a comprehensive guide to successfully training neural networks, we do list here a few of the prominent issues. For further discussion on optimization techniques and algorithms for neural networks, refer to Bengio et al.
For some theoretical discussion and analysis, refer to Glorot and Bengio []. For various practical tips and recommendations, see Bottou [], LeCun et al. Section 2. Analysis by He et al. Using ensembles often increases the prediction accuracy, at the cost of having to run the prediction step several times once for each model.
Dealing with the vanishing gradients problem is still an open research question. Let gO be the gradients of all parameters threshold in the network, and kgk O be their L2 norm. Pascanu et al. Saturated neurons have very small gradients, and should be avoided. If your network does not train well, it is advisable to monitor the network for layers with many saturated or dead neurons. Saturated neurons are caused by too large values entering the layer.
Dead neurons are caused by all signals entering the layer being negative for example this can happen after a large gradient update. Reducing the learning rate will help in this situation. For saturated layers, another option is to normalize the values in the saturated layer after the activation, i.
As of this writing, it is less popular in natural language applications. In practice, most implemen- tations go over the training example in random order, essentially performing random sampling without replacement. Too small learning rates will take a very long time to converge.
Learning rate scheduling decreases the rate as a function of the number of observed minibatches. A common schedule is dividing the initial learning rate by the iteration number. In terms of the computation graph abstraction, one can create a computation graph for each of the k training examples, and then connecting the k loss nodes under an averaging node, whose output will be the loss of the minibatch. All of these models take as input vectors x and produce predictions.
Up until now we assumed the vectors x are given. Deciding on the right features is an integral part of a successful machine learning project. We now diverge from the training machinery in order to discuss the feature functions that are used for language data, which will be the topic of the next few chapters.
Chapter 7 discusses feature choices for some concrete NLP problems. Chapter 8 deals with encoding the features as input vectors that can be fed to a neural network. For example, problems in which we are required to produce sentences or longer texts—i. What language is it in? How common is it? What other words are similar to it? Is it a mis-spelling of another word?
And so on. Texts In these problems we are faced with a piece of text, be it a phrase, a sentence, a paragraph or a document, and need to say something about it. Is it spam or not? Is it sarcastic? Is it positive, negative or neutral toward some issue? Who wrote it? Is it reliable? Will this text be liked by 16—18 years old males? Paired Texts In these problems we are given a pair of words or longer texts, and need to say something about the pair.
Are words A and B synonyms? Is word A a valid translation for word B? Are documents A and B written by the same author? Can the meaning of sentence A be inferred from sentence B? Word in Context Here, we are given a piece of text, and a particular word or phrase, or letter, etc.
Is the word apple in a given context referring to a company or a fruit? Is on the right preposition to use in I read a book on London? Does a given period denote a sentence boundary or an abbreviation?
Is the given word part of a name of a person, location, or organization? Relation between two words Here we are given two words or phrases within the context of a larger document, and need to say something about the relations between them. Is word A the subject of verb B in a given sentence? What is a word? We are using the term word rather loosely. A process called tokenization is in charge of splitting text into tokens what we call here words based on whitespace and punctuation. In English, the job of the tokenizer is quite simple, although it does need to consider cases such as abbreviations I.
M and titles Mr. In other languages, things can become much tricker: in Hebrew and Arabic some words attach to the next one without whitespace, and. It is common for English tokenizers to handle these cases as well. More interestingly, take something like New York, is it two words, or one? What about ice cream? Is it the same as ice-cream or icecream? And what about idioms such as kick the bucket? In general, we distinguish between words and tokens.
We refer to the output of a tok- enizer as a token, and to the meaning-bearing units as words. Having said that, in this book, we use the term word very loosely, and take it to be interchangeable with token. It is important to keep in mind, however, that the story is more complex than that.
As words and letters are discrete items, our features often take the form of indicators or counts. An indicator feature takes a value of 0 or 1, depending on the existence of a condition e. A count takes a value depending on the number of times some event occurred, e. Are all letters capitalized? Does the word include a hyphen? Does it include a digit? Does it end with ing? We may also look at the word with relation to external sources of information: How many times does the word appear in a large collection of text?
Does the word appear in a list of common person names in the U. Lemmas and Stems We often look at the lemma the dictionary entry of the word, mapping forms such as booking, booked, books to their common lemma book. A coarser process than lemmatization, that can work on any sequence of letters, is called stemming.
Note that the result of stemming need not be a valid word: picture and pictures and pictured will all be stemmed to pictur. Lexical Resources An additional source of information about word forms are lexical resources. Such lexicons will typically also include lemma information. A very well-known lexical resource in English is WordNet [Fellbaum, ].
WordNet is a very large manually curated dataset attempting to capture conceptual semantic knowledge about words. Each word belongs to one or several synsets, where each synsets describes a cognitive con- cept. For example, the word star as a noun belongs to the synsets astronomical celestial body, someone who is dazzlingly skilled, any celestial body visible from earth and an actor who plays a principle role, among others.
Other semantic relations in WordNet contain antonyms opposite words and holonyms and meronyms part-whole and whole-part relations. WordNet contains information about nouns, verbs, adjec- tives, and adverbs. FrameNet [Fillmore et al. It lists words and phrases, and for each one provides a list of words and phrases that can be used to mean roughly the same thing.
Lexical resources such as these contain a lot of information, and can serve a good source of features. Distributional Information Another important source of information about words is distribu- tional —which other words behave similar to it in the text?
In Section Features for Text When we consider a sentence, a paragraph, or a document, the observable features are the counts and the order of the letters and the words within the text. Bag of words A very common feature extraction procedures for sentences and documents is the bag-of-words approach BOW.
In this approach, we look at the histogram of the words within the text, i. We can also compute quantities that are directly derived from the words and the letters, such as the length of the sentence in terms of number of letters or number of words. Consider a document d which is part of a larger corpus D. Rather than representing each word w in d by its normalized count in the document P 0 d.
Besides words, one may also look at consecutive pairs or triplets of words. Ngram features are discussed in depth in Section 6. Features of Words in Context When considering a word within a sentence or a document, the directly observable features of the word are its position within the sentence, as well as the words or letters surrounding it.
Words that are closer to the target word are often more informative about it than words that are further apart. For example, consider the sentence the brown fox jumped over the lazy dog, with the target word jumped.
Encoding of window-based features as vectors is discussed in Section 8. Position Besides the context of the word, we may be interested in its absolute position within a sentence. Features for Word Relations When considering two words in context, besides the position of each one and the words surrounding them, we can also look at the distance between the words and the identities of the words that appear between them. Linguistics has much wider breadth than syntax, and there are other systems that regulate the human linguistic behavior besides the syntactic one.
For a more in depth overview, see the further reading recommendations at the end of this section. Consider the sentence the boy with the black shirt opened the door with a key. However, VP chunks may contain more elements, covering also cases such as will opened and did not open. Words that are far apart in the surface form of the sentence may be close in its dependency tree.
For example, boy and opened have four words between them in the surface form, but have a direct nsubj edge connecting them in the dependency tree. Other kinds of relations are more semantic. It does not tell us, however, what are the semantic-roles of the arguments with respect to the verb, i.
Besides the observable properties letters, words, counts, lengths, linear distances, frequen- cies, etc. For example, we could look at the part-of-speech tag POS of a word within a document Is it a noun, a verb, adjective, or a determiner?
Is it the main verb of the sentence? When given two words in a sentence, we can consider the syntactic dependency tree of the sentence, and the subtree or paths that connect the two words within the this tree, as well as properties of that path. Words that are far apart in the sentence in terms of the number of words separating them can be close to each other in the syntactic structure. Another important phenomena is that of anaphora—consider the sentence sequence the boy opened the door with a key.
He2 saw a man. He3 was smiling. Anaphora resolution also called coreference resolution will tell us that It1 refers to the door and not the key or the boy , he2 refers to the boy and he3 is likely to refer to the man. Part of speech tags, syntactic roles, discourse relations, anaphora, and so on are concepts that are based on linguistic theories that were developed by linguists over a long period of time, with the aim of capturing the rules and regularities in the very messy system of the human language.
While many aspects of the rules governing language are still open for debate, and others may seem overly rigid or simplistic, the concepts explored here and others do indeed capture a wide and important array of generalizations and regularities in language.
Are linguistic concepts needed? Some proponents of deep-learning argue that such inferred, manually designed, linguistic properties are not needed, and that the neural network will learn these intermediate representations or equivalent, or better ones on its own.
Even if we do have enough data, we may want to focus the network on certain aspects of the text and hint to it that it should ignore others, by providing the generalized concepts in addition to, or even instead of, the surface forms of the words. Finally, even if we do not use these linguistic properties as input features, we may want to help guide the network by using them as additional supervision in a multi-task learning setup see Chapter 20 or by designing network architecture or training paradigms that are more suitable for learning certain linguistic phenomena.
Overall, we see enough evidence that the use of linguistic concepts help improve language understanding and production systems. Further Reading When dealing with natural language text, it is well advised to be aware of the linguistic concepts beyond letters and words, as well as of the current computational tools and re- sources that are available. For a discussion on current NLP methods, tools, and resources see the book by Jurafsky and Martin [] as well as the various specialized titles in this series.
Linear models cannot assign a score to a conjunction of events X occurred and Y occurred and … that is not a sum of their individual scores, unless the conjunction itself is modeled as its own feature. When using a neural network such as a multi-layer perceptron Chapter 4 , the model designer can specify only the set of core features, and rely on the network training procedure to pick up on the important combinations on its own. Word-bigrams, as well as trigrams sequences of three items of letters or words are also com- mon.
Beyond that, 4-grams and 5-grams are sometimes used for letters, but rarely for words due to sparsity issues. It should be intuitively clear why word-bigrams are more informative than individual words: it captures structures such as New York, not good, and Paris Hilton.
Indeed, a bag-of-bigrams representation is much more powerful than bag-of-words, and in many cases proves very hard to beat. Of course, not all bigrams are equally informative, bigrams such as of the, on a, the boy, etc.
However, it is very hard to know a-priori which ngrams will be useful for a given task. Bidirectional RNNs Chapters 14 and 16 generalize the ngram concept even further, and can be sensitive to informative ngrams of varying lengths, as well as ngrams with gaps in them. Unless we have a specialized list of foods we will not learn that pizza is more similar to burger than it is to chair, and it will be even harder to learn that pizza is more similar to burger than it is to icecream.
Many algorithms were derived over the years to make use of this property, and learn generalizations of words based on the contexts in which they occur. Turian et al. However, care must be taken when using such word similarity information, as it can have unintended consequences. We will discuss word embeddings methods and the use of word vectors in more detail in Chapters 10 and As we saw in Chapter 2, a bag of letter-bigrams is a very strong feature representation for this task.
Concretely, each possible letter-bigram or each letter bigram appearing at least k times in at least one language is a core feature, and the value of a core feature for a given document is the count of that feature in the document. A similar task is the one of encoding detection. Here, a good feature representation is a bag-of byte-bigrams. Here, the letter level is not very informative, and our basic units will be words. Word order is not very informative for this task except maybe for consecutive word pairs such as bigrams.
If nothing else, such models should be considered as strong baselines for whatever networks you are designing. We may also replace or supplement words by distributional features such as word clusters or word-embedding vectors. When using a bag-of-words, it is sometimes useful to weight each word with proportion to its informativeness, for example using TF-IDF weighting Section 6. However, the learning algorithm is often capable of coming up with the weighting on its own.
Another option is to use word indicators rather than word counts: each word in the document or each word above a given count will be represented once, regardless of its number of occurrences in the document. A good approximation of function words is the list of top- or so most frequent words in a large corpus. By focusing on such features, we can learn to capture subtle stylistic variations in writing, that are unique to an author and very hard to fake.
A good feature set for authorship attribution task include a bag-of-function-words-and- pronouns, bag-of-POS-tags, and bags of POS bigrams, trigrams, and 4grams. Additionally, we may want to consider the density of function words i.
Our feature function when classifying a word wi has access to all the words in the sentence and their letters as well as all the previous tagging decisions i.
In Chapter 19 we discuss the structured learning case—using the same set of features. After all they are deterministic functions of the word. A typical input to the problem would be a sentence such as: John Smith , president of McCormik Industries visited his niece Paris in Milan , reporters say. While NER is a sequence segmentation task—it assigns labeled brackets over non- overlapping sentence spans—it is often modeled as a sequence tagging task, like POS-tagging.
See Lample et al. Distributional features such word clusters or word vectors are also extremely useful for the NER task. Preposions are very common, and also very ambiguous. Consider, for example, the word for in the following sentences. We went there for lunch. He paid for me. We ate for two hours. He would have left for home, but it started raining.
In order to fully understand the meaning of a sentence, one should arguablly know the correct senses of the prepositions within it.
Schneider et al. We follow here the feature set inspired by the work of Hovy et al. Besides that, we will look in the context in which the word occurs. Consider, for example, the following sentences. We need a better mechanism for selecting informative contexts. In linguistic terms, we say that this heuristic helps us capture the governor and the and object of the preposition. It is also somewhat brittle—it is not hard to imagine cases in which it fails.
For robustness, we may look at both the governor and object extracted from the parser and the governor and object extracted using the heuristic, and use all four as sources for features i. See, for example, Litkowski and Hargraves [, ], Srikumar and Roth [a].
After extracting the governor and the object and perhaps also words adjacent to the gov- ernor and the object , we can use them as the basis for further feature extraction. See the work of Hovy et al. One approach to modeling the task is the arc-factored approach [McDonald et al. Training the scoring function such that it works well with the search procedure will be discussed in Chapter The second part of the book Parts III and IV introduces more specialized neural network architectures, including 1D convolutional neural networks, recurrent neural networks, conditioned-generation models, and attention-based models.
These architectures and techniques are the driving force behind state-of-the-art algorithms for machine translation, syntactic parsing, and many other applications. Finally, we also discuss tree-shaped networks, structured prediction, and the prospects of multi-task learning.
0コメント