NMA-DL Keynote Speaker: Yoshua Bengio

NMA-DL Keynote Speaker: Yoshua Bengio

hello my name is joshua benjio and today i’m going to tell you about deep learning for ai i got into this field a few decades ago because i was very excited about what be called a amazing hypothesis which many people take for granted now but wasn’t quite obvious then that there would be a few simple principles um a bit like the laws of physics that would allow us to understand intelligence um and um and that because it’s just a few principles it would have to really rely a lot on the notion of learning so that all of the knowledge that we have instead of being just a result of a huge bag of tricks would be mostly acquired through learning so so that’s pretty much what machine learning uh as an approach to ai is about and it’s interesting to contrast this with the classical ai approach the rule-based symbolic methods that were dominant in the 80s and 90s when when i started doing this kind of research so in those classical approach the knowledge is provided by human machines only do inference in other words they combine the pieces of known knowledge in order to come up with answers to questions and there’s no learning or not much no adaptation it’s not clear how to handle uncertainty although progress has been made on that front using graphical models which intersect in fact a lot with machine learning another problem um is that these approaches hardly make contact with low level perception in action and that’s something where machine learning has really succeeded to a great extent another area where machine learning has really been successful but that classical ai didn’t handle well is everything that has to do with intuitive knowledge things like common sense that we know but we don’t necessarily know how to explain precisely and that’s the kind of thing that you can learn with current deep learning in fact but it’s not clear how to express that in a few rules and symbols so so machine learning really gave us a path towards coming up with a set of general learning principles that would avoid having to program by hand a huge bag of tricks but um today in in the last few years and i think uh looking forward i’ll come back to that at the end there are still a lot of gaps between machine learning and human intelligence and one of them has to do precisely with what was the strength of the role-based approaches in other words higher level cognition and reasoning so this is something that current machine learning is is attacking as well but is is an open problem really now there are a lot of approaches within machine learning and today i’ll talk about the neural net approach which i’ve been involved with so it’s it’s inspired by brain by neuroscience and by cognitive science and it sees computations emerging from the synergy of a large number of simple adaptive computational units there is a focus on the notion of distributed representations which i will elaborate on but for example words um which are symbols are not represented by a symbol they are represented by a pattern of activation by a vector or a word representation this is something that especially in my work on language modeling that i’ll touch upon has been very important for the success of deep learning and more generally we see the conceptual ingredients for how a system can become intelligent in the neural net approach as arising from combining three ingredients an objective function or reward function a learning rule or an approximate optimizer that tries to make the system maximize the objective and a initial architecture or parameterization of the set of functions that we want to search over and all of that you know gives rise to the so-called end-to-end learning approach where all of the pieces of the network are adapting together to help each other achieve that objective now machine learning theory has a lot of important concepts and i will just tell you about maybe what i’ve used those most salient and the most important concept in machine learning is that of generalization you have to distinguish it from learning by heart learning by heart is really easy for computers you can just store the data in memory or in files but actually learning by heart strangely enough is difficult for humans what is easy for humans and not so easy for computers is to generalize to new cases so that’s the challenge and um you can’t do it unless you make some kind of assumptions about the data it doesn’t have to be an explicit assumption so for example the structure of the neural net uh embodies that kind of assumption in a lot of machine learning theory we express that assumption as the a form of smoothness prior smoothness assumption we say that the function we want to learn is smooth meaning that for two nearby inputs x and x prime the output of the function should also be nearby another important concept from a machine learning theory is capacity that’s the number of arbitrary examples that a learner could always learn by heart it’s like how many examples can i stuff in your bag now you can control capacity by having a small net versus a large net for example and um there’s a really important quantity that varies with capacity and that’s called the generalization gap that’s the difference between the performance you measure on your training data and the performance you measure on your test data and that gap can increase or decrease as you increase capacity when you have too high capacity you are in the overfitting regime and the generalization gap will increase as you increase capacity when you have too little capacity your network is too small you’re under fitting and the generalization gap will decrease as you increase capacity now how do we manage to generalize it seems that there is this uh barrier uh called the curse of dimensionality that makes it harder and harder if we try to generalize in higher dimensions and to illustrate that consider first uh learning in a one-dimensional space uh and and let’s make we make it really easy by breaking down that space that real line into 10 bins and then collecting data for each of the bins maybe counting how many fold in each bin or what’s the average output value that you observe for that bin and you know quickly enough you will have data for all the bins and there will be enough in each bin to have reliable statistics about what the answer should be but now consider two dimensions um so so now maybe instead of having 10 bids you have 10 by 10 equals 100 bins you might end up with some empty bins like no data falls there but maybe not that many if you go to three dimensions that’s ten by ten by ten you’ve got a thousand bins um it gets more tricky now um yes you’re more likely to find empty bins and if you go to even higher dimensions and think about an image of a thousand by a thousand pixels it’s like a million pixels um in fact the fast exponentially vast majority of bins will be empty there will be no data now you can bleed a bit of information from one bin to its neighbors unfortunately for a geometric reason it doesn’t really work well and it’s not sufficient to address the problem so you need other assumptions to be able to generalize besides assuming that the function is smooth and in in deep learning and that’s probably one of this its greatest strength we exploit different forms of compositionality to get better internalization compositionality is actually easiest to understand by thinking about how we can compose words in language to obtain different meanings and of course the number of ways that we can compose words words we know in novel ways combined in other ways is exponentially large so that’s very powerful we can generalize two bins sentences that we’ve never seen before in a meaningful way in neural nets we have two especially deep nets we have two ways of getting that kind of compositionality there’s similar kind of compositionally at each level you can have a distributed representation that means a pattern of activations and you can generalize from the patterns you’ve seen to patterns you’ve not seen so long as for each feature you you see enough data and then you can compose different layers if you have a deep network so that’s really the power of deep nets that you have this composition of features you have a hierarchy of features and you also get an exponential advantage from there now the kind that we find in language actually is something that we are trying to put in your nest these days and that’s what i call system two deep learning and that’s something that’s ongoing in the research so in all of these cases really deep learning is characterized by the notion of representation that we’re learning an internal representation that was not given ahead of time that’s that’s really the the hallmark of deep learning to understand better why it could be exponentially advantageous think about a convolutional net which is a special kind of neural net that is for images and you know it takes an image’s input and it outputs some categories maybe different scenes or whatever but in order to produce its answers it somehow discovers near the you know one of the top hidden layers um a set of features so individual units specialize on a task in fact we found in experiments that there is such a specialization and that you can describe verbally what those units learn so imagine one of the units learn the feature that a person wears glasses another one learns that a person is a female or not another one learns that the person is a child or not so you can imagine that in the data there may be some combinations of these features but it’s very unlikely you’ve seen all the combination if you had you know 1000 features and you would need like two to the 1000 examples that’s not plausible but these things work and they work because you don’t need to see all the combinations of features you can think of it more like you’re learning each feature the meaning of each feature and once you understand the meaning of each feature you can combine combine them for example with a linear classifier so if you have n features and each of them means on the order of k degrees of freedom to be learned then you need only on the order of n times k degrees of freedom or out of order of n times k examples to to learn that so so this parallel composition of features really can give you an exponential advantage and we have mathematical results that tell us uh that story in different words and that should be contrasted with more classical non-parametric statistical methods that would require just like in the cursive dimensionality example tiling the space with all of the exponential number of configurations and you would need that exponential number of examples now let’s go to the sort of hierarchy of features this is something you can get a lot of inspiration by looking at how humans understand the world and how engineers build solutions to problems if you consider image recognition you have low level features you have pixels you have low level features you have motifs parts objects and so on and you can you can see similar things and text and speech and so on so so this is a natural natural at least in the minds of humans way to grasp the world in in this kind of multi-level hierarchy of features and then that’s been a big inspiration for for deep learning of course now as you go higher up in the hierarchy you’re hoping that you end up with more and more abstract features and what does it mean that it’s abstract well um that it allows you to describe what is going on in a very general way that that applies to many settings and you can even think of the highest level of representation as well ideally um separating or disentangling as i quoted this a few decades ago disentangling the explanatory factors of variation and if you can do that then it becomes much easier to generalize and transfer to new tasks because new tasks or new settings correspond to typically just combining a few of these variables um and so that’s going to be easy to learn from very little data so a lot of the recent work is about that notion of learning representations that separate those factors of variation but but going even further than that and disentangling not just those those variables those factors but the way that they are related to each other which you can maybe think of as the causal structure the set of causal mechanisms which relate them at the top level a nice property of neural nets that was uncovered very early on is their ability to approximate anything if you have a big enough neural net basically you can approximate any function with some given precision if you want to approximate with more precision you make the network bigger as in wider that’s great in fact a single heating layer is sufficient that doesn’t mean that’s going to be the one that generalizes the best in fact a lot of our recent experience and even theory suggests that you want a network that’s deep enough and having that universal approximation property is not sufficient because it doesn’t guarantee that it’s going to be easy to train that neural net to find those ways i mean theory mainly says that there exists a set of weights that performs the job also it doesn’t guarantee that you will generalize that you’ll get good generalization outside of the training set so you know it’s a piece of the theory but it’s clearly insufficient so how do we train those movements well it turns out that of the many many different optimization methods that exist one of the simplest way um is stochastic descent and has worked incredibly well for deep learning so gradient descent just says change the parameters in the direction that reduces the loss the error on the most and that’s the derivative of the loss of the specular parameters and when it’s done in the computer it’s fairly easy to compute those derivatives and the brain is less obvious [Music] and then the stochastic part is because the true gradient is over all the examples but of course you don’t want to wait and see all the examples before you modify your parameters uh you don’t wait the end of your life before you change right you change every day so stochastic gradient descent is a kind of noisy version of reading descent where you only look at a few examples before you modify the parameters and um the the reason why this gradient is so simple and so uh useful is that it’s if you’re gonna make a small change to the parameters it’s the best change to make in the sense of uh you know going down the loss or up the objective the fastest possible let me go back to the notion of distributed representation so at europe’s 2000 my collaborators and i presented this neural language model was extended in gmlr in 2003 and what it was really was a simple extension of the standard multi-layer neural net where in the first layer instead of having a fully connected thing we consider the sequence of symbols as input as a sequence of words and associate with each word he learned word vector so each word in the vocabulary has a different representation and we take these word vectors concatenate them and stick them into a normal neural net so it’s just like the first layer has these shared weights uh if you could if you represent each word by a one hot vector that’s a vector with all zeros except a one um and we just have this um kind of local uh connectivity like in the convolutional net with shared weights and that’s it another challenge was the output we have tens or hundreds of thousands of words in the vocabulary we want to predict say the next word in the sentence or a word in the middle we need to normalize the outputs of this sum to one and that’s the softmax and it creates a bit of computational challenge but today this is handled quite nicely with gpus and we can look at those word vectors and if you zoom in if you if you reduce dimensionality and project them to 2d and then you zoom in you see that words that are semantically similar to each other end up close to each other in the learned vector space and yeah you can spend a lot of time studying this that’s quite exciting you can do uh many other tricks with neural net architectures that are quite useful and quite exciting like you can handle multiple tasks and use the same representation across all the tasks and now the gradients from all the tasks are pushing the representation in a way that’s good for all the tasks and that ends up often being a better representation you can also turn these nets upside down and instead of classifying or predicting some label given an input they can sample an input given some conditions so one of the approaches for doing that is something we introduced in 2014 called gans generative adversarial networks it’s one of the few methods that are really working well and very popular another worth mentioning is the variational autoencoder that came out at about the same time and one interesting aspect of these scans is that instead of having a single network which you could call the generator that produces the images given a input noise vector and potentially some some conditions there’s also another network a discriminator that is a normal classifier but what it does is it’s trained to distinguish between the real images and those generated by the generator and then you can take the signal coming from the discriminator to train the generator to fool the discriminator and the progress that’s being made with these kinds of generative models has been amazing in the last few years so you can see the progression from 2014 to 2018 modern ones are you know so amazingly good that it’s hard to say that it’s machine generated and as i said you can condition those generators on relevant information like you can map a sequence like a sentence to an image another really interesting um architectural advance with deep learning is the introduction of attention attention had been studied earlier but it really took off with a particular form of retention mechanisms we introduced in my group in 2014 for the purpose of machine translation so um you want to go say from an english sentence like economic growth has slowed down in recent years to a french sentence [Music] and you could use a recurrent net say to generate the output sentence but when you’re producing the next word say economic it would really help if instead of looking at the whole input sequence as a big object you could focus your attention on the right word in this case we know it’s the first word of the of the english sentence economic and so that’s what attention allows you to do and you can have um attention focused on one thing or you can have it focus on several things by having what’s called multi-head attention so you have just several attention mechanisms and each of them produces a kind of selection a soft selection from from the input and then these soft selections could be concatenated as input for the next computation and then when you stage these uh self-retention mechanisms one on top of the other you essentially get transformers and transformers have been amazingly successful um they um they allow you to obtain the current state of the art in natural language processing language modeling translation and so on so at the heart of this what we have is a gating mechanism so something that multiplies the weights uh based on some context that’s uh the way you think about this is so to potentially put all the importance on one element in the input and kind of ignore the others so what’s exciting is that it’s changing the picture of what a neural net means whereas traditionally we think of neural nets as operating on vectors attention is a kind of internal action can be learned like an internal policy that decides what computation to do and what to compute over and thanks to this you can now think of the input instead of a vector rather as a set an unordered set because the same weights can be applied on all the elements this is the tension really that decides on the weights on the fly and of course a set is a very general um very general data structure and you can you can represent graphs as well so graphical nets or another architectural events that’s very closely related to transformers and attention now looking forward um there’s something important that i mentioned a little bit at the beginning missing from current machine learning and that’s generalization beyond the training distribution that is different from generalization beyond the training data that’s the normal kind of generalization you might have infinite amount of training data from the same training distribution so for example imagine you have a lot of data from one country like the united states and you’d like your system to generalize in to a different country say in europe well it turns out that when you train even with huge amounts of data from first country and you try to apply the system on in the other setting in europe uh it doesn’t work as well there’s a a a loss in performance and and of course theory says well of course it’s a different distribution why would you even expect that it works so you need some kind of assumptions to deal with that and then and people are starting to redefine learning theory to handle this notion of out of distribution generalization and um the path that i and others have chosen is to look at how humans do it because humans do an incredible job at generalizing in new settings and what’s exciting is that when they do that they use a different form of computation in their brain they use system two ignition so whereas uh when they’re behaving in their habitual way they use system one cognition which is intuitive fast unconscious so you don’t have access to the computations that are being performed when they attend something new and maybe you know requires their attention um the the the they use this more sequential form of computation and that’s conscious so you kind of can explain to others what you’re doing and why and we’d like to extend deep learning in that direction um to finish i want to tell you about um some some clues some cues about how to run machine learning experiments because you need to get your hands dirty to really understand what you’re doing and become an expert in machine learning requires becoming an expert at running and coding up these experiments so um yeah don’t trust your own code there could be bugs so try to run other people’s code as a baseline and then once you have debugged your code make sure that it’s available to others so they can debug they can check your code and replicate your experiments also do a lot of comparisons compare always against some baseline methods because you know what’s what’s the performance number in the absolute it’s hard to interpret it but if you can compare it against other methods ideally against the state of the art then you get something you can talk about also the data on which you’re going to apply maybe some some new algorithm it would be better if you could use some existing data set some existing benchmarks with published results and ideally published code so you can compare your new method fairly by opposition to working on the your own data set that nobody has tried before and um it’s harder for reviewers to know you know to be able to make comparisons with previous work so don’t trust negative experimental results whether yours or other people it could be a bug and so it’s important for these negative results to try to find an explanation for them and maybe to run experiments to test the hypothesis that you you embody in those explanations so in general think of the experiments not just as something to improve a benchmark and and you know beat other people’s methods but rather to help um understand um you know why is this working what ingredients matter uh what does this teach us that we didn’t know already or that we were not sure of on this thank you very much and i hope you enjoy the rest of the neuro match sessions
rn

NMA-DL Keynote Speaker: Yoshua Bengio

rn

Share this post

Leave a Reply

Your email address will not be published.