Lesson 7 - ResNets, U-Nets, GANs and RNNs
Lesson 7 - ResNets, U-Nets, GANs and RNNs
These are my personal notes from fast.ai Live (the new International Fellowship programme) course and will continue to be updated and improved if I find anything useful and relevant while I continue to review the course to study much more in-depth. Thanks for reading and happy learning!
Live date: 13 Dec 2018, GMT+8
Topics
Residual Networks (ResNets)
DenseNets
U-Nets
Image restoration
Generative Adversarial Networks (GANs)
Wasserstein GAN (WGAN)
Super resolution
Feature/perceptual loss function
Recurrent Neural Networks (RNNs)
What now?
Coming up: part 2!
AMA
Lesson Resources
Jupyter Notebook and code
Other Resources
Papers
Optional reading
Other Useful Information
Useful Tools and Libraries
Assignments
Run lesson 7 notebooks.
Practice.
Go back, watch the videos again
My Notes
Welcome to lesson 7. The last lesson of part 1. This will be a pretty intense lesson. Don't let that bother you because partly what I want to do is to give you enough things to think about to keep you busy until part 2. In fact, some of the things we cover today, I'm not going to tell you about some of the details. I'll just point out a few things. I'll say like okay that we're not talking about yet, that we're not talking about yet. Then come back in part 2 to get the details on some of these extra pieces. So today will be a lot of material pretty quickly. You might require a few viewings to fully understand at all or a few experiments and so forth. That's kind of intentional. I'm trying to give you stuff to to keep you amused for a couple of months.
Share your work
So we are going to use the MNIST dataset. As I read this in, I am going to show some more details about pieces of the Data Block API.
Data Block API
We start by using the data block API one function at a time.
This saves the paths to the images.
Normally when you show images those are in RGB and this time we want to use binary colormap so we can change it this way.
Our image list contains 70K images and then we can see the shape of first five images and also where these images are from.
You want to include that unit axis at the start. Fastai will do that for you even if it is reading 1 channel images.
We can print certain images.
Then we need to define validation set. If you don’t have that, we need to tell it using .no_split()
method to create kind of empty validation set. You can't skip it entirely. You have to say how to split and of the 2 options is no split.
My split data is a training set and a validation set.
Because our data is in folders where folder name tell what label items inside it want we can just use this function.
Now we have also the labels.
Now we've got a transform LabelLists. We can pick a batch size and choose databunch, we can choose normalize. In this case, we are not using a pre-trained model. So there's no reason to use ImageNet stats here. So if you call normalize()
like this without passing in stats, it will grab a batch of data at random and use that to decide what normalization stats to use. That's a good idea if you're not using a pre-trained model.
How many transform version of the image do you create? The answer is infinite. Each time we grab one thing from the dataset, we do a random transform on the fly. So potentially every one will look a bit different.
Let's start out creating a simple CNN simple convnet.
The code comment is grid size and it is halving the size because stride is 2 so the kernel is moving two pixels at a time.
Then we turn it into a learner.
Train it. So that's how easy it is to create a pretty accuracte digit detector.
Let's refactor that a little. Rather than saying conv, BatchNorm, ReLU all the time, fastai already has something called conv_layer
which lets you create conv, BatchNorm, ReLU combinations. It has various options to do other tweaks to it, but the basic version is just exactly what I've showed you.
So we can refactor that like so. That's exactly the same neural net.
How can we improve this? What we really want to do is to create a deeper network. A very easy way to create a deeper network would be after every stride 2 conv, add a stride 1 conv. Because stride 1 conv doesn't change the feature map size at all. So you can add as many you like. But there's a problem.
So when you see something wierd happen, really good researcher don't go "No it is not working.". They go "That's interesting.". Kaiming He said "That's interesting. What is going on?". He said "I don't know but what I do know is this. I could take this 56-layer network and make a new version of it which is identical but has to be at least as good as the 20-layer network and here's how. Every 2 convolutions, I am going to add together the input to those 2 convolutions, add it together with the result of those 2 convolutions.". In other words, he is saying instead of saying output = c2(c1(x))
, instead he is saying output = x + c2(c1(x))
. That 56-layer worth of convolutions in that.
His theory was has to be at least as good as the 20-layer version because it could always just set conv2 and conv1 to a bunch of weights for everything except for the first 20 layers because the x, the input can just go straight through. So this thing here as you can see called an identity connection. It's the "identity" function, nothing happen at all. It is also known as the skip connection.
So that was the theory. That is what the paper described as the intuition behind this is what would happen if we created something which has to train at least as well as the 20-layer neural network because it kind of contains the 20-layer neural network is literally a path you can skip over all the convolutions. So what happens? What happened was he won ImageNet that year. He easily won ImageNet that year and in fact today, we had that record breaking result on ImageNet speed training ourselves. You know in the last year, we used this too. ResNet has been revolutionary.
Here's a trick if you are interested to do novel research. Anytime you find some model for anything whether it's for medical image segmentation or some kind of GAN or whatever and that was written a couple of years ago, they might have forgoten to put ResBlock in. So replace their convolutional path with a bunch of ResBlock and you almost always get better results faster. It's a good trick.
So the big picture was this one. Here's what happened. These plots represent the loss function and as we can see adding skip connections will make it much simpler.
In our code, we can create a ResBlock in just the way I described. We create a nn.Module
. We create 2 conv_layer
s. Remember that conv_layer
is conv2d, ReLU, BatchNorm. Then in forward
, we go conv1
of x
, conv2
of that and then add x
.
There is a res_block
function already in fastai. So you can instead call res_block
instead and you just pass in say how many filters you want.
There's something else here which is when you create your MergeLayer, you can optionally set dense=True
. What happen if you do? If you do, it doesn't go x+x.orig
. It goes torch.cat([x,x.orig])
. In other words, rather than putting a 'plus (+)' in this connection, it does a concatenate. That is pretty interesting because what happens is that you have your input coming in to your ResBlock and once you use concatenate instead of plus, it's not called a ResBlock anymore, it's called a DenseBlock. And it is not called a ResNet anymore, it is called a DenseNet.
The DenseNet was invented about a year after the ResNet, and if you read the DenseNet paper, it can sound incredibly complex and different, but actually it's literally identical but plus here is placed with cat. So you have your input coming into your dense block, and you've got a few convolutions in here, and then you've got some output coming out, and then you've got your identity connection, and remember it doesn't plus, it concats so the channel axis gets a little bit bigger. Then we do another dense block, and at the end of that, we have the result of the convolution as per usual, but this time the identity block is that big.
So you can see that what happens is that with dense blocks it's getting bigger and bigger and bigger, and kind of interestingly the exact input is still here. So actually, no matter how deep you get the original input pixels are still there, and the original layer 1 features are still there, and the original layer 2 features are still there. So as you can imagine, DenseNets are very memory intensive. There are ways to manage this. From time to time, you can have a regular convolution and it squishes your channels back down, but they are memory intensive. But, they have very few parameters. So for dealing with small datasets, you should definitely experiment with dense blocks and DenseNets. They tend to work really well on small datasets.
Also, because it's possible to keep those original input pixels all the way down the path, they work really well for segmentation. Because for segmentation, you want to be able to reconstruct the original resolution of your picture, so having all of those original pixels still there is a super helpful.
TL;DR: DenseNets take a lot of memory because we need to store all these values but layers have a small amount of parameters. This is why you should try DenseBlocks with problems where you have a small dataset.
:memo: New in fastai v1.0.37: SequentialEx
, MergeLayer
, and res_block
to more easily create resnet and densenet architectures.
That's ResNets. One of the main reasons other than the fact that ResNets are awesome to tell you about them is that these skipped connections are useful in other places as well. They are particularly useful in other places in other ways of designing architectures for segmentation. So in building this lesson, I keep trying to take old papers and imagining like what would that person have done if they had access to all the modern techniques we have now, and I try to rebuild them in a more modern style.
TL;DR: Jeremy is breaking the records for many different problems by applying modern techniques to old papers. He said that he thinks what the paper writers might have done differently if they would have access to the newest techniques.
U-Net
What we will do to get there is we are going to use this U-Net (unet_learner
). We've use U-Net before. I've improved it a bit since then. So we've use it when we did the CamVid segmentation but we didn't understand what I was doing. So we're now in the position where we can understand what I was doing.
Here's the thing, in order to color code this as a pedestrian, but this as a bicyclist, it needs to know what it is. It needs to actually know what a pedestrian looks like. It needs to actually know where the pedestrian is. This is the arm of the pedestrian and not part of the shopping basket. It needs to really understand a lot about this picture to do this task. And it really does this task. When you look at the results of our top model, I can't see a single pixel by looking it by my eye, I know there are a few wrong but I can't see the ones that are wrong. It's that accurate. So how does it do that?
The way we are doing it to really really good results is not surprisingly using pre-training. So we start with ResNet-34 and you can see that:
and if you don't say pretrained=False
, by default you get pretrained=True
, why not.
TL;DR: So the model starts with a normal image and then go through the layers of pre-build model. The U’s left side will be the pre-build model. It will reduce the size to pretty small. Then we increase our model on the right side and got the same size image we had when we started. We increase size by doing stride half convolution a.k.a. deconvolution a.k.a. transpose convolution.
So that's what the U-Net's downsampling path (the left half is called the downsampling path) look like. Ours is just a ResNet-34. So you can see it here learn.summary()
, this is literally a ResNet-34. So you can see that the size keeps halving, channels keep going up and so forth.
Eventually, you've got down to a point where, if you use U-Net architecture, it's 28 by 28 with 1,024 channels. With the ResNet architecture with a 224 pixel input, it would be 512 channels by 7 by 7. So it's a pretty small grid size on this feature map. Somehow, we've got to end up with something which is the same size as our original picture. So how do we do that? How do you do computation which increases the grid size? Well, we don't have a way to do that in our current bag of tricks. We can use a stride one conv to do computation and keeps grid size or a stride 2 conv to do computation and halve the grid size.
This is how you can increase the resolution. This was the way people did it until maybe a year or two ago. There's another trick for improving things you find online. Because this is actually a dumb way to do it. And it's kind of obvious it's a dumb way to do it for a couple of reasons. One is that, have a look at the shaded area on the left, nearly all of those pixels are white. They're nearly all zeros. What a waste. What a waste of time, what a waste of computation. There's just nothing going on there.
So I've now up scaled from 2 by 2 to 4 by 4. I haven't done any interesting computation, but now on top of that, I could just do a stride 1 convolution, and now I have done some computation.
An upsample, this is called nearest neighbor interpolation. That's super fast which is nice. So you can do a nearest neighbor interpolation, and then a stride 1 conv, and now you've got some computation which is actually using there's no zeros in upper left 4x4, this (one pixel to the right) is kind of nice because it gets a mixture of A's and B's which is kind of what you would want and so forth.
Another approach is instead of using nearest neighbor interpolation, you can use bilinear interpolation which basically means instead of copying A to all those different cells you take a weighted average of the cells around it.
For example if you were looking at what should go here (red), you would kind of go, oh it's about 3 A's, 2 C's, 1 D, and 2 B's, and you take the average, not exactly, but roughly just a weighted average. Bilinear interpolation, you'll find all over the place - it's pretty standard technique. Anytime you look at a picture on your computer screen and change its size, it's doing bilinear interpolation. So you can do that and then a stride 1 conv. So that was what people were using, well, what people still tend to use. That's as much as I going to teach you this part. In part 2, we will actually learn what the fast.ai library is actually doing behind the scenes which is something called a pixel shuffle also known as sub pixel convolutions. It's not dramatically more complex but complex enough that I won't cover it today. They're the same basic idea. All of these things is something which is basically letting us do a convolution that ends up with something that's twice the size.
That gives us our upsampling path. That lets us go from 28 by 28 to 54 by 54 and keep on doubling the size, so that's good. And that was it until U-Net came along. That's what people did and it didn't work real well which is not surprising because like in this 28 by 28 feature map, how the heck is it going to have enough information to reconstruct a 572 by 572 output space? That's a really tough ask. So you tended to end up with these things that lack fine detail.
This is the unit code from fast.ai, and the key thing that comes in is the encoder. The encoder refers to the downsampling part of U-Net, in other words, in our case a ResNet 34. In most cases they have this specific older style architecture, but like I said, replace any older style architecture bits with ResNet bits and life improves particularly if they're pre-trained. So that certainly happened for us. So we start with our encoder.
So our layers of our U-Net is an encoder, then batch norm, then ReLU, and then middle_conv
which is just (conv_layer
, conv_layer
). Remember, conv_layer
is a conv, ReLU, batch norm in fast.ai. So that middle con is these two extra steps here at the bottom:
It's doing a little bit of computation. It's kind of nice to add more layers of computation where you can. So encoder, batch norm, ReLU, and then two convolutions. Then we enumerate through these indexes (sfs_idxs
). What are these indexes? I haven't included the code but these are basically we figure out what is the layer number where each of these stride 2 convs occurs and we just store it in an array of indexes. Then we can loop through that and we can basically say for each one of those points create a UnetBlock
telling us how many upsampling channels that are and how many cross connection. These gray arrows are called cross connections - at least that's what I call them.
So really all the work is going on in a UnetBlock
and UnetBlock
has to store the the activations at each of these downsampling points, and the way to do that, as we learn in the last lesson, is with hooks. So we put hooks into the ResNet 34 to store the activations each time there's a stride 2 conv, and so you can see here, we grab the hook (self.hook =hook
). And we grab the result of the stored value in that hook, and we literally just go torch.cat
so we concatenate the upsampled convolution with the result of the hook which we chuck through batch norm, and then we do two convolutions to it.
Actually, something you could play with at home is pretty obvious here (the very last line). Anytime you see two convolutions like this, there's an obvious question is what if we used a ResNet block instead? So you could try replacing those two convs with a ResNet block, you might find you get even better results. They're the kind of things I look for when I look at an architecture is like "oh, two convs in a row, probably should be a ResNet block.
Okay, so that's U-Net and it's amazing to think it preceded ResNet, preceded DenseNet. It wasn't even published in a major machine learning venue. It was actually published in MICCAI which is a specialized medical image computing conference. For years, it was largely unknown outside of the medical imaging community. Actually, what happened was Kaggle competitions for segmentation kept on being easily won by people using U-Nets and that was the first time I saw it getting noticed outside the medical imaging community. Then gradually, a few people in the academic machine learning community started noticing, and now everybody loves U-Net, which I'm glad because it's just awesome.
So identity connections, regardless of whether they're a plus style or a concat style, are incredibly useful. They can basically get us close to the state of the art on lots of important tasks. So I want to use them on another task now.
The next task we are going to look at is image restoration. We start with an image but instead of creating segmentation image we try to create a better quality image.
We start with low-resolution images where is writing top of them and purpose is to create high-resolution images where the text is removed.
The easiest way to do this kind of dataset is to take some good images and crappify those.
I’m not going to write the code again so if you are interested in the exact code look the video again or check the code from the lesson notebooks. It is just using this same method but for a little bit different task and in my opinion there is nothing really important.
[:warning: WIP: incomplete video transcribe for this part :warning:]
Generative Adversarial Network (GAN)
[:warning: WIP: incomplete video transcribe for this part :warning:]
We have a crappy image which we then run through generator and got a prediction. We are going to compare that to high-resolution image using MSE. This is not doing well because the pixel difference is not that big. It is looking something similar but details like cat eyes can be still blurry. That is why we build another model (discriminator a.k.a. critic) which is going to get both images and trying to predict is the picture original. Other words that are just classifier that we know how to build. Our plan is to create so good images that classifier mislabels those as real pictures. We are using critic as loss function. We first train generator to do as good images as plausible using MSE. Then we train critic to recognize which is generated using critic and continue this loop until model produces good results. This is called GAN. Idea is that both loss functions offer a different kind of things and using alternately we get better results than just choosing one.
When we are creating critic on our own we can’t use ResNet as a base model. We will learn more about that in part 2 but for now, just use gan_critic()
function. There is also GAN learner in fastai which take generator and critic as input and train the model using those.
When we are training GAN both generator and critic loss will stay about the same because when other is doing better it will make other doing worst. The only way to look at how these are doing is to look at the results.
To get better results using GAN like in restoring task Jeremy showed a little trick. As we know every layer activations are finding some kind of features. So we put generative model output and y image to ImageNet model and look how different the features are that models are recognizing. For example, if ImageNet model activations in layer 3 are recognizing cat eyes from the given image then the generative model should have also this feature recognizer in that layer. This way we can teach ImageNet model features to this other generative model and get much better results.
U-Net suffer when the size of your output is similar to the size of your input and kind of align with it. There's no point kind of having cross connections if that level of spatial resolution in the output isn't necessarily useful. So any kind of generative modelling and you know segmentation is kind of generative modelling. It's generating a picture which is a mask of the original objects. Probably anything that you want that kind of resolution of the output to be of the same fidelity of the resolution of the input. Obviously something like a classifier makes no sense. In a classifier, you just want the down-sampling path because at the end, you just want a single number which is like is it a dog or a cat or what kind of pet is it or whatever.
It is kind of interesting because the dataset we used is the LSUN bedroom dataset which we provided in our URLs
as you can see has bedrooms, lot of bedrooms. The approach that we use in this case is to just say "can we create a bedroom?". What we actually do is that the input to the generator isn't an image that we clean up. We actually feed the generator random noise. Then the generator task is "can you turn random noise into something which the critic can't tell the difference between that output and the real bedroom?".
We are not doing any pre-training here or any of those stuffs that make it fast and easy. This is a very traditional approach. But you can see you still go GANLearner.wgan
you know this kind of older style approach. You just pass the data in, the generator and critic in the usual way and you call fit. You will see in this case we have show_image
on. After epoch 1, it's not creating great bedrooms or 2 or 3. You can really see in the early days these kind of GANs doesn't do a great job or anything. Eventually after a couple hours of training, producing somewhat like bedroom-ish things. So anyway it's notebook you can play with and have a bit of fun.
Super Resolution
It shares something with GANs which is that after we go through our generator, which they called the image transform net and you can see they've got this kind of U-Net shape thing they didn't actually use U-Nets because at that time this came out nobody in machine learning world know about U-Nets. Nowadays of course we you U-Nets. Anyway something U-Net-ish. I should mention in this architecture where we have a down-sampling path followed by up-sampling path, the down-sampling path is very often called the encoder (as you saw in our code) and the up-sampling path is very often called the decoder.
In generative models, generally including generative text models, neural translation stuffs like that, they tend to be called the encoder and the decoder. We have this generator and we want a loss function that says "is the thing that is created like the thing that we want?". The way they do that is they take the predictions (remember y_hat is what we normally use for a prediction from a model) and they put it through pre-trained ImageNet network. At the time that this came out, the pre-trained ImageNet network that we're using was VGG. It's kind of old now but people still tend to use it cause it works fine for this process. So they take the predictions and put it through VGG pre-trained ImageNet network. It doesn't matter much which one it is.
Normally the output of that will tell you "is this generated thing a dog or a cat or an aeroplane or a fire engine or whatever?". But in the process of getting to that final classification, it goes through many different layers. In this case, they color coded all the layers with the same grid size and the feature maps were the same color. So everytime we switch colors, we've switching grid size. There is a stride 2 conv or in VGG case they still used to use some max pooling layers which is kind of similar idea. What we could do is, let's not take the final output of the VGG model on this generated image but let's take something in the middle. Let's take the activations of some layer in the middle. The activations might be a feature map of like 256 channels by 28 by 28. Those 28x28 grid cells would kind of roughly semantically say things like "in this part of 28x28 grid is something that kind of look fury or is something that kind of look shiny or is something that look kind of circular or is something that look like eyeball or whatever.". We then take the target (the actual y value) and we put it through the same pre-trained VGG network and we pull out the activations of the same layer and then we do a mean squared error comparison. It will say like "in the real image, grid cell 1 1 of that 28x28 feature map is fury and blue and round shape and in the generated image, it's fury and blue and not round shape.". So it's like an OK, matched. That ought to go a long way towards fixing our eyeball problem because in this case, the feature map is going to say "there's eyeballs here but there isn't here.". So do a better job at that place make better eyeballs. That is the idea. That's what we called feature losses or Johnson et al. called perceptual losses.
FeatureLoss
class:
hook outputs
make features
forward pass
feature losses
[:warning: WIP: incomplete video transcribe for this part going through the notebook :warning:]
[:warning: WIP: incomplete video transcribe for this part :warning:]
Yeah, pretty much. We don't fully know yet. It's a pretty new area. There's a lot of opportunities there. We would be looking at some in a moment actually.
[:warning: WIP: incomplete video transcribe for this part :warning:]
Here is what we learned so far in the course, some of the main thing.
We've learned that neural nets consist of sandwich layers of affine functions which is basically matrix multiplications likely more general version and non-linearities like ReLU. We learned that the results of those calculations are called activations and the things that go into those calculations are called parameters. The parameters are initially randomly initialized or we copy them over from the pre-trained model and then we train them with SGD or faster versions (Momentum, Adam). We learned that convolutions are particular affine functions that worked great for auto-correlated data things like images and stuffs. We learned about Batch-norm, dropout, data augmentation, weight decay as ways of regularizing models and also Batch-norm helps train model quickly. Today we learned about Res/dense blocks. We've obviously learned about image classification and regression, embeddings, categorical and continuous variables, collaborative filtering, language models and NLP classification and then kind of segmentation and U-Net and GANs.
So go over these things and make sure you feel comfortable with each of them. If you only watch these series once, you definitely won't, people normally watch you know 3x or so to really understand the detail.
One thing that doesn't get here is RNNs. So that's the last thing. We're going to do RNNs.
[:warning: WIP: incomplete video transcribe for this part :warning:]
[:warning: WIP: incomplete video transcribe for this part :warning:]
[:warning: WIP: incomplete video transcribe for this part :warning:]
Same thing with a loop:
Code is also the same but this time there is a loop.
[:warning: WIP: incomplete video transcribe for this part :warning:]
Multi fully connected model:
Maintain state:
[:warning: WIP: incomplete video transcribe for this part :warning:]
PyTorch nn.RNN
:
2-layer GRU:
But here's the thing, when you think about this, think about it without the loop.
It looks like this. It's like keeps on going and we got a bptt of 20. So there's 20 layers of these. We know from the "Visualizing the loss landscape" paper that deep networks have awful bumpy loss surfaces. So when you start creating a long time scales and multiple layers, this things get impossible to train. There are few tricks you can do. One thing is to add skip connections of course.
What people normally do instead, they put inside, instead of adding this together, they actually use a mini neural net to decide how of the grean arrow to keep and how much of the orange arrow to keep. When you do that, yo get something is either called a GRU or an LSTM depending on the details of that neural net and we will learn about the details of those neural nets in part 2. They really don't matter though frankly. We can now say let's create a GRU instead just like what we had before but it will handle longer sequences in deep networks. Let's use 2 layers. And we are up to 75% 81%.
TL;DR: This technique can be used for text labeling but also for many other tasks.
[:warning: WIP: incomplete video transcribe for this part :warning:]
OK. So that's it! That's Deep Learning or at least the kind of the practical pieces from my point of view. Having watched this one time you won't get it all. I do recommend you do watch this so slowly you get it all the first time. Go back, look at it again, take your time and there will be bits like you will go "oh, now I see what you're saying." and then you could implement things you couldn't implement before and you will be able to digging more than before.
Definitely go back and do it again. As you do, write code not just for yourself but put it on GitHub. It doesn't matter if you think is great code or not. The fact that you are writing code and sharing it is impressive and the feedback you'll get tell if you tell people on the forum, you know "hey I wrote this code, it's not great but it's my first effort, anything you see jump out at you.". People will say like "oh that bit was done well, hey but did you know for this bit you could use this library and save some time.". You'll learn a lot by interacting with your peers.
As you noticed, I started introducing more and more papers. Now part 2 will be more papers. It's a good time to start reading some of the papers that have been introduced in this section. All the bits that say like derivations and theorems or lemmas, you can skip them. I do. They add almost nothing to your understanding of practical deep learning. But the bits that say like why are we solving this problem and what are the results so forth are really interesting.
Try and write English prose. Not English prose that you want to be read by Geoff Hinton and Yann LeCun but English prose you want to be written read by you as of 6 months ago. Because there's lot more people in the audience of you as of 6 months ago than there is of Geoffrey Hinton and Yann LeCun. That's the person you best understand. You know what they need.
Go and get help and help others. Tell us about your success stories. Perhaps the most important ones are get together with others. People learning got so much better if you got that social experience. So start a book club, get involved in Meetups, create study groups and build things. And again it doesn't have to be amazing. Build something like you think the world would be a little bit better if that work existed or you think be kind of slightly delightful to your 2 year old to see that thing or you just want to show it to your brother the next time they come around to see what you're doing whatever. Just finish something. You know finish something. Then try to make it a little better.
Then come back for part 2 where will be looking at all these interesting stuffs in particular going deep into fastai codebase to understand how did we build exactly. We will actually go through as we are building it, we created notebooks of like here is where we were of each day. We will actually see the software development process itself. We talk about the process of doing research, how to read academic papers, how to turn math into code and then a whole bunch of additional types of models that we haven't seen yet. So we will be kind of like going beyond practical deep learning into actually cutting edge research.
Ask Me Anything (AMA)
I hear that all the time. I thought I should answer it and it got a few votes. People who have come to our study group are always shock at how disorganized and incompetant I am. So I often hear people saying like "Oh wow I thought you are like this deep learning role model when I get to see be liked you. Now I'm not sure to be liked you at all.". :laughing: :laughing: :laughing: For me it's all about just to have a good time with it. I never have many plans. I just want to finish what I start. If you not having fun with it, it's really hard to continue because there's a lot of frustration in deep learning because it's not like running a web app where it's like "authentication, check, backend service watchdog, check, ok user credentials, check.", like you are making progress. Where else for stuff like these GAN stuff that we've been doing a couple of weeks is just like it's not working, it's not working, it also didn't work, it also didn't work. "I can tell it's amazing, OMG it's a cat.". That's kind of what is. You don't get that kind of regular feedback. You gotta have fund with it. My day is kind of I don't do any meetings, I don't do coffees, I don't watch TV, I don't play computer games. I spend lots of time with my family, lots of time exercising, lots of time reading and coding and doing things I like. The main thing is just finish something like properly finish it. So when you get to that point when you think 80% through, you haven't quite created the README yet and the install process is still a bit clunky. This is what 99% of GitHub projects look like. You'll see the README says, "TODO complete baseline experiments, document, blah blah blah.". Don't be that person. Just do something properly and finish it. May be get some other people around you to work with you so you're doing it together and you know, get it done.
I still feel exactly the same way as I did 3 years ago when we started this which is it's all about transfer learning, it's under appreciated, it's under research. Everytime we put transfer learning into everything, we make it much better. Our academic paper on transfer learning for NLP has helped be one piece of changing the direction of NLP this year. Made it all the way to New York Times. Just a stupid obvious little thing that we threw it together. I remain excited about that. I remain unexcited about Reinforcement Learning for most things. I don't see it use by normal people for normal things for nearly anything. It's incredibly inefficient way to solve problems which often solve more simply, more quickly in other ways. Probably has a may be a role in the world but a limited one and not in most peoples day-to-day work.
Just code. Just code all the time. I know it's perfectly possible. I hear from people who get to this point of the course and they haven't written any code yet and if that's you, it's OK. You know, you've just go through and do it again and this time do code. Look at the shape of your inputs, look at your outputs and make sure you know how to grab a mini batch, look at its main standard deviation and plot it. There's so much material that we've covered. If you can get to a point where you can rebuild those notebooks from scratch without too much cheating. When I said from scratch, use the fastai library, not from scratch from scratch. You'll be in the top edge of practitioners because you'll be able to do all of these things yourself. And that's really really rare. And that will put you in a great position to part 2.
Well like I say, I don't make plans. I just piss around. I mean our only plan for fastai as an organization is to make Deep Learning accessible as a tool for normal people to use for normal stuff. So as long we need to code, we failed that because 99.8% of the world can't code. The main goal would be to get to a point where it's not a library but a piece of software that doesn't required code. It certainly shouldn't require a lenghty hardworking course like this one. I want to get rid of the course, get rid of the code. I want to make it so you can do usual stuff quickly and easily. That's may be 5 years, may be longer.
Alright. I hope to see you back here for part 2. Thank you. :clap: :clap: :clap:
Last updated