Lesson 8 - Object Detection
These are my personal notes from fast.ai course and will continue to be updated and improved if I find anything useful and relevant while I continue to review the course to study much more in-depth. Thanks for reading and happy learning!
Topics
A quick recap of what we learned in part 1.
Introduces the new focus of this part of the course: cutting edge research.
We’ll show you how to read an academic paper in a way that you don’t get overwhelmed by the notation and writing style.
Another difference in this part is that we’ll be digging deeply into the source code of the fastai and PyTorch libraries.
We’ll see how to use Python’s debugger to deepen your understand of what’s going on, as well as to fix bugs.
The main topic of this lesson is object detection, which means getting a model to draw a box around every key object in an image, and label each one correctly.
Two main tasks: find and localize the objects, and classify them; we’ll use a single model to do both these at the same time.
Such multi-task learning generally works better than creating different models for each task—which many people find rather counter-intuitive.
To create this custom network whilst leveraging a pre-trained model, we’ll use fastai's flexible custom head architecture.
Lesson Resources
Jupyter Notebook and Code
Assignments
Papers
Must read
WIP
Additional papers (optional)
WIP
Other Resources
Blog Posts and Articles
Lesson summary by Avinash
Other Useful Information
Tips and Tricks
Useful Tools and Libraries
Integrated Development Environment (IDE)
If you don't have a IDE or light-weight editor, download one.
PyCharm for community is free
Bounding box annotation tools
My Notes
Where We Are
What we've learnt so far.
Differentiable Layers
Transfer Learning
Architecture Design
Handling over-fitting
Embeddings
From Part 1 "practical" to Part 2 "cutting edge".
Goals or approach.
Part 1 really was all about introducing best practices in deep learning.
Part 2 is cutting edge deep learning for coders, and what that means is Jeremy often does not know the exact best parameters, architecture details, and so forth to solve a particular problem. We do not necessarily know if it’s going to solve a problem well enough to be practically useful.
:warning: Be aware of sample codes [00:13:20]! The code academics have put up to go along with papers or example code somebody else has written on github, Jeremy nearly always find there is some massive critical flaw, so be careful of taking code from online resources and be ready to do some debugging.
It's time to start reading papers.
Each week, we will be implementing a paper or two. In academic papers, people love to use Greek letters. They also hate to refactor, so you will often see a page long formula where when you look at it carefully you'll realize the same sub equation appears 8 times. Academic papers are a bit weird, but in the end, it's the way that the research community communicates their findings so we need to learn to read them.
Part 2's topics.
This lesson will start on object detection.
Object Detection
Two main differences from what we are used to:
1. We have multiple things that we are classifying.
This part is not new, as we have done this in part 1, the Planet satellite tutorial.
2. Bounding boxes around what we are classifying.
The box has the object entirely in it, but is no bigger than it needs to be. For these object detection datasets, we are looking for a pool of objects, but not necessarily every object in the image (horse, person, car, tree, etc).
Stages
Find the largest object.
Find where it is.
Try and do both at the same time.
Pascal Notebook
Start with the Pascal Notebook.
You may find a line torch.cuda.set_device(3)
left behind which will give you an error if you only have one GPU. This is how you select a GPU when you have multiple, so just set it to zero (torch.cuda.set_device(0)
) or take out the line entirely.
Pascal VOC
We will be looking at the Pascal VOC dataset. It's quite slow, so you may prefer to download from this mirror. There are two different competition/research datasets, from 2007 and 2012. We'll be using the 2007 version. You can use the larger 2012 for better results, or even combine them (but be careful to avoid data leakage between the validation sets if you do this).
:memo: How to download the dataset:
Pathlib
Unlike previous lessons, we are using the python 3 standard library pathlib
for our paths and file access. Note that it returns an OS-specific class (on Linux, PosixPath
) so your output may look a little different. Most libraries than take paths as input can take a pathlib object - although some (like cv2
) can't, in which case you can use str()
to convert it to a string.
:memo: pathlib cheat sheet
Python 3 Generators
Generators are a simple and powerful tool for creating iterators. They are written like regular functions but use the yield
statement whenever they want to return data.
The reason that things generally return generators is that if the directory had 10 million items in, you don’t necessarily want 10 million long list. Generator lets you do things "lazily".
Loading Pascal VOC Annotations
pascal_train2007.json
file contains not the images but the bounding boxes and the classes of the objects.
:memo: json.load()
deserialize file-like object containing a JSON document to a Python object.
Images
Annotations
Schema:
bbox
: column (x coord, origin of top left), row (y coord, origin of top left), height, widthimage_id
: you'd have join this up withtrn_j[IMAGES]
(above) to findfile_name
etc.category_id
: seetrn_j[CATEGORIES]
(below)segmentation
: polygon segmentation (we will not be using them)ignore
: we will ignore theignore
flagsiscrowd
: specifies that it is a crowd of that object, not just one of them
Categories
Convert VOC's Bounding Box
Convert VOC's height/width into top-left/bottom-right, and switch x/y coords to be consistent with numpy.
Python's defaultdict
A defaultdict
is useful any time you want to have a default dictionary entry for new keys [00:55:05]. If you try and access a key that doesn’t exist, it magically makes itself exist and it sets itself equal to the return value of the function you specify (in this case lambda:[]
).
Covert Back to VOC Bounding Box Format
Some libs take VOC format bounding boxes, so this let’s us convert back when required [1:02:23]:
Fastai's open_image
function
open_image
functionFastai uses OpenCV. TorchVision uses PyTorch tensors for data augmentations etc. A lot of people use Pillow PIL
. Jeremy did a lot of testing of all of these and he found OpenCV was about 5 to 10 times faster than TorchVision.
Matplotlib
:bookmark: Note to self: as we will frequently use Matplotlib, for the better, we need to prioritize learning Matplotlib.
Tricks:
plt.subplots
.Useful wrapper for creating plots, regardless of whether you have more than one subplot.
:memo: Matplotlib has an optional object-oriented API which I think is much easier to understand and use (although few examples online use it!).
Visible text regardless of background color.
A simple but rarely used trick to making text visible regardless of background is to use white text with black outline, or visa versa. Here's how to do it in Matplotlib:
Packaging it all up
When you are working with a new dataset, getting to the point that you can rapidly explore it pays off.
Next Complex Step - Largest Item Classifier
Rather than trying to solve everything at once, let’s make continual progress. We know how to find the biggest object in each image and classify it, so let’s start from there.
Steps we need to do:
Go through each of the bounding boxes in an image and get the largest one.
Sort the annotation for each image - by bounding box size (descending).
Now we have a dictionary from image id to a single bounding box - the largest for that image.
Plot the bounding box.
Model Data
:memo: Often it's easiest to simply create a CSV of the data you want to model, rather than trying to create a custom dataset.
Here we use Pandas to help us create a CSV of the image filename and class.
:bookmark: note to self: learn Pandas.
Model
From here it’s just like lesson 2 "Dogs vs Cats"!
One thing that is different is crop_type
.
For bounding boxes, we do not want to crop the image because unlike an ImageNet where the thing we care about is pretty much in the middle and pretty big, a lot of the things in object detection is quite small and close to the edge. By setting crop_type
to CropType.NO
, it will not crop.
Data Loaders
You already know that inside of a model data object, we have bunch of things which include training data loader and training data set. The main thing to know about data loader is that it is an iterator that each time you grab the next iteration of stuff from it, you get a mini batch.
If you want to grab just a single batch, this is how you do it:
Training with ResNet34
Accuracy is still at 79%.
Accuracy isn’t improving much — since many images have multiple different objects, it’s going to be impossible to be that accurate.
Training Results
Let's look at the 20 classes.
It's doing a pretty good job of classifying the largest object.
In the next stage, we create a bounding box around an object.
Debugging
How to understand the unfamiliar code:
Run each line of code step by step, print out the inputs and outputs.
Method 1: Break down a large piece of code from a cell and put them all in separate cells, one line per cell. Example, you can take the contents of the loop, copy it, create a cell above it, paste it, un-indent it, set i=0
and put them all in separate cells.
Method 2: Use Python debugger pdb
to step through code.
:bookmark: note to self: re-learn pdb
Next Stage: Create A Bounding Box Around An Object
We know we can make a regression neural net instead of a classification. This is accomplished by changing the last layer of the neural net. Instead of Softmax, and use MSE, it is now a regression problem. We can have multiple outputs.
Bounding Box Only
Now we’ll try to find the bounding box of the largest object. This is simply a regression with 4 outputs (predict the following values). So, we can use a CSV with multiple 'labels'.
top left
x
top left
y
lower right
x
lower right
y
Transform the bounding box data
Open and read CSV up to 5 lines.
Set our model and parameters.
Tell fastai lib to make a continous network model
Set continuous=True
to tell fastai this is a regression problem, which means it won't one-hot encode the labels, and will use MSE as the default crit.
Note that we have to tell the transforms constructor that our labels are coordinates, so that it can handle the transforms correctly.
Also, we use CropType.NO
because we want to 'squish' the rectangular images into squares, rather than center cropping, so that we don't accidentally crop out some of the objects. (This is less of an issue in something like ImageNet, where there is a single object to classify, and it's generally large and centrally located).
We will look at TfmType.COORD
in the next lesson, but for now, just realize that when we are doing scaling and data augmentation, that needs to happen to the bounding boxes, not just images.
Custom Head
Custom head allows us to add additional layers on the end of the ResNet.
fastai library lets you use a custom_head
to add your own module on top of a convnet, instead of the adaptive pooling and fully connected net which is added by default. In this case, we don't want to do any pooling, since we need to know the activations of each grid cell.
The final layer has 4 activations, one per bounding box coordinate. Our target is continuous, not categorical, so the MSE loss function used does not do any sigmoid or softmax to the module outputs.
Flatten()
Normally the previous layer has 7x7x512 in ResNet34, so flatten that out into a single vector of length 25088.
L1Loss
Rather than adding up the squared errors, add up the absolute values of the errors. It is normally what you want because adding up the squared errors really penalizes bad misses by too much. So L1Loss is generally better to work with.
Check the model to see that the additional layer has been added:
Try and fit the model.
Save model.
Training Results
Let's see how our model did.
We will revise this more next lesson.
As you look further down, it starts looking a bit crappy — anytime we have more than one object. This is not surprising. Overall, it did a pretty good job.
Last updated