Idea
Convert detection problem to classification problem
Architecture
Region Proposals
Usually generated by selective search
Feature Extraction
Before warping, the region size is expanded to a new size that will result in pixels of context in the warped frame. The CNN used is AlexNet and it is typically fine-tuned on a large dataset like ImageNet for generic feature representation.
Object Classification
The extracted feature vectors from the region proposals are fed into a separate machine learning classifier for each object class of interest. R-CNN typically uses Support Vector Machines (SVMs) for classification. For each class, a unique SVM is trained to determine whether or not the region proposal contains an instance of that class.