Region of interest pooling explained

Region of interest pooling (also known as RoI pooling) is an operation widely used in object detection tasks using convolutional neural networks. For example, to detect multiple cars and pedestrians in a single image. Its purpose is to perform max pooling on inputs of nonuniform sizes to obtain fixed-size feature maps (e.g. 7×7).

We’ve just released an open-source implementation of RoI pooling layer for TensorFlow (you can find it here). In this post, we’re going to say a few words about this interesting neural network layer. But first, let’s start with some background.

Two major tasks in computer vision are object classification and object detection. In the first case the system is supposed to correctly label the dominant object in an image. In the second case it should provide correct labels and locations for all objects in an image. Of course there are other interesting areas of computer vision, such as image segmentation, but today we’re going to focus on detection. In this task we’re usually supposed to draw bounding boxes around any object from a previously specified set of categories and assign a class to each of them. For example, let’s say we’re developing an algorithm for self-driving cars and we’d like to use a camera to detect other cars, pedestrians, cyclists, etc. — our dataset might look like this.

In this case we’d have to draw a box around every significant object and assign a class to it. This task is more challenging than classification tasks such as MNIST or CIFAR. On each frame of the video, there might be multiple objects, some of them overlapping, some poorly visible or occluded. Moreover, for such an algorithm, performance can be a key issue. In particular for autonomous driving we have to process tens of frames per second.

So how do we solve this problem?

Related:  Playing Atari with deep reinforcement learning - deepsense.ai’s approach

Typical architecture

The object detection architecture we’re going to be talking about today is broken down in two stages:

  1. Region proposal: Given an input image find all possible places where objects can be located. The output of this stage should be a list of bounding boxes of likely positions of objects. These are often called region proposals or regions of interest. There are quite a few methods for this task, but we’re not going to talk about them in this post.
  2. Final classification: for every region proposal from the previous stage, decide whether it belongs to one of the target classes or to the background. Here we could use a deep convolutional network.

 

Object detection pipeline wit region of interest pooling

Object detection pipeline with region of interest pooling

 

Usually in the proposal phase we have to generate a lot of regions of interest. Why? If an object is not detected during the first stage (region proposal), there’s no way to correctly classify it in the second phase. That’s why it’s extremely important for the region proposals to have a high recall. And that’s achieved by generating very large numbers of proposals (e.g., a few thousands per frame). Most of them will be classified as background in the second stage of the detection algorithm.

Some problems with this architecture are:

  • Generating a large number of regions of interest can lead to performance problems. This would make real-time object detection difficult to implement.
  • It’s suboptimal in terms of processing speed. More on this later.
  • You can’t do end-to-end training, i.e., you can’t train all the components of the system in one run (which would yield much better results)

That’s where region of interest pooling comes into play.

Related:  Optimize Spark with DISTRIBUTE BY & CLUSTER BY

Region of interest pooling — description

Region of interest pooling is a neural-net layer used for object detection tasks. It was first proposed by Ross Girshick in April 2015 (the article can be found here) and it achieves a significant speedup of both training and testing. It also maintains a high detection accuracy. The layer takes two inputs:

  1. A fixed-size feature map obtained from a deep convolutional network with several convolutions and max pooling layers.
  2. An N x 5 matrix of representing a list of regions of interest, where N is a number of RoIs. The first column represents the image index and the remaining four are the coordinates of the top left and bottom right corners of the region.

 

region proposals on a cat image An image from the Pascal VOC dataset annotated with region proposals (the pink rectangles)

 

What does the RoI pooling actually do? For every region of interest from the input list, it takes a section of the input feature map that corresponds to it and scales it to some pre-defined size (e.g., 7×7). The scaling is done by:

  1. Dividing the region proposal into equal-sized sections (the number of which is the same as the dimension of the output)
  2. Finding the largest value in each section
  3. Copying these max values to the output buffer

The result is that from a list of rectangles with different sizes we can quickly get a list of corresponding feature maps with a fixed size. Note that the dimension of the RoI pooling output doesn’t actually depend on the size of the input feature map nor on the size of the region proposals. It’s determined solely by the number of sections we divide the proposal into. What’s the benefit of RoI pooling? One of them is processing speed. If there are multiple object proposals on the frame (and usually there’ll be a lot of them), we can still use the same input feature map for all of them. Since computing the convolutions at early stages of processing is very expensive, this approach can save us a lot of time.

Related:  Deep learning for satellite imagery via image segmentation

Region of interest pooling — example

Let’s consider a small example to see how it works. We’re going to perform region of interest pooling on a single 8×8 feature map, one region of interest and an output size of 2×2. Our input feature map looks like this:

Region of interest pooling example (input feature map)

Let’s say we also have a region proposal (top left, bottom right coordinates): (0, 3), (7, 8). In the picture it would look like this:

Region of interest pooling example (region proposal)Normally, there’d be multiple feature maps and multiple proposals for each of them, but we’re keeping things simple for the example.

By dividing it into (2×2) sections (because the output size is 2×2) we get:

Region of interest pooling example (pooling sections)

Notice that the size of the region of interest doesn’t have to be perfectly divisible by the number of pooling sections (in this case our RoI is 7×5 and we have 2×2 pooling sections).

The max values in each of the sections are:

Region of interest pooling example (output)

And that’s the output from the Region of Interest pooling layer. Here’s our example presented in form of a nice animation:

Region of interest pooling (animation)

What are the most important things to remember about RoI Pooling?

  • It’s used for object detection tasks
  • It allows us to reuse the feature map from the convolutional network
  • It can significantly speed up both train and test time
  • It allows to train object detection systems in an end-to-end manner

If you need an opensource implementation of RoI pooling in TensorFlow you can find our version here.

In the next post, we’re going to show you some examples on how to use region of interest pooling with Neptune and TensorFlow.

References:

  • Girshick, Ross. “Fast r-cnn.” Proceedings of the IEEE International Conference on Computer Vision. 2015.
  • Girshick, Ross, et al. “Rich feature hierarchies for accurate object detection and semantic segmentation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.
  • Sermanet, Pierre, et al. “Overfeat: Integrated recognition, localization and detection using convolutional networks.” arXiv preprint arXiv:1312.6229. 2013.

Related Posts

41 replies
  1. Rui
    Rui says:

    typo: “Let’s say we also have a region proposal (top left, bottom right coordinates): (5,0), (10,7)”, should be (0, 7)

    Reply
  2. Filip Novotny
    Filip Novotny says:

    Hello,

    a very good explanation of RoI polling, thanks for that!
    However I am confused by this:
    In this case we’d have to *draw a box around every significant object* and assign a class to it. This task is more challenging […]
    So how do we solve this problem?

    From what I understand, you say that Fast(er) R-CNN allow us to predict locations of objects without any ground truth boxes: just run the thing and candidates class-agnostic boxes will pop out.

    However, this is a quote from the Fast r-cnn paper:
    “Third, the network is modified to take two data inputs: a
    list of images and a *list of RoIs* in those images.”

    There seems to be a need for ground truth boxes. Am I missing something or is there a mistake in the article?

    Thanks,

    F.

    Reply
    • Tomasz Grel
      Tomasz Grel says:

      Hello,

      You definitely need ground truth boxes to train the model and at test time, the model is going to predict both the bounding box and the class label. So by “draw a box around every significant object” I meant that the boxes are generated at test time, at train time you have to provide the ground truth boxes and labels.

      Reply
  3. Alex Wang
    Alex Wang says:

    Hi:
    I am confused by this:
    ‘For every region of interest from the input list, it takes a section of the input feature map that corresponds to it’

    for example:
    a image of size (299,299), and a region of interest (cat , 50,40, 30, 30), how can I find the region in feature map corresponding to this ROI?
    Thanks.

    Reply
    • Tomasz Grel
      Tomasz Grel says:

      In our example we used the ROI format of top left and bottom right corner of the rectangle, so the ROI (50,40,30,30) makes little sense (it should rather be (30,30,50,40) The corresponding ROI is the section of the feature map between the points (30,30) and (50,40).

      Reply
  4. llp
    llp says:

    The ground truth boxes are based on images, but the rois are based on feature maps, how did they correspond to each other? I’m confused about this. Thank you.

    Reply
    • Krzysztof
      Krzysztof says:

      As far as I understand their code, they are using VGG without last pooling layer, so they transform image (224 x 224 x 3) into (14 x 14 x 512) – hight and width is divided by 16. They map RoIs coordinates just by dividing them by 16.

      I have the feeling that original authors did the same, as this step isn’t explained in the paper and dividing by some factor is most obvious way of doing it.

      Reply
    • Tomasz Grel
      Tomasz Grel says:

      The intuition is that you want to partition the image in roughly equal parts, but the size of the output is fixed and the input image can have any size.

      Reply
      • Sai
        Sai says:

        Thanks for the reply! My confusion is about different possible permutations of 4 pooling rectangles. The output will differ based on the 4 rectangles we choose right?

        Reply
      • Sai
        Sai says:

        Thanks for the timely reply! My confusion is about different possible permutations of 4 pooling rectangles. The output will differ based on the 4 rectangles we choose right?

        Reply
        • Tomasz Grel
          Tomasz Grel says:

          Yes, you’re right it would be different, but we don’t really care about it. It’s perfectly sufficient to just pick one deterministic method and stick with it.

          Reply
  5. Ryan
    Ryan says:

    I think there’s a typo in case there’s confusion for beginners.
    “Let’s say we also have a region proposal (top left, bottom right coordinates): (0, 3), (7, 8). In the picture it would look like this:”
    if 1-index is used, top left coordinate would be (1,3) instead of (0,3) according to the picture used

    Reply
  6. Ryan
    Ryan says:

    I think there’s a typo in case there’s confusion for beginners.
    “Let’s say we also have a region proposal (top left, bottom right coordinates): (0, 3), (7, 8). In the picture it would look like this:”
    if 1-index is used, top left coordinate would be (1,3) instead of (0,3) according to the picture used

    Reply
    • sunya
      sunya says:

      can’t work it returns error ‘tensorflow.python.framework.errors_impl.NotFoundError: /home/sunya1989/ml/lib/python3.5/site-packages/roi_pooling/roi_pooling.so: undefined symbol: _ZTIN10tensorflow8OpKernelE’

      Reply
      • Tomasz Grel
        Tomasz Grel says:

        This looks like a low-level C++ issue. Please check that you’re using the right gcc version and the right TensorFlow version.

        Reply
    • Tomasz Grel
      Tomasz Grel says:

      I don’t think I’ve tested this scenario so I’m not sure. If you test this and notice incorrect behavior please create an issue on our github page.

      Reply
    • AVi
      AVi says:

      did you understand that, what will happen if rois get too small? for example if we want output from ro pooling as 2×2 but our roi get smaller than that (1×1). Do we neglect this roi?

      Reply
    • sunya
      sunya says:

      can’t work it returns error ‘tensorflow.python.framework.errors_impl.NotFoundError: /home/sunya1989/ml/lib/python3.5/site-packages/roi_pooling/roi_pooling.so: undefined symbol: _ZTIN10tensorflow8OpKernelE’

      Reply
    • Tomasz Grel
      Tomasz Grel says:

      I don’t think I’ve tested this scenario so I’m not sure. If you test this and notice incorrect behavior please create an issue on our github page.

      Reply
    • AVi
      AVi says:

      did you understand that, what will happen if rois get too small? for example if we want output from ro pooling as 2×2 but our roi get smaller than that (1×1). Do we neglect this roi?

      Reply
    • Tomasz Grel
      Tomasz Grel says:

      This situation is perfectly normal. Usually proposals will cover only a small patch of the image where the interesting object is located. The purpose of the RoI Pooling layer is to cut this interesting patch out of the feature map and feed it to the rest of the network for further classification.

      Reply
    • Tomasz Grel
      Tomasz Grel says:

      This situation is perfectly normal. Usually proposals will cover only a small patch of the image where the interesting object is located. The purpose of the RoI Pooling layer is to cut this interesting patch out of the feature map and feed it to the rest of the network for further classification.

      Reply
  7. Neeraj Sajjan
    Neeraj Sajjan says:

    In the code of the original implementation[https://github.com/rbgirshick/fast-rcnn/blob/master/matlab/fast_rcnn_im_detect.m] and many other implementations[eg: https://github.com/yhenon/keras-frcnn/blob/master/keras_frcnn/RoiPoolingConv.py%5D, the approach to roi pooling is different from what you have presented here.

    In the implementation presented here,
    If say we have roi size [7,5] and desired output [2,2], then we get the first block size to be floor([7/2,5/2]),i.e,[3,2]. Rest of the boxes will have their sizes adjusted to be [4,2],[3,3] and [4,3].

    But according to the original implementation and in several others, they use the same [3,2] as size for all boxes. This leads to ignoring some of the activations in roi proposal.

    Am i correct in my above observation and if so, shouldnt your implementation be better than the original?

    Reply
  8. Krzysztof
    Krzysztof says:

    There is one part of RoI pooling, which isn’t clear for me – you’ve wrote:
    “What does the RoI pooling actually do? For every region of interest from the input list, it takes a section of the input feature map that corresponds to it.”
    How do we map coordinates of RoI on original image to coordinates in last layer of ConvNet? For the nets proposed in paper (AlexNet, VGG) the problem is how to map coordinates from (224, 224, 3) to (7, 7, 512). It’s not trivial at all. Even if we omit last pooling layer in convNet, then it’s still (14, 14, 256).

    Reply
    • Tomasz Grel
      Tomasz Grel says:

      Great question. This is not trivial, but no matter what backbone you’re using (VGG, Resnet, FPP) you should be able to postprocess the RoI coordinates to map them into original image coordinates. Most papers treat this step as a technical detail so they don’t really talk about it. If you really want to dive deep into this my advice would be to read the original Fast-RCNN or Faster-RCNN code in Caffe or some other high-quality implementation.

      Reply

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *