Friday, May 25, 2012

Algorithm improvement for Coca-Cola can shape recognition


One of the most interesting projects I've worked in the past couple years as I was still a student, was a final project about image processing. The goal was to develop a system to be able to recognize Coca-Cola cans (note that I'm stressing the word cans, you'll see why in a minute). You can see a sample below, with the can recognized in the green rectangle with scale and rotation.



Template matching



Some contraints on the project:



  • The background could be very noisy.

  • The can could have any scale or rotation or even orientation (within reasonable limits)

  • The image could have some degree of fuziness (contours could be not really straight)

  • There could be Coca-Cola bottles in the image, and the algorithm should only detect the can !

  • The brightness of the image could vary a lot (so you can't rely "too much" on color detection.

  • The can could be partly hidden on the sides or the middle (and possibly partly hidden behind the bottle !)

  • There could be no cans at all in the image, in which case you had to find nothing and write a message saying so.



So you could end up with tricky things like this (which in this case had my algorithm totally fail):



Total fail



Now I've done this project obviously as it was a while ago, and had a lot of fun doing it, and I had a decent implementation. Here are some details about my implementation:



Language : Done in C++ using OpenCV library.



Pre-processing : Regarding image pre-processing I mean how to transform it in a more raw form to give to the algorithm. I used 2 methods:



  1. Changing color domain from RGB to HSV ( Hue Saturation Value ) and filtering based on "red" hue, saturation above a certain threshold to avoid orange-like colors, and filtering of low value to avoid dark tones. The end result was a binary black and white image, where all white pixels would represent the pixels that match this threshold. Obviously there is still a lot of crap in the image, but this reduces the number of dimensions you have to work with). Binarized image

  2. Noise filtering using median filtering (taking the median pixel value of all neighbors and replace the pixel by this value) to reduce noise.

  3. Using Canny Edge Detection Filter to get the contours of all items after 2 precedent steps. Contour detection



Algorithm : The algorithm itself I chose for this task was taken from this (awesome) book on feature extraction and called Generalized Hough Transform (pretty different from the regular Hough Transform). It basically says a few things:



  • You can describe an object in space without knowing its analytical equation (which is the case here).

  • It is resistent to image deformations such as scaling and rotation, as it will basically test your image for every combination of scale factor and rotation factor.

  • It uses a base model (a template) that the algorithm will "learn".

  • Each pixel remaining in the contour image will vote for another pixel which will supposedly be the center (in terms of gravity) of your object, based on what it learned from the model.



In the end, you end up with a heat map of the votes, for example here all the pixels of the contour of the can will vote for its gravitational center, so you'll have a lot of votes in the same pixel corresponding to the center, and will see a peak in the heat map as below.



GHT



Once you have that, a simple threshold-based heuristic can give you the location of the center pixel, from which you can derive the scale and rotation and then plot your little rectangle around it (final scale and rotation factor will obviously be relative to your original template). In theory at least...



Results : Now, while this approach worked in the basic cases, it was severely lacking in some areas:



  • It is extremely slow ! I'm not stressing this enough. Almost a full day was needed to process the 30 test images, obvisouly because I had a very high scaling factor for rotation and translation, since some of the cans were very small.

  • It was completely lost when bottles were in the image, and for some reason almost always found the bottle instead of the can (perhaps because bottles were bigger, thus had more pixels, thus more votes)

  • Fuzzy images were also no good, since the votes ended up in pixel at random locations around the center, thus ending with a very noisy heat map.

  • Invariance in translation and rotation was achieved, but not in orientation, meaning that a can that was not directly facing the camera objective wasn't recognized.



Can you help me improve my specific algorithm, using exclusively OpenCV features, to resolve the four specific issues mentionned?



I hope some people will also learn something out of it as well, after all I think not only people who ask questions should learn :)


Source: Tips4all

11 comments:

  1. An alternative approach would be to extract features (keypoints) using the scale-invariant feature transform (SIFT) or Speeded Up Robust Features (SURF).

    It is implemented in OpenCV 2.3.1.

    You can find a nice code example using features in Features2D + Homography to find a known object <---

    Both algorithms are invariant to scaling and rotation. Since they work with features, you can also handle occulsion (as long as enough keypoints are visible).



    Image source: tutoial example

    The processing takes a few hundred ms for SIFT, SURF is bit faster, but it not suitable for realtime applications. ORB uses FAST which is weaker regarding rotation invariance.

    The original papers


    SURF: Speeded Up Robust Features
    Distinctive Image Features
    from Scale-Invariant Keypoints
    ORB: an efficient alternative to SIFT or SURF

    ReplyDelete
  2. Fun problem: when I glanced at your bottle image I thought it was a can too. But, as a human, what I did to tell the difference is that I then noticed it was also a bottle...

    So, to tell cans and bottles apart, how about simply scanning for bottles first? If you find one, mask out the label before looking for cans.

    Not too hard to implement if you're already doing cans. The real downside is it doubles your processing time. (But thinking ahead to real-world applications, you're going to end up wanting to do bottles anyway ;-)

    ReplyDelete
  3. To speed things up, I would take advantage of the fact that you are not asked to find an arbitrary image/object, but specifically one with the Coca-Cola logo. This is significant because this logo is very distinctive, and it should have a characteristic, scale-invariant signature in the frequency domain, particularly in the red channel of RGB. That is to say, the alternating pattern of red-to-white-to-red encountered by a horizontal scan line (trained on a horizontally aligned logo) will have a distinctive "rhythm" as it passes through the central axis of the logo. That rhythm will "speed up" or "slow down" at different scales and orientations, but will remain proportionally equivalent. You could identify/define a few dozen such scanlines, both horizontally and vertically through the logo and several more diagonally, in a starburst pattern. Call these the "signature scan lines."



    Searching for this signature in the target image is a simple matter of scanning the image in horizontal strips. Look for a high-frequency in the red-channel (indicating moving from a red region to a white one), and once found, see if it is followed by one of the frequency rhythms identified in the training session. Once a match is found, you will instantly know the scan-line's orientation and location in the logo (if you keep track of those things during training), so identifying the boundaries of the logo from there is trivial.

    I would be surprised if this weren't a linearly-efficient algorithm, or nearly so. It obviously doesn't address your can-bottle discrimination, but at least you'll have your logos.

    (Update: for bottle recognition I would look for coke (the brown liquid) adjacent to the logo -- that is, inside the bottle. Or, in the case of an empty bottle, I would look for a cap which will always have the same basic shape, size, and distance from the logo and will typically be all white or red. Search for a solid color eliptical shape where a cap should be, relative to the logo. Not foolproof of course, but your goal here should be to find the easy ones fast.)

    (It's been a few years since my image processing days, so I kept this suggestion high-level and conceptual. I think it might slightly approximate how a human eye might operate -- or at least how my brain does!)

    ReplyDelete
  4. Looking at shape

    Take a gander at the shape of the red portion of the can/bottle. Notice how the can tapers off slightly at the very top whereas the bottle label is straight. You can distinguish between these two by comparing the width of the red portion across the length of it.

    Looking at highlights

    One way to distinguish between bottles and cans is the material. A bottle is made of plastic whereas a can is made of aluminum metal. In sufficiently well-lit situations, looking at the specularity would be one way of telling a bottle label from a can label.

    As far as I can tell, that is how a human would tell the difference between the two types of labels. If the lighting conditions are poor, there is bound to be some uncertainty in distinguishing the two anyways. In that case, you would have to be able to detect the presence of the transparent/translucent bottle itself.

    ReplyDelete
  5. I would detect red rectangles: RGB -> HSV, filter red -> binary image, close (dilate then erode, known as imclose in matlab)

    Then look through rectangles from largest to smallest. Rectangles that have smaller rectangles in a known position/scale can both be removed (assuming bottle proportions are constant, the smaller rectangle would be a bottle cap).

    This would leave you with red rectangles, then you'll need to somehow detect the logos to tell if they're a red rectangle or a coke can. Like OCR, but with a known logo?

    ReplyDelete
  6. If you are not limited to just a camera which wasn't in one of your constraints perhaps you can move to using a range sensor like the Xbox Kinect. With this you can perform depth and colour based matched segmentation of the image. This allows for faster separation of objects in the image. You can then use ICP matching or similar techniques to even match the shape of the can rather then just its outline or colour and given that it is cylindrical this may be a valid option for any orientation if you have a previous 3D scan of the target. These techniques are often quite quick especially when used for such a specific purpose which should solve your speed problem.

    Also I could suggest, not necessarily for accuracy or speed but for fun you could use a trained neural network on your hue segmented image to identify the shape of the can. These are very fast and can often be up to 80/90% accurate. Training would be a little bit of a long process though as you would have to manually identify the can in each image.

    This is an interesting question and its so in depth and cool looking, well done!

    ReplyDelete
  7. I really like Darren Cook's and stacker's answers to this problem. I was in the midst of throwing my thoughts into a comment on those, but I believe my approach is too answer-shaped to not leave here.

    In short summary, you've identified an algorithm to determine that a Coca-Cola logo is present at a particular location in space. You're now trying to determine, for arbitrary orientations and arbitrary scaling factors, a heuristic suitable for distinguishing Coca-Cola cans from other objects, inclusive of: bottles, billboards, advertisements, and Coca-Cola paraphernalia all associated with this iconic logo. You didn't call out many of these additional cases in your problem statement, but I feel they're vital to the success of your algorithm.

    The secret here is determining what visual features a can contains or, through the negative space, what features are present for other Coke products that are not present for cans. To that end, the current top answer sketches out a basic approach for selecting "can" if and only if "bottle" is not identified, either by the presence of a bottle cap, liquid, or other similar visual heuristics.

    The problem is this breaks down. A bottle could, for example, be empty and lack the presence of a cap, leading to a false positive. Or, it could be a partial bottle with additional features mangled, leading again to false detection. Needless to say, this isn't elegant, nor is it effective for our purposes.

    To this end, the most correct selection criteria for cans appear to be the following:


    Is the shape of the object silhouette, as you sketched out in your question, correct? If so, +1.
    If we assume the presence of natural or artificial light, do we detect a chrome outline to the bottle that signifies whether this is made of aluminum? If so, +1.
    Do we determine that the specular properties of the object are correct, relative to our light sources (illustrative video link on light source detection)? If so, +1.
    Can we determine any other properties about the object that identify it as a can, including, but not limited to, the topological image skew of the logo, the orientation of the object, the juxtaposition of the object (for example, on a planar surface like a table or in the context of other cans), and the presence of a pull tab? If so, for each, +1.


    Your classification might then look like the following:


    For each candidate match, if the presence of a Coca Cola logo was detected, draw a gray border.
    For each match over +2, draw a red border.


    This visually highlights to the user what was detected, emphasizing weak positives that may, correctly, be detected as mangled cans.

    The detection of each property carries a very different time and space complexity, and for each approach, a quick pass through http://dsp.stackexchange.com is more than reasonable for determining the most correct and most efficient algorithm for your purposes. My intent here is, purely and simply, to emphasize that detecting if something is a can by invalidating a small portion of the candidate detection space isn't the most robust or effective solution to this problem, and ideally, you should take the appropriate actions accordingly.

    And hey, congrats on the Hacker News posting! On the whole, this is a pretty terrific question worthy of the publicity it received. :)

    ReplyDelete
  8. Please take a look at Zdenek Kalal's Predator tracker. It requires some training, but it can actively learn how the tracked object looks at different orientations and scales and does it in realtime!

    The source code is available on his site. It's in MATLAB, but perhaps there is a Java implementation already done by a community member. I have succesfully re-implemented the tracker part of TLD in C#. If I remember correctly, TLD is using Ferns as the keypoint detector. I use either SURF or SIFT instead (already suggested by @stacker) to reacquire the object if it was lost by the tracker. The tracker's feedback makes it easy to build with time a dynamic list of sift/surf templates that with time enable reacquiring the object with very high precision.

    If you're interested in my C# implementation of the tracker, feel free to ask.

    Enjoy! =D

    ReplyDelete
  9. Isn't it difficult even for humans to distinguish between a bottle and a can in the second image (provided the transparent region of the bottle is hidden)?

    They are almost the same except for a very small region (that is, width at the top of the can is a little small while the wrapper of the bottle is the same width throughout, but a minor change right?).

    The first thing that came to my mind was to check for the red top of bottle. But it is still a problem, if ther is no top for the bottle, or if it is partially hidden (as mentioned above).

    The second thing I thought was about the transparency of bottle. OpenCV has some works on finding transparent objects in an image. Check the below links.


    OpenCV Meeting Notes Minutes 2012-03-05
    OpenCV Meeting Notes Minutes 2012-02-21


    Particularly look at this to see how accurately they detect glass:


    OpenCV Meeting Notes Minutes 2012-04-24


    See their implmentation result:



    They say it is the implementation of the paper "A Geodesic Active Contour Framework for Finding Glass" by K. McHenry and J. Ponce, CVPR 2006.. (Download paper).

    It might be helpful in your case a little bit, but problem arises again if the bottle is filled.

    So I think here, you can search for the transparent body of the bottles first or for a red region connected to two transparent objects laterally which is obviously the bottle. (When working ideally, an image as follows.)



    Now you can remove the yellow region, that is, the label of the bottle and run your algorithm to find the can.

    Anyway, this solution also has different problems like in the other solutions.


    It works only if your bottle is empty. In that case, you will have to search for the red region between the two black colors (if the Coca Cola liquid is black).
    Another problem if transparent part is covered.


    But anyway, if there are none of the above problems in the pictures, this seems be to a better way.

    ReplyDelete
  10. This may be a very naive idea (or may not work at all), but the dimensions of all the coke cans are fixed. So may be if the same image contains both a can and a bottle then you can tell them apart by size considerations (bottles are going to be larger). Now because of missing depth (i.e. 3D mapping to 2D mapping) its possible that a bottle may appear shrunk and there isn't a size difference. You may recover some depth information using stereo-imaging and then recover the original size.

    ReplyDelete
  11. I'm not aware of OpenCV but looking at the problem logically I think you could differentiate between bottle and can by changing the image which you are looking for i.e. Coca Cola. You should incorporate till top portion of can as in case of can there is silver lining at top of coca cola and in case of bottle there will be no such silver lining.

    But obviously this algorithm will fail in cases where top of can is hidden, but in such case even human will not be able to differentiate between the two (if only coca cola portion of bottle/can is visible)

    ReplyDelete