Skip to main content
Cognitive Computing Laboratory

Deep Aerial Object Recognition

Aerial imagery, which is captured by drones or unmanned aerial vehicles, is a great tool for surveillance because of its wide field of view and the ability of drones to access places that would be physically difficult to visit. Aerial imagery has many applications like boarder security, search and rescue tasks, and image and video understanding. The aerial imagery has an advantage of wide area view but this results in objects of interest occupying a small number of pixels in the image. Therefore, it is common that a vehicle in an aerial view is missed by an object detector. Because of the background or other objects a large number of false positive predictions are also highly probable. The application of aerial vehicle detection and recognition can be more specific if the goal of the system is not just limited to detect vehicles but to detect and find specific vehicles. For example, a detection system can concentrate on searching for a specific car with a specific color, type, and other descriptions (e.g., yellow taxi, large green truck). In this scenario, the detection system can be used in the applications like finding a suspicious vehicle or a specific target vehicle among several other vehicles, objects, and backgrounds. In this project, we proposed a framework that can handle the problem of open-ended classification or prediction. A classical image classification system (see Fig. 1a) receives an image and produces an output label. However, in this project, we use a novel architecture (see Figure 1b) in which it receives an image and a desired text description of the queried object (i.e., vehicle label), represented by a code-vector, and makes a yes or no decision about the correctness of the input label. In other words, it decides if the input image has the desired class label or not.

Fig. 1
Fig. 1.  The proposed deep architecture that receives an image as well as the desired class description as the inputs. This architecture predicts a yes and no decision that shows if the image has the desired class label or not. The proposed deep vehicle detector consists of a VGG-16 like deep network with only one fully connected layer to extract visual descriptors. This fully connected layer is fed into the next step which is a fusion of visual features extracted by the VGG structure and the textual features about the desired vehicle. The textual description of the desired classes are coded using the bag of words representation, and then a fully connected layer transforms this information into the next space. The desired classes in our experiments consist of the color and the types of the vehicles, but they can be more complicated with more textual details about the desired vehicles.