How not to test your deep learning algorithm?

An application to object recognition on satellite imagery


Testing AI and deep learning neural networks performances can be a pain. When you build a deep learning application, you want to validate your performances as much as possible and fill your product spec sheet with quantitative numbers. However, quantitative testing needs a ground truth to be meaningful, and the having ground truth tends to make one wants to include that data in training to increase… performances. This is a real dilemma, especially when labeled data are scarce or expensive to get, e.g. when dealing with remote sensing images.


What are algorithm performances?


But before even thinking about testing data, it is important to understand what a “good” algorithm is. 

When training an AI, a common misunderstanding is to think that you want your algorithm to learn every possible situation there is, to then be robust to everything that could happen. But this is not true. What you want when training an AI is to be able to generalize, i.e. learn the minimum feature combinations that describe the object. Mathematically, it is a combination of convolution filters weighted together in order to minimize the loss function.


For face detection, for instance, an algorithm will learn that a face is round, contains ears, a mouth, a nose, eyes, and that eyes are elliptic, that they contain a round concentric black shape and another circle that can be blue/green/brown... Hence combining these features with a certain weighting would give the best loss score. This is how convolution neural networks, which are the base of AI for object recognition, actually “learn”.

In addition to learning what the algorithm is tasked to detect, it is also paramount to learn what it should not detect. If you want to detect red Ferraris, you will need to give your algorithm lots of red objects which are not Ferraris, or your algorithm will put a Ferrari tag on every red object it detects, including tomatoes or British phone booths. In order to test your algorithm performances, you will then want to give it lots of red sports cars, to be sure it is robust enough. However, by doing so, you may bias your test results. Indeed, giving too many “hard examples” will result in unrealistically poor scores. Getting your test data right is thus a big challenge: if your test base is composed of many “easy” examples, you will get overestimated performances, but if your test database is too complex, you will exhaust your resources trying to get impossible performances in production.


It is nearly impossible to have a working recipe when building a testing protocol, but it is possible to give guidelines regarding what NOT to do.



Bad idea #1: Don’t think about your metrics


In order to test, you need metrics. Usually, these metrics are pretty standard: recall, precision, F1/F2-scores or maps are pretty common for data scientists. What is more tricky though is how you compute them to be relevant for your real-life applications. To master this, you need to think ahead about your use cases and the technology you will need to use. Indeed, even if you want to detect one unique thing, you might end up using different metrics depending on your customers’ needs.

A great illustration for that topic is car detection on satellite imagery. Imagine that you want to detect parking occupations: you could use object-based localisation and detection to put boxes around cars. Metric would thus be pretty straightforward: IoU (intersection over union). One box will count one object and you could thus derive your parking occupation.

However, object-detection algorithms could struggle to detect vehicles on satellite imagery above 50cm of resolution. This is due to the fact that these algorithms, either one stage or two stages, use the concept of regions and classify these regions. But a car on a 50cm image is mainly a blurry ovoid of 10x6 pixels… way too small to be considered as a “region” by most region-based algorithms.


This is what trucks/cars would look like on 30–50cm satellite imagery

Thus, an alternative could be to use segmentation, i.e. algorithms that classify every pixel of your picture.

While ensuring better performances, segmentation algorithms can struggle to separate small objects.

This way you would have the total surface of the parking that is occupied by cars. But your object separation would thus be much harder. So how you would count your score? It’s starting to be tricky indeed…!  It is possible to find a solution looking at the surface occupied by the segmentation: in the end, parking space is just m². But if you were to use the same technology to actually “count” cars for a car maker production evaluation, it would be completely different, and you would need to use another metric.


Even a metric as simple as precision can be misleading. Basically, it evaluates the number of false detection rates. However, this number increases a lot when the size of the image increases. It is quite simple to understand that, for the same algorithm, you have more probability to have a false detection on a 200km² image than on a 20km². This is why, internally, we prefer to evaluate some metrics per km².


This simple example illustrates that there are infinite ways to evaluate performances depending on the objects, technology, use cases… Users and developers should be careful when looking at a number that often seems “too good to be true”. And this is just the beginning. Once you have the metric, one big question remains: which data do I need to use? Or in fact, which data should I not use?



Bad idea #2: Testing on the training dataset


Yes, “nobody is dumb enough to do that”. Indeed it is basic in machine learning to use different data in training and validation/testing. But why is it a bad idea really? The answer is actually more complex than it looks like.

Using training data to test your algorithm can result in undetected overfitting and over-specialization, which is defined as the process of an algorithm matching data without learning to generalize. Over-specialization comes from not having diverse enough training bases and overfitting comes from too much freedom in your model.


Both can be very hard to detect for deep learning as you often have millions of parameters of freedom and hundreds of thousands of data points for training. Using regularization techniques such as dropouts helps but cannot totally prevent the process. If we go back to the red Ferrari example, if your training base is only composed of black cars and red Ferraris, there is a very strong possibility that your model will over-simplify the problem: red = Ferrari, not red = not Ferrari, this is a basic example of over-specialization of your training base and rigorous testing easily allows to detect it.



Bad idea #3: Split the same image between training and test


If your application is detecting pedestrians for an autonomous car application, you will want to use very varied landscapes and roads. Nobody will have the idea to use one side of the images for training and the other half for testing.

Unfortunately, we still see that a lot for remote sensing applications, mainly due to the scarcity and size of the data. Satellite images can be up to 1 billion pixels, hosting tens of thousands of instances of the same object. A first POC usually leverages a very limited number of images, often one or two.


Consequently, testing is often performed on subparts of the same image that is used in training, thus not accounting for weather, image quality or landscape variations, creating a huge positive bias. This also explains why you can see a huge discrepancy in performances between some research paper/POC performances and actual “real life” applications. When you learn and test on one image it is quite easy to get above 90 or 95% in F1 score, but a huge disappointment always comes when you test the same network on a different location/date.

Same area, same sensor but 3 weeks apart

Bad idea #4: Use the same geographical scene to train and test


This notion here is trickier to understand and could also depend on the use case. If you are performing a recurrent monitoring of the same area, it is not specifically bad to test your algorithm on this area, but if you want to build generic detectors, you will once again put a huge positive bias in your performances. It is quite easy to understand that the lack of landscape diversity in your testing base will bias the algorithm positively. However, it is also related to how Earth Observation satellites actually work.


Most of Earth Observation satellites are in sun-synchronous orbit, they always picture the same area at the same solar hour, meaning every image you get from Paris from World View 3 satellite will be at the same hour in the late morning (minus off-nadir imaging capabilities but I’m trying to keep it simple here…). So you will get similar lighting conditions, shadows and will not get all the daily variability in weather or human activities.


Depending on the site, you can also encounter very different atmospheric phenomenon: haze, humidity, or aerosol: all of these impact the images quality and thus the algorithm’s ability to detect objects within the satellite images. We found a huge variation in image quality and thus algorithm performances (up to 30%!) between sites taken at the same local hour at the same resolution due to this factor, so it is needed to evaluate performances on sites both in training and not in training to see if performances are consistent and homogeneous.


Example of various landscapes and geographies that are changing drastically algorithm results


Last bad idea: Use the same sensor in train and test


This also depends on your use case, but you will usually want to take advantage of every sensor available to get more image frequencies. Even at the same resolution, sensors have different physical qualities that can change image aspects, spectral responses or color balances. We have even seen algorithm performances variations between two sister satellites Pleiade 1A and Pleiade 1B: when trained only on P1A, our object detection algorithm had much lower performances on P1B.


Consequently, we now test performances on the whole sensor catalog we own, from QuickBird 2 to Worldview 4, panchro and multi-spectral. It is also important in order to understand where to specialize. For example, we realized that you can use the same algorithm for boat detection between 30cm and 1m resolution but you would need 3 vehicle detection algorithms if using the same resolution range. Only a balanced test base can show you this.


Sensor and image qualities can have a tremendous impact on object aspects


In the end, what are good performances?


For marketing or demonstration applications, it is always easy to show 90–95, sometimes even 97% precision/recall. Even with a limited number of training and test examples, it is easy to (voluntary or not) over-fit your remote sensing application to be very good on one specific image, location, and time.


However, problems appear when scaling up and it is always related to human performances. This is why it can take months or years of research to get above 90% F1 score for object detection in satellite imagery. And sometimes these scores are not even reachable.


On the 1st of October, we reached a milestone at Earthcube: we had officially labeled more than 1 million vehicles alone for training deep learning algorithms on satellite images. So we decided to perform a benchmark on the ground truth quality with several military-trained photo analysts. Results were not a surprise if you look back at what a vehicle looks like on a satellite image: when labeling the same full-size images we saw the variability going from 10% on “perfect” 30cm multi-spectral clean images, to more than 25% for more difficult images (panchro only, high off-nadir angle…). Of course, this decreases the quality of the dataset and thus, of the detection rate. If it is hard for the human, it is also hard for the algorithm and, in the end, an AI cannot be better than the dataset it was trained on.


There are more than 37k vehicles on this image. It is impossible for a human to label it with 95% accuracy without spending 4 weeks on it.

The same conclusion was drawn when we discussed automatic pipeline surveillance. When you actually look at human performances, one can miss up to 30% of the threats during an aerial flyby due to a decrease of attention caused by long flight times, the speed of the aerial vector, or observation conditions.


In the end, when you get 70 or 75% F1 score, which can be considered a bad score on paper, you could be, in fact, above human-level performances on many use cases and image types.


When you ask customers, they always want “>90%” but it also the job of the data science team to explain what is actually achievable and what good performances are compared to what a human can achieve. It is then critical to be able to show that test methodology ensures and confirms the reliability and robustness of the solution and, above all, fits the client’s use case.