An Efficient Solution for Semantic Segmentation of Three Ground‐based Cloud Datasets

The machine learning approach has shown its state‐of‐the‐art ability to handle segmentation and detection tasks. It is increasingly employed to extract patterns and spatiotemporal features from the ever‐increasing stream of Earth system data. However, there is still a significant challenge, which is the generalization capability of the model on cloud images in different types and weather conditions. After studying several popular methods, we propose a semantic segmentation neural network for cloud segmentation. It extracts features learned by source and target domains in an end‐to‐end behavior, which can address the problem of significant lack of labels in the observed cloud image data. It is further evaluated on the Singapore Whole Sky Image Segmentation (SWIMSEG) dataset by using Mean Intersection‐over‐Union, recall, F‐score, and accuracy matrices. The scores of these matrices are 86%, 97%, 92%, and 96%, which prove that it has excellent efficiency and robustness. Most importantly, a new benchmark based on the SWIMSEG dataset for the task of cloud segmentation is introduced. The others, BENCHMARK, Cirrus Cumulus Stratus Nimbus are evaluated through the model trained from the SWIMSEG dataset by way of visualization.


Introduction
The Earth's average temperature is mainly regulated by clouds, which absorb and reflect shortwave radiation as well as absorb and emit longwave radiation (Huang & Su, 2008;Jiang et al., 2018;Su et al., 2017). The heat generated by clouds influences the atmospheric circulation and water content, which in turn affect cloud shapes. Clouds show in a variety of forms and always stay in a continuous process of evolution, and there are several principals of discriminations in which clouds are determined by their shapes or interior distinctive features. Clouds are thought as a kind of natural texture due to their texture similarity, brightness similarity, and contour continuity in an image. Therefore, it is possible to identify 10 main cloud groups known as cloud genera by means of cloud segmentation based on recognition of certain features. The traditional cloud segmentation methods are mainly based on the threshold segmentation technology, which combines the fixed/dynamic/mixed threshold to distinguish different categories (Heinle et al., 2010;Li et al., 2011;Long et al., 2006;Souza-Echer et al., 2006). However, these methods rarely use the spatial information and rely heavily on the weather condition. Therefore, considerable mis-segmentation between the source domain and the target domains still can be found. Besides, employing the color and texture features can also acquire some promising detection results (Mantelli Neto et al., 2010;Richards & Sullivan, 1992), but there are issues that need to be resolved such as smooth regions of cloud and cloud at borders with sky.
Recent developments in deep convolutional neural networks make relentless progress on semantic segmentation (Badrinarayanan et al., 2017;Chen et al., 2014;Long et al., 2015;Noh et al., 2015;Wu et al., 2019;Reichstein et al., 2019;Zhao et al., 2017). Semantic segmentation is regarded as a fundamental task in computer vision, which can help scene understanding completely. It can predict dense labels of all pixels in the image to generate the corresponding mask. Cloud segmentation can be treated as an application of image segmentation, and therefore, applying semantic segmentation techniques for cloud detection is a reasonable consideration. Moreover, existing approaches based on deep learning for cloud segmentation have largely concentrated on satellite data (Drönner et al., 2018;Lu et al., 2019). Hence, it deserves to exploring the cloud segmentation performance by means of deep learning on the ground-based cloud dataset.
Recently, a few methods based on deep learning have achieved better results than traditional algorithms on the ground-based cloud dataset. Dev et al. (2019) propose a model, which can effectively solve the problem of thin clouds with wrong labels. A SegCloud model trained on the 400 whole sky images with manual-marked labels is proposed for the cloud image segmentation (Xie et al., 2019). Neither comprehensively compares the effects of other models on their datasets, leaving the generalization of the models on other datasets unclear. Thus, it is a pressing need for a method based on deep learning in terms of effective, robustness, and speedup runtime. Furthermore, a labeling algorithm for reducing errors induced by human subjective judgment is also indispensable. Comparing the performances of various previous benchmark models will contribute to selecting neural network frameworks for other tasks.
In the following sections, we first describe three publicly available datasets. Second, the experimental designs and parameter settings are further presented. Third, the proposed method is thoroughly compared with the traditional methods. Finally, we summarize the paper and draw conclusions about this work.

Three Ground-Based Cloud Datasets
Typically, the annotated dataset is one of the essential prerequisites for a rigorous evaluation of any traditional or any modern segmentation algorithm. It is well known that a well-organized dataset that includes corresponding ground truth contributes to improving accuracy and avoiding overfitting concerning machine learning techniques. The existing three public cloud datasets are used for cloud segmentation.

The BENCHMARK Dataset
The BENCHMARK dataset is a subset of UTILITY dataset collected by Li et al. (2011) from three sources: whole sky camera, photographers, and Internet. The UTILITY dataset includes 1,000 cloud images, which are cropped of various dimensions to avoid areas near the sun and the horizon. After the visual screening, 32 images are selected from UTILITY including different cloud forms, such as the cumuliform, cirriform, and stratiform. And then, the Voronoi polygonal region generator is further used to generate the binary mask image as the ground truth with average pixels of 682 × 512. A few sample images in BENCHMARK are displayed in Figure 1a. Given the limited samples of this dataset, using deep learning techniques may lead to overfitting.

The Singapore Whole Sky Image Segmentation Dataset
Another publicly available cloud dataset annotated with masks is the Singapore Whole Sky Image Segmentation (SWIMSEG) dataset created by Dev et al. (2017). Figure 1b illustrates a few representative cloud samples. It consists of 1,013 images with the dimension of 600 × 600, which is obtained by the sky imaging system that is developed and deployed at Nanyang Technological University in Singapore. Each cloud patch is manually chosen from the 2-year observation images series. Cloud masks for the dataset are generated in discussion with cloud experts from the Singapore Meteorological Services. It has more samples than BENCHMARK dataset and is more suitable for machine learning training. Zhang et al. (2018) introduces Cirrus Cumulus Stratus Nimbus (CCSN) dataset, which is grouped into 11 categories under the meteorological criteria. It includes 2,543 cloud images with 256 × 256 pixels in the JPEG format and contains the 10 most concerned cloud genera in the cloud observation and contrails. The SONG ET AL. contrails may cause the warming effect and therefore arouse the concern of many scientists. However, compared to the two datasets mentioned before, this dataset currently does not have the cloud masks. Therefore, we try to use the model trained from the SWIMSEG dataset to generate the corresponding cloud masks.
These neural network frameworks are first deployed into cloud segmentation, and their performances are evaluated on the SWIMSEG dataset. In addition, there are many evaluation metrics that have been introduced and frequently adopted to assess the effectiveness of semantic segmentation techniques. In the performance evaluation, we not only adopt the precision, recall, F-score, and accuracy metric but also calculate the Mean Intersection-over-Union (MIoU) (Garcia-Garcia et al., 2017), which is a widely used metric for the task of semantic segmentation and quantifies a ratio of overlap between the intersection and the union of two sets. The percentage can also be interpreted as the number of true positives over the sum of true positives, false negatives, and false positives. where k+1 denotes the number of categories in SWIMSEG dataset, p ij is the number of false positives, which mean the number of pixels of class i inferred to belong to class j. Similarly, p ii represents the number of true positives, p ji is the number of false negatives.
The 12 neural networks are implemented by training on the SWIMSEG dataset with Tensorflow (Abadi, 2015) (version 1.12) running on a single NVIDIA GeForce GTX1080Ti. The SWIMSEG dataset, which includes 1,013 cloud images, is divided into three parts according to the common practice. Consequently, 911 cloud images are used as the training set. The remaining cloud samples are used for the testing and validation set, with 51 and 51 cloud images, respectively. In the training process, the batch size is 1, and the image is resized into 512 × 512 pixels, according to the number of input layer nodes setting. Figure 2 illustrates the overall structure, which consists of encoder and decoder. The former is similar to the convolutional neural network without the final fully connected layer. Here, we apply the ResNet101 (He et al., 2015), a state-of-the-art neural network, pretrained on the ImageNet dataset (Deng et al., 2009) as backbone. The latter uses several special networks to generate a complete cloud images. The hyperparameters of the network in each layer keep its original setting. The entire network parameters are fine-tuned on the SWIM-SEG dataset by RMSProp optimizer with the learning rate of 0.0001 and decay of 0.995. The same settings are used to train 300 epochs in the 12 kinds of encoder-decoder architectures. To check the effect of data augmentation, we do not employ the data augmentation techniques in the previous experiments.
A classic Unet (Ronneberger et al., 2015) is introduced with SE-ResNet-152 (Hu et al., 2017) as backbone pretrained on the ImageNet to train the SWIMSEG dataset. Here the batch size is 4 and the learning rate is 0.0001 with Adam (Kingma & Ba, 2014) optimizer. The other settings are the same as the previous experiments, such as partitioning of the SWIMSEG dataset. The SWIMSEG dataset is trained 50 epochs to get the final model. Since the dataset is very small, a large number of augmentation methods are adopted to increase the quantity of cloud image, such as horizontal flip, affine transforms, perspective transforms, brightness/contrast/colors manipulations, image blurring and sharpening, gaussian noise, and random crops. These transformations can be easily implemented using the albumentation package (Buslaev et al., 2018). Some neural networks with a slight modification are used to compare their final results. For example, the MobileNets-skip indicates that it uses the skip connection in the network layer connection. The only difference between FC-Densnet-56 and FC-Densnet-67 is neural network depth, with 56 and 67 layers, respectively.

Experimental Result and Discussion
Compared with the results of existing methods, this section proves the effectiveness of our proposed method.

Evaluation of Several Semantic Segmentation Methods
The results of different methods are summarized in Table 1. The IoU is 86% in our proposed method, which acquires the best result of all methods. Besides the IoU criterion, the results of recall, F-score, and accuracy are also higher than that of other methods. Moreover, MobileNets-skip has achieved lower results comparing with MobileNets. The performance of the DeepLabV3+ is better than its original network design. Two versions of the full-resolution residual network (FRRN) are implemented. We discover that more complicated networks cannot always yield a more satisfying result. For example, the FRRN-B network structure is more in-depth than FRRN-A, but the results of the FRRN-A outperform FRRN-B. The performance of FC-Densnet-67 is also worse than the shallow FC-Densnet-56 network structure. The reason may be overfitting caused by deep networks and the lack of data augmentation technology deployed on this small dataset.

Comparison to Traditional Works
The performances of our proposed method have been extensively analyzed and compared with traditional works. Detailed results as shown in Table 2, the results outperform the current best results proposed by Dev et al. (2017), which is utterly dependent on the color characteristics. It acquires high scores to the precision, recall, and F-score, with 95%, 97%, and 92%, respectively. The segmentation speed of producing a cloud mask is also faster than traditional methods, only 0.90s. Meanwhile, the model trained on the SWIMSEG dataset is used to produce a few cloud masks in the BENCHMARK and SWIMSEG datasets. The predicted masks are  visualized in Figure 3. Overall, this model can grasp the core parts and profiles of the cloud images. That is to say, our model has learned cloud features from the SWIMSEG dataset during the training process. Although the results of the BENCHMARK dataset ( Figure 3 left three columns) fail at some points compared with the ground truth such as in the (c)1 top middle, the performance of the cloud masks is more reliable by visual comparison with traditional results and ground truth. The reason why it fails may be that the quality of the generated cloud labels largely depends on the specific knowledge of experts. Those experts can not present the cloud masks of different color depths, such as the dark cloud. Unsurprisingly, the results on the SWIMSEG dataset (Figure 3 right three columns) are quite well and even more real than the ground truth at some points, such as (a)4 lower middle where the corresponding ground truth [(b)4] does not include the thin cloud. However, our generated cloud mask contains the thin cloud and can truly reflect the depth of different parts of the cloud, which can help us to distinguish different weather phenomena.

The Performance of Generating CCSN Dataset Labels
The labeling process is a quite cumbersome and time-consuming task for domain experts and it is not feasible to create all labels manually for datasets beyond dozens of petabytes. Therefore, the model trained on the SWIMSEG dataset is utilized to segment a few samples in the CCSN dataset. The segmentation performance is visualized in Figure 4. Observing the generated results, most of them can effectively produce reliable cloud masks. For example, the contrail [the bottom of Column (a) in Figure 4] presents a nearly perfect mask as well as clear boundaries. The deficiencies or mis-predictions only appear on the relatively unclear and dark boundaries among adjacent structures. For example, the SWIMSEG dataset does not include the object like the mountain [in the bottom of Column (b) of Figure 4], which will cause mis-segmentation. It also can be found that the color information in cloud images will greatly affect the final segmentation results. In the middle of the Column (b) of Figure 4, the deep color part on the top of the cumulonimbus cloud cannot be identified as a cloud.

Conclusion
In this paper, we study the challenging yet essential task of ground-based cloud segmentation in the pixel level based on deep learning. The 12 kinds of semantic segmentation networks and some variants performances are thoroughly evaluated on the SWIMSEG dataset. This is the first time that multiple deep learning methods have been used to comprehensively evaluate and compare the segmentation performance of ground-based cloud datasets. A novel neural network combination is introduced, which surpasses traditional methods by a large margin. And then we use trained model to the CCSN dataset for producing cloud masks. The experimental results have validated the effectiveness of our approach and shown the generalization capability and robustness of the model. Although the results are promising, there is still room for improvement. For example, the current ground truth only includes cloud or no cloud, but the actual observation may contain other objects, such as buildings and mountains. Moreover, we contribute a benchmark on the specific application of cloud segmentation to facilitate future research on the cloud segmentation problem. We believe that the cloud segmentation task will witness rapid developments with the help of a well-labeled dataset.