1 Introduction

In this research, we are concerned by two concepts; the first is face detection in surveillance videos and the second is concealed face identification. Face detection is concerned by discovery of faces in a video or video frames and if discovered, then the image location should be noted. The challenges of face detection techniques are numerous. Face pose in a video frame can be non-conform because of the locus of camera-face dependency. Camera locus can be frontal, inclined, or profile. Also, faces can be concealed partially or totally due to innocent actions like presence of beards or glasses, or due to threatening actions like presence of mask. Another problem that face detection systems may face is the problem that faces may be the occlusion where faces can be partially concealed by other objects in the video frame. Also, lighting state and camera characteristics can distress the appearance of a face. To handle these complications, researchers have proposed different techniques. Robust face recognition from multi-view videos is proposed by Du, Sankaranarayanan,and Chellappa [10]. Advanced face detection techniques can handle adversative conditions such as lightning setting and profile angles. Nowadays, techniques utilize neural networks and skin color identification. Skin detection using color processing mechanism was proposed by Wu et al. [43]. Skin color detection utilizing neural network was proposed by Kim, Hwang, and Cho [24]. This algorithm achieved high-performance face detection time. Shearlet neural network for face detection using one sample per person was achieved by Borgi et al. [3]. Ejbali, Zaied, and Ben Amar [11] implemented face recognition model based on elastic graph equivalence, skin segmentation, and consequent. A literature survey of face recognition is presented by Zhao, Chellappa, Phillips, and Rosenfeld [49]. Filali et al. [12] introduced texture classification of melanoma skin cancer utilizing an efficient convolutional neural network. Chai, Shan, Chen, and Gao [5] utilized locally linear regression for pose-invariant face recognition. Khan and Khan [23] produced pioneered algorithms with high reliability for face localization in multifaceted images.

Identification of skin area is one of the well-known techniques to firstly identify faces in images or video frame. Distinguishing the skin area reduces the time complexity of the face detection algorithms. On the contrary, we want to detect covered faces, so we are proposing an exclusion algorithm, where we have to identify faces in frames by applying the head and shoulder identification; second, if we can detect movement in the video, then we can locate the face; we then exclude the concealed face assumption if skin complexion is detected.

Face detection techniques are classified in four areas: template matching techniques, feature invariant techniques, Knowledge-based techniques, and appearance training–based techniques.

  1. 1.

    Template matching techniques In this technique, it saves various patterns that describe faces as a whole face or as facial features, as described by Wang and He [41]. It locates faces utilizing correlation function in respect with a standard face feature. The technique experiences some problems such as the following: it is difficult to characterize typical templates fit for various poses, angles, facial manifestation, and illumination setting.

  2. 2.

    Feature invariant techniques It utilizes different facial features such as eyes, nose, and skin color for face detection. This technique, as explained by Song et al. [37], is robust under different lightning settings. At the first step, the technique extracts discriminative local features. In the second step, the technique employs a spatial pyramid to construct a local-holistic face image. It then utilizes a support vector machine for classification.

  3. 3.

    Knowledge-based techniques These techniques, as described by Devadethan et al. [8] and [18], utilizes rules extraction for face detection. The rules describe the associations among facial features. These techniques are utilized for face localization. Also, these techniques suffer from low accuracy as the accuracy is greatly affected by rules considered for face detection. Restrictive constraints yield low detection rate while loose rules yield false detection rate. [18] explained the assessment of face detection in angular positions. They assess matched features utilizing minimum distance measures within clusters of the facial area. On the other hand, they utilized maximum distance measures among classes of non-facial areas.

  4. 4.

    Appearance training–based techniques These techniques, as described by Molder and Oancea [31], are trained utilizing set of facial images. No predefined templates are required. Convolution neural network face detection utilizes such training methodology. Khan M. Z. et al. [22] utilized scale-invariant transforms for feature extraction. They proposed a training technique that utilizes labeled faces in the wild dataset presented in Huang G. B. et al. [19].

Our research presents a technique for face detection which is based on skin color segmentation using hybrid color model combining the normalized RGB and the YCbCr models. This paper combines the skin segmentation and facial features to detect faces in images. It has been found that by using skin segmentation accuracy, the algorithm is improved and gives the better result than other approaches.

This paper is organized in the following manner: Section 2 presents background around face detection in general and in constrained environment. Section 3 describes the proposed technique. Section 4 describes the proposed hybrid approach for skin detection. Section 5 gives experimental results, while section 6 summarizes the conclusions.

2 Background

A face shape prototype with local adaptive morphology handling is presented by Liu, Guo, Liu, Lee, and Yao [29] as an alignment standard to overcome geometric distortion artifacts due to various poses. Du, Hu, Qiao, and Pitas [9] proposed a very well-connected face recognition system utilizing low-rank sparse illustration. Zaman, Shafie, and Mustafah [46] presented a facial recognition system that is robust against different expressions and camera occlusions. Jin, McCann, Froustey, and Unser [21] proposed a deep convolutional neural network to resolve ill-posed inverse image problems.

Zhu, Mai, and Shao [50] utilized a color attenuation methodology for haze removal from hazy images. They introduced linear modelling paradigm for scene depth utilizing a supervised learning method, where the depth information can be modelled. Yu, Bampis, Gupta, and Bovik [45] introduced bi-step image quality estimate methodology which is very important in any image prediction system. According to Fu, Xu, Li, Liu, Ye, and Zhu [13], crowd density estimation is carried utilizing convolutional neural networks, which is very helpful in surveillance system to detect and estimate people in crowds. Wang, Wang, Wang, Zhang, and Qiao [42] proposed scene recognition utilizing local patches. Wu, Lin, Dong, Yan, Bian, and Yang [44] utilized one example methodology for person re-identification through progressive learning.

Chen, Papandreou, Kokkinos, Murphy, and Yuille [6] presented semantic image segmentation utilizing deep convolutional nets; the proposed methodology gave a very good performance in image segmentation that utilizes semantic information. Zhang K., Zhang Z., Li Z., and Qiao Y. [48] proposed a framework that influenced a cascaded architecture of deep convolutional networks. They indicated high performance in forecasting face location in a coarse-to-fine fashion. Also, Badrinarayanan, Kendall, and Cipolla [2] proposed image segmentation technique using neural network. Li et al. [26] as well, utilized semantic information in maps to solve salient object identification problem; they modelled the semantic attributes of salient objects and utilized convolutional neural network with raw images as input and saliency maps as output. Romera, Álvarez, Bergasa, and Arroyo [36] also focused on extracting semantic information to achieve semantic segmentation in real time. Although Gao, Li, Woo, and Tian [14] emphasized on image segmentation of thermography imaging using genetic algorithms, still their proposal can be extended to normal images. While Liu, Xiao, and Yang [28] focused on edge detection utilizing coastline detection algorithm, they mixed the region and edge active contour methods.

Real time for surveillance systems is very crucial in their success. Awais M. et al. [1] proposed video surveillance system with enhanced accuracy and less computational complexity. Their system comprises face localization and recognition, and it takes real-time videos of faces. The system then extracts key frames and compares it with stored facial images. It uses histogram of oriented gradients (HOG) features. Their simulation results show almost 92% success rates which are comparable with the deep learning approaches, but deep learning techniques have higher computational cost. Ullah H. et al. [40] presented a real-time and new face recognition technique with occlusion. The system utilized 68 points to detect the face in the input image. Linear discriminant techniques are then utilized to extract face features. At the last stage, a classifier with nearest center measures is used. The system is proven to act in real time through experimentation results. Haq M. et al. [17] proposed a novel technique to boost the performance of low-resolution face recognition. Many other articles have studied high-performance face recognition such as the authors Zhang J. et al. [47]. They designed a high-performance face recognition system that utilizes edge computing.

As we surveyed many face detection and recognition techniques, still the most important issue for our research are the systems that recognize masked faces or faces under occlusion. Qezavati H., Majidi B., and Manzuri M. T. [34] introduced a methodology for the detection of partially covered face. It is utilized in surveillance videos containing partially concealed faces including headscarves and eyeglasses. The methodology combines Haar and binary histogram for face classification. Rajeshwari, Karibasappa, and GopalKrishna [35] surveyed face detection based on skin detection. Liao, Jain, and Li [27] addressed problems in face detection with no prior constrains. They utilized normalized pixel difference image features. These features are extracted by experimental psychology. Bu W. et al. [4] presented a novel cascaded CNN (convolutional neural network) framework to detect masked faces. They also constructed a dataset for masked faces. Ge S. et al. [15] also proposed a LLE-CNNs for occluded face detection. Pre-trained CNNs are utilized to exclude facial regions from the image expressing them with descriptors. These descriptors are converted into similarity-based measures. They tested the system on a large pool of synthesized faces, occluded faces, and also on non-faces. Ghiasi G. and Fowlkes C.C. [16] presented a hierarchical deformable face detection model. They presented occlusions in a structured model. They also enhanced training data with synthetically occluded face images. Nair A. and Potgantwar A. [32] proposed an automated masked person detection in less time.

In a recent study, Ud Din N. et al. [39] proposed mask object removal in face images. They faced challenges because facial masks usually cover a large part of the face, and they also faced the problem of the lack of training datasets for face image with and without mask. They introduced a solution for mask detection. They also utilized a generative adversarial network (GAN) of two discriminators.

3 The proposed technique for human detection in surveillance scene

The proposed technique is utilized in surveillance systems. It aims in detecting concealed faces in surveillance images or videos. It comprises many algorithms starting with an algorithm to take images for the surveillance scene under different conditions. The scene under surveillance will detect new objects that enter the scene. The object will be identified as a human being using height and width measurement extraction. If a human being is detected, face and shoulder areas will be extracted utilizing pattern learning from a training phase. Clustering of patches is performed and determines if it is skin patch or a concealed face patch. Human complexion detector using a hybrid technique combining normalized RGB and the YCbCr is proposed. An overview of the proposed technique is depicted in Fig. 1. Head and shoulder detection algorithm is depicted in Fig. 2. The proposed technique comprises many algorithms. Algorithm 1 depicts the training phase for head and shoulder detection. Algorithm 2 is utilized to determine head and shoulder for an unknown image. Algorithm 3 depicts the face detection algorithm.

Fig. 1
figure 1

An overview of the proposed technique

Fig. 2
figure 2

Head and shoulder training and detection

Algorithm 1: Training phase for head and shoulder detection (Output {pattern})

figure a

Algorithm 2: Determine Head-and Shoulder (Input: Clus (i), R (Clus (i))

figure b

Algorithm 3: Face Detection Algorithm

figure c

4 The proposed hybrid approach skin detector

The RGB color space is mainly utilized for digital image as explained by Cheng, Liu, and Haifeng [7]. There is a high correlation between the RGB components, which creates sensitivity. To solve the sensitivity issue, each component of the RGB should undergo normalization process: normalized RGB color space by Loesdau et al. [30]. The normalization in Eqs. 1, 2, and 3 helps to reduce the dependency between the RGB components.

$$ r=\frac{R}{R+G+B} $$
(1)
$$ g=\frac{G}{R+G+B} $$
(2)
$$ b=\frac{B}{R+G+B} $$
(3)

HSV model (hue saturation value) is described by Jang and Ra [20] as a discernment model that discriminates between luminance and chrominance components. H, S, and V components are depicted in Eqs. 4, 5, and 6.

$$ H={\cos}^{-1}\frac{\frac{1}{2}\ \left(\left(R-G\right)+\left(R-B\right)\right)}{\sqrt{\left(R-G\right)2+\left(R-B\right)\left(G-B\right)}} $$
(4)
$$ S=1-3\left(\frac{\min \left(R,G,B\right)}{R+G+B}\right) $$
(5)
$$ V=\frac{1}{2}\left(R+G+B\right) $$
(6)

Another model is YCbCr to encode RGB images as introduced by Lei et al. [25]. This model utilizes linear transform to separate illumination and chrominance components. This model is very effective the detection of human skin. YCbCr is calculated as follows in Eqs. 7, 8, and 9.

$$ Y=0.299\ R-0.587\ G-0.114\ B $$
(7)
$$ {C}_b=R-Y $$
(8)
$$ {C}_r=B-Y $$
(9)

We simulated the three models using 100 labelled images of skin and non-skin patches; 70 patches were skin and 30 patches were non-skin patches. The computed confusion matrix is depicted in Tables 1, 2, and 3. The results are not convincing enough, as the best one was the YCbCr, which gives only 71.4% true-positive detection and 53.3% true-negative detection. We decided to utilize a hybrid model of the normalized RGB and the YCbCr as follows in Eqs. 10, 11, and 12.

Table 1 The normalized RGB
Table 2 HSV model
Table 3 YCbCr model
$$ Y=0.299\ r-0.587\ g-0.114\ b $$
(10)
$$ {C}_b=r-Y $$
(11)
$$ {C}_r=y-Y $$
(12)

The hybrid model takes advantage of the normalization process of the RGB model to reduce the dependency between the RGB components, and takes advantage of the YCbCr model to encode RGB images and to separate illumination and chrominance components. The confusion matrices of the normalized RGB, HSV model, YCbCr model, and our proposed hybrid model are shown in Tables 1, 2, 3, and 4. In the experiments, we used 2000 patches; 1000 are human skin patches and 1000 are masked human skin patches. The hybrid model gives 97.3% true-positive detection and 98.2% true-negative detection, outperforming the other three models. The true-negative percentage is very important for our proposed system, because if we are sure that the area is face and it did not give us skin detection, therefore we can assume the face is covered. The proposed algorithm is depicted in Algorithm 4. Identification of the covered face is depicted in Algorithm 5. The whole system is depicted in Fig 3 a and b.

Table 4 Our proposed hybrid model
Fig. 3
figure 3

a Classification of skin and non-skin patches. b Detection of concealed faces

Algorithm 4. Training for presence of human complexion (Clus (i))

figure d

Algorithm 5. Identification of the covered face (Clus (i))

figure e

5 Experimental result

We tested the hybrid model utilizing a set of training images that are selected from two databases, the first one (TAN) is presented by Tan et al. [38], and the second dataset (FvNF) is presented by Nanni and Lumini [33]. The first image database includes 650 skin and non-skin patches. Color images are obtained from the various sources and under different illumination settings. The second dataset (FvNF) is (face vs. nonface), which is composed by 800 face images. This dataset has been collected and used by Nanni and Lumini [33] to evaluate the capability of a skin detector method to detect the presence of a face, based on the number of pixels classified as skin. For our experiments, we created a synthetic dataset by occluding parts of the skin in 450 images in FvNf and labelled them with concealed human set while the rest of the images were left not occluded.

For classification of the skin patches, we applied the FMeasure as stated in Eq. 13. It is shown from Table 5 that the proposed hybrid technique has high detection rate compared with existing models. Metrics such as FMeasure and specificity will be utilized for comparison between the proposed algorithm (Algorithms 4 and 5) and other algorithms in the literature. Specificity describes the true-negative rate that measures the ratio of actual negatives that are correctly detected as such. FMeasure is a metric of the experiment accuracy. It comprises the precision and the recall. Also, the metrics recall, false-positive ratio (FPR), and false-negative ratio (FNR) are utilized in the experimentation, and the comparison result is depicted in Tables 5 and 6, and Fig. 6. All those metrics are defined in Eqs. 1318.

Table 5 The performance measures of the proposed hybrid approach against other approaches using the TAN dataset of 600 face images
Table 6 The performance measures of the proposed hybrid approach against other approaches using the FvNF dataset of 450 images
$$ FMeasure=2\times \frac{\left( Precision\times Recall\right)}{\left( Precision+ Recall\right)} $$
(13)
$$ Precision=\frac{TP}{\left( TP+ FP\right)} $$
(14)
$$ Recall=\frac{TP}{\left( TP+ FN\right)} $$
(15)
$$ Specificity=\frac{TN}{\left( TN+ FP\right)} $$
(16)
$$ FPR=\frac{FN}{\left( FN+ TP\right)} $$
(17)
$$ FNR=\frac{FP}{\left( FP+ TN\right)} $$
(18)

Some of the used skin patches, clustering, and skin patched after concealment (used in actual experiments) are depicted in Fig. 4 (dataset TAN) and Fig. 5 (dataset FvNF). The proposed hybrid approach achieved high-concealed face detection performance for frontal faces. Performance is degraded when images contain non-frontal faces and for dark skin images if concealed with brown cover. The hybrid approach achieves average specificity of 96.8%, which is an enhancement in specificity by 28%, and average detection rate of 97.5%, which is an enhancement by 42.12% than the second best algorithm YCbCr model (Fig. 6). While FMeasure is a measure that combines precision and recall, the proposed hybrid approach achieves enhancement by 30%.

Fig. 4
figure 4

a Patches of skin. b Clustering. c Concealing (dataset TAN)

Fig. 5
figure 5

a Example of an image from FvNF. b Synesthetic masked image

Fig. 6
figure 6

The performance measures of the proposed hybrid approach against other approaches using the FvNF dataset of 450 images

In Table 7, we compared some of existing systems for concealled face detection. Different features are sought and compared with our proposed system. The systems that we compared are as follows:

  1. 1.

    GAN: A novel generative network for unmasking of masked face. It was introduced by Ud Din N. et al. [39].

  2. 2.

    Head scarf : Qezavati H., Majidi B., and Manzuri M. T. [34] detected partially covered face with headscarf.

  3. 3.

    LLE: Ge S. et al. [15] succeeded in detecting masked faces with LLE-CNNs.

  4. 4.

    Occlusion coherence: Ghiasi G. and Fowlkes C.C. [16] introduced a local occluded faces utilizing a hierarchical deformable model.

  5. 5.

    Viola: Nair A. and Potgantwar A. [32] proposed a masked face detection using the viola algorithm.

Table 7 Comparison of existing systems for concealed face detection

We carried experiments to determine the runtime of our proposed approach. It is very crucial to carry our algorithm in real time. We compared our implementation with Viola (Nair A. and Potgantwar A. [32]) and GAN (Ud Din N. et al. [39]). The Viola system detects people’s face and determines if it is masked or not in video based setting. The Viola system comprises several phases such as distance from camera, identifies the eye line and the face, and finally detects if the face is masked. While with GAN, we compared part of our proposed system (masked face detection) with the first phase of GAN which includes detection of mask in images. We added the detection of face and shoulder to GAN. Both systems are known for their real-time occluded face detection.

In Fig. 7, we show average runtime comparison combined with average detection rate. We used 450 images from the FvFN database with synthetic mask.

Fig. 7
figure 7

Average runtime and average detection rate for VOILA, GAN, and our proposed system

As shown in Fig. 7, Voila has a slight less average runtime than our proposed system but with lower detection rate. Our system has 12% more detection rate while having average runtime of 8% which is still considered real time. Also, GAN has more average runtime than our proposed system but with comparable detection rate. Our system is 30% faster than GAN.

Figure 8 presents the average runtime for VOILA, GAN versus, and our proposed system for 11 different masked face images.

Fig. 8
figure 8

Average runtime in MS for VOILA, GAN versus, and our proposed system for 11 different masked face images

6 Conclusions

This paper proposes a robust approach for concealed face detection under different camera angle and illumination settings. Human skin patches are identified by a hybrid non-linear transform model that combines the RGB (rgb) color space model and the YCbCr color model. The concealed face is detected by negating the presence of skin patches in a true identified face. The novel technique for concealed face detection based on complexion detection to challenge a concealed face assumption. The proposed algorithm first determine of the existence of a human being in the surveillance scene. Head and shoulder contour is detected. The face will be clustered to cluster patches. Then determination of presence or absent of human skin will be determined. We proposed a hybrid approach that combines normalized RGB and the YCbCr space color. This technique is tested on two datasets; the first one contains 650 skin patches, and the second one contains 800 face images. Masks are synthesized on 60% of the images in the data sets. The algorithm achieves an average masked face detection rate of 97.51% for concealed faces in real time.