Academic Editor: Jun-Qiang Wang Received: 14 November 2024 Revised: 24 December 2024 Accepted: 6 January 2025 Published: 16 January 2025 Citation: Alhassan, M.A.M.; Yılmaz, E. Evaluating YOLOv4 and YOLOv5 for Enhanced Object Detection in UAV-Based Surveillance. Processes 2025, 13, 254. https://doi.org/ 10.3390/pr13010254 Copyright: © 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/ licenses/by/4.0/). Article Evaluating YOLOv4 and YOLOv5 for Enhanced Object Detection in UAV-Based Surveillance Mugtaba Abdalrazig Mohamed Alhassan * and Ersen Yılmaz Electrical-Electronic Engineering Department, Bursa Uludag University, 16059 Bursa, Turkey; ersen@uludag.edu.tr * Correspondence: mugtaba.abdalrazig@gmail.com Abstract: Traditional surveillance systems often rely on fixed cameras with limited coverage and human monitoring, which can lead to potential errors and delays. Unmanned Aerial Vehicles (UAVs) equipped with object detection algorithms, such as You Only Look Once (YOLO), offer a robust solution for dynamic surveillance, enabling real-time monitoring over large and inaccessible areas. In this study, we present a comparative analysis of YOLOv4 and YOLOv5 for UAV-based surveillance applications, focusing on two critical metrics: detection speed (Frames Per Second, FPS) and accuracy (Average Precision, AP). Using aerial imagery captured by a UAV, along with 20,288 images from the Microsoft Common Objects in Context (MS COCO) dataset, we evaluate each model’s suitability for deployment in high-demand environments. The results indicate that YOLOv5 outperforms YOLOv4 with a 1.63-fold increase in FPS and a 1.09-fold improvement in AP, suggesting that YOLOv5 is a more efficient option for UAV-based detection. However, to align with recent advancements, this study also highlights potential areas for integrating newer YOLO models and transformer-based architectures in future research to further enhance detection performance and model robustness. This work aims to provide a solid foundation for UAV-based object detection, while acknowledging the need for continuous development to accommodate newer models and evolving detection challenges. Keywords: image processing; object detection; unmanned aerial vehicles; YOLO 1. Introduction Over the past years, significant advancements have taken place in the field of object detection models, and among these, the evolution of the You Only Look Once (YOLO) algorithm stands out as one of the most noteworthy advancements [1]. YOLO first came to the scene in June 2016 when Joseph Redmon et al. shared the first version on GitHub [2], marking a transformative shift in the approach to object detection. Unlike its predecessors, YOLO transformed the concept of object detection into a regression challenge, leveraging a single convolutional neural network (CNN) to predict both object locations and their associated probabilities [2]. This innovative approach not only made YOLO faster and more accurate but also improved its generalization capabilities [3]. The first version of YOLO laid the groundwork for the evolution of YOLOv2 in December 2016, which implemented the normalization of batches across all of its convolutional layers. This enhancement increased the effectiveness of the model’s training by reducing overfitting and elevated both stability and performance [2]. YOLOv2 was followed by YOLOv3, which was released in April 2018 [3], which ushered in significant enhancements to the algorithm. One notable improvement was its capability to predict objects at three different scales, a feature that Processes 2025, 13, 254 https://doi.org/10.3390/pr13010254 https://doi.org/10.3390/pr13010254 https://doi.org/10.3390/pr13010254 https://creativecommons.org/licenses/by/4.0/ https://creativecommons.org/licenses/by/4.0/ https://www.mdpi.com/journal/processes https://www.mdpi.com https://orcid.org/0000-0003-1611-7046 https://orcid.org/0000-0002-6620-655X https://doi.org/10.3390/pr13010254 https://www.mdpi.com/article/10.3390/pr13010254?type=check_update&version=1 Processes 2025, 13, 254 2 of 14 extended its effectiveness in detecting objects of various sizes. Additionally, YOLOv3 introduced a more efficient backbone architecture known as Darknet-53 [4], a contribution that notably bolstered both its accuracy and processing speed [3]. In April 2020, the release of YOLOv4 marked a significant milestone in the YOLO series [5]. This version was strategically fine-tuned for optimal resource utilization, aiding deployment across diverse computing platforms, encompassing edge devices. It introduced a novel foundational architecture referred to as CSPDarknet53, which led to improved accuracy in detection and enhanced speed of performance [5]. However, YOLOv4 remains a state-of-the-art object detection model that provides a significant improvement over previous version. Afterwards, YOLOv5 development was adopted by Ultralytics [6]. A new architecture based on the CSP (Cross-Stage-Partial) backbone showed improvement in the accuracy of the model and at the same time maintained fast inference speeds [6]. An advantage of the CSP architecture made the YOLOv5 more compatible with different platforms from mobile and similar low-edge devices to high-capability models for more powerful devices like NVIDIA’s GPU. An example of the application of the algorithms to an NVIDIA device can be seen in [4]. Maintaining anchor-based predictions similar to the previous versions of YOLO but with the introduction of a new novel anchor free approach enhanced the model’s capability to detect smaller objects. An example of this approach can be seen in [7]. Additionally, a summary of the various YOLO properties is provided in Table 1. Table 1. Comparing the different versions of YOLO, their features and performance metrics. YOLO Version Release Date Backbone Architecture Key Features YOLOv1 June 2016 Darknet-19 Custom network designed for efficiency YOLOv2 December 2016 Modified Darknet-19 Batch normalization for faster training and improved performance. YOLOv3 April 2018 Darknet-53 Deeper architecture for improved feature extraction. YOLOv4 April 2020 CSP-Darknet53 Improved efficiency and accuracy through feature reuse YOLOv5 June 2020 New CSP-Darknet53 Scalable architecture with multiple sizes Other state-of-the-art object detection models, such as Faster R-CNN and EfficientDet, have also demonstrated significant potential in UAV applications. Faster R-CNN, as a two-stage detector, is highly regarded for its exceptional detection accuracy and robust performance in complex environments. However, its reliance on substantial computational resources to achieve real-time processing speeds poses challenges for deployment on resource-constrained UAV platforms. In contrast, EfficientDet employs compound scaling and an advanced architecture that strikes a commendable balance between speed and accuracy, making it a competitive alternative for UAV-based tasks. Nonetheless, both Faster R-CNN and EfficientDet encounter limitations in real-time applications, as their computational demands remain considerably higher than those of YOLO models, which are specifically optimized for lightweight and efficient performance in real-time UAV operations [8–10]. According to Microsoft COCO, the concept of object recognition algorithms involves two components: classification, which involves assigning a label to an image, where the label is selected from a predefined set of classes (e.g., car, person, and dog), and the other concept is object detection, which focuses on marking all objects that belong to one or more of the categories in an image, and defining the spatial boundaries of that object by predicting a bounding rectangle box that encompasses the object in the image [11]. By focusing on the last two versions of YOLO for UAV applications, YOLOv4, released in April 2020, was followed by the release of YOLOv5 in May 2020, which raised questions about their claims on increased speed (140 frames per second, FPSs) and a significantly re- Processes 2025, 13, 254 3 of 14 duced code size. YOLOv5 is notably compact, measuring around 27 megabytes, compared to YOLOv4, which occupies 244 megabytes according to their git repositories [6,12], which will affect the efficacy of object detection in aerial imagery. On the other hand, on the last recent years, Unmanned Aerial Vehicles (UAVs) have observed an exponential increase in usage utilization across many different fields, includ- ing but not limited to surveillance, for which some examples can be examined in [13], agriculture [14], search-and-rescue missions [15,16], and infrastructure inspection [17]. These versatile aerial platforms have revolutionized data collection and monitoring tasks, with applications that demand efficient and accurate object detection capabilities [15]. 2. Related Work Object detection has undergone significant advancements in recent years, particularly with the introduction and evolution of the YOLO family of algorithms. These advance- ments have greatly impacted various fields, including UAV (Unmanned Aerial Vehicle) applications. Numerous studies have been conducted to compare different versions of YOLO, each aiming to improve the balance between detection accuracy, speed, and resource efficiency. Nepal et al. [4] evaluate the performance of three versions of the YOLO object detection algorithm in identifying suitable landing spots for UAVs during emergencies. YOLOv3, with its Darknet53 backbone, is noted for its speed but lower accuracy. YOLOv4, incorporating CSPDarknet53, enhances accuracy and detection speed over YOLOv3 by integrating advanced features like SPP and PANet. YOLOv5, also based on CSPDark- net53 and implemented in PyTorch, offers the highest accuracy and comparable speed to YOLOv4. The evaluation, conducted using the MS COCO dataset, reveals that YOLOv5 strikes the best balance between accuracy and speed, making it highly suited for real-time UAV operations. However, YOLOv3 remains advantageous for scenarios where detection speed is critical. This comparative analysis underscores the trade-offs between accuracy and speed in selecting the appropriate YOLO algorithm for UAV applications. In a similar vein, Lu et al. [18] compare RetinaNet, SSD, and YOLO v3 for real-time pill identification, revealing distinct strengths and weaknesses among the models. RetinaNet demonstrates the highest Mean Average Precision (MAP) at 82.89%, but its lower frames per second (FPS) rate of 17 limits its suitability for real-time applications. SSD offers a balance with a MAP of 82.71% and an FPS of 32. In contrast, YOLO v3 excels in speed with an FPS of 51 and a slightly lower MAP of 80.69%. Additionally, YOLO v3 performs better with hard sample detection, making it ideal for deployment in hospital environments where real-time processing is critical. Therefore, YOLO v3 emerges as the best option for real-time pill identification in busy hospital pharmacies due to its superior detection speed, despite having a marginally lower MAP compared to RetinaNet. Anna et al. [1] conduct a detailed comparison between the YOLOv5 and YOLOv3 models for the task of apple detection. The research demonstrates that YOLOv5 significantly outperforms YOLOv3 across various performance metrics. For example, YOLOv5 achieves a precision of 96.5%, recall of 97.2%, and an F1 score of 96.9%, compared to lower scores by YOLOv3. Additionally, YOLOv5 shows a notable reduction in both False Negative Rate (FNR) and False Positive Rate (FPR). The findings highlight the advancements in deep convolutional networks represented by YOLOv5, underscoring its potential to enhance robotic harvesting technology in horticulture. It is important to note that choosing the appropriate version of YOLO depends on the specific needs of the user, such as accuracy, speed, hardware limitations, and ease of use [5]. For our case, the target application for the comparison is in the UAV applications in general. Our research aims to offer insights into the advantages and limitations of YOLOv4 and YOLOv5 when applied to UAV tasks. We explore critical factors, including Processes 2025, 13, 254 4 of 14 how well objects are detected, how quickly they process data, the size of the models, and the resources they require. The results of this comparative analysis will help inform the selection of the most appropriate algorithm for UAV missions, ultimately improving the effectiveness and precision of aerial operations across various industries. 3. Methodology Attaining a combination of really fast and highly accurate object detection might seem like a straightforward goal, but achieving this in a UAV environment involves navigating a complex trade-off scenario [19]. Several critical parameters come into play, including the choice of GPU based on processing power and power consumption, as well as the selection of the YOLO pre-trained model, which affects both the speed and accuracy of detection. These factors create a delicate balance, where optimizing one element often means making sacrifices in another. The parameter selection outlined above will provide a clear and effective understand- ing, particularly in the context of UAVs, as follows: • Limited Battery Life: UAVs rely on batteries for power, and running resource-intensive algorithms like YOLO can drain the battery quickly, limiting flight time [20]. • Processing Power and Memory: Implementing YOLO on UAVs may require specialized hardware or optimizations to ensure real-time performance [20]. Since speed is dependent on the algorithm’s performance, a lower performance might require more advanced hardware. Three different hardwares with different processing performances can be seen in Table 2 for comparison. • Payload Constraints: UAVs often have weight restrictions, and adding the necessary equipment for YOLO implementation, such as cameras and processing units, can impact flight time and overall performance, especially on small UAVs [21]. • Data Transmission: Transmitting high-resolution images or video streams from the UAV to a ground station for YOLO processing can strain the communication bandwidth and introduce latency. This is especially challenging for real-time applications [22]. • Real-time Processing: Achieving real-time processing for YOLO on UAVs can be challenging due to lim- ited computational resources. This can affect the responsiveness of the system in dynamic environments. • Object Size and Distance: YOLO may struggle to detect small or distant objects, which can be a significant limitation in UAV applications, especially for tasks such as search and rescue or wildlife monitoring. • Cost: Implementing YOLO on UAVs may require investments in hardware, software, and training, which can increase the overall cost of UAV operations. As can be seen in Table 2, as hardware performance increases, the cost also increases. In UAV applications, selecting the appropriate hardware requirements is highly de- pendent on the specific use case. Table 2 provides a framework resembling a selection matrix to offer general guidance and assist in decision making for suitable hardware and software configurations. In this study, as detailed in Section 4.2, the NVIDIA Jetson Orin NX 16 GB is utilized as the primary hardware platform. This device demonstrates excellent Processes 2025, 13, 254 5 of 14 performance for fast detection and tracking tasks, making it suitable for the demanding requirements of UAV-based applications. Table 2. Presents a comparison for some of the NVIDIA Jetson’s modules, highlighting their key specifications and features [23]. AGX ORIN 64 GB AGX XAVIER 64 GB JETSON NANO AI performance 275 TOPS 32 TOPS 472 GFLOPS GPU 2048-core 512-core 128-core Memory speed 204.8 GB/s 136.5 GB/s 25.6 GB/s Power 15 W–60 W 10 W–30 W 5 W–10 W Cost $1800 $1400 $500 1 1 Prices according to https://www.arrow.com. Figures 1 and 2 show the structures of YOLOv4 and YOLOv5. Notably, YOLOv5 is a streamlined version of the earlier YOLO algorithms because it uses the PyTorch framework instead of the Darknet framework as in YOLOv4. Figure 1. The YOLOv4 architecture in general consists of the following: backbone as CSPDarknet53, neck as SPP, PAN, and head as YOLOv3 [5]. Figure 2. Illustrates the YOLOv5 architecture, delineating its three primary components: backbone, neck, and head. This representation is derived from the TensorBoard visualization of the model and the official documentation available in the YOLOv5 repository [24]. https://www.arrow.com Processes 2025, 13, 254 6 of 14 YOLOv4 utilizes CSPDarknet53 as its backbone, which improves the gradient flow and enhances the model’s learning capability by splitting feature maps and performing partial connections. This architecture is effective in detecting objects in complex environments, making it suitable for UAV applications that involve dense or cluttered scenes. In contrast, YOLOv5 employs an efficient CSP-based backbone (depending on the version, often a simpler version compared to YOLOv4). Its lightweight design improves computational efficiency, enabling faster inference speeds, which are critical for UAVs that operate with limited processing power: Precision = TP TP + FP (1) Recall = TP TP + FN (2) AP = ∫ r=1 r=0 P(r)dr (3) MeanAveragePrecision = 1/n k=n ∑ k=1 APk (4) Definitions for precision and recall can be seen in Equations (1) and (2). Here, TP stands for true positive, which means the correct prediction of a positive instance, FP for false positive, which means wrong prediction for a negative instance, and FN for false negative, which means wrong prediction for a positive instance. Precision demonstrates how accurately the model can predict positive results. A high precision value signifies that the model tends to be correct when it predicts a positive outcome. Recall gives the proportion of true positive identifications relative to the overall count of verified positive instances. From Equation (3), the Average Precision (AP), which is used in Tables 3–5, is the area under the precision–recall curve for each IoU threshold. AP maps the PR curve into a single scalar value. A high Average Precision reflects situations where both the precision and recall are high, while low values occur when either precision or recall is low across various confidence threshold levels. The AP score falls within the range of 0 to 1. In this context, ‘r’ represents the recall variable. On the other hand, Average Recall (AR) signifies the highest recall achieved with a set number of the detections per each image, averaged across categories and Intersection over Unions (IoUs). From Equation (4) the performance of the model can be measured. The calculation involves determining the Average Precision (AP) for each individual class and subsequently averaging these AP values across multiple classes. Table 3. Average Precision calculated for YOLOv4 and YOLOv5 on MS COCO evaluation server; the evaluate category is for car. YOLOv4 YOLOv5 AP when IoU is 0.50:0.95, area is all, and maxDets is 100 =0.456 =0.506 AP when IoU is 0.50, area is all, and maxDets is 100 =0.715 =0.750 AP when IoU is 0.75, area is all, and maxDets is 100 =0.492 =0.549 AP when IoU is 0.50:0.95, area is small, and maxDets is 100 =0.358 =0.392 AP when IoU is 0.50:0.95, area is medium, and maxDets is 100 =0.603 =0.675 AP when IoU is 0.50:0.95, area is large, and maxDets is 100 =0.601 =0.702 AR when IoU is 0.50:0.95, area is all, and maxDets is 1 =0.193 =0.211 AR when IoU is 0.50:0.95, area is all, and maxDets is 10 =0.545 =0.590 AR when IoU is 0.50:0.95, area is all, and maxDets is 100 =0.603 =0.643 AR when IoU is 0.50:0.95, area is small, and maxDets is 100 =0.508 =0.534 Processes 2025, 13, 254 7 of 14 Table 3. Cont. YOLOv4 YOLOv5 AR when IoU is 0.50:0.95, area is medium, and maxDets is 100 =0.736 =0.793 AR when IoU is 0.50:0.95, area is large, and maxDets is 100 =0.781 =0.856 Table 4. Average Precision calculated for YOLOv4 and YOLOv5 on MS COCO evaluation server; the evaluate category is for bus. YOLOv4 YOLOv5 AP when IoU is 0.50:0.95, area is all, and maxDets is 100 =0.691 =0.762 AP when IoU is 0.50, area is all, and maxDets is 100 =0.877 =0.891 AP when IoU is 0.75, area is all, and maxDets is 100 =0.785 =0.834 AP when IoU is 0.50:0.95, area is small, and maxDets is 100 =0.250 =0.315 AP when IoU is 0.50:0.95, area is medium, and maxDets is 100 =0.588 =0.658 AP when IoU is 0.50:0.95, area is large, and maxDets is 100 =0.803 =0.876 AR when IoU is 0.50:0.95, area is all, and maxDets is 1 =0.501 =0.547 AR when IoU is 0.50:0.95, area is all, and maxDets is 10 =0.762 =0.830 AR when IoU is 0.50:0.95, area is all, and maxDets is 100 =0.768 =0.834 AR when IoU is 0.50:0.95, area is small, and maxDets is 100 =0.431 =0.516 AR when IoU is 0.50:0.95, area is medium, and maxDets is 100 =0.710 =0.771 AR when IoU is 0.50:0.95, area is large, and maxDets is 100 =0.857 =0.923 Table 5. Average precision calculated for YOLOv4 and YOLOv5 on MS COCO evaluation server; the evaluate category is for person. YOLOv4 YOLOv5 AP when IoU is 0.50:0.95, area is all, and maxDets is 100 =0.548 =0.614 AP when IoU is 0.50, area is all, and maxDets is 100 =0.827 =0.846 AP when IoU is 0.75, area is all, and maxDets is 100 =0.606 =0.669 AP when IoU is 0.50:0.95, area is small, and maxDets is 100 =0.375 =0.413 AP when IoU is 0.50:0.95, area is medium, and maxDets is 100 =0.621 =0.696 AP when IoU is 0.50:0.95, area is large, and maxDets is 100 =0.719 =0.824 AR when IoU is 0.50:0.95, area is all, and maxDets is 1 =0.189 =0.209 AR when IoU is 0.50:0.95, area is all, and maxDets is 10 =0.554 =0.614 AR when IoU is 0.50:0.95, area is all, and maxDets is 100 =0.645 =0.701 AR when IoU is 0.50:0.95, area is small, and maxDets is 100 =0.497 =0.530 AR when IoU is 0.50:0.95, area is medium, and maxDets is 100 =0.711 =0.776 AR when IoU is 0.50:0.95, area is large, and maxDets is 100 =0.807 =0.889 4. Experiment This comparative analysis between YOLOv4 and YOLOv5 is conducted using two dis- tinct methodologies. The MS COCO dataset is selected for this study due to its comprehen- sive annotations, wide range of object categories, and extensive diversity in environmental conditions, which make it an ideal choice for evaluating object detection models compared to other datasets. The UAV platform is chosen after a thorough evaluation to ensure its suitability for real-world applications and to identify potential challenges during practical deployment. This evaluation revealed critical factors such as the difficulty in detecting small objects and the unique challenges introduced by the aerial perspective of the UAV. These insights are instrumental in refining the system for robust and reliable performance under real-world conditions. These choices are made to balance the generalization capa- bility, computational feasibility and real-world applicability, all of which are essential to improve UAV-based applications, such as search-and-rescue (SAR) operations. Processes 2025, 13, 254 8 of 14 4.1. MS-COCO Evaluation The evaluation of both YOLO versions is conducted using standard metrics, including Average Precision (AP) and Average Recall (AR). These metrics provide insights into the detection accuracy and the algorithm’s ability to identify true positives across different Intersection over Union (IoU) thresholds. This benchmark is widely recognized for its reliability and comprehensive evaluation metrics and has been adopted by leading organi- zations such as Microsoft, Facebook, and the Common Visual Data Foundation (CVDF). For these reasons, we choose to adopt it in our research to ensure a robust and standardized per- formance evaluation. This method employs 20,288 images from the MS-COCO dataset [11] to evaluate the detection performance of both YOLO algorithms. By employing real-world images from this dataset, a comparison is performed by analyzing the performance of YOLOv4 and YOLOv5 in detecting three specific object classes: car, bus, and person. These categories are chosen, as they directly correspond to critical targets commonly encoun- tered in UAV applications under diverse conditions. Although the MS COCO dataset includes a wide range of object categories, many (e.g., kitchen utensils or furniture) are irrelevant to UAV-based scenarios. Including such unrelated categories would dilute the focus of the study and computational efficiency without providing meaningful insights into UAV applications. The results are shown in Tables 3–5, where AP measures the area under the precision– recall curve, reflecting the model’s ability to balance precision (low false positives) and recall (high true positives) across various confidence thresholds. In UAV applications, AP is critical for evaluating detection quality under diverse conditions. In the following, we provide a detailed correlation of the use case. 1. Surveillance and Monitoring: High precision is vital in applications like military surveillance or border monitoring, where false positives can lead to unnecessary alerts or actions. For example, detecting intruders or vehicles must have high AP to avoid misidentifying benign objects as threats. Recall also matters for ensuring no critical objects (e.g., intruders) are missed, but a balanced AP ensures consistent performance. 2. Search and Rescue: In disaster scenarios, UAVs often scan large areas to locate missing persons or critical items. High AP ensures that the system can accurately detect objects such as humans or vehicles, minimizing false positives that could divert rescue efforts. Precision and recall trade-offs must be optimized for speed and resource efficiency. 3. Inspection and Maintenance: In industrial UAV applications, such as inspecting wind turbines or power lines, high AP ensures the accurate detection of defects or anomalies, reducing the risk of missing critical issues or flagging false positives that increase operational costs. This study generally employs model indicators [25] to validate the target detection model; average precision (AP) and average recall (AR) metrics are used to evaluate al- gorithm performance. These indicators depend on different variables: IoU, area, and maxDets. As can be seen in Tables 3–5, IoU refers to the Intersection over Union, which can be formulated and visualized as in Figure 3. It is a metric for showing how well the predicted position aligns with the ground truth position of the object. It is used to evaluate precision and recall. For example, 0.50 means count detections that have 0.50 IoU. So, as IoU increases, the precision decreases. Processes 2025, 1, 0 8 of 14 reliability and comprehensive evaluation metrics and has been adopted by leading organi- zations such as Microsoft, Facebook, and the Common Visual Data Foundation (CVDF). For these reasons, we choose to adopt it in our research to ensure a robust and standardized per- formance evaluation. This method employs 20,288 images from the MS-COCO dataset [11] to evaluate the detection performance of both YOLO algorithms. By employing real-world images from this dataset, a comparison is performed by analyzing the performance of YOLOv4 and YOLOv5 in detecting three specific object classes: car, bus, and person. These categories are chosen, as they directly correspond to critical targets commonly encountered in UAV applications under diverse conditions. Although the MS COCO dataset includes a wide range of object categories, many (e.g., kitchen utensils or furniture) are irrelevant to UAV-based scenarios. Including such unrelated categories would dilute the focus of the study and computational efficiency without providing meaningful insights into UAV applications. The results are shown in Tables 3–5, where AP measures the area under the precision– recall curve, reflecting the model’s ability to balance precision (low false positives) and recall (high true positives) across various confidence thresholds. In UAV applications, AP is critical for evaluating detection quality under diverse conditions. In the following, we provide a detailed correlation of the use case. 1. Surveillance and Monitoring: High precision is vital in applications like military surveillance or border monitoring, where false positives can lead to unnecessary alerts or actions. For example, detecting intruders or vehicles must have high AP to avoid misidentifying benign objects as threats. Recall also matters for ensuring no critical objects (e.g., intruders) are missed, but a balanced AP ensures consistent performance. 2. Search and Rescue: In disaster scenarios, UAVs often scan large areas to locate missing persons or critical items. High AP ensures that the system can accurately detect objects such as humans or vehicles, minimizing false positives that could divert rescue efforts. Precision and recall trade-offs must be optimized for speed and resource efficiency. 3. Inspection and Maintenance: In industrial UAV applications, such as inspecting wind turbines or power lines, high AP ensures the accurate detection of defects or anomalies, reducing the risk of missing critical issues or flagging false positives that increase operational costs. This study generally employs model indicators [25] to validate the target detection model; average precision (AP) and average recall (AR) metrics are used to evaluate al- gorithm performance. These indicators depend on different variables: IoU, area, and maxDets. As can be seen in Tables 3–5, IoU refers to the Intersection over Union, which can be formulated and visualized as in Figure 3. It is a metric for showing how well the predicted position aligns with the ground truth position of the object. It is used to evaluate precision and recall. For example, 0.50 means count detections that have 0.50 IoU. So, as IoU increases, the precision decreases. IoU = area o f overlap area o f union = Figure 3. The Intersection over Union. The observed performance differences at varying IoU thresholds indicate that YOLOv5 is generally more versatile and reliable across a wide range of conditions. Its superior precision and robustness at both low and medium IoU thresholds make it particularly suited for UAV applications that demand real-time responsiveness and high detection Figure 3. The Intersection over Union. Processes 2025, 13, 254 9 of 14 The observed performance differences at varying IoU thresholds indicate that YOLOv5 is generally more versatile and reliable across a wide range of conditions. Its superior precision and robustness at both low and medium IoU thresholds make it particularly suited for UAV applications that demand real-time responsiveness and high detection accuracy, such as traffic monitoring and agricultural assessments. While YOLOv4 remains competitive, its limitations become more pronounced at stricter thresholds, potentially affecting its usability in tasks requiring precise localization. By considering these differences, practitioners can select the appropriate model and IoU threshold based on the specific requirements of their UAV application. To further explore the differences in detection performance between YOLOv4 and YOLOv5, Tables 6 and 7 provide the calculated mean and standard deviation. In addition, Figures 4 and 5 present box plots to visualize the distribution of AP and AR metrics in three categories of objects: person, bus, and car. These figures further support the statistical findings that YOLOv5 outperforms YOLOv4 in terms of precision and recall. The consistency and reduced variability of YOLOv5 underscore its suitability for applications requiring stable object detection. Table 6. Mean ± standard deviation values of Average Precision (AP) for YOLOv4 and YOLOv5 across person, bus, and car object categories. Person Bus Car YOLOv4 0.62 ± 0.14 0.67 ± 0.21 0.54 ± 0.12 YOLOv5 0.68 ± 0.14 0.72 ± 0.20 0.60 ± 0.12 Table 7. Mean ± standard deviation values of Average Recall (AR) for YOLOv4 and YOLOv5 across person, bus, and car object categories. Person Bus Car YOLOv4 0.57 ± 0.20 0.67 ± 0.15 0.56 ± 0.19 YOLOv5 0.62 ± 0.22 0.74 ± 0.15 0.60 ± 0.21 (a) (b) Figure 4. Box plot illustrating the distribution of (a) Average Precision (AP) and (b) Average Recall (AR) for YOLOv4 across the person, bus, and car object categories. Processes 2025, 13, 254 10 of 14 (a) (b) Figure 5. Box plot illustrating the distribution of (a) Average Precision (AP) and (b) Average Recall (AR) for YOLOv5 across the person, bus, and car object categories. 4.2. UAV-Captured Real Images Evaluation In this approach, as shown in Figure 6, we use a multirotor drone (Raven Base from MR Aviation) with a 30× zoom camera (Z30F from Viewpro) to collect original images and videos from different altitudes. These images represent the unique challenges posed by UAV environments, encompassing factors such as varying altitudes, lighting conditions, and weather variables. This method offers insights into the suitability of the algorithms for real-world UAV applications. The computational speed and the ability to detect small objects are compared for both algorithms. The results are discussed in the next section. The test environment that the algorithms run is as follows: Python-3.10.12, torch-2.0.1, cu118 CUDA:0, Tesla T4 GPU, and CUDNN_HALF = 1. Figure 6. YOLO object detection (residual blocks, bounding box regression and Intersection Over Union). Processes 2025, 13, 254 11 of 14 5. Results and Discussion The comparison between YOLOv4 and YOLOv5 yields valuable information on their performance in various metrics. 5.1. Detection Accuracy In terms of detection precision, YOLOv5 exhibits better performance compared to YOLOv4 as evidenced by the results in Table 3–5 from the evaluation using the MS COCO dataset. The Mean Average Precision (mAP) scores consistently favor YOLOv5, highlighting its improved precision and reliability in identifying objects. 5.2. Speed and Efficiency Under the same test environment, Table 8 shows the speed comparison; the results for YOLOv4 and YOLOv5 are 23.5 FPS and 38.4 FPS, respectively, indicating that YOLOv5 achieves a higher frames-per-second (FPSs) rate compared to YOLOv4. This improvement in speed is crucial for real-time applications, especially in dynamic environments. Table 8. The evaluation is performed using a video with resolution of 720 × 1280, 24 FPS, with 3551 frames, the results for YOLOv4 and YOLOv5 are 23.5 FPS and 38.4 FPS, respectively. Also, it can be seen that YOLOv4 has more success in detecting objects with a small area, which is an advantage for UAVs to detect objects from a further distance. YOLOv4 YOLOv5 Processes 2025, 13, 254 12 of 14 5.3. Confidence Values UAV-based detection systems often require fast decision-making in dynamic environ- ments. Adjusting the confidence threshold can help strike a balance between real-time performance and detection quality. A higher threshold might reduce computational time by eliminating low-confidence detections, while a lower threshold could slow down the pro- cess due to the need to process more detections. Although YOLOv4 demonstrates slightly higher average confidence values than YOLOv5 as depicted in Figure 5, it is important to note that the nature of the pre-trained models of YOLOv4 and YOLOv5 are optimized for ground-level images; therefore, to foster a more pertinent and insightful comparison, the utilization of datasets such as VisDrone-Dataset becomes particularly relevant. This dataset encompasses a diverse collection of images and videos captured by drone-mounted cameras, offering a more fitting context for evaluating. 5.4. Suitability for UAV Applications The study emphasized the importance of evaluating these models in a UAV environ- ment. The results underscore that YOLOv5, with its heightened accuracy and speed, is more apt for UAV applications. 6. Conclusions The YOLOv4 and YOLOv5 algorithms are introduced, and their performance results were first compared with the COCO dataset, which is now considered one of the standard datasets to compare AI detection algorithms. Then, both algorithms were applied to videos taken from a UAV platform. The results were analyzed with different metrics: speed (FPS) and accuracy (AP) under different conditions, including IoU, area, and maxDets. Based on the experiments with the COCO dataset, YOLOv5 outperforms YOLOv4 with an increased accuracy of 5% to 16% based on the evaluation conditions. Processing speeds are measured as FPS with the UAV captured images measured as 23.5 FPS and 38.4 FPS for YOLOv4 and YOLOv5, respectively, which leads to a 63% increase for YOLOv5. This shows that YOLOv5 is both more efficient and more accurate, which means that YOLOv5 can work on smaller and cheaper hardware with the same performance as YOLOv4. This is important for UAV applications since each additional weight and power usage leads to decreased flight times and payload capacities for UAVs. The method used in this work has been tested on the cloud environment using Google Colab with a Tesla T4 GPU. The performance of the algorithm depends on the operating hardware. GPU with more processing units such as V100 and A100 will result in faster performance, while edge devices like the NVIDIA Jetson series should perform slower. However, given the same platform, YOLOv5 should outperform YOLOv4 as the given rates in both precision performance and speed. The analysis might be extended using these algorithms in the future for further robotic applications such as target tracking, and it can be observed how the AI performance affects robotic performance such as tracking error, flight time, weight, etc. Finally, it is important to highlight some of the limitations associated with YOLOv5 to provide a balanced perspective on its applicability and performance; while YOLOv5 is designed for speed and efficiency, its larger variants, such as YOLOv5x, can be compu- tationally intensive, requiring substantial resources for both training and inference. This poses challenges for deployment on resource-constrained devices, such as UAVs, and in real-time applications where low latency is critical. Moreover, UAV applications often involve detecting small or partially obscured objects at a distance—conditions that can challenge the YOLOv5 anchor-based detection approach, despite notable improvements over earlier YOLO versions. Another key limitation lies in the heavy reliance of YOLOv5 Processes 2025, 13, 254 13 of 14 on the quality and diversity of its training datasets. Although the model demonstrates exceptional performance when trained on comprehensive datasets like MS COCO, its de- tection accuracy can deteriorate significantly when applied to scenarios involving unseen objects or environments not represented in the training data. This dependency underscores the need for extensive domain-specific data collection and augmentation to maintain ro- bust performance in real-world UAV applications, where environmental conditions can be highly variable and unpredictable. For future work in this area, expanding the comparative study to include other state-of- the-art object detection models, such as Transformer-based architectures like DETR, would provide valuable insights into their suitability for UAV applications. Furthermore, research can focus on key directions, such as integrating YOLOv4 and YOLOv5 into UAV systems for real-world deployment. This integration could provide critical information on the performance of the models in various applications, including traffic monitoring, disaster response, and agricultural monitoring. Testing these models under varying environmental conditions is essential to evaluate their robustness, adaptability, and practicality in real-time UAV operations. Author Contributions: All authors have contributed their unique insights to the research concept, design and implementation of the research, M.A.M.A. contributed to writing original draft and the analysis of results and E.Y. review, edit and checked the manuscript, and after review and discussion, they unanimously approved the content of the final manuscript. All authors have read and agreed to the published version of the manuscript. Funding: This research received no external funding. Data Availability Statement: The MS COCO evaluation server using test-dev2017 dataset which used in this study is openly available on the website. Available online: https://competitions.codalab. org/competitions/20794#results, accessed on 19 July 2023. Conflicts of Interest: The authors declare no conflicts of interest. References 1. Kuznetsova, A.; Maleva, T.; Soloviev, V. YOLOv5 Versus YOLOv3 for Apple Detection; Springer: Berlin/Heidelberg, Germany, 2021; pp. 349–358. [CrossRef] 2. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. Available online: http://pjreddie.com/yolo/ (accessed on 19 September 2023). 3. Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo Algorithm Developmentsin Procedia Computer Science; Elsevier B.V.: Amsterdam, The Netherlands, 2021; pp. 1066–1073. [CrossRef] 4. Nepal, U.; Eslamiat, H. Comparing YOLOv3, YOLOv4 and YOLOv5 for Autonomous Landing Spot Detection in Faulty UAVs. Sensors 2022, 22, 464. [CrossRef] [PubMed] 5. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. 6. YOLOv5 v1.0 Commits · Ultralytics/yolov5 · GitHub. Available online: https://github.com/ultralytics/yolov5/commits/v1.0 (accessed on 18 September 2023). 7. Zhang, P.; Li, D. EPSA-YOLO-V5s, A novel method for detecting the survival rate of rapeseed in a plant factory based on multiple guarantee mechanisms. Comput. Electron. Agric. 2022, 193, 106714. [CrossRef] 8. Kırac, E.; Özbek, S. Deep Learning Based Object Detection with Unmanned Aerial Vehicle Equipped with Embedded System. J. Aviat. 2024, 8, 15–25. [CrossRef] 9. Kim, J.; Cho, J. RGDiNet: Efficient Onboard Object Detection with Faster R-CNN for Air-to-Ground Surveillance. Sensors 2021, 21, 1677. [CrossRef] 10. Tang, G.; Ni, J.; Zhao, Y.; Gu, Y.; Cao, W. A Survey of Object Detection for UAVs Based on Deep Learning. Remote Sens. 2024, 16, 149. [CrossRef] 11. Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. arXiv 2014, arXiv:1405.0312. https://competitions.codalab.org/competitions/20794#results https://competitions.codalab.org/competitions/20794#results http://doi.org/10.1007/978-3-030-66077-2_28 http://pjreddie.com/yolo/ http://dx.doi.org/10.1016/j.procs.2022.01.135 http://dx.doi.org/10.3390/s22020464 http://www.ncbi.nlm.nih.gov/pubmed/35062425 https://github.com/ultralytics/yolov5/commits/v1.0 http://dx.doi.org/10.1016/j.compag.2022.106714 http://dx.doi.org/10.30518/jav.1356997 http://dx.doi.org/10.3390/s21051677 http://dx.doi.org/10.3390/rs16010149 Processes 2025, 13, 254 14 of 14 12. YOLOv4 GitHub—AlexeyAB/Darknet at 3d4242a6e534fe44afd3c0bf0de92e0f4e9ce23f. Available online: https://github.com/ AlexeyAB/darknet/tree/3d4242a6e534fe44afd3c0bf0de92e0f4e9ce23f (accessed on 19 September 2023). 13. Carrio, A.; Sampedro, C.; Rodriguez-Ramos, A.; Campoy, P. A review of deep learning methods and applications for unmanned aerial vehicles. J. Sens. 2017, 2017, 3296874. [CrossRef] 14. Salvini, P. Urban robotics: Towards responsible innovations for our cities. Robot. Auton. Syst. 2018, 100, 278–286. [CrossRef] 15. Schedl, D.C.; Kurmi, I.; Bimber, O. Search and Rescue with Airborne Optical Sectioning. Nat. Mach. Intell. 2020, 2, 783–790. [CrossRef] 16. Xu, R.; Lin, H.; Lu, K.; Cao, L.; Liu, Y. A forest fire detection system based on ensemble learning. Forests 2021, 12, 217. [CrossRef] 17. Park, S.E.; Eem, S.H.; Jeon, H. Concrete crack detection and quantification using deep learning and structured light. Constr. Build. Mater. 2020, 252, 119096. [CrossRef] 18. Tan, L.; Huangfu, T.; Wu, L.; Chen, W. Comparison of RetinaNet, SSD, and YOLO v3 for real-time pill identification. BMC Med. Inform. Decis. Mak. 2021, 21, 324. [CrossRef] [PubMed] 19. Li, S.; Ozo, M.M.O.I.; Wagter, C.D.; de Croon, G.C.H.E. Autonomous drone race: A computationally efficient vision-based navigation and control strategy. Robot. Auton. Syst. 2020, 133, 103621. [CrossRef] 20. Plastiras, G.; Kyrkou, C.; Theocharides, T. EdgeNet—Balancing accuracy and performance for edge-based convolutional neural network object detectors. In ACM International Conference Proceeding Series; Association for Computing Machinery: New York, NY, USA, 2019. [CrossRef] 21. Bayer, R.; Priest, J.; Tözün, P. Reaching the Edge of the Edge: Image Analysis in Space. arXiv 2024, arXiv:2301.04954. 22. Yang, Q.; Yang, J.H. HD video transmission of multi-rotor Unmanned Aerial Vehicle based on 5G cellular communication network. Comput. Commun. 2020, 160, 688–696. [CrossRef] 23. Jetson Modules, Support, Ecosystem, and Lineup|NVIDIA Developer. Available online: https://developer.nvidia.com/ embedded/jetson-modules (accessed on 19 October 2023). 24. Liu, L.; Liu, Y.; Gao, X.-Z.; Zhang, X. An Immersive Human-Robot Interactive Game Framework Based on Deep Learning for Children’s Concentration Training. Healthcare 2022, 10, 1779. [CrossRef] [PubMed] 25. COCO—Common Objects in Context Metrics. Available online: https://cocodataset.org/#detection-eval (accessed on 7 November 2023). Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. https://github.com/AlexeyAB/darknet/tree/3d4242a6e534fe44afd3c0bf0de92e0f4e9ce23f https://github.com/AlexeyAB/darknet/tree/3d4242a6e534fe44afd3c0bf0de92e0f4e9ce23f http://dx.doi.org/10.1155/2017/3296874 http://dx.doi.org/10.1016/j.robot.2017.03.007 http://dx.doi.org/10.1038/s42256-020-00261-3 http://dx.doi.org/10.3390/f12020217 http://dx.doi.org/10.1016/j.conbuildmat.2020.119096 http://dx.doi.org/10.1186/s12911-021-01691-8 http://www.ncbi.nlm.nih.gov/pubmed/34809632 http://dx.doi.org/10.1016/j.robot.2020.103621 http://dx.doi.org/10.1145/3349801.3349809 http://dx.doi.org/10.1016/j.comcom.2020.07.024 https://developer.nvidia.com/embedded/jetson-modules https://developer.nvidia.com/embedded/jetson-modules http://dx.doi.org/10.3390/healthcare10091779 http://www.ncbi.nlm.nih.gov/pubmed/36141391 https://cocodataset.org/#detection-eval Introduction Related Work Methodology Experiment MS-COCO Evaluation UAV-Captured Real Images Evaluation Results and Discussion Detection Accuracy Speed and Efficiency Confidence Values Suitability for UAV Applications Conclusions References