Academic Editor: Jun-Qiang Wang

Received: 14 November 2024

Revised: 24 December 2024

Accepted: 6 January 2025

Published: 16 January 2025

Citation: Alhassan, M.A.M.; Yılmaz,

E. Evaluating YOLOv4 and YOLOv5

for Enhanced Object Detection in

UAV-Based Surveillance. Processes

2025, 13, 254. https://doi.org/

10.3390/pr13010254

Copyright: © 2025 by the authors.

Licensee MDPI, Basel, Switzerland.

This article is an open access article

distributed under the terms and

conditions of the Creative Commons

Attribution (CC BY) license

(https://creativecommons.org/

licenses/by/4.0/).

Article

Evaluating YOLOv4 and YOLOv5 for Enhanced Object Detection
in UAV-Based Surveillance
Mugtaba Abdalrazig Mohamed Alhassan * and Ersen Yılmaz

Electrical-Electronic Engineering Department, Bursa Uludag University, 16059 Bursa, Turkey;
ersen@uludag.edu.tr
* Correspondence: mugtaba.abdalrazig@gmail.com

Abstract: Traditional surveillance systems often rely on fixed cameras with limited coverage
and human monitoring, which can lead to potential errors and delays. Unmanned Aerial
Vehicles (UAVs) equipped with object detection algorithms, such as You Only Look Once
(YOLO), offer a robust solution for dynamic surveillance, enabling real-time monitoring
over large and inaccessible areas. In this study, we present a comparative analysis of
YOLOv4 and YOLOv5 for UAV-based surveillance applications, focusing on two critical
metrics: detection speed (Frames Per Second, FPS) and accuracy (Average Precision, AP).
Using aerial imagery captured by a UAV, along with 20,288 images from the Microsoft
Common Objects in Context (MS COCO) dataset, we evaluate each model’s suitability for
deployment in high-demand environments. The results indicate that YOLOv5 outperforms
YOLOv4 with a 1.63-fold increase in FPS and a 1.09-fold improvement in AP, suggesting
that YOLOv5 is a more efficient option for UAV-based detection. However, to align with
recent advancements, this study also highlights potential areas for integrating newer YOLO
models and transformer-based architectures in future research to further enhance detection
performance and model robustness. This work aims to provide a solid foundation for
UAV-based object detection, while acknowledging the need for continuous development to
accommodate newer models and evolving detection challenges.

Keywords: image processing; object detection; unmanned aerial vehicles; YOLO

1. Introduction
Over the past years, significant advancements have taken place in the field of object

detection models, and among these, the evolution of the You Only Look Once (YOLO)
algorithm stands out as one of the most noteworthy advancements [1]. YOLO first came to
the scene in June 2016 when Joseph Redmon et al. shared the first version on GitHub [2],
marking a transformative shift in the approach to object detection. Unlike its predecessors,
YOLO transformed the concept of object detection into a regression challenge, leveraging
a single convolutional neural network (CNN) to predict both object locations and their
associated probabilities [2]. This innovative approach not only made YOLO faster and more
accurate but also improved its generalization capabilities [3]. The first version of YOLO
laid the groundwork for the evolution of YOLOv2 in December 2016, which implemented
the normalization of batches across all of its convolutional layers. This enhancement
increased the effectiveness of the model’s training by reducing overfitting and elevated
both stability and performance [2]. YOLOv2 was followed by YOLOv3, which was released
in April 2018 [3], which ushered in significant enhancements to the algorithm. One notable
improvement was its capability to predict objects at three different scales, a feature that

Processes 2025, 13, 254 https://doi.org/10.3390/pr13010254

https://doi.org/10.3390/pr13010254
https://doi.org/10.3390/pr13010254
https://creativecommons.org/licenses/by/4.0/
https://creativecommons.org/licenses/by/4.0/
https://www.mdpi.com/journal/processes
https://www.mdpi.com
https://orcid.org/0000-0003-1611-7046
https://orcid.org/0000-0002-6620-655X
https://doi.org/10.3390/pr13010254
https://www.mdpi.com/article/10.3390/pr13010254?type=check_update&version=1


Processes 2025, 13, 254 2 of 14

extended its effectiveness in detecting objects of various sizes. Additionally, YOLOv3
introduced a more efficient backbone architecture known as Darknet-53 [4], a contribution
that notably bolstered both its accuracy and processing speed [3]. In April 2020, the release
of YOLOv4 marked a significant milestone in the YOLO series [5]. This version was
strategically fine-tuned for optimal resource utilization, aiding deployment across diverse
computing platforms, encompassing edge devices. It introduced a novel foundational
architecture referred to as CSPDarknet53, which led to improved accuracy in detection
and enhanced speed of performance [5]. However, YOLOv4 remains a state-of-the-art
object detection model that provides a significant improvement over previous version.
Afterwards, YOLOv5 development was adopted by Ultralytics [6]. A new architecture
based on the CSP (Cross-Stage-Partial) backbone showed improvement in the accuracy
of the model and at the same time maintained fast inference speeds [6]. An advantage of
the CSP architecture made the YOLOv5 more compatible with different platforms from
mobile and similar low-edge devices to high-capability models for more powerful devices
like NVIDIA’s GPU. An example of the application of the algorithms to an NVIDIA device
can be seen in [4]. Maintaining anchor-based predictions similar to the previous versions
of YOLO but with the introduction of a new novel anchor free approach enhanced the
model’s capability to detect smaller objects. An example of this approach can be seen in [7].
Additionally, a summary of the various YOLO properties is provided in Table 1.

Table 1. Comparing the different versions of YOLO, their features and performance metrics.

YOLO Version Release Date Backbone Architecture Key Features

YOLOv1 June 2016 Darknet-19 Custom network designed for efficiency

YOLOv2 December 2016 Modified Darknet-19 Batch normalization for faster training and
improved performance.

YOLOv3 April 2018 Darknet-53 Deeper architecture for improved feature extraction.

YOLOv4 April 2020 CSP-Darknet53 Improved efficiency and accuracy through
feature reuse

YOLOv5 June 2020 New CSP-Darknet53 Scalable architecture with multiple sizes

Other state-of-the-art object detection models, such as Faster R-CNN and EfficientDet,
have also demonstrated significant potential in UAV applications. Faster R-CNN, as a
two-stage detector, is highly regarded for its exceptional detection accuracy and robust
performance in complex environments. However, its reliance on substantial computational
resources to achieve real-time processing speeds poses challenges for deployment on
resource-constrained UAV platforms. In contrast, EfficientDet employs compound scaling
and an advanced architecture that strikes a commendable balance between speed and
accuracy, making it a competitive alternative for UAV-based tasks. Nonetheless, both
Faster R-CNN and EfficientDet encounter limitations in real-time applications, as their
computational demands remain considerably higher than those of YOLO models, which
are specifically optimized for lightweight and efficient performance in real-time UAV
operations [8–10].

According to Microsoft COCO, the concept of object recognition algorithms involves
two components: classification, which involves assigning a label to an image, where the
label is selected from a predefined set of classes (e.g., car, person, and dog), and the other
concept is object detection, which focuses on marking all objects that belong to one or
more of the categories in an image, and defining the spatial boundaries of that object by
predicting a bounding rectangle box that encompasses the object in the image [11].

By focusing on the last two versions of YOLO for UAV applications, YOLOv4, released
in April 2020, was followed by the release of YOLOv5 in May 2020, which raised questions
about their claims on increased speed (140 frames per second, FPSs) and a significantly re-


Processes 2025, 13, 254 3 of 14

duced code size. YOLOv5 is notably compact, measuring around 27 megabytes, compared
to YOLOv4, which occupies 244 megabytes according to their git repositories [6,12], which
will affect the efficacy of object detection in aerial imagery.

On the other hand, on the last recent years, Unmanned Aerial Vehicles (UAVs) have
observed an exponential increase in usage utilization across many different fields, includ-
ing but not limited to surveillance, for which some examples can be examined in [13],
agriculture [14], search-and-rescue missions [15,16], and infrastructure inspection [17].
These versatile aerial platforms have revolutionized data collection and monitoring tasks,
with applications that demand efficient and accurate object detection capabilities [15].

2. Related Work
Object detection has undergone significant advancements in recent years, particularly

with the introduction and evolution of the YOLO family of algorithms. These advance-
ments have greatly impacted various fields, including UAV (Unmanned Aerial Vehicle)
applications. Numerous studies have been conducted to compare different versions of
YOLO, each aiming to improve the balance between detection accuracy, speed, and resource
efficiency. Nepal et al. [4] evaluate the performance of three versions of the YOLO object
detection algorithm in identifying suitable landing spots for UAVs during emergencies.
YOLOv3, with its Darknet53 backbone, is noted for its speed but lower accuracy. YOLOv4,
incorporating CSPDarknet53, enhances accuracy and detection speed over YOLOv3 by
integrating advanced features like SPP and PANet. YOLOv5, also based on CSPDark-
net53 and implemented in PyTorch, offers the highest accuracy and comparable speed to
YOLOv4. The evaluation, conducted using the MS COCO dataset, reveals that YOLOv5
strikes the best balance between accuracy and speed, making it highly suited for real-time
UAV operations. However, YOLOv3 remains advantageous for scenarios where detection
speed is critical. This comparative analysis underscores the trade-offs between accuracy
and speed in selecting the appropriate YOLO algorithm for UAV applications. In a similar
vein, Lu et al. [18] compare RetinaNet, SSD, and YOLO v3 for real-time pill identification,
revealing distinct strengths and weaknesses among the models. RetinaNet demonstrates
the highest Mean Average Precision (MAP) at 82.89%, but its lower frames per second
(FPS) rate of 17 limits its suitability for real-time applications. SSD offers a balance with
a MAP of 82.71% and an FPS of 32. In contrast, YOLO v3 excels in speed with an FPS
of 51 and a slightly lower MAP of 80.69%. Additionally, YOLO v3 performs better with
hard sample detection, making it ideal for deployment in hospital environments where
real-time processing is critical. Therefore, YOLO v3 emerges as the best option for real-time
pill identification in busy hospital pharmacies due to its superior detection speed, despite
having a marginally lower MAP compared to RetinaNet.

Anna et al. [1] conduct a detailed comparison between the YOLOv5 and YOLOv3
models for the task of apple detection. The research demonstrates that YOLOv5 significantly
outperforms YOLOv3 across various performance metrics. For example, YOLOv5 achieves
a precision of 96.5%, recall of 97.2%, and an F1 score of 96.9%, compared to lower scores by
YOLOv3. Additionally, YOLOv5 shows a notable reduction in both False Negative Rate
(FNR) and False Positive Rate (FPR). The findings highlight the advancements in deep
convolutional networks represented by YOLOv5, underscoring its potential to enhance
robotic harvesting technology in horticulture.

It is important to note that choosing the appropriate version of YOLO depends on
the specific needs of the user, such as accuracy, speed, hardware limitations, and ease of
use [5]. For our case, the target application for the comparison is in the UAV applications
in general. Our research aims to offer insights into the advantages and limitations of
YOLOv4 and YOLOv5 when applied to UAV tasks. We explore critical factors, including


Processes 2025, 13, 254 4 of 14

how well objects are detected, how quickly they process data, the size of the models, and
the resources they require. The results of this comparative analysis will help inform the
selection of the most appropriate algorithm for UAV missions, ultimately improving the
effectiveness and precision of aerial operations across various industries.

3. Methodology
Attaining a combination of really fast and highly accurate object detection might seem

like a straightforward goal, but achieving this in a UAV environment involves navigating a
complex trade-off scenario [19]. Several critical parameters come into play, including the
choice of GPU based on processing power and power consumption, as well as the selection
of the YOLO pre-trained model, which affects both the speed and accuracy of detection.
These factors create a delicate balance, where optimizing one element often means making
sacrifices in another.

The parameter selection outlined above will provide a clear and effective understand-
ing, particularly in the context of UAVs, as follows:

• Limited Battery Life:
UAVs rely on batteries for power, and running resource-intensive algorithms like
YOLO can drain the battery quickly, limiting flight time [20].

• Processing Power and Memory:
Implementing YOLO on UAVs may require specialized hardware or optimizations
to ensure real-time performance [20]. Since speed is dependent on the algorithm’s
performance, a lower performance might require more advanced hardware. Three
different hardwares with different processing performances can be seen in Table 2
for comparison.

• Payload Constraints:
UAVs often have weight restrictions, and adding the necessary equipment for YOLO
implementation, such as cameras and processing units, can impact flight time and
overall performance, especially on small UAVs [21].

• Data Transmission:
Transmitting high-resolution images or video streams from the UAV to a ground
station for YOLO processing can strain the communication bandwidth and introduce
latency. This is especially challenging for real-time applications [22].

• Real-time Processing:
Achieving real-time processing for YOLO on UAVs can be challenging due to lim-
ited computational resources. This can affect the responsiveness of the system in
dynamic environments.

• Object Size and Distance:
YOLO may struggle to detect small or distant objects, which can be a significant
limitation in UAV applications, especially for tasks such as search and rescue or
wildlife monitoring.

• Cost:
Implementing YOLO on UAVs may require investments in hardware, software, and
training, which can increase the overall cost of UAV operations. As can be seen in
Table 2, as hardware performance increases, the cost also increases.

In UAV applications, selecting the appropriate hardware requirements is highly de-
pendent on the specific use case. Table 2 provides a framework resembling a selection
matrix to offer general guidance and assist in decision making for suitable hardware and
software configurations. In this study, as detailed in Section 4.2, the NVIDIA Jetson Orin
NX 16 GB is utilized as the primary hardware platform. This device demonstrates excellent


Processes 2025, 13, 254 5 of 14

performance for fast detection and tracking tasks, making it suitable for the demanding
requirements of UAV-based applications.

Table 2. Presents a comparison for some of the NVIDIA Jetson’s modules, highlighting their key
specifications and features [23].

AGX ORIN 64 GB AGX XAVIER 64 GB JETSON NANO

AI performance 275 TOPS 32 TOPS 472 GFLOPS
GPU 2048-core 512-core 128-core

Memory speed 204.8 GB/s 136.5 GB/s 25.6 GB/s
Power 15 W–60 W 10 W–30 W 5 W–10 W
Cost $1800 $1400 $500 1

1 Prices according to https://www.arrow.com.

Figures 1 and 2 show the structures of YOLOv4 and YOLOv5. Notably, YOLOv5 is a
streamlined version of the earlier YOLO algorithms because it uses the PyTorch framework
instead of the Darknet framework as in YOLOv4.

Figure 1. The YOLOv4 architecture in general consists of the following: backbone as CSPDarknet53,
neck as SPP, PAN, and head as YOLOv3 [5].

Figure 2. Illustrates the YOLOv5 architecture, delineating its three primary components: backbone,
neck, and head. This representation is derived from the TensorBoard visualization of the model and
the official documentation available in the YOLOv5 repository [24].

https://www.arrow.com


Processes 2025, 13, 254 6 of 14

YOLOv4 utilizes CSPDarknet53 as its backbone, which improves the gradient flow and
enhances the model’s learning capability by splitting feature maps and performing partial
connections. This architecture is effective in detecting objects in complex environments,
making it suitable for UAV applications that involve dense or cluttered scenes. In contrast,
YOLOv5 employs an efficient CSP-based backbone (depending on the version, often a
simpler version compared to YOLOv4). Its lightweight design improves computational
efficiency, enabling faster inference speeds, which are critical for UAVs that operate with
limited processing power:

Precision =
TP

TP + FP
(1)

Recall =
TP

TP + FN
(2)

AP =
∫ r=1

r=0
P(r)dr (3)

MeanAveragePrecision = 1/n
k=n

∑
k=1

APk (4)

Definitions for precision and recall can be seen in Equations (1) and (2). Here, TP
stands for true positive, which means the correct prediction of a positive instance, FP for
false positive, which means wrong prediction for a negative instance, and FN for false
negative, which means wrong prediction for a positive instance. Precision demonstrates
how accurately the model can predict positive results. A high precision value signifies
that the model tends to be correct when it predicts a positive outcome. Recall gives the
proportion of true positive identifications relative to the overall count of verified positive
instances. From Equation (3), the Average Precision (AP), which is used in Tables 3–5, is
the area under the precision–recall curve for each IoU threshold. AP maps the PR curve
into a single scalar value. A high Average Precision reflects situations where both the
precision and recall are high, while low values occur when either precision or recall is low
across various confidence threshold levels. The AP score falls within the range of 0 to 1.
In this context, ‘r’ represents the recall variable. On the other hand, Average Recall (AR)
signifies the highest recall achieved with a set number of the detections per each image,
averaged across categories and Intersection over Unions (IoUs). From Equation (4) the
performance of the model can be measured. The calculation involves determining the
Average Precision (AP) for each individual class and subsequently averaging these AP
values across multiple classes.

Table 3. Average Precision calculated for YOLOv4 and YOLOv5 on MS COCO evaluation server; the
evaluate category is for car.

YOLOv4 YOLOv5

AP when IoU is 0.50:0.95, area is all, and maxDets is 100 =0.456 =0.506
AP when IoU is 0.50, area is all, and maxDets is 100 =0.715 =0.750
AP when IoU is 0.75, area is all, and maxDets is 100 =0.492 =0.549
AP when IoU is 0.50:0.95, area is small, and maxDets is 100 =0.358 =0.392
AP when IoU is 0.50:0.95, area is medium, and maxDets is 100 =0.603 =0.675
AP when IoU is 0.50:0.95, area is large, and maxDets is 100 =0.601 =0.702
AR when IoU is 0.50:0.95, area is all, and maxDets is 1 =0.193 =0.211
AR when IoU is 0.50:0.95, area is all, and maxDets is 10 =0.545 =0.590
AR when IoU is 0.50:0.95, area is all, and maxDets is 100 =0.603 =0.643
AR when IoU is 0.50:0.95, area is small, and maxDets is 100 =0.508 =0.534


Processes 2025, 13, 254 7 of 14

Table 3. Cont.

YOLOv4 YOLOv5

AR when IoU is 0.50:0.95, area is medium, and maxDets is 100 =0.736 =0.793
AR when IoU is 0.50:0.95, area is large, and maxDets is 100 =0.781 =0.856

Table 4. Average Precision calculated for YOLOv4 and YOLOv5 on MS COCO evaluation server; the
evaluate category is for bus.

YOLOv4 YOLOv5

AP when IoU is 0.50:0.95, area is all, and maxDets is 100 =0.691 =0.762
AP when IoU is 0.50, area is all, and maxDets is 100 =0.877 =0.891
AP when IoU is 0.75, area is all, and maxDets is 100 =0.785 =0.834
AP when IoU is 0.50:0.95, area is small, and maxDets is 100 =0.250 =0.315
AP when IoU is 0.50:0.95, area is medium, and maxDets is 100 =0.588 =0.658
AP when IoU is 0.50:0.95, area is large, and maxDets is 100 =0.803 =0.876
AR when IoU is 0.50:0.95, area is all, and maxDets is 1 =0.501 =0.547
AR when IoU is 0.50:0.95, area is all, and maxDets is 10 =0.762 =0.830
AR when IoU is 0.50:0.95, area is all, and maxDets is 100 =0.768 =0.834
AR when IoU is 0.50:0.95, area is small, and maxDets is 100 =0.431 =0.516
AR when IoU is 0.50:0.95, area is medium, and maxDets is 100 =0.710 =0.771
AR when IoU is 0.50:0.95, area is large, and maxDets is 100 =0.857 =0.923

Table 5. Average precision calculated for YOLOv4 and YOLOv5 on MS COCO evaluation server; the
evaluate category is for person.

YOLOv4 YOLOv5

AP when IoU is 0.50:0.95, area is all, and maxDets is 100 =0.548 =0.614
AP when IoU is 0.50, area is all, and maxDets is 100 =0.827 =0.846
AP when IoU is 0.75, area is all, and maxDets is 100 =0.606 =0.669
AP when IoU is 0.50:0.95, area is small, and maxDets is 100 =0.375 =0.413
AP when IoU is 0.50:0.95, area is medium, and maxDets is 100 =0.621 =0.696
AP when IoU is 0.50:0.95, area is large, and maxDets is 100 =0.719 =0.824
AR when IoU is 0.50:0.95, area is all, and maxDets is 1 =0.189 =0.209
AR when IoU is 0.50:0.95, area is all, and maxDets is 10 =0.554 =0.614
AR when IoU is 0.50:0.95, area is all, and maxDets is 100 =0.645 =0.701
AR when IoU is 0.50:0.95, area is small, and maxDets is 100 =0.497 =0.530
AR when IoU is 0.50:0.95, area is medium, and maxDets is 100 =0.711 =0.776
AR when IoU is 0.50:0.95, area is large, and maxDets is 100 =0.807 =0.889

4. Experiment
This comparative analysis between YOLOv4 and YOLOv5 is conducted using two dis-

tinct methodologies. The MS COCO dataset is selected for this study due to its comprehen-
sive annotations, wide range of object categories, and extensive diversity in environmental
conditions, which make it an ideal choice for evaluating object detection models compared
to other datasets. The UAV platform is chosen after a thorough evaluation to ensure its
suitability for real-world applications and to identify potential challenges during practical
deployment. This evaluation revealed critical factors such as the difficulty in detecting
small objects and the unique challenges introduced by the aerial perspective of the UAV.
These insights are instrumental in refining the system for robust and reliable performance
under real-world conditions. These choices are made to balance the generalization capa-
bility, computational feasibility and real-world applicability, all of which are essential to
improve UAV-based applications, such as search-and-rescue (SAR) operations.


Processes 2025, 13, 254 8 of 14

4.1. MS-COCO Evaluation

The evaluation of both YOLO versions is conducted using standard metrics, including
Average Precision (AP) and Average Recall (AR). These metrics provide insights into the
detection accuracy and the algorithm’s ability to identify true positives across different
Intersection over Union (IoU) thresholds. This benchmark is widely recognized for its
reliability and comprehensive evaluation metrics and has been adopted by leading organi-
zations such as Microsoft, Facebook, and the Common Visual Data Foundation (CVDF). For
these reasons, we choose to adopt it in our research to ensure a robust and standardized per-
formance evaluation. This method employs 20,288 images from the MS-COCO dataset [11]
to evaluate the detection performance of both YOLO algorithms. By employing real-world
images from this dataset, a comparison is performed by analyzing the performance of
YOLOv4 and YOLOv5 in detecting three specific object classes: car, bus, and person. These
categories are chosen, as they directly correspond to critical targets commonly encoun-
tered in UAV applications under diverse conditions. Although the MS COCO dataset
includes a wide range of object categories, many (e.g., kitchen utensils or furniture) are
irrelevant to UAV-based scenarios. Including such unrelated categories would dilute the
focus of the study and computational efficiency without providing meaningful insights
into UAV applications.

The results are shown in Tables 3–5, where AP measures the area under the precision–
recall curve, reflecting the model’s ability to balance precision (low false positives) and
recall (high true positives) across various confidence thresholds. In UAV applications, AP
is critical for evaluating detection quality under diverse conditions. In the following, we
provide a detailed correlation of the use case.

1. Surveillance and Monitoring: High precision is vital in applications like military
surveillance or border monitoring, where false positives can lead to unnecessary alerts
or actions. For example, detecting intruders or vehicles must have high AP to avoid
misidentifying benign objects as threats. Recall also matters for ensuring no critical
objects (e.g., intruders) are missed, but a balanced AP ensures consistent performance.

2. Search and Rescue: In disaster scenarios, UAVs often scan large areas to locate missing
persons or critical items. High AP ensures that the system can accurately detect objects
such as humans or vehicles, minimizing false positives that could divert rescue efforts.
Precision and recall trade-offs must be optimized for speed and resource efficiency.

3. Inspection and Maintenance: In industrial UAV applications, such as inspecting
wind turbines or power lines, high AP ensures the accurate detection of defects or
anomalies, reducing the risk of missing critical issues or flagging false positives that
increase operational costs.

This study generally employs model indicators [25] to validate the target detection
model; average precision (AP) and average recall (AR) metrics are used to evaluate al-
gorithm performance. These indicators depend on different variables: IoU, area, and
maxDets. As can be seen in Tables 3–5, IoU refers to the Intersection over Union, which
can be formulated and visualized as in Figure 3. It is a metric for showing how well the
predicted position aligns with the ground truth position of the object. It is used to evaluate
precision and recall. For example, 0.50 means count detections that have 0.50 IoU. So, as
IoU increases, the precision decreases.

Processes 2025, 1, 0 8 of 14

reliability and comprehensive evaluation metrics and has been adopted by leading organi-
zations such as Microsoft, Facebook, and the Common Visual Data Foundation (CVDF). For
these reasons, we choose to adopt it in our research to ensure a robust and standardized per-
formance evaluation. This method employs 20,288 images from the MS-COCO dataset [11]
to evaluate the detection performance of both YOLO algorithms. By employing real-world
images from this dataset, a comparison is performed by analyzing the performance of
YOLOv4 and YOLOv5 in detecting three specific object classes: car, bus, and person. These
categories are chosen, as they directly correspond to critical targets commonly encountered
in UAV applications under diverse conditions. Although the MS COCO dataset includes
a wide range of object categories, many (e.g., kitchen utensils or furniture) are irrelevant
to UAV-based scenarios. Including such unrelated categories would dilute the focus of
the study and computational efficiency without providing meaningful insights into UAV
applications.

The results are shown in Tables 3–5, where AP measures the area under the precision–
recall curve, reflecting the model’s ability to balance precision (low false positives) and
recall (high true positives) across various confidence thresholds. In UAV applications, AP
is critical for evaluating detection quality under diverse conditions. In the following, we
provide a detailed correlation of the use case.

1. Surveillance and Monitoring: High precision is vital in applications like military
surveillance or border monitoring, where false positives can lead to unnecessary alerts
or actions. For example, detecting intruders or vehicles must have high AP to avoid
misidentifying benign objects as threats. Recall also matters for ensuring no critical
objects (e.g., intruders) are missed, but a balanced AP ensures consistent performance.

2. Search and Rescue: In disaster scenarios, UAVs often scan large areas to locate missing
persons or critical items. High AP ensures that the system can accurately detect objects
such as humans or vehicles, minimizing false positives that could divert rescue efforts.
Precision and recall trade-offs must be optimized for speed and resource efficiency.

3. Inspection and Maintenance: In industrial UAV applications, such as inspecting
wind turbines or power lines, high AP ensures the accurate detection of defects or
anomalies, reducing the risk of missing critical issues or flagging false positives that
increase operational costs.

This study generally employs model indicators [25] to validate the target detection
model; average precision (AP) and average recall (AR) metrics are used to evaluate al-
gorithm performance. These indicators depend on different variables: IoU, area, and
maxDets. As can be seen in Tables 3–5, IoU refers to the Intersection over Union, which
can be formulated and visualized as in Figure 3. It is a metric for showing how well the
predicted position aligns with the ground truth position of the object. It is used to evaluate
precision and recall. For example, 0.50 means count detections that have 0.50 IoU. So, as
IoU increases, the precision decreases.

IoU =
area o f overlap
area o f union

=

Figure 3. The Intersection over Union.

The observed performance differences at varying IoU thresholds indicate that YOLOv5
is generally more versatile and reliable across a wide range of conditions. Its superior
precision and robustness at both low and medium IoU thresholds make it particularly
suited for UAV applications that demand real-time responsiveness and high detection

Figure 3. The Intersection over Union.


Processes 2025, 13, 254 9 of 14

The observed performance differences at varying IoU thresholds indicate that YOLOv5
is generally more versatile and reliable across a wide range of conditions. Its superior
precision and robustness at both low and medium IoU thresholds make it particularly
suited for UAV applications that demand real-time responsiveness and high detection
accuracy, such as traffic monitoring and agricultural assessments. While YOLOv4 remains
competitive, its limitations become more pronounced at stricter thresholds, potentially
affecting its usability in tasks requiring precise localization. By considering these differences,
practitioners can select the appropriate model and IoU threshold based on the specific
requirements of their UAV application.

To further explore the differences in detection performance between YOLOv4 and
YOLOv5, Tables 6 and 7 provide the calculated mean and standard deviation. In addition,
Figures 4 and 5 present box plots to visualize the distribution of AP and AR metrics
in three categories of objects: person, bus, and car. These figures further support the
statistical findings that YOLOv5 outperforms YOLOv4 in terms of precision and recall. The
consistency and reduced variability of YOLOv5 underscore its suitability for applications
requiring stable object detection.

Table 6. Mean ± standard deviation values of Average Precision (AP) for YOLOv4 and YOLOv5
across person, bus, and car object categories.

Person Bus Car

YOLOv4 0.62 ± 0.14 0.67 ± 0.21 0.54 ± 0.12
YOLOv5 0.68 ± 0.14 0.72 ± 0.20 0.60 ± 0.12

Table 7. Mean ± standard deviation values of Average Recall (AR) for YOLOv4 and YOLOv5 across
person, bus, and car object categories.

Person Bus Car

YOLOv4 0.57 ± 0.20 0.67 ± 0.15 0.56 ± 0.19
YOLOv5 0.62 ± 0.22 0.74 ± 0.15 0.60 ± 0.21

(a) (b)

Figure 4. Box plot illustrating the distribution of (a) Average Precision (AP) and (b) Average Recall
(AR) for YOLOv4 across the person, bus, and car object categories.


Processes 2025, 13, 254 10 of 14

(a) (b)

Figure 5. Box plot illustrating the distribution of (a) Average Precision (AP) and (b) Average Recall
(AR) for YOLOv5 across the person, bus, and car object categories.

4.2. UAV-Captured Real Images Evaluation

In this approach, as shown in Figure 6, we use a multirotor drone (Raven Base from
MR Aviation) with a 30× zoom camera (Z30F from Viewpro) to collect original images and
videos from different altitudes. These images represent the unique challenges posed by
UAV environments, encompassing factors such as varying altitudes, lighting conditions,
and weather variables. This method offers insights into the suitability of the algorithms
for real-world UAV applications. The computational speed and the ability to detect small
objects are compared for both algorithms. The results are discussed in the next section. The
test environment that the algorithms run is as follows: Python-3.10.12, torch-2.0.1, cu118
CUDA:0, Tesla T4 GPU, and CUDNN_HALF = 1.

Figure 6. YOLO object detection (residual blocks, bounding box regression and Intersection Over
Union).


Processes 2025, 13, 254 11 of 14

5. Results and Discussion
The comparison between YOLOv4 and YOLOv5 yields valuable information on their

performance in various metrics.

5.1. Detection Accuracy

In terms of detection precision, YOLOv5 exhibits better performance compared to
YOLOv4 as evidenced by the results in Table 3–5 from the evaluation using the MS COCO
dataset. The Mean Average Precision (mAP) scores consistently favor YOLOv5, highlighting
its improved precision and reliability in identifying objects.

5.2. Speed and Efficiency

Under the same test environment, Table 8 shows the speed comparison; the results
for YOLOv4 and YOLOv5 are 23.5 FPS and 38.4 FPS, respectively, indicating that YOLOv5
achieves a higher frames-per-second (FPSs) rate compared to YOLOv4. This improvement
in speed is crucial for real-time applications, especially in dynamic environments.

Table 8. The evaluation is performed using a video with resolution of 720 × 1280, 24 FPS, with
3551 frames, the results for YOLOv4 and YOLOv5 are 23.5 FPS and 38.4 FPS, respectively. Also, it can
be seen that YOLOv4 has more success in detecting objects with a small area, which is an advantage
for UAVs to detect objects from a further distance.

YOLOv4 YOLOv5


Processes 2025, 13, 254 12 of 14

5.3. Confidence Values

UAV-based detection systems often require fast decision-making in dynamic environ-
ments. Adjusting the confidence threshold can help strike a balance between real-time
performance and detection quality. A higher threshold might reduce computational time by
eliminating low-confidence detections, while a lower threshold could slow down the pro-
cess due to the need to process more detections. Although YOLOv4 demonstrates slightly
higher average confidence values than YOLOv5 as depicted in Figure 5, it is important
to note that the nature of the pre-trained models of YOLOv4 and YOLOv5 are optimized
for ground-level images; therefore, to foster a more pertinent and insightful comparison,
the utilization of datasets such as VisDrone-Dataset becomes particularly relevant. This
dataset encompasses a diverse collection of images and videos captured by drone-mounted
cameras, offering a more fitting context for evaluating.

5.4. Suitability for UAV Applications

The study emphasized the importance of evaluating these models in a UAV environ-
ment. The results underscore that YOLOv5, with its heightened accuracy and speed, is
more apt for UAV applications.

6. Conclusions
The YOLOv4 and YOLOv5 algorithms are introduced, and their performance results

were first compared with the COCO dataset, which is now considered one of the standard
datasets to compare AI detection algorithms. Then, both algorithms were applied to videos
taken from a UAV platform. The results were analyzed with different metrics: speed (FPS)
and accuracy (AP) under different conditions, including IoU, area, and maxDets.

Based on the experiments with the COCO dataset, YOLOv5 outperforms YOLOv4 with
an increased accuracy of 5% to 16% based on the evaluation conditions. Processing speeds
are measured as FPS with the UAV captured images measured as 23.5 FPS and 38.4 FPS
for YOLOv4 and YOLOv5, respectively, which leads to a 63% increase for YOLOv5. This
shows that YOLOv5 is both more efficient and more accurate, which means that YOLOv5
can work on smaller and cheaper hardware with the same performance as YOLOv4. This
is important for UAV applications since each additional weight and power usage leads to
decreased flight times and payload capacities for UAVs.

The method used in this work has been tested on the cloud environment using Google
Colab with a Tesla T4 GPU. The performance of the algorithm depends on the operating
hardware. GPU with more processing units such as V100 and A100 will result in faster
performance, while edge devices like the NVIDIA Jetson series should perform slower.
However, given the same platform, YOLOv5 should outperform YOLOv4 as the given
rates in both precision performance and speed.

The analysis might be extended using these algorithms in the future for further robotic
applications such as target tracking, and it can be observed how the AI performance affects
robotic performance such as tracking error, flight time, weight, etc.

Finally, it is important to highlight some of the limitations associated with YOLOv5
to provide a balanced perspective on its applicability and performance; while YOLOv5
is designed for speed and efficiency, its larger variants, such as YOLOv5x, can be compu-
tationally intensive, requiring substantial resources for both training and inference. This
poses challenges for deployment on resource-constrained devices, such as UAVs, and in
real-time applications where low latency is critical. Moreover, UAV applications often
involve detecting small or partially obscured objects at a distance—conditions that can
challenge the YOLOv5 anchor-based detection approach, despite notable improvements
over earlier YOLO versions. Another key limitation lies in the heavy reliance of YOLOv5


Processes 2025, 13, 254 13 of 14

on the quality and diversity of its training datasets. Although the model demonstrates
exceptional performance when trained on comprehensive datasets like MS COCO, its de-
tection accuracy can deteriorate significantly when applied to scenarios involving unseen
objects or environments not represented in the training data. This dependency underscores
the need for extensive domain-specific data collection and augmentation to maintain ro-
bust performance in real-world UAV applications, where environmental conditions can be
highly variable and unpredictable.

For future work in this area, expanding the comparative study to include other state-of-
the-art object detection models, such as Transformer-based architectures like DETR, would
provide valuable insights into their suitability for UAV applications. Furthermore, research
can focus on key directions, such as integrating YOLOv4 and YOLOv5 into UAV systems
for real-world deployment. This integration could provide critical information on the
performance of the models in various applications, including traffic monitoring, disaster
response, and agricultural monitoring. Testing these models under varying environmental
conditions is essential to evaluate their robustness, adaptability, and practicality in real-time
UAV operations.

Author Contributions: All authors have contributed their unique insights to the research concept,
design and implementation of the research, M.A.M.A. contributed to writing original draft and the
analysis of results and E.Y. review, edit and checked the manuscript, and after review and discussion,
they unanimously approved the content of the final manuscript. All authors have read and agreed to
the published version of the manuscript.

Funding: This research received no external funding.

Data Availability Statement: The MS COCO evaluation server using test-dev2017 dataset which
used in this study is openly available on the website. Available online: https://competitions.codalab.
org/competitions/20794#results, accessed on 19 July 2023.

Conflicts of Interest: The authors declare no conflicts of interest.

References
1. Kuznetsova, A.; Maleva, T.; Soloviev, V. YOLOv5 Versus YOLOv3 for Apple Detection; Springer: Berlin/Heidelberg, Germany, 2021;

pp. 349–358. [CrossRef]
2. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. Available online:

http://pjreddie.com/yolo/ (accessed on 19 September 2023).
3. Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo Algorithm Developmentsin Procedia Computer Science; Elsevier B.V.:

Amsterdam, The Netherlands, 2021; pp. 1066–1073. [CrossRef]
4. Nepal, U.; Eslamiat, H. Comparing YOLOv3, YOLOv4 and YOLOv5 for Autonomous Landing Spot Detection in Faulty UAVs.

Sensors 2022, 22, 464. [CrossRef] [PubMed]
5. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020,

arXiv:2004.10934.
6. YOLOv5 v1.0 Commits · Ultralytics/yolov5 · GitHub. Available online: https://github.com/ultralytics/yolov5/commits/v1.0

(accessed on 18 September 2023).
7. Zhang, P.; Li, D. EPSA-YOLO-V5s, A novel method for detecting the survival rate of rapeseed in a plant factory based on multiple

guarantee mechanisms. Comput. Electron. Agric. 2022, 193, 106714. [CrossRef]
8. Kırac, E.; Özbek, S. Deep Learning Based Object Detection with Unmanned Aerial Vehicle Equipped with Embedded System.

J. Aviat. 2024, 8, 15–25. [CrossRef]
9. Kim, J.; Cho, J. RGDiNet: Efficient Onboard Object Detection with Faster R-CNN for Air-to-Ground Surveillance. Sensors 2021,

21, 1677. [CrossRef]
10. Tang, G.; Ni, J.; Zhao, Y.; Gu, Y.; Cao, W. A Survey of Object Detection for UAVs Based on Deep Learning. Remote Sens. 2024,

16, 149. [CrossRef]
11. Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft

COCO: Common Objects in Context. arXiv 2014, arXiv:1405.0312.

https://competitions.codalab.org/competitions/20794#results
https://competitions.codalab.org/competitions/20794#results
http://doi.org/10.1007/978-3-030-66077-2_28
http://pjreddie.com/yolo/
http://dx.doi.org/10.1016/j.procs.2022.01.135
http://dx.doi.org/10.3390/s22020464
http://www.ncbi.nlm.nih.gov/pubmed/35062425
https://github.com/ultralytics/yolov5/commits/v1.0
http://dx.doi.org/10.1016/j.compag.2022.106714
http://dx.doi.org/10.30518/jav.1356997
http://dx.doi.org/10.3390/s21051677
http://dx.doi.org/10.3390/rs16010149


Processes 2025, 13, 254 14 of 14

12. YOLOv4 GitHub—AlexeyAB/Darknet at 3d4242a6e534fe44afd3c0bf0de92e0f4e9ce23f. Available online: https://github.com/
AlexeyAB/darknet/tree/3d4242a6e534fe44afd3c0bf0de92e0f4e9ce23f (accessed on 19 September 2023).

13. Carrio, A.; Sampedro, C.; Rodriguez-Ramos, A.; Campoy, P. A review of deep learning methods and applications for unmanned
aerial vehicles. J. Sens. 2017, 2017, 3296874. [CrossRef]

14. Salvini, P. Urban robotics: Towards responsible innovations for our cities. Robot. Auton. Syst. 2018, 100, 278–286. [CrossRef]
15. Schedl, D.C.; Kurmi, I.; Bimber, O. Search and Rescue with Airborne Optical Sectioning. Nat. Mach. Intell. 2020, 2, 783–790.

[CrossRef]
16. Xu, R.; Lin, H.; Lu, K.; Cao, L.; Liu, Y. A forest fire detection system based on ensemble learning. Forests 2021, 12, 217. [CrossRef]
17. Park, S.E.; Eem, S.H.; Jeon, H. Concrete crack detection and quantification using deep learning and structured light. Constr. Build.

Mater. 2020, 252, 119096. [CrossRef]
18. Tan, L.; Huangfu, T.; Wu, L.; Chen, W. Comparison of RetinaNet, SSD, and YOLO v3 for real-time pill identification. BMC Med.

Inform. Decis. Mak. 2021, 21, 324. [CrossRef] [PubMed]
19. Li, S.; Ozo, M.M.O.I.; Wagter, C.D.; de Croon, G.C.H.E. Autonomous drone race: A computationally efficient vision-based

navigation and control strategy. Robot. Auton. Syst. 2020, 133, 103621. [CrossRef]
20. Plastiras, G.; Kyrkou, C.; Theocharides, T. EdgeNet—Balancing accuracy and performance for edge-based convolutional neural

network object detectors. In ACM International Conference Proceeding Series; Association for Computing Machinery: New York, NY,
USA, 2019. [CrossRef]

21. Bayer, R.; Priest, J.; Tözün, P. Reaching the Edge of the Edge: Image Analysis in Space. arXiv 2024, arXiv:2301.04954.
22. Yang, Q.; Yang, J.H. HD video transmission of multi-rotor Unmanned Aerial Vehicle based on 5G cellular communication network.

Comput. Commun. 2020, 160, 688–696. [CrossRef]
23. Jetson Modules, Support, Ecosystem, and Lineup|NVIDIA Developer. Available online: https://developer.nvidia.com/

embedded/jetson-modules (accessed on 19 October 2023).
24. Liu, L.; Liu, Y.; Gao, X.-Z.; Zhang, X. An Immersive Human-Robot Interactive Game Framework Based on Deep Learning for

Children’s Concentration Training. Healthcare 2022, 10, 1779. [CrossRef] [PubMed]
25. COCO—Common Objects in Context Metrics. Available online: https://cocodataset.org/#detection-eval (accessed on

7 November 2023).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

https://github.com/AlexeyAB/darknet/tree/3d4242a6e534fe44afd3c0bf0de92e0f4e9ce23f
https://github.com/AlexeyAB/darknet/tree/3d4242a6e534fe44afd3c0bf0de92e0f4e9ce23f
http://dx.doi.org/10.1155/2017/3296874
http://dx.doi.org/10.1016/j.robot.2017.03.007
http://dx.doi.org/10.1038/s42256-020-00261-3
http://dx.doi.org/10.3390/f12020217
http://dx.doi.org/10.1016/j.conbuildmat.2020.119096
http://dx.doi.org/10.1186/s12911-021-01691-8
http://www.ncbi.nlm.nih.gov/pubmed/34809632
http://dx.doi.org/10.1016/j.robot.2020.103621
http://dx.doi.org/10.1145/3349801.3349809
http://dx.doi.org/10.1016/j.comcom.2020.07.024
https://developer.nvidia.com/embedded/jetson-modules
https://developer.nvidia.com/embedded/jetson-modules
http://dx.doi.org/10.3390/healthcare10091779
http://www.ncbi.nlm.nih.gov/pubmed/36141391
https://cocodataset.org/#detection-eval

	Introduction
	Related Work
	Methodology
	Experiment
	MS-COCO Evaluation
	UAV-Captured Real Images Evaluation 

	Results and Discussion
	Detection Accuracy 
	Speed and Efficiency 
	Confidence Values
	Suitability for UAV Applications 

	Conclusions
	References