Immersive Translate

1. Background and significance of the study

1.1 Background

With the aging of the population, falls are becoming a major healthcare problem for the elderly in contemporary society. Sending an immediate alarm to thecaregiver of an elderly person who has fallen can prevent furtherdamage, reduce treatment costs and increase the opportunitiesfor recovery.^[1]About 30% of the elderly over 65 years old fallevery year.^[2]Allsare the leading cause of accidental injuries and deaths for theelderly people who are over 65 years old,^[3]Rapid help after a fall reduces the risk of hospitalization by 26% and the risk of death by 80%^[4]Fall detection technology can help detect falls and provide timely rescue, thereby reducing the risk of injury in older adults. Human pose estimation technology is a key technology for fall detection, which can determine whether the human body is in a fall state by analyzing the changes in the position of key points of the human body in images or videos.

1.2 Significance

In recent years, significant progress has been made in human pose estimation and fall detection methods based on deep learning. Many researchers have proposed various complex deep learning models, such as the OpenPose algorithm proposed by Z Cao et al. ^[5], and the AlphaPose algorithm proposed by Fang H et al^[6] These models have excellent accuracy, but they require high computing resources and are difficult to be widely generalized in real-time application scenarios. In addition, some fall detection methods rely on additional human object detection tasks, which increases the complexity and latency of the system. In order to address these problems, researchers have begun to explore the application of lightweight networks in fall detection.

In this paper, a human pose estimation algorithm based on deep learning is applied to fall detection. The algorithm uses the Lightweight HRNet network to detect the key points of the human body, and uses the VGG16 network to classify the stitching of images and gesture maps, so as to identify the fall state. Experimental results show that the proposed algorithm has a high accuracy in fall detection tasks, especially in different lighting conditions and complex background environments, and can still maintain good detection performance. Compared with traditional methods, deep learning-based algorithms can adaptively learn the features of different scenarios, improve robustness and generalization ability, and meet the needs of real-time and accuracy in practical applications. These experiments prove the effectiveness of the algorithm and provide a feasible solution for elderly care and safety monitoring.

1.3 Research status at home and abroad

1.3.1 Current status of foreign research

In recent years, visual fall detection methods based on deep learning have been widely studied abroad. Kim D et al. ^[7] analyzed the abnormal behavior of pedestrians in the environment of surveillance cameras, discussed the advantages and disadvantages of various algorithms, and how to improve the detection performance in practical applications. They point out that although deep learning methods excel in accuracy, they have high computational resource requirements and are difficult to widely apply in real-time systems. Dubey S et al. ^[8] investigated different two-dimensional (2D) and three-dimensional human (3D) pose estimation techniques and their classical and deep learning methods that provide solutions to various computer vision problems, also considering the different deep learning models used in pose estimation. Their research shows that human pose estimation has important application potential, but there is still room for improvement in the real-time and robustness of existing methods.

In addition, Ibrahim M R et al. ^[9] investigated the detection of near-miss cycling accidents based on video streams, discussed the combination of video processing technology and deep learning models, and proposed future research directions. Their research shows that real-time video stream processing plays an important role in detection, but needs to address the balance between computing resources and real-time. Gao J et al. ^[10] proposed some groundbreaking deep learning models to fuse these multimodal big data. Their research found that multimodal data can significantly improve the accuracy and robustness of detection, but the data fusion technology still needs to be further optimized.

1.3.2 Current status of domestic research

In China, fall detection technology based on deep learning has also received extensive attention. Zhao R et al. ^[11] developed a novel Dynamic Centroid Model (DCM) to detect abnormal pedestrian behavior in public places, especially U-turns and falls, two common anomalies.

1.3.3 Research on existing problems and challenges

Although remarkable progress has been made in the field of deep learning-based fall detection at home and abroad, there are still some problems and challenges. First, the existing methods have high computing resource requirements, which makes it difficult to achieve real-time detection on embedded devices. Secondly, although the multimodal data fusion technology can improve the detection performance, the complexity of the data preprocessing and fusion algorithm increases the difficulty of the implementation of the system. Finally, the detection performance of different lighting conditions and complex background environments still needs to be improved, which will affect the robustness and generalization ability of the algorithm.

1.4 Research content and innovation points of this paper

In order to solve the above problems, this paper proposes a fall detection method based on lightweight HRNet and VGG16 network. Specifically, this paper uses the lightweight network Lightweight HRNet to detect human key points, and ensures the accuracy of key points through high-resolution feature map output. The main advantage of Lightweight HRNet is its bottom-up approach, which does not require additional human object detection tasks, giving it an inherent advantage in detection speed. In this paper, the VGG16 network is used to classify the stitching of images and gesture maps, so as to identify the fall state. The key work of VGG16 is to demonstrate that extending network depth can improve the performance of the network in some cases. ^[12] VGG16 network is a classic model in image classification and object detection tasks because of its simple structure and excellent performance, which can effectively extract high-level features of images. Experimental results show that the algorithm has a high accuracy in the fall detection task. The research in this paper is not only innovative in technology, but also provides a feasible solution for elderly care and safety monitoring. Experimental results prove the effectiveness of the algorithm, which can be applied to various practical scenarios in the future to improve the quality of life and safety of the elderly.

2. Overview of system development technology

2.1 Jetson Xavier NX

NVIDIA Jetson series covers low-power embedded platforms suitable for GPU-accelerated computing.^[13] The system provides powerful computing performance and is designed to support deep learning, computer vision, and artificial intelligence (AI) applications, with independent CPU, GPU, PMIC, and flash memory for applications that require real-time detection tasks on local devices. Jetson Xavier NX is one of Jetson's platforms, equipped with NVIDIA Volta GPUs and six-core ARM Cortex-A57 CPUs, with powerful computing power in a small form factor, designed for deep learning and AI applications on edge devices, and for applications that require high-performance deep learning inference. It can be used for human behavior recognition tasks in lightweight networks, and can also be used for object detection tasks.

Jetson Xavier NX has two CSI camera interfaces to ensure the completion of computer vision and real-time image processing tasks. an HDMI interface that enables the device to connect with the display and keep an eye on the operation of the program in real time; 4 USB interfaces, which can be connected to U disk, keyboard, mouse and other devices to facilitate users to operate on the system; The RJ45 Gigabit Ethernet port can allow the network to access the system; The device is also equipped with a Micro USB OTG interface to meet the needs of more device access.

2.2 VGG16 model

VGG network adopts the superposition of multiple 3×3 convolution filters to replace a large convolution filter,which can increase the network depth and reduce the number of total parameters at the same time^[14]The VGG16 model consists of 5 convolutional blocks, 3 fully connected layers, and 1 softmax layer. The network is divided into 5 stages, each of which consists of several convolutional layers and a 2x2 maximum pooling layer. Specifically, the first two phases each have two convolutional layers, and the last three phases each have three convolutional layers. This design helps to extract the abstract features of the image layer by layer, from edges and textures to high-level object features.

After convolution and pooling operations, VGG16 is classified through three fully connected layers, and the output layer uses the softmax function to generate probability distributions, with each node corresponding to the probability value of a class. Although the number of parameters is large, about 138M, and the training resource requirements are high, VGG16 performs well in tasks such as image classification and object detection, and is an important learning model for beginners, as well as a benchmark model for many computer vision tasks.

2.3 Lightweight HRNet network

In recent years, with the rapid development of deep learning, human pose estimation technology has made significant progress. However, traditional pose estimation networks usually require a large amount of computing resources and storage space, which limits their generalization in real-time application scenarios. In order to solve this problem, researchers propose a lightweight pose estimation network, which aims to reduce the computational and storage requirements of the network while ensuring accuracy

Among them, lightweight HRNet, as a lightweight network based on HRNet architecture, has attracted extensive attention due to its excellent performance and efficiency, which was proposed by Jinzhen Liao et al. in 2024, which proposed lightweight HRNet, an efficient network designed for bottom-up multi-person human pose estimation tasks. The main advantage of the lightweight HRNet is that the bottom-up method does not require additional human object detection tasks [^15], which has inherent advantages in detection speed.

3. Overall design of the system

3.1 Design of fall detection methods

The STM32 series microcontrollers use the ARM Cortex-M core, which supports the use of neural networks to detect fall behaviors in multiple subsections, and treats the fall detection task as a multi-label classification task for images, and uses the VGG16 network to classify the input images. The network structure of VGG16 is shown in Figure 1. The size of the image in the input network is fixed at 224×224, and in the convolution process, it uses all 3×3 convolution kernels to stack each other, so that a deeper network can be achieved without introducing a large number of parameters and computations. The depth of the network is the key factor to improve the performance of the neural network, and the network with the appropriate depth can learn more complex feature representations. In addition, the convolutional kernel of 3×3 size is sufficient to cover the local features of the input image to obtain edges, textures and other low-level features, which is very important in image processing, so the convolution kernel of this size is very widely used in various networks. The network uses the pooling layer to reduce the resolution size of the image, instead of using the convolution operation of stride=2, which has the advantage of reducing the amount of computation and parameters of the network, and the pooling operation is very useful for extracting the main features of the feature map, which preserves important information by selecting the maximum value in the local area, and reduces the noise and unimportant details, which helps the network to better focus on the key features in the image; The pooling layer can also increase the translational invariance of features to a certain extent, that is, the position change of the features in the input image within a certain range will not have much impact on the output of the pooling layer, which is very important for tasks such as object recognition, because objects can appear in different positions of the image.

Figure 1 Network structure of VGG16

In the final stage, the network outputs the classification of the image through the fully connected layer, as shown in Figure 1, and the n in the final output target feature map represents the number of categories of the classified image. The simplicity and ease of implementation of VGG16's network structure make it one of the classic models in the field of deep learning, and it is widely used in image recognition, image classification and other tasks. It is very suitable to use it to complete the fall recognition task based on static pictures.

In the process of designing the fall detection method, a fall dataset based on static images was established for the network training of fall detection. The original dataset was downloaded from the PaddlePaddle website. The final dataset is obtained by secondary processing on the basis of the original dataset. The original datasets are messy, most of them are collected online, and these datasets are labeled with the location of the person's box and whether the character has fallen; There is also a small portion of the dataset that was manually added and was taken on some video-based fall dataset. These datasets have complex scenes, each picture may have more than one person and one label, it mainly includes two scenes: athletes falling in sports scenes and pedestrians falling in daily situations, mainly from news and surveillance. The original dataset cannot be used directly, it has a lot of characters, the scene is very messy, and the occlusion between them is serious, the characters are of different sizes, and it is very difficult to identify, and it needs to go through a certain amount of processing. Firstly, according to the annotation, a single person in the dataset is intercepted. Then, these pictures are manually selected, and the unobscured and clear pictures of a single person falling or in a normal state are retained; Finally, each image is named in the format of "Status. Serial number". For example, the first picture of the fall is named "fall.00001". After processing, a total of 4056 fall datasets were obtained, and due to the small size of the datasets, the test set was not set here, but they were divided into the training set and the validation set according to the ratio of 3:7, and finally a total of 2839 images in the test set and 1217 images in the validation set were obtained. Figure 2 shows the image of the partially processed fall dataset.

Figure 2 Illustration of the processed partial fall dataset

3.2 Introduction to the experimental process and analysis of the results

According to the implementation steps of fall detection, the experimental process of fall detection is introduced in three aspects: network training, experimental result analysis, and implementation of fall detection.

3.2.1. Training of the network

The training device for the fall detection task is an NVIDIA GeForce RTX 3060, and the number of samples selected for a training session on the selected network is 32 according to the capacity of the device, and the learning rate is set to E-4. Due to the small amount of data, the number of training rounds is set to 30 to prevent overfitting during training. The mean square error commonly used in the classification algorithm was selected for the loss calculation of the experiment. Assuming that x is the image in the input network, since the image needs to be classified into two behaviors (fall, normal), for each input image, the network will output a two-dimensional array f(x), that is, in Figure 1, n=2, and each dimension corresponds to the probability of the corresponding state predicted by the network. The real tag g(x) of the image is also set as a two-dimensional array, the position of the corresponding tag array is set to 1, and the other tags are set to 0, for example, the tag of the picture of the fall behavior is set to the array [1 0], and the tag of the picture in the normal state is set to the array [0 1]. The loss function of the network, Loss(x), uses the mean squared error, which is calculated as shown in equation (1). The mean square error has a larger penalty for large errors, which is more helpful for the model to focus on reducing large errors and improving the accuracy of prediction, which is more commonly used in classification tasks.

Loss (x) = [g (x) − f(x)]2 （1）

As mentioned earlier, in the experimental process, not only the image was input into the VGG16 network, but also the attitude map corresponding to the picture was input into the network, and the attitude map corresponding to the picture was input into the network together with the picture. In the training set, HRNet-W48 was used to generate the pose maps of the characters in the training set, and the corresponding pose maps of some datasets are shown in Figure 3. The gesture map is set to be one-dimensional, so if the attitude map corresponding to the picture is input into the network together with the picture, the attitude map is stitched with the network, and finally the feature map of 4×224×224 size is input into the network for training.

(a) Diagram of the fall state dataset

(b) Illustration of the dataset in the non-falling state

Figure 3 Attitude diagram of some datasets

3.2.2 Experimental results and result analysis of fall detection on the validation set

The experiment was conducted in three parts. The first part uses the character picture as the input, the second part uses the gesture diagram as the input, and the third part uses the splicing of the character diagram and the gesture diagram as the input. The attitude diagram was generated by the network HRNet-w48. Table 1 shows the validation results of the final network on the validation dataset. It is concluded that when the input is a stitching of images and attitude maps, the detection effect of fall detection is the best, and the accuracy can reach 96.0% on the validation set. According to the experimental results, consistent with the analysis in Section 2.1, the input of the attitude map does assist the network to detect the fall state of the character.

Table 1 Comparison of the effects of fall detection with different inputs

input	Enter the image size	Detection accuracy
Character diagram	224	95.2%
Pose diagram	224	93.7%
Character Diagram + Posture Diagram	224	96.0%

3.2.3. Implementation of fall detection

According to the above experimental results, the input of the attitude map assists the network to complete the fall detection task, so the implementation of fall detection uses Lightweight HRNet to detect the attitude map of the person, and the image and the attitude map are stitched and input into the network for the detection of the final fall behavior. Because Lightweight HRNet uses the AE algorithm for human pose detection, there will be the problem that the key points are repeatedly detected, so it also uses the method of preprocessing the detected key points to improve the visualization effect of the detected posture map.

In addition, the AE algorithm uses the method of first detecting all the key points of the human body in the input image, and then classifying the key points to complete the detection task of human posture, and it does not automatically generate a human detection frame, and the above-mentioned trained VGG16 network detects the fall state of a single image. Therefore, the posture map detected by the network is designed to complete the positioning of all human bodies in the input image, as shown in Figure 4. Firstly, after the preprocessing of key points, the single posture with no overlap and accurate prediction was screened, as shown in Fig. 4(a). Next, the character is positioned according to the attitude diagram, and the external matrix of the gesture diagram is made, as shown in Figure 4(b); The external matrix of the attitude diagram is based on the key points, and it can frame all the key points into the matrix, but some matrices at this time cannot frame all the human bodies into the matrix, so it is necessary to expand part of the external matrix to include all the human bodies in the matrix box as shown in Figure 4(c). Finally, according to the generated matrix box, the single character frame and the posture map of the character are extracted, and the fall detection is entered into the VGG16 network through splicing, and the detection result is shown in Figure 4(d).


(a) Prediction results of human body posture diagram	(b) External matrix for the attitude map

(c) Extended external matrix of the attitude diagram	(d) Fall test results

Figure 4 Fall detection

3.3 Implementation of Fall Detection on Jetson Xavier NX

In Section 2.1, the construction of the Jetson Xavier NX is described in detail, followed by an introduction to the implementation of fall detection on the device. The application uses the lightweight human pose estimation network proposed in Chapter 3 to detect the pose map of the human body, and the VGG16 network to determine the fall state of the person.

In this chapter, the implementation of fall detection first requires the weights of the Lightweight HRNet trained for pose estimation and the VGG16 network for fall detection to be imported into the device. After the input image passes through Lightweight HRNet, the posture map of the image detected by the network will be preprocessed by the human posture detection results to enhance the visualization effect of the posture map detection to generate more considerable human posture detection results. Then, the generated attitude map is processed into a one-dimensional image, which will be stitched with the original image to be detected and input into the VGG16 network to realize the detection of the fall state of the input image by the device. Finally, the fall detection results will be displayed on the device interface. Figure 5 shows the detection interface of the device for fall detection.

(a) Fall Detection Interface 1

(b) Fall Detection Interface 2

Figure 5 Fall detection page

4 Conclusion

In this paper, a human pose estimation algorithm based on deep learning is applied to fall detection, which uses the Lightweight HRNet network to detect human key points, and the VGG16 network is used to classify the stitching of images and posture maps. Experimental results show that the algorithm has high accuracy under different lighting conditions and complex background environments, and can detect the fall state in real time, which significantly improves the robustness and generalization ability of the system. Compared with the traditional method, the algorithm not only improves the accuracy, but also reduces the computing resource requirements, which is suitable for the elderly care and safety monitoring scenarios in practical applications. However, there are still some performance bottlenecks in the processing of large-scale datasets, and future work will further optimize the computational efficiency of the algorithm.

References:

Chen L, Li R, Zhang H, et al. Intelligent fall detection method based on accelerometer data from a wrist-worn smart watch[J]. Measurement, 2019, 140: 215-226.

Christiansen, T.L.; Lipsitz, S.; Scanlan, M.; Yu, S.P.; Lindros, M.E.; Leung, W.Y.; Adelman, J.; Bates, D.W.; Dykes, P.C. Patient activation related to fall prevention: A multisite study. Jt. Comm. J. Qual. Patient Saf.2020, 46, 129–135.

C.Y. Hsieh, K.C. Liu, C.N. Huang, W.C. Chu, C.T. Chan, Novel hierarchical fall detection algorithm using a multiphase fall model, Sensors 17 (2) (2017) 307.

Stevens J A. Falls among older adults—risk factors and prevention strategies[J]. Journal of safety research, 2005, 36(4): 409-411.

Cao Z, Simon T, Wei S E, et al. Realtime multi-person 2d pose estimation using part affinity fields[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2017: 7291-7299.

Fang H S, Xie S, Tai Y W, et al. Rmpe: Regional multi-person pose estimation[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2334-2343.

Kim D, Kim H, Mok Y, et al. Real-time surveillance system for analyzing abnormal behavior of pedestrians[J]. Applied Sciences, 2021, 11(13): 6153.

Dubey S, Dixit M. A comprehensive survey on human pose estimation approaches[J]. Multimedia Systems, 2023, 29(1): 167-195.

Ibrahim M R, Haworth J, Christie N, et al. CyclingNet: Detecting cycling near misses from video streams in complex urban scenes with deep learning[J]. IET Intelligent Transport Systems, 2021, 15(10): 1331-1344.

Gao J, Li P, Chen Z, et al. A survey on deep learning for multimodal data fusion[J]. Neural Computation, 2020, 32(5): 829-864.

Zhao R, Wang Y, Jia P, et al. Abnormal behavior detection based on dynamic pedestrian centroid model: Case study on u-turn and fall-down[J]. IEEE Transactions on Intelligent Transportation Systems, 2023, 24(8): 8066-8078.

Yang H, Ni J, Gao J, et al. A novel method for peanut variety identification and classification by Improved VGG16[J]. Scientific Reports, 2021, 11(1): 15756.

Jabłoński B, Makowski D, Perek P, et al. Evaluation of nvidia xavier nx platform for real-time image processing for plasma diagnostics[J]. Energies, 2022, 15(6): 2088.

Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition.arXiv 2014, arXiv:1409.1556.

Liao J, Cui W, Tao Y, et al. Lightweight HRNet: A Ligtweight Network for Bottom-Up Human Pose Estimation[J]. Engineering Letters, 2024, 32(3).

Liaoning University of Science and Technology College Student Innovation Training Project: 2024 Liaoning Province Innovation Training Project "Research on Human Fitness Motion Detection Based on Deep Learning" (LNYJG2024092).