CN109284733B - Shopping guide negative behavior monitoring method based on yolo and multitask convolutional neural network - Google Patents

Shopping guide negative behavior monitoring method based on yolo and multitask convolutional neural network Download PDF

Info

Publication number
CN109284733B
CN109284733B CN201811197781.2A CN201811197781A CN109284733B CN 109284733 B CN109284733 B CN 109284733B CN 201811197781 A CN201811197781 A CN 201811197781A CN 109284733 B CN109284733 B CN 109284733B
Authority
CN
China
Prior art keywords
shopping guide
pedestrian
training
layer
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811197781.2A
Other languages
Chinese (zh)
Other versions
CN109284733A (en
Inventor
赵云波
林建武
李灏
宣琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201811197781.2A priority Critical patent/CN109284733B/en
Publication of CN109284733A publication Critical patent/CN109284733A/en
Application granted granted Critical
Publication of CN109284733B publication Critical patent/CN109284733B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G07CHECKING-DEVICES
    • G07CTIME OR ATTENDANCE REGISTERS; REGISTERING OR INDICATING THE WORKING OF MACHINES; GENERATING RANDOM NUMBERS; VOTING OR LOTTERY APPARATUS; ARRANGEMENTS, SYSTEMS OR APPARATUS FOR CHECKING NOT PROVIDED FOR ELSEWHERE
    • G07C1/00Registering, indicating or recording the time of events or elapsed time, e.g. time-recorders for work people
    • G07C1/10Registering, indicating or recording the time of events or elapsed time, e.g. time-recorders for work people together with the recording, indicating or registering of other data, e.g. of signs of identity

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Strategic Management (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Game Theory and Decision Science (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Image Analysis (AREA)

Abstract

A shopping guide negative behavior monitoring method based on yolo and a multitask convolutional neural network comprises the steps of firstly training a pedestrian detection model based on yolo, pre-training the model by using ImageNet and voc2007 data sets, and then finely tuning the model by using monitoring scene images; then constructing a multitask convolution neural network based on ResNet50, and training the network by using manually labeled multi-label image data; and then, reading a market monitoring picture by using a rtsp protocol, detecting a pedestrian in the picture by using a pedestrian detection model, inputting a pedestrian image into a multitask convolutional neural network, identifying whether the pedestrian is shopping guide, sits idle or not, and plays a mobile phone, so as to judge whether passive behaviors exist in the shopping guide, and storing the 'serious negative' and 'general negative' shopping guide pictures locally. And finally, the pedestrian detection network based on yolo and the multitask convolutional neural network are used for effectively monitoring and recording shopping guide negative behaviors.

Description

Shopping guide negative behavior monitoring method based on yolo and multitask convolutional neural network
Technical Field
The invention relates to a method for monitoring shopping guide negative behaviors in the field of new retail sales.
Background
With the increase of labor cost, in a shopping mall, recruitment of more shopping guides means cost increase. However, some shopping guides have negative work behaviors, such as "playing mobile phone", "sitting nearby with customers", and the like, which results in waste of human resources. In order to avoid unnecessary expenditure, effective attendance management of shopping guides in a shopping mall is very important.
The common attendance system can only record the shopping guide's time of going to and fro, cannot automatically analyze whether the shopping guide has a negative work condition during the time of going to and fro, and cannot record the picture of the shopping guide when the shopping guide works negatively. Aiming at the requirement, the invention utilizes the computer vision technology to carry out image recognition and analysis on the images collected by monitoring which are ubiquitous in the shopping mall.
For pedestrian detection, in the existing method, a directional gradient histogram is used as a descriptor for pedestrian detection, and an SVM is used for classification, so that the method is not very high in precision and is easy to detect by mistake. In recent years, the deep convolutional neural network is applied to pedestrian detection, the accuracy of the pedestrian detection is greatly improved, however, due to the fitting problem of cross data sets in transfer learning, the method is lack of robustness under the monitoring view angle.
Aiming at attribute identification, the convolutional neural network achieves the effect which cannot be compared with the traditional method in the precision of attribute classification. In recent years, CNN model frameworks such as VGG, ResNet, densneet, etc. have been widely used. However, an original ResNet can classify only one attribute, and multiple attributes need to train multiple models, which greatly increases the computational burden.
Therefore, no complete solution exists for a monitoring system for identifying and recording the shopping guide negative behaviors.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a shopping guide negative behavior monitoring method based on yolo and multitask convolutional neural networks.
In order to achieve the aim, the invention designs a shopping guide negative behavior monitoring system based on yolo and multitask convolutional neural networks. Firstly, training a yolo-based pedestrian detection model and a ResNe 50-based multitask convolutional neural network; further, for monitoring the images sampled at fixed time, detecting pedestrians by using a yolo-based detection model; furthermore, a multitask convolutional neural network based on ResNet50 is used for identifying various attributes and behaviors of shopping guides in a shopping mall, judging whether negative behaviors exist or not, and recording pictures of the shopping guides making the negative behaviors in the form of pictures. The problem of the passive behavior detection of the shopping guide and the automatic attendance checking of the working condition is solved to a certain extent. The method can be applied to aspects of a check-out system, shopping guide management, shop operation and the like in a new retail scene.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a shopping guide negative behavior monitoring method based on yolo and multitask convolutional neural networks comprises the following steps:
step 1, training a yolo-based pedestrian detection model: constructing a pedestrian detection model based on yolo, pre-training a classification model by using an ImageNet data set, pre-training a detection model by using a voc2007 data set, and finely tuning the model by using a monitoring visual angle data set:
step 2, training a multitask convolution neural network based on ResNet 50: constructing a multitask convolutional neural network based on ResNet50, and training the multitask convolutional neural network based on ResNet 50;
and 3, shopping guide negative behavior recording: reading a monitoring picture, detecting pedestrians in a shopping mall, identifying the attributes of the pedestrians, and recording a shopping guide negative behavior picture;
compared with the prior art, the technical scheme of the invention has the advantages that:
(1) the pedestrian detection model trained by the invention can perform robust pedestrian detection under the monitoring visual angle of a shopping mall;
(2) the multi-task convolutional neural network trained by the method can simultaneously identify a plurality of attributes of the pedestrian, and keeps high precision and robustness;
(3) the invention expands the attendance system to record the negative behavior in the working process, and not only records the late arrival and early departure of attendance, thereby improving the attendance system.
Drawings
FIG. 1 is a schematic diagram of the yolo pre-trained classification model of the present invention;
FIG. 2 is a schematic diagram of a yolo-based pedestrian detection model of the present invention;
FIG. 3 is a schematic diagram of a multitask convolutional neural network based on ResNet50 of the present invention;
FIG. 4 is a flow chart of the method of the present invention;
Detailed Description
In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail below with reference to the accompanying drawings and examples.
Example 1:
a passive behavior monitoring system for shopping guide based on yolo and multitask convolutional neural network comprises the following steps:
(1) training a yolo-based pedestrian detection model;
step 11: constructing a pedestrian detection model based on yolo;
the invention uses the training mode and the network structure of the yolo second generation for reference, and improves the network structure on the basis, so that the model is more robust in the monitoring view of the invention. Specifically, the network structure of the original yolo-v2 includes 19 convolutional layers and 5 maximum pooling layers, the method of layer jump fusion is used in the invention, 13 convolutional layers and 4 maximum pooling layers are used in the first stage of feature extraction, 7 convolutional layers are used in the second stage, 1 maximum pooling layer is arranged between the first stage and the second stage, and the size of the feature graph output in the first stage is adjusted to be consistent with the size of the feature graph output in the second stage. And then fusing the two oversized feature maps together in a superposition mode to form the input of the stage three. Stage three has two modes, one is a classification network, and the mode is used when the model is pre-trained, specifically, the mode is a 3 x 3 convolution layer and a full connection layer, and the number of neurons in the full connection layer is equal to the classification number; the second mode is a detection network, which is used for training the detection network after loading the pre-training parameters of the first mode, specifically, a layer of convolution layer of 3 × 3 is added, a layer of convolution layer of 1 × 1 is added, the number of convolution kernels is related to the detection category, and the specific numerical values are as follows: anchlors number × (5+ number of detection categories).
The classification network for mode one, as shown in fig. 1, is described in detail below:
stage one: the size of an input image is 448 multiplied by 3, the first layer of the first stage is a convolution layer with the convolution kernel size of 3 multiplied by 32, and batch normalization, ReLu nonlinear activation and 2 multiplied by 2 maximum pooling operations are carried out on the layer; the second layer of the first stage is a convolution layer with convolution kernel size of 3 multiplied by 64, and batch normalization, ReLu nonlinear activation and 2 multiplied by 2 maximum pooling operation are carried out on the layer; the third layer of the first stage is a convolution layer with convolution kernel size of 3 multiplied by 128, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the fourth layer of the first stage is a convolution layer with the convolution kernel size of 1 multiplied by 64, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the fifth layer of the first stage is a convolution layer with the convolution kernel size of 3 multiplied by 128, and batch normalization and ReLu nonlinear activation operation are carried out on the convolution layer; the sixth layer of the first stage is a convolution layer with convolution kernel size of 3 multiplied by 256, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the seventh layer of the first stage is a convolution layer with convolution kernel size of 1 multiplied by 128, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the eighth layer of the first stage is a convolution layer with convolution kernel size of 3 × 3 × 256, and batch normalization, ReLu nonlinear activation and 2 × 2 maximum pooling operation are performed on the layer; the ninth layer of the first stage is a convolution layer with convolution kernel size of 3 multiplied by 512, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the tenth layer of the first stage is a convolution layer with convolution kernel size of 1 multiplied by 256, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the eleventh layer of the first stage is a convolution layer with convolution kernel size of 3 multiplied by 512, and batch normalization and ReLu nonlinear activation operation are carried out on the convolution layer; the twelfth layer of the first stage is a convolution layer with the convolution kernel size of 1 multiplied by 256, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the thirteenth layer of the first stage is a convolution layer with the convolution kernel size of 3 multiplied by 512, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the output signature of phase one is denoted output 1.
And a second stage: firstly, performing 2 x 2 maximum pooling operation on a characteristic diagram output by a stage I, wherein a first layer of the stage II is a convolution layer with a convolution kernel of 3 x 1024, and performing batch normalization and ReLu nonlinear activation operation on the layer; the second layer of the second stage is a convolution layer with the convolution kernel size of 1 multiplied by 512, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the third layer of the second stage is a convolution layer with convolution kernel size of 3 multiplied by 1024, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the fourth layer of the second stage is a convolution layer with convolution kernel size of 1 multiplied by 512, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the fifth layer of the second stage is a convolution layer with the convolution kernel size of 3 multiplied by 1024, and batch normalization and ReLu nonlinear activation operation are carried out on the convolution layer; the sixth layer of the second stage is a convolution layer with convolution kernel size of 3 multiplied by 1024, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the seventh layer of the second stage is a convolution layer with convolution kernel size of 3 multiplied by 1024, and batch normalization and ReLu nonlinear activation operation are carried out on the layer; the output characteristic diagram of the second stage is denoted as output 2;
and a third stage: the output feature map output1 of the stage one is convolved by 1 × 1 × 64, and then the size of the output feature map output1 of the stage one is adjusted to be the same as that of the output feature map output2 of the stage two, and the adjusted feature map is marked as output1_ 1. The feature map output1_1 is superimposed with output2 to form a new feature map output 3. Then, 3 × 3 × 1024 convolution, batch normalization, ReLu nonlinear activation operation are performed on the fused feature map output3, and finally, a full connection layer of 1000 neurons is added and constrained by a softmax loss function.
The benefit of this layer jump operation is that output3 has both the fine features of output2 obtained after deep layer convolution and the basic features of output1_1 obtained after shallow layer convolution, so that the network precision is higher.
For the detection network of the mode two, as shown in fig. 2, except for the last two layers, the rest structures are the same as the classification model; the difference lies in that: 3 x 1024 convolution, batch normalization and ReLu nonlinear activation operation are carried out on the fusion characteristic output3 in the third stage in the detection network, then the full connection layer is removed, and the detection network is replaced with a 1 x 30 convolution layer, and finally a constraint model is lost through coordinate loss, confidence coefficient loss and category loss.
Step 12: pre-training a classification model by using an ImageNet data set;
good initialization parameters are an important ring of model convergence, and the detection data set has a small amount of data in each category due to the complicated labeling steps. Therefore, a classification model is trained by using the ImageNet data set, and the trained classification model parameters are used as initialization parameters of a common structure in the detection model.
The standard 1000 classified ImageNet dataset pictures were first randomly cropped, rotated and shifted in hue, saturation, exposure to obtain more available training data, adjusted to 224 x 224 images, trained for 160 epochs, and using the SGD optimizer, the initial learning rate was set to 0.1, momentum to 0.9, and weight decay to 0.0005.
Further, the network was fine-tuned with a larger size (448 × 448) and trained with a learning rate of 0.001 for 10 epochs.
Step 13: pre-training the detection model with the voc2007 dataset;
since the first few layers of the detection model are consistent with the classification network, the parameters of the classification network trained in step 12 are used as the initialization parameters of the common structure in the detection network. The Voc2007 dataset is a common detection dataset, and there are 20 classes of labeled detection objects, including pedestrian image data. Training only pedestrian image data, performing data enhancement operation on the pedestrian data, adjusting the image size to 448 multiplied by 448, training 160 epochs by an SGD optimizer, and setting the initial learning rate to 0.0001;
step 14: fine-tuning the model with the monitored perspective dataset;
most of the pedestrian data in the voc2007 are not pedestrian images under the monitoring view angle, so that the model trained in the step 13 is difficult to detect pedestrians in the monitoring picture of the shopping mall. Therefore, a data set in the BOT2018 new retail technology challenge match is selected for fine adjustment, and pedestrian images of the data set are collected from monitoring cameras in real market scenes. Performing data enhancement operations such as horizontal rotation, center random cropping and HSV (hue, saturation, value) space fine adjustment on the image of the data set, and adjusting the size to 448 x 448;
loading the model trained in the step 13, training 160 epochs by using an SGD optimizer, setting the initial learning rate to be 0.001, reducing the learning rate with the increase of the training times, setting the learning rate to be 0.001 when 0-5 epochs are used, setting the learning rate to be 0.0001 when 5-80 epochs are used, and setting the learning rate to be 0.00001 when 80-160 epochs are used.
(2) Training multitask convolution neural network based on ResNet50
Step 21: constructing a multitask convolutional neural network based on ResNet 50;
for the pedestrian detected in (1), the attribute of the pedestrian needs to be identified so as to judge whether the shopping guide has the behavior of negative work, and the attributes marked in the data set comprise: "customer" or "shopping guide", "male" or "female", "standing" or "sitting", "playing a mobile phone", or "not playing a mobile phone". These attributes are not related to each other and therefore can be considered as unrelated attributes.
ResNet50 is a network structure with excellent classification performance, however, an original ResNet50 is often not good when directly identifying multiple irrelevant attributes, and training a model for each attribute results in occupying extra computing resources. Therefore, aiming at the identification of shopping guide negative behaviors, the invention designs a multitask convolutional neural network based on ResNet50, and the structure is shown in FIG. 3.
Specifically, the last two layers (full-link layer and pooling layer) of the original ResNet50 are removed, four parallel full-link layers are spliced, the number of neurons in each full-link layer is 2, and the full-link layers respectively represent 8 attributes: "customer" and "shopping guide", "male" and "female", "standing" and "sitting", "playing mobile phone" and "not playing mobile phone", two attributes on the same full connection layer are associated attributes, and an attribute not on one full connection layer is an unrelated attribute. After each full connection, a Softmax layer is connected respectively. The calculation formula of the Softmax loss function is:
Figure BDA0001829231970000071
the final loss function value is the sum of four Softmax loss function values, namely:
Loss=L1+L2+L3+L4 (2)
step 22: training a multitask convolutional neural network based on ResNet 50;
in the convolutional neural network, good initial parameters play an important role in the convergence of the network model, so the parameters except the last two layers in ResNet50 trained on an ImageNet data set are loaded as the initialization parameters of the multitask convolutional neural network. And (3) adopting labeled data of a BOT2018 new retail technology challenge match to the data, performing data enhancement operation on the data, such as image level inversion, center random clipping, HSV space enhancement and the like, to obtain more available training data, finally training by using an Adam optimization algorithm, setting the initial learning rate to be 0.0005, and training 40 epochs.
(3) Shopping guide negative behavior record
Step 31: reading a monitoring picture;
in a shopping mall, a monitoring system which is widely distributed provides data for the system, and additional equipment is not needed. Before reading the monitoring picture, we need to set two parameters: working time interval and monitoring sampling time. The method comprises the steps that an on-duty time interval is set, so that the system only focuses on the on-duty time, extra computing resources and false detection are reduced, and the purpose of the method is to detect whether the shopping guide has a negative behavior during the on-duty time, so that the behavior of the shopping guide during the off-duty time is out of consideration; the monitoring sampling time is set to control the frequency of reading the monitoring picture, so that extra computing resources can be reduced, detection is not needed at every moment, the smaller the sampling time is, the more the identification times are, the more strict the management is, but the greater the computing burden is, the larger the sampling time is, the less the identification times are, the less the computing burden is, but the management is loose. The default monitoring sampling time of the invention is from 9 am to 9 pm, and the sampling time is 1 sampling time in 30 seconds.
Specifically, the monitor picture is read through an rtsp protocol, and the frequency of the read picture is controlled through the system time and the sampling time of the computer.
Step 32: detecting pedestrians in a mall;
loading the pedestrian detection model trained in the step (1), reading the monitoring image in the step (31), normalizing the image, converting the image into a sensor, and then loading the sensor into the pedestrian detection model, wherein the pedestrian detection model can detect four coordinates of a pedestrian in the image, namely the upper coordinate, the lower coordinate, the left coordinate and the right coordinate, so that multiple persons can be detected on one image;
step 33: identifying a pedestrian attribute;
the pedestrian detected in step 32 may be shopping guide or customer, and we want to identify whether the shopping guide has negative behavior, and the multitask convolutional neural network designed by the invention can realize the function. And (3) loading the multitask convolution neural network in the step (2), taking the image data of the pedestrian detected in the step (32) as input data of the multitask convolution neural network, and outputting the value of the full connection layer, namely the confidence coefficient of the model on the certain attribute of the pedestrian. If the confidence level of outputting the shopping guide is higher than that of the customer, the pedestrian is identified as the shopping guide, if the confidence level of outputting the male is higher than that of the female, the pedestrian is identified as the male, if the confidence level of outputting the standing is higher than that of sitting, the pedestrian is identified as the standing, and if the confidence level of outputting the mobile phone playing is higher than that of not playing the mobile phone, the pedestrian is identified as the mobile phone playing. And vice versa.
Step 34: recording a shopping guide negative behavior picture;
a block diagram of the system is shown in fig. 4.
Specifically, during the business hours, we make a determination of shopping guide negative behavior for the pedestrian attributes identified in step 33. Firstly, the identity of the pedestrian is judged whether to belong to shopping guide, if the pedestrian belongs to shopping guide, the pedestrian analyzes the posture (standing or sitting) and the working state (whether playing a mobile phone), and further, whether a customer exists in a picture of the shopping guide or not is judged, and under the condition that the customer is in the field, the pedestrian has stricter requirements on the shopping guide. For example, when a shopping guide is playing a cell phone or sitting, and there is no customer in the screen at this time, we consider the shopping guide to be "generally passive"; when a shopping guide is playing a cell phone or sitting with a customer in the screen, we consider the shopping guide to be "severely passive". The determination of the degree of the negative behavior of the specific shopping guide is shown in table 1, and we save the screen in folder 1 for the "seriously negative" shopping guide, in folder 2 for the "general negative" shopping guide, and in no way for the "positive" shopping guide. The store owner can make corresponding penalties for the passive shopping guide based on the image frames of file 1 and folder 2.
TABLE 1 shopping guide negative behavior decision Table
Figure BDA0001829231970000101
Example 2:
(1) selecting experimental data
The invention uses a BOT2018 new retail technology to challenge a data set of a match, data are collected from a real market scene, and labels in image data comprise: "customer" and "shopping guide", "male" and "female", "standing" and "sitting", "playing mobile phone" and "not playing mobile phone". The method is characterized in that the images are divided into 5 scenes, 5000 images are provided, each image comprises shopping guides and customers in different numbers, the 5000 images are divided into a training set and a testing set according to the ratio of 9:1, and the training sets and the testing sets are extracted averagely. 1980 scenes 1, 937 scenes 2, 915 scenes 3, 356 scenes 4, 312 scenes 5 and 4500 scenes in the training set; 220 scenes 1, 105 scenes 2, 101 scenes 3, 40 scenes 4, and 34 scenes 5 were collected in the test set.
(2) Results of the experiment
After constructing the model by training a multitask convolutional neural network based on ResNet50 according to the step (2) in example 1, parameters of ResNet50 trained on ImageNet are loaded, 40 epochs are trained on a BOT market data set by an Adam optimizer, the initial learning rate is 0.0005, and the final precision on the test set is shown in Table 2:
TABLE 2 results of the experiment
Figure BDA0001829231970000111
The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof as may occur to those skilled in the art upon consideration of the present inventive concept.

Claims (1)

1. A shopping guide negative behavior monitoring method based on yolo and multitask convolutional neural networks comprises the following steps:
(1) training a yolo-based pedestrian detection model;
step 11: constructing a pedestrian detection model based on yolo;
using a layer-skipping fusion mode, using 13 convolutional layers and 4 maximum pooling layers in a first stage of feature extraction, using 7 convolutional layers in a second stage, wherein 1 maximum pooling layer is arranged between the first stage and the second stage, and adjusting the size of a feature graph output in the first stage to be consistent with the size of a feature graph output in the second stage; then the two feature graphs with the oversize adjusted are fused together in a superposition mode to become input of a stage three; stage three has two modes, one is a classification network, and the mode is used when the model is pre-trained, specifically, the mode is a 3 x 3 convolution layer and a full connection layer, and the number of neurons in the full connection layer is equal to the classification number; the second mode is a detection network, which is used for training the detection network after loading the pre-training parameters of the first mode, specifically, a layer of convolution layer of 3 × 3 is added, a layer of convolution layer of 1 × 1 is added, the number of convolution kernels is related to the detection category, and the specific numerical values are as follows: anchlors number × (5+ number of detection categories);
step 12: pre-training a classification model by using an ImageNet data set;
training a classification model by using the ImageNet data set, and taking the trained classification model parameters as initialization parameters of a common structure in the detection model;
step 13: pre-training the detection model with the voc2007 dataset;
because the first few layers of the detection model are consistent with the classification network, the parameters of the classification network trained in the step 12 are used as the initialization parameters of the common structure in the detection network; the voc2007 data set is a common detection data set, and 20 types of labeled detection objects are provided, wherein the labeled detection objects comprise pedestrian image data; training only the image data of the pedestrians, performing data enhancement operation on the pedestrian data, adjusting the image size to 448 multiplied by 448, training 160 epochs by an SGD optimizer, and setting the initial learning rate to 0.0001;
step 14: fine-tuning the model with the monitored perspective dataset;
selecting a data set in a BOT2018 new retail technology challenge match for fine adjustment, wherein a pedestrian image of the data set is acquired from a monitoring camera in a real market scene; performing data enhancement operations such as horizontal rotation, center random cropping and HSV (hue, saturation, value) space fine adjustment on the image of the data set, and adjusting the size to 448 x 448;
loading the model trained in the step 13, training 160 epochs by using an SGD optimizer, setting the initial learning rate to be 0.001, reducing the learning rate along with the increase of the training times, setting the learning rate to be 0.001 when 0-5 epochs are used, setting the learning rate to be 0.0001 when 5-80 epochs are used, and setting the learning rate to be 0.00001 when 80-160 epochs are used;
(2) training a multitask convolutional neural network based on ResNet 50;
step 21: constructing a multitask convolutional neural network based on ResNet 50;
for the pedestrian detected in the step (1), the attribute of the pedestrian needs to be identified so as to judge whether the shopping guide has the behavior of negative work, and the attributes marked in the data set comprise: "customer" or "shopping guide", "male" or "female", "standing" or "sitting", "playing mobile phone" or "not playing mobile phone"; the attributes are not related to each other and are regarded as unrelated attributes;
aiming at the identification of shopping guide negative behaviors, a multitask convolutional neural network is designed based on ResNet 50;
specifically, the full-link layer and the pooling layer of the last two layers of the original ResNet50 are removed, four parallel full-link layers are spliced, the number of neurons of each full-link layer is 2, and the full-link layers respectively represent 8 attributes: "customer" and "shopping guide", "male" and "female", "standing" and "sitting", "playing mobile phone" and "not playing mobile phone", two attributes on the same full connection layer are associated attributes, and an attribute not on one full connection layer is an unrelated attribute; after each full connection, a softmax layer is respectively connected; the calculation formula of the Softmax loss function is:
Figure FDA0002779691710000021
the final loss function value is the sum of four Softmax loss function values, namely:
Loss=L1+L2+L3+L4 (2)
step 22: training a multitask convolutional neural network based on ResNet 50;
loading parameters except the last two layers in the ResNet50 trained on the ImageNet data set as initialization parameters of the multitask convolutional neural network; the method comprises the steps that label data of a BOT2018 new retail technology challenge race are adopted in a data set, data enhancement operation is conducted on the data to obtain more available training data, an Adam optimizer is used for training, the initial learning rate is set to be 0.0005, and 40 epochs are trained;
(3) shopping guide negative behavior records;
step 31: reading a monitoring picture;
in a shopping mall, a widely distributed monitoring system provides data; before reading the monitoring picture, two parameters need to be set: working time interval and monitoring sampling time; setting a working time interval, so that the system only focuses on the working time; setting monitoring sampling time, namely controlling the frequency of reading a monitoring picture;
reading the monitoring picture through an rtsp protocol, and controlling the reading picture through the system time of a computer;
step 32: detecting pedestrians in a mall;
loading the pedestrian detection model trained in the step (1), reading the monitoring image in the step 31, normalizing the image, converting the image into a Tensor, and then loading the Tensor into the pedestrian detection model, wherein the pedestrian detection model can detect four coordinates of a pedestrian in the image, namely the upper coordinate, the lower coordinate, the left coordinate and the right coordinate;
step 33: identifying a pedestrian attribute;
loading the multitask convolution neural network in the step (2), taking the image data of the pedestrian detected in the step (32) as input data of the multitask convolution neural network, and outputting the value of the full connection layer, namely the confidence coefficient of the model on the certain attribute of the pedestrian; if the confidence level of outputting the shopping guide is higher than the confidence level of outputting the customer, the pedestrian is identified as the shopping guide, if the confidence level of outputting the male is higher than the confidence level of outputting the female, the pedestrian is identified as the male, if the confidence level of outputting the standing is higher than the confidence level of sitting, the pedestrian is identified as the standing, and if the confidence level of outputting the mobile phone playing is higher than the confidence level of not playing the mobile phone, the pedestrian is identified as the mobile phone playing;
step 34: recording a shopping guide negative behavior picture;
in the business time interval, the pedestrian attribute identified in the step 33 is judged to be shopping guide negative behavior; firstly, judging whether the identity of the pedestrian belongs to shopping guide, if the identity belongs to the shopping guide, analyzing the posture and the working state of the pedestrian, and further judging whether a customer exists in a picture where the shopping guide is located, wherein the customer has a stricter requirement on the shopping guide under the condition that the customer is in the place; when a shopping guide is playing a mobile phone or sitting, and there is no customer in the picture at this time, the shopping guide is considered to be "generally passive"; when a shopping guide plays a mobile phone or sits and a customer is in a picture, the shopping guide is considered to be 'seriously passive'; the degree of specific shopping guide negative behavior is determined as follows: when there is a customer nearby, the behavior of the shopping guide sitting or playing the mobile phone is judged as "severely passive", and when standing and not playing the mobile phone, the behavior is judged as "active"; when there is no customer nearby, the behavior of the shopping guide sitting or playing the mobile phone is judged as "generally negative", and when standing and not playing the mobile phone, it is judged as "positive"; the screen is saved in the folder 1 for the "seriously passive" shopping guide, the screen is saved in the folder 2 for the "general passive" shopping guide, and the screen is not saved for the "active" shopping guide.
CN201811197781.2A 2018-10-15 2018-10-15 Shopping guide negative behavior monitoring method based on yolo and multitask convolutional neural network Active CN109284733B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811197781.2A CN109284733B (en) 2018-10-15 2018-10-15 Shopping guide negative behavior monitoring method based on yolo and multitask convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811197781.2A CN109284733B (en) 2018-10-15 2018-10-15 Shopping guide negative behavior monitoring method based on yolo and multitask convolutional neural network

Publications (2)

Publication Number Publication Date
CN109284733A CN109284733A (en) 2019-01-29
CN109284733B true CN109284733B (en) 2021-02-02

Family

ID=65176535

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811197781.2A Active CN109284733B (en) 2018-10-15 2018-10-15 Shopping guide negative behavior monitoring method based on yolo and multitask convolutional neural network

Country Status (1)

Country Link
CN (1) CN109284733B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948490A (en) * 2019-03-11 2019-06-28 浙江工业大学 A kind of employee's specific behavior recording method identified again based on pedestrian
CN109919135A (en) * 2019-03-27 2019-06-21 华瑞新智科技(北京)有限公司 Behavioral value method, apparatus based on deep learning
CN110222942B (en) * 2019-05-14 2022-11-25 北京天正聚合科技有限公司 Method and device for identifying shopping guide, electronic equipment and storage medium
CN110210750A (en) * 2019-05-29 2019-09-06 北京天正聚合科技有限公司 A kind of method, apparatus, electronic equipment and storage medium identifying Shopping Guide's business
CN110414421B (en) * 2019-07-25 2023-04-07 电子科技大学 Behavior identification method based on continuous frame images
US20210150347A1 (en) * 2019-11-14 2021-05-20 Qualcomm Incorporated Guided training of machine learning models with convolution layer feature data fusion
CN113051967A (en) * 2019-12-26 2021-06-29 广州慧睿思通科技股份有限公司 Monitoring method, device, server and computer readable storage medium
CN111309954B (en) * 2020-02-24 2023-10-17 浙江力石科技股份有限公司 Scenic spot shopping guide behavior identification system
CN111461169B (en) * 2020-03-04 2023-04-07 浙江工商大学 Pedestrian attribute identification method based on forward and reverse convolution and multilayer branch depth network
CN111291840A (en) * 2020-05-12 2020-06-16 成都派沃智通科技有限公司 Student classroom behavior recognition system, method, medium and terminal device
CN112183397A (en) * 2020-09-30 2021-01-05 四川弘和通讯有限公司 Method for identifying sitting protective fence behavior based on cavity convolutional neural network
CN112016527B (en) * 2020-10-19 2022-02-01 成都大熊猫繁育研究基地 Panda behavior recognition method, system, terminal and medium based on deep learning
CN112307976B (en) * 2020-10-30 2024-05-10 北京百度网讯科技有限公司 Target detection method, target detection device, electronic equipment and storage medium
CN113111859B (en) * 2021-05-12 2022-04-19 吉林大学 License plate deblurring detection method based on deep learning
CN117274953A (en) * 2023-09-28 2023-12-22 深圳市厚朴科技开发有限公司 Vehicle and pedestrian attribute identification method system, device and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169421A (en) * 2017-04-20 2017-09-15 华南理工大学 A kind of car steering scene objects detection method based on depth convolutional neural networks
CN108510000A (en) * 2018-03-30 2018-09-07 北京工商大学 The detection and recognition methods of pedestrian's fine granularity attribute under complex scene

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8195598B2 (en) * 2007-11-16 2012-06-05 Agilence, Inc. Method of and system for hierarchical human/crowd behavior detection

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169421A (en) * 2017-04-20 2017-09-15 华南理工大学 A kind of car steering scene objects detection method based on depth convolutional neural networks
CN108510000A (en) * 2018-03-30 2018-09-07 北京工商大学 The detection and recognition methods of pedestrian's fine granularity attribute under complex scene

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Vision based real-time fish detection using convolutional neural network;Minsung Sung等;《IEEE》;20171026;第1-6页 *
基于深度卷积神经网络的行人检测;芮挺等;《计算机工程与应用》;20150819;第162-166页 *

Also Published As

Publication number Publication date
CN109284733A (en) 2019-01-29

Similar Documents

Publication Publication Date Title
CN109284733B (en) Shopping guide negative behavior monitoring method based on yolo and multitask convolutional neural network
US10726244B2 (en) Method and apparatus detecting a target
CN109961009B (en) Pedestrian detection method, system, device and storage medium based on deep learning
US11830230B2 (en) Living body detection method based on facial recognition, and electronic device and storage medium
US10891465B2 (en) Methods and apparatuses for searching for target person, devices, and media
CN110532970B (en) Age and gender attribute analysis method, system, equipment and medium for 2D images of human faces
CN106960195B (en) Crowd counting method and device based on deep learning
CN111797653A (en) Image annotation method and device based on high-dimensional image
CN110503076B (en) Video classification method, device, equipment and medium based on artificial intelligence
CN110033040B (en) Flame identification method, system, medium and equipment
CN111242127A (en) Vehicle detection method with granularity level multi-scale characteristics based on asymmetric convolution
WO2021051547A1 (en) Violent behavior detection method and system
CN110765882B (en) Video tag determination method, device, server and storage medium
CN111738344A (en) Rapid target detection method based on multi-scale fusion
US20190236738A1 (en) System and method for detection of identity fraud
CN108564673A (en) A kind of check class attendance method and system based on Global Face identification
CN111310662A (en) Flame detection and identification method and system based on integrated deep network
CN109670065A (en) Question and answer processing method, device, equipment and storage medium based on image recognition
CN108446688B (en) Face image gender judgment method and device, computer equipment and storage medium
CN110879982A (en) Crowd counting system and method
CN112101195A (en) Crowd density estimation method and device, computer equipment and storage medium
CN113205002B (en) Low-definition face recognition method, device, equipment and medium for unlimited video monitoring
KR101961462B1 (en) Object recognition method and the device thereof
CN115115825B (en) Method, device, computer equipment and storage medium for detecting object in image
CN109670423A (en) A kind of image identification system based on deep learning, method and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant