CN103064987A

CN103064987A - Bogus transaction information identification method

Info

Publication number: CN103064987A
Application number: CN2013100376918A
Authority: CN
Inventors: 王永康; 张爱华
Original assignee: Beijing 58 Information Technology Co Ltd
Current assignee: Beijing 58 Information Technology Co Ltd
Priority date: 2013-01-31
Filing date: 2013-01-31
Publication date: 2013-04-24
Anticipated expiration: 2033-01-31
Also published as: CN103064987B

Abstract

The invention discloses a bogus transaction information identification method which comprises the following steps of: S101, acquiring information features and information contents of information published by a user and/or picture information; and S202, performing bogus transaction information identification on the information published by the user according to the information features and the information contents of information published by the user and/or thepicture information. By the method, the amount of bogus transaction information can be greatly reduced, the authenticity of the transaction information is improved, and user experience is improved; and the labor cost can be greatly reduced.

Description

A kind of wash sale information identifying method

Technical field

The present invention relates to Internet technical field, particularly relate to a kind of wash sale information identifying method.

Background technology

Along with the development of internet, online information becomes and more and more spreads unchecked, and is more and more hard to tell whether it is true or false.Website for types such as ecommerce or classified informations, if can provide safety, real merchandise news for the user, become an important and basic content, so how to identify the true and false key of guaranteeing information security that become that the user releases news, this also is the problem that a lot of websites all face.

On identification wash sale information, present method mainly is by artificial audit, add some technological means, for example determine the IP(Internet Protocol of blacklist, the agreement that interconnects between the network) address, determine the information content of issue or form is illegal, price range is illegal etc. will determine the illegal information deletion of information fully.

The shortcoming of Existing policies is: manual examination and verification consume very much manpower, auxiliary technological means can only be deleted the wash sale information of small part, also have a large amount of wash sale information to escape, can delete 100% and be defined as false information, but 85% may be helpless for the information of vacation to having, because can not judgement information be false degree all.

Summary of the invention

The technical problem to be solved in the present invention provides a kind of wash sale information identifying method and puts, and carries out the problem that the upper manpower consumption of wash sale information identification is large, wash sale information discrimination is low in order to solve prior art.

For solving the problems of the technologies described above, on the one hand, the invention provides a kind of wash sale information identifying method, comprising:

Step S101 obtains information characteristics, the information content and/or pictorial information that the user releases news;

Step S201, the information characteristics, the information content and/or the pictorial information that release news according to the user give out information to the user and to carry out the identification of wash sale information.

Further, before obtaining the information characteristics that the user releases news, may further comprise the steps:

Step S1011, the master data that the user gives out information before obtaining;

Step S1012 according to the master data that user before obtaining gives out information, extracts training data, determines positive negative sample;

Step S1013, the data that align in the negative sample are carried out Feature Conversion, obtain the data of setting data form;

Step S1014 according to the data of setting data form, sets up regression model.

Further, step S1013 specifically comprises:

The feature of every data in the positive negative sample is defined as numeric type or enumeration type two classes;

The dimension values of numeric type is constant, is in the numerical value that these numeric type data are disposed in position in the sample in the numeric type data;

The dimension values of enumeration type is calculated first its md5 value, then with the md5 value to the W delivery, obtain the delivery result; In sample, will be in delivery as a result the numerical value of position put 1.

Further, step S1014 specifically comprises:

The data of the setting data form that step S1013 is obtained are converted into sparse matrix;

Sparse matrix (the x that input produces in the model training program ₁, x ₂, x ₃, x ₄, x ₅..., x _p), p is the data volume of the data of setting data form; Obtain parameter (β corresponding to each bar record ₀, β ₁, β ₂, β ₃, β ₄, β ₅..., β _p);

Set up regression model, regression model is:

G (x)=β wherein ₀+ β ₁x ₁+ β ₂x ₂+ ... + β _px _p

Further, after setting up regression model, when receiving the user and release news, then step S101 is specially:

Step S1015 obtains the master data that the user gives out information; Comprise the essential characteristic that the extraction user gives out information and obtain first feature; Essential characteristic is with the master data of first feature as excavation.

Further, after obtaining the master data that the user gives out information, step S201 specifically may further comprise the steps:

Step S2011 carries out Feature Conversion to obtaining the master data that the user gives out information, and obtains the accessible data layout of model;

Step S2012, the data that step S2011 is obtained are converted into the form of sparse matrix, carry out spoofing identification by regression model; Wherein, P〉M, Y=1 then, the expression user releases news and is true sale information; Otherwise, P≤M, Y=0 then, the expression user releases news and is wash sale information, and M is predefined threshold value.

Further, before obtaining the information content that the user releases news, may further comprise the steps:

Step S1021, the information content that the user gives out information before obtaining is also examined, will be by examining and not being divided into two classes by the information of examining, as the sample data of classification;

Step S1022 carries out participle to the information content in the sample;

Step S1023 by calculating, extracts Feature Words;

Step S1024 calculates in every class the eigenwert of each Feature Words in every piece of document;

Step S1025, the eigenwert according to obtaining in every class each word in every piece of document obtains model of cognition by training.

Further, step S1023 specifically comprises:

The CHI value asked in each word; Evolution check formula is:

Wherein, A: the number of documents that under this classification, comprises this word; B: the number of documents that under this classification, does not comprise this word; C: the number of documents that under this classification, does not comprise this word; D: not under this classification, and do not comprise the number of documents of this word; N: expression article sum; T: represent the current word of asking the CHI value; C: the classification of presentation class; x ²: the open check of expression CHI value;

Then get P value of CHI value maximum in all words as Feature Words;

Step S1024 specifically comprises:

Adopt the deformation algorithm computation of characteristic values of TFIDF algorithm or TFIDF, wherein the way of TFIDF is to calculate in every class the number of times of each Feature Words in every piece of document, and the number of files that comprises this word, with the value of TFIDF as eigenwert; Wherein, every piece of document is converted into: category IDs t feature sequence number the form of t eigenwert; The TFIDF formula is: TFIDF=TF * IDF, wherein, and the frequency that TF occurs in this piece document for certain Feature Words, IDF is anti-document frequency, namely total document tree is divided by the number of files that comprises this word.

Further, after obtaining the information content that the user gives out information, step S201 specifically may further comprise the steps:

Step S2021 carries out participle to the information content that the user gives out information;

Step S2022 by calculating, extracts Feature Words;

Step S2023, the eigenwert of each word in the information content that the calculating user gives out information;

Step S2024 according to the model of cognition that obtains, carries out the identification of wash sale information to the information content that the user gives out information.

Further, the pictorial information that releases news according to the user gives out information to the user and to carry out the identification of wash sale information, specifically may further comprise the steps:

Step S2031, the query history picture library judges whether photo current occurs in picture library, if there is, judge further then whether the content of posting is identical, and whether the position is identical, if all different, judge that then it is wash sale information that the user who comprises this picture releases news; Otherwise, judge that then it is true sale information that the user who comprises this picture releases news;

Whether perhaps, judging has watermark on the picture, if having, judges further then whether the watermark on the picture is legal, if illegal, judges that then it is wash sale information that the user who comprises this picture releases news; Otherwise, judge that then it is true sale information that the user who comprises this picture releases news.

Beneficial effect of the present invention is as follows:

The present invention can reduce the falseness amount of Transaction Information greatly, improves the authenticity of Transaction Information, increases the user and experiences, and can greatly reduce human cost simultaneously.

Description of drawings

Fig. 1 is the process flow diagram of a kind of wash sale information identifying method in the embodiment of the invention.

Embodiment

Below in conjunction with accompanying drawing and embodiment, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, does not limit the present invention.

As shown in Figure 1, the embodiment of the invention relates to a kind of wash sale information identifying method, comprising:

Among the step S101, be specifically related to three kinds of situations, the first is to carry out the identification of wash sale information for the information characteristics that the user releases news, and namely carries out the identification of wash sale information based on user characteristics and behavior; The second is to carry out the identification of wash sale information for the information content that the user releases news, and namely carries out the identification of wash sale information based on the model content of text; The third is the wash sale information identification of carrying out for pictorial information.

At first, describe the information characteristics that releases news based on the user and carry out the identification of wash sale information, before obtaining the information characteristics that the user releases news, may further comprise the steps:

Step S1011, the master data that the user gives out information before obtaining.In this step, by the splicing data, the analysis user daily record of posting extracts the essential characteristic that the user gives out information; Wherein, essential characteristic refers to the directly data of extraction acquisition from the user gives out information before, the features such as for example, user's identify label (USER ID), the IP that posts, cookieid, telephone number, temporal information (comprising week, month, date), the duration of posting, pageview, the amount of refreshing, the city of posting, the classification of posting.Then, according to user's essential characteristic, obtain first feature; Wherein, first feature refers on the basis of user's essential characteristic, by the data of adding up or calculating; Such as the number of posting with IP, with IP post the city number, with the user post number, with user post city number, the first features such as number, the city number of posting with cookie of posting with cookie.Essential characteristic is with the master data of first feature as excavation.For example, produce such record R1(123123,192.168.11.11, DFOKIEBNGIDH1232,18311067654 ...).

Step S1012 according to the master data that user before obtaining gives out information, extracts training data.In this step, take the result of step S1011 as the basis, verify out by manual examination and verification to be defined as true or false data, as positive negative sample, True Data is positive sample, and false data is negative sample; For example, R1 is labeled as positive sample or negative sample.The manual examination and verification process can know that altogether information artificially judges according to some, also can carry out demonstration validation by means such as phones.

Step S1013, the data that align in the negative sample are carried out Feature Conversion, obtain the data of setting data form.In this step, the feature of every data in the positive negative sample is defined as numeric type or enumeration type two classes, wherein, numeric type refers to that data itself are exactly numerical value; Enumeration type refers to that data itself are not numerical value, and enumeration type shines upon according to original dimension and value and obtains.Being the data of enumeration type such as USER ID, the IP that posts etc., the duration of posting, is the data of numeric type with user's city number of posting.The dimension values of numeric type is constant; For example, certain characteristic is 20, and the position in sample then puts 20 the 10th position at the 10th.The dimension values of enumeration type is then calculated first its md5(Message Digest Algorithm MD5, Message Digest Algorithm 5) value, then with the md5 value to W(W=300000 for example) delivery, that is: with the md5 value divided by 300000, obtain remainder; The value of enumeration type will drop between the 1-300000 like this.Two features are for example arranged: (telephone number, with the phone number of posting), corresponding value is (18211078765,100), post number for numeric type with phone, and telephone number is enumeration type, so with the phone invariant position of several positions in sample of posting, after telephone number calculates the md5 value, to 300000 deliverys, for example obtain 180834, the vector that this moment, this record produced is (0,100,0 ..., 1), wherein, the 180834th position puts 1 in sample, represents that there is numerical value this position, and numerical value is 1.

Step S1014 according to the data of setting data form, sets up regression model.To the requirement of regression model be the result's that returns codomain between [0,1], perhaps can be mapped in this scope by calculating, below take logistic regression as example.What obtain among the step S1013 is the vector of a rule, for example (0,0,0,0,0,0,0,0,0,12,32,43 ... 1,0,0......1,0,0......), because these vectors may have 300000 dimensions, the expression data volume can quite expend internal memory, thus the vector of a rule is converted into the form of sparse matrix, for example, if upper one is article one, then horizontal ordinate is 1, and the form of corresponding sparse matrix is: 110(is equivalent to ordinate) 12,11132,11243 etc.After each bar all so transforms, the sparse matrix that input produces above being in model training program program, output is parameter corresponding to each bar record.Can simply be interpreted as if a record is (x ₁, x ₂, x ₃, x ₄, x ₅..., x _p), p is the data volume of the data of setting data form; Find the solution by the model training program, produce (β ₀, β ₁, β ₂, β ₃, β ₄, β ₅..., β _p) etc. corresponding parameter.Set up regression model this moment, and regression model can be expressed as:

G (x)=β wherein ₀+ β ₁x ₁+ β ₂x ₂+ ... + β _px _p

After setting up regression model, when receiving the user again and release news, then step S101 is specially:

Step S1015 obtains the master data that the user gives out information; Comprise the essential characteristic that the extraction user gives out information and obtain first feature; Essential characteristic is with the master data of first feature as excavation.Particular content is identical with step S1011, and this step is not described in detail.

After obtaining the master data that the user gives out information, step S201 specifically may further comprise the steps:

Step S2011 carries out Feature Conversion to obtaining the master data that the user gives out information, and obtains the data of setting data form.This step is identical with step S1013 method, no longer describes in detail.

Step S2012, the data of the setting data form that step S2011 is obtained are converted into the form of sparse matrix, carry out spoofing identification by regression model.In this step, obtain sparse matrix after, according to the user who the obtains corresponding (x that gives out information ₁, x ₂, x ₃, x ₄, x ₅..., x _p), just can obtain g (x), so just can be in the hope of the result of P (Y=1|x), i.e. the probability of Y=1; Wherein, P〉M, Y=1 then, the expression user releases news and is true sale information; Otherwise, P≤M, Y=0 then, the expression user releases news and is wash sale information; M is predefined threshold value.

Secondly, describe the information content that releases news based on the user and carry out the identification of wash sale information, before obtaining the information content that the user releases news, may further comprise the steps:

Step S1021, the information content that gives out information of user before obtaining, and to foregoing by audit (manual examination and verification or automatically audit), will by audit with ing by the Transaction Information model examined as two classes, as the sample data of classifying; Algorithm that can high by expert's manual tag and part accuracy rate (be higher than threshold value is set) extracts positive and negative sample training collection automatically;

Step S1022 carries out participle to the information content in the sample, can optimize the participle effect by the mode of Custom Dictionaries.Concrete segmenting method can adopt existing segmenting method, for example ICT segmenting method or other segmenting method.

Step S1023 extracts Feature Words.In this step, filter out and stop word, rare words, common word in the step S1022 participle, then with the check of CHI(evolution) etc. method choose the Feature Words large with the class degree of correlation.Concrete choosing method is: the CHI value asked in each word, then get 1000 values of CHI value maximum in all words as Feature Words.Evolution check formula is:

Wherein, A: the number of documents that under this classification, comprises this word; B: the number of documents that under this classification, does not comprise this word; C: the number of documents that under this classification, does not comprise this word; D: not under this classification, and do not comprise the number of documents of this word; N: expression article sum; T: represent the current word of asking the CHI value; C: the classification of presentation class; x ²: the open check of expression CHI value.

Step S1024 carries out vectorization, obtains in every class the eigenwert of each Feature Words in every piece of document.This step adopts the TFIDF algorithm, calculates in every class the number of times of each Feature Words in every piece of document, and the number of files that comprises this word, with the value of TFIDF as eigenwert.Every piece of document is converted into: category IDs t feature sequence number the form of t eigenwert.The TFIDF formula is: TFIDF=TF * IDF, wherein, and the frequency that TF occurs in this piece document for certain Feature Words, IDF is anti-document frequency, namely total document tree is divided by the number of files that comprises this word.

Step S1025, the eigenwert according to obtaining in every class each word in every piece of document obtains model of cognition by training.In this step, employing SVM(support vector machine support vector machine), the modes such as decision tree, Bayess classification are trained above-mentioned eigenwert, every piece of document has been converted into the form of vector among the step S1024, adopt classification (Waikato Environment for Knowledge Analysis, Waikato intellectual analysis environment) program is trained these vectors, can select different sorting techniques, such as SVM, decision tree, Bayess classification etc. produces a model of cognition.SVM, decision tree, Bayess classification are existing ripe training method, and this step is not described in detail.

After obtaining model of cognition, when receiving the user again and release news, then step S101 is specially:

Step S1026 obtains the information content that the user gives out information, and for example, posts as example with the user, then obtains the particular content of model.

After obtaining the information content that the user gives out information, step S201 specifically may further comprise the steps:

Step S2021 carries out participle to the information content that the user gives out information.

Step S2022 extracts Feature Words.This step is identical with step S1023 method, therefore, is not described in detail.

Step S2023 carries out vectorization, obtains the eigenwert of each word in the information content that the user gives out information.This step is identical with step S1024 method, therefore, is not described in detail.

Step S2024 according to the model of cognition that obtains, carries out the identification of wash sale information to the information content that the user gives out information.In this step, the model of cognition that obtains by modes such as SVM, decision tree, Bayess classifications is existing maturity model, and its recognition methods also is existing mature technology, so this step is not described in detail.

At last, description is carried out the identification of wash sale information based on pictorial information, after obtaining the pictorial information that the user releases news, the pictorial information that releases news according to the user gives out information to the user and to carry out wash sale information identification (step S201) and may further comprise the steps:

Step S2031, the query graph valut judges whether photo current occurs in picture library, if there is, judge further then whether the content of posting is identical, and whether the position is identical, if all different, judge that then it is wash sale information that the user who comprises this picture releases news; Otherwise, judge that then it is true sale information that the user who comprises this picture releases news; Whether perhaps, judging has watermark on the picture, if having, judges further then whether the watermark on the picture is legal, if illegal, judges that then it is wash sale information that the user who comprises this picture releases news; Otherwise, judge that then it is true sale information that the user who comprises this picture releases news.

In addition, above-mentioned three kinds of strategies also can make up, and combine and judge, for example, two kinds of situation combinations, or three kinds of situation combinations; In above-mentioned three kinds of situations, there are any one or two kinds of situations to judge that it is wash sale information that the user releases news, and judges that then it is wash sale information that the user releases news.

As can be seen from the above-described embodiment, the present invention can reduce the falseness amount of Transaction Information greatly, improves the authenticity of Transaction Information, increases the user and experiences, and can greatly reduce human cost simultaneously.

Although be the example purpose, the preferred embodiments of the present invention are disclosed, it also is possible those skilled in the art will recognize various improvement, increase and replacement, therefore, scope of the present invention should be not limited to above-described embodiment.

Claims

1. a wash sale information identifying method is characterized in that, comprising:

2. wash sale information identifying method as claimed in claim 1 is characterized in that, before obtaining the information characteristics that the user releases news, may further comprise the steps:

3. wash sale information identifying method as claimed in claim 2 is characterized in that, step S1013 specifically comprises:

The dimension values of enumeration type is then calculated first its md5 value, then with the md5 value to the W delivery, obtain the delivery result; In sample, will be in delivery as a result the numerical value of position put 1.

4. wash sale information identifying method as claimed in claim 3 is characterized in that, step S1014 specifically comprises:

The data that step S1013 is obtained are converted into sparse matrix;

Sparse matrix (the x that input produces in model training program program ₁, x ₂, x ₃, x ₄, x ₅..., x _p), p is the data volume of the data of setting data form; Obtain parameter (β corresponding to each bar record ₀, β ₁, β ₂, β ₃, β ₄, β ₅..., β _p);

Set up regression model, regression model is: G (x)=β wherein ₀+ β ₁x ₁+ β ₂x ₂+ ... + β _px _p

5. wash sale information identifying method as claimed in claim 4 is characterized in that, after setting up regression model, when receiving the user and release news, then step S101 is specially:

6. wash sale information identifying method as claimed in claim 5 is characterized in that, after obtaining the master data that the user gives out information, step S201 specifically may further comprise the steps:

Step S2011 carries out Feature Conversion to obtaining the master data that the user gives out information, and obtains the data of setting data form;

Step S2012, the data of the setting data form that step S2011 is obtained are converted into the form of sparse matrix, carry out spoofing identification by regression model; Wherein, P〉M, Y=1 then, the expression user releases news and is true sale information; Otherwise, P≤M, Y=0 then, the expression user releases news and is wash sale information; M is predefined threshold value.

7. such as claim 1 or 6 described wash sale information identifying methods, it is characterized in that, before obtaining the information content that the user releases news, may further comprise the steps:

Step S1022 carries out participle to the information content in the sample;

Step S1023 by calculating, extracts Feature Words;

8. wash sale information identifying method as claimed in claim 7 is characterized in that, step S1023 specifically comprises:

The CHI value asked in each word; Evolution check formula is:

Wherein, A: the number of documents that under this classification, comprises this word; B: the number of documents that under this classification, does not comprise this word; C: the number of documents that under this classification, does not comprise this word; D: not under this classification, and do not comprise the number of documents of this word; N: expression article sum; T: represent the current word of asking the CHI value; C: the classification of presentation class; X2: the open check of expression CHI value;

Then get P value of CHI value maximum in all words as Feature Words;

Step S1024 specifically comprises:

Adopt the TFIDF algorithm, calculate in every class the number of times of each Feature Words in every piece of document, and the number of files that comprises this word, with the value of TFIDF as eigenwert; Wherein, every piece of document is converted into: category IDs t feature sequence number the form of t eigenwert; The TFIDF formula is: TFIDF=TF * IDF, wherein, and the frequency that TF occurs in this piece document for certain Feature Words, IDF is anti-document frequency, namely total document tree is divided by the number of files that comprises this word.

9. wash sale information identifying method as claimed in claim 8 is characterized in that, after obtaining the information content that the user gives out information, step S201 specifically may further comprise the steps:

Step S2022 by calculating, extracts Feature Words;

10. such as claim 1,6 or 9 described wash sale information identifying methods, it is characterized in that, the pictorial information that releases news according to the user gives out information to the user and to carry out the identification of wash sale information, specifically may further comprise the steps:

Step S2031, the query graph valut judges whether photo current occurs in picture library, if there is, judge further then whether the content of posting is identical, and whether the position is identical, if all different, judge that then it is wash sale information that the user who comprises this picture releases news; Otherwise, judge that then it is true sale information that the user who comprises this picture releases news;