Note: We cleaned and republished the dataset because we found some errors in the original dataset.
ZhihuRec dataset is constructed by the Information Retrieval group of Tsinghua Unversity (THUIR) and Zhihu company, and it is for research purposes only.
ZhihuRec dataset is collected from a knowledge-sharing platform (Zhihu), which is composed of around 100M interactions collected within 10 days, 798K users, 165K questions, 554K answers, 240K authors, 70K topics, and more than 501K user query logs. There are also descriptions of users, answers, questions, authors, and topics, which are anonymous. To the best of our knowledge, this is the largest real-world interaction dataset for personalized recommendation.
As the ZhihuRec dataset contains about 100M user-answer impression logs, it is also called ZhihuRec-100M. Two smaller datasets randomly sampled from ZhihuRec-100M dataset called ZhihuRec-20M and ZhihuRec-1M are also constructed to facilitate various application requirements. They contain about 20M and 1M user-answer impression logs and can be viewed as a medium-size dataset and a relatively small-size dataset.
Filename | Size | Description |
---|---|---|
inter_impression.csv |
2.6GB | user clicks and impressions |
inter_query.csv |
111MB | user queries |
info_user.csv |
135MB | the features of the users occured in the dataset |
info_answer.csv |
917MB | the features of the answers occured in the dataset |
info_question.csv |
14MB | the features of the questions occured in the dataset |
info_author.csv |
3.1MB | the features of the authors occured in the dataset |
info_topic.csv |
413KB | the IDs of the topics occured in the dataset |
info_token.csv |
409MB | the features of the tokens occured in the dataset |
Dataset | ZhihuRec-100M | ZhihuRec-20M | ZhihuRec-1M |
---|---|---|---|
#impressions * | 99,978,523 | 19,999,857 | 999,970 |
#clicks | 26,981,583 | 5,402,345 | 268,656 |
#clicks : #non-clicks | 1 : 2.71 | 1 : 2.70 | 1 : 2.72 |
#queries * | 3,899,553 | 776,201 | 38,422 |
#users * | 798,086 | 159,642 | 7,974 |
avg #impressions per user | 125.27 | 125.28 | 125.40 |
avg #clicks per user | 33.81 | 33.84 | 33.69 |
#users with queries | 501,893 | 100,271 | 5,047 |
avg #queries per user | 7.77 | 7.74 | 7.61 |
#answers * | 554,976 | 343,103 | 81,563 |
#questions * | 165,012 | 104,130 | 29,340 |
#authors * | 240,956 | 167,796 | 47,888 |
#topics * | 72,318 | 54,785 | 22,897 |
#tokens * | 556,546 | 428,334 | 249,586 |
* The two smaller datasets can be generated by taking the top
$N$ lines in the eight files.
Some fields in the data set are null, which are represented by empty strings in the file.
Index | Nullable | Description |
---|---|---|
0 | user ID | |
1 | answer ID | |
2 | impression timestamp | |
3 | click timestamp (0 for non-click) |
Index | Nullable | Description |
---|---|---|
0 | user ID | |
1 | token IDs in the query (separated by spaces) | |
2 | query timestamp |
Index | Nullable | Description |
---|---|---|
0 | user ID | |
1 | register timestamp | |
2 | gender | |
3 | login frequency | |
4 | #followers | |
5 | #topics followed by this user | |
6 | #questions followed by this user | |
7 | #answers | |
8 | #questions | |
9 | #comments | |
10 | #thanks received by this user | |
11 | #comments received by this user | |
12 | #likes received by this user | |
13 | #dislikes received by this user | |
14 | register type | |
15 | register platform | |
16 | from android or not | |
17 | from iphone or not | |
18 | from ipad or not | |
19 | from pc or not | |
20 | from mobile web or not | |
21 | device model | |
22 | device brand | |
23 | platform | |
24 | province | |
25 | city | |
26 | topic IDs followed by this user (separated by spaces) |
Index | Nullable | Description |
---|---|---|
0 | answer ID | |
1 | question ID | |
2 | anonymous or not | |
3 | author ID (null for anonymous) | |
4 | labeled high-value answer or not | |
5 | recommended by the editor or not | |
6 | create timestamp | |
7 | contain pictures or not | |
8 | contain videos or not | |
9 | #thanks | |
10 | #likes | |
11 | #comments | |
12 | #collections | |
13 | #dislikes | |
14 | #reports | |
15 | #helpless | |
16 | token IDs in the answer (separated by spaces) | |
17 | topic IDs of the answer (separated by spaces) |
Index | Nullable | Description |
---|---|---|
0 | question ID | |
1 | create timestamp | |
2 | #answers | |
3 | #followers | |
4 | #invitations | |
5 | #comments | |
6 | token IDs in the question (separated by spaces) | |
7 | topic IDs of the queation (separated by spaces) |
Index | Nullable | Description |
---|---|---|
0 | author ID | |
1 | is excellent author or not | |
2 | #followers | |
3 | is excellent answerer or not |
Index | Nullable | Description |
---|---|---|
0 | topic ID |
Index | Nullable | Description |
---|---|---|
0 | token ID * | |
1 | word vector trained by word2vec (64 dimensions, separated by spaces) |
* ZhihuRec can't provide the corresponding text of tokens for privacy reasons. Researchers can use word vectors in the dataset or train word vectors from scratch.
ZhihuRec dataset can be downloaded from here, and it is for the paper:
please cite the paper if you use this dataset:
@misc{hao2021largescale, title={A Large-Scale Rich Context Query and Recommendation Dataset in Online Knowledge-Sharing}, author={Bin Hao and Min Zhang and Weizhi Ma and Shaoyun Shi and Xinxing Yu and Houzhi Shan and Yiqun Liu and Shaoping Ma}, year={2021}, eprint={2106.06467}, archivePrefix={arXiv}, primaryClass={cs.IR} }
This dataset is for research use only. If you have any problem about this work or dataset, please contact with Bin Hao at [email protected].