TX-WORD2VEC

腾讯开源的word2vec模型。

原版15个G，一般爱好者很难玩出来。

所以做了一些小的。方便大家使用。

5000-small.txt 这个有5000词，可以下下来玩玩

45000-small.txt 这个有4.5w的词，已经能解决很多问题了

70000-small.txt 7w词 133MB https://pan.baidu.com/s/1DprHD8HwEqkWRBG0ss2y1A

100000-small.txt 10w词 190MB https://pan.baidu.com/s/1KqPOwfrw3KoLJqTsCUdriA

500000-small.txt 50w词 953MB https://pan.baidu.com/s/1SGwxpGW8HjYw8HdKQUB8Gw

1000000-small.txt 100w词 1.9GB https://pan.baidu.com/s/1ObstPl7R8o1L98Ag9owGiw

2000000-small.txt 200w词 3.8GB https://pan.baidu.com/s/1hmCiMandgyedjmP520_Aog

再大就自己去下载吧

https://ai.tencent.com/ailab/nlp/en/data/Tencent_AILab_ChineseEmbedding.tar.gz

如何使用

读取模型

from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format("50-small.txt")

把玩模型

model.most_similar(positive=['女', '国王'], negative=['男'], topn=1)

model.doesnt_match("上海 成都 广州 北京".split(" "))

model.similarity('女人', '男人')

model.most_similar('特朗普',topn=10)

深度模式示例

使用LSTM模型，根据豆瓣评论，预测打分。

首先下载豆瓣的数据

豆瓣评论数据149M https://pan.baidu.com/s/1WbqoCKsmrnpf6n5ZTV-fKA

然后下载库对应的分词包。 https://pan.baidu.com/s/19busyY1yysbOgdYWxIaIQA

加载70000字典前

加载70000字典后

代码文件见 Use Tencent Word Embeddings with douban datasets.ipynb

抛转引玉，如果有哪位朋友有功夫更新其他的模型或者数据库，还可以request pull

有问题也可以开issue

都看到这里了，加入QQ群跟老司机们一起讨论下吧：

python自然语言处理 644612230

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.idea		.idea
pics		pics
45000-small.txt		45000-small.txt
5000-small.txt		5000-small.txt
README.md		README.md
README.txt		README.txt
Use Tencent Word Embeddings with douban datasets.ipynb		Use Tencent Word Embeddings with douban datasets.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TX-WORD2VEC

如何使用

读取模型

把玩模型

深度模式示例

About

Releases

Packages

Languages

huoyo/TX-WORD2VEC-SMALL

Folders and files

Latest commit

History

Repository files navigation

TX-WORD2VEC

如何使用

读取模型

把玩模型

深度模式示例

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages