Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: perplexity must be less than n_samples #6

Closed
caicai555 opened this issue Aug 18, 2022 · 7 comments
Closed

ValueError: perplexity must be less than n_samples #6

caicai555 opened this issue Aug 18, 2022 · 7 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@caicai555
Copy link

When the topic number is more than 30 (which is the TSNE's default perplexity setting), the ValueError occurs.

ValueError: perplexity must be less than n_samples

Maybe we shall simply set perplexity to 5, or change it according to the number of topics (e.g. add an n_topic variable to the _report() and _distance() methods)

@maximtrp
Copy link
Owner

You are right. Thank you! I will fix it in the next release (will try to make it soon).

@maximtrp maximtrp self-assigned this Aug 19, 2022
@maximtrp maximtrp added bug Something isn't working enhancement New feature or request labels Aug 19, 2022
@maximtrp
Copy link
Owner

@caicai555 I have pushed a fixing commit. Could you please test it?

@caicai555
Copy link
Author

caicai555 commented Aug 22, 2022

@caicai555 I have pushed a fixing commit. Could you please test it?

Just upgraded tmplot to 0.0.9 and tested with the same data, which worked when I simply set perplexity of tsne to 5 in the 0.0.8 version. Another error occurs...


AttributeError                            Traceback (most recent call last)
<ipython-input-5-b777329f9dd5> in <module>
      1 import tmplot as tmp
----> 2 tmp.report(model=model, docs=texts)

D:\anaconda\lib\site-packages\tmplot\_report.py in report(model, docs, topics_labels, corpus, layout, show_headers, show_docs, show_words, show_topics, topics_kws, height, width, coords_kws, words_kws, docs_kws, top_docs_kws)
    135 
    136     if 'topics_coords' not in _topics_kws:
--> 137         topics_coords = prepare_coords(model, **_coords_kws)
    138         _topics_kws.update({
    139             'topics_coords': topics_coords,

D:\anaconda\lib\site-packages\tmplot\_report.py in prepare_coords(model, labels, dist_kws, scatter_kws)
     42     theta = get_theta(model)
     43     topics_dists = get_topics_dist(phi, **dist_kws)
---> 44     topics_coords = get_topics_scatter(topics_dists, theta, **scatter_kws)
     45     topics_coords['label'] = labels or theta.index
     46     return topics_coords

D:\anaconda\lib\site-packages\tmplot\_distance.py in get_topics_scatter(topic_dists, theta, method, method_kws)
    175         transformer = Isomap(**method_kws)
    176 
--> 177     coords = transformer.fit_transform(topic_dists)
    178 
    179     topics_xy = DataFrame(coords, columns=['x', 'y'])

D:\anaconda\lib\site-packages\sklearn\manifold\_t_sne.py in fit_transform(self, X, y)
   1121         """
   1122         self._check_params_vs_input(X)
-> 1123         embedding = self._fit(X)
   1124         self.embedding_ = embedding
   1125         return self.embedding_

D:\anaconda\lib\site-packages\sklearn\manifold\_t_sne.py in _fit(self, X, skip_num_points)
    960 
    961             t0 = time()
--> 962             distances_nn = knn.kneighbors_graph(mode="distance")
    963             duration = time() - t0
    964             if self.verbose:

D:\anaconda\lib\site-packages\sklearn\neighbors\_base.py in kneighbors_graph(self, X, n_neighbors, mode)
    922 
    923         elif mode == "distance":
--> 924             A_data, A_ind = self.kneighbors(X, n_neighbors, return_distance=True)
    925             A_data = np.ravel(A_data)
    926 

D:\anaconda\lib\site-packages\sklearn\neighbors\_base.py in kneighbors(self, X, n_neighbors, return_distance)
    761         )
    762         if use_pairwise_distances_reductions:
--> 763             results = PairwiseDistancesArgKmin.compute(
    764                 X=X,
    765                 Y=self._fit_X,

sklearn\metrics\_pairwise_distances_reduction.pyx in sklearn.metrics._pairwise_distances_reduction.PairwiseDistancesArgKmin.compute()

D:\anaconda\lib\site-packages\sklearn\utils\fixes.py in threadpool_limits(limits, user_api)
    149         return controller.limit(limits=limits, user_api=user_api)
    150     else:
--> 151         return threadpoolctl.threadpool_limits(limits=limits, user_api=user_api)
    152 
    153 

D:\anaconda\lib\site-packages\threadpoolctl.py in __init__(self, limits, user_api)
    169             self._check_params(limits, user_api)
    170 
--> 171         self._original_info = self._set_threadpool_limits()
    172 
    173     def __enter__(self):

D:\anaconda\lib\site-packages\threadpoolctl.py in _set_threadpool_limits(self)
    266             return None
    267 
--> 268         modules = _ThreadpoolInfo(prefixes=self._prefixes,
    269                                   user_api=self._user_api)
    270         for module in modules:

D:\anaconda\lib\site-packages\threadpoolctl.py in __init__(self, user_api, prefixes, modules)
    338 
    339             self.modules = []
--> 340             self._load_modules()
    341             self._warn_if_incompatible_openmp()
    342         else:

D:\anaconda\lib\site-packages\threadpoolctl.py in _load_modules(self)
    371             self._find_modules_with_dyld()
    372         elif sys.platform == "win32":
--> 373             self._find_modules_with_enum_process_module_ex()
    374         else:
    375             self._find_modules_with_dl_iterate_phdr()

D:\anaconda\lib\site-packages\threadpoolctl.py in _find_modules_with_enum_process_module_ex(self)
    483 
    484                 # Store the module if it is supported and selected
--> 485                 self._make_module_from_path(filepath)
    486         finally:
    487             kernel_32.CloseHandle(h_process)

D:\anaconda\lib\site-packages\threadpoolctl.py in _make_module_from_path(self, filepath)
    513             if prefix in self.prefixes or user_api in self.user_api:
    514                 module_class = globals()[module_class]
--> 515                 module = module_class(filepath, prefix, user_api, internal_api)
    516                 self.modules.append(module)
    517 

D:\anaconda\lib\site-packages\threadpoolctl.py in __init__(self, filepath, prefix, user_api, internal_api)
    604         self.internal_api = internal_api
    605         self._dynlib = ctypes.CDLL(filepath, mode=_RTLD_NOLOAD)
--> 606         self.version = self.get_version()
    607         self.num_threads = self.get_num_threads()
    608         self._get_extra_info()

D:\anaconda\lib\site-packages\threadpoolctl.py in get_version(self)
    644                              lambda: None)
    645         get_config.restype = ctypes.c_char_p
--> 646         config = get_config().split()
    647         if config[0] == b"OpenBLAS":
    648             return config[1].decode("utf-8")

AttributeError: 'NoneType' object has no attribute 'split'

@maximtrp maximtrp reopened this Aug 25, 2022
@maximtrp
Copy link
Owner

Could you please post the code and data sample which give such an error?

@caicai555
Copy link
Author

Just like this one I think. Here come the codes run on jupyter notebook with python=3.8.8

import bitermplus as btm
import numpy as np
import pandas as pd
import pickle as pkl
import matplotlib.pyplot as plt
#loading data
df = pd.read_csv('../data/cmdata_cutted_filt_stp.txt', header=None, names=['texts'])
texts = df['texts'].str.strip().tolist()
#train the model
n_topics = 50
model = btm.BTM(
    X, vocabulary, seed=666, T=n_topics, M=100, alpha=50/n_topics, beta=0.01)
model.fit(biterms, iterations=20)
#vis
import tmplot as tmp
tmp.report(model=model, docs=texts)

And here's part of the [texts] data, which should be innocent in this issue.

texts
['听闻 喜得 贵子 世代 为官',
 '玩意 养大',
 '了噜 二代 吓死',
 '婴儿 身上 香味 猫咪 味道 吸引 过来',
 '喜欢 折耳 逼迫 不养 想养 做好 准备 存在 所有 都病 存在 避免 性状 会病 看到 家里 痛苦 忍下 看着 痛苦 善待 科普 基因 致病 基因 同一个 基因 发病率 混血 发病 混血 不会 发病 网上 资料 欧洲 联盟 英国 爱猫 协会 国际 猫咪 联盟 承认 美国 利益 没有 国内 名气 猫咪 声名 原因 希望 全面 认识 骨骼 是否 购买 做出 选择 没有 买卖 没有 繁育 购买者 折耳 低于 贩子 期望 价格 出售 导致 利润 降低 抑制 繁育 领养 了折 耳猫 用心 照顾 以下 注意事项 饮食 科学 饮食 为主 吃零食 千万别 补钙 喜欢 爬高 喜欢 增加 踏脚板 舒适 柔软 包括 纯种 容易 结石 体质 折耳 身体素质 体重 要求 饮水 摄入量 看到 提供 电梯 抱起 放下 前爪 先着 坚持 检查 发病 之后 可用 软骨素 缓解 症状 原创 当妈 以后 看见 折耳 视频 评论 复制粘贴 弹幕 确实',
 '开玩笑 孩子 需要 大人 监护 才能 宠物 不能 可爱 教训',
 '围着 嫉妒',
 '最好 看着 可爱 痛苦',
 '原名 直接 看到 授权 视频 大家 吸猫',
 '熙春 授权 视频 应该 属于 撞车 视频',
 '油管 看过 视频 宝宝 走路 所有 宝宝 毯子 睡着 总有 猫会 静静 宝宝 身边',
 '感觉 猫猫 感受 人类 崽崽 没有 爪子 试探 方法 看一看 闻一闻',
 '弹幕 科普 发言 弹幕',
 '最好 动物 靠近 婴儿',
 '小东西 小会 长大',
 '酥梨 酥梨 哈哈哈哈 哈哈哈哈 可爱',
 '熙春 支持 正版 授权 理智 观看 举报 拉黑 屏蔽 视频 营销 没毛 区别 支持 正版 授权 支持 正版 授权 猫咪 授权 只给 熙春 有没有 来源 大家 视为 视频 谢谢 上去 看到 拜托',
 '哈哈 最后 弹幕 怪味 换尿布',
 '家里 波斯猫 身上 争宠',
 '感觉 不能 太小 孩子 零距离 接触 小孩 很嫩 就算 不是故意 可能 弄伤 小孩 不会 控制 力气 力气 不大 要命',
 '出生 家庭 后来 面前 眼睛 麻麻 抓瞎 送走',
 '看着 长大 了能 滑稽',
 '观察 准备 小孩',
 '闻到 香味儿',
 '骨科 发病 痛苦 喜欢',
 '婴儿 案例 切记',
 '小时候 懂事 觉醒 宠物 吓死',
 '听话 聪明 不想 商量 出门时 躲起来',
 '小孩子 安全',
 '视频 麻麻',
 '眼神 超级',
 '猫咪 熟悉 成员 气味',
 '油管 搬运 挣个 容易 搬运 不标 搬运 来源',
 '看到 韩语',
 '换尿布 哈哈哈 哈哈哈',
 '过来 闻闻 子奶 味儿',
 '小孩 一巴掌',
 '喂奶 恶心 打回 娘胎 回炉 重造',
 '玩意 养大', ...]

Thx for ur great effort and quick replies on this issue. If you need the exact model file, plz let me know :)

(BTW, I've tried to analyse my texts with your BitermPlus packet, concise and elegant, quite easy to get hang of it. However, I find the results turned out to be different with the same settings on R package BTM (https://github.com/bnosac/BTM) and BTMpy(https://github.com/jasperyang/BTMpy). The latter two gave the same topic distribution and the top words seem to make sense, but the BitermPlus' output is a bit confusing. I am a beginner in BTM and not sure what's behind the problem, maybe you could check it if you got the time?

PS: maybe I should leave an issue there but I don't want to annoy you xd

@maximtrp
Copy link
Owner

Thank you for your comments! I will try to reproduce this bug. By the way, what number of iterations have you used with bitermplus? The authors of algorithm refer to 2000 in their paper. I have found 600 to be more or less sufficient.

@caicai555
Copy link
Author

Thank you for your comments! I will try to reproduce this bug. By the way, what number of iterations have you used with bitermplus? The authors of algorithm refer to 2000 in their paper. I have found 600 to be more or less sufficient.

Due to time restrictions, I have only tried 200 iters maximum XD. I'll test with more iterations later when I return to my lab. What confused me the most is that the results differ even if I get all the settings exactly the same.

Thanks again for your warm help :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants