ValueError: perplexity must be less than n_samples #6

caicai555 · 2022-08-18T11:12:09Z

When the topic number is more than 30 (which is the TSNE's default perplexity setting), the ValueError occurs.

ValueError: perplexity must be less than n_samples

Maybe we shall simply set perplexity to 5, or change it according to the number of topics (e.g. add an n_topic variable to the _report() and _distance() methods)

maximtrp · 2022-08-19T09:03:04Z

You are right. Thank you! I will fix it in the next release (will try to make it soon).

maximtrp · 2022-08-19T16:37:21Z

@caicai555 I have pushed a fixing commit. Could you please test it?

caicai555 · 2022-08-22T03:39:48Z

@caicai555 I have pushed a fixing commit. Could you please test it?

Just upgraded tmplot to 0.0.9 and tested with the same data, which worked when I simply set perplexity of tsne to 5 in the 0.0.8 version. Another error occurs...

AttributeError                            Traceback (most recent call last)
<ipython-input-5-b777329f9dd5> in <module>
      1 import tmplot as tmp
----> 2 tmp.report(model=model, docs=texts)

D:\anaconda\lib\site-packages\tmplot\_report.py in report(model, docs, topics_labels, corpus, layout, show_headers, show_docs, show_words, show_topics, topics_kws, height, width, coords_kws, words_kws, docs_kws, top_docs_kws)
    135 
    136     if 'topics_coords' not in _topics_kws:
--> 137         topics_coords = prepare_coords(model, **_coords_kws)
    138         _topics_kws.update({
    139             'topics_coords': topics_coords,

D:\anaconda\lib\site-packages\tmplot\_report.py in prepare_coords(model, labels, dist_kws, scatter_kws)
     42     theta = get_theta(model)
     43     topics_dists = get_topics_dist(phi, **dist_kws)
---> 44     topics_coords = get_topics_scatter(topics_dists, theta, **scatter_kws)
     45     topics_coords['label'] = labels or theta.index
     46     return topics_coords

D:\anaconda\lib\site-packages\tmplot\_distance.py in get_topics_scatter(topic_dists, theta, method, method_kws)
    175         transformer = Isomap(**method_kws)
    176 
--> 177     coords = transformer.fit_transform(topic_dists)
    178 
    179     topics_xy = DataFrame(coords, columns=['x', 'y'])

D:\anaconda\lib\site-packages\sklearn\manifold\_t_sne.py in fit_transform(self, X, y)
   1121         """
   1122         self._check_params_vs_input(X)
-> 1123         embedding = self._fit(X)
   1124         self.embedding_ = embedding
   1125         return self.embedding_

D:\anaconda\lib\site-packages\sklearn\manifold\_t_sne.py in _fit(self, X, skip_num_points)
    960 
    961             t0 = time()
--> 962             distances_nn = knn.kneighbors_graph(mode="distance")
    963             duration = time() - t0
    964             if self.verbose:

D:\anaconda\lib\site-packages\sklearn\neighbors\_base.py in kneighbors_graph(self, X, n_neighbors, mode)
    922 
    923         elif mode == "distance":
--> 924             A_data, A_ind = self.kneighbors(X, n_neighbors, return_distance=True)
    925             A_data = np.ravel(A_data)
    926 

D:\anaconda\lib\site-packages\sklearn\neighbors\_base.py in kneighbors(self, X, n_neighbors, return_distance)
    761         )
    762         if use_pairwise_distances_reductions:
--> 763             results = PairwiseDistancesArgKmin.compute(
    764                 X=X,
    765                 Y=self._fit_X,

sklearn\metrics\_pairwise_distances_reduction.pyx in sklearn.metrics._pairwise_distances_reduction.PairwiseDistancesArgKmin.compute()

D:\anaconda\lib\site-packages\sklearn\utils\fixes.py in threadpool_limits(limits, user_api)
    149         return controller.limit(limits=limits, user_api=user_api)
    150     else:
--> 151         return threadpoolctl.threadpool_limits(limits=limits, user_api=user_api)
    152 
    153 

D:\anaconda\lib\site-packages\threadpoolctl.py in __init__(self, limits, user_api)
    169             self._check_params(limits, user_api)
    170 
--> 171         self._original_info = self._set_threadpool_limits()
    172 
    173     def __enter__(self):

D:\anaconda\lib\site-packages\threadpoolctl.py in _set_threadpool_limits(self)
    266             return None
    267 
--> 268         modules = _ThreadpoolInfo(prefixes=self._prefixes,
    269                                   user_api=self._user_api)
    270         for module in modules:

D:\anaconda\lib\site-packages\threadpoolctl.py in __init__(self, user_api, prefixes, modules)
    338 
    339             self.modules = []
--> 340             self._load_modules()
    341             self._warn_if_incompatible_openmp()
    342         else:

D:\anaconda\lib\site-packages\threadpoolctl.py in _load_modules(self)
    371             self._find_modules_with_dyld()
    372         elif sys.platform == "win32":
--> 373             self._find_modules_with_enum_process_module_ex()
    374         else:
    375             self._find_modules_with_dl_iterate_phdr()

D:\anaconda\lib\site-packages\threadpoolctl.py in _find_modules_with_enum_process_module_ex(self)
    483 
    484                 # Store the module if it is supported and selected
--> 485                 self._make_module_from_path(filepath)
    486         finally:
    487             kernel_32.CloseHandle(h_process)

D:\anaconda\lib\site-packages\threadpoolctl.py in _make_module_from_path(self, filepath)
    513             if prefix in self.prefixes or user_api in self.user_api:
    514                 module_class = globals()[module_class]
--> 515                 module = module_class(filepath, prefix, user_api, internal_api)
    516                 self.modules.append(module)
    517 

D:\anaconda\lib\site-packages\threadpoolctl.py in __init__(self, filepath, prefix, user_api, internal_api)
    604         self.internal_api = internal_api
    605         self._dynlib = ctypes.CDLL(filepath, mode=_RTLD_NOLOAD)
--> 606         self.version = self.get_version()
    607         self.num_threads = self.get_num_threads()
    608         self._get_extra_info()

D:\anaconda\lib\site-packages\threadpoolctl.py in get_version(self)
    644                              lambda: None)
    645         get_config.restype = ctypes.c_char_p
--> 646         config = get_config().split()
    647         if config[0] == b"OpenBLAS":
    648             return config[1].decode("utf-8")

AttributeError: 'NoneType' object has no attribute 'split'

maximtrp · 2022-08-25T07:24:07Z

Could you please post the code and data sample which give such an error?

caicai555 · 2022-08-25T19:16:27Z

Just like this one I think. Here come the codes run on jupyter notebook with python=3.8.8

import bitermplus as btm
import numpy as np
import pandas as pd
import pickle as pkl
import matplotlib.pyplot as plt
#loading data
df = pd.read_csv('../data/cmdata_cutted_filt_stp.txt', header=None, names=['texts'])
texts = df['texts'].str.strip().tolist()
#train the model
n_topics = 50
model = btm.BTM(
    X, vocabulary, seed=666, T=n_topics, M=100, alpha=50/n_topics, beta=0.01)
model.fit(biterms, iterations=20)
#vis
import tmplot as tmp
tmp.report(model=model, docs=texts)

And here's part of the [texts] data, which should be innocent in this issue.

texts
['听闻 喜得 贵子 世代 为官',
 '玩意 养大',
 '了噜 二代 吓死',
 '婴儿 身上 香味 猫咪 味道 吸引 过来',
 '喜欢 折耳 逼迫 不养 想养 做好 准备 存在 所有 都病 存在 避免 性状 会病 看到 家里 痛苦 忍下 看着 痛苦 善待 科普 基因 致病 基因 同一个 基因 发病率 混血 发病 混血 不会 发病 网上 资料 欧洲 联盟 英国 爱猫 协会 国际 猫咪 联盟 承认 美国 利益 没有 国内 名气 猫咪 声名 原因 希望 全面 认识 骨骼 是否 购买 做出 选择 没有 买卖 没有 繁育 购买者 折耳 低于 贩子 期望 价格 出售 导致 利润 降低 抑制 繁育 领养 了折 耳猫 用心 照顾 以下 注意事项 饮食 科学 饮食 为主 吃零食 千万别 补钙 喜欢 爬高 喜欢 增加 踏脚板 舒适 柔软 包括 纯种 容易 结石 体质 折耳 身体素质 体重 要求 饮水 摄入量 看到 提供 电梯 抱起 放下 前爪 先着 坚持 检查 发病 之后 可用 软骨素 缓解 症状 原创 当妈 以后 看见 折耳 视频 评论 复制粘贴 弹幕 确实',
 '开玩笑 孩子 需要 大人 监护 才能 宠物 不能 可爱 教训',
 '围着 嫉妒',
 '最好 看着 可爱 痛苦',
 '原名 直接 看到 授权 视频 大家 吸猫',
 '熙春 授权 视频 应该 属于 撞车 视频',
 '油管 看过 视频 宝宝 走路 所有 宝宝 毯子 睡着 总有 猫会 静静 宝宝 身边',
 '感觉 猫猫 感受 人类 崽崽 没有 爪子 试探 方法 看一看 闻一闻',
 '弹幕 科普 发言 弹幕',
 '最好 动物 靠近 婴儿',
 '小东西 小会 长大',
 '酥梨 酥梨 哈哈哈哈 哈哈哈哈 可爱',
 '熙春 支持 正版 授权 理智 观看 举报 拉黑 屏蔽 视频 营销 没毛 区别 支持 正版 授权 支持 正版 授权 猫咪 授权 只给 熙春 有没有 来源 大家 视为 视频 谢谢 上去 看到 拜托',
 '哈哈 最后 弹幕 怪味 换尿布',
 '家里 波斯猫 身上 争宠',
 '感觉 不能 太小 孩子 零距离 接触 小孩 很嫩 就算 不是故意 可能 弄伤 小孩 不会 控制 力气 力气 不大 要命',
 '出生 家庭 后来 面前 眼睛 麻麻 抓瞎 送走',
 '看着 长大 了能 滑稽',
 '观察 准备 小孩',
 '闻到 香味儿',
 '骨科 发病 痛苦 喜欢',
 '婴儿 案例 切记',
 '小时候 懂事 觉醒 宠物 吓死',
 '听话 聪明 不想 商量 出门时 躲起来',
 '小孩子 安全',
 '视频 麻麻',
 '眼神 超级',
 '猫咪 熟悉 成员 气味',
 '油管 搬运 挣个 容易 搬运 不标 搬运 来源',
 '看到 韩语',
 '换尿布 哈哈哈 哈哈哈',
 '过来 闻闻 子奶 味儿',
 '小孩 一巴掌',
 '喂奶 恶心 打回 娘胎 回炉 重造',
 '玩意 养大', ...]

Thx for ur great effort and quick replies on this issue. If you need the exact model file, plz let me know :)

(BTW, I've tried to analyse my texts with your BitermPlus packet, concise and elegant, quite easy to get hang of it. However, I find the results turned out to be different with the same settings on R package BTM (https://github.com/bnosac/BTM) and BTMpy(https://github.com/jasperyang/BTMpy). The latter two gave the same topic distribution and the top words seem to make sense, but the BitermPlus' output is a bit confusing. I am a beginner in BTM and not sure what's behind the problem, maybe you could check it if you got the time?

PS: maybe I should leave an issue there but I don't want to annoy you xd

maximtrp · 2022-08-26T07:29:47Z

Thank you for your comments! I will try to reproduce this bug. By the way, what number of iterations have you used with bitermplus? The authors of algorithm refer to 2000 in their paper. I have found 600 to be more or less sufficient.

caicai555 · 2022-08-26T18:41:47Z

Thank you for your comments! I will try to reproduce this bug. By the way, what number of iterations have you used with bitermplus? The authors of algorithm refer to 2000 in their paper. I have found 600 to be more or less sufficient.

Due to time restrictions, I have only tried 200 iters maximum XD. I'll test with more iterations later when I return to my lab. What confused me the most is that the results differ even if I get all the settings exactly the same.

Thanks again for your warm help :)

…updated

maximtrp self-assigned this Aug 19, 2022

maximtrp added bug Something isn't working enhancement New feature or request labels Aug 19, 2022

maximtrp closed this as completed in d5c6f0f Aug 19, 2022

maximtrp reopened this Aug 25, 2022

LappisProblems mentioned this issue Jun 9, 2023

AttributeError: 'NoneType' object has no attribute 'split' #8

Closed

maximtrp added a commit that referenced this issue Oct 27, 2023

fixed tsne perplexity error (#6, #9), minor code improvements, tests …

3364b7a

…updated

maximtrp closed this as completed Oct 27, 2023

yogurt-shadow mentioned this issue Apr 11, 2024

ValueError: perplexity must be less than n_samples nhamlv-55/Ropey#7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: perplexity must be less than n_samples #6

ValueError: perplexity must be less than n_samples #6

caicai555 commented Aug 18, 2022

maximtrp commented Aug 19, 2022

maximtrp commented Aug 19, 2022

caicai555 commented Aug 22, 2022 •

edited by maximtrp

Loading

maximtrp commented Aug 25, 2022

caicai555 commented Aug 25, 2022

maximtrp commented Aug 26, 2022

caicai555 commented Aug 26, 2022

ValueError: perplexity must be less than n_samples #6

ValueError: perplexity must be less than n_samples #6

Comments

caicai555 commented Aug 18, 2022

maximtrp commented Aug 19, 2022

maximtrp commented Aug 19, 2022

caicai555 commented Aug 22, 2022 • edited by maximtrp Loading

maximtrp commented Aug 25, 2022

caicai555 commented Aug 25, 2022

maximtrp commented Aug 26, 2022

caicai555 commented Aug 26, 2022

caicai555 commented Aug 22, 2022 •

edited by maximtrp

Loading