希望路过的同学,顺手给JStarCraft框架点个Star,算是对作者的一种鼓励吧!
JStarCraft NLP是一个面向自然语言处理领域的轻量级引擎.遵循Apache 2.0协议.
专注于解决自然语言处理领域的几个核心问题:
- 词法分析
- 句法分析
- 语义分析
- 信息抽取
- 文本聚类
- 文本分类
涵盖了多种自然语言处理算法,整合了多个自然语言处理框架.为相关领域的研发人员提供提供满足工业级别场景要求的通用设计与参考实现,普及自然语言处理在Java领域的应用.
- 1.文本相关性
- 词语相关性
- 短语相关性
- 句子相关性
- 文档相关性
- 2.文本哈希
- 局部敏感哈希
- 布隆过滤器
- 3.词法分析(Lexical Analysis)
- 分词
- 词性标注
- 4.句法分析(Sentence Analysis)
- 句法结构分析
- 依存关系分析
- 5.语义分析(Semantic Analysis)
- 6.信息抽取(Information Extraction)
- 语种检测(Language Detection)
- 实体抽取(Entity Extraction)
- 关系抽取(Relation Extraction)
- 事件抽取(Event Extraction)
- 7.文本聚类
- 8.文本分类
- 9.兼容Lucene,Solr,ElasticSearch
- 10.整合第三方框架
- Ansj
- Stanford CoreNLP
- HanLP
- IK
- Jcseg
- jieba
- MMSEG
- MYNLP
- THULAC
- word
JStarCraft RNS要求使用者具备以下环境:
- JDK 8或者以上
- Maven 3
git clone https://github.com/HongZhaoHua/jstarcraft-core.git
mvn install -Dmaven.test.skip=true
git clone https://github.com/HongZhaoHua/jstarcraft-ai.git
mvn install -Dmaven.test.skip=true
git clone https://github.com/HongZhaoHua/jstarcraft-nlp.git
mvn install -Dmaven.test.skip=true
- 设置Maven依赖
<dependency>
<groupId>com.jstarcraft</groupId>
<artifactId>nlp</artifactId>
<version>1.0</version>
</dependency>
- 设置Gradle依赖
compile group: 'com.jstarcraft', name: 'nlp', version: '1.0'
名称 | 功能 | 默认值 |
---|---|---|
名称 | 功能 | 默认值 |
名称 | 功能 | 默认值 |
---|---|---|
名称 | 功能 | 默认值 |
名称 | 功能 | 默认值 |
---|---|---|
名称 | 功能 | 默认值 |
名称 | 功能 | 默认值 |
---|---|---|
名称 | 功能 | 默认值 |
名称 | 功能 | 默认值 |
---|---|---|
名称 | 功能 | 默认值 |
名称 | 功能 | 默认值 |
---|---|---|
名称 | 功能 | 默认值 |
名称 | 功能 | 默认值 |
---|---|---|
名称 | 功能 | 默认值 |
名称 | 功能 | 默认值 |
---|---|---|
名称 | 功能 | 默认值 |
名称 | 功能 | 默认值 |
---|---|---|
名称 | 功能 | 默认值 |
名称 | 功能 | 默认值 |
---|---|---|
名称 | 功能 | 默认值 |
**信息熵(Information Entropy)**是指某个片段外部搭配的丰富程度;
**互信息(Mutual Information)**是指某个片段内部搭配的固定程度;
代码 | 名称 | 词类 | 说明 |
---|---|---|---|
A | 形容词 | 实词 | 取英语形容词adjective的第1个字母 |
C | 连词 | 虚词 | 取英语连词conjunction的第1个字母 |
D | 副词 | 虚词 | 取英语副词adverb的第2个字母 |
E | 叹词 | 虚词 | 取英语叹词exclamation的第1个字母 |
M | 数词 | 实词 | 取英语数词numeral的第3个字母 |
N | 名词 | 实词 | 取英语名词noun的第1个字母 |
O | 拟声词 | 虚词 | 取英语拟声词onomatopoeia的第1个字母 |
P | 介词 | 虚词 | 取英语拟声词onomatopoeia的第1个字母 |
Q | 量词 | 实词 | 取英语量词quantity的第1个字母 |
R | 代词 | 实词 | 取英语代词pronoun的第2个字母 |
T | 冠词 | 虚词 | 取英语冠词article的第3个字母 |
U | 助词 | 虚词 | 取英语助词auxiliary的第2个字母 |
V | 动词 | 实词 | 取英语动词verb的第1个字母 |
W | 标点符号 | ||
X | 未知 |
编码 | 名称 |
---|---|
cmn |
Mandarin Chinese |
spa |
Spanish |
eng |
English |
rus |
Russian |
arb |
Standard Arabic |
ben |
Bengali |
hin |
Hindi |
por |
Portuguese |
ind |
Indonesian |
jpn |
Japanese |
fra |
French |
deu |
German |
jav |
Javanese |
kor |
Korean |
tel |
Telugu |
vie |
Vietnamese |
mar |
Marathi |
ita |
Italian |
tam |
Tamil |
tur |
Turkish |
urd |
Urdu |
guj |
Gujarati |
pol |
Polish |
ukr |
Ukrainian |
fas |
Persian |
kan |
Kannada |
mai |
Maithili |
mal |
Malayalam |
mya |
Burmese |
ori |
Oriya (macrolanguage) |
gax |
Borana-Arsi-Guji Oromo |
swh |
Swahili (individual language) |
sun |
Sundanese |
ron |
Romanian |
pan |
Panjabi |
bho |
Bhojpuri |
amh |
Amharic |
hau |
Hausa |
fuv |
Nigerian Fulfulde |
bos |
Bosnian (Cyrillic) |
bos |
Bosnian (Latin) |
hrv |
Croatian |
nld |
Dutch |
srp |
Serbian (Cyrillic) |
srp |
Serbian (Latin) |
tha |
Thai |
ckb |
Central Kurdish |
yor |
Yoruba |
uzn |
Northern Uzbek (Cyrillic) |
uzn |
Northern Uzbek (Latin) |
zlm |
Malay (individual language) (Arabic) |
zlm |
Malay (individual language) (Latin) |
ibo |
Igbo |
nep |
Nepali (macrolanguage) |
ceb |
Cebuano |
skr |
Saraiki |
tgl |
Tagalog |
hun |
Hungarian |
azj |
North Azerbaijani (Cyrillic) |
azj |
North Azerbaijani (Latin) |
sin |
Sinhala |
koi |
Komi-Permyak |
ell |
Modern Greek (1453-) |
ces |
Czech |
run |
Rundi |
bel |
Belarusian |
plt |
Plateau Malagasy |
qug |
Chimborazo Highland Quichua |
mad |
Madurese |
nya |
Nyanja |
zyb |
Yongbei Zhuang |
pbu |
Northern Pashto |
kin |
Kinyarwanda |
zul |
Zulu |
bul |
Bulgarian |
swe |
Swedish |
lin |
Lingala |
som |
Somali |
hms |
Southern Qiandong Miao |
hnj |
Hmong Njua |
ilo |
Iloko |
kaz |
Kazakh |
uig |
Uighur (Arabic) |
uig |
Uighur (Latin) |
hat |
Haitian |
khm |
Khmer |
aka |
Akan |
hil |
Hiligaynon |
sna |
Shona |
tat |
Tatar |
xho |
Xhosa |
hye |
Armenian |
min |
Minangkabau |
afr |
Afrikaans |
lua |
Luba-Lulua |
sat |
Santali |
bod |
Tibetan |
tir |
Tigrinya |
fin |
Finnish |
slk |
Slovak |
tuk |
Turkmen (Cyrillic) |
tuk |
Turkmen (Latin) |
dan |
Danish |
nob |
Norwegian Bokmål |
suk |
Sukuma |
als |
Tosk Albanian |
sag |
Sango |
nno |
Norwegian Nynorsk |
heb |
Hebrew |
mos |
Mossi |
tgk |
Tajik |
cat |
Catalan |
sot |
Southern Sotho |
kat |
Georgian |
bcl |
Central Bikol |
glg |
Galician |
lao |
Lao |
lit |
Lithuanian |
umb |
Umbundu |
tsn |
Tswana |
vec |
Venetian |
nso |
Pedi |
ban |
Balinese |
bug |
Buginese |
knc |
Central Kanuri |
kng |
Koongo |
ibb |
Ibibio |
lug |
Ganda |
ace |
Achinese |
bam |
Bambara |
tzm |
Central Atlas Tamazight |
ydd |
Eastern Yiddish |
kmb |
Kimbundu |
lun |
Lunda |
shn |
Shan |
war |
Waray (Philippines) |
dyu |
Dyula |
wol |
Wolof |
kir |
Kirghiz |
nds |
Low German |
fuf |
Pular |
mkd |
Macedonian |
vmw |
Makhuwa |
zgh |
Standard Moroccan Tamazight |
ewe |
Ewe |
khk |
Halh Mongolian |
slv |
Slovenian |
ayr |
Central Aymara |
bem |
Bemba (Zambia) |
emk |
Eastern Maninkakan |
bci |
Baoulé |
bum |
Bulu (Cameroon) |
epo |
Esperanto |
pam |
Pampanga |
tiv |
Tiv |
tpi |
Tok Pisin |
ven |
Venda |
ssw |
Swati |
nyn |
Nyankole |
kbd |
Kabardian |
iii |
Sichuan Yi |
yao |
Yao |
lav |
Latvian |
quz |
Cusco Quechua |
src |
Logudorese Sardinian |
sco |
Scots |
tso |
Tsonga |
rmy |
Vlax Romani |
men |
Mende (Sierra Leone) |
fon |
Fon |
nhn |
Central Nahuatl |
dip |
Northeastern Dinka |
kde |
Makonde |
snn |
Siona |
kbp |
Kabiyè |
tem |
Timne |
toi |
Tonga (Zambia) |
est |
Estonian |
snk |
Soninke |
cjk |
Chokwe |
ada |
Adangme |
aii |
Assyrian Neo-Aramaic |
quy |
Ayacucho Quechua |
rmn |
Balkan Romani |
bin |
Bini |
gaa |
Ga |
ndo |
Ndonga |
nym |
Nyamwezi |
sus |
Susu |
tly |
Talysh |
srr |
Serer |
kha |
Khasi |
hea |
Northern Qiandong Miao |
gkp |
Guinea Kpelle |
hni |
Hani |
fry |
Western Frisian |
yua |
Yucateco |
fij |
Fijian |
fur |
Friulian |
tet |
Tetum |
wln |
Walloon |
eus |
Basque |
oss |
Ossetian |
nbl |
South Ndebele |
pov |
Upper Guinea Crioulo |
cym |
Welsh |
lus |
Lushai |
dag |
Dagbani |
dga |
Southern Dagaare |
bre |
Breton |
kek |
Kekchí |
lij |
Ligurian |
pcd |
Picard |
roh |
Romansh |
bfa |
Bari |
kri |
Krio |
cnh |
Hakha Chin |
lob |
Lobi |
arn |
Mapudungun |
bba |
Baatonum |
dzo |
Dzongkha |
kea |
Kabuverdianu |
sah |
Yakut |
smo |
Samoan |
koo |
Konzo |
nzi |
Nzima |
maz |
Central Mazahua |
pis |
Pijin |
ctd |
Tedim Chin |
cos |
Corsican |
ltz |
Luxembourgish |
lia |
West-Central Limba |
mlt |
Maltese |
hna |
Mina (Cameroon) |
zdj |
Ngazidja Comorian |
guc |
Wayuu |
qwh |
Huaylas Ancash Quechua |
quc |
K'iche' |
div |
Dhivehi |
isl |
Icelandic |
kqn |
Kaonde |
pap |
Papiamento |
gle |
Irish |
dyo |
Jola-Fonyi |
hns |
Caribbean Hindustani |
gjn |
Gonja |
njo |
Ao Naga |
hus |
Huastec |
mag |
Magahi |
xsm |
Kasem |
ote |
Mezquital Otomi |
qxn |
Northern Conchucos Ancash Quechua |
tyv |
Tuvinian |
gag |
Gagauz |
san |
Sanskrit |
shk |
Shilluk |
nba |
Nyemba |
miq |
Mískito |
mam |
Mam |
tah |
Tahitian |
nav |
Navajo |
ami |
Amis |
lot |
Otuho |
cak |
Kaqchikel |
tzh |
Tzeltal |
tzo |
Tzotzil |
lns |
Lamnso' |
ton |
Tonga (Tonga Islands) |
tbz |
Ditammari |
lad |
Ladino |
vai |
Vai |
mto |
Totontepec Mixe |
ady |
Adyghe |
abk |
Abkhazian |
ast |
Asturian |
tsz |
Purepecha |
swb |
Maore Comorian |
cab |
Garifuna |
krl |
Karelian |
zam |
Miahuatlán Zapotec |
top |
Papantla Totonac |
cha |
Chamorro |
crs |
Seselwa Creole French |
ddn |
Dendi (Benin) |
loz |
Lozi |
mri |
Maori |
hsb |
Upper Sorbian |
cri |
Sãotomense |
pbb |
Páez |
alt |
Southern Altai |
qva |
Ambo-Pasco Quechua |
mxv |
Metlatónoc Mixtec |
gla |
Scottish Gaelic |
kjh |
Khakas |
csw |
Swampy Cree |
qvm |
Margos-Yarowilca-Lauricocha Quechua |
fao |
Faroese |
kal |
Kalaallisut |
cni |
Asháninka |
chk |
Chuukese |
mah |
Marshallese |
rar |
Rarotongan |
evn |
Evenki |
qvn |
North Junín Quechua |
wwa |
Waama |
buc |
Bushi |
qvh |
Huamalíes-Dos de Mayo Huánuco Quechua |
toj |
Tojolabal |
lue |
Luvale |
qvc |
Cajamarca Quechua |
ojb |
Northwestern Ojibwa |
jiv |
Shuar |
qud |
Calderón Highland Quichua |
lld |
Ladin |
hlt |
Matu Chin |
que |
Quechua |
pon |
Pohnpeian |
agr |
Aguaruna |
qxa |
Chiquián Ancash Quechua |
quh |
South Bolivian Quechua |
tca |
Ticuna |
chj |
Ojitlán Chinantec |
ike |
Eastern Canadian Inuktitut |
kwi |
Awa-Cuaiquer |
rgn |
Romagnol |
oki |
Okiek |
tob |
Toba |
guu |
Yanomamö |
qxu |
Arequipa-La Unión Quechua |
pau |
Palauan |
shp |
Shipibo-Conibo |
gld |
Nanai |
gug |
Paraguayan Guaraní |
mzi |
Ixcatlán Mazatec |
cjs |
Shor |
mic |
Mi'kmaq |
haw |
Hawaiian |
eve |
Even |
yap |
Yapese |
cbt |
Chayahuita |
ame |
Yanesha' |
gyr |
Guarayu |
vep |
Veps |
cpu |
Pichis Ashéninka |
acu |
Achuar-Shiwiar |
not |
Nomatsiguenga |
sme |
Northern Sami |
yad |
Yagua |
ura |
Urarina |
cbu |
Candoshi-Shapra |
huu |
Murui Huitoto |
cof |
Colorado |
boa |
Bora |
ztu |
Güilá Zapotec |
piu |
Pintupi-Luritja |
cbr |
Cashibo-Cacataibo |
mcf |
Matsés |
bis |
Bislama |
orh |
Oroqen |
ykg |
Northern Yukaghir |
ese |
Ese Ejja |
nio |
Nganasan |
cic |
Chickasaw |
csa |
Chiltepec Chinantec |
mcd |
Sharanahua |
amc |
Amahuaca |
amr |
Amarakaeri |
cot |
Caquinte |
oaa |
Orok |
ajg |
Aja (Benin) |
arl |
Arabela |
ppl |
Pipil |
bax |
Bamun |
nku |
Bouna Kulango |
cbi |
Chachi |
ccp |
Chakma |
chr |
Cherokee (Cherokee) |
chr |
Cherokee (Cherokee) |
duu |
Drung |
cfm |
Falam Chin |
fat |
Fanti |
ido |
Ido |
ina |
Interlingua (International Auxiliary Language Association) |
kkh |
Khün |
ktu |
Kituba (Democratic Republic of Congo) |
fkv |
Kven Finnish |
lat |
Latin |
glv |
Manx |
mfq |
Moba |
mnw |
Mon |
mxi |
Mozarabic |
pcm |
Nigerian Pidgin |
niu |
Niuean |
kqs |
Northern Kissi |
sey |
Secoya |
ekk |
Standard Estonian |
lvs |
Standard Latvian |
blt |
Tai Dam |
kdh |
Tem |
tdt |
Tetun Dili |
twi |
Twi (Latin) |
twi |
Twi (Latin) |
auc |
Waorani |
gaz |
West Central Oromo |
pnb |
Western Panjabi |
zro |
Záparo |
JStarCraft NLP遵循Apache 2.0协议,一切以其为基础的衍生作品均属于衍生作品的作者.
作者 | 洪钊桦 |
---|---|
[email protected], [email protected] |