-
Notifications
You must be signed in to change notification settings - Fork 4
/
jgtextrank.html
631 lines (628 loc) · 85.9 KB
/
jgtextrank.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html><head><title>Python: module jgtextrank.__init__</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head><body bgcolor="#f0f0f8">
<table width="100%" cellspacing=0 cellpadding=2 border=0 summary="heading">
<tr bgcolor="#7799ee">
<td valign=bottom> <br>
<font color="#ffffff" face="helvetica, arial"> <br><big><big><strong><a href="jgtextrank.html"><font color="#ffffff">jgtextrank</font></a>.__init__</strong></big></big> (version 0.1.3)</font></td
><td align=right valign=bottom
><font color="#ffffff" face="helvetica, arial"><a href=".">index</a><br><a href="file:c%3A%5Coak-project%5Cpython%5Cgithub%5Cjgtextrank%5Cjgtextrank%5C__init__.py">c:\oak-project\python\github\jgtextrank\jgtextrank\__init__.py</a></font></td></tr></table>
<p><tt>jgtextrank: Yet another Python implementation of TextRank<br>
==================================<br>
<br>
jgtextrank is a Python package for the creation, manipulation, and study of TextRank algorithm, a graph based keywords extraction and summarization approach<br>
<br>
<br>
Website (including documentation)::<br>
<br>
https://github.com/jerrygaoLondon/jgtextrank<br>
<br>
Source::<br>
<br>
https://github.com/jerrygaoLondon/jgtextrank<br>
<br>
Bug reports::<br>
<br>
https://github.com/jerrygaoLondon/jgtextrank/issues<br>
<br>
Simple example<br>
--------------<br>
Extract weighted keywords with an undirected graph::<br>
<br>
>>> from jgtextrank import keywords_extraction<br>
>>> example_abstract = "Compatibility of systems of linear constraints over the set of natural numbers. " "Criteria of compatibility of a system of linear Diophantine equations, strict inequations, " "and nonstrict inequations are considered. Upper bounds for components of a minimal set of " "solutions and algorithms of construction of minimal generating sets of solutions for all " "types of systems are given. These criteria and the corresponding algorithms for " "constructing a minimal supporting set of solutions can be used in solving all the " "considered types systems and systems of mixed types."<br>
>>> <a href="#-keywords_extraction">keywords_extraction</a>(example_abstract, top_p = 1, directed=False, weight_comb="sum")[0][:15]<br>
[('linear diophantine equations', 0.18059), ('minimal supporting set', 0.16649), ('minimal set', 0.13201), ('types systems', 0.1194), ('linear constraints', 0.10997), ('strict inequations', 0.08832), ('systems', 0.08351), ('corresponding algorithms', 0.0767), ('nonstrict inequations', 0.07276), ('mixed types', 0.07178), ('set', 0.06674), ('minimal', 0.06527), ('natural numbers', 0.06466), ('algorithms', 0.05479), ('solutions', 0.05085)]<br>
<br>
<br>
License<br>
-------<br>
<br>
Released under the MIT License::<br>
<br>
Copyright (C) 2017, JIE GAO <[email protected]></tt></p>
<p>
<table width="100%" cellspacing=0 cellpadding=2 border=0 summary="section">
<tr bgcolor="#eeaa77">
<td colspan=3 valign=bottom> <br>
<font color="#ffffff" face="helvetica, arial"><big><strong>Functions</strong></big></font></td></tr>
<tr><td bgcolor="#eeaa77"><tt> </tt></td><td> </td>
<td width="100%"><dl><dt><a name="-build_cooccurrence_graph"><strong>build_cooccurrence_graph</strong></a>(preprocessed_context:Generator[Tuple[List[str], List[Tuple[str, str]]], NoneType, NoneType], directed:bool=False, weighted:bool=False, conn_with_original_ctx=True, window:int=2) -> Tuple[networkx.classes.graph.Graph, List[List[str]]]</dt><dd><tt>build cooccurrence graph from filtered context<br>
and only consider single words as candidates for addition to the graph<br>
<br>
prepare vertex representation -> add vertex > add edges<br>
<br>
For directed or undirected, the conclusion of the paper is that "no 'direction' that can be established between<br>
co-occurring words."<br>
<br>
:type preprocessed_context: generator or list/iterable<br>
:param preprocessed_context: a tuple list of tokenised and PoS tagged text filtered by syntactic filter<br>
:type directed: bool<br>
:type weighted: bool. Not supported yet<br>
:param directed: default as False, best results observed with undirected graph;<br>
:TODO: for directed graph, not fully supported yet and need to define forward co-occurrence and backward co-occurrence<br>
For directed graph, a direction should be set following the natural flow of the text<br>
:type conn_with_original_ctx: bool<br>
:param conn_with_original_ctx: True if checking two vertices co-occurrence link from original context<br>
else checking connections from filtered context<br>
More vertices connection can be built if 'conn_with_original_ctx' is set to False<br>
:type window: int<br>
:param window: a window of N words<br>
:rtype: tuple[of [nx.graph, list]]<br>
:return: (networkx) graph object readily to score along with all tokenised raw text splitted by context</tt></dd></dl>
<dl><dt><a name="-compute_TeRGraph"><strong>compute_TeRGraph</strong></a>(term_graph:networkx.classes.graph.Graph) -> Dict[str, float]</dt><dd><tt>compute graph vertices with TeRGraph algorithms<br>
<br>
This algorithm is based on the assumption that term representativeness in a graph for a specific domain depends on<br>
the number of neighbors that it has, and the number of neighbors of its neighbors. A term with more neighbors is<br>
less representative of the specific domain.<br>
<br>
Original paper requires a connected graph and this method will set isolated nodes to 0 (by default).<br>
<br>
Lossio-Ventura, J. A., Jonquet, C., Roche, M., & Teisseire, M. (2014, September).<br>
Yet another ranking function for automatic multiword term extraction.<br>
In International Conference on Natural Language Processing (pp. 52-64). Springer, Cham.<br>
<br>
:param term_graph: NetworkX graph<br>
:return: dict, all nodes weighted with TeRGraph metric</tt></dd></dl>
<dl><dt><a name="-compute_neighborhood_size"><strong>compute_neighborhood_size</strong></a>(term_cooccur_graph) -> Dict[str, int]</dt><dd><tt>Number of immediate neighbors to a node<br>
<br>
a version of node degree that disregards self-loops (e.g., "again, again, again")<br>
<br>
:param term_graph: NetworkX graph<br>
:return: dict, all nodes weighted with neighborhood size</tt></dd></dl>
<dl><dt><a name="-keywords_extraction"><strong>keywords_extraction</strong></a>(text:str, window:int=2, top_p:float=1, top_t:Union[int, NoneType]=None, directed:bool=False, weighted:bool=False, conn_with_original_ctx:bool=True, syntactic_categories:Set[str]={'NNP', 'JJ', 'NNS', 'NN'}, stop_words:Set[str]=None, lemma:bool=False, solver:str='pagerank', max_iter:int=100, tol:float=1e-06, weight_comb:str='norm_max', mu:int=5, workers:int=1)</dt><dd><tt>TextRank keywords extraction for unstructured text<br>
<br>
:type text: string, required<br>
:param text: textual data for keywords extraction<br>
:type window: int, required<br>
:param window: co-occurrence window size (default with forward and backward context). Recommend: 2-10<br>
:type top_t: int or None, optional<br>
:param top_t: the top T vertices in the ranking are retained for post-processing<br>
Top T is computed from Top p if value is none<br>
:type top_p: float or None, optional<br>
:param top_p: the top Percentage(P) of vertices are retained for post-processing.<br>
Top 1/3 of all vertices is recommended in original paper.<br>
:type directed: bool, required<br>
:param directed: directed or undirected graph (a preserved parameters)<br>
:type weighted: bool, optional<br>
:param weighted: weighted or unweighted, Custom weighted graph is not supported yet, Default as False<br>
Best result is found with unweighted graph in the original paper<br>
<br>
:type conn_with_original_ctx: bool, optional<br>
:param conn_with_original_ctx: whether build vertices connections from original context or filtered context,<br>
True if checking two vertices co-occurrence link from original context,<br>
else checking connections from filtered context by syntactic rule<br>
<br>
More vertices connections can be built if 'conn_with_original_ctx' is set to False<br>
:type syntactic_categories: set [of string], required<br>
:param syntactic_categories: Default with noun and adjective categories.<br>
Syntactic categories (default as Part-Of-Speech(PoS) tags) is defined to<br>
filter accepted graph vertices (default with word-based tokens as single syntactic unit).<br>
<br>
Any word that is not matched with the predefined categories will be removed based on corresponding the PoS tag.<br>
<br>
Best result is found with noun and adjective categories only in original paper.<br>
:type stop_words: set of [string {‘english’}], or None (default), Optional<br>
:param stop_words: remove stopwords from PoS tagged context (token tuple list).<br>
The stop words are considered as noisy common/function words.<br>
By provide a list of stop words can improve vertices network connectivity<br>
and increase weights to more meaningful words.<br>
:type lemma: bool<br>
:param lemma: if lemmatize text<br>
:type solver: string, optional<br>
:param solver: {'pagerank', 'pagerank_numpy', 'pagerank_scipy', 'betweenness_centrality', 'degree_centrality',<br>
'hits', 'closeness_centrality', 'edge_betweenness_centrality', 'eigenvector_centrality',<br>
'katz_centrality', 'communicability_betweenness', 'current_flow_closeness', 'current_flow_betweenness',<br>
'edge_current_flow_betweenness', 'load_centrality', 'clustering_coefficient', 'TeRGraph',<br>
'coreness', 'neighborhood_size'}, default 'pagerank'<br>
PageRank Algorithms supported in networkx to use in the vertices ranking.<br>
<br>
- 'pagerank' networkx pagerank implementation<br>
- 'pagerank_numpy' numpy pagerank implementation<br>
- 'pagerank_scipy' scipy pagerank implementation<br>
- 'betweenness_centrality' computes the shortest-path betweenness centrality of a node<br>
- 'degree_centrality' computes the degree centrality for nodes.<br>
- 'hits' computes HITS algorithm for a node. The avg. of Authority value and Hub value is computed<br>
- 'closeness_centrality' computes closeness centrality for nodes.<br>
- 'edge_betweenness_centrality' computes betweenness centrality for edges.<br>
Maximum edge betweenness value in all the possible edge pairs is adopted for each vertex<br>
- 'eigenvector_centrality' computes the eigenvector centrality for the cooocurrence graph.<br>
- 'katz_centrality' computes the Katz centrality for the nodes based on the centrality of its neighbors.<br>
- 'communicability_betweenness' computes subgraph communicability for all pairs of nodes<br>
- 'current_flow_closeness' computes current-flow closeness centrality for nodes.<br>
- 'current_flow_betweenness' computes current-flow betweenness centrality for nodes.<br>
- 'edge_current_flow_betweenness' computes current-flow betweenness centrality for edges.<br>
- 'load_centrality' computes edge load. This is a experimental algorithm in nextworkx<br>
that counts the number of shortest paths which cross each edge.<br>
- 'clustering_coefficient' computes the clustering coefficient for nodes. Only undirected graph is supported.<br>
- 'TeRGraph': computes the TeRGraph (Lossio-Ventura, 2014) weights for nodes.<br>
The solver requires a connected graph and isolated nodes will be set to 0.<br>
- 'coreness' (Batagelj & Zaversnik, 2003) measures how "deep" a node(word/phrase) is in the co-occurrence network.<br>
This indicates how strongly the node is connected to the network. The "deeper" a word, the more it is important.<br>
The metric is not suitable for ranking terms directly, but it is proved as useful feature for keywords extraction<br>
- 'neighborhood_size' computes the number of immediate neighbors to a node.<br>
This is a version of node degree that disregards self-loops<br>
<br>
Note: Centrality measures (such as "current flow betweeness", "current flow closeness", "communicability_betweenness")<br>
does not support loosely connected graph and betweeness centrality measures cannot compute on single isolated nodes.<br>
It is recommended to re-consider the graph construction method or increase context window size to<br>
ensure a (strongly) connected graph.<br>
:type max_iter: int, optional<br>
:param max_iter: number of maximum iteration of pagerank, katz_centrality<br>
:type tol: float, optional, default 1.0e-6<br>
:param tol: Error tolerance used to check convergence, the value varies for specific solver<br>
:type weight_comb: str<br>
:param weight_comb: {'avg', 'norm_avg', 'log_norm_avg', 'gaussian_norm_avg', 'sum', 'norm_sum', 'log_norm_sum',<br>
'gaussian_norm_sum', 'max', 'norm_max', 'log_norm_max', 'gaussian_norm_max',<br>
'len_log_norm_max', 'len_log_norm_avg', 'len_log_norm_sum'}, default 'norm_max'<br>
The weight combination method for multi-word candidate terms weighing.<br>
<br>
- 'max' : maximum value of vertices weights<br>
- 'avg' : avarage vertices weight<br>
- 'sum' : sum of vertices weights<br>
- 'norm_max' : MWT unit size normalisation of 'max' weight<br>
- 'norm_avg' : MWT unit size normalisation of 'avg' weight<br>
- 'norm_sum' : MWT unit size normalisation of 'sum' weight<br>
- 'log_norm_max' : logarithm based normalisation of 'max' weight<br>
- 'log_norm_avg' : logarithm based normalisation of 'avg' weight<br>
- 'log_norm_sum' : logarithm based normalisation of 'sum' weight<br>
- 'gaussian_norm_max' : gaussian normalisation of 'max' weight<br>
- 'gaussian_norm_avg' : gaussian normalisation of 'avg' weight<br>
- 'gaussian_norm_sum' : gaussian normalisation of 'sum' weight<br>
- 'len_log_norm_max': log2(|a| + 0.1) * 'max' adapted from CValue (Frantzi, 2000) formulate<br>
- 'len_log_norm_avg': log2(|a| + 0.1) * 'avg' adapted from CValue (Frantzi, 2000) formulate<br>
- 'len_log_norm_sum': log2(|a| + 0.1) * 'sum' adapted from CValue (Frantzi, 2000) formulate<br>
<br>
NOTE: \*_norm_\*" penalises/smooth the longer term (than default 5 token size)<br>
to achieve a saturation level as term size grows<br>
:type mu: int, optional<br>
:param mu: mean value to set a center point (default to 5) in order to rank the MWT candidates higher that are near the central point<br>
This param is only required and effective for normalisation based MWT weighting methods<br>
:type workers: int, optional<br>
:param workers: number of workers (CPU cores)<br>
<br>
:rtype: tuple [list[tuple[string,float]], dict[string:float]]<br>
:return: keywords: sorted keywords with weights along with Top T weighted vertices<br>
:raise: ValueError</tt></dd></dl>
<dl><dt><a name="-keywords_extraction_from_corpus_directory"><strong>keywords_extraction_from_corpus_directory</strong></a>(corpus_dir:str, encoding:str='utf-8', solver:str='pagerank', max_iter:int=100, tol:float=0.0001, window:int=2, top_p:float=0.3, top_t:Union[int, NoneType]=None, directed:bool=False, weighted:bool=False, syntactic_categories:Set[str]={'NNP', 'JJ', 'NNS', 'NN'}, stop_words:Set[str]=None, lemma:bool=False, weight_comb:str='norm_max', mu:int=5, export:bool=False, export_format:str='csv', export_path:str='', workers:int=1) -> Tuple[List[Tuple[str, float]], Dict[str, float]]</dt><dd><tt>:type corpus_dir: string<br>
:param corpus_dir: corpus directory where text files are located and will be read and processed<br>
:type encoding: string, required<br>
:param encoding: encoding of the text, default as 'utf-8',<br>
:type solver: string, optional<br>
:param solver: {'pagerank', 'pagerank_numpy', 'pagerank_scipy', 'betweenness_centrality', 'degree_centrality',<br>
'hits', 'closeness_centrality', 'edge_betweenness_centrality', 'eigenvector_centrality',<br>
'katz_centrality', 'communicability_betweenness', 'current_flow_closeness', 'current_flow_betweenness',<br>
'edge_current_flow_betweenness', 'load_centrality', 'clustering_coefficient', 'TeRGraph',<br>
'coreness'}, default 'pagerank'<br>
PageRank Algorithms supported in networkx to use in the vertices ranking.<br>
<br>
- 'betweenness_centrality' computes the shortest-path betweenness centrality of a node<br>
- 'degree_centrality' computes the degree centrality for nodes.<br>
- 'hits' computes HITS algorithm for a node. The avg. of Authority value and Hub value is computed<br>
- 'closeness_centrality' computes closeness centrality for nodes.<br>
- 'edge_betweenness_centrality' computes betweenness centrality for edges.<br>
Maximum edge betweenness value in all the possible edge pairs is adopted for each vertex<br>
- 'eigenvector_centrality' computes the eigenvector centrality for the cooocurrence graph.<br>
- 'katz_centrality' computes the Katz centrality for the nodes based on the centrality of its neighbors.<br>
- 'communicability_betweenness' computes subgraph communicability for all pairs of nodes<br>
- 'current_flow_closeness' computes current-flow closeness centrality for nodes.<br>
- 'current_flow_betweenness' computes current-flow betweenness centrality for nodes.<br>
- 'edge_current_flow_betweenness' computes current-flow betweenness centrality for edges.<br>
- 'load_centrality' computes edge load. This is a experimental algorithm in nextworkx<br>
that counts the number of shortest paths which cross each edge.<br>
- 'clustering_coefficient' computes the clustering coefficient for nodes. Only undirected graph is supported.<br>
- 'TeRGraph': computes the TeRGraph (Lossio-Ventura, 2014) weights for nodes.<br>
The solver requires a connected graph and isolated nodes will be set to 0.<br>
- 'coreness' (Batagelj & Zaversnik, 2003) measures how "deep" a node(word/phrase) is in the co-occurrence network.<br>
This indicates how strongly the node is connected to the network. The "deeper" a word, the more it is important.<br>
The metric is not suitable for ranking terms directly, but it is proved as useful feature for keywords extraction<br>
- 'neighborhood_size' computes the number of immediate neighbors to a node.<br>
This is a version of node degree that disregards self-loops<br>
<br>
Note: Centrality measures (such as "current flow betweeness", "current flow closeness", "communicability_betweenness")<br>
does not support loosely connected graph and betweeness centrality measures cannot compute on single isolated nodes.<br>
It is recommended to re-consider the graph construction method or increase context window size to<br>
ensure a (strongly) connected graph.<br>
:type max_iter: int, optional<br>
:param max_iter: number of maximum iteration of pagerank, katz_centrality<br>
:type tol: float, optional, default 1e-4<br>
:param tol: Error tolerance used to check convergence, the value varies for specific solver<br>
:type window: int, required<br>
:param window: co-occurrence window size (default with forward and backward context). Default value: 2<br>
:type top_p: float, required<br>
:param top_p: the top Percentage of vertices are retained for post-processing, Default as 1/3 of all vertices<br>
:type top_t: int|None(default), optional<br>
:param top_t: the top T vertices in the ranking are retained for post-processing<br>
if None is provided, top T will be computed from top P. Otherwise, top T will be used to filter vertices<br>
<br>
:type directed: bool, required<br>
:param directed: directed or undirected graph, best result is found with undirected graph in the original paper. Default as False<br>
:type weighted: bool, required<br>
:param weighted: weighted or unweighted, weighted graph is not supported yet, Default as False<br>
Best result is found with unweighted graph in the original paper<br>
:type syntactic_categories: set[string], required<br>
:param syntactic_categories: Syntactic categories (default as Part-Of-Speech(PoS) tags) is defined to<br>
filter accepted graph vertices (essentially word-based tokens).<br>
Default with noun and adjective categories.<br>
<br>
Any word that is not matched with the predefined categories will be removed<br>
based on corresponding the PoS tag.<br>
<br>
Best result is found with noun and adjective categories only in original paper.<br>
:type stop_words: set[string {‘english’}] | None (default), Optional<br>
:param stop_words: remove stopwords from PoS tagged context (token tuple list)<br>
The stop words are considered as noisy common/function words.<br>
By provide a list of stop words can improve vertices network connectivity<br>
and increase weights to more meaningful words.<br>
:type lemma: bool<br>
:param lemma: if lemmatize text<br>
:type weight_comb: str<br>
:param weight_comb: {'avg', 'norm_avg', 'log_norm_avg', 'gaussian_norm_avg', 'sum', 'norm_sum', 'log_norm_sum',<br>
'gaussian_norm_sum', 'max', 'norm_max', 'log_norm_max', 'gaussian_norm_max',<br>
'len_log_norm_max', 'len_log_norm_avg', 'len_log_norm_sum'}, default 'norm_max'<br>
The weight combination method for multi-word candidate terms weighing.<br>
<br>
- 'max' : maximum value of vertices weights<br>
- 'avg' : avarage vertices weight<br>
- 'sum' : sum of vertices weights<br>
- 'norm_max' : MWT unit size normalisation of 'max' weight<br>
- 'norm_avg' : MWT unit size normalisation of 'avg' weight<br>
- 'norm_sum' : MWT unit size normalisation of 'sum' weight<br>
- 'log_norm_max' : logarithm based normalisation of 'max' weight<br>
- 'log_norm_avg' : logarithm based normalisation of 'avg' weight<br>
- 'log_norm_sum' : logarithm based normalisation of 'sum' weight<br>
- 'gaussian_norm_max' : gaussian normalisation of 'max' weight<br>
- 'gaussian_norm_avg' : gaussian normalisation of 'avg' weight<br>
- 'gaussian_norm_sum' : gaussian normalisation of 'sum' weight<br>
- 'len_log_norm_max': log2(|a| + 0.1) * 'max' adapted from CValue (Frantzi, 2000) formulate<br>
- 'len_log_norm_avg': log2(|a| + 0.1) * 'avg' adapted from CValue (Frantzi, 2000) formulate<br>
- 'len_log_norm_sum': log2(|a| + 0.1) * 'sum' adapted from CValue (Frantzi, 2000) formulate<br>
<br>
NOTE: \*_norm_\*" penalises/smooth the longer term (than default 5 token size)<br>
to achieve a saturation level as term size grows<br>
:type mu: int, optional<br>
:param mu: mean value to set a center point (default to 5) in order to rank the candidates higher that are near the central point<br>
This param is only required and effective for normalisation based MWT weighting method<br>
:type export: bool<br>
:param export: True if export result else False<br>
:type export_format: string<br>
:param export_format: export file format.Support options: "csv"|"json". Default with "csv"<br>
:type export_path: string<br>
:param export_path: file path where the result will be exported to<br>
:type workers: int<br>
:param workers: available CPU cores that can be used to parallelize co-occurrence computation<br>
:rtype: tuple [list[tuple[string,float]], dict[string:float]]<br>
:return: keywords: sorted keywords with weights along with Top T weighted vertices</tt></dd></dl>
<dl><dt><a name="-keywords_extraction_from_segmented_corpus"><strong>keywords_extraction_from_segmented_corpus</strong></a>(segmented_corpus_context:Union[Generator[List[str], NoneType, NoneType], jgtextrank.utility.CorpusContent2RawSentences], solver:str='pagerank', max_iter:int=100, tol:float=1e-06, window:int=2, top_p:float=0.3, top_t:Union[int, NoneType]=None, directed:bool=False, weighted:bool=False, conn_with_original_ctx:bool=True, syntactic_categories:Set[str]={'NNP', 'JJ', 'NNS', 'NN'}, stop_words:Set[str]=None, lemma:bool=False, weight_comb:str='norm_max', mu:int=5, export:bool=False, export_format:str='csv', export_path:str='', encoding:str='utf-8', workers:int=1) -> Tuple[List[Tuple[str, float]], Dict[str, float]]</dt><dd><tt>TextRank keywords extraction for a list of context of tokenised textual corpus.<br>
This method allows any pre-defined keyword co-occurrence context criteria (e.g., sentence, or paragraph,<br>
or section, or a user-defined segment) and any pre-defined word segmentation<br>
<br>
:type segmented_corpus_context: list|generator, required<br>
:param segmented_corpus_context: pre-tokenised corpus formatted in pre-defined context list.<br>
Tokenised sentence list is the recommended(and default) context corpus in TextRank.<br>
You can also choose your own pre-defined co-occurrence context (e.g., paragraph, entire document, a user-defined segment).<br>
<br>
:Example: input:<br>
<br>
>>> context_1 = ["The", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog", ".", "hey","diddle", "diddle", ",", "the", "cat", "and", "the", "fiddle","."]<br>
<br>
>>> context_2 = ["The", "cow", "jumped", "over", "the", "moon",".", "The", "little", "dog", "laughted", "to", "see","such", "fun", "."]<br>
<br>
>>> segmented_corpus_context = [context_1, context_2]<br>
:type solver: string, optional<br>
:param solver: {'pagerank', 'pagerank_numpy', 'pagerank_scipy', 'betweenness_centrality', 'degree_centrality',<br>
'hits', 'closeness_centrality', 'edge_betweenness_centrality', 'eigenvector_centrality',<br>
'katz_centrality', 'communicability_betweenness', 'current_flow_closeness', 'current_flow_betweenness',<br>
'edge_current_flow_betweenness', 'load_centrality', 'clustering_coefficient', 'TeRGraph',<br>
'coreness'}, default 'pagerank'<br>
PageRank Algorithms supported in networkx to use in the vertices ranking.<br>
<br>
- 'betweenness_centrality' computes the shortest-path betweenness centrality of a node<br>
- 'degree_centrality' computes the degree centrality for nodes.<br>
- 'hits' computes HITS algorithm for a node. The avg. of Authority value and Hub value is computed<br>
- 'closeness_centrality' computes closeness centrality for nodes.<br>
- 'edge_betweenness_centrality' computes betweenness centrality for edges.<br>
Maximum edge betweenness value in all the possible edge pairs is adopted for each vertex<br>
- 'eigenvector_centrality' computes the eigenvector centrality for the cooocurrence graph.<br>
- 'katz_centrality' computes the Katz centrality for the nodes based on the centrality of its neighbors.<br>
- 'communicability_betweenness' computes subgraph communicability for all pairs of nodes<br>
- 'current_flow_closeness' computes current-flow closeness centrality for nodes.<br>
- 'current_flow_betweenness' computes current-flow betweenness centrality for nodes.<br>
- 'edge_current_flow_betweenness' computes current-flow betweenness centrality for edges.<br>
- 'load_centrality' computes edge load. This is a experimental algorithm in nextworkx<br>
that counts the number of shortest paths which cross each edge.<br>
- 'clustering_coefficient' computes the clustering coefficient for nodes. Only undirected graph is supported.<br>
- 'TeRGraph': computes the TeRGraph (Lossio-Ventura, 2014) weights for nodes.<br>
The solver requires a connected graph and isolated nodes will be set to 0.<br>
- 'coreness' (Batagelj & Zaversnik, 2003) measures how "deep" a node(word/phrase) is in the co-occurrence network.<br>
This indicates how strongly the node is connected to the network. The "deeper" a word, the more it is important.<br>
The metric is not suitable for ranking terms directly, but it is proved as useful feature for keywords extraction<br>
- 'neighborhood_size' computes the number of immediate neighbors to a node.<br>
This is a version of node degree that disregards self-loops<br>
<br>
Note: Centrality measures (such as "current flow betweeness", "current flow closeness", "communicability_betweenness")<br>
does not support loosely connected graph and betweeness centrality measures cannot compute on single isolated nodes.<br>
It is recommended to re-consider the graph construction method or increase context window size to<br>
ensure a (strongly) connected graph.<br>
:type max_iter: int, optional<br>
:param max_iter: number of maximum iteration of pagerank, katz_centrality<br>
:type tol: float, optional, default 1.0e-6<br>
:param tol: Error tolerance used to check convergence, the value varies for specific solver<br>
:type window: int, required<br>
:param window: co-occurrence window size (default with forward and backward context). Default value: 2<br>
:type top_p: float, optional<br>
:param top_p: the top Percentage of vertices are retained for post-processing, Default as 1/3 of all vertices<br>
:type top_t: int|None(default), optional<br>
:param top_t: the top T vertices in the ranking are retained for post-processing<br>
:type directed: bool, required<br>
:param directed: directed or undirected graph, best result is found with undirected graph in the original paper. Default as False<br>
:type weighted: bool, required<br>
:param weighted: weighted or unweighted, Custom weighted graph is not supported yet, Default as False<br>
Best result is found with unweighted graph in the original paper<br>
When this is set to True, graph construction component will try to construct a fully-connected graph<br>
by connecting isolated nodes (due to small context window) with low weight (default to 0.001)<br>
Please check if the ranking algorithm supports weighted graph<br>
Note: custom weights is not supported yet.<br>
<br>
:type conn_with_original_ctx: bool, optional<br>
:param conn_with_original_ctx: True if checking two vertices co-occurrence link from original context<br>
else checking connections from filtered context<br>
More vertices connection can be built if 'conn_with_original_ctx' is set to False<br>
:type syntactic_categories: set[string], required<br>
:param syntactic_categories: Syntactic categories (default as Part-Of-Speech(PoS) tags) is defined to<br>
filter accepted graph vertices (essentially word-based tokens).<br>
Default with noun and adjective categories.<br>
<br>
Any word that is not matched with the predefined categories will be removed<br>
based on corresponding the PoS tag.<br>
<br>
Best result is found with noun and adjective categories only in original paper.<br>
:type stop_words: set[string {‘english’}] | None (default), Optional<br>
:param stop_words: remove stopwords from PoS tagged context (token tuple list)<br>
The stop words are considered as noisy common/function words.<br>
By provide a list of stop words can improve vertices network connectivity<br>
and increase weights to more meaningful words.<br>
:type lemma: bool<br>
:param lemma: if lemmatize text<br>
:type weight_comb: str<br>
:param weight_comb: {'avg', 'norm_avg', 'log_norm_avg', 'gaussian_norm_avg', 'sum', 'norm_sum', 'log_norm_sum',<br>
'gaussian_norm_sum', 'max', 'norm_max', 'log_norm_max', 'gaussian_norm_max',<br>
'len_log_norm_max', 'len_log_norm_avg', 'len_log_norm_sum'}, default 'norm_max'<br>
The weight combination method for multi-word candidate terms weighing.<br>
<br>
- 'max' : maximum value of vertices weights<br>
- 'avg' : avarage vertices weight<br>
- 'sum' : sum of vertices weights<br>
- 'norm_max' : MWT unit size normalisation of 'max' weight<br>
- 'norm_avg' : MWT unit size normalisation of 'avg' weight<br>
- 'norm_sum' : MWT unit size normalisation of 'sum' weight<br>
- 'log_norm_max' : logarithm based normalisation of 'max' weight<br>
- 'log_norm_avg' : logarithm based normalisation of 'avg' weight<br>
- 'log_norm_sum' : logarithm based normalisation of 'sum' weight<br>
- 'gaussian_norm_max' : gaussian normalisation of 'max' weight<br>
- 'gaussian_norm_avg' : gaussian normalisation of 'avg' weight<br>
- 'gaussian_norm_sum' : gaussian normalisation of 'sum' weight<br>
- 'len_log_norm_max': log2(|a| + 0.1) * 'max' adapted from CValue (Frantzi, 2000) formulate<br>
- 'len_log_norm_avg': log2(|a| + 0.1) * 'avg' adapted from CValue (Frantzi, 2000) formulate<br>
- 'len_log_norm_sum': log2(|a| + 0.1) * 'sum' adapted from CValue (Frantzi, 2000) formulate<br>
<br>
NOTE: \*_norm_\*" penalises/smooth the longer term (than default 5 token size)<br>
to achieve a saturation level as term size grows<br>
:type mu: int, optional<br>
:param mu: mean value to set a center point (default to 5) in order to rank the candidates higher that are near the central point<br>
This param is only required and effective for normalisation based MWT weighting method<br>
:type export: bool<br>
:param export: True if export result else False<br>
:type export_format: string<br>
:param export_format: export file format. Support options: "csv"|"json". Default with "csv"<br>
:type export_path: string<br>
:param export_path: file path where the result will be exported to<br>
:type encoding: string, required<br>
:param encoding: encoding of the text, default as 'utf-8',<br>
:type workers: int<br>
:param workers: available CPU cores, default to use all the available CPU cores<br>
:rtype: tuple [list[tuple[string,float]], dict[string, float]]<br>
:return: keywords: sorted keywords with weights along with Top T weighted vertices</tt></dd></dl>
<dl><dt><a name="-keywords_extraction_from_tagged_corpus"><strong>keywords_extraction_from_tagged_corpus</strong></a>(tagged_corpus_context:List[List[Tuple[str, str]]], solver:str='pagerank', max_iter:int=100, tol:float=1e-06, window:int=2, top_p:float=0.3, top_t:Union[int, NoneType]=None, directed:bool=False, weighted:bool=False, conn_with_original_ctx:bool=True, syntactic_categories:Set[str]={'NNP', 'JJ', 'NNS', 'NN'}, stop_words:Set[str]=None, lemma:bool=False, weight_comb:str='norm_max', mu:int=5, export:bool=False, export_format:str='csv', export_path:str='', encoding:str='utf-8', workers:int=1) -> Tuple[List[Tuple[str, float]], List[Tuple[str, float]]]</dt><dd><tt>TextRank keywords extraction for pos tagged corpus context list<br>
<br>
This method allows to use external Part-of-Speech tagging, and any pre-defined keyword co-occurrence context criteria (e.g., sentence, or paragraph,<br>
or section, or a user-defined segment) and any pre-defined word segmentation<br>
<br>
:type tagged_corpus_context: list[list[tuple[string, string]]] or generator<br>
:param tagged_corpus_context: pre-tagged corpus in the form of tuple<br>
:type solver: string, optional<br>
:param solver: {'pagerank', 'pagerank_numpy', 'pagerank_scipy', 'betweenness_centrality', 'degree_centrality',<br>
'hits', 'closeness_centrality', 'edge_betweenness_centrality', 'eigenvector_centrality',<br>
'katz_centrality', 'communicability_betweenness', 'current_flow_closeness', 'current_flow_betweenness',<br>
'edge_current_flow_betweenness', 'load_centrality', 'clustering_coefficient', 'TeRGraph',<br>
'coreness'}, default 'pagerank'<br>
PageRank Algorithms supported in networkx to use in the vertices ranking.<br>
<br>
- 'betweenness_centrality' computes the shortest-path betweenness centrality of a node<br>
- 'degree_centrality' computes the degree centrality for nodes.<br>
- 'hits' computes HITS algorithm for a node. The avg. of Authority value and Hub value is computed<br>
- 'closeness_centrality' computes closeness centrality for nodes.<br>
- 'edge_betweenness_centrality' computes betweenness centrality for edges.<br>
Maximum edge betweenness value in all the possible edge pairs is adopted for each vertex<br>
- 'eigenvector_centrality' computes the eigenvector centrality for the cooocurrence graph.<br>
- 'katz_centrality' computes the Katz centrality for the nodes based on the centrality of its neighbors.<br>
- 'communicability_betweenness' computes subgraph communicability for all pairs of nodes<br>
- 'current_flow_closeness' computes current-flow closeness centrality for nodes.<br>
- 'current_flow_betweenness' computes current-flow betweenness centrality for nodes.<br>
- 'edge_current_flow_betweenness' computes current-flow betweenness centrality for edges.<br>
- 'load_centrality' computes edge load. This is a experimental algorithm in nextworkx<br>
that counts the number of shortest paths which cross each edge.<br>
- 'clustering_coefficient' computes the clustering coefficient for nodes. Only undirected graph is supported.<br>
- 'TeRGraph': computes the TeRGraph (Lossio-Ventura, 2014) weights for nodes.<br>
The solver requires a connected graph and isolated nodes will be set to 0.<br>
- 'coreness' (Batagelj & Zaversnik, 2003) measures how "deep" a node(word/phrase) is in the co-occurrence network.<br>
This indicates how strongly the node is connected to the network. The "deeper" a word, the more it is important.<br>
The metric is not suitable for ranking terms directly, but it is proved as useful feature for keywords extraction<br>
- 'neighborhood_size' computes the number of immediate neighbors to a node.<br>
This is a version of node degree that disregards self-loops<br>
<br>
Note: Centrality measures (such as "current flow betweeness", "current flow closeness", "communicability_betweenness")<br>
does not support loosely connected graph and betweeness centrality measures cannot compute on single isolated nodes.<br>
It is recommended to re-consider the graph construction method or increase context window size to<br>
ensure a (strongly) connected graph.<br>
:type max_iter: int, optional<br>
:param max_iter: number of maximum iteration of pagerank, katz_centrality<br>
:type tol: float, optional, default 1e4<br>
:param tol: Error tolerance used to check convergence, the value varies for specific solver<br>
:type window: int, required<br>
:param window: co-occurrence window size (default with forward and backward context). Default value: 2<br>
:type top_p: float, optional<br>
:param top_p: the top Percentage of vertices are retained for post-processing, Default as 1/3 of all vertices<br>
:type top_t: int|None(default), optional<br>
:param top_t: the top T vertices in the ranking are retained for post-processing<br>
:type directed: bool, required<br>
:param directed: directed or undirected graph, best result is found with undirected graph in the original paper. Default as False<br>
:type weighted: bool, required<br>
:param weighted: weighted or unweighted, weighted graph is not supported yet, Default as False<br>
Best result is found with unweighted graph in the original paper<br>
:type conn_with_original_ctx: bool, optional<br>
:param conn_with_original_ctx: True if checking two vertices connections from original context<br>
else checking connections from filtered context<br>
More vertices connection can be built if 'conn_with_original_ctx' is set to False<br>
:type syntactic_categories: set[string], required<br>
:param syntactic_categories: Syntactic categories (default as Part-Of-Speech(PoS) tags) is defined to<br>
filter accepted graph vertices (essentially word-based tokens).<br>
Default with noun and adjective categories.<br>
<br>
Any word that is not matched with the predefined categories will be removed<br>
based on corresponding the PoS tag.<br>
<br>
Best result is found with noun and adjective categories only in original paper.<br>
:type stop_words: set[string {‘english’}] | None (default), Optional<br>
:param stop_words: remove stopwords from PoS tagged context (token tuple list)<br>
The stop words are considered as noisy common/function words.<br>
By provide a list of stop words can improve vertices network connectivity<br>
and increase weights to more meaningful words.<br>
:type lemma: bool<br>
:param lemma: if lemmatize text<br>
:type weight_comb: str<br>
:param weight_comb: {'avg', 'norm_avg', 'log_norm_avg', 'gaussian_norm_avg', 'sum', 'norm_sum', 'log_norm_sum',<br>
'gaussian_norm_sum', 'max', 'norm_max', 'log_norm_max', 'gaussian_norm_max',<br>
'len_log_norm_max', 'len_log_norm_avg', 'len_log_norm_sum'}, default 'norm_max'<br>
The weight combination method for multi-word candidate terms weighing.<br>
<br>
- 'max' : maximum value of vertices weights<br>
- 'avg' : avarage vertices weight<br>
- 'sum' : sum of vertices weights<br>
- 'norm_max' : MWT unit size normalisation of 'max' weight<br>
- 'norm_avg' : MWT unit size normalisation of 'avg' weight<br>
- 'norm_sum' : MWT unit size normalisation of 'sum' weight<br>
- 'log_norm_max' : logarithm based normalisation of 'max' weight<br>
- 'log_norm_avg' : logarithm based normalisation of 'avg' weight<br>
- 'log_norm_sum' : logarithm based normalisation of 'sum' weight<br>
- 'gaussian_norm_max' : gaussian normalisation of 'max' weight<br>
- 'gaussian_norm_avg' : gaussian normalisation of 'avg' weight<br>
- 'gaussian_norm_sum' : gaussian normalisation of 'sum' weight<br>
- 'len_log_norm_max': log2(|a| + 0.1) * 'max' adapted from CValue (Frantzi, 2000) formulate<br>
- 'len_log_norm_avg': log2(|a| + 0.1) * 'avg' adapted from CValue (Frantzi, 2000) formulate<br>
- 'len_log_norm_sum': log2(|a| + 0.1) * 'sum' adapted from CValue (Frantzi, 2000) formulate<br>
<br>
NOTE: \*_norm_\*" penalises/smooth the longer term (than default 5 token size)<br>
to achieve a saturation level as term size grows<br>
:type mu: int, optional<br>
:param mu: mean value to set a center point (default to 5) in order to rank the candidates higher that are near the central point<br>
This param is only required and effective for normalisation based MWT weighting method<br>
:type export: bool<br>
:param export: True if export result else False<br>
:type export_format: string<br>
:param export_format: {'csv', 'json'}, default 'csv'<br>
export file format<br>
:type export_path: string<br>
:param export_path: file path where the result will be exported to<br>
:type encoding: string, required<br>
:param encoding: encoding of export file, default as 'utf-8',<br>
:type workers: int<br>
:param workers: available CPU cores, default to use all the available CPU cores<br>
:rtype: tuple [list[tuple[string,float]], dict[string:float]]<br>
:return: keywords: sorted keywords with weights along with Top T weighted vertices</tt></dd></dl>
<dl><dt><a name="-preprocessing"><strong>preprocessing</strong></a>(text:str, syntactic_categories:Set[str]={'NNP', 'JJ', 'NNS', 'NN'}, stop_words:Union[Set[str], NoneType]=None, lemma:bool=False) -> Generator[Tuple[List[str], List[Tuple[str, str]]], NoneType, NoneType]</dt><dd><tt>pre-processing pipeline: sentence splitting -> tokenisation -><br>
Part-of-Speech(PoS) tagging -> syntactic filtering (default with sentential context)<br>
<br>
Text segmentation: using NLTK's recommended English word tokenizer (currently an improved :class:`.TreebankWordTokenizer`<br>
along with :class:`.PunktSentenceTokenizer`<br>
<br>
PoS tagging: Use NLTK's currently recommended part of speech tagger ('taggers/averaged_perceptron_tagger/english.pickle')<br>
<br>
You can download both via<br>
<br>
>>> import nltk<br>
>>> nltk.download('punkt')<br>
>>> nltk.download('averaged_perceptron_tagger')<br>
<br>
:type text: string<br>
:param text: plain text<br>
:type syntactic_categories: Set [of string], required<br>
:param syntactic_categories: Default with noun and adjective categories.<br>
Syntactic categories (default as Part-Of-Speech(PoS) tags) is defined to<br>
filter accepted graph vertices (default with word-based tokens as single syntactic unit).<br>
<br>
Any word that is not matched with the predefined categories will be removed based on corresponding the PoS tag.<br>
<br>
Best result is found with noun and adjective categories only in original paper.<br>
:type stop_words: Set of [string {‘english’}], or None (default), Optional<br>
:param stop_words: remove stopwords from PoS tagged context (token tuple list).<br>
The stop words are considered as noisy common/function words.<br>
By provide a list of stop words can improve vertices network connectivity<br>
and increase weights to more meaningful words.<br>
:type bool<br>
:param lemma: if lemmatize text<br>
:rtype: generatorType (of tuple)<br>
:return: result: a tuple list of tokenised context(default in sentence level) text<br>
and the corresponding PoS tagged context text filtered by syntactic filter</tt></dd></dl>
<dl><dt><a name="-preprocessing_tokenised_context"><strong>preprocessing_tokenised_context</strong></a>(tokenised_context:Union[Generator[List[str], NoneType, NoneType], List[List[str]]], syntactic_categories:Set[str]={'NNP', 'JJ', 'NNS', 'NN'}, stop_words:Union[Set[str], NoneType]=None, lemma:bool=False) -> Generator[Tuple[List[str], List[Tuple[str, str]]], NoneType, NoneType]</dt><dd><tt>pre-processing tokenised corpus context (recommend as sentences)<br>
<br>
pipeline: Part-of-Speech tagging -> syntactic filtering (default with sentential context)<br>
<br>
:type tokenised_context: generator or iterable object<br>
:param tokenised_context: generator of tokenised context(default with sentences)<br>
:type syntactic_categories: set [of string], required<br>
:param syntactic_categories: Default with noun and adjective categories.<br>
Syntactic categories (default as Part-Of-Speech(PoS) tags) are defined to<br>
filter accepted graph vertices (default with word-based tokens as single syntactic unit).<br>
<br>
Any word that is not matched with the predefined categories will be removed based on corresponding the PoS tag.<br>
<br>
Best result is found with noun and adjective categories only in original paper.<br>
:type stop_words: set of [string {‘english’}], or None (default), Optional<br>
:param stop_words: remove stopwords from PoS tagged context (token tuple list).<br>
The stop words are considered as noisy common/function words.<br>
By provide a list of stop words can improve vertices network connectivity<br>
and increase weights to more meaningful words.<br>
:type bool<br>
:param lemma: if lemmatize text<br>
:rtype: generator[of tuple]<br>
:return: pre-processed raw text tokens splitted with context and filtered text tokens splitted with context</tt></dd></dl>
</td></tr></table><p>
<table width="100%" cellspacing=0 cellpadding=2 border=0 summary="section">
<tr bgcolor="#55aa55">
<td colspan=3 valign=bottom> <br>
<font color="#ffffff" face="helvetica, arial"><big><strong>Data</strong></big></font></td></tr>
<tr><td bgcolor="#55aa55"><tt> </tt></td><td> </td>
<td width="100%"><strong>__all__</strong> = ['preprocessing', 'preprocessing_tokenised_context', 'build_cooccurrence_graph', 'keywords_extraction', 'keywords_extraction_from_segmented_corpus', 'keywords_extraction_from_tagged_corpus', 'keywords_extraction_from_corpus_directory', 'compute_TeRGraph', 'compute_neighborhood_size']</td></tr></table>
</body></html>