Skip to content

nk2028/yitizi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Yitizi

Input a Chinese character. Output all the variant characters of it.
輸入一個漢字,輸出它的全部異體字。
输入一个汉字,输出它的全部异体字。

Usage

Python

pip install yitizi
>>> import yitizi
>>> yitizi.get('和')
['咊', '龢']

JavaScript (Node.js)

npm install yitizi
> const Yitizi = require('yitizi');
> Yitizi.get('和');
[ '咊', '龢' ]

JavaScript (browser)

<script src="https://cdn.jsdelivr.net/npm/[email protected]"></script>
> Yitizi.get('和');
[ '咊', '龢' ]

Design

Connections between variant characters can be modeled as an graph with characters as vertices, where two characters are variants of each other if they are directly connected by an edge.

To reduce data redundancy, only several types of basic connections are stored in data tables located in data/, from which the full graph yitizi.json is computed by invoking build/main.py.

Basic connections

A basic connection between two variant characters can be classified into one of the three types: equivalent, intersecting, simplification.

  • Equivalent "全等": Two characters are equivalent only if they are interchangable in most texts without change in the meaning. When computing the full graph, it is considered both commutative and transitive, i.e.

    • If A is an equivalent variant of B, then B is an equivalent variant of A;
    • If A is an equivalent variant of B, and B is an equivalent variant of C, then A is an equivalent variant of C.
  • Intersecting "語義交疊": Two characters are intersecting variants if they are interchangable in certain cases. It is also commutative, but not necessarily transitive. Characters with intersecting variants are arranged in groups (rows in data files), with each group having specific meanings shared by its listed characters. A character can belong to multiple groups.

    Example: "閒" has two intersecting variants: "閑" and "間", listed in two groups:

    閒閑  # meaning "vacant"
    閒間  # meaning "in the middle"
    閑>闲  # simplified form (same below)
    間>间
    

    Then in the computed yitizi.json:

    • 閒 and 閑 (闲) are variants of each other;
    • 閒 and 間 (间) are variants of each other;
    • 閑 (闲) and 間 (间) are unrelated.

    Example I-1

    A more complex (though abstract) example:

    =AB  # "=" means equivalent variants
    ACD
    AEFG
    
    • A, B, C and D are variants of one another;
    • A, B, E, F and G are variants of one another;
    • No connections between C (or D) and E (or F/G).

    Example I-2

  • Simplification "簡體": A non-transitive and asymmetric connection. A simplified character is associated only with its traditional form.

    Example 1: "么" is 1) a simplified form of "麼", 2) an equivalent variant of "幺"; "麼" has an equivalent variant "麽", then:

    • 麼, 麽 and 么 are variants of one another;
    • 幺 and 么 are variants of each other;
    • 麼 or 麽 is unrelated to 幺.

    Example S-1

    Example 2: "苧" is 1) a simplified form of "薴", 2) a traditional form of "苎", then:

    • 苧 is a variant of 薴 and 苎;
    • 薴 and 苎 are unrelated.

    Example S-2

    Example 3: "芸" is a simplified form of "藝" (Japanese Shinjitai) and "蕓" (Chinese), and "艺" is also a simplified form of "藝" (Chinese), then:

    • 藝, 芸 and 艺 are variants of one another;
    • 蕓 and 芸 are variants of each other;
    • 藝 or 艺 is unrelated to 蕓.

    Example S-3

Data source

Note for developers

You need to substitute all the occurrences of the version string before publishing a new release.