Skip to content

Breakdown of Unicode CJK characters into pieces by name.

Notifications You must be signed in to change notification settings

TimToady/cjkrads

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

    Copyright 2019, Larry Wall

    Unless otherwise noted, everything in this directory may be distributed
    under the terms of the Artistic License, version 2.0 or later.

    This project is an attempt to decompose all CJK characters into named
    radicals.  Some of the names are standard radical names, some are the names
    of other characters, and some are completely made up.  The goal is to have
    a system of mnemonics, not to try for a traditional radical naming scheme.

    Nevertheless, where practical I've tried to use the standard radical
    names and meanings.  One problem is that there are multiple standards to
    pick from here.  I originally started with just the Japanese characters,
    so there may be a slight resididual bias toward Japanese meanings, but
    when I have a choice between a radical name that is a purely Japanese
    interpretation and one that also related to the semantic space of the
    Chinese borrowing, I've opted for the Chinese-related meaning, even if
    it's not the usual Japanese gloss (or Chinese gloss) for that particular
    character or radical.

    For similar reasons, some of my names are not the modern meaning in
    any CJKV language, but can be mnemonically related via the etymology.
    For instance, the original "clam" symbol is called MORNING in KangXi,
    but means "dragon" in Chinese (and zodiacal Japanese).  But many of the
    derived meanings apparently relate to the idea of a clam shell opening
    up, like a pair of lips, or the daytime sky opening up like a clamshell.
    So "clam" it is.

    For infrequently used names I've often just run the component part names
    together--the goal in such cases is merely to get a unique name, not to
    be particularly meaningful.

    Since the names are picked primarily for their mnemonic value, it should
    never be assumed that the particular word I've chosen is somehow the one
    and only "real" meaning...though of course I've tried to come close to
    the core meaning if there is one.

    In particular, some of the mnemonics chosen here may make more sense to
    an English speaker than to a Chinese speaker.  A speaker of Chinese is
    likely better off remembering characters based on the phonetic radicals,
    since Chinese characters are partially phonetic and partially logographic.
    Our treatment here is mostly logographic, which is just as much of a
    distortion as treating Chinese as mostly phonetic.  As usual the truth
    is somewhere in the middle.

    In any case, the decompositions in here largely take the approach:
    "This piece LOOKS like that so we'll pretend that's what it really is."
    Please, please do not assign any etymological significance to accidental
    lookalikes.  Chinese is full of symbols that have fallen together or
    fallen apart over time, either accidentally or intentionally, and every
    honest scholar of Chinese admits that there is a great deal of guesswork
    in many of the derivations, especially in some of the more obscure corners
    of the orthography.  As a vague approximation for some of that scholarly
    work, this project is even less reliable.

    PLEASE DO NOT TAKE ME FOR A CHINESE SCHOLAR.

    Also, be aware that Unicode characters vary in font genericity; the
    plane 0 characters often list several variants construed to have the
    same meaning, while the individual variants may be found (sometimes)
    in the plane 2 extensions.  Another difference you should be aware of
    is that the planes tend to count strokes differently.  (This is because
    the base plane contains unifications of characters can be written
    three or four different ways, while the extensions mostly contain
    characters written one way, so you can talk about the variants as
    different characters.)
    
    For instance, the radical for plants, 艹, is counted as six
    strokes in the base plane (because it's a variant of 艸), but
    as four strokes in the extensions, whether written as 艹 or 艹.
    In general, count yourself lucky if you come within a stroke or two
    of the stroke count listed here.

    Finally, there are plenty of mistakes in here, due to bad fonts, bad
    eyesight, or just bad thinking on the part of the author.  Sucks to
    be us...

    When one exists, the simplified character gets a lowercase name, while
    the corresponding traditional character gets an uppercase name.

    A variant or alternate form of the traditional character will usually be
    prefixed with a lowercase v (followed by a digit in the case of multiple
    variants, but note that we just use "v" to mean the "v1" variant).
    No meaning should be attributed to the digit.  There's no system here, except
    that the higher digits tend to be rarer so I came to them later.  Some of my
    early variants use '-ish' or '-y' suffixes, for example, the various
    variants of "Good", "good", "goodish", "goody", in parallel with "Whoa", etc.

    These are only general rules, so don't expect complete consistency.
    In particular, lowercase is often used for traditional forms that have no
    simplified form.  Contrariwise, some lower/uppercase names overlap because
    I wanted to use the same name for two entirely different characters,
    or for two minor variants of the same character.  There just aren't that
    many good short words in English.  (For similar reasons I don't apologize
    for using some of the earthier short words supplied by English. :)

    In the cases where we have named a phonetic syllable after its sound,
    it is typically uppercased: Mang, Li, Xi, etc.

    If a character has the special codepoint 0000 with ? for a grapheme, it
    means that Unicode does not (yet?) contain this pattern as a separate
    character, so it must be described indirectly.  In general, I haven't
    added these pseudo-characters unless the patterns occur four or five
    times within other characters, with a slight bias to adding a pattern
    sooner if it seems to have some etymological significance as a unit
    of meaning, and later if it just seems to be accidental juxtaposition.
    (In that sense, it's a bit like the distinction between constellations
    that are star clusters vs those that are mere asterisms.)

    For description of the location of each subradical, we use suffixes after
    a dot to indicate their position.

    Basic positions:

           ⿰/⿲           ⿱/⿳
        l  left         t  top
        c  center       m  middle
        r  right        b  bottom

           ⿴/⿵/⿶/⿷/⿸/⿹/⿺
        o  outside (can have multiple insertions, 𠼡=bow.o126)
        i  inside (only one direction, usually, 𠼡=sigh.i1,work.i2,work.i6)

           Direction of insertion is 0..7 for 45° rotations clockwise:
            0  ⿵ n     太=big.o0  處=Tiger.o0
            1  ⿹ ne    㢧=bow.o1
            2     e     𢎛=bow.o2
            3     se    头=big.o3
            4  ⿶ s     凷=bin.o4
            5  ⿺ sw    㲌=fur.o5  彪=Tiger.o5
            6  ⿷ w     匛=box.o6  𧆬=Tiger.o6
            7  ⿸ nw    在=dam.o7  𧆸=Tiger.o7

            8  ⿴ completely surrounds inner glyph (8 sides)  㘞=closure.o8
            9  ⿴ has internal extension ticks, lines, marks  丼=well.o9

           ⿻
        x  overlapped/shared without obvious containment 甴=mouth.x,tack.x
           (note that c/C and m/M below already imply some overlap)

    Spreaders:
        C  anti-c, that is, split l+r (or wider than the c thing)  粥=2bow.C
        M  anti-m, that is, split t+b (or taller than the m thing) 呂=2mouth.M
        V  Vertical, 3 or more in a stack, usually nothing inside  ☰ =3h.V
        H  Horizontal, 3 or more in a row, usually nothing inside  Ⅲ =3v.H

        T  triple   众 is 1*guy.tc 2*guy.bC
        Q  quad     𠈌 is 4*guy.CM or 4*guy.MC in a square

    Generally, if it's ambiguous whether to describe crossings vertically
    or horizontally, I've chosen to use c/C to describe things in the
    horizontal dimension rather than m/M in the vertical dimension.  But
    sometimes I do both...

    Approximators:
        ~R    Rotation (replication implies rotational symmetry, e.g. 𦮙.b)
                May be annotated with 45° rotations 0..7 clockwise
        ~F    Flipped (replication implies mirror symmetry)
                May be annotated with location/axis of flip
        ~S    Stolen from or shared with another piece
                May be annotated with location of piece shared
        ~oid  A descriptive subshape that isn't really a subradical
                (a 止 contains 上 but they are really different shapes)
        ~like A miscellaneous could-be-confused-with lookalike

    An isolated single slash (/) gives a vague idea of alternation.
    Note that the first alternative is usually going to be the "correct"
    one, and subsequent alternatives are just other ways it would be
    possible to decompose the character, in case someone looks them up
    that way.

    An isolated double slash (//) is used mostly on radicals to indicate
    that the subparts are merely descriptive of shapes, and should not
    be recursively expanded for search purposes.  Because they are not
    recursively expanded, we use a bit more redundancy of description
    for the subparts that people might search on.  For instance, we
    might put both 电 "electric" and ⺃ "hook", even though electric
    would expand to hook if we did that, which we don't, so there.

About

Breakdown of Unicode CJK characters into pieces by name.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages