Skip to content

A Python module for retrieving script types of writing systems including alphabets, abjads, abugidas, syllabaries, logographs, featurals as well as Latin script codes

License

Notifications You must be signed in to change notification settings

Halvani/alphabetic

Repository files navigation

Alphabetic logo

Alphabetic

A Python module for retrieving script types of writing systems including alphabets, abjads, abugidas, syllabaries, logographs, featurals as well as Latin script codes.

Description / Background

Alphabetic is a small project that was born out of the need to find out the alphabet of different languages for a private NLP project. Determining the alphabet (or other script types) of a language plays an important role in a variety of NLP tasks and can be used, for example, to classify the language of a given text, normalize it by removing noisy/random strings, apply fine-grained regex pattern matching, and more.

Core functionality in a nutshell: given a specific language Alphabetic, first translates its name internally into a corresponding ISO code (either ISO 639-2/3 or ISO 15924) and outputs the corresponding script, which is categorized according to the writing systems listed in the following table (adapted from here):

Writing system Each symbol represents Example
Abjad Consonant Arabic alphabet
Abugida Consonant accompanied by specific vowel modifying symbols represent other vowels Indian Devanagari
Alphabet Consonant or vowel Latin alphabet
Featural system Distinctive feature of segment Korean Hangul
Logographic Word or morpheme as well as syllable Chinese characters
Syllabary Syllable Japanese kana

The distinction between the different script types is important in this respect and necessary in certain application scenarios, as otherwise it can lead to unexpected behavior. Perhaps you have already worked with the built-in string functions in Python? If so, you may have noticed the following questionable result:

print("伏伐休众优伙".isalpha())

# True

The answer True could be interpreted as meaning that the string, which is written in Chinese, is alphabetic. From a linguistic point of view, however, this is incorrect, as there is no alphabet in Chinese (the Chinese writing system is logographic). On the other hand, the following string, which is written in the Devanagari script, is in fact not an alphabet but an abugida:

print("अमित".isalpha())
       
# False

For this and other use cases Alphabetic can be employed.

Installation

The easiest way to install Alphabetic is to use pip, where you can choose between PyPI and this repository:

  • pip install alphabetic
  • pip install git+https://github.com/Halvani/alphabetic.git

The latter will pull and install the latest commit from this repository as well as the required Python dependencies. Note that the repo is updated regulary, while PyPi-packages are less frequently released (primarily after mayor bugfixing, refactoring, etc.).

Usage

A simple lookup of a language's script (e.g., alphabet) can be performed as follows:

from alphabetic import WritingSystem

ws = WritingSystem()
ws.by_language(ws.Language.Hawaiian)

# {"Hawaiian": ["A", "E", "H", "I", "K", "L", "M", "N", "O", "P", "U", "W", "a", "e", "h", "i", "k", "l", "m", "n", "o", "p", "u", "w", "ʻ"]}

By default, the output of by_language is a dictionary containing the name and the corresponding script of the selected language. To retrieve only the latter, use ws.by_language(ws.Language.Hawaiian, as_list=True). However, some languages such as Japanese have not one but multiple writing systems. In such a case, the output would look like this:

ws.by_language(ws.Language.Japanese)

# {"Japanese": {"Hiragana": ["あ", "い", ...], "Kanji": ["万", "丁", ...], "Katakana": ["ア", "イ", ...]}}

In case you want a pretty print of the output, use:

ws.pretty_print(ws.by_language(ws.Language.Dutch))

# A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z

If the resulting script represents an alphabet, the result can be further filtered in terms of:

  • Letter Casing:
ws.pretty_print(ws.by_language(ws.Language.Bosnian))

# Ђ Ј Љ Њ Ћ Џ А Б В Г Д Е Ж З И К Л М Н О П Р С Т У Ф Х Ц Ч Ш а б в г д е ж з и к л м н о п р с т у ф х ц ч ш ђ ј љ њ ћ џ

ws.pretty_print(ws.by_language(ws.Language.Bosnian, letter_case=ws.LetterCase.Upper))

# Ђ Ј Љ Њ Ћ Џ А Б В Г Д Е Ж З И К Л М Н О П Р С Т У Ф Х Ц Ч Ш
  • Multigraphs:
ws.pretty_print(ws.by_language(ws.Language.Aleut))
      
# A B Ch D E F G H Hl Hm Hn Hng I J K L M N Ng O P Q R S T U Uu V W X X̂ Y Z a b ch d e f g h hl hm hn hng i j k l m n ng o p q r s t u uu v w x x̂ y z Á á Ĝ ĝ

ws.pretty_print(ws.by_language(ws.Language.Aleut, strip_multigraphs=True, multigraphs_size=ws.MultigraphSize.All))

# A B D E F G H I J K L M N O P Q R S T U V W X Y Z a b d e f g h i j k l m n o p q r s t u v w x y z Á á Ĝ ĝ
  • Diacritics
ws.pretty_print(ws.by_language(ws.Language.Czech))
                
# A B C C h D E F G H I J K L M N O P Q R S T U V W X Y Z a b c c h d e f g h i j k l m n o p q r s t u v w x y z Á É Í Ó Ú Ý á é í ó ú ý Č č Ď ď Ě ě Ň ň Ř ř Š š Ť ť Ů ů Ž ž

ws.pretty_print(ws.by_language(ws.Language.Czech, strip_diacritics=True))

# A B C C h D E F G H I J K L M N O P Q R S T U V W X Y Z a b c c h d e f g h i j k l m n o p q r s t u v w x y z

For certain languages such as Chinese (simplified), which have a language code but no alphabet, a fallback strategy is used which maps the ISO 639-2 language code to an ISO 15924 code (as an example here: "chi" --> "Hans"). As a user, you do not have to handle this manually, but simply call up the language as it is:

ws.pretty_print(ws.by_language(ws.Language.Chinese_Simplified))

# 㑇 㑊 㕮 㘎 㙍 㙘 㙦 㛃 㛚 㛹 㟃 㠇 㠓 㤘 㥄 㧐 ...

Another important use case is to check whether a given sequence of characters represents a specific script of a writing system. This can be achieved as follows:

ws.is_abjad("גדולים או בינוניים") # True
ws.is_alphabet("גדולים או בינוניים") # False

ws.is_alphabet("dobré ráno") # True
ws.is_abjad("dobré ráno") # False

ws.is_logographic("早上好") # True
ws.is_syllabary("早上好") # False

ws.is_abugida("ምልካም እድል") # True
ws.is_abjad("ምልካም እድል") # False

ws.is_featural("좋은 아침") # True
ws.is_logographic("좋은 아침") # False

ws.is_alphabet("დილა მშვიდობისა") # True
ws.is_abjad("დილა მშვიდობისა") # False

Furthermore, you can also use Alphabetic to remove all characters from a given string that do not occur within the supported script types (abjads, abugidas, alphabets, etc.):

ws.strip_non_script_characters("SΓpλrώaσcσhεeςn!", ws.Language.German)
# "Sprachen" (languages)

ws.strip_non_script_characters("SΓpλrώaσcσhεeςn!", ws.Language.Greek)
# "Γλώσσες" (languages)

If no language is given, all characters of all supported script types are considered:

ws.strip_non_script_characters("#jüste BAD/good tösté X4567Y ßÜ משהו действует?!")
# Result: 'jüste BADgood tösté XY ßÜ משהו действует'

If you wish, you can also list the characters of a language based on a specified Unicode range:

ws.generate_all_characters_in_range("\u0400-\u04FF") # Bulgarian

# ['Ѐ', 'Ё', 'Ђ', 'Ѓ', 'Є', ..., 'Ӽ', 'ӽ', 'Ӿ', 'ӿ']

Features

  • Currently 151 languages and corresponding scripts are supported, with more to follow over time;

  • In total, Alphabetic covers six writing systems script types: abjads, abugidas, alphabets, syllabaries, logographics as well as featurals;

  • Beside (true) writing systems, Alphabetic also offers Latin script representations (e.g., Morse or NATO Phonetic Alphabet);

  • Alphabetic includes a complete list of all ISO 639-1/2/3 as well as ISO 15924 codes and enables bidirectional translation between language names and codes;

  • At the heart of Alphabetic are json files that can be used independently of the respective programming language or application;

  • Consistently documented source code.

Supported Languages

Open to view all supported languages
Language ISO 639-2/3 code
Abkhazian abk
Afar aar
Afrikaans afr
Albanian sqi
Aleut ale
Amharic amh
Angika anp
Arabic ara
Arapaho arp
Armenian arm
Assamese asm
Avar ava
Avestan ave
Balochi bal
Bambara bam
Bashkir bak
Basque baq
Bavarian bar
Belarusian bel
Bislama bis
Boko bqc
Boro brx
Bosnian bos
Breton bre
Bulgarian bul
Buryat bua
Catalan cat
Chamorro cha
Chechen che
Cherokee chr
Chichewa nya
Chinese_Simplified chi
Chukchi ckt
Chuvash chv
Cimbrian cim
Cornish cor
Corsican cos
Cree cre
Croatian hrv
Czech ces
Danish dan
Dungan dng
Dutch nld
Dzongkha dzo
Elfdalian ovd
English eng
Esperanto epo
Estonian est
Ewe ewe
Faroese fao
Fijian fij
Finnish fin
Flemish dut
French fra
Georgian kat
German deu
Greek gre
Guarani grn
Haitian_Creole hat
Hausa hau
Hawaiian haw
Hebrew heb
Herero her
Hindi hin
Icelandic isl
Igbo ibo
Indonesian ind
Irish gle
Istro_Romanian ruo
Italian ita
Japanese jpn
Javanese jav
Jeju jje
Kabardian kbd
Kanuri kau
Kashubian csb
Kazakh kaz
Kinyarwanda kin
Kirghiz kir
Komi kpv
Korean kor
Kumyk kum
Kurmanji kmr
Latin lat
Latvian lav
Lezghian lez
Lingala lin
Lithuanian lit
Luganda lug
Luxembourgish ltz
Macedonian mkd
Malagasy mlg
Malay may
Malayalam mal
Maltese mlt
Manx glv
Maori mao
Mari chm
Marshallese mah
Moksha mdf
Moldovan rum
Mongolian mon
Mru mro
Nepali nep
Norwegian nor
Occitan oci
Oromo orm
Osage osa
Parthian xpr
Pashto pus
Persian per
Phoenician phn
Polish pol
Portuguese por
Punjabi_Gurmukhī _pan
Punjabi_Shahmukhi pan
Quechua que
Rohingya rhg
Russian rus
Samaritan smp
Samoan smo
Sango sag
Sanskrit san
Scottish_Gaelic gla
Serbian srp
Slovak slo
Slovenian slv
Somali som
Sorani ckb
Spanish spa
Sundanese sun
Swedish swe
Swiss_German gsw
Tajik tgk
Tatar tat
Turkish tur
Turkmen tuk
Tuvan tyv
Twi twi
Ugaritic uga
Ukrainian ukr
Uzbek uzb
Venda ven
Vengo bav
Volapük vol
Welsh wel
Wolof wol
Yakut sah
Yiddish yid
Zeeuws zea
Zulu zul

Supported Abjads

Open to view all supported abjads
Abjad ISO code
Arabic Arab
Balochi bal
Hebrew Hebr
Hebrew_Samaritan Samr
Parthian Prti
Pashto pus
Persian per
Phoenician Phnx
Punjabi_Shahmukhi pan
Sorani ckb
Ugaritic Ugar
Yiddish yid

Supported Abugidas

Open to view all supported abugidas
Abugida ISO code
Amharic amh
Angika anp
Assamese asm
Boro brx
Devanagari Deva
Dzongkha dzo
Ethiopic Ethi
Hindi hin
Malayalam Mlym
Nepali nep
Punjabi_Gurmukhī Guru
Sanskrit san
Sundanese Sund
Thaana Thaa

Supported Syllabaries

Open to view all supported syllabaries
Syllabary ISO code
Avestan Avst
Carian Cari
Cherokee Cher
Hiragana Hira
Katakana Kana
Lydian Lydi

Supported Logographics

Open to view all supported logographics
Logographic ISO code
Chinese_Simplified Hans
Kanji Hani

Supported featural writing systems

Open to view all supported featurals
Featural ISO code
Hangul Hang

Design Considerations / Limitations

Once delving deeper into the world of writing systems, one is overwhelmed by the numerous difficulties that arise when organizing the various script types. This is particularly difficult when it comes to non-Latin scripts with their many variabilities and forms. Therefore, various design considerations were made to make Alphabetic as simple and usable as possible.

  • For languages that exhibit several variants of alphabets, the more modern or the most frequently encountered form was used. References to sources such as Omniglot, Wikipedia and Britannica were used for this purpose.

  • For Arabic scripts where the character form depends on its position, the so-called isolated forms were used.

  • Multigraphs are considered as part of the scripts. However, if desired they can be suppressed. The same applies to diacritical marks (e.g., acute, breve, cédille, gravis, etc.).

  • The function is_abugida is not fully functional because not all vowel forms are integrated yet.

  • For so-called non-bicameral languages such as Hebrew or Arabic, where there is no distinction between upper and lower case, the respective filter letter_case= argument is ignored and the entire alphabet is returned instead:

ws.pretty_print(ws.by_language(ws.Language.Hebrew, letter_case=ws.LetterCase.Upper))
                                                   
# א ב ג ד ה ו ז ח ט י כ ך ל מ ם נ ן ס ע פ ף צ ץ ק ר ש ת

ws.pretty_print(ws.by_language(ws.Language.Arabic, letter_case=ws.LetterCase.Lower))

# ا ب ة ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي

Contribution

If you like this project, you are welcome to support it, e.g. by testing it or providing additional languages (there is a lot to do with regard to the remaining languages). Feel free to fork the repository and create a pull request to suggest and collaborate on changes.

Disclaimer

Although this project has been carried out with great care, no liability is accepted for the completeness and accuracy of all the underlying data. The use of Alphabetic for integration into production systems is at your own risk!

Furthermore, please note that this project is still in its initial phase. The code structure may therefore change over time.

Citation

If you find this repository helpful, feel free to cite it in your paper or project:

@software{Halvani_Constituent_Treelib:2024,
	author = {Halvani, Oren},
	title = {{Alphabetic - A Python module for retrieving script types of writing systems including alphabets, abjads, abugidas, syllabaries, logographs, featurals as well as Latin script codes}},
	doi = {https://doi.org/10.5281/zenodo.11580510},
	month = jun,	
	url = {https://github.com/Halvani/alphabetic},
	version = {0.0.5},
	year = {2024}
}

License

The Alphabetic package is released under the Apache-2.0 license. See LICENSE for further details.

Last Remarks

As is usual with open source projects, we developers do not earn any money with what we do, but are primarily interested in giving something back to the community with fun, passion and joy. Nevertheless, we would be very happy if you rewarded all the time that has gone into the project with just a small star 🤗