BERTopic on Chinese Language Education in South Korea

Abstract

After a long-time intricate process of development, Chinese education in South Korea has gradually reached a new boom time in recent decades with the emergence of numerous private foreign language institutions. In order to fully understand the reasons and motivations behind this boom, this paper collected more than 10,000 pieces of Q&A records on Chinese learning and private language training institutions that have appeared on the NAVER knowledge-IN platform over the past decade, and presented an empirical study through a new topic classification model called "BERTopic" based on those data. After analyzing the difference in topics between the local and imported private educational institutions, this paper found that although the overall demand for Chinese language learning in Korean society was increasing, the difference in demand between different groups was also very obvious. Therefore, in order to achieve sustainable development of Chinese education in South Korea, the basic concept of “localizing development” should be taken into account.

Experiment & Result

Overview of the "BERTopic" workflow

BERTOPIC by Maarten Grootendorst

Data preparation

By using the double-restricted search method of “keyword + catalogue index “ with the specific time range from 2011.01.01 to 2021.04.01, URLs of these targeted Q&A data are searched and collected on the "NAVER Knowledge-in" platform automatically;
With URLs of total 10708 targets, the necessary information(including "title", "question description", "answer description", "question time", "cumulative page views", etc.) can be further extracted or summarized from the internet;
The next step is to clean and reorganize all the texts. The specific cleaning process includes: deleting informal characters, spelling check on Korean expressions , merging valid paragraphs, and so on.
Clean and reorganize all the text data parts. The specific cleaning process includes: deleting informal language characters, spelling and proofreading for Korean parts, merging valid paragraphs, and so on.
Finally, the cleaned data are resaved as a suitable dataset for model training, of which the individual text is separately documented with specific indexes based on “keyword(the name of private language training institutions)”.

Keyword	Numbers of questions	Numbers of page views	Category of Origin
Sisa (시사)	1408	1420551	Korean
Moon-Jeonga(문정아)	1798	1339087	Korean
Caihong(차이홍)	1575	1272821	Sino-Korean
HakSeupji(학습지)	2050	1222846	Korean
Hackers(해커스)	1765	1033599	Korean
Confucius Institute(공자학원/공자 아카데미)	673	537608	Sino-Korean
Pagoda(파고다)	506	453905	Korean
SiwonSchool(시원스쿨)	439	358808	Korean
YBM	268	315248	Korean
Gumon(구몬)	226	184430	Korean

model training

Two branches：
- Topic modelling for the whole dataset: a total of 10708(without the distinctions for keywords) are put into the “BERTopic” model for automatic topic classification.
- Topic modelling for several sub-datasets: data with 4 representative keywords are extracted separately as 4 sub-datasets and are put into the “BERTopic” model one by one for automatic topic classification.
Three steps：
- Parameter Tuning: Adjusting the local parameters used in training the topic models, including: Pre-trained language model for text embedding, topic merging method, minimum of topic size and range of representative keywords, etc.
- Visualization: Mapping the trained topic models into the binary space to facilitate the further classifications and explanations with human subjective judgment.
- Keywords extraction: after the training of the model, filtering out 15 most significantly related keywords of each subdivided topic and listing them for the follow-up analysis.

Stages	Main Functions	Key Parameters	values
Parameters setout	BERTopic() & fit_transform()	embedding_model	"distiluse-base-multilingual-cased-v1"
Parameters setout		nr_topics	"auto"
Parameters setout		top_n_words	15
Parameters setout		min_topic_size	50(total)/20(seperate)
Parameters setout		n_gram_range	(1,1)
2 Topics Visualization	visualize_topics() & visualize_topics_over_time	top_n	numbers of all topics
2 Topics Visualization		topics	depending on needs
3 keywords Extraction	get_topic_info() & get_topics()

visualization of results

topics of Total
topics of Hackers
topics of Moon-Jeonga
topics of Caihong
topics of Confucius Institute
topics of Sisa
topics of HakSeupji
topics over time

For more details, to see the plot list below.

plot name	type	link
total(original)	"html"	Total
Hackers(original)	"html"	Hackers
Munza(original)	"html"	Munza
Caihong(original)	"html"	Caihong
Kongzi(original)	"html"	Kongzi
Sisa(original)	"html"	Sisa
Haksp(original)	"html"	Haksp

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
codes		codes
pictures		pictures
plots		plots
sample data		sample data
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BERTopic on Chinese Language Education in South Korea

Abstract

Experiment & Result

Overview of the "BERTopic" workflow

Data preparation

model training

visualization of results

Citation

About

Releases

Packages

Languages

License

feili0820/BERTopic-on-Chinese-Education-in-Korea

Folders and files

Latest commit

History

Repository files navigation

BERTopic on Chinese Language Education in South Korea

Abstract

Experiment & Result

Overview of the "BERTopic" workflow

Data preparation

model training

visualization of results

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages