Hi! In the notebook, we will start our text mining journey by scraping a list of news articles from tirto.id and detik.com about the Coronavirus using BeautifulSoup
package. The contents will be saved to an individual .tsv (tab seperated value) files, which will be loaded again for further analysis. From there, we analyze the posting pattern for each sites and train a Word2Vec
model using gensim
package in order to analyze the semantic and syntactic similarity between each preprocessed words.
- https://github.com/har07/PySastrawi
- https://github.com/pebbie/pebahasa/blob/master/indonesian
- https://github.com/aliakbars/bilp/blob/master/stoplist
- http:https://web.archive.org/web/20120608052057/http:https://fpmipa.upi.edu/staff/yudi/stop_words_list.txt