Yuki Yada, Jiaying Feng, Tsuneo Matsumoto, Nao Fukushima, Fuyuko Kido, Hayato Yamana
This repository provides the text-based dataset and experimental code for the paper: Dark patterns in e-commerce: a dataset and its baseline evaluations
Accepted at IEEE BigData 2022, 5th Workshop on Big Data for CyberSecurity (BigCyber-2022)
Dark patterns are user interface designs on online services that are designed to make users behave in ways they do not intend.
In this work, we construct a text-based dataset for dark pattern automatic detection on e-commerce sites and its detection using machine learning as a baseline.
For baseline detection, we applied following two methods:
- Classical NLP Methods ( Bag-of-Words + Classical Machine Learning Model )
- Transformer-based pre-trained language models ( e.g. BERT )
For more information, please check our Paper.
dataset/
: dataset.tsv in this directory is the dataset for text-based dark patterns automatic detection.darkpattern-auto-detection-classical/
: Experimental code of baseline evaluation using classical NLP methods.darkpattern-auto-detection-deeplearning/
: Experimental code of baseline evaluation using transformer-based pre-trained language models.scraping/
: Code for collecting non-dark pattern texts in the dataset.
dataset/dataset.tsv is a text-based dataset for dark pattern automatic detection (TSV Format). Dark pattern texts were obtained from Mathur et al.’s study in 2019, which consists of 1,818 dark pattern texts from shopping sites. Then, we collect non-dark pattern texts on e-commerce sites by accessing and segmenting the sites targeted by the Mathur et al.'s study.
Scraping code for non-dark pattern texts is on scraping/. That is implemented using Typscript (Javascript) and Puppeteer.
We conduct experiment of dark pattern auto detection using the dataset. The code is on darkpattern-auto-detection-classical/ (Classical NLP Methods) and darkpattern-auto-detection-deeplearning/ (Transformer-based pre-trained language models).
Model | Accuracy | AUC | F1 score | Precision | Recall |
---|---|---|---|---|---|
Logistic Regression | 0.961 | 0.989 | 0.960 | 0.981 | 0.940 |
SVM | 0.954 | 0.987 | 0.952 | 0.986 | 0.922 |
Random Forest | 0.958 | 0.989 | 0.957 | 0.984 | 0.932 |
Gradient Boosting | 0.962 | 0.989 | 0.961 | 0.976 | 0.947 |
Model | Accuracy | AUC | F1 score | Precision | Recall |
---|---|---|---|---|---|
0.972 | 0.993 | 0.971 | 0.982 | 0.961 | |
0.965 | 0.993 | 0.965 | 0.973 | 0.957 | |
0.966 | 0.993 | 0.966 | 0.979 | 0.954 | |
0.959 | 0.991 | 0.959 | 0.972 | 0.946 | |
0.965 | 0.986 | 0.965 | 0.973 | 0.957 | |
0.966 | 0.992 | 0.966 | 0.975 | 0.958 | |
0.942 | 0.988 | 0.940 | 0.968 | 0.914 |
This project is based on the Mathur et al.’s study and its dataset. We thank their authors for making the source code publically available.
- We extract text data from "Pattern String" column in dark-patterns.csv.
- We ignored missing and duplicate text data.