Human written Text Vs ChatGPT generated Text

This is a novel Transformer network based approach to distinguish ChatGPT generated Text from Human text. The model was also deployed on local server using Flask where Docker was used to manage all dependencies.

Author - Omkar Nitsure

Email - [email protected]
github profile - https://github.com/omkarnitsureiitb

I used the publicly available dataset named OpenGPTText which you can get here .We designed and implemented a Text-to-Text Transformer (T5) model for text classification. You can take a look at the model architecture in the paper. Our model performed very well, obtaining an accuracy of 85% evaluated on various metrics. Furthermore, we also deployed our model using Flask on localhost to provide a better interfacing for performing the text classification.

We can use some simple classification algorithms such as Support Vector Machines (SVM) or any ensemble techniques such as Random Forest, however they may not work as good as the transformers as they are not built for sequential data. Transformers, on the other hand give the advantage of multi-head attention, that is it can find relation between words in a sentence which are farther apart, which is also a disadvantage of Recurrent Neural Networks (RNNs).

Dataset

I used a part of the OpenGPTText dataset that is available on this repository for training. The distribution of number of words per sentence for both human and ChatGPT text is as follows -

Text Preprocessing and Vector Embeddings

Standard text preprocessing techniques were used like removing stopwords and punctuations, tokenization, lemmatization, stemming and final vector embeddings of length 50 per word were obtained by using the pretrained Glove Network. Each sentence was limited to a sentence length of 100. So sentences with more that 100 words were truncated and sentences with less than 100 words were padded by the vector embeddings of Full Stop.

Model Architecture

Transformer is the backbone of the model with 3 linear layers after 2 Transformer Encoder layers.

Training and performance improvisation

The model was trained using different hyperparameters. Dropout was used as a Regularization Technique and training error was further reduced by using Cosine Annealing Learning Rate Scheduler which improved the accuracy from 77% to 85. We also tried different number of Transformer encoder layers from 2 to 5. All of them gave similar results but we chose 2 encoders bacuse it gave significant compute boost without any loss in performance. The final optimal hyperparameters along with loss curve for different number of encoder layers is as follows-

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Code_Files		Code_Files
Dataset		Dataset
Embeddings		Embeddings
Plots		Plots
Test_Dataset		Test_Dataset
models		models
.DS_Store		.DS_Store
Dockerfile		Dockerfile
LICENSE		LICENSE
Paper.pdf		Paper.pdf
README.md		README.md
app.py		app.py
docker-compose.yml		docker-compose.yml
example_prompts.txt		example_prompts.txt
requirements.txt		requirements.txt
transformer.py		transformer.py
utilities.py		utilities.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Human written Text Vs ChatGPT generated Text

Author - Omkar Nitsure

Dataset

Text Preprocessing and Vector Embeddings

Model Architecture

Training and performance improvisation

About

Releases

Packages

Languages

License

omkarnitsureiitb/Human_Vs_ChatGPT

Folders and files

Latest commit

History

Repository files navigation

Human written Text Vs ChatGPT generated Text

Author - Omkar Nitsure

Dataset

Text Preprocessing and Vector Embeddings

Model Architecture

Training and performance improvisation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages