Skip to content

This is a novel Transformer network based approach to distinguish ChatGPT generated Text from Human text. The model was also deployed on local server using Flask where Docker was used to manage all dependencies.

License

Notifications You must be signed in to change notification settings

omkarnitsureiitb/Human_Vs_ChatGPT

Repository files navigation

Human written Text Vs ChatGPT generated Text

This is a novel Transformer network based approach to distinguish ChatGPT generated Text from Human text. The model was also deployed on local server using Flask where Docker was used to manage all dependencies.

Author - Omkar Nitsure

Email - [email protected]
github profile - https://github.com/omkarnitsureiitb

I used the publicly available dataset named OpenGPTText which you can get here .We designed and implemented a Text-to-Text Transformer (T5) model for text classification. You can take a look at the model architecture in the paper. Our model performed very well, obtaining an accuracy of 85% evaluated on various metrics. Furthermore, we also deployed our model using Flask on localhost to provide a better interfacing for performing the text classification.

We can use some simple classification algorithms such as Support Vector Machines (SVM) or any ensemble techniques such as Random Forest, however they may not work as good as the transformers as they are not built for sequential data. Transformers, on the other hand give the advantage of multi-head attention, that is it can find relation between words in a sentence which are farther apart, which is also a disadvantage of Recurrent Neural Networks (RNNs).

Dataset

I used a part of the OpenGPTText dataset that is available on this repository for training. The distribution of number of words per sentence for both human and ChatGPT text is as follows -

Text Preprocessing and Vector Embeddings

Standard text preprocessing techniques were used like removing stopwords and punctuations, tokenization, lemmatization, stemming and final vector embeddings of length 50 per word were obtained by using the pretrained Glove Network. Each sentence was limited to a sentence length of 100. So sentences with more that 100 words were truncated and sentences with less than 100 words were padded by the vector embeddings of Full Stop.

Model Architecture

Transformer is the backbone of the model with 3 linear layers after 2 Transformer Encoder layers.

Training and performance improvisation

The model was trained using different hyperparameters. Dropout was used as a Regularization Technique and training error was further reduced by using Cosine Annealing Learning Rate Scheduler which improved the accuracy from 77% to 85. We also tried different number of Transformer encoder layers from 2 to 5. All of them gave similar results but we chose 2 encoders bacuse it gave significant compute boost without any loss in performance. The final optimal hyperparameters along with loss curve for different number of encoder layers is as follows-

About

This is a novel Transformer network based approach to distinguish ChatGPT generated Text from Human text. The model was also deployed on local server using Flask where Docker was used to manage all dependencies.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published