Skip to content

This chatbot is integrated on pasona.vn website to be an AI assistant answering the web visitors/ job seekers' job-related inquiries (salary, open positions...). The chatbot applies a fine-tuned BERT model to categorize the main topic of the message, and direct to specific response process based on each topic to interact with the database.

Notifications You must be signed in to change notification settings

MinhThanh2404/Chatbot-for-consulting-candidates-seeking-jobs-on-pasona.vn-website

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chatbot for consulting candidates seeking jobs on pasona.vn website

This chatbot is integrated on pasona.vn website to be an AI assistant answering the web visitors/ job seekers' job-related inquiries (salary, open positions...). The chatbot applies a fine-tuned BERT model to categorize the main topic of the message, and direct to specific response process based on each topic to interact with the database. This repository contains both the frontend and backend design: the frontend languages are HTML, CSS, Javascript; the backend ones are Python, SQL. For the message processing, many NLP techniques have been combined, and for topic classification issue, a fine-tuned BERT model is applied. The author will introduce how the chatbot operates, source code's structure, and some highlights in the backend process, focusing on NLP techniques and ML model. The full report is written in Vietnamese.

Table of content:

  1. Workflow of the chatbot
  2. Source code summary
  1. Workflow of the chatbot

    chatbot icon appears on the website

    When customers access the homepage of the Pasona.vn website, in the bottom right corner of the screen, the blinking Chatbot icon will appear. When customers click on the icon, the system will display a greeting and invite customers to fill out the contact information form. The system will proceed to verify and process the data: If the phone number and email already exist in the company's database, it will check and update other information with the stored information in the contact information data table if there's any difference between the form submission and the current data in database. Finally, the system retrieves the customer code and saves it to the local storage. If the phone number and email do not exist in the company's database, a new data row will be created, storing the contact information of the new customer. Then, the system retrieves the customer code and saves it to the local storage.

    contact form

    After filling out the contact information, the system will continue to display a form with a format similar to the Job Page, allowing the customer to enter job search criteria (not all fields are required). At this point, the system will call the "search job" API on the Job Page to perform the query and display the search results (in a mobile interface). Simultaneously, the system will store the job search information in the data repository (job_search_history table) (if the customer does not perform a search, no search history will be saved).

    search job form

    search job result

    Below the displayed search results, the system will ask the customer if they have any more questions and provide three options:

    3 options

    • "Search job": Continue searching for jobs using the available template. The system will display the job search template for the customer to enter search information.
    • "Chat with chatbot": If the customer has other questions besides job searching, instead of interacting with customer service staff, the system will be the one to receive the question and provide an answer. When the customer selects this option, the system will open a message composition box and a send message button. The customer enters the message content into the message composition box, and the system will process the message, interact with the data repository, and return a response, as well as store the message content, message analysis, and response. After providing a response, the system will ask the customer if they have any more questions, along with three options.

      the customer sends the question and system is in the procress

      the response from chatbot

    • "End chat": If the customer no longer needs to use the chatbot, they can choose to end the conversation. At this point, the system will disable the message composition box and send message button. Additionally, the system will invite the customer to rate their chatbot experience using a 5-point scale represented by 5 stars. Whatever star the customer selects, the system will store the value (score) of that star and update it in the "rating" column in the customer contact table. Finally, a thank you message will be displayed, and the conversation will end.

      rating section


  2. Source code summary

    Frontend

    Originally, this chatbot is an element belonging to the footer of the website. However, in this repository, the chatbot will be extracted independently. Here's the structure of frontend folder:

    frontend source code

    a sample function to call API from server

    Backend

    backend source code

    • app.py: Sets up environment variables, connects to the database, and registers blueprints for routes.
    • .env: Contains information about keys and accounts of third-party applications linked to the program.
    • In the routes folder, the file chat.py specifies functions and methods for each endpoint.
    • In the controller folder, the file chat_controller.py defines the functions called in the routes section. Each function will define the types of input data and the functions (services) that will be applied to process that input data.
    • In the services folder, each file corresponds to a stage in the chatbot process. Each file will define objects and functions that perform specific operations (adding, updating, deleting, etc.).
    • In the model folder, each file corresponds to a data table used in the chatbot process. Each file will define the columns of each table and the data types of each data field.
    • In the utils folder, the files will define the structure (format) of input/output data.
    • The fine-tuning_model folder contains fine-tuned files for the deep learning natural language processing (NLP) model. The files in this folder will not run together with the entire chatbot program but will be run separately at a specific time to improve the model.

    Execution flow: Routes -> controllers -> utils -> services -> models -> controllers -> utils -> routes.

    Fine-tuning deel learning BERT model:

    After processing, cleaning, and extracting the original messages, retaining the message keywords, the the author will classify the message topics to direct them to various processing steps corresponding to each topic. Currently, there are two main topics: 'amount' and 'salary', and an additional topic for message unrelated to recruitment and employment, labeled as 'trivia'. Including this 'trivia' topic reduces processing time; the system, upon recognizing a message belonging to the 'trivia' topic, will transfer it directly to storing the message and automatically respond, 'Sorry, your request is beyond our knowledge. We are in the progress of expanding our intelligence. Thanks for your resource.'
    In cases where messages are classified as 'amount' or 'salary', the system will perform deeper language analysis to identify the objects and their roles in the SQL query.This is a data classification problem, where the training dataset is pre-labeled. We'll apply a deep learning model to 'learn' the characteristics of each class from the training dataset, allowing us to classify test data.
    Given the absence of historical data regarding customer queries on the company's website, in the first trial, the author aim to leverage existing functions and libraries to save time in preparing the training data and fine-tuning deep learning models. They choose the 'synset()' function from the WordNet library to detect synonyms related to the topics 'amount' and 'salary'. However, the synonyms list from this library is limited, and many human-understood words or phrases pertaining to a topic might not be synonymous. Consequently, the 'synset()' function's understanding of these words or phrases is restricted, unable to discern their synonymous relationships.
    Realizing that the available NLTK libraries might not specifically meet the chatbot's processing needs, the author will create a training dataset consisting of questions potentially sent by customers using the chatbot. They'll label the dataset's topics and fine-tune a pre-trained deep learning model to classify and identify the message topics.
    The training dataset comprises 322 messages: 111 'amount,' 111 'salary,' and 100 'trivia.' Before training, this dataset will undergo cleaning using the 'clean_data()' function defined in Task 3. Post-training, the author will employ the 'WordCloud()' function from the wordcloud library to visualize common words for each topic, facilitating adjustment in case of excessive similarity between two classes.

    After obtaining the training dataset, the author proceeds to explore and choose a deep learning model. In this case, the author will select the BERT model. Leveraging the knowledge acquired in Machine Learning, Artificial Intelligence, and deep learning models, the author proceeds to fine-tune the BERT model to suit the classification problem (message topic identification).
    Each deep learning model serving a task will be stored in separate files within the 'fine-tuning_model' folder. The message topic identification model will be saved in the file 'train_model_modifier.py'. First, necessary libraries such as torch, transformers, and tensorflow are installed via the command "pip install torch tensorflow transformers" and essential modules are imported like torch, BertForSequenceClassification, BertTokenizer, tensorflow, and some modules essential for representation and model evaluation.
    For the current classification task, the original model will be fine-tuned by adding a classification layer at the end of the model, and the output of the main model will be the input to this classification layer. A class named UpdatePretrainedModel is created, in which a function named train_and_save_model() is defined with the following input parameters:

    • new_input_texts: set of cleaned messages
    • new_labels: set of corresponding labels for the input messages
    • save_path: file name to contain the tuned model
    • num_classes: number of classes (labels) in the classification part
    • batch_size: number of data samples in one training iteration
    • learning_rate: a hyperparameter used in training neural networks. Its value is a positive number, usually between 0 and 1.
    Next, the model name for fine-tuning, 'bert-base-uncased', is declared, and the pre-trained BERT model is loaded for classification by invoking BertForSequenceClassification.from_pretrained() with input parameters comprising the model name (previously declared) and the number of labels equal to the number of classes in the training data sample. The input data is processed using the tokenizer() function of the model to transform it into a form suitable for inputting into the model. Both the tokenized input messages and labeled data (already encoded) are converted into PyTorch tensors and combined into a TensorDataset. The 'Dataloader' will feed the TensorDataset into the model, addressing the issue of selecting the number of data samples and shuffling data during training.
    Subsequently, the loss function 'CrossEntropyLoss' and the optimizer 'Adam' are defined. The optimizer function updates the model's weights based on gradients during the backpropagation process.
    Backpropagation, short for "backward propagation of errors," is a common method used in training artificial neural networks in conjunction with an optimization method like gradient descent. This method calculates the gradient of the loss function with respect to all relevant weights in that neural network. This gradient is used in the optimization method to update the weights, aiming to minimize the loss function.
    Once the input factors are prepared, the training loop begins. Here, the number of epoch loops during training is set to 200, and the loss metric is saved every 50 loops to monitor whether the model is overfitting or not. At the beginning of each epoch, the loss metric is reset to 0, and the model is put into training mode. The inner loop iterates through batches of data samples. For each batch, the optimizer's gradient is zeroed, the model's output and the loss metric are calculated. Then, the backpropagation process is executed, updating the optimizer's weights and the total loss of that epoch.
    Upon completion of training for 200 epochs, the .save() function from torch is used to export the model file with the filename passed through the save_path parameter. The fine-tuning function concludes and returns the name of the fine-tuned model file.

    In the main() function, after cleaning the training data and saving the updated data to an Excel file, the file is opened as a DataFrame. The model fine-tuning function is called, passing in the necessary parameters. Upon completion of this function, a new .path file will appear in the directory. This file contains the fine-tuned model.
    The next step after successfully training the model is model evaluation. Here, the author utilizes a confusion matrix and calculates metrics like accuracy, precision, recall, and F1 score. Using the training dataset again, after declaring the use of the fine-tuned model, the author performs embedding steps and predicts labels for the messages. The prediction results are saved into a new column in the data table.

    • Accuracy is computed as the ratio of the number of correct predictions (true positives) to the total number of samples in the dataset. In this initial trial, the accuracy of predictions is at an average of 66%.
    • The Confusion Matrix is a vital tool in evaluating the performance of a classification model in machine learning. It provides a comprehensive view of how the model predicts classes and offers information about the distribution of predictions compared to reality. Elements on the diagonal of the matrix represent the number of correctly predicted observations, while elements outside the main diagonal indicate the number of misclassified items into other classes. With this model, the 'salary' class is misclassified as 'amount' a lot, with only 1 out of 18 elements belonging to the 'salary' class being correctly labeled; the rest are labeled as 'amount'.
    This indicates an issue with the training dataset, specifically that the 'amount' and 'salary' classes are quite similar at present, requiring a modification of the dataset to differentiate these two classes better. Looking at the word clouds of both classes, it's evident that the 'amount' and 'salary' classes share many common words (position, job, role). However, in the 'amount' class, common words represent topic content (open, total, position: the number of open job positions), whereas in the 'salary' class, common words are more generic and not related to salary (job, role, position... unrelated to salary).

    wordcloud of 'amount' topic

    wordcloud of 'salary' topic

    Therefore, the 'salary' class data needs to be modified. After modification, common words like average, standard, role... are more related to the salary theme, making this new data more acceptable.

    wordcloud of modified 'salary' topic

    Upon fine-tuning the model for the second time, the evaluation results are more promising. Specifically, the accuracy has increased to 99.38%, and there are no more misclassifications between the 'amount' and 'salary' classes. Although the 'trivia' class has 2 cases misclassified as 'amount', it's negligible and can be ignored. The second fine-tuned model is accepted and ready for deployment.

    After fine-tuning the deep learning model, the author combines two methods of message topic identification into one function named `topic_identifier()`. This function takes two input parameters: a dictionary of topics (as during model fine-tuning, labels were encoded, and the model's predictions return numerical types, requiring a dictionary defining the encoded topics) and the cleaned message text. Initially, the `synsets()` function from the WordNet library is used to find any synonyms of the names of the three topics within the sentence. If a synonym is found, that word is replaced with the topic word, and the function returns the message with the replaced topic word. Otherwise, if no synonyms of any topic word are found, the fine-tuned model is utilized. The `topic_identifier()` function returns an object containing the identified topic word and the message.
    Within `chat_controller()`, the topic words are listed in a collection, and the `topic_identifier()` function is called to determine the message's topic. In the case where the topic word is 'trivia', the message is stored in the data store with a response of "Sorry," indicating the end of the chat session. Otherwise, the system continues the analysis, constructs an SQL query, and generates a response for the customer.

    • Messages unrelated to recruitment and job-related topics (trivia): Successful classification of these messages results in the message being correctly categorized and saved into the database. The chat session ends with a response of "Sorry, the information has not been updated in our intelligence."
    • Messages inquiring about salary or the number of job openings: Successful classification of these messages.

About

This chatbot is integrated on pasona.vn website to be an AI assistant answering the web visitors/ job seekers' job-related inquiries (salary, open positions...). The chatbot applies a fine-tuned BERT model to categorize the main topic of the message, and direct to specific response process based on each topic to interact with the database.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published