Skip to content

tiricha91/Project_Maui

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project_Maui

Using News Sentiment to Predict Stoci Price Movements

Program that allows a user to choose a company from the S&P 500 and run a logistic regression model to predict the price movement of this company's stock on the next trading day based on current sentiment (Vader) of Reuters news articles related to this company.

UX/UI Showcase

Choose a Company

TE Connectivity

Required Model Accuracy of 20%

Costco Buy

Required Model Accuracy of 63%

Please wait till the end for modified recommendation. Thank you for your patience! Costco Adjust

Original Goal

Should there be access to hourly data on stock prices and news sentiments

Predict intraday movement in stock price between the current point in time and the end of the trading day to determine if a trade will be profitable by end of day. Machine learning model to be implemented that took in average news sentiment between h0 and h-24 (h = hour) as features, and intraday change in price between h0 and 4pm as the target on a rolling single-day basis historically .

NOTE: You must have active keys from the following APIs to run this program:

File to run: project_code > master_function

Link to Project Proposal

Data

Cleaning and Curation

News Sentiments

  • Pulled 20 Articles per day from news API

    • Could only pull for past 30 days
    • Request limitations informed Dataframe Structure
  • Sentiments are placed into four categories:

    • compound, positive, negative and neutral
  • Cleaned up articles using Lemmatization and stop word removal

    • Marginally affected polarity score
    • Applied Vader Sentiment Analyzer to return Polarity Score
    Code (click me)
    # function to tokenize text
    def tokenizer(text):
        
        # cleaning text
        sw = set(stopwords.words('english'))
        regex = re.compile("[^a-zA-Z ]")
        re_clean = regex.sub('', text)
        words = word_tokenize(re_clean)
        lem = [lemmatizer.lemmatize(word) for word in words]
        tokens = [word.lower() for word in lem if word.lower() not in sw]
        
        # exporting tokenized words as output
        return tokens
      </details>
    
  • Ensure relevancy by including both the company and ticker

    Code (click me)
    # establishing keywords for news pull
    keyword = f'{company} AND {ticker}'

Stock Prices

Breaks on weekends and holidays? Here we go!

Code
        # iterating through sentiment score / article DataFrame to...
        for index, row in dataframe.iterrows():

            # if daily return is null value for a given day - i.e. a non-trading day,
            if pd.isnull(row['return']):
                
                # then append polarity scores to their respective lists
                compound.append(row['compound'])
                positive.append(row['positive'])
                negative.append(row['negative'])
                neutral.append(row['neutral'])
                dataframe.drop(index=index, inplace=True)
            
            # if there was a return value - i.e. it was a trading day
            elif pd.notnull(row['return']):
                
                # The list of compound polarity scores will be empty if the stock was traded
                # on the previous day; therefore, move along.
                if len(compound) == 0:
                    pass

                # If the list is not empty, then at least one day prior was a non-trading 
                # day. Append the current day's scores to the list and calculate the mean 
                # for each score. Then replace the current day's polarity scores with the 
                # average scores of today and previous non-trading days.
                else:
                    compound.append(row['compound'])
                    compound_mean = np.mean(compound)
                    compound = []

                    positive.append(row['positive'])
                    positive_mean = np.mean(positive)
                    positive = []

                    negative.append(row['negative'])
                    negative_mean = np.mean(negative)
                    negative = []

                    neutral.append(row['neutral'])
                    neutral_mean = np.mean(neutral)
                    neutral = []

                    dataframe.at[index, 'compound'] = compound_mean
                    dataframe.at[index, 'positive'] = positive_mean
                    dataframe.at[index, 'negative'] = negative_mean
                    dataframe.at[index, 'neutral'] = neutral_mean

            else:
                pass

Sample of Pre-model Dataframe

get_model_data

Featuring Lags

Lagging days feature is incorporated into the get_model_data function:

def get_model_data(company, ticker, lag=0):
 # shifting the return column up to adjust for a lag in stock reaction to sentiments
    final_df = cleaned_df(combined_df)
    final_df['return'] = final_df['return'].shift(-lag)
    final_df.dropna(inplace=True)

Limitations

  • Lack of affordable availability of historical intraday stock price data

    • Had to change scope of project from intraday predictions to day over day
  • News API only allows for 30 days of historical articles to be pulled in

    • Limited training data likely affects the accuracy of our model

Sources

  • IEX Finance - historical stock price data
  • News API / Reuters - historical news articles

Models

We are prediction whether the closing price of a stock would rise (1) or fall (-1) compared to the closing price of the previous trading day. It is supervised machine learning as we have a target variable. A 30% training-and-testing split is applied to fit the models.

As of models, we used Logit regression and Balanced Random Forest Classifier to predict the probability of the binary outcome.

Other models used include LSTM Sequential and 3-Layer Neural Network.

Python Libraries:

Data for News Sentiments

NLTK

Logit, Balanced Random Forest and Miscellaneous Classifiers:

scikit-learn

For Neural Network Sequential and LSTM Models:

keras

tensorflow

Evaluation Results: Which Model Shall We Use?

Based on 31 days of data for Disney (DIS): 3/15/2020 to 4/15/2020

Note: Changes on test statistics based on live data may lead to different choice of models.

a. Logit Regression

  • The balanced accuracy score is 0.83.
Evaluation (click me)

Logit Evaluation Logit Evaluation on Training vs. Testing Data

Code (click me)
# ********* MODEL FITTING *************
   # --------- Loigt -----------
   # --------Start-------------
   
M = 'Logit'
from sklearn import linear_model 
lm = linear_model.LogisticRegression(C = 1e5)
lm.fit(X_train, y_train)
lm_pred = lm.predict(X_test)


  # --------- Logit ------------
   # ---------End -------------

b. Balanced Random Forest Classifier Ensemble Learning

  • The balanced accuracy score is 0.67.
  • Better choice over Decision Tree model as it prevents overfitting
Evaluation

Balanced Random Forest Evaluation

Code
# ********* MODEL FITTING *************
   # -----Balanced Random Forest -------
   # --------Start-------------

# Resample the training data with the RandomOversampler
# fit Random Forest Classifier
from imblearn.ensemble import BalancedRandomForestClassifier
brf = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
brf.fit(X_train, y_train)
brf_pred = brf.predict(X_test)

   # --- Balanced Random Forest --------
   # --------End-------------

c. Decision Tree Resampling

  • The balanced accuracy score is 0.67.
Evaluation

Decision Tree Evaluation

Code
# ********* MODEL FITTING *************
   # ----- Decision Tree -------
   # --------Start-------------

from sklearn import tree
# Needed for decision tree visualization
import pydotplus
from IPython.display import Image

# Creating the decision tree classifier instance
model_tree = tree.DecisionTreeClassifier()
# Fitting the model
model_tree = model_tree.fit(X_train, y_train)
# Making predictions using the testing data
tree_pred = model_tree.predict(X_test)

  # --- Decision Tree --------
   # --------End-------------
Image

Decision Tree Visualization

Data Preparation for Models

Code
# Creating training and testing data sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, shuffle=False, random_state=42) 

# For neural network sequential, LSTM and ensemble learning
#Create the StandardScaler instance
scaler = StandardScaler()
# Fit the Standard Scaler with the training data
X_scaler = scaler.fit(X_train)

# Scale the training data - only scale X_train and X_test data 
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)


# Creating validation data sets for deep learning on neural network model training
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.3, shuffle=False) 

Model Evaluation

  • Confusion Matrix and Balanced Accuracy Scores for Logit, Supervised Resampling and Ensemble Learning
  • Model evaluation would differ by company and time window.
  • Interpretion on accuracy scores
    • The Logit model is offering a balanced accuracy score of predictions for the test dataset, which is 30% of 20, or 6 predictions.
      • The reason we usually get 50% as the balanced accuracy score is because with so little data (and having been trained on so little data - 14 days), it's basically randomly guessing and getting it right 50% of the time.
      • Now, sometimes we're a value of .25 (25%) or .625 (62.5%), and you're probably thinking to yourself, 6 can't be divided by another whole number to return that fraction/percentage. True - but we're not using a straight accuracy score in our model, we're using a balanced accuracy score, which is different. Read here for explanation.
    • If you go into the model and change balanced_accuracy_score to accuracy_score (which is a simple calculation of number of correct guesses divided by total guesses), and print the confusion matrix, you'll see that it's returning the correct fraction.
Code
# Score the accuracy
print("Training vs. Testing - Logit")
print(f"Training Data Score: {lm.score(X_train, y_train):,.04f}")
print(f"Testing Data Score: {lm.score(X_test, y_test):,.04f}")

# Evaluating the Logit model in a nicer format
# Calculating the confusion matrix
cm_lm = confusion_matrix(y_test, lm_pred)
cm_lm_df = pd.DataFrame(
    cm_lm, index=["Actual -1", "Actual 1"], columns=["Predicted -1", "Predicted 1"]
)
# Calculating the accuracy score
acc_lm_score = balanced_accuracy_score(y_test, lm_pred)

# Displaying results
print("Confusion Matrix - Logit")
display(cm_lm_df)
print(f"Balanced Accuracy Score : {acc_lm_score:,.04f}")
print("Classification Report - Logit")
print(classification_report(y_test, lm_pred))

Results: Real vs. Predicted

Logit

Logit Predictions on 1-Day Rolling Window

Logit predictions without rolling window

Logit Predictions

Balanced Random Forest

Graph

Balanced Random Forest Predictions

Three-layer Neural Network

Graph

3-Layer Neural Network Predictions

For more details on models, please click on the link below:

Test Folder

Conclusions

Impact of Positive vs. Negative News Sentiments

Conclusion: Predictions are more consistent with actual directions of returns based on negative compared to positive sentiments, subject to overfitting.

Deep Learning LSTM Model

Positive News Sentiments

LSTM Prediction on Positive Sentiments

LSTM Training Loss on Positive Sentiments

Negative News Sentiments

LSTM Prediction on Negative Sentiments

LSTM Training Loss on Negative Sentiments

Note: Potentially overfitting. Needed more data.

The plots of parallel categories and OLS regressions below shows consistent conclusion that negative and compound sentiment scores hold higher predictive power on the direction of daily returns.

Parallel Categories on Negative Sentiments

Parallel Categories on Composite Sentiments

Compared to positive and neutral sentiments:

Parallel Categories on Positive Sentiments

Parallel Categories on Neutral Sentiments

OLS Prediction on Price Move Directions due to News Sentiments in the Past 5 Days

Actual Returns (the blue line on top) in more in tune with the predictions according to negative sentiments (the purple line, third from the bottom) and compound sentiments (the brown line, second from the bottom) over the past five trading days.

Actual vs. Lagged OLS Evaluation

OLS Predicted Returns on News Sentiments in the Past 5 Days

OLS Predicted Actual vs. Lagged Returns on Sentiment Catogories Note: Inverted due to complications when multiplying positive and negative signs.

Snoozed Sentiments? For How Long?

Impact on Lagged Response to News Sentiments

Conclusion: We found that news sentiments over the past five trading days has identical predictive capacity to one-day sentiments.

OLS with Rolling One-day Training Window

OLS Rolling 1D

OLS Returns with Rolling Three-day Training Window

OLS Rolling 2D

OLS with Rolling Five-day Training Window

OLS Rolling 3D

Discussions

Gradient Boosing Ensemble and SVM Models also provides higher accuracy scores (50%) compared to other models.

Gradient Boosting Ensemble Learning on News Sentiments

Gradient Boosting Visualization

Returns from predicted directions: does multiplication always work?

What happens when multiplying a prediction of price drop, i.e. -1, with a negative actual return? Please click below to see a solution.

Code
sigma = (dis['return']/100).std()
all_predic = all_pred
all_predic['OLS_predi'] = 0.7 * sigma * (-all_predic['OLS_pred'])

SVM Prediction

Interpretation on Test Statistics

Confusion Matrix

Predicted 0 (-1) Predicted 1
Actually 0 (-1) TN FP
Actually 1 FN TP
  • Accuracy = (TP+TN)/(TP+TN+FP+FN)
    • It treats FP and FN equally and would be biased for imbalanced data:
      • More weights are put on true negatives (TN)s for COVID-19 tests
      • Tests need to focus on minimizing false negatives (FN)
    • Therefore, other test statistics need to be considered
Graph Illustration

Accuracy Interpretation

Other Model Evaluation Statistics (click me)
  • Precision = TP/(TP+FP)

    • Out of all the predictions of "1" for daily price increase, how many are actually increased.
    • It focuses on the data on price increase and uses figuress in the second column of the confusion matrix.
  • Recall = TP/(TP+FN)

    • How many actual daily price increase moves are predicted correctly?
    • It features the second row of the confusion matrix
    • Recall is also the sensitivity of the testing model
  • Specificity = TN/(TN+FP)

    • How many of the actuall downward price moves are predicted correctly?
    • It spotlight the first row of our confusion matrix and examine only the downward price moves in our data.
  • F1 = 2 x (Precision x Recall)/(Precision + Recall)

    • F1 score is the harmonic mean of precision and recall.
    • As precission and recall usually go in opposite directions, f1 score is a good balance between the two.
    • F1 leverages the second row and column for actual and predicted upward price moves.

Deployment

Jupyter Lab, Jupyter Notebook, Visual Studio Code

Built With

Future Steps

We are working on the following three features to upgrade our machine learning widget.

Our group has been working on several solutions to incorporate Buy, Sell and Hold feature into the master function widget.

The original code appears as follows:
def signal_column(df):
    df['test'] = None
    for index, row in df.iterrows():
        if pd.isnull(row['test']):
            if df.loc[index]['return'] >= 2:
                df.at[index, 'test'] = 1
            elif df.loc[index]['return'] <= -2:
                df.at[index, 'test'] = -1
            else:
                df.at[index, 'test'] = 0
    return df

Trying to fit it in, we tried three versions below

a. on get_model_data(company, ticker, lag=0) function

Buy Sell Hold on Data

b. on on_button_clicked(b) function

Buy Sell Hold on Button Clicked

c. on the model(df) function
# defining model to run Logit logistic regression model on the feature/target DataFrame
# and export predicted price movement and model accuracy
def model(df):
    # preparing the dataframe
    df['return_sign'] = None
    for index, row in df.iterrows():
        if pd.isnull(row['return_sign']):
            if df.loc[index]['return'] >= 2:
                df.at[index, 'return_sign'] = 1
            elif df.loc[index]['return'] <= -2:
                df.at[index, 'return_sign'] = -1
            else:
                df.at[index, 'return_sign'] = 0
    df = df.drop(columns=['text'])
    # creating the features (X) and target (y) sets
    X = df.iloc[:, 0:4]
    y = df["return_sign"]
    # creating training and testing data sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, shuffle=False, random_state=42)
    # fitting model
    M = 'Logit'
    lm = linear_model.LogisticRegression(solver = 'lbfgs')
    lm.fit(X_train, y_train)
    lm_pred = lm.predict(X_test)
    # calculating confusion matrix
    cm_lm = confusion_matrix(y_test, lm_pred)
    cm_lm_df = pd.DataFrame(
    cm_lm, index=["Actual -1", "Actual 1"], columns=["Predicted -1", "Predicted 1"]
    )
    # calculating the accuracy score
    acc_lm_score = balanced_accuracy_score(y_test, lm_pred)
    # exporting model accuracy and predicted price movement float variables as output
    return acc_lm_score, lm_pred[-1]

APIs ran out as of the date of this readme file. It remains unknown whether the buy-sell-hold feature works.

Click here for latest version Featuring Buy Sell and Hold

Furthermore, we discussed about options based on Black-Scholes Pricing and showcase put-call parity. It features an interactive input function. Outputs include prices on put and calls with greeks to measure price sensitivities.

Click here for latest version on options feature

Another topic that we spoke about was an Amazon Lex Bot.

Click here for latest version on Lambda function for Amazon Bot

Contributors

  • Richard Bertrand
  • Ava Lee
  • Devin Nigro
  • Brody Wacker

Files

Data

Dataframe

Stock Data

Models

Models of Good Fit

Scikit-learn Classifiers

Neural Net

LSTM

Future Steps

Future Steps

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published