Authors: Brendan Ferris, Michael Wirtz
This project analyzes the needs of Butterfly Ventures, a micro venture capital fund that is seeking to leverage a machine learning model that will precisely classify companies that will be acquired. In an effort to model this problem, we collected a dataset of startups that fell into any one of the following three categories: closed, operating or acquired. In an effort to minimize the false negatives, we chose precision to be our target metric. Our baseline model using Logistic Regression had a precision score ranging from 13-30%, exhibiting low predictive power. Our final and best model was a Random Forest model that had a precision score 28%.
Butterfly Ventures is small VC fund that is low on capital. Because of their limited funds, they are looking for a way to better filter companies in the hopes of making the most of their investments. They are aware of the following statistics: 75% of venture-backed startups fail. Under 50% of businesses make it to their fifth year. 33% of startups make it to the 10-year mark. Only 40% of startups actually turn a profit. Given this knowledge, Butterfly Ventures is targeting startups that they believe have the best opportunity at acquisition, a sure-fire way for investment profits. For this purpose, they have hired a group of data scientists to create a model classify whether or not a startup will be acquired.
In order to help Butterfly Ventures, we used a Kaggle dataset with information on 54,000 companies sourced from crunchbase to train our model. The three original classification in the dataset were "closed," "operating," and "acquired."; however, we grouped operating and closed into one category (not acquired) in order to predict the "acquried" class. The feature definitions for the dataset can be found below, or on the crunchbase website.
Original Features |
---|
|
Engineered Features |
---|
|
Target |
---|
|
Because certain values possessed overly predictive power, they were dropped from the models. Those columns are as follows:
Removed From Original Dataset |
---|
|
Overall, this project analyzes the given dataset information to maximize the precision metric of our models.
In order to get the most out of our features, we dummied all of the categorical columns. We presumed that the category list column would be the most beneficial to our model, given that it would be able to classify each startup specifically into business-type categories.
Because there was high class imbalance, we implemented a mixture of upsampling and downsampling techniques to balance out the acquired (1) and not acquired(0) classes.
For our logistic regression models, large continuous variables presented issues with training time and performance of our model, so we standardized the continuous features in order to deal with this issue.
We ran through multiple iterations of both logistic regression and random forest models in order to maximize the percision score.
Our random forest model yielded the best precision results at 28%. Although at first, logistic regression models produced high numbers on precision, values would fluctuate leading to low predictive power.
The conclusions that can be drawn given our results include:
- Predicting whether a company would be acquired is a complex problem, and expanded data collection would greatly benefit the precision of the model.
- A combination of Up/Downsampling drastically reduced false positives, while preserving precision.
- Random Forest models yielded the highest precision figures.
- If Butterfly Ventures does not have the resources to collect more data, it may be beneficial to pivot into developing a more interpretable model, then drawing insights from it to guide investment decisions.
- Scrape data on startup management to get an indication of how that can effect acquisition
- Predicting whether a company would be acquired is a complex problem, and expanded data collection would greatly benefit the precision of the model.
See the full modeling process in the modeling notebook or review this presentation.
For additional info, contact Brendan Ferris or Michael Wirtz at [email protected] and [email protected], respectively.
├── README.md ├── archive │ ├── EDA_notebook.ipynb │ ├── cleaning_notebook.ipynb │ └── modeling_notebook.ipynb ├── data │ ├── cleaned_investments_VC.csv │ └── investments_VC.csv ├── images │ ├── cat_frequency_graph.png │ ├── class_imbalance_graph.png │ ├── external-content.duckduckgo.com.jpg │ ├── funding.png │ ├── startup_acquisitions_blue.jpeg │ └── startup_acquisitions_red.jpeg └── slide_deck.pdf