Skip to content

Principal Component Analysis on a data set of 10 stocks in order to reduce dimensionality to 2 principal components. Numpy, Pandas.

Notifications You must be signed in to change notification settings

ayoubmassik/PCA_in_Python

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

PCA_in_Python

Principal Component Analysis on a data set of 10 stocks in order to reduce dimensionality to 2 principal components. Numpy, Pandas.

Today we will perform PCA on a set of 10 stocks I've chosen, the top 5 holdings for the Nasdaq and top 5 holdings of the Dow Jonesm as of August 2018. The data is included in the repository as 'PCAstocks.csv'.

Nasdaq: Apple, Amazon, Microsoft, Google, Facebook

Dow: Boing, United Health Group, Goldmansachs, 3M, Home Depot

We first decide the number of dimensions we want to reduce our data set to. We can do this by naively guessing and checking, or we can write a loop and see where the diminshing returns of adding dimensions (or principal components) kicks in.

Here is what our graph looks like: figure_1

As we can see the effect of adding principal components drops off signficantly after the second (from ~15% to below 8%) With this in mind we will make our desired dimensions 2, which is also nice because this allows for easy visualization.

We fit our data using the number of dimensions we want and scikitlearn's PCA(). Print out of the desired dimensions and variability of the data they explain: figure_2

Now we calculate our factor returns and factor exposures (risk variable). We can then plot the factor exposure or our first principal component against our second. Here's the result: figure_3

Very interesting! We can see two rather distinct groupings of the data. Non-coincidentally they group into their respective tech and or DOW industries, which is where the data was pulled. (top 5 holdings of QQQ, top 5 of DIA.)

PCA allows for us to reduce the number of dimensions we are working with, which when dealing with financial data can often be extremely bountiful (curse of dimensionality). This allows for easier visualization, computation, and intuition.

We could further pair this with K means clustering as a way to cluster and classify our factor exposures, and we could try to classify any new data as one risk exposure or the other.

PCA is one of the most common dimensionality reduction techniques in data science and rather easy to implement.

About

Principal Component Analysis on a data set of 10 stocks in order to reduce dimensionality to 2 principal components. Numpy, Pandas.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%