Data Science 1: Introduction to Data Science
Final Project: Predicting Stock Prices from Financial Data¶
Harvard University
Fall 2023
Jake Pappo, Vincent Hock, Thomas Garity, Tomas Arevalo
Import Libraries¶
Note: xgboost is not included in the CS109a environment. A separate installation of that package is required.
!pip install xgboost
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import itertools
from scipy.stats import uniform
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score, RandomizedSearchCV
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, FunctionTransformer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, StackingClassifier
from xgboost import XGBClassifier
from pandas.api.types import CategoricalDtype
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer, KNNImputer, MissingIndicator
import requests
from IPython.core.display import HTML
styles = requests.get(
"https://raw.githubusercontent.com/Harvard-IACS/2021-CS109A/master/"
"themes/static/css/cs109.css"
).text
HTML(styles)
Notebook Contents¶
1. Introduction
2. Data Description
3. EDA and Visualizations
4. Data Preprocessing
5. Modeling
6. Feature Importance
7. Results & Discussion
8. Future Work
1. Introduction¶
From quantitative trading firms like Jane Street and Citadel to the average day trader trying to get a leg up, the monetarily motivated have been attempting to "beat" the stock market for decades. It sounds like a very straightforward task: predict the trajectory of a company's value before the rest of the market can, and invest (or divest) quicker than everyone else. This path, though amply trodden, is a treacherous one. Unless absolutely confident in the accuracy of their model, any investor is taking a risk when making the decision to put their money in (or pull it out), and when these trades are taking on magnitudes of hundreds of thousands of dollars, this risk cannot be taken lightly. The reward of a succesful risk, however, can be enormous.
Unfortunately, accurately modeling the stock market has thus far shown itself to be an intractable problem. Even firms with billions in budget haven't cracked the code to perfect predictions, so we've set our eyes on a more manageable task: direction. It may not be possible to determine exactly how much a company's stock will rise or fall in a given year, but perhaps it is possible to determine whether a company's stock will rise or fall, and which factors might contribute most to that result. Thus, we've set out to explore the viability of a model that uses financial indicators of publicly traded companies from prior years to classify a company as either increasing or decreasing in value in the following year.
2. Data Description¶
The original data from Kaggle contained 5 datasets, one for each year from 2014 to 2018. For our purposes (ie. predicting behavior in 2019), we decided to use only the data from the two years prior to 2019. Each dataset contains 200+ financial indicators from annual 10-K (annual report) filings of over 4,000 US publicly traded companies.
More precisely, there are exactly 223 predictors and 4,392 distinct observations (companies). Among these predictors are Revenue
, Operating Income
, various expenses, various taxes, growth metrics, and a number of other lesser-known descriptors of a company's performance. Of these 223 predictors, all but one are quantitative. Sector
, the lone categorical predictor, indicates each company's macro-area (Technology, Financial Services, Healthcare, ...).
Each dataset also contains the PRICE VAR [%]
of the company in the year following the year of the annual report. That is, 2018_Financial_Data.csv
will contain 200+ financial indicators of each company for the year 2018, as well as the variation in company stock price from the first trading day on Jan 2019 to the last trading day on Dec 2019. The Class
variable is a binary classification of PRICE VAR [%]
, and will serve as our response variable in the classification models that will follow. If PRICE VAR [%]
> 0, Class
= 1, and if PRICE VAR [%]
< 0, Class
= 0. Thus, predicting that Class
= 1 for company X is predicting that an investment in company X will yield a positive return. The PRICE VAR [%]
response variable could be used in further exploration of a regression model, as opposed to the classification model that we will be pursuing.
Load in the data¶
# Load in the data (2018, 2017)
df_2018 = pd.read_csv('data/2018_Financial_Data.csv')
df_2018 = df_2018.rename(columns={'Unnamed: 0':'Ticker'})
df_2017 = pd.read_csv('data/2017_Financial_Data.csv')
df_2017 = df_2017.rename(columns={'Unnamed: 0':'Ticker'})
df_2018 = df_2018.set_index('Ticker')
df_2017 = df_2017.set_index('Ticker')
df_2017 = df_2017.reindex(df_2018.index)
3. EDA and Visualizations¶
Let's first take a peek at our data. As stated above, each row (our observations) represents a specific company's stock, and each column (our predictors) represents a financial indicator pertaining to the company from which the stock is from. Immediately, we see that we have lots of predictors, 224 to be exact, along with 4392 observations.
As we look deeper into our dataset, we can see that it was a pretty good year for the market. Predictors such as Revenue
, Revenue Growth
, and Gross Profit
are all positive, and the average Operating Income
was more than that of the average Operating Expenses
. Additionally, looking at the final two variables, we see that the average stock price variation was a 20% increase and looking at Class
, our response variable, we see that 69% of the companies saw an increase in their stock price. Finally, we see that the minimum change in price variation was -99.86% while the maximum change in percentage was 3756.72%, with a standard deviation of 82.6%. Because of this, it might be extremely hard to predict exact PRICE VAR [%]
, but we hope to be able to get good classifications scores for Class
.
df_2018.head()
Revenue | Revenue Growth | Cost of Revenue | Gross Profit | R&D Expenses | SG&A Expense | Operating Expenses | Operating Income | Interest Expense | Earnings before Tax | ... | Receivables growth | Inventory Growth | Asset Growth | Book Value per Share Growth | Debt Growth | R&D Expense Growth | SG&A Expenses Growth | Sector | 2019 PRICE VAR [%] | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Ticker | |||||||||||||||||||||
CMCSA | 9.450700e+10 | 0.1115 | 0.000000e+00 | 9.450700e+10 | 0.000000e+00 | 6.482200e+10 | 7.549800e+10 | 1.900900e+10 | 3.542000e+09 | 1.511100e+10 | ... | 0.2570 | 0.0000 | 0.3426 | 0.0722 | 0.7309 | 0.0000 | 0.1308 | Consumer Cyclical | 32.794573 | 1 |
KMI | 1.414400e+10 | 0.0320 | 7.288000e+09 | 6.856000e+09 | 0.000000e+00 | 6.010000e+08 | 3.062000e+09 | 3.794000e+09 | 1.917000e+09 | 2.196000e+09 | ... | 0.0345 | -0.0920 | -0.0024 | 0.0076 | -0.0137 | 0.0000 | -0.1265 | Energy | 40.588068 | 1 |
INTC | 7.084800e+10 | 0.1289 | 2.711100e+10 | 4.373700e+10 | 1.354300e+10 | 6.750000e+09 | 2.042100e+10 | 2.331600e+10 | -1.260000e+08 | 2.331700e+10 | ... | 0.1989 | 0.0387 | 0.0382 | 0.1014 | -0.0169 | 0.0390 | -0.0942 | Technology | 30.295514 | 1 |
MU | 3.039100e+10 | 0.4955 | 1.250000e+10 | 1.789100e+10 | 2.141000e+09 | 8.130000e+08 | 2.897000e+09 | 1.499400e+10 | 3.420000e+08 | 1.430300e+10 | ... | 0.4573 | 0.1511 | 0.2275 | 0.6395 | -0.5841 | 0.1738 | 0.0942 | Technology | 64.213737 | 1 |
GE | 1.216150e+11 | 0.0285 | 9.546100e+10 | 2.615400e+10 | 0.000000e+00 | 1.811100e+10 | 4.071100e+10 | -1.455700e+10 | 5.059000e+09 | -2.177200e+10 | ... | -0.2781 | -0.2892 | -0.1575 | -0.4487 | -0.2297 | 0.0000 | 0.0308 | Industrials | 44.757840 | 1 |
5 rows × 224 columns
df_2018.describe()
Revenue | Revenue Growth | Cost of Revenue | Gross Profit | R&D Expenses | SG&A Expense | Operating Expenses | Operating Income | Interest Expense | Earnings before Tax | ... | 3Y Dividend per Share Growth (per Share) | Receivables growth | Inventory Growth | Asset Growth | Book Value per Share Growth | Debt Growth | R&D Expense Growth | SG&A Expenses Growth | 2019 PRICE VAR [%] | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 4.346000e+03 | 4253.000000 | 4.207000e+03 | 4.328000e+03 | 4.155000e+03 | 4.226000e+03 | 4.208000e+03 | 4.357000e+03 | 4.208000e+03 | 4.321000e+03 | ... | 4067.000000 | 4268.000000 | 4160.000000 | 4178.000000 | 4121.000000 | 4128.000000 | 4133.000000 | 4144.000000 | 4392.000000 | 4392.000000 |
mean | 5.119287e+09 | 3.455278 | 3.144946e+09 | 2.043954e+09 | 1.180176e+08 | 9.005022e+08 | 1.435546e+09 | 6.541207e+08 | 1.001350e+08 | 5.584432e+08 | ... | 0.006081 | 36.768524 | 0.183066 | 1.389013 | 0.262530 | 9.928446 | 0.091891 | 0.153610 | 20.803948 | 0.693534 |
std | 2.049504e+10 | 195.504906 | 1.508813e+10 | 7.682369e+09 | 9.330891e+08 | 3.661116e+09 | 5.529831e+09 | 2.969341e+09 | 3.780021e+08 | 2.639327e+09 | ... | 0.239653 | 2347.079237 | 4.688013 | 35.123904 | 5.612666 | 363.717734 | 0.823281 | 0.839647 | 82.622147 | 0.461078 |
min | -6.894100e+07 | -3.461500 | -2.669055e+09 | -1.818220e+09 | -1.042000e+08 | -1.401594e+08 | -4.280000e+09 | -1.455700e+10 | -1.408252e+09 | -2.177200e+10 | ... | -1.000000 | -1.000000 | -1.000000 | -0.999100 | -32.258100 | -1.000000 | -1.000000 | -1.000000 | -99.864779 | 0.000000 |
25% | 6.501425e+07 | 0.000000 | 3.415500e+06 | 3.618903e+07 | 0.000000e+00 | 2.056226e+07 | 4.223644e+07 | -5.510000e+06 | 0.000000e+00 | -1.000800e+07 | ... | 0.000000 | -0.048075 | 0.000000 | -0.036700 | -0.108600 | -0.082850 | 0.000000 | -0.004650 | -7.477173 | 0.000000 |
50% | 4.982640e+08 | 0.074900 | 1.741180e+08 | 2.219470e+08 | 0.000000e+00 | 9.390450e+07 | 1.806253e+08 | 4.203800e+07 | 5.693500e+06 | 2.730700e+07 | ... | 0.000000 | 0.010200 | 0.000000 | 0.034750 | 0.026100 | 0.000000 | 0.000000 | 0.065700 | 17.639393 | 1.000000 |
75% | 2.457878e+09 | 0.188500 | 1.297814e+09 | 9.767015e+08 | 1.450150e+07 | 4.117162e+08 | 6.796040e+08 | 2.862690e+08 | 5.817075e+07 | 2.238810e+08 | ... | 0.042050 | 0.185900 | 0.080050 | 0.160575 | 0.138400 | 0.115425 | 0.009700 | 0.167625 | 39.625879 | 1.000000 |
max | 5.003430e+11 | 12739.000000 | 3.733960e+11 | 1.269470e+11 | 2.883700e+10 | 1.065100e+11 | 1.065100e+11 | 7.089800e+10 | 9.168000e+09 | 7.290300e+10 | ... | 4.079100 | 153332.333300 | 293.473000 | 1184.993800 | 313.395800 | 17646.823500 | 36.898100 | 43.718800 | 3756.716345 | 1.000000 |
8 rows × 223 columns
df_2018.info()
<class 'pandas.core.frame.DataFrame'> Index: 4392 entries, CMCSA to ZYME Columns: 224 entries, Revenue to Class dtypes: float64(222), int64(1), object(1) memory usage: 7.5+ MB
# Calculate value counts
df_2018['Class'].value_counts()
Class 1 3046 0 1346 Name: count, dtype: int64
Visualizations¶
With our orginal idea being to predict the exact change in price variation we wanted to see its distrubution. Since 75% of the data had a price variation of 39% or lower we chose to limit the $x$ axis plotting values from -110% to 200%. We added a vertical line at $x=0$ in order to more easily visualize the class. Every value to the right of the line has a Class
$=1$; every value to the left has a Class
$=0$. From this graph, we can see that the values are normally distributed around 17-18%, with there being obvious outliers as price variation gets bigger than 200%.
sns.histplot(df_2018, x='2019 PRICE VAR [%]')
plt.xlabel('2019 Price Variation [%] ')
plt.title('Distribution of Price Variations')
plt.axvline(x=0, color='red', linestyle='--', linewidth=2)
plt.xlim(-110, 200)
plt.show()
Similarly, we wanted to see how Revenue
was distributed as high revenue could potentially have some correlation with having an increase in price variation. Like 2019 PRICE VAR [%]
, Revenue
was also normally distributted, but had much less outliers.
sns.histplot(df_2018, x='Revenue', log_scale=True)
plt.xlabel('Revenue (log Scale)')
plt.title('Distribution of Company Revenue')
plt.show()
Because the simple distribution of Revenue
doesn't provide us with much, we wanted to see how Class
mixed in would look. Looking at the the left plot which shows the raw counts, we see that when the Class
$=1$ there are more counts for almost all revenues (about twice as many), but this makes sense as we saw that 69% of companies where Class
$=1$. Thus, we want to look more closely at the proportion graph. Looking the the right graph which models this, we see that the proportion graphs are alomst identical and both normally distributed, but there is one small take away. The companies with revenue bigger than the mode Revenue
then the counts for Class
$=1$ are higher, but for companies with revenue with less than the mode Revenue then the counts for Class
$=0$ are higher. This make sense as companies with higher than the mode Revenue
probably do have a slightly higher probability at increasing the price of their stock, but overall revenue didn't show much promise.
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Counts
sns.histplot(df_2018, x='Revenue', hue='Class', log_scale=True, ax=axes[0])
axes[0].set_xlabel('Revenue (log Scale)')
axes[0].set_title('Distribution of Company Revenue with Class(Count)')
# Proportion
sns.histplot(df_2018, x='Revenue', hue='Class', log_scale=True, stat='proportion', common_norm=False, ax=axes[1])
axes[1].set_xlabel('Revenue (log Scale)')
axes[1].set_title('Distribution of Company Revenue with Class(Proportion)')
plt.tight_layout()
plt.show()
We wanted to continue to explore the relationship between Revenue
and 2019 PRICE VAR [%]
so we decided to plot it, along with an additional plot for Revenue Growth
vs 2019 PRICE VAR [%]
. The idea behind plotting revenue growth as well was to gain more information; a company could have high revenue, but if it was less than their previous year, meaning negative revenue growth, then maybe it could help us predict a decrease in price variation. Like earlier, we added a horizontal line at $y=0$ to help visualize the classes. When plotting just the revenue, the points were pretty symetrical over the line $y=0$. This indicates that revenue probably isn't that important of a predictor until the revenue is greater than $10^{10}$ which is a factor of 10 from the mode revenue. The Revenue Growth
plot echoes this conclusion as it is also symetrical over the line $y=0$.
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Revenue vs. 2019 Price Change
sns.scatterplot(x='Revenue', y='2019 PRICE VAR [%]', data=df_2018, ax=axes[0])
axes[0].set_title('Revenue vs. 2019 Price Change (SymLog Scale)')
axes[0].set_xlabel('Revenue')
axes[0].set_ylabel('2019 Price Change (%)')
# horizontal line at y=0
axes[0].axhline(y=0, color='red', linestyle='--', linewidth=2)
axes[0].set_xscale('log')
axes[0].set_yscale('symlog')
# Revenue Growth vs. 2019 Price Change
sns.scatterplot(x='Revenue Growth', y='2019 PRICE VAR [%]', data=df_2018, ax=axes[1])
axes[1].set_title('Revenue Growth vs. 2019 Price Change (SymLog Scale)')
axes[1].set_xlabel('Revenue Growth')
axes[1].set_ylabel('2019 Price Change (%)')
# horizontal line at y=0
axes[1].axhline(y=0, color='red', linestyle='--', linewidth=2)
axes[1].set_xscale('symlog')
axes[1].set_yscale('symlog')
plt.tight_layout()
plt.show()
Clearly, Revenue
didn't seem to be a great predictor and thus we moved onto exploring Sector
which was a categorical predictor which told us what industry the company came from. The top 3 sectors where Financial Services, Healthcare and Technology, all having counts of more than 600 companies. Additionally, we wanted to see if any specific sector had a weird split for Class
, but the only real sectors we saw this in were Utilities and Real Estate which had low counts. Everything else seemed to follow the ~70/30 split.
df_2018['Sector'].value_counts()
Sector Financial Services 824 Healthcare 691 Technology 636 Industrials 574 Consumer Cyclical 506 Basic Materials 276 Real Estate 255 Energy 248 Consumer Defensive 191 Utilities 102 Communication Services 89 Name: count, dtype: int64
# Calculate value counts
sector_counts = df_2018['Sector'].value_counts()
# Plot the bar chart
fig, ax1 = plt.subplots()
# Plot count on the left y-axis
ax1.bar(sector_counts.index, sector_counts.values)
ax1.set_xlabel('Sector')
ax1.set_ylabel('Count')
ax1.set_xticks(range(len(sector_counts)))
ax1.set_xticklabels(sector_counts.index, rotation=45, ha="right")
ax1.tick_params('y')
ax1.set_title(r'Distribution of $Sector$')
# Create a twin Axes on the right side for proportion
ax2 = ax1.twinx()
ax2.set_ylim(0,.2)
ax2.set_ylabel('Proportion')
# Calculate proportion and plot on the right y-axis
proportion = sector_counts / sector_counts.sum()
ax2.bar(sector_counts.index, proportion)
plt.show()
sns.histplot(data=df_2018, x='Sector', hue='Class', multiple="stack", shrink=0.8)
plt.title('Breakdown of Sector Counts with Class Hue')
plt.xlabel('Sector')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.show()
After exploring some predictors by hand, we decided simply test all predictors with the reponse variable Class
and plot the top 20 predictors with the highest correlation. The highest correlated variable ended up being cashConvertingCyle
with a correlation of almost 0.5. The next highest was operatingCycle
was slightly over 0.2, and the rest were slightly above or around 0.1. Therefore, we can see besides cashConvertingCyle
, there are no real standout predictors which makes sense. If there was a single predictor that had very high correlation with Class
, then the stock market would be very easy to predict. Thus, our models will have to include a multitude of predictors in order to make the best possible prediction.
all_predictors = df_2018.drop(columns='2019 PRICE VAR [%]').columns
corrs = df_2018.drop(columns='2019 PRICE VAR [%]').corr(numeric_only=True)
correlation_with_class = corrs['Class'].sort_values()[-21:-2]
correlation_with_class.plot(kind='barh')
plt.title('Top 20 Predictors by Correlation with Response (Class)')
plt.xlabel('Correlation')
plt.ylabel('Predictor')
plt.show()
4. Data Preprocessing¶
Add the YoY change for each variable¶
For each predictor, we calculate the % change between 2017 and 2018. This should improve the predictions since it captures the trend of companies as opposed to absolute values at a given time. For example, a company's overall value at a given time can only tell you so much about how its stock price will change over the year. The % change of its value over the previous year, though, may offer insights into how the company will fare in the following year.
Furthermore, the above graph of correlations has many predictors that measure the growth of a given statistic over several years. This shows the importance of measuring change over time and, thus, led us to including the YoY % change variables. Also, the models performed better with a single year-over-year as opposed to multiple. This may be because the older data is less valuable or has more missing values.
num_cols = df_2018.columns.drop(['Class', 'Sector', '2019 PRICE VAR [%]'])
perc_change_df = (df_2018[num_cols] - df_2017[num_cols]) / df_2017[num_cols]
perc_change_df.columns = perc_change_df.columns + ' (YoY)'
df = pd.concat([df_2018, perc_change_df], axis=1)
display(df.shape, df.head())
(4392, 445)
Revenue | Revenue Growth | Cost of Revenue | Gross Profit | R&D Expenses | SG&A Expense | Operating Expenses | Operating Income | Interest Expense | Earnings before Tax | ... | 10Y Dividend per Share Growth (per Share) (YoY) | 5Y Dividend per Share Growth (per Share) (YoY) | 3Y Dividend per Share Growth (per Share) (YoY) | Receivables growth (YoY) | Inventory Growth (YoY) | Asset Growth (YoY) | Book Value per Share Growth (YoY) | Debt Growth (YoY) | R&D Expense Growth (YoY) | SG&A Expenses Growth (YoY) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Ticker | |||||||||||||||||||||
CMCSA | 9.450700e+10 | 0.1115 | 0.000000e+00 | 9.450700e+10 | 0.000000e+00 | 6.482200e+10 | 7.549800e+10 | 1.900900e+10 | 3.542000e+09 | 1.511100e+10 | ... | inf | 1.394095 | 0.981435 | 1.325792 | NaN | 7.875648 | -0.761243 | 11.711304 | NaN | 1.505747 |
KMI | 1.414400e+10 | 0.0320 | 7.288000e+09 | 6.856000e+09 | 0.000000e+00 | 6.010000e+08 | 3.062000e+09 | 3.794000e+09 | 1.917000e+09 | 2.196000e+09 | ... | NaN | 0.128674 | -0.168657 | -0.393673 | -1.490144 | -0.846154 | -1.329004 | -0.751361 | NaN | 4.938967 |
INTC | 7.084800e+10 | 0.1289 | 2.711100e+10 | 4.373700e+10 | 1.354300e+10 | 6.750000e+09 | 2.042100e+10 | 2.331600e+10 | -1.260000e+08 | 2.331700e+10 | ... | -0.107338 | 0.351598 | 0.245161 | 0.017391 | -0.849709 | -0.563927 | 1.086420 | -1.279339 | 0.413043 | -0.146739 |
MU | 3.039100e+10 | 0.4955 | 1.250000e+10 | 1.789100e+10 | 2.141000e+09 | 8.130000e+08 | 2.897000e+09 | 1.499400e+10 | 3.420000e+08 | 1.430300e+10 | ... | NaN | NaN | NaN | -0.440748 | 0.865432 | -0.196397 | 0.370847 | -5.650478 | 0.357813 | -0.261176 |
GE | 1.216150e+11 | 0.0285 | 9.546100e+10 | 2.615400e+10 | 0.000000e+00 | 1.811100e+10 | 4.071100e+10 | -1.455700e+10 | 5.059000e+09 | -2.177200e+10 | ... | 2.686084 | -4.795148 | 12.712042 | 7.301493 | 1.202589 | -15.189189 | 0.821762 | 17.230159 | NaN | -3.933333 |
5 rows × 445 columns
Preliminary Feature Selection¶
Remove Duplicates¶
duplicate_cols = list()
for i, col1 in enumerate(df.columns):
for col2 in df.columns[i + 1:]:
if df[col1].equals(df[col2]):
duplicate_cols.append((col1, col2))
cols_to_drop = [pair[0] for pair in duplicate_cols]
df = df.drop(columns=cols_to_drop)
df.shape
(4392, 397)
Remove columns with at least $k$ missing values¶
ks = list(range(0, 4392, 12))
# Store the number of columns with > k missing values
num_cols = []
for k in ks:
# Count the number of missing values per column
missing_values_count = df.isnull().sum()
# Filter columns with more than k missing values
low_quality_columns = missing_values_count[missing_values_count >= k].index.tolist()
# Append the result to num_cols
num_cols.append(low_quality_columns)
# Flatten the list of lists
num_cols_flat = list(itertools.chain.from_iterable(num_cols))
plt.plot(ks, [len(cols) for cols in num_cols])
plt.xlabel(rf'$k$')
plt.axvline(x=2000, c='k', ls='--', label=r'$k=2000$ (cutoff)')
plt.title(r'Number of Columns with at least $k$ missing values')
plt.legend();
sparse_cols = df.count()[df.count() <= 2000].index
df = df.drop(columns=sparse_cols)
We remove all predictors with at least $2000$ missing values (i.e., >$45$% of the column is missing). As seen in the chart above, the number of predictors with at least $k$ missing values exponentially decreases as $k$ increases. This left-skewed distribution demonstrates that the majority of predictors have relatively few missing values. We set the threshold to $2000$ to remove the skewed end of the distribution. This allows us to drop ~$40$ predictors with the least information, while still retaining a large feature set.
Train-test Split¶
df = df.rename(columns={'2019 PRICE VAR [%]': 'Variation'})
X = df.drop(columns=['Variation', 'Class'])
y = df['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=209)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((3513, 376), (879, 376), (3513,), (879,))
One-hot Encoding¶
We one-hot encode the Sector
variable so that it can be properly handled by our models. Because there is no specific ordering of the sectors, encoding them as an ordinal feature would not make sense. One-hot encoding allots a binary column to each sector in which a $1$ indicates that the corresponding company belongs to that given sector. This will allow the model to consider a company's particular sector when predicting the change in its stock price. Companies within the same sector often operate–and, thus, perform–similarly, so this distinction should help the model.
The below graph demonstrates the importance of accounting for a company's sector when trying to predict its change in stock price over the year. The correlation between predictors and the response variable, Class
, is significantly higher when calculating them by sector. The large increase in correlations from among sectors (as seen in "All", on the far left) to within sectors reflects the additional predictive power when including the Sector
variable in the model.
oh_encoder = OneHotEncoder(drop=None, handle_unknown='ignore')
X_train_ohe = oh_encoder.fit_transform(X_train[['Sector']])
X_train_ohe = pd.DataFrame(X_train_ohe.toarray(), columns=oh_encoder.get_feature_names_out(['Sector']))
X_train_ohe.index = X_train.index
X_train = pd.concat([X_train.drop(columns=['Sector']), X_train_ohe], axis=1)
X_test_ohe = oh_encoder.transform(X_test[['Sector']])
X_test_ohe = pd.DataFrame(X_test_ohe.toarray(), columns=oh_encoder.get_feature_names_out(['Sector']))
X_test_ohe.index = X_test.index
X_test = pd.concat([X_test.drop(columns=['Sector']), X_test_ohe], axis=1)
# Function to plot the correlations by sector
sectors = df['Sector'].unique().tolist()
# Initialize dictionary to store predictor-response correlations within each sector
sector_corrs = {}
# Store the correlations between the response and each predictor acros ALL sectors (i.e., the whole dataset)
all_corr = df.corrwith(df['Class'], numeric_only=True)
sector_corrs['All'] = all_corr
# Store correlations for the data within each sector
for sector in sectors:
corrs = df[df['Sector'] == sector].corrwith(df['Class'], numeric_only=True)
sector_corrs[sector] = corrs
# Convert to df and drop response (trivial; corr = 1)
sector_corrs_df = pd.DataFrame(sector_corrs)
sector_corrs_df = sector_corrs_df.drop('Class')
# Plot
melted_df = sector_corrs_df.melt(var_name='Sector', value_name='Correlation')
plt.figure(figsize=(15, 8))
sns.boxplot(x='Sector', y='Correlation', data=melted_df)
plt.title('Predictor-Response Correlations', fontsize=15)
plt.xticks(rotation=45, ha='right')
plt.show()
/Users/Tomas/micromamba/envs/cs109a/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide c /= stddev[:, None] /Users/Tomas/micromamba/envs/cs109a/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide c /= stddev[None, :] /Users/Tomas/micromamba/envs/cs109a/lib/python3.11/site-packages/numpy/lib/function_base.py:2742: RuntimeWarning: invalid value encountered in subtract X -= avg[:, None] /Users/Tomas/micromamba/envs/cs109a/lib/python3.11/site-packages/numpy/core/_methods.py:118: RuntimeWarning: invalid value encountered in reduce ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
Standardization¶
We decided to standardize the data since the range of values across predictors is quite large, from mere decimals (e.g., growth rates) to hundreds of billions (e.g., revenue of the largest companies). The models will be able to handle standardized data and its smaller values with more ease.
Furthermore, for parametric models, the coefficients of standardized predictors can be directly compared, whereas they'd be skewed without standardization.
# Replace infinity values with NaN
X_train = X_train.replace([np.inf, -np.inf], np.nan)
X_test = X_test.replace([np.inf, -np.inf], np.nan)
# Define a function to apply a transformation to data and return them
def transform_data(transformer, X_train, X_test):
X_train_trans = transformer.fit_transform(X_train)
X_train_trans = pd.DataFrame(X_train_trans, columns=transformer.get_feature_names_out(), index=X_train.index)
X_test_trans = transformer.transform(X_test)
X_test_trans = pd.DataFrame(X_test_trans, columns=transformer.get_feature_names_out(), index=X_test.index)
return X_train_trans, X_test_trans
std_scaler = StandardScaler()
X_train_std, X_test_std = transform_data(std_scaler, X_train, X_test)
Imputation¶
Having dealt with the majority of the missingness in the dataset, we now turn to imputation to resolve the remaining nan
values. Through our previous work in Milestone 2, we determined that mean imputation was the most effective method to do so.
Note: The difference between mean imputation and kNN imputation was extremely marginal, and the selection of one imputation method over the other would have nearly no consequence on the model's performance.
mean_imp = SimpleImputer(strategy='mean')
X_train_final, X_test_final = transform_data(mean_imp, X_train_std, X_test_std)
# knn_imp = KNNImputer(n_neighbors=3)
# X_train_knnimp, X_test_knnimp = transform_data(knn_imp, X_train, X_test)
5. Models¶
Naive Model¶
# Naive model (old data)
data = pd.Series()
for i in range(7, 3, -1):
current_data = pd.read_csv(f'data/201{i}_Financial_Data.csv')['Class']
data = pd.concat([data, current_data], ignore_index=True)
print(f"{data.mean() * 100:.2f}% of stocks in the data from 2014-2017 improved.")
print(f"{df_2018['Class'].mean() * 100:.2f}% of stocks in the data from 2018 improved.")
51.53% of stocks in the data from 2014-2017 improved. 69.35% of stocks in the data from 2018 improved.
/var/folders/7h/kr5s22ts0hv9gdztycnz_rtw0000gn/T/ipykernel_56640/2735913412.py:5: FutureWarning: The behavior of array concatenation with empty entries is deprecated. In a future version, this will no longer exclude empty items when determining the result dtype. To retain the old behavior, exclude the empty entries before the concat operation. data = pd.concat([data, current_data], ignore_index=True)
Let's assume that a naive model is built without access to data from the end of 2018; therefore, it would be smart to look at the general trend of stocks in the available data over the past 4 years. As seen in the above results, $51.53$% of stocks had a YoY increase in price from 2014-2017. Furthermore, even if someone did have access to the data from the end of 2018, the majority–now $69.35$%–of the stocks went up in price. Thus, either way the naive model would predict Class
$=1$ for each stock. This would achieve $69$% accuracy on the $2018$ dataset, which we are trying to beat here.
# Naive Model
y_pred_train_naive = np.ones_like(y_train)
y_pred_test_naive = np.ones_like(y_test)
naive_train_acc = accuracy_score(y_train, y_pred_train_naive)
naive_test_acc = accuracy_score(y_test, y_pred_test_naive)
print(f'Naive Model Train Accuracy: {naive_train_acc}')
print(f'Naive Model Test Accuracy: {naive_test_acc}')
Naive Model Train Accuracy: 0.695417022487902 Naive Model Test Accuracy: 0.6860068259385665
Logistic Regression w/ L1 Penalty Term¶
# Logistic Regression (L1)
Cs = [.005, .01, .02, .05]
lasso = LogisticRegressionCV(Cs=Cs, cv=5, penalty='l1', solver='saga', max_iter=1000, n_jobs=-1, random_state=209).fit(X_train_final, y_train)
best_C = lasso.C_[0]
y_pred_train_lasso = lasso.predict(X_train_final)
y_pred_test_lasso = lasso.predict(X_test_final)
lasso_train_acc = accuracy_score(y_train, y_pred_train_lasso)
lasso_test_acc = accuracy_score(y_test, y_pred_test_lasso)
# Store the top 10 predictors' coefs and their names
coefficients = lasso.coef_[0]
feature_names = X_train.columns
coef_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})
coef_df = coef_df.reindex(coef_df['Coefficient'].abs().sort_values(ascending=False).index)
top_10_coefs = coef_df.head(10)
# Plot
plt.figure(figsize=(10, 6))
plt.barh(top_10_coefs['Feature'], top_10_coefs['Coefficient'], color='skyblue')
plt.xlabel('Coefficient Value')
plt.title('Top 10 Coefficients in the L1-regularized Logistic Regression')
plt.grid(axis='x')
plt.show()
The L1-regularized logistic regression model conveniently selects predictors to include by driving many coefficients to 0. This allows us to get an initial perspective on feature importance, as seen in the above graph, which displays the top 10 predictors based on their coefficient values. This model also improved upon the test accuracy of the naive model by $2$%.
However, logistic regression is parameterized and may not be able to capture complex relationships in the data. Furthermore, the size of our feature set is so large that adding polynomial features will cause the number of predictors to quickly get out of hand. Therefore, we turn to non-parametric models hereafter.
# Store predictors deemed most important by LogisticRegressionCV (i.e., their coefficient was not driven to 0)
predictors = X_train_final.columns
important_predictors = []
for predictor, coef in zip(predictors, lasso.coef_.tolist()[0]):
if coef > 0:
important_predictors.append(predictor)
display(X_train_final.shape)
(3513, 386)
print(rf'Out of the original {len(predictors)} features, the L1-regularized logistic regression model drives the weights of {len(predictors) - len(important_predictors)} of them to 0.')
Out of the original 386 features, the L1-regularized logistic regression model drives the weights of 331 of them to 0.
# L1 Logistic Performance
print(f'L1 Logistic Train Accuracy: {lasso_train_acc}')
print(f'L1 Logistic Test Accuracy: {lasso_test_acc}')
L1 Logistic Train Accuracy: 0.7349843438656419 L1 Logistic Test Accuracy: 0.7076222980659841
kNN Classification¶
# kNN Classifier
ks = range(10,101,10)
train_mse_means = []
val_mse_means = []
for k in ks:
knn_cv = cross_validate(KNeighborsClassifier(n_neighbors=k),
X_train_final, y_train,cv=5,scoring='accuracy',return_train_score=True)
train_mse_means.append(np.mean(knn_cv['train_score']))
val_mse_means.append(np.mean(knn_cv['test_score']))
best_idx = np.argmax(val_mse_means)
best_k = ks[best_idx]
# Plot the accuracies
plt.plot(ks, train_mse_means)
plt.plot(ks, val_mse_means)
# Add labels
plt.xlabel(r'$k$')
plt.ylabel('accuracy')
plt.axvline(best_k, c='k', ls='--', label=rf'best $k={best_k}$')
plt.title(r'kNN Classifier CV Scores')
plt.xticks(ks)
plt.legend();
This cross-validation graph shows that $k=60$ is the optimal hyperparameter for the kNN classifier, as it maximizes the validation accuracy.
# Fit a final kNN model on all the data
knn = KNeighborsClassifier(n_neighbors=best_k).fit(X_train_final, y_train)
y_pred_train_knn = knn.predict(X_train_final)
y_pred_test_knn = knn.predict(X_test_final)
knn_train_acc = accuracy_score(y_train, y_pred_train_knn)
knn_test_acc = accuracy_score(y_test, y_pred_test_knn)
This kNN model classifies a new data point based on the majority class of its $60$ nearest neighbors. In terms of performance, it is better than the naive model but slightly worse than logistic regression.
We will now turn to a new class of non-parametric models–decision trees.
# kNN Performance
print(f'kNN Train Accuracy: {knn_train_acc}')
print(f'kNN Test Accuracy: {knn_test_acc}')
kNN Train Accuracy: 0.7122117847993168 kNN Test Accuracy: 0.6996587030716723
Single Decision Tree¶
# Single Decision Tree CV
train_scores = []
cvmeans = []
depths = list(range(1,21))
for max_depth in depths:
tree = DecisionTreeClassifier(max_depth=max_depth, random_state=109)
tree.fit(X_train_final, y_train)
train_scores.append(accuracy_score(y_train, tree.predict(X_train_final)))
scores = cross_val_score(estimator=DecisionTreeClassifier(max_depth=max_depth, random_state=109), X=X_train_final, y=y_train, cv=5, n_jobs=-1)
cvmeans.append(scores.mean())
best_idx = np.argmax(cvmeans)
best_depth_tree = depths[best_idx]
cvmeans = np.array(cvmeans)
depths = range(1, 21)
plt.plot(depths, train_scores, label = 'training scores')
plt.plot(depths, cvmeans, label = 'mean validation scores')
plt.axvline(best_depth_tree, c='k', ls='--', label=f'best depth={best_depth_tree}')
plt.title('Non-CV Training Accuracy vs. CV Validation Accuracy')
plt.xlabel('Max Tree Depth')
plt.xticks(range(0, 21, 2))
plt.ylabel('Accuracy Score')
plt.legend()
plt.show()
Interestingly, despite the large feature space, cross-validation identifies a max depth of $3$ as optimal. This means that overfitting begins relatively quickly. Depth $3$ will now be used for a final single decision tree, fit on the entire training set.
# Fit a final decision tree on all the data
tree = DecisionTreeClassifier(max_depth=best_depth_tree, random_state=0).fit(X_train_final, y_train)
y_pred_train_tree = tree.predict(X_train_final)
y_pred_test_tree = tree.predict(X_test_final)
tree_train_acc = accuracy_score(y_train, y_pred_train_tree)
tree_test_acc = accuracy_score(y_test, y_pred_test_tree)
# Plot the decision tree
plt.figure(figsize=(35, 10))
plot_tree(tree, filled=True, feature_names=X_train.columns, class_names=['0', '1'], rounded=True, fontsize=20)
plt.show()
The above image allows us to visualize the single decision tree fit on all the training data. This improves upon the kNN classifier, but is still worse than the logistic regression model by ~$0.2$%.
We will stick with decision trees, but turn to ensemble methods, which perform better than a single tree since they are able to aggregate many of them into one model.
# Tree Performance
print(f'Single Tree Train Accuracy: {tree_train_acc}')
print(f'Single Tree Test Accuracy: {tree_test_acc}')
Single Tree Train Accuracy: 0.7378309137489325 Single Tree Test Accuracy: 0.7053469852104665
Bagging¶
After testing with various n_estimators
, we find that the model's performance plateaus before $200$. Thus, n_estimators
=$200$ is sufficient to reduce the variance of the decision tree base estimators. We perform CV below to determine the optimal tree depth.
# Max Depth CV - Bagging
oob_scores = []
depths = list(range(3,25))
for depth in depths:
bagger = BaggingClassifier(estimator=DecisionTreeClassifier(max_depth=depth), n_estimators=100, oob_score=True, random_state=209
).fit(X_train_final, y_train)
oob_scores.append(bagger.oob_score_)
best_idx = np.argmax(oob_scores)
best_depth = depths[best_idx]
best_depth
14
# Bagging Performance
n_trees = 200
bagger = BaggingClassifier(estimator=DecisionTreeClassifier(max_depth=best_depth), n_estimators=n_trees, n_jobs=-1, random_state=209
).fit(X_train_final, y_train)
y_train_pred_bagger = bagger.predict(X_train_final)
y_test_pred_bagger = bagger.predict(X_test_final)
bagger_train_acc = accuracy_score(y_train, y_train_pred_bagger)
bagger_test_acc = accuracy_score(y_test, y_test_pred_bagger)
bagging_predictors_1 = [bagged_tree.tree_.feature[0] for bagged_tree in bagger.estimators_]
bagging_counts_1 = pd.Series(bagging_predictors_1).value_counts()
bagging_predictors_2 = [bagged_tree.tree_.feature[1] for bagged_tree in bagger.estimators_]
bagging_counts_2 = pd.Series(bagging_predictors_2).value_counts()
top_predictors_bagging = pd.DataFrame(index=range(X_train_final.shape[1]))
top_predictors_bagging['predictor'] = X_train_final.columns[top_predictors_bagging.index]
top_predictors_bagging['first count'] = bagging_counts_1
# top_predictors_bagging['second count'] = bagging_counts_2
top_predictors_bagging = top_predictors_bagging.dropna(thresh=2)
top_predictors_bagging = top_predictors_bagging.reindex(top_predictors_bagging['first count'].sort_values(ascending=False).index)
plt.figure(figsize=(12, 8))
plt.barh(top_predictors_bagging['predictor'], top_predictors_bagging['first count'], color='skyblue')
plt.xlabel('Number of Times Used for the First Split')
plt.ylabel('Predictor')
plt.title('Top Node Predictors')
plt.gca().invert_yaxis() # Invert y-axis for better readability
plt.show()
This graph demonstrates how correlated the individual trees in the bagging model are. Out of the $386$ predictors, only $15$ are even used for the first split. Not only that, but Tangible Asset Value
dominates the distribution, accounting for $80$ of the $200$ trees' first split.
The bagging model has the highest test accuracy thus far, at $73.5$%–a significant improvement upon the logistic regression. However, in order to further improve generalizability, we will fit a random forest model below. This decorrelates the individual trees by only selecting from a subset of the predictors at each split.
# Bagging Performance
print(f'Bagging Train Accuracy: {bagger_train_acc}')
print(f'Bagging Test Accuracy: {bagger_test_acc}')
Bagging Train Accuracy: 0.977227440933675 Bagging Test Accuracy: 0.7372013651877133
Random Forest on All Predictors¶
For the same reasons as above, we set n_estimators
=$200$.
# Random Forest Classifier
train_acc_mean = []
val_acc_mcean = []
k = 5
depths = list(range(3,30))
# Cross Validation
for max_depth in depths:
rf_scores = cross_validate(estimator=RandomForestClassifier(n_estimators=n_trees, max_depth=max_depth, max_features='sqrt', random_state=209),
X=X_train_final,
y=y_train,
scoring='accuracy',
cv=k,
n_jobs=-1,
return_train_score=True)
train_acc_mean.append(np.mean(rf_scores['train_score']))
val_acc_mcean.append(np.mean(rf_scores['test_score']))
# Get best depth for RF
best_idx = np.argmax(val_acc_mcean)
best_depth_rf = depths[best_idx]
print(f'The cross-validation results show that the best depth is {best_depth_rf}.')
The cross-validation results show that the best depth is 9.
# Plot the accuracies
plt.plot(depths, train_acc_mean)
plt.plot(depths, val_acc_mcean)
# Add labels
plt.xlabel(r'depth')
plt.ylabel('accuracy')
plt.axvline(best_depth_rf, c='k', ls='--', label=rf'best depth={best_depth_rf}')
plt.title(r'Random Forest CV Scores')
plt.xticks(range(3,31,3))
plt.legend();
n_trees = 200
tree_depth = best_depth_rf
# Fit a RF
random_forest = RandomForestClassifier(n_estimators=n_trees,
max_depth=best_depth_rf,
max_features='sqrt',
random_state=0
).fit(X_train_final, y_train)
y_train_pred = random_forest.predict(X_train_final)
y_test_pred = random_forest.predict(X_test_final)
rf_train_acc = accuracy_score(y_train, y_train_pred)
rf_test_acc = accuracy_score(y_test, y_test_pred)
# RF Performance
print(f'RF Train Accuracy: {rf_train_acc}')
print(f'RF Test Accuracy: {rf_test_acc}')
RF Train Accuracy: 0.9373754625676061 RF Test Accuracy: 0.7485779294653014
Random Forest on Important Predictors¶
Given the nature of random forest (ie. it performs better with a higher ratio of important predictors), it may be beneficial to reduce the likelihood that we select an unimportant predictor. Thus, we will try a random forest model on only the important predictors identified by lasso regularization.
# Random Forest Classifier (important predictors)
X_train_important = X_train_final[important_predictors]
X_test_important = X_test_final[important_predictors]
n_trees = 200
tree_depth = best_depth_rf
# Fit a RF
rf_important = RandomForestClassifier(n_estimators=n_trees,
max_depth=tree_depth,
max_features='sqrt',
random_state=0
).fit(X_train_important, y_train)
y_train_important_pred = rf_important.predict(X_train_important)
y_test_important_pred = rf_important.predict(X_test_important)
rf_important_train_acc = accuracy_score(y_train, y_train_important_pred)
rf_important_test_acc = accuracy_score(y_test, y_test_important_pred)
# RF Performance, after lasso selection
print(f'RF Train Accuracy (full data): {rf_train_acc}')
print(f'RF Test Accuracy (full data): {rf_test_acc}')
print(f'RF Train Accuracy (lasso-selected predictors): {rf_important_train_acc}')
print(f'RF Test Accuracy (lasso-selected predictors): {rf_important_test_acc}')
RF Train Accuracy (full data): 0.9373754625676061 RF Test Accuracy (full data): 0.7485779294653014 RF Train Accuracy (lasso-selected predictors): 0.9154568744662681 RF Test Accuracy (lasso-selected predictors): 0.7281001137656428
The random forest and its decorrelated trees improves upon the bagging model in test accuracy, while reducing slightly in training accuracy. As of now, and Random Forest has the highest training accuracy with 74.8%
AdaBoost¶
Next, we will move to another ensemble method–gradient boosting. While bagging and random forests improve decrease the variance of a deep decision tree by aggregating many of them, gradient boosting increases the bias of low-complexity (i.e., shallow) decision trees by sequentially fitting them to the previous tree's residuals.
# AdaBoost - (depth,iter,lr,acc): (2,150,.05) = .745, (1,200,.05)=.741, (2,200,.05)=.743, (2,200,0.3)=.744, (2,100,.075)=.743
ada_depth = 2
n_iters = 150
lr = 0.05
adaboost = AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=ada_depth),
n_estimators=n_iters,
learning_rate=lr,
random_state=0
).fit(X_train_final, y_train)
ada_train_acc = adaboost.score(X_train_final, y_train)
ada_test_acc = adaboost.score(X_test_final, y_test)
The AdaBoost model slightly underperforms the bagging and random forest models, though it is better than every other one.
Also, note that the train accuracy is much lower than the other two ensemble methods', since the base learner for AdaBoost is much simpler.
print(f'AdaBoost Train Accuracy: {ada_train_acc}')
print(f'AdaBoost Test Accuracy: {ada_test_acc}')
AdaBoost Train Accuracy: 0.7913464275547964 AdaBoost Test Accuracy: 0.7349260523321957
Stacking (209)¶
Lastly, we will build a stacking model, which is another ensemble method. However, whereas bagging, random forests, and gradient boosting are homogenous, in that the ensemble consists of the same type of model (e.g., decision trees of depth $k$), stacking is heterogenous. This means that it can combine various different models–even other ensemble models. Here, we use a model stacking with cross-validated hyperparameters to combine all the previous models we fit throughout the notebook.
estimators = [('lasso', LogisticRegression(C=best_C, penalty='l1', solver='liblinear', random_state=209)),
('knn', KNeighborsClassifier(n_neighbors=best_k)),
('bagger', BaggingClassifier(estimator=DecisionTreeClassifier(max_depth=13), n_estimators=200, n_jobs=-1, random_state=209)),
('rf', RandomForestClassifier(n_estimators=200, max_depth=7, random_state=209)),
('adaboost', AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=2), n_estimators=150, learning_rate=0.05, random_state=209))]
stacker = StackingClassifier(estimators=estimators,
final_estimator = LogisticRegression(),
cv=5,
n_jobs=-1)
stacker.fit(X_train_final, y_train)
StackingClassifier(cv=5, estimators=[('lasso', LogisticRegression(C=0.05, penalty='l1', random_state=209, solver='liblinear')), ('knn', KNeighborsClassifier(n_neighbors=60)), ('bagger', BaggingClassifier(estimator=DecisionTreeClassifier(max_depth=13), n_estimators=200, n_jobs=-1, random_state=209)), ('rf', RandomForestClassifier(max_depth=7, n_estimators=200, random_state=209)), ('adaboost', AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=2), learning_rate=0.05, n_estimators=150, random_state=209))], final_estimator=LogisticRegression(), n_jobs=-1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
StackingClassifier(cv=5, estimators=[('lasso', LogisticRegression(C=0.05, penalty='l1', random_state=209, solver='liblinear')), ('knn', KNeighborsClassifier(n_neighbors=60)), ('bagger', BaggingClassifier(estimator=DecisionTreeClassifier(max_depth=13), n_estimators=200, n_jobs=-1, random_state=209)), ('rf', RandomForestClassifier(max_depth=7, n_estimators=200, random_state=209)), ('adaboost', AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=2), learning_rate=0.05, n_estimators=150, random_state=209))], final_estimator=LogisticRegression(), n_jobs=-1)
LogisticRegression(C=0.05, penalty='l1', random_state=209, solver='liblinear')
KNeighborsClassifier(n_neighbors=60)
DecisionTreeClassifier(max_depth=13)
DecisionTreeClassifier(max_depth=13)
RandomForestClassifier(max_depth=7, n_estimators=200, random_state=209)
DecisionTreeClassifier(max_depth=2)
DecisionTreeClassifier(max_depth=2)
LogisticRegression()
stacker_train_acc = stacker.score(X_train_final, y_train)
stacker_test_acc = stacker.score(X_test_final, y_test)
# Stacker Performance
print(f'Stacker Train Accuracy: {stacker_train_acc}')
print(f'Stacker Test Accuracy: {stacker_test_acc}')
Stacker Train Accuracy: 0.9288357529177341 Stacker Test Accuracy: 0.7474402730375427
Overall, Stacking performed slightly worse than the random forest, making it our second best model by a marginal amount of 0.12%. Nonetheless, it improves upon the naive model's test accuracy by over $5$ percentage points, thus proving that we can indeed solve the infamous problem of projecting stock price changes with data science.
6. Feature Importance¶
The two graphs below show the feature importance for the top 30 predictors in the Random Forest and AdaBoost models, respectively. The scale on the x-axis reveals that the Random Forest is less concentrated with respect to the most important predictors. For example, Total Liabilities
is the single most important feature, but it only explains a little over $1$% of the model; and the importance of all 30 predictors displayed only adds up to ~$18$%.
The AdaBoost model, while still utilizing many predictors, relies much more on the most important ones. The top feature, Debt Growth (YoY)
, has an importance proportion of around $4$% and those of the top 30 predictors add up to $57.8$%.
rf_importances = pd.DataFrame(data=[random_forest.feature_importances_], columns=X_train_std.columns, index=['RF']).T.sort_values(by='RF', ascending = True)
ada_importances = pd.DataFrame(data=[adaboost.feature_importances_], columns=X_train_std.columns, index=['Ada']).T.sort_values(by='Ada', ascending = True)
ada_importances[-30:].sum()
Ada 0.577874 dtype: float64
rf_importances[-30:].plot(kind='barh')
plt.title('Top 30 Predictors for Random Forest')
plt.xlabel('Importance')
plt.xlim(0, 0.05)
plt.show()
ada_importances[-30:].plot(kind='barh')
plt.title('Top 30 Predictors for AdaBoost')
plt.xlabel('Importance')
plt.xlim(0, 0.05)
plt.show()
7. Results & Discussion¶
Thinking back to our goal of classifying publicly-traded companies as increasing or decreasing in value based on their 10-K filings, we have proposed several models as viable solutions. Along the way, we have reached interesting findings about the various factors that play into a model's success at classifying companies:
- Model / Hyperparameter Selection: what are the strengths and weaknesses of our models, and what do their optimal hyperparameters tell us about the dataset?
- Preprocessing: how might our original preprocessing have affected our final models?
- Feature Importance: what can we learn about the dataset based on the importance of features in the models?
- Contextualizing our Strengths and Weaknesses Are our model's properties appropriate for its real-world applications?
First, let's look at how our various models compared:
classifiers = ['Naive Model',
'L1 Logistic Regression',
'kNN Classifier',
'Single Decision Tree',
'Bagging Model',
'Random Forest',
'AdaBoost',
'Stacking Model']
train_scores = [naive_train_acc, lasso_train_acc, knn_train_acc, tree_train_acc, bagger_train_acc, rf_train_acc, ada_train_acc, stacker_train_acc]
train_scores = [np.round(score * 100, 2) for score in train_scores]
test_scores = [naive_test_acc, lasso_test_acc, knn_test_acc, tree_test_acc, bagger_test_acc, rf_test_acc, ada_test_acc, stacker_test_acc]
test_scores = [np.round(score * 100, 2) for score in test_scores]
# Create the df
results_df = pd.DataFrame({'Training Accuracy':train_scores, 'Testing Accuracy':test_scores}, index=classifiers)
results_df
Training Accuracy | Testing Accuracy | |
---|---|---|
Naive Model | 69.54 | 68.60 |
L1 Logistic Regression | 73.50 | 70.76 |
kNN Classifier | 71.22 | 69.97 |
Single Decision Tree | 73.78 | 70.53 |
Bagging Model | 97.72 | 73.72 |
Random Forest | 93.74 | 74.86 |
AdaBoost | 79.13 | 73.49 |
Stacking Model | 92.88 | 74.74 |
Model/Hyperparameter Selection
The models with the highest test accuracy, in the end, were the Stacking model and Random Forest. Posting a 5% increase in test accuracy over the Naive model, the Stacker still managed to obtain high training accuracy as well. This suggests that it could handle the dataset's complexity while still not overfitting, perhaps because it works with a mix of simple and ensemble learners.
The random forest, on the other hand, was not expected to be our best model, as it did not combine multiple strong learners like the stacker. The random forest's test accuracy suggests that it also generalizes well. We believe that combining it with (barely) weaker models in the stacker may have brought down its accuracy.
One limitation of our stacker is time and space complexity; is it worth training and storing 5 different models if the increase in test accuracy is less than 1%? Our ensemble models showed test accuracies all clustered around 73-74%. Their closeness suggests we select a model on factors other than test accuracy. We could argue for the random forest to be more space-efficient, as it houses smaller base learners without compromising on train or test accuracy. If we were strictly limited on storage space, the random forest would be our second choice after the stacking model.
Another limitation is the interpretability of the stacking model. Although we can perform analyses on the coefficients of the logistic regression combining our models, it is hard to discover the effect of one predictor on the outcome.
Some hyperparameters that were of note were the max_depth
chosen for the decision tree, bagging model, and random forest. While a lower depth was optimal for one singular decision tree, there was minimal dropoff in validation scores at greater depths. We believe that this hinted at the strength of ensemble methods. The deeper base learners in the bagging and random forest models were still reaching decent accuracies on their own.
Preprocessing:
We made decisions in our preprocessing that contributed to the strengths and weaknesses of the model:
- Standardization allowed for predictors to be similar scales. Thus, when we selected our most important predictors using L1 Regularization, we knew those selections were not affected by the scale of our predictors. Standardization, however, contributed to our model being less interpretable than a non-standardized version.
- Adding Year over Year Comparisons: These predictors allowed us to model the rate of change of the stock price. We hypothesized that each model would use these predictors, as past variations might suggest the next year. The Year-over-year predictors, though, were deemed less important in a few of our models.
Feature Importance
- Building on the previous point about YoY variables: In the adaboost model, though, they were very important; perhaps when a weak learner can only select one predictor, these are more favorable. This hints that YoY variables may be better at error correction, or they model the dataset outliers the best.
- We still think that it was worth experimenting with these YoY extra predictors. Perhaps they would be of more value in a regression model, as they could suggest when the stock price's rate of change is trending upwards or downwards.
- We found that predictors about assets and liabilities were consistently the most important in our models. This could be useful information for new investors wondering what factors affect a stock price the most.
- Finally, we also experimented with interaction terms (specifically sector interacting with each of the other predictors), but they did not improve the scores significantly. For example, we hypothesized that
Capital Expenditure
–the amount of money a company uses to buy and maintain physical assets, such as property and equipment–matters much more within theIndustrials
sector than theTechnology
sector. Interaction terms would have been useful in the logistic regression, but in a decision tree, we can effectively model the interaction of two predictors by default. We mainly used tree-based ensemble models, so interaction terms would not have been overly beneficial.
Contextualizing our Strengths and Weaknesses
We can place these strengths and weaknesses in the context of our target audience: If we are hoping to create a model that investors train, use, and interpret in real-time, space and time efficiency are key. If an investor, for example, is guessing what a company's Revenue will look like for the year and wants to quickly predict whether that is enough to make the stock go up or down, a stacking model is not the best solution. A stacker is optimal for a model that is trained once and then used on full observations, while a simpler model (perhaps our original L1 Logistic Regression) is better for interpretability.
8. Future Work¶
With more time, we believe that such a classification model is genuinely feasible. Some potential areas for improvement are:
- Time Series Data: We saw that YoY comparisons were useful in our boosting model, suggesting that true time series data (different predictors, at different points in time) would be helpful in the future. With time series, we would be able to get a more accurate proxy for the stock's trend at a given time.
- Regression instead of Classification: A classification model is great for binary decisions (like invest vs. not invest), but in the stock market, the true question might be "invest in stock x or stock y". This is not possible without a regression model. When we tried regression, we struggled to obtain a r2 higher than 0.005. In the future, building a regression model would be an exciting challenge.
- Continued Feature Engineering: With our large number of predictors, the curse of dimensionality is always a risk. If we had the time, we could spend it trimming less important features (using regularization, or by removing the least frequently used features in our ensemble models). This could lead to models that are overall more interpretable and efficient to train/store.