No description has been provided for this image Data Science 1: Introduction to Data Science

Final Project: Predicting Stock Prices from Financial Data¶

Harvard University
Fall 2023
Jake Pappo, Vincent Hock, Thomas Garity, Tomas Arevalo


Import Libraries¶

Note: xgboost is not included in the CS109a environment. A separate installation of that package is required.

In [ ]:
!pip install xgboost
In [ ]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import itertools

from scipy.stats import uniform
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score, RandomizedSearchCV
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, FunctionTransformer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, StackingClassifier
from xgboost import XGBClassifier

from pandas.api.types import CategoricalDtype
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer, KNNImputer, MissingIndicator
In [ ]:
import requests
from IPython.core.display import HTML
styles = requests.get(
    "https://raw.githubusercontent.com/Harvard-IACS/2021-CS109A/master/"
    "themes/static/css/cs109.css"
).text
HTML(styles)
Out[ ]:

Notebook Contents¶

  • 1. Introduction

  • 2. Data Description

  • 3. EDA and Visualizations

  • 4. Data Preprocessing

  • 5. Modeling

  • 6. Feature Importance

  • 7. Results & Discussion

  • 8. Future Work

1. Introduction¶

From quantitative trading firms like Jane Street and Citadel to the average day trader trying to get a leg up, the monetarily motivated have been attempting to "beat" the stock market for decades. It sounds like a very straightforward task: predict the trajectory of a company's value before the rest of the market can, and invest (or divest) quicker than everyone else. This path, though amply trodden, is a treacherous one. Unless absolutely confident in the accuracy of their model, any investor is taking a risk when making the decision to put their money in (or pull it out), and when these trades are taking on magnitudes of hundreds of thousands of dollars, this risk cannot be taken lightly. The reward of a succesful risk, however, can be enormous.

Unfortunately, accurately modeling the stock market has thus far shown itself to be an intractable problem. Even firms with billions in budget haven't cracked the code to perfect predictions, so we've set our eyes on a more manageable task: direction. It may not be possible to determine exactly how much a company's stock will rise or fall in a given year, but perhaps it is possible to determine whether a company's stock will rise or fall, and which factors might contribute most to that result. Thus, we've set out to explore the viability of a model that uses financial indicators of publicly traded companies from prior years to classify a company as either increasing or decreasing in value in the following year.

2. Data Description¶

The original data from Kaggle contained 5 datasets, one for each year from 2014 to 2018. For our purposes (ie. predicting behavior in 2019), we decided to use only the data from the two years prior to 2019. Each dataset contains 200+ financial indicators from annual 10-K (annual report) filings of over 4,000 US publicly traded companies.

More precisely, there are exactly 223 predictors and 4,392 distinct observations (companies). Among these predictors are Revenue, Operating Income, various expenses, various taxes, growth metrics, and a number of other lesser-known descriptors of a company's performance. Of these 223 predictors, all but one are quantitative. Sector, the lone categorical predictor, indicates each company's macro-area (Technology, Financial Services, Healthcare, ...).

Each dataset also contains the PRICE VAR [%] of the company in the year following the year of the annual report. That is, 2018_Financial_Data.csv will contain 200+ financial indicators of each company for the year 2018, as well as the variation in company stock price from the first trading day on Jan 2019 to the last trading day on Dec 2019. The Class variable is a binary classification of PRICE VAR [%], and will serve as our response variable in the classification models that will follow. If PRICE VAR [%] > 0, Class = 1, and if PRICE VAR [%] < 0, Class = 0. Thus, predicting that Class = 1 for company X is predicting that an investment in company X will yield a positive return. The PRICE VAR [%] response variable could be used in further exploration of a regression model, as opposed to the classification model that we will be pursuing.

Load in the data¶

In [ ]:
# Load in the data (2018, 2017)
df_2018 = pd.read_csv('data/2018_Financial_Data.csv')
df_2018 = df_2018.rename(columns={'Unnamed: 0':'Ticker'})

df_2017 = pd.read_csv('data/2017_Financial_Data.csv')
df_2017 = df_2017.rename(columns={'Unnamed: 0':'Ticker'})

df_2018 = df_2018.set_index('Ticker')
df_2017 = df_2017.set_index('Ticker')

df_2017 = df_2017.reindex(df_2018.index)

3. EDA and Visualizations¶

Let's first take a peek at our data. As stated above, each row (our observations) represents a specific company's stock, and each column (our predictors) represents a financial indicator pertaining to the company from which the stock is from. Immediately, we see that we have lots of predictors, 224 to be exact, along with 4392 observations.

As we look deeper into our dataset, we can see that it was a pretty good year for the market. Predictors such as Revenue, Revenue Growth, and Gross Profit are all positive, and the average Operating Income was more than that of the average Operating Expenses. Additionally, looking at the final two variables, we see that the average stock price variation was a 20% increase and looking at Class, our response variable, we see that 69% of the companies saw an increase in their stock price. Finally, we see that the minimum change in price variation was -99.86% while the maximum change in percentage was 3756.72%, with a standard deviation of 82.6%. Because of this, it might be extremely hard to predict exact PRICE VAR [%], but we hope to be able to get good classifications scores for Class.

In [ ]:
df_2018.head()
Out[ ]:
Revenue Revenue Growth Cost of Revenue Gross Profit R&D Expenses SG&A Expense Operating Expenses Operating Income Interest Expense Earnings before Tax ... Receivables growth Inventory Growth Asset Growth Book Value per Share Growth Debt Growth R&D Expense Growth SG&A Expenses Growth Sector 2019 PRICE VAR [%] Class
Ticker
CMCSA 9.450700e+10 0.1115 0.000000e+00 9.450700e+10 0.000000e+00 6.482200e+10 7.549800e+10 1.900900e+10 3.542000e+09 1.511100e+10 ... 0.2570 0.0000 0.3426 0.0722 0.7309 0.0000 0.1308 Consumer Cyclical 32.794573 1
KMI 1.414400e+10 0.0320 7.288000e+09 6.856000e+09 0.000000e+00 6.010000e+08 3.062000e+09 3.794000e+09 1.917000e+09 2.196000e+09 ... 0.0345 -0.0920 -0.0024 0.0076 -0.0137 0.0000 -0.1265 Energy 40.588068 1
INTC 7.084800e+10 0.1289 2.711100e+10 4.373700e+10 1.354300e+10 6.750000e+09 2.042100e+10 2.331600e+10 -1.260000e+08 2.331700e+10 ... 0.1989 0.0387 0.0382 0.1014 -0.0169 0.0390 -0.0942 Technology 30.295514 1
MU 3.039100e+10 0.4955 1.250000e+10 1.789100e+10 2.141000e+09 8.130000e+08 2.897000e+09 1.499400e+10 3.420000e+08 1.430300e+10 ... 0.4573 0.1511 0.2275 0.6395 -0.5841 0.1738 0.0942 Technology 64.213737 1
GE 1.216150e+11 0.0285 9.546100e+10 2.615400e+10 0.000000e+00 1.811100e+10 4.071100e+10 -1.455700e+10 5.059000e+09 -2.177200e+10 ... -0.2781 -0.2892 -0.1575 -0.4487 -0.2297 0.0000 0.0308 Industrials 44.757840 1

5 rows × 224 columns

In [ ]:
df_2018.describe()
Out[ ]:
Revenue Revenue Growth Cost of Revenue Gross Profit R&D Expenses SG&A Expense Operating Expenses Operating Income Interest Expense Earnings before Tax ... 3Y Dividend per Share Growth (per Share) Receivables growth Inventory Growth Asset Growth Book Value per Share Growth Debt Growth R&D Expense Growth SG&A Expenses Growth 2019 PRICE VAR [%] Class
count 4.346000e+03 4253.000000 4.207000e+03 4.328000e+03 4.155000e+03 4.226000e+03 4.208000e+03 4.357000e+03 4.208000e+03 4.321000e+03 ... 4067.000000 4268.000000 4160.000000 4178.000000 4121.000000 4128.000000 4133.000000 4144.000000 4392.000000 4392.000000
mean 5.119287e+09 3.455278 3.144946e+09 2.043954e+09 1.180176e+08 9.005022e+08 1.435546e+09 6.541207e+08 1.001350e+08 5.584432e+08 ... 0.006081 36.768524 0.183066 1.389013 0.262530 9.928446 0.091891 0.153610 20.803948 0.693534
std 2.049504e+10 195.504906 1.508813e+10 7.682369e+09 9.330891e+08 3.661116e+09 5.529831e+09 2.969341e+09 3.780021e+08 2.639327e+09 ... 0.239653 2347.079237 4.688013 35.123904 5.612666 363.717734 0.823281 0.839647 82.622147 0.461078
min -6.894100e+07 -3.461500 -2.669055e+09 -1.818220e+09 -1.042000e+08 -1.401594e+08 -4.280000e+09 -1.455700e+10 -1.408252e+09 -2.177200e+10 ... -1.000000 -1.000000 -1.000000 -0.999100 -32.258100 -1.000000 -1.000000 -1.000000 -99.864779 0.000000
25% 6.501425e+07 0.000000 3.415500e+06 3.618903e+07 0.000000e+00 2.056226e+07 4.223644e+07 -5.510000e+06 0.000000e+00 -1.000800e+07 ... 0.000000 -0.048075 0.000000 -0.036700 -0.108600 -0.082850 0.000000 -0.004650 -7.477173 0.000000
50% 4.982640e+08 0.074900 1.741180e+08 2.219470e+08 0.000000e+00 9.390450e+07 1.806253e+08 4.203800e+07 5.693500e+06 2.730700e+07 ... 0.000000 0.010200 0.000000 0.034750 0.026100 0.000000 0.000000 0.065700 17.639393 1.000000
75% 2.457878e+09 0.188500 1.297814e+09 9.767015e+08 1.450150e+07 4.117162e+08 6.796040e+08 2.862690e+08 5.817075e+07 2.238810e+08 ... 0.042050 0.185900 0.080050 0.160575 0.138400 0.115425 0.009700 0.167625 39.625879 1.000000
max 5.003430e+11 12739.000000 3.733960e+11 1.269470e+11 2.883700e+10 1.065100e+11 1.065100e+11 7.089800e+10 9.168000e+09 7.290300e+10 ... 4.079100 153332.333300 293.473000 1184.993800 313.395800 17646.823500 36.898100 43.718800 3756.716345 1.000000

8 rows × 223 columns

In [ ]:
df_2018.info()
<class 'pandas.core.frame.DataFrame'>
Index: 4392 entries, CMCSA to ZYME
Columns: 224 entries, Revenue to Class
dtypes: float64(222), int64(1), object(1)
memory usage: 7.5+ MB
In [ ]:
# Calculate value counts
df_2018['Class'].value_counts()
Out[ ]:
Class
1    3046
0    1346
Name: count, dtype: int64

Visualizations¶

With our orginal idea being to predict the exact change in price variation we wanted to see its distrubution. Since 75% of the data had a price variation of 39% or lower we chose to limit the $x$ axis plotting values from -110% to 200%. We added a vertical line at $x=0$ in order to more easily visualize the class. Every value to the right of the line has a Class $=1$; every value to the left has a Class $=0$. From this graph, we can see that the values are normally distributed around 17-18%, with there being obvious outliers as price variation gets bigger than 200%.

In [ ]:
sns.histplot(df_2018, x='2019 PRICE VAR [%]')
plt.xlabel('2019 Price Variation [%] ')
plt.title('Distribution of Price Variations')
plt.axvline(x=0, color='red', linestyle='--', linewidth=2)
plt.xlim(-110, 200)
plt.show()
No description has been provided for this image

Similarly, we wanted to see how Revenue was distributed as high revenue could potentially have some correlation with having an increase in price variation. Like 2019 PRICE VAR [%], Revenue was also normally distributted, but had much less outliers.

In [ ]:
sns.histplot(df_2018, x='Revenue', log_scale=True)
plt.xlabel('Revenue (log Scale)')
plt.title('Distribution of Company Revenue')
plt.show()
No description has been provided for this image

Because the simple distribution of Revenue doesn't provide us with much, we wanted to see how Class mixed in would look. Looking at the the left plot which shows the raw counts, we see that when the Class $=1$ there are more counts for almost all revenues (about twice as many), but this makes sense as we saw that 69% of companies where Class $=1$. Thus, we want to look more closely at the proportion graph. Looking the the right graph which models this, we see that the proportion graphs are alomst identical and both normally distributed, but there is one small take away. The companies with revenue bigger than the mode Revenue then the counts for Class $=1$ are higher, but for companies with revenue with less than the mode Revenue then the counts for Class $=0$ are higher. This make sense as companies with higher than the mode Revenue probably do have a slightly higher probability at increasing the price of their stock, but overall revenue didn't show much promise.

In [ ]:
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Counts
sns.histplot(df_2018, x='Revenue', hue='Class', log_scale=True, ax=axes[0])
axes[0].set_xlabel('Revenue (log Scale)')
axes[0].set_title('Distribution of Company Revenue with Class(Count)')

# Proportion
sns.histplot(df_2018, x='Revenue', hue='Class', log_scale=True, stat='proportion', common_norm=False, ax=axes[1])
axes[1].set_xlabel('Revenue (log Scale)')
axes[1].set_title('Distribution of Company Revenue with Class(Proportion)')

plt.tight_layout()
plt.show()
No description has been provided for this image

We wanted to continue to explore the relationship between Revenue and 2019 PRICE VAR [%] so we decided to plot it, along with an additional plot for Revenue Growth vs 2019 PRICE VAR [%]. The idea behind plotting revenue growth as well was to gain more information; a company could have high revenue, but if it was less than their previous year, meaning negative revenue growth, then maybe it could help us predict a decrease in price variation. Like earlier, we added a horizontal line at $y=0$ to help visualize the classes. When plotting just the revenue, the points were pretty symetrical over the line $y=0$. This indicates that revenue probably isn't that important of a predictor until the revenue is greater than $10^{10}$ which is a factor of 10 from the mode revenue. The Revenue Growth plot echoes this conclusion as it is also symetrical over the line $y=0$.

In [ ]:
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Revenue vs. 2019 Price Change
sns.scatterplot(x='Revenue', y='2019 PRICE VAR [%]', data=df_2018, ax=axes[0])
axes[0].set_title('Revenue vs. 2019 Price Change (SymLog Scale)')
axes[0].set_xlabel('Revenue')
axes[0].set_ylabel('2019 Price Change (%)')

# horizontal line at y=0
axes[0].axhline(y=0, color='red', linestyle='--', linewidth=2)
axes[0].set_xscale('log')
axes[0].set_yscale('symlog')

# Revenue Growth vs. 2019 Price Change
sns.scatterplot(x='Revenue Growth', y='2019 PRICE VAR [%]', data=df_2018, ax=axes[1])
axes[1].set_title('Revenue Growth vs. 2019 Price Change (SymLog Scale)')
axes[1].set_xlabel('Revenue Growth')
axes[1].set_ylabel('2019 Price Change (%)')

# horizontal line at y=0
axes[1].axhline(y=0, color='red', linestyle='--', linewidth=2)
axes[1].set_xscale('symlog')
axes[1].set_yscale('symlog')

plt.tight_layout()
plt.show()
No description has been provided for this image

Clearly, Revenue didn't seem to be a great predictor and thus we moved onto exploring Sector which was a categorical predictor which told us what industry the company came from. The top 3 sectors where Financial Services, Healthcare and Technology, all having counts of more than 600 companies. Additionally, we wanted to see if any specific sector had a weird split for Class, but the only real sectors we saw this in were Utilities and Real Estate which had low counts. Everything else seemed to follow the ~70/30 split.

In [ ]:
df_2018['Sector'].value_counts()
Out[ ]:
Sector
Financial Services        824
Healthcare                691
Technology                636
Industrials               574
Consumer Cyclical         506
Basic Materials           276
Real Estate               255
Energy                    248
Consumer Defensive        191
Utilities                 102
Communication Services     89
Name: count, dtype: int64
In [ ]:
# Calculate value counts
sector_counts = df_2018['Sector'].value_counts()

# Plot the bar chart
fig, ax1 = plt.subplots()

# Plot count on the left y-axis
ax1.bar(sector_counts.index, sector_counts.values)
ax1.set_xlabel('Sector')
ax1.set_ylabel('Count')
ax1.set_xticks(range(len(sector_counts)))
ax1.set_xticklabels(sector_counts.index, rotation=45, ha="right")
ax1.tick_params('y')
ax1.set_title(r'Distribution of $Sector$')

# Create a twin Axes on the right side for proportion
ax2 = ax1.twinx()
ax2.set_ylim(0,.2)
ax2.set_ylabel('Proportion')

# Calculate proportion and plot on the right y-axis
proportion = sector_counts / sector_counts.sum()
ax2.bar(sector_counts.index, proportion)

plt.show()
No description has been provided for this image
In [ ]:
sns.histplot(data=df_2018, x='Sector', hue='Class', multiple="stack", shrink=0.8)
plt.title('Breakdown of Sector Counts with Class Hue')
plt.xlabel('Sector')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')

plt.show()
No description has been provided for this image

After exploring some predictors by hand, we decided simply test all predictors with the reponse variable Class and plot the top 20 predictors with the highest correlation. The highest correlated variable ended up being cashConvertingCyle with a correlation of almost 0.5. The next highest was operatingCycle was slightly over 0.2, and the rest were slightly above or around 0.1. Therefore, we can see besides cashConvertingCyle, there are no real standout predictors which makes sense. If there was a single predictor that had very high correlation with Class, then the stock market would be very easy to predict. Thus, our models will have to include a multitude of predictors in order to make the best possible prediction.

In [ ]:
all_predictors = df_2018.drop(columns='2019 PRICE VAR [%]').columns

corrs = df_2018.drop(columns='2019 PRICE VAR [%]').corr(numeric_only=True)
correlation_with_class = corrs['Class'].sort_values()[-21:-2]

correlation_with_class.plot(kind='barh')
plt.title('Top 20 Predictors by Correlation with Response (Class)')
plt.xlabel('Correlation')
plt.ylabel('Predictor')
plt.show()
No description has been provided for this image

4. Data Preprocessing¶

Add the YoY change for each variable¶

For each predictor, we calculate the % change between 2017 and 2018. This should improve the predictions since it captures the trend of companies as opposed to absolute values at a given time. For example, a company's overall value at a given time can only tell you so much about how its stock price will change over the year. The % change of its value over the previous year, though, may offer insights into how the company will fare in the following year.

Furthermore, the above graph of correlations has many predictors that measure the growth of a given statistic over several years. This shows the importance of measuring change over time and, thus, led us to including the YoY % change variables. Also, the models performed better with a single year-over-year as opposed to multiple. This may be because the older data is less valuable or has more missing values.

In [ ]:
num_cols = df_2018.columns.drop(['Class', 'Sector', '2019 PRICE VAR [%]'])

perc_change_df = (df_2018[num_cols] - df_2017[num_cols]) / df_2017[num_cols]
perc_change_df.columns = perc_change_df.columns + ' (YoY)'

df = pd.concat([df_2018, perc_change_df], axis=1)
display(df.shape, df.head())
(4392, 445)
Revenue Revenue Growth Cost of Revenue Gross Profit R&D Expenses SG&A Expense Operating Expenses Operating Income Interest Expense Earnings before Tax ... 10Y Dividend per Share Growth (per Share) (YoY) 5Y Dividend per Share Growth (per Share) (YoY) 3Y Dividend per Share Growth (per Share) (YoY) Receivables growth (YoY) Inventory Growth (YoY) Asset Growth (YoY) Book Value per Share Growth (YoY) Debt Growth (YoY) R&D Expense Growth (YoY) SG&A Expenses Growth (YoY)
Ticker
CMCSA 9.450700e+10 0.1115 0.000000e+00 9.450700e+10 0.000000e+00 6.482200e+10 7.549800e+10 1.900900e+10 3.542000e+09 1.511100e+10 ... inf 1.394095 0.981435 1.325792 NaN 7.875648 -0.761243 11.711304 NaN 1.505747
KMI 1.414400e+10 0.0320 7.288000e+09 6.856000e+09 0.000000e+00 6.010000e+08 3.062000e+09 3.794000e+09 1.917000e+09 2.196000e+09 ... NaN 0.128674 -0.168657 -0.393673 -1.490144 -0.846154 -1.329004 -0.751361 NaN 4.938967
INTC 7.084800e+10 0.1289 2.711100e+10 4.373700e+10 1.354300e+10 6.750000e+09 2.042100e+10 2.331600e+10 -1.260000e+08 2.331700e+10 ... -0.107338 0.351598 0.245161 0.017391 -0.849709 -0.563927 1.086420 -1.279339 0.413043 -0.146739
MU 3.039100e+10 0.4955 1.250000e+10 1.789100e+10 2.141000e+09 8.130000e+08 2.897000e+09 1.499400e+10 3.420000e+08 1.430300e+10 ... NaN NaN NaN -0.440748 0.865432 -0.196397 0.370847 -5.650478 0.357813 -0.261176
GE 1.216150e+11 0.0285 9.546100e+10 2.615400e+10 0.000000e+00 1.811100e+10 4.071100e+10 -1.455700e+10 5.059000e+09 -2.177200e+10 ... 2.686084 -4.795148 12.712042 7.301493 1.202589 -15.189189 0.821762 17.230159 NaN -3.933333

5 rows × 445 columns

Preliminary Feature Selection¶

Remove Duplicates¶
In [ ]:
duplicate_cols = list()
for i, col1 in enumerate(df.columns):
    for col2 in df.columns[i + 1:]:
        if df[col1].equals(df[col2]):
            duplicate_cols.append((col1, col2))
cols_to_drop = [pair[0] for pair in duplicate_cols]
df = df.drop(columns=cols_to_drop)
df.shape
Out[ ]:
(4392, 397)
Remove columns with at least $k$ missing values¶
In [ ]:
ks = list(range(0, 4392, 12))
# Store the number of columns with > k missing values
num_cols = []
for k in ks:
    # Count the number of missing values per column
    missing_values_count = df.isnull().sum()

    # Filter columns with more than k missing values
    low_quality_columns = missing_values_count[missing_values_count >= k].index.tolist()

    # Append the result to num_cols
    num_cols.append(low_quality_columns)

# Flatten the list of lists
num_cols_flat = list(itertools.chain.from_iterable(num_cols))

plt.plot(ks, [len(cols) for cols in num_cols])
plt.xlabel(rf'$k$')
plt.axvline(x=2000, c='k', ls='--', label=r'$k=2000$ (cutoff)')
plt.title(r'Number of Columns with at least $k$ missing values')
plt.legend();
No description has been provided for this image
In [ ]:
sparse_cols = df.count()[df.count() <= 2000].index
df = df.drop(columns=sparse_cols)

We remove all predictors with at least $2000$ missing values (i.e., >$45$% of the column is missing). As seen in the chart above, the number of predictors with at least $k$ missing values exponentially decreases as $k$ increases. This left-skewed distribution demonstrates that the majority of predictors have relatively few missing values. We set the threshold to $2000$ to remove the skewed end of the distribution. This allows us to drop ~$40$ predictors with the least information, while still retaining a large feature set.

Train-test Split¶

In [ ]:
df = df.rename(columns={'2019 PRICE VAR [%]': 'Variation'})

X = df.drop(columns=['Variation', 'Class'])
y = df['Class']
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=209)
In [ ]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Out[ ]:
((3513, 376), (879, 376), (3513,), (879,))

One-hot Encoding¶

We one-hot encode the Sector variable so that it can be properly handled by our models. Because there is no specific ordering of the sectors, encoding them as an ordinal feature would not make sense. One-hot encoding allots a binary column to each sector in which a $1$ indicates that the corresponding company belongs to that given sector. This will allow the model to consider a company's particular sector when predicting the change in its stock price. Companies within the same sector often operate–and, thus, perform–similarly, so this distinction should help the model.

The below graph demonstrates the importance of accounting for a company's sector when trying to predict its change in stock price over the year. The correlation between predictors and the response variable, Class, is significantly higher when calculating them by sector. The large increase in correlations from among sectors (as seen in "All", on the far left) to within sectors reflects the additional predictive power when including the Sector variable in the model.

In [ ]:
oh_encoder = OneHotEncoder(drop=None, handle_unknown='ignore')
X_train_ohe = oh_encoder.fit_transform(X_train[['Sector']])
X_train_ohe = pd.DataFrame(X_train_ohe.toarray(), columns=oh_encoder.get_feature_names_out(['Sector']))
X_train_ohe.index = X_train.index
X_train = pd.concat([X_train.drop(columns=['Sector']), X_train_ohe], axis=1)
In [ ]:
X_test_ohe = oh_encoder.transform(X_test[['Sector']])
X_test_ohe = pd.DataFrame(X_test_ohe.toarray(), columns=oh_encoder.get_feature_names_out(['Sector']))
X_test_ohe.index = X_test.index
X_test = pd.concat([X_test.drop(columns=['Sector']), X_test_ohe], axis=1)
In [ ]:
# Function to plot the correlations by sector
sectors = df['Sector'].unique().tolist()
# Initialize dictionary to store predictor-response correlations within each sector
sector_corrs = {}

# Store the correlations between the response and each predictor acros ALL sectors (i.e., the whole dataset)
all_corr = df.corrwith(df['Class'], numeric_only=True)
sector_corrs['All'] = all_corr

# Store correlations for the data within each sector
for sector in sectors:
    corrs = df[df['Sector'] == sector].corrwith(df['Class'], numeric_only=True)
    sector_corrs[sector] = corrs

# Convert to df and drop response (trivial; corr = 1)
sector_corrs_df = pd.DataFrame(sector_corrs)
sector_corrs_df = sector_corrs_df.drop('Class')

# Plot
melted_df = sector_corrs_df.melt(var_name='Sector', value_name='Correlation')

plt.figure(figsize=(15, 8))
sns.boxplot(x='Sector', y='Correlation', data=melted_df)
plt.title('Predictor-Response Correlations', fontsize=15)
plt.xticks(rotation=45, ha='right')
plt.show()
/Users/Tomas/micromamba/envs/cs109a/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide
  c /= stddev[:, None]
/Users/Tomas/micromamba/envs/cs109a/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide
  c /= stddev[None, :]
/Users/Tomas/micromamba/envs/cs109a/lib/python3.11/site-packages/numpy/lib/function_base.py:2742: RuntimeWarning: invalid value encountered in subtract
  X -= avg[:, None]
/Users/Tomas/micromamba/envs/cs109a/lib/python3.11/site-packages/numpy/core/_methods.py:118: RuntimeWarning: invalid value encountered in reduce
  ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
No description has been provided for this image

Standardization¶

We decided to standardize the data since the range of values across predictors is quite large, from mere decimals (e.g., growth rates) to hundreds of billions (e.g., revenue of the largest companies). The models will be able to handle standardized data and its smaller values with more ease.

Furthermore, for parametric models, the coefficients of standardized predictors can be directly compared, whereas they'd be skewed without standardization.

In [ ]:
# Replace infinity values with NaN
X_train = X_train.replace([np.inf, -np.inf], np.nan)
X_test = X_test.replace([np.inf, -np.inf], np.nan)
In [ ]:
# Define a function to apply a transformation to data and return them
def transform_data(transformer, X_train, X_test):
    X_train_trans = transformer.fit_transform(X_train)
    X_train_trans = pd.DataFrame(X_train_trans, columns=transformer.get_feature_names_out(), index=X_train.index)
    X_test_trans = transformer.transform(X_test)
    X_test_trans = pd.DataFrame(X_test_trans, columns=transformer.get_feature_names_out(), index=X_test.index)
    return X_train_trans, X_test_trans
In [ ]:
std_scaler = StandardScaler()
X_train_std, X_test_std = transform_data(std_scaler, X_train, X_test)

Imputation¶

Having dealt with the majority of the missingness in the dataset, we now turn to imputation to resolve the remaining nan values. Through our previous work in Milestone 2, we determined that mean imputation was the most effective method to do so.

Note: The difference between mean imputation and kNN imputation was extremely marginal, and the selection of one imputation method over the other would have nearly no consequence on the model's performance.

In [ ]:
mean_imp = SimpleImputer(strategy='mean')
X_train_final, X_test_final = transform_data(mean_imp, X_train_std, X_test_std)
In [ ]:
# knn_imp = KNNImputer(n_neighbors=3)
# X_train_knnimp, X_test_knnimp = transform_data(knn_imp, X_train, X_test)

5. Models¶

Naive Model¶

In [ ]:
# Naive model (old data)
data = pd.Series()
for i in range(7, 3, -1):
    current_data = pd.read_csv(f'data/201{i}_Financial_Data.csv')['Class']
    data = pd.concat([data, current_data], ignore_index=True)

print(f"{data.mean() * 100:.2f}% of stocks in the data from 2014-2017 improved.")
print(f"{df_2018['Class'].mean() * 100:.2f}% of stocks in the data from 2018 improved.")
51.53% of stocks in the data from 2014-2017 improved.
69.35% of stocks in the data from 2018 improved.
/var/folders/7h/kr5s22ts0hv9gdztycnz_rtw0000gn/T/ipykernel_56640/2735913412.py:5: FutureWarning: The behavior of array concatenation with empty entries is deprecated. In a future version, this will no longer exclude empty items when determining the result dtype. To retain the old behavior, exclude the empty entries before the concat operation.
  data = pd.concat([data, current_data], ignore_index=True)

Let's assume that a naive model is built without access to data from the end of 2018; therefore, it would be smart to look at the general trend of stocks in the available data over the past 4 years. As seen in the above results, $51.53$% of stocks had a YoY increase in price from 2014-2017. Furthermore, even if someone did have access to the data from the end of 2018, the majority–now $69.35$%–of the stocks went up in price. Thus, either way the naive model would predict Class $=1$ for each stock. This would achieve $69$% accuracy on the $2018$ dataset, which we are trying to beat here.

In [ ]:
# Naive Model
y_pred_train_naive = np.ones_like(y_train)
y_pred_test_naive = np.ones_like(y_test)

naive_train_acc = accuracy_score(y_train, y_pred_train_naive)
naive_test_acc = accuracy_score(y_test, y_pred_test_naive)
In [ ]:
print(f'Naive Model Train Accuracy: {naive_train_acc}')
print(f'Naive Model Test Accuracy: {naive_test_acc}')
Naive Model Train Accuracy: 0.695417022487902
Naive Model Test Accuracy: 0.6860068259385665

Logistic Regression w/ L1 Penalty Term¶

In [ ]:
# Logistic Regression (L1)
Cs = [.005, .01, .02, .05]
lasso = LogisticRegressionCV(Cs=Cs, cv=5, penalty='l1', solver='saga', max_iter=1000, n_jobs=-1, random_state=209).fit(X_train_final, y_train)
best_C = lasso.C_[0]

y_pred_train_lasso = lasso.predict(X_train_final)
y_pred_test_lasso = lasso.predict(X_test_final)

lasso_train_acc = accuracy_score(y_train, y_pred_train_lasso)
lasso_test_acc = accuracy_score(y_test, y_pred_test_lasso)
In [ ]:
# Store the top 10 predictors' coefs and their names
coefficients = lasso.coef_[0]
feature_names = X_train.columns
coef_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})
coef_df = coef_df.reindex(coef_df['Coefficient'].abs().sort_values(ascending=False).index)
top_10_coefs = coef_df.head(10)

# Plot
plt.figure(figsize=(10, 6))
plt.barh(top_10_coefs['Feature'], top_10_coefs['Coefficient'], color='skyblue')
plt.xlabel('Coefficient Value')
plt.title('Top 10 Coefficients in the L1-regularized Logistic Regression')
plt.grid(axis='x')
plt.show()
No description has been provided for this image

The L1-regularized logistic regression model conveniently selects predictors to include by driving many coefficients to 0. This allows us to get an initial perspective on feature importance, as seen in the above graph, which displays the top 10 predictors based on their coefficient values. This model also improved upon the test accuracy of the naive model by $2$%.

However, logistic regression is parameterized and may not be able to capture complex relationships in the data. Furthermore, the size of our feature set is so large that adding polynomial features will cause the number of predictors to quickly get out of hand. Therefore, we turn to non-parametric models hereafter.

In [ ]:
# Store predictors deemed most important by LogisticRegressionCV (i.e., their coefficient was not driven to 0)
predictors = X_train_final.columns
important_predictors = []
for predictor, coef in zip(predictors, lasso.coef_.tolist()[0]):
    if coef > 0:
        important_predictors.append(predictor)
In [ ]:
display(X_train_final.shape)
(3513, 386)
In [ ]:
print(rf'Out of the original {len(predictors)} features, the L1-regularized logistic regression model drives the weights of {len(predictors) - len(important_predictors)} of them to 0.')
Out of the original 386 features, the L1-regularized logistic regression model drives the weights of 331 of them to 0.
In [ ]:
# L1 Logistic Performance
print(f'L1 Logistic Train Accuracy: {lasso_train_acc}')
print(f'L1 Logistic Test Accuracy: {lasso_test_acc}')
L1 Logistic Train Accuracy: 0.7349843438656419
L1 Logistic Test Accuracy: 0.7076222980659841

kNN Classification¶

In [ ]:
# kNN Classifier
ks = range(10,101,10)
train_mse_means = []
val_mse_means = []

for k in ks:
    knn_cv = cross_validate(KNeighborsClassifier(n_neighbors=k),
                           X_train_final, y_train,cv=5,scoring='accuracy',return_train_score=True)
    train_mse_means.append(np.mean(knn_cv['train_score']))
    val_mse_means.append(np.mean(knn_cv['test_score']))

best_idx = np.argmax(val_mse_means)
best_k = ks[best_idx]
In [ ]:
# Plot the accuracies
plt.plot(ks, train_mse_means)
plt.plot(ks, val_mse_means)
# Add labels
plt.xlabel(r'$k$')
plt.ylabel('accuracy')
plt.axvline(best_k, c='k', ls='--', label=rf'best $k={best_k}$')
plt.title(r'kNN Classifier CV Scores')
plt.xticks(ks)
plt.legend();
No description has been provided for this image

This cross-validation graph shows that $k=60$ is the optimal hyperparameter for the kNN classifier, as it maximizes the validation accuracy.

In [ ]:
# Fit a final kNN model on all the data
knn = KNeighborsClassifier(n_neighbors=best_k).fit(X_train_final, y_train)

y_pred_train_knn = knn.predict(X_train_final)
y_pred_test_knn = knn.predict(X_test_final)

knn_train_acc = accuracy_score(y_train, y_pred_train_knn)
knn_test_acc = accuracy_score(y_test, y_pred_test_knn)

This kNN model classifies a new data point based on the majority class of its $60$ nearest neighbors. In terms of performance, it is better than the naive model but slightly worse than logistic regression.

We will now turn to a new class of non-parametric models–decision trees.

In [ ]:
# kNN Performance
print(f'kNN Train Accuracy: {knn_train_acc}')
print(f'kNN Test Accuracy: {knn_test_acc}')
kNN Train Accuracy: 0.7122117847993168
kNN Test Accuracy: 0.6996587030716723

Single Decision Tree¶

In [ ]:
# Single Decision Tree CV
train_scores = []
cvmeans = []
depths = list(range(1,21))

for max_depth in depths:
    tree = DecisionTreeClassifier(max_depth=max_depth, random_state=109)
    tree.fit(X_train_final, y_train)
    train_scores.append(accuracy_score(y_train, tree.predict(X_train_final)))
    scores = cross_val_score(estimator=DecisionTreeClassifier(max_depth=max_depth, random_state=109), X=X_train_final, y=y_train, cv=5, n_jobs=-1)
    cvmeans.append(scores.mean())

best_idx = np.argmax(cvmeans)
best_depth_tree = depths[best_idx]
In [ ]:
cvmeans = np.array(cvmeans)
depths = range(1, 21)

plt.plot(depths, train_scores, label = 'training scores')
plt.plot(depths, cvmeans, label = 'mean validation scores')
plt.axvline(best_depth_tree, c='k', ls='--', label=f'best depth={best_depth_tree}')

plt.title('Non-CV Training Accuracy vs. CV Validation Accuracy')

plt.xlabel('Max Tree Depth')
plt.xticks(range(0, 21, 2))
plt.ylabel('Accuracy Score')
plt.legend()
plt.show()
No description has been provided for this image

Interestingly, despite the large feature space, cross-validation identifies a max depth of $3$ as optimal. This means that overfitting begins relatively quickly. Depth $3$ will now be used for a final single decision tree, fit on the entire training set.

In [ ]:
# Fit a final decision tree on all the data
tree = DecisionTreeClassifier(max_depth=best_depth_tree, random_state=0).fit(X_train_final, y_train)

y_pred_train_tree = tree.predict(X_train_final)
y_pred_test_tree = tree.predict(X_test_final)

tree_train_acc = accuracy_score(y_train, y_pred_train_tree)
tree_test_acc = accuracy_score(y_test, y_pred_test_tree)
In [ ]:
# Plot the decision tree
plt.figure(figsize=(35, 10))
plot_tree(tree, filled=True, feature_names=X_train.columns, class_names=['0', '1'], rounded=True, fontsize=20)
plt.show()
No description has been provided for this image

The above image allows us to visualize the single decision tree fit on all the training data. This improves upon the kNN classifier, but is still worse than the logistic regression model by ~$0.2$%.

We will stick with decision trees, but turn to ensemble methods, which perform better than a single tree since they are able to aggregate many of them into one model.

In [ ]:
# Tree Performance
print(f'Single Tree Train Accuracy: {tree_train_acc}')
print(f'Single Tree Test Accuracy: {tree_test_acc}')
Single Tree Train Accuracy: 0.7378309137489325
Single Tree Test Accuracy: 0.7053469852104665

Bagging¶

After testing with various n_estimators, we find that the model's performance plateaus before $200$. Thus, n_estimators=$200$ is sufficient to reduce the variance of the decision tree base estimators. We perform CV below to determine the optimal tree depth.

In [ ]:
# Max Depth CV - Bagging
oob_scores = []
depths = list(range(3,25))
for depth in depths:
    bagger = BaggingClassifier(estimator=DecisionTreeClassifier(max_depth=depth), n_estimators=100, oob_score=True, random_state=209
                               ).fit(X_train_final, y_train)
    oob_scores.append(bagger.oob_score_)

best_idx = np.argmax(oob_scores)
best_depth = depths[best_idx]
In [ ]:
best_depth
Out[ ]:
14
In [ ]:
# Bagging Performance
n_trees = 200
bagger = BaggingClassifier(estimator=DecisionTreeClassifier(max_depth=best_depth), n_estimators=n_trees, n_jobs=-1, random_state=209
                            ).fit(X_train_final, y_train)

y_train_pred_bagger = bagger.predict(X_train_final)
y_test_pred_bagger = bagger.predict(X_test_final)

bagger_train_acc = accuracy_score(y_train, y_train_pred_bagger)
bagger_test_acc = accuracy_score(y_test, y_test_pred_bagger)
In [ ]:
bagging_predictors_1 = [bagged_tree.tree_.feature[0] for bagged_tree in bagger.estimators_]
bagging_counts_1 = pd.Series(bagging_predictors_1).value_counts()
bagging_predictors_2 = [bagged_tree.tree_.feature[1] for bagged_tree in bagger.estimators_]
bagging_counts_2 = pd.Series(bagging_predictors_2).value_counts()

top_predictors_bagging = pd.DataFrame(index=range(X_train_final.shape[1]))
top_predictors_bagging['predictor'] = X_train_final.columns[top_predictors_bagging.index]
top_predictors_bagging['first count'] = bagging_counts_1
# top_predictors_bagging['second count'] = bagging_counts_2
top_predictors_bagging = top_predictors_bagging.dropna(thresh=2)
top_predictors_bagging = top_predictors_bagging.reindex(top_predictors_bagging['first count'].sort_values(ascending=False).index)
In [ ]:
plt.figure(figsize=(12, 8))
plt.barh(top_predictors_bagging['predictor'], top_predictors_bagging['first count'], color='skyblue')
plt.xlabel('Number of Times Used for the First Split')
plt.ylabel('Predictor')
plt.title('Top Node Predictors')
plt.gca().invert_yaxis()  # Invert y-axis for better readability
plt.show()
No description has been provided for this image

This graph demonstrates how correlated the individual trees in the bagging model are. Out of the $386$ predictors, only $15$ are even used for the first split. Not only that, but Tangible Asset Value dominates the distribution, accounting for $80$ of the $200$ trees' first split.

The bagging model has the highest test accuracy thus far, at $73.5$%–a significant improvement upon the logistic regression. However, in order to further improve generalizability, we will fit a random forest model below. This decorrelates the individual trees by only selecting from a subset of the predictors at each split.

In [ ]:
# Bagging Performance
print(f'Bagging Train Accuracy: {bagger_train_acc}')
print(f'Bagging Test Accuracy: {bagger_test_acc}')
Bagging Train Accuracy: 0.977227440933675
Bagging Test Accuracy: 0.7372013651877133

Random Forest on All Predictors¶

For the same reasons as above, we set n_estimators=$200$.

In [ ]:
# Random Forest Classifier
train_acc_mean = []
val_acc_mcean = []
k = 5
depths = list(range(3,30))
# Cross Validation
for max_depth in depths:
    rf_scores = cross_validate(estimator=RandomForestClassifier(n_estimators=n_trees, max_depth=max_depth, max_features='sqrt', random_state=209),
                               X=X_train_final,
                               y=y_train,
                               scoring='accuracy',
                               cv=k,
                               n_jobs=-1,
                               return_train_score=True)
    train_acc_mean.append(np.mean(rf_scores['train_score']))
    val_acc_mcean.append(np.mean(rf_scores['test_score']))
In [ ]:
# Get best depth for RF
best_idx = np.argmax(val_acc_mcean)
best_depth_rf = depths[best_idx]
print(f'The cross-validation results show that the best depth is {best_depth_rf}.')
The cross-validation results show that the best depth is 9.
In [ ]:
# Plot the accuracies
plt.plot(depths, train_acc_mean)
plt.plot(depths, val_acc_mcean)
# Add labels
plt.xlabel(r'depth')
plt.ylabel('accuracy')
plt.axvline(best_depth_rf, c='k', ls='--', label=rf'best depth={best_depth_rf}')
plt.title(r'Random Forest CV Scores')
plt.xticks(range(3,31,3))
plt.legend();
No description has been provided for this image
In [ ]:
n_trees = 200
tree_depth = best_depth_rf

# Fit a RF
random_forest = RandomForestClassifier(n_estimators=n_trees,
                                       max_depth=best_depth_rf,
                                       max_features='sqrt',
                                       random_state=0
                                       ).fit(X_train_final, y_train)

y_train_pred = random_forest.predict(X_train_final)
y_test_pred = random_forest.predict(X_test_final)

rf_train_acc = accuracy_score(y_train, y_train_pred)
rf_test_acc = accuracy_score(y_test, y_test_pred)
In [ ]:
# RF Performance
print(f'RF Train Accuracy: {rf_train_acc}')
print(f'RF Test Accuracy: {rf_test_acc}')
RF Train Accuracy: 0.9373754625676061
RF Test Accuracy: 0.7485779294653014

Random Forest on Important Predictors¶

Given the nature of random forest (ie. it performs better with a higher ratio of important predictors), it may be beneficial to reduce the likelihood that we select an unimportant predictor. Thus, we will try a random forest model on only the important predictors identified by lasso regularization.

In [ ]:
# Random Forest Classifier (important predictors)
X_train_important = X_train_final[important_predictors]
X_test_important = X_test_final[important_predictors]

n_trees = 200
tree_depth = best_depth_rf

# Fit a RF
rf_important = RandomForestClassifier(n_estimators=n_trees,
                                       max_depth=tree_depth,
                                       max_features='sqrt',
                                       random_state=0
                                       ).fit(X_train_important, y_train)

y_train_important_pred = rf_important.predict(X_train_important)
y_test_important_pred = rf_important.predict(X_test_important)

rf_important_train_acc = accuracy_score(y_train, y_train_important_pred)
rf_important_test_acc = accuracy_score(y_test, y_test_important_pred)
In [ ]:
# RF Performance, after lasso selection
print(f'RF Train Accuracy (full data): {rf_train_acc}')
print(f'RF Test Accuracy (full data): {rf_test_acc}')
print(f'RF Train Accuracy (lasso-selected predictors): {rf_important_train_acc}')
print(f'RF Test Accuracy (lasso-selected predictors): {rf_important_test_acc}')
RF Train Accuracy (full data): 0.9373754625676061
RF Test Accuracy (full data): 0.7485779294653014
RF Train Accuracy (lasso-selected predictors): 0.9154568744662681
RF Test Accuracy (lasso-selected predictors): 0.7281001137656428

The random forest and its decorrelated trees improves upon the bagging model in test accuracy, while reducing slightly in training accuracy. As of now, and Random Forest has the highest training accuracy with 74.8%

AdaBoost¶

Next, we will move to another ensemble method–gradient boosting. While bagging and random forests improve decrease the variance of a deep decision tree by aggregating many of them, gradient boosting increases the bias of low-complexity (i.e., shallow) decision trees by sequentially fitting them to the previous tree's residuals.

In [ ]:
# AdaBoost - (depth,iter,lr,acc): (2,150,.05) = .745, (1,200,.05)=.741, (2,200,.05)=.743, (2,200,0.3)=.744, (2,100,.075)=.743
ada_depth = 2
n_iters = 150
lr = 0.05
adaboost = AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=ada_depth),
                              n_estimators=n_iters,
                              learning_rate=lr,
                              random_state=0
                             ).fit(X_train_final, y_train)

ada_train_acc = adaboost.score(X_train_final, y_train)
ada_test_acc = adaboost.score(X_test_final, y_test)

The AdaBoost model slightly underperforms the bagging and random forest models, though it is better than every other one.

Also, note that the train accuracy is much lower than the other two ensemble methods', since the base learner for AdaBoost is much simpler.

In [ ]:
print(f'AdaBoost Train Accuracy: {ada_train_acc}')
print(f'AdaBoost Test Accuracy: {ada_test_acc}')
AdaBoost Train Accuracy: 0.7913464275547964
AdaBoost Test Accuracy: 0.7349260523321957

Stacking (209)¶

Lastly, we will build a stacking model, which is another ensemble method. However, whereas bagging, random forests, and gradient boosting are homogenous, in that the ensemble consists of the same type of model (e.g., decision trees of depth $k$), stacking is heterogenous. This means that it can combine various different models–even other ensemble models. Here, we use a model stacking with cross-validated hyperparameters to combine all the previous models we fit throughout the notebook.

In [ ]:
estimators = [('lasso', LogisticRegression(C=best_C, penalty='l1', solver='liblinear', random_state=209)),
              ('knn', KNeighborsClassifier(n_neighbors=best_k)),
              ('bagger', BaggingClassifier(estimator=DecisionTreeClassifier(max_depth=13), n_estimators=200, n_jobs=-1, random_state=209)),
              ('rf', RandomForestClassifier(n_estimators=200, max_depth=7, random_state=209)),
              ('adaboost', AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=2), n_estimators=150, learning_rate=0.05, random_state=209))]

stacker = StackingClassifier(estimators=estimators,
                             final_estimator = LogisticRegression(),
                             cv=5,
                             n_jobs=-1)

stacker.fit(X_train_final, y_train)
Out[ ]:
StackingClassifier(cv=5,
                   estimators=[('lasso',
                                LogisticRegression(C=0.05, penalty='l1',
                                                   random_state=209,
                                                   solver='liblinear')),
                               ('knn', KNeighborsClassifier(n_neighbors=60)),
                               ('bagger',
                                BaggingClassifier(estimator=DecisionTreeClassifier(max_depth=13),
                                                  n_estimators=200, n_jobs=-1,
                                                  random_state=209)),
                               ('rf',
                                RandomForestClassifier(max_depth=7,
                                                       n_estimators=200,
                                                       random_state=209)),
                               ('adaboost',
                                AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=2),
                                                   learning_rate=0.05,
                                                   n_estimators=150,
                                                   random_state=209))],
                   final_estimator=LogisticRegression(), n_jobs=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
StackingClassifier(cv=5,
                   estimators=[('lasso',
                                LogisticRegression(C=0.05, penalty='l1',
                                                   random_state=209,
                                                   solver='liblinear')),
                               ('knn', KNeighborsClassifier(n_neighbors=60)),
                               ('bagger',
                                BaggingClassifier(estimator=DecisionTreeClassifier(max_depth=13),
                                                  n_estimators=200, n_jobs=-1,
                                                  random_state=209)),
                               ('rf',
                                RandomForestClassifier(max_depth=7,
                                                       n_estimators=200,
                                                       random_state=209)),
                               ('adaboost',
                                AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=2),
                                                   learning_rate=0.05,
                                                   n_estimators=150,
                                                   random_state=209))],
                   final_estimator=LogisticRegression(), n_jobs=-1)
LogisticRegression(C=0.05, penalty='l1', random_state=209, solver='liblinear')
KNeighborsClassifier(n_neighbors=60)
DecisionTreeClassifier(max_depth=13)
DecisionTreeClassifier(max_depth=13)
RandomForestClassifier(max_depth=7, n_estimators=200, random_state=209)
DecisionTreeClassifier(max_depth=2)
DecisionTreeClassifier(max_depth=2)
LogisticRegression()
In [ ]:
stacker_train_acc = stacker.score(X_train_final, y_train)
stacker_test_acc = stacker.score(X_test_final, y_test)
In [ ]:
# Stacker Performance
print(f'Stacker Train Accuracy: {stacker_train_acc}')
print(f'Stacker Test Accuracy: {stacker_test_acc}')
Stacker Train Accuracy: 0.9288357529177341
Stacker Test Accuracy: 0.7474402730375427

Overall, Stacking performed slightly worse than the random forest, making it our second best model by a marginal amount of 0.12%. Nonetheless, it improves upon the naive model's test accuracy by over $5$ percentage points, thus proving that we can indeed solve the infamous problem of projecting stock price changes with data science.

6. Feature Importance¶

The two graphs below show the feature importance for the top 30 predictors in the Random Forest and AdaBoost models, respectively. The scale on the x-axis reveals that the Random Forest is less concentrated with respect to the most important predictors. For example, Total Liabilities is the single most important feature, but it only explains a little over $1$% of the model; and the importance of all 30 predictors displayed only adds up to ~$18$%.

The AdaBoost model, while still utilizing many predictors, relies much more on the most important ones. The top feature, Debt Growth (YoY), has an importance proportion of around $4$% and those of the top 30 predictors add up to $57.8$%.

In [ ]:
rf_importances = pd.DataFrame(data=[random_forest.feature_importances_], columns=X_train_std.columns, index=['RF']).T.sort_values(by='RF', ascending = True)
ada_importances = pd.DataFrame(data=[adaboost.feature_importances_], columns=X_train_std.columns, index=['Ada']).T.sort_values(by='Ada', ascending = True)
In [ ]:
ada_importances[-30:].sum()
Out[ ]:
Ada    0.577874
dtype: float64
In [ ]:
rf_importances[-30:].plot(kind='barh')
plt.title('Top 30 Predictors for Random Forest')
plt.xlabel('Importance')
plt.xlim(0, 0.05)
plt.show()
No description has been provided for this image
In [ ]:
ada_importances[-30:].plot(kind='barh')
plt.title('Top 30 Predictors for AdaBoost')
plt.xlabel('Importance')
plt.xlim(0, 0.05)
plt.show()
No description has been provided for this image

7. Results & Discussion¶

Thinking back to our goal of classifying publicly-traded companies as increasing or decreasing in value based on their 10-K filings, we have proposed several models as viable solutions. Along the way, we have reached interesting findings about the various factors that play into a model's success at classifying companies:

  • Model / Hyperparameter Selection: what are the strengths and weaknesses of our models, and what do their optimal hyperparameters tell us about the dataset?
  • Preprocessing: how might our original preprocessing have affected our final models?
  • Feature Importance: what can we learn about the dataset based on the importance of features in the models?
  • Contextualizing our Strengths and Weaknesses Are our model's properties appropriate for its real-world applications?

First, let's look at how our various models compared:

In [ ]:
classifiers = ['Naive Model',
               'L1 Logistic Regression',
               'kNN Classifier',
               'Single Decision Tree',
               'Bagging Model',
               'Random Forest',
               'AdaBoost',
               'Stacking Model']
train_scores = [naive_train_acc, lasso_train_acc, knn_train_acc, tree_train_acc, bagger_train_acc, rf_train_acc, ada_train_acc, stacker_train_acc]
train_scores = [np.round(score * 100, 2) for score in train_scores]
test_scores = [naive_test_acc, lasso_test_acc, knn_test_acc, tree_test_acc, bagger_test_acc, rf_test_acc, ada_test_acc, stacker_test_acc]
test_scores = [np.round(score * 100, 2) for score in test_scores]

# Create the df
results_df = pd.DataFrame({'Training Accuracy':train_scores, 'Testing Accuracy':test_scores}, index=classifiers)
In [ ]:
results_df
Out[ ]:
Training Accuracy Testing Accuracy
Naive Model 69.54 68.60
L1 Logistic Regression 73.50 70.76
kNN Classifier 71.22 69.97
Single Decision Tree 73.78 70.53
Bagging Model 97.72 73.72
Random Forest 93.74 74.86
AdaBoost 79.13 73.49
Stacking Model 92.88 74.74

Model/Hyperparameter Selection

The models with the highest test accuracy, in the end, were the Stacking model and Random Forest. Posting a 5% increase in test accuracy over the Naive model, the Stacker still managed to obtain high training accuracy as well. This suggests that it could handle the dataset's complexity while still not overfitting, perhaps because it works with a mix of simple and ensemble learners.

The random forest, on the other hand, was not expected to be our best model, as it did not combine multiple strong learners like the stacker. The random forest's test accuracy suggests that it also generalizes well. We believe that combining it with (barely) weaker models in the stacker may have brought down its accuracy.

One limitation of our stacker is time and space complexity; is it worth training and storing 5 different models if the increase in test accuracy is less than 1%? Our ensemble models showed test accuracies all clustered around 73-74%. Their closeness suggests we select a model on factors other than test accuracy. We could argue for the random forest to be more space-efficient, as it houses smaller base learners without compromising on train or test accuracy. If we were strictly limited on storage space, the random forest would be our second choice after the stacking model.

Another limitation is the interpretability of the stacking model. Although we can perform analyses on the coefficients of the logistic regression combining our models, it is hard to discover the effect of one predictor on the outcome.

Some hyperparameters that were of note were the max_depth chosen for the decision tree, bagging model, and random forest. While a lower depth was optimal for one singular decision tree, there was minimal dropoff in validation scores at greater depths. We believe that this hinted at the strength of ensemble methods. The deeper base learners in the bagging and random forest models were still reaching decent accuracies on their own.

Preprocessing:

We made decisions in our preprocessing that contributed to the strengths and weaknesses of the model:

  • Standardization allowed for predictors to be similar scales. Thus, when we selected our most important predictors using L1 Regularization, we knew those selections were not affected by the scale of our predictors. Standardization, however, contributed to our model being less interpretable than a non-standardized version.
  • Adding Year over Year Comparisons: These predictors allowed us to model the rate of change of the stock price. We hypothesized that each model would use these predictors, as past variations might suggest the next year. The Year-over-year predictors, though, were deemed less important in a few of our models.

Feature Importance

  • Building on the previous point about YoY variables: In the adaboost model, though, they were very important; perhaps when a weak learner can only select one predictor, these are more favorable. This hints that YoY variables may be better at error correction, or they model the dataset outliers the best.
  • We still think that it was worth experimenting with these YoY extra predictors. Perhaps they would be of more value in a regression model, as they could suggest when the stock price's rate of change is trending upwards or downwards.
  • We found that predictors about assets and liabilities were consistently the most important in our models. This could be useful information for new investors wondering what factors affect a stock price the most.
  • Finally, we also experimented with interaction terms (specifically sector interacting with each of the other predictors), but they did not improve the scores significantly. For example, we hypothesized that Capital Expenditure–the amount of money a company uses to buy and maintain physical assets, such as property and equipment–matters much more within the Industrials sector than the Technology sector. Interaction terms would have been useful in the logistic regression, but in a decision tree, we can effectively model the interaction of two predictors by default. We mainly used tree-based ensemble models, so interaction terms would not have been overly beneficial.

Contextualizing our Strengths and Weaknesses

We can place these strengths and weaknesses in the context of our target audience: If we are hoping to create a model that investors train, use, and interpret in real-time, space and time efficiency are key. If an investor, for example, is guessing what a company's Revenue will look like for the year and wants to quickly predict whether that is enough to make the stock go up or down, a stacking model is not the best solution. A stacker is optimal for a model that is trained once and then used on full observations, while a simpler model (perhaps our original L1 Logistic Regression) is better for interpretability.

8. Future Work¶

With more time, we believe that such a classification model is genuinely feasible. Some potential areas for improvement are:

  • Time Series Data: We saw that YoY comparisons were useful in our boosting model, suggesting that true time series data (different predictors, at different points in time) would be helpful in the future. With time series, we would be able to get a more accurate proxy for the stock's trend at a given time.
  • Regression instead of Classification: A classification model is great for binary decisions (like invest vs. not invest), but in the stock market, the true question might be "invest in stock x or stock y". This is not possible without a regression model. When we tried regression, we struggled to obtain a r2 higher than 0.005. In the future, building a regression model would be an exciting challenge.
  • Continued Feature Engineering: With our large number of predictors, the curse of dimensionality is always a risk. If we had the time, we could spend it trimming less important features (using regularization, or by removing the least frequently used features in our ensemble models). This could lead to models that are overall more interpretable and efficient to train/store.