In this project we have a dataset containing stock prices of Google from May-2009 to August-2018. We have both AMD and Google datasets. We are using only Google data as of now.
Dataset Source: https://www.kaggle.com/gunhee/amdgoogle
We are using the stock prices data from 2009 to 2017 (9 years) for training the neural network and predict the stock prices for the year 2018. This is a Regression problem.
To achieve this goal, we will train a Recurrent Neural Network (LSTM). We will use one of the deep learning libraries, Keras, to build the neural network.
import numpy as np
import pandas as pd
import datetime
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.pylab import rcParams
from keras.layers import LSTM, Dense, Dropout
from keras.models import Sequential
from sklearn.preprocessing import MinMaxScaler
!pip install keras
!pip install tensorflow
df = pd.read_csv("dataset/GOOGL.csv")
df.head()
df.tail()
df.info()
# dataframe.size
print("Size = ", df.size)
# dataframe.shape
shape = df.shape
print("Shape = ", shape)
print("Shape[0] x Shape[1] = {}".format(shape[0]*shape[1]))
df.dataframeName = 'GOOGL.csv'
nRow, nCol = df.shape
print(f'There are {nRow} rows and {nCol} columns')
Let's take a quick look at what the data looks like
df.head(5)
#statistical summary of the attributes
df.describe()
# boxplots
df.iloc[:,0:6].boxplot()
plt.show()
df.iloc[:,6:7].boxplot()
plt.show()
df.hist(figsize=(12, 6))
plt.show()
# Correlation Matrix
df.corr()
# Correlation Matrix Plot
f = plt.figure(figsize=(12, 6))
plt.matshow(df.corr(), fignum=f.number)
plt.xticks(range(df.shape[1]), df.columns, fontsize=14, rotation=45)
plt.yticks(range(df.shape[1]), df.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
# Correlation heatmap
rs = np.random.RandomState(0)
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)
# Attribute Scatter Plot Matrix
from pandas.plotting import scatter_matrix
scatter_matrix(df, figsize=(12, 6), diagonal='kde', alpha=0.2)
plt.show()
You can see that we have the dataset over the years from 2009 to 2018. We planned to train the model over the data from 2009 to 2017 and test it on 2018 data. So, we need to filter the data into two parts.
# converting the Date row of the Dataframe to datetime format for easy handling
import datetime
df['Date'] = pd.to_datetime(df['Date'])
# the seperation date
dec_2017 = '2017-12-31'
mask = (df['Date'] <= dec_2017)
data_09to17 = df.loc[mask]
data_09to17.tail()
mask = (df['Date'] > dec_2017)
data_18 = df.loc[mask]
data_18.head()
We will use the data_09to17
for the further steps. We will use the data_18
in the end for testing and validating.
The dataset contains various information like displayed above. But in this project we will only use the Open
stock prices for training our model. Therefore for the purpose of convenience we will create another variable that stores only the required (Open
stock price) information.
training_set = data_09to17.iloc[:,1:2].values
print(training_set)
print("********************")
print("********************")
print(training_set.shape)
Now we can see that there is only one column with the Open
stock prices. There are a total of 2168 stock prices.
print(type(data_09to17))
print(type(data_09to17.iloc[:,1:2]))
print(type(data_09to17.iloc[:,1:2].values))
# iloc[rangeofRows, rangeofColumns]
# Indexing starts from zero.
# ":" indicates entire range.
# "1:2" indicates column one only. Because, the upper bound will be excluded.
# mathematical operation are performed on the arrays. So, it is crusial to convert the data to arrays.
The performance of the neural network will be better if the entire training input is in the same range. As we can see from above the stock prices are not in the same range. So, we need to scale the training data such that they are in the same range. This process is called Feature Scaling. The two popular methods for feature scaling are:
$ x' = \frac{x - \bar{x}}{\sigma} $
where $ x $ is the original feature vector, $ \bar{x} $ is the mean of that feature vector, and $ \sigma $ is its standard deviation.
$ x' = \frac{x - \text{min}(x)}{\text{max}(x)-\text{min}(x)} $
where $ x $ is an original value, $ x' $ is the normalized value.
It is recommended to use Normalization in the case of RNN networks. Therefore we use Min-Max normalization here. You may also experiment with different feature scaling methods.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range = (0,1))
scaled_training_set = scaler.fit_transform(training_set)
print(scaled_training_set.shape)
scaled_training_set
The fit
method only calculates the min and max values. It does not apply the formula on the training set. The fit_transform
method applies the minmax formula on the training set. After applying the minmax formula, the transformed values will be in the range 0 and 1 i.e, the training data (features) will be in the range 0 and 1 (as shown above).
X_train = []
y_train = []
for i in range(60, 2168):
X_train.append(scaled_training_set[i-60:i,0])
y_train.append(scaled_training_set[i,0])
X_train,y_train = np.array(X_train),np.array(y_train)
print(X_train.shape, y_train.shape)
X_train = np.reshape(X_train,(X_train.shape[0],X_train.shape[1],1)) #adding the batch_size axis
X_train.shape
from keras.models import Sequential
from keras.layers import Dense,LSTM,Dropout
model = Sequential()
model.add(LSTM(units=100, return_sequences = True, input_shape =(X_train.shape[1],1)))
model.add(Dropout(0.2))
model.add(LSTM(units=100, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(units=100, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(units=100, return_sequences = False))
model.add(Dropout(0.2))
model.add(Dense(units =1))
model.compile(optimizer='adam',loss="mean_squared_error")
hist = model.fit(X_train, y_train, epochs = 20, batch_size = 32, verbose=2)
plt.figure(figsize = (16, 8))
plt.plot(hist.history['loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train'], loc='upper left')
plt.show()
testing_set = data_18.iloc[:,1:2]
y_test = testing_set.iloc[60:,0:].values
testing_set = testing_set.iloc[:,0:].values
scaled_testing_set = scaler.transform(testing_set)
scaled_testing_set.shape
X_test = []
for i in range(60, 167):
X_test.append(scaled_testing_set[i-60:i,0])
X_test = np.array(X_test)
print(X_test.shape)
X_test = np.reshape(X_test,(X_test.shape[0],X_test.shape[1],1)) #adding the batch_size axis
X_test.shape
y_pred = model.predict(X_test) #predicting the new values
predicted_price = scaler.inverse_transform(y_pred) #inversing the scaling transformation for ploting
plt.figure(figsize = (16, 8))
plt.plot(y_test, color = 'blue', label = 'Actual Stock Price')
plt.plot(predicted_price, color = 'red', label = 'Predicted Stock Price')
plt.title('GOOGL stock price prediction')
plt.xlabel('Time')
plt.ylabel('Stock Price')
plt.legend()
plt.show()