3.3. Convolutional Neural Networks from scratch in Python
Contents
3.3. Convolutional Neural Networks from scratch in Python#
We will be building Convolutional Neural Networks (CNN) model from scratch using Numpy in Python. Please check out the following list of ingredients (if you have not already done so), so that you can cook (code) the CNN model from scratch because this is going to be the most general CNN model that you can find anywhere on the net (without using any for loops, except for the epochs part :))!
Note: I have already explained (in detail) most of the code sections in my previous chapters (like forward and backward propagation in Convolution layers, forward and backward propagation in Pooling layers, and MLP model for Fully connected layers). I will just put the list for you to go and check them out (so that I can skip the tedious work of explaining them again and concentrate more on the fun part).
Ingredients
Now that we have all the ingredients available, we are ready to code the most general Convolutional Neural Networks (CNN) model from scratch using Numpy in Python.
Import essential libraries#
# numpy for linear algebra
import numpy as np
# matplotlib for plotting the loss functions and/or accuracy
import matplotlib.pyplot as plt
# confusion matrix
from sklearn.metrics import confusion_matrix
# accuracy score
from sklearn.metrics import accuracy_score
# show progress bar
from tqdm import tqdm
Activation class#
This class will contain class methods to calculate activation functions and also it will calculate the forward propagation and backpropagation as per the decsription in the chapter Shortcut to calculate forward pass and backpropagation across layers (link to previous chapter).
class Activation:
    def __init__(self, activation_type=None):
        '''
        Parameters
        
        activation_type: type of activation
        available options are 'sigmoid', 'linear', 'tanh', 'softmax', 'prelu' and 'relu'
        '''
        if activation_type is None:
            self.activation_type = 'linear'
        else:
            self.activation_type = activation_type
    def linear(self, x):
        '''
        Parameters
        
        x: input matrix of shape (m, d) 
        where 'm' is the number of samples (in case of batch gradient descent of size m)
        and 'd' is the number of features
        '''
        return x
    def d_linear(self, x):
        '''
        Parameters
        
        x: input matrix of shape (m, d) 
        where 'm' is the number of samples (in case of batch gradient descent of size m)
        and 'd' is the number of features
        '''
        return np.ones(x.shape)
    def sigmoid(self, x):
        '''
        Parameters
        
        x: input matrix of shape (m, d) 
        where 'm' is the number of samples (in case of batch gradient descent of size m)
        and 'd' is the number of features
        '''
        return 1/(1+np.exp(-x))
    def d_sigmoid(self, x):
        '''
        Parameters
        
        x: input matrix of shape (m, d) 
        where 'm' is the number of samples (in case of batch gradient descent of size m)
        and 'd' is the number of features
        '''
        return self.sigmoid(x) * (1-self.sigmoid(x))
     
    def tanh(self, x):
        '''
        Parameters
        
        x: input matrix of shape (m, d) 
        where 'm' is the number of samples (in case of batch gradient descent of size m)
        and 'd' is the number of features
        '''
        return (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))
    def d_tanh(self, x):
        '''
        Parameters
        
        x: input matrix of shape (m, d) 
        where 'm' is the number of samples (in case of batch gradient descent of size m)
        and 'd' is the number of features
        '''
        return 1-(self.tanh(x))**2
    def ReLU(self, x):
        '''
        Parameters
        
        x: input matrix of shape (m, d) 
        where 'm' is the number of samples (in case of batch gradient descent of size m)
        and 'd' is the number of features
        '''
        return x * (x > 0)
    def d_ReLU(self, x):
        '''
        Parameters
        
        x: input matrix of shape (m, d) 
        where 'm' is the number of samples (in case of batch gradient descent of size m)
        and 'd' is the number of features
        '''
        return (x>0)*np.ones(x.shape)
    def PReLU(self, x, alpha=0.2):
        '''
        Parameters
        alpha: slope parameter (𝛼)
        x: input matrix of shape (m, d) 
        where 'm' is the number of samples (or rows)
        and 'd' is the number of features (or columns)
        '''
        return np.where(x > 0, x, alpha*x) 
    def d_PReLU(self, x, alpha=0.2):
        '''
        Parameters
        alpha: slope parameter (𝛼)
        x: input matrix of shape (m, d) 
        where 'm' is the number of samples (or rows)
        and 'd' is the number of features (or columns)
        '''
        return np.where(x > 0, 1, alpha) 
    def softmax(self, x):
        '''
        Parameters
        
        x: input matrix of shape (m, d) 
        where 'm' is the number of samples (in case of batch gradient descent of size m)
        and 'd' is the number of features
        '''
        z = x - np.max(x, axis=-1, keepdims=True)
        numerator = np.exp(z)
        denominator = np.sum(numerator, axis=-1, keepdims=True)
        softmax = numerator / denominator
        return softmax
    def d_softmax(self, x):
        '''
        Parameters
        
        x: input matrix of shape (m, d) 
        where 'm' is the number of samples (in case of batch gradient descent of size m)
        and 'd' is the number of features
        '''
        if len(x.shape)==1:
            x = np.array(x).reshape(1,-1)
        else:
            x = np.array(x)
        m, d = x.shape
        a = self.softmax(x)
        tensor1 = np.einsum('ij,ik->ijk', a, a)
        tensor2 = np.einsum('ij,jk->ijk', a, np.eye(d, d))
        return tensor2 - tensor1
    def get_activation(self, x):
        '''
        Parameters
        
        x: input matrix of shape (m, d) 
        where 'm' is the number of samples (in case of batch gradient descent of size m)
        and 'd' is the number of features
        '''
        if self.activation_type == 'sigmoid':
            return self.sigmoid(x)
        elif self.activation_type == 'tanh':
            return self.tanh(x)
        elif self.activation_type == 'relu':
            return self.ReLU(x)
        elif self.activation_type == 'linear':
            return self.linear(x)
        elif self.activation_type == 'prelu':
            return self.PReLU(x)
        elif self.activation_type == 'softmax':
            return self.softmax(x)
        else:
            raise ValueError("Valid Activations are only 'sigmoid', 'linear', 'tanh' 'softmax', 'prelu' and 'relu'")
    def get_d_activation(self, x):
        '''
        Parameters
        
        x: input matrix of shape (m, d) 
        where 'm' is the number of samples (in case of batch gradient descent of size m)
        and 'd' is the number of features
        '''
        if self.activation_type == 'sigmoid':
            return self.d_sigmoid(x)
        elif self.activation_type == 'tanh':
            return self.d_tanh(x)
        elif self.activation_type == 'relu':
            return self.d_ReLU(x)
        elif self.activation_type == 'linear':
            return self.d_linear(x)
        elif self.activation_type == 'prelu':
            return self.d_PReLU(x)
        elif self.activation_type == 'softmax':
            return self.d_softmax(x)
        else:
            raise ValueError("Valid Activations are only 'sigmoid', 'linear', 'tanh', 'softmax', 'prelu' and 'relu'")
    def forward(self, X):
        self.X = X
        z = self.get_activation(X)
        return z
    
    def backpropagation(self, dz):
        f_prime = self.get_d_activation(self.X)
        if self.activation_type=='softmax':
            # because derivative of softmax is a tensor
            dx = np.einsum('ijk,ik->ij', f_prime, dz)
        else:
            dx = dz * f_prime
        return dx
Cost function#
Follow the lecture to develop the cost function class
class Cost:
    def __init__(self, cost_type='mse'):
        '''
        Parameters
        
        cost_type: type of cost function
        available options are 'mse', and 'cross-entropy'
        '''
        self.cost_type = cost_type
    def mse(self, a, y):
        '''
        Parameters
        
        a: Predicted output array of shape (m, d)
        y: Actual output array of shape (m, d)
        '''
        return (1/2)*np.sum((np.linalg.norm(a-y, axis=1))**2)
    def d_mse(self, a, y):
        '''
        represents dJ/da
        Parameters
        
        a: Predicted output array of shape (m, d)
        y: Actual output array of shape (m, d)
        '''
        return a - y
    def cross_entropy(self, a, y, epsilon=1e-12):
        '''
        Parameters
        
        a: Predicted output array of shape (m, d)
        y: Actual output array of shape (m, d)
        '''
        a = np.clip(a, epsilon, 1. - epsilon)
        return -np.sum(y*np.log(a))
    def d_cross_entropy(self, a, y, epsilon=1e-12):
        '''
        represents dJ/da
        Parameters
        
        a: Predicted output array of shape (m, d)
        y: Actual output array of shape (m, d)
        '''
        a = np.clip(a, epsilon, 1. - epsilon)
        return -y/a
    def get_cost(self, a, y):
        '''
        Parameters
        
        a: Predicted output array of shape (m, d)
        y: Actual output array of shape (m, d)
        '''
        if self.cost_type == 'mse':
            return self.mse(a, y)
        elif self.cost_type == 'cross-entropy':
            return self.cross_entropy(a, y)
        else:
            raise ValueError("Valid cost functions are only 'mse', and 'cross-entropy'")
    def get_d_cost(self, a, y):
        '''
        Parameters
        
        a: Predicted output array of shape (m, d)
        y: Actual output array of shape (m, d)
        '''
        if self.cost_type == 'mse':
            return self.d_mse(a, y)
        elif self.cost_type == 'cross-entropy':
            return self.d_cross_entropy(a, y)
        else:
            raise ValueError("Valid cost functions are only 'mse', and 'cross-entropy'")
Optimizers#
This class contains different optimizers (such as RMSProp, Adam, etc) used for updating the parameters.
class Optimizer:
    def __init__(self, optimizer_type=None, shape_W=None, shape_b=None,
                 momentum1=0.9, momentum2=0.999, epsilon=1e-8):
        '''
        Parameters
        momentum1: float hyperparameter >= 0 that accelerates gradient descent in the relevant 
                   direction and dampens oscillations. Defaults to 0, i.e., vanilla gradient descent.
                   Also used in RMSProp
        momentum2: used in Adam only
        optimizer_type: type of optimizer
                        available options are 'gd', 'sgd' (This also includes momentum), 'adam', and 'rmsprop'
        shape_W: Shape of the weight matrix W/ Kernel K
        shape_b: Shape of the bias matrix b
        epsilon: parameter used in RMSProp and Adam to avoid division by zero error
        '''
        if optimizer_type is None:
            self.optimizer_type = 'adam'
        else:
            self.optimizer_type = optimizer_type
        self.momentum1 = momentum1
        self.momentum2 = momentum2
        self.epsilon = epsilon
        self.vdW = np.zeros(shape_W)
        self.vdb = np.zeros(shape_b)
        self.SdW = np.zeros(shape_W)
        self.Sdb = np.zeros(shape_b)
    def GD(self, dW, db, k):
        '''
        dW: gradient of Weight W for iteration k
        db: gradient of bias b for iteration k
        k: iteration number
        '''
        return dW, db
 
    def SGD(self, dW, db, k):
        '''
        dW: gradient of Weight W for iteration k
        db: gradient of bias b for iteration k
        k: iteration number
        '''
        self.vdW = self.momentum1*self.vdW + (1-self.momentum1)*dW
        self.vdb = self.momentum1*self.vdb + (1-self.momentum1)*db
        return self.vdW, self.vdb
    def RMSProp(self, dW, db, k):
        '''
        dW: gradient of Weight W for iteration k
        db: gradient of bias b for iteration k
        k: iteration number
        '''
        self.SdW = self.momentum2*self.SdW + (1-self.momentum2)*(dW**2)
        self.Sdb = self.momentum2*self.Sdb + (1-self.momentum2)*(db**2)
        den_W = np.sqrt(self.SdW) + self.epsilon
        den_b = np.sqrt(self.Sdb) + self.epsilon
        return dW/den_W, db/den_b
 
    def Adam(self, dW, db, k):
        '''
        dW: gradient of Weight W for iteration k
        db: gradient of bias b for iteration k
        k: iteration number
        '''
        # momentum
        self.vdW = self.momentum1*self.vdW + (1-self.momentum1)*dW
        self.vdb = self.momentum1*self.vdb + (1-self.momentum1)*db
        # rmsprop
        self.SdW = self.momentum2*self.SdW + (1-self.momentum2)*(dW**2)
        self.Sdb = self.momentum2*self.Sdb + (1-self.momentum2)*(db**2)
        # correction
        if k>1:
            vdW_h = self.vdW / (1-(self.momentum1**k))
            vdb_h = self.vdb / (1-(self.momentum1**k))
            SdW_h = self.SdW / (1-(self.momentum2**k))
            Sdb_h = self.Sdb / (1-(self.momentum2**k))
        else:
            vdW_h = self.vdW 
            vdb_h = self.vdb
            SdW_h = self.SdW
            Sdb_h = self.Sdb
        den_W = np.sqrt(SdW_h) + self.epsilon
        den_b = np.sqrt(Sdb_h) + self.epsilon
        return vdW_h/den_W, vdb_h/den_b
    def get_optimization(self, dW, db, k):
        if self.optimizer_type == 'gd':
            return self.GD(dW, db, k)
        if self.optimizer_type == 'sgd':
            return self.SGD(dW, db, k)
        if self.optimizer_type == 'rmsprop':
            return self.RMSProp(dW, db, k)
        if self.optimizer_type == 'adam':
            return self.Adam(dW, db, k)
        else:
            raise ValueError("Valid optimizer options are only 'gd', 'sgd', 'rmsprop', and 'adam'.")
Learning Rate decay#
This class contains different methods to implement the learning rate decay scheduler.
class LearningRateDecay:
    def __init__(self):
        pass
    def constant(self, t, lr_0):
        '''
        t: iteration
        lr_0: initial learning rate
        '''
        return lr_0
    def time_decay(self, t, lr_0, k):
        '''
        lr_0: initial learning rate
        k: Decay rate
        t: iteration number
        '''
        lr = lr_0 /(1+(k*t))
        return lr
    def step_decay(self, t, lr_0, F, D):
        '''
        lr_0: initial learning rate
        F: factor value controlling the rate in which the learning date drops
        D: “Drop every” iteration
        t: current iteration
        '''
        mult = F**np.floor((1+t)/D)
        lr = lr_0 * mult
        return lr
    def exponential_decay(self, t, lr_0, k):
        '''
        lr_0: initial learning rate
        k: Exponential Decay rate
        t: iteration number
        '''
        lr = lr_0 * np.exp(-k*t)
        return lr
Utility function#
This class contains several utility functions such as one-hot vector, label encoder, normalization, etc
class Utility:
    def __init__(self):
        pass
    def label_encoding(self, Y):
        '''
        Parameters:
        Y: (m,d) shape matrix with categorical data
        Return
        result: label encoded data of 𝑌
        idx_list: list of the dictionaries containing the unique values 
                  of the columns and their mapping to the integer.
        '''
        idx_list = []
        result = []
        for col in range(Y.shape[1]):
            indexes = {val: idx for idx, val in enumerate(np.unique(Y[:, col]))}
            result.append([indexes[s] for s in Y[:, col]])
            idx_list.append(indexes)
        return np.array(result).T, idx_list
    def onehot(self, X):
        '''
        Parameters:
        X: 1D array of labels of length "m"
        Return
        X_onehot: (m,d) one hot encoded matrix (one-hot of X) 
                  (where d is the number of unique values in X)
        indexes: dictionary containing the unique values of X and their mapping to the integer column
        '''
        indexes = {val: idx for idx, val in enumerate(np.unique(X))}
        y = np.array([indexes[s] for s in X])
        X_onehot = np.zeros((y.size, len(indexes)))
        X_onehot[np.arange(y.size), y] = 1
        return X_onehot, indexes
    def minmax(self, X, min_X=None, max_X=None):
        if min_X is None:
            min_X = np.min(X, axis=0)
        if max_X is None:
            max_X = np.max(X, axis=0)
        Z = (X - min_X) / (max_X - min_X)
        return Z, min_X, max_X
    def standardize(self, X, mu=None, std=None):
        if mu is None:
            mu = np.mean(X, axis=0)
        if std is None:
            std = np.std(X, axis=0)
        Z = (X - mu) / std
        return Z, mu, std
    def inv_standardize(self, Z, mu, std):
        X = Z*std + mu
        return X
    def train_test_split(self, X, y, test_ratio=0.2, seed=None):
        if seed is not None:
            np.random.seed(seed)
        train_ratio = 1-test_ratio
        indices = np.random.permutation(X.shape[0])
        train_idx, test_idx = indices[:int(train_ratio*len(X))], indices[int(train_ratio*len(X)):]
        X_train, X_test = X[train_idx,:], X[test_idx,:]
        y_train, y_test = y[train_idx], y[test_idx]
        return X_train, X_test, y_train, y_test
Weights initializer class#
class Weights_initializer:
    def __init__(self, shape, initializer_type=None, seed=None):
        '''
        Parameters
        shape: Shape of the weight matrix
        initializer_type: type of weight initializer
        available options are 'zeros', 'ones', 'random_normal', 'random_uniform', 
        'he_normal', 'xavier_normal' and 'glorot_normal'
        '''
        self.shape = shape
        if initializer_type is None:
            self.initializer_type = "he_normal"
        else:
            self.initializer_type = initializer_type
        self.seed = seed
        
    def zeros_initializer(self):
        if self.seed is not None:
            np.random.seed(self.seed)
        return np.zeros(self.shape)
    def ones_initializer(self):
        if self.seed is not None:
            np.random.seed(self.seed)
        return np.ones(self.shape)
    def random_normal_initializer(self):
        if self.seed is not None:
            np.random.seed(self.seed)
        return np.random.normal(size=self.shape)
    def random_uniform_initializer(self):
        if self.seed is not None:
            np.random.seed(self.seed)
        return np.random.uniform(size=self.shape)
    def he_initializer(self):
        if self.seed is not None:
            np.random.seed(self.seed)
        try:
            F, Kc, Kh, Kw = self.shape
        except:
            Kh, Kw = self.shape
        return np.random.randn(*self.shape) * np.sqrt(2/Kh)
    def xavier_initializer(self):
        '''
        shape: Shape of the Kernel matrix.
        '''
        if self.seed is not None:
            np.random.seed(self.seed)
        try:
            F, Kc, Kh, Kw = self.shape
        except:
            Kh, Kw = self.shape
        return np.random.randn(*self.shape) * np.sqrt(1/Kh)
    def glorot_initializer(self):
        '''
        shape: Shape of the weight matrix.
        '''
        if self.seed is not None:
            np.random.seed(self.seed)
        try:
            F, Kc, Kh, Kw = self.shape
        except:
            Kh, Kw = self.shape
        return np.random.randn(*self.shape) * np.sqrt(2/(Kh+Kw))
    def get_initializer(self):
        if self.initializer_type == 'zeros':
            return self.zeros_initializer()
        elif self.initializer_type == 'ones':
            return self.ones_initializer()
        elif self.initializer_type == 'random_normal':
            return self.random_normal_initializer()
        elif self.initializer_type == 'random_uniform':
            return self.random_uniform_initializer()
        elif self.initializer_type == 'he_normal':
            return self.he_initializer()
        elif self.initializer_type == 'xavier_normal':
            return self.xavier_initializer()
        elif self.initializer_type == 'glorot_normal':
            return self.glorot_initializer()
        else:
            raise ValueError("Valid initializer options are 'zeros', 'ones', 'random_normal', 'random_uniform', 'he_normal', 'xavier_normal', and 'glorot_normal'")
Dense class#
Dense class implements the operation:
where activation \(f(.)\) is used if specified, else we do not use it. \(W\) is a weights matrix created by the Dense layer based on type of initialization (link to previous chapter) provided, and \(b\) is a bias vector created by the layer (only applicable if use_bias is True). These are all attributes of Dense.
class Dense:
    def __init__(self, neurons, activation_type=None, use_bias=True, 
                 weight_initializer_type=None, weight_regularizer=None, seed=None, input_dim=None):
        '''
        Parameters:
        neurons: Positive integer (number of neurons), dimensionality of the output
        activation_type: type of activation
                         available options are 'sigmoid', 'linear', 'tanh', 'softmax', 'prelu' and 'relu'
                         If you don't specify anything, no activation is applied (ie. "linear" activation: a(x) = x).
        use_bias: Boolean, whether the layer uses a bias vector.
        
        weight_initializer_type: Initializer for the kernel weights matrix.
        
        weight_regularizer: Tuple, Regularizer function applied to the weights matrix ('L2', 0.01) or ('L1', 2)
        seed: To generate reproducable results
        input_dim: integer showing number of neurons in input layer 
        '''
        self.neurons = neurons
        self.activation = Activation(activation_type=activation_type)
        self.use_bias = use_bias
        self.weight_initializer_type = weight_initializer_type # none is handled
        if weight_regularizer is None:
            self.weight_regularizer = ('L2', 0)
        else:
            self.weight_regularizer = weight_regularizer
        self.seed = seed
        self.input_dim = input_dim
    def initialize_parameters(self, hl, optimizer_type):
        '''
        hl: Number of neurons in layer l-1
        '''
        shape_W = (hl, self.neurons)
        shape_b = (self.neurons, 1)
        initializer = Weights_initializer(shape=shape_W,
                                          initializer_type=self.weight_initializer_type,
                                          seed=self.seed)
        self.W = initializer.get_initializer()
        self.b = np.zeros(shape_b)
        
        self.optimizer = Optimizer(optimizer_type=optimizer_type, shape_W=shape_W, shape_b=shape_b)
    def forward(self, X):
        self.X = X
        r = X @ self.W
        self.z = r + self.b.T
        a = self.activation.forward(self.z)
        return a
    
    def backpropagation(self, da):
        dz = self.activation.backpropagation(da)
        dr = dz.copy()
        self.db = np.sum(dz, axis=0).reshape(-1,1)
        self.dW = (self.X.T) @ dr
        dX = dr @ (self.W.T)
        return dX
    def update(self, lr, m, k):
        '''
        Parameters:
        lr: learning rate
        m: batch_size (sumber of samples in batch)
        k: iteration_number
        '''
        dW, db = self.optimizer.get_optimization(self.dW, self.db, k)
        if self.weight_regularizer[0].lower()=='l2':
            dW += self.weight_regularizer[1] * self.W
        elif self.weight_regularizer[0].lower()=='l1':
            dW += self.weight_regularizer[1] * np.sign(self.W)
        
        self.W -= dW*(lr/m)
        if self.use_bias:
            self.b -= db*(lr/m)
Dropout class#
This class will perform forward and backpropagation for a Dropout layer
class Dropout:
    def __init__(self, p):
        '''
        Parameters
        p: Dropout probability
        '''
        self.p = p
        if self.p == 0:
            self.p += 1e-6
        if self.p == 1:
            self.p -= 1e-6
    
    def forward(self, X):
        self.mask = (np.random.rand(*X.shape) < self.p) / self.p 
        Z = X * self.mask
        return Z
    
    def backpropagation(self, dZ):
        dX = dZ * self.mask
        return dX
Batch Normalization class#
This class will perform forward and backpropagation for Batch Normalization layer. Right now it is only prepared for 1D data (that is it can be used only in Fully Connected or MLP layers and not in Convolution or Maxpool).
Note: We will initialise \(\gamma\) as ones and \(\beta\) as zeroes so that the output of the linear batch-norm transformation initially follows the standard zero-mean unit-variance normal distribution. This provides a normalised starting point, for which the model can update the \(\gamma\) and \(\beta\) to scale and shift the distribution(s) of each input accordingly (for the current layer).
Forward pass
eps represents: \(\epsilon\)
mu represents: \(\mu\)
var represents: \(\sigma^2\)
zmu represents: \(\bar{z_l}\)
ivar represents: \(\frac{1}{\sqrt{\sigma^2 + \epsilon}}\)
zhat represents: \(\hat{z_l}\)
q represents: \(q_l\)
Backpropagation
This dq variable below represents \(\frac{\partial J}{\partial q_l}\)
dgamma represents: \(\frac{\partial J}{\partial \gamma}\)
dbeta represents: \(\frac{\partial J}{\partial \beta}\)
dzhat represents: \(\frac{\partial J}{\partial \hat{z_l}}\)
dvar represents: \(\frac{\partial J}{\partial \sigma^2}\)
dmu represents: \(\frac{\partial J}{\partial \mu}\)
dz represents: \(\frac{\partial J}{\partial z_l}\)
class BatchNormalization:
    def __init__(self, momentum=0.9, epsilon=1e-6):
        '''
        Parameters
        momentum: Momentum for the moving average
        epsilon: 𝜖, Small float added to variance to avoid dividing by zero
        '''
        self.epsilon = epsilon
        self.momentum = momentum
    
    def initialize_parameters(self, d):
        '''
        d: Shape of input to BN layer
        '''
        self.gamma = np.ones((d))
        self.beta = np.zeros((d))
        self.running_mean = np.zeros((d))
        self.running_var = np.zeros((d))
    
    def forward(self, z, mode='train'):
        '''
        z: Input to BN layer
        mode: forward pass used for train or test
        ''' 
        if mode=='train':
            self.m, self.d = z.shape
            self.mu = np.mean(z, axis = 0) # 𝜇
            self.var = np.var(z, axis=0) # 𝜎^2
            self.zmu = z - self.mu # z - 𝜇
            self.ivar = 1 / np.sqrt(self.var + self.epsilon) # 𝜎𝑖𝑛𝑣
            self.zhat = self.zmu * self.ivar 
            q = self.gamma*self.zhat + self.beta # ql
            self.running_mean = self.momentum * self.running_mean + (1 - self.momentum) * self.mu
            self.running_var = self.momentum * self.running_var + (1 - self.momentum) * self.var
        elif mode=='test':
            q = (z - self.running_mean) / np.sqrt(self.running_var + self.epsilon)
            q = self.gamma*q + self.beta
        else:
            raise ValueError('Invalid forward batchnorm mode "%s"' % mode)
        return q
    def backpropagation(self, dq):
        self.dgamma = np.sum(dq * self.zhat, axis=0)
        self.dbeta = np.sum(dq, axis=0)
        dzhat = dq * self.gamma
        dvar = np.sum(dzhat * self.zmu * (-.5) * (self.ivar**3), axis=0)
        dmu = np.sum(dzhat * (-self.ivar), axis=0)
        dz = dzhat * self.ivar + dvar * (2/self.m) * self.zmu + (1/self.m)*dmu
        return dz
    def update(self, lr, m, k):
        '''
        Parameters:
        lr: learning rate
        m: batch_size (sumber of samples in batch)
        k: iteration_number
        '''
        self.gamma -= self.dgamma*(lr/m)
        self.beta -= self.dbeta*(lr/m)
Padding2D class#
This class will perform padding on a batch of 2D image with channels. We will be considering padding as a separate layer only which contains its own forward and backpropagation operations.
We need a small function such that we can extract the original input errors from the padding ones. This you can consider as backpropagation through padding operation
class Padding2D:
    def __init__(self, p='valid'):
        '''
        Parameters:
        p: padding type
        Allowed types are only 'same', 'valid', an integer or a tuple of length 2.
        '''
        self.p = p
        
    def get_dimensions(self, input_shape, kernel_size, s=(1,1)):
        '''
        Utility function to help get the dimension of the output after padding
        '''
        if len(input_shape)==4:
            m, Nc, Nh, Nw = input_shape
        elif len(input_shape)==3:
            Nc, Nh, Nw = input_shape
        
        Kh, Kw = kernel_size
        sh, sw = s
        p = self.p
        
        if type(p)==int:
            pt, pb = p, p
            pl, pr = p, p
        if type(p)==tuple:
            ph, pw = p
            pt, pb = ph//2, (ph+1)//2
            pl, pr = pw//2, (pw+1)//2
        elif p=='valid':
            pt, pb = 0, 0
            pl, pr = 0, 0
        elif p=='same':
            # calculating how much padding is required in all 4 directions 
            # (top, bottom, left and right)
            ph = (sh-1)*Nh + Kh - sh
            pw = (sw-1)*Nw + Kw - sw
            pt, pb = ph//2, (ph+1)//2
            pl, pr = pw//2, (pw+1)//2
        else:
            raise ValueError("Incorrect padding type. Allowed types are only 'same', 'valid', an integer or a tuple of length 2.")
            
        if len(input_shape)==4:
            output_shape = (m, Nc, Nh+pt+pb, Nw+pl+pr)
        elif len(input_shape)==3:
            output_shape = (Nc, Nh+pt+pb, Nw+pl+pr)
        
        return output_shape, (pt, pb, pl, pr)
        
    def forward(self, X, kernel_size, s=(1,1)):
        '''
        Parameters:
        
        X: input of shape (m, Nc, Nh, Nw)
        
        s: strides along height and width (sh, sw)
        
        kernel_size: kernel size as specified in Conv2D layer
        
        Returns: 
        
        Xp: padded X
        '''
        self.input_shape = X.shape
        m, Nc, Nh, Nw = self.input_shape
        
        self.output_shape, (self.pt, self.pb, self.pl, self.pr) = self.get_dimensions(self.input_shape, 
                                                                                      kernel_size, s=s)
        
        zeros_r = np.zeros((m, Nc, Nh, self.pr))
        zeros_l = np.zeros((m, Nc, Nh, self.pl))
        zeros_t = np.zeros((m, Nc, self.pt, Nw + self.pl + self.pr))
        zeros_b = np.zeros((m, Nc, self.pb, Nw + self.pl + self.pr))
        Xp = np.concatenate((X, zeros_r), axis=3)
        Xp = np.concatenate((zeros_l, Xp), axis=3)
        Xp = np.concatenate((zeros_t, Xp), axis=2)
        Xp = np.concatenate((Xp, zeros_b), axis=2)
        
        return Xp
    
    def backpropagation(self, dXp):
        '''
        Parameters: 
        
        dXp: Backprop Error of padded X (Xp)
        
        Return:
        
        dX: Backprop Error of X
        '''
        m, Nc, Nh, Nw = self.input_shape
        dX = dXp[:, :, self.pt:self.pt+Nh, self.pl:self.pl+Nw]
        return dX
Convolution2D class#
This class will perform forward and backpropagation for a Convolution layer on a batch of 2D image with channels.
We have an Input with batch of images (with channels) and multiple filters.
Shape of Image will be \((m, N_c, N_h, N_w)\).
The shape of Kernel will be \((F, K_c, K_h, K_w)\) where \(F\) is the total number of filters.


class Conv2D:
    def __init__(self, filters, kernel_size, s=(1, 1), p='valid',
                 activation_type=None, use_bias=True, weight_initializer_type=None, 
                 kernel_regularizer=None, seed=None, input_shape=None):
        '''
        Parameters:
        filters: Integer, the number of output filters in the convolution (F).
        
        kernel_size: An integer or tuple/list of 2 integers, 
                     specifying the height and width of the 2D convolution window.
                     
        s: strides along height and width (sh, sw)
        
        p: padding type
           Allowed types are only 'same', 'valid', an integer or a tuple of length 2.
        activation_type: type of activation
                         available options are 'sigmoid', 'linear', 'tanh', 'softmax', 'prelu' and 'relu'
                         If you don't specify anything, no activation is applied (ie. "linear" activation: a(x) = x).
        use_bias: Boolean, whether the layer uses a bias vector.
        
        weight_initializer_type: Initializer for the kernel weights matrix.
        
        kernel_regularizer: Tuple, Regularizer function applied to the kernel matrix ('L2', 0.01) or ('L1', 2)
        seed: To generate reproducable results
        input_shape: tuple showing size of input: (batch_size, channels, rows, cols) -> (m, Nc, Nh, Nw)
        '''
        
        self.padding = Padding2D(p=p)
        
        self.F = filters
        
        self.input_shape_x = input_shape
        
        if type(kernel_size)==int:
            self.kernel_size = (kernel_size, kernel_size)
        elif type(kernel_size)==tuple and len(kernel_size)==2:
            self.kernel_size = kernel_size
            
        self.Kh, self.Kw  = self.kernel_size
        
        if type(s)==int:
            self.s = (s,s)
        elif type(s)==tuple and len(s)==2:
            self.s = s
        
        self.sh, self.sw = self.s
        
        self.activation = Activation(activation_type=activation_type)
        self.use_bias = use_bias
        self.weight_initializer_type = weight_initializer_type # none is handled
        if kernel_regularizer is None:
            self.kernel_regularizer = ('L2', 0)
        else:
            self.kernel_regularizer = kernel_regularizer
        self.seed = seed
        
    def get_dimensions(self, input_shape):
        
        self.input_shape_x = input_shape # (3D or 4D)
        
        # Padded X will be actual input to this Conv2D
        
        self.input_shape, _ = self.padding.get_dimensions(self.input_shape_x, 
                                                          self.kernel_size, self.s)
        
        if len(input_shape)==3:
            self.Nc, self.Nh, self.Nw = self.input_shape
        elif len(input_shape)==4:
            self.m, self.Nc, self.Nh, self.Nw = self.input_shape
        
        # Output shape
        self.Oh = (self.Nh - self.Kh)//self.sh + 1
        self.Ow = (self.Nw - self.Kw)//self.sw + 1
        
        if len(input_shape)==3:
            self.output_shape = (self.F, self.Oh, self.Ow)
        elif len(input_shape)==4:
            self.output_shape = (self.m, self.F, self.Oh, self.Ow)
    def initialize_parameters(self, input_shape, optimizer_type):
        
        self.get_dimensions(input_shape)
        
        shape_b = (self.F, self.Oh, self.Ow)
        
        shape_K = (self.F, self.Nc, self.Kh, self.Kw)
        
        initializer = Weights_initializer(shape=shape_K,
                                          initializer_type=self.weight_initializer_type,
                                          seed=self.seed)
        
        self.K = initializer.get_initializer()
        self.b = np.zeros(shape_b)
        
        self.optimizer = Optimizer(optimizer_type=optimizer_type, shape_W=shape_K, shape_b=shape_b)
    
    def dilate2D(self, X, Dr=(1,1)):
        dh, dw = Dr # Dilate rate
        m, C, H, W = X.shape
        Xd = np.insert(arr=X, obj=np.repeat(np.arange(1,W), dw-1), values=0, axis=-1)
        Xd = np.insert(arr=Xd, obj=np.repeat(np.arange(1,H), dh-1), values=0, axis=-2)
        return Xd
    def prepare_subMatrix(self, X, Kh, Kw, s):
        m, Nc, Nh, Nw = X.shape
        sh, sw = s
        Oh = (Nh-Kh)//sh + 1
        Ow = (Nw-Kw)//sw + 1
        strides = (Nc*Nh*Nw, Nw*Nh, Nw*sh, sw, Nw, 1)
        strides = tuple(i * X.itemsize for i in strides)
        subM = np.lib.stride_tricks.as_strided(X, 
                                               shape=(m, Nc, Oh, Ow, Kh, Kw),
                                               strides=strides)
        return subM
         
    def convolve(self, X, K, s=(1,1), mode='front'):
        F, Kc, Kh, Kw = K.shape
        subM = self.prepare_subMatrix(X, Kh, Kw, s)
        
        if mode=='front':
            return np.einsum('fckl,mcijkl->mfij', K, subM)
        elif mode=='back':
            return np.einsum('fdkl,mcijkl->mdij', K, subM)
        elif mode=='param':
            return np.einsum('mfkl,mcijkl->fcij', K, subM)
    def dZ_D_dX(self, dZ_D, Nh, Nw):
        # Pad the dilated dZ (dZ_D -> dZ_Dp)
        _, _, Hd, Wd = dZ_D.shape
        ph = Nh - Hd + self.Kh - 1
        pw = Nw - Wd + self.Kw - 1
        
        padding_back = Padding2D(p=(ph, pw))
        dZ_Dp = padding_back.forward(dZ_D, self.kernel_size, self.s)
        # Rotate K by 180 degrees
        
        K_rotated = self.K[:, :, ::-1, ::-1]
        
        # convolve dZ_Dp with K_rotated
        
        dXp = self.convolve(dZ_Dp, K_rotated, mode='back')
        
        dX = self.padding.backpropagation(dXp)
        return dX
    def forward(self, X):
        # padding
        self.X = X
        Xp = self.padding.forward(X, self.kernel_size, self.s)
        
        # convolve Xp with K
        Z = self.convolve(Xp, self.K, self.s) + self.b
    
        a = self.activation.forward(Z)
        
        return a
     
    def backpropagation(self, da):
        Xp = self.padding.forward(self.X, self.kernel_size, self.s)
        m, Nc, Nh, Nw = Xp.shape
        dZ = self.activation.backpropagation(da)
        
        # Dilate dZ (dZ-> dZ_D)
        
        dZ_D = self.dilate2D(dZ, Dr=self.s)
        
        dX = self.dZ_D_dX(dZ_D, Nh, Nw)
        
        # Gradient dK
        
        _, _, Hd, Wd = dZ_D.shape
        
        ph = self.Nh - Hd - self.Kh + 1
        pw = self.Nw - Wd - self.Kw + 1
        padding_back = Padding2D(p=(ph, pw))
        dZ_Dp = padding_back.forward(dZ_D, self.kernel_size, self.s)
        
        self.dK = self.convolve(Xp, dZ_Dp, mode='param')
        
        # Gradient db
        
        self.db = np.sum(dZ, axis=0)
        
        return dX
    def update(self, lr, m, k):
        '''
        Parameters:
        lr: learning rate
        m: batch_size (sumber of samples in batch)
        k: iteration_number
        '''
        dK, db = self.optimizer.get_optimization(self.dK, self.db, k)
        if self.kernel_regularizer[0].lower()=='l2':
            dK += self.kernel_regularizer[1] * self.K
        elif self.weight_regularizer[0].lower()=='l1':
            dK += self.kernel_regularizer[1] * np.sign(self.K)
        self.K -= self.dK*(lr/m)
        
        if self.use_bias:
            self.b -= self.db*(lr/m)
Maxpool2D class#
This class will perform forward and backpropagation for a Pooling layer on a batch of 2D image with channels.
Type of Pooling available are: Max and Mean

class Pooling2D:
    def __init__(self, pool_size=(2,2), s=(2,2), p='valid', pool_type='max'):
        '''
        Parameters:
        
        pool_size: An integer or tuple/list of 2 integers, 
                     specifying the height and width of the 2D convolution window.
                     
        s: strides along height and width (sh, sw)
        
        p: padding type
           Allowed types are only 'same', 'valid', an integer or a tuple of length 2.
        pool_type: pooling type
        Allowed types are only 'max', or 'mean'
        '''
        self.padding = Padding2D(p=p)
        
        if type(pool_size)==int:
            self.pool_size = (pool_size, pool_size)
        elif type(pool_size)==tuple and len(pool_size)==2:
            self.pool_size = pool_size
            
        self.Kh, self.Kw  = self.pool_size
        
        if type(s)==int:
            self.s = (s,s)
        elif type(s)==tuple and len(s)==2:
            self.s = s
        
        self.sh, self.sw = self.s
        
        self.pool_type = pool_type    
    def get_dimensions(self, input_shape):
        
        if len(input_shape)==4:
            m, Nc, Nh, Nw = input_shape
        elif len(input_shape)==3:
            Nc, Nh, Nw = input_shape
        Oh = (Nh-self.Kh)//self.sh + 1
        Ow = (Nw-self.Kw)//self.sw + 1
        
        if len(input_shape)==4:
            self.output_shape = (m, Nc, Oh, Ow)
        elif len(input_shape)==3:
            self.output_shape = (Nc, Oh, Ow)
    def prepare_subMatrix(self, X, pool_size, s):
        m, Nc, Nh, Nw = X.shape
        sh, sw = s
        Kh, Kw = pool_size
        Oh = (Nh-Kh)//sh + 1
        Ow = (Nw-Kw)//sw + 1
        strides = (Nc*Nh*Nw, Nh*Nw, Nw*sh, sw, Nw, 1)
        strides = tuple(i * X.itemsize for i in strides)
        subM = np.lib.stride_tricks.as_strided(X, 
                                               shape=(m, Nc, Oh, Ow, Kh, Kw),
                                               strides=strides)
        return subM
    
    def pooling(self, X, pool_size=(2,2), s=(2,2)):
        
        subM = self.prepare_subMatrix(X, pool_size, s)
        
        if self.pool_type=='max':
            return np.max(subM, axis=(-2,-1))
        elif self.pool_type=='mean':
            return np.mean(subM, axis=(-2,-1))
        else:
            raise ValueError("Allowed pool types are only 'max' or 'mean'.")
    def prepare_mask(self, subM, Kh, Kw):
        
        m, Nc, Oh, Ow, Kh, Kw = subM.shape
        a = subM.reshape(-1,Kh*Kw)
        idx = np.argmax(a, axis=1)
        b = np.zeros(a.shape)
        b[np.arange(b.shape[0]), idx] = 1
        mask = b.reshape((m, Nc, Oh, Ow, Kh, Kw))
        return mask
    def mask_dXp(self, mask, Xp, dZ, Kh, Kw):
        dA = np.einsum('i,ijk->ijk', dZ.reshape(-1), mask.reshape(-1,Kh,Kw)).reshape(mask.shape)
        m, Nc, Nh, Nw = Xp.shape
        strides = (Nc*Nh*Nw, Nh*Nw, Nw, 1)
        strides = tuple(i * Xp.itemsize for i in strides)
        dXp = np.lib.stride_tricks.as_strided(dA, Xp.shape, strides)
        return dXp
            
    def maxpool_backprop(self, dZ, X):
        
        Xp = self.padding.forward(X, self.pool_size, self.s)
        subM = self.prepare_subMatrix(Xp, self.pool_size, self.s)
        m, Nc, Oh, Ow, Kh, Kw = subM.shape
        
        m, Nc, Nh, Nw = Xp.shape
        mask = self.prepare_mask(subM, Kh, Kw)
        
        dXp = self.mask_dXp(mask, Xp, dZ, Kh, Kw)
        
        return dXp
    def dZ_dZp(self, dZ):
        sh, sw = self.s
        Kh, Kw = self.pool_size
        dZp = np.kron(dZ, np.ones((Kh,Kw), dtype=dZ.dtype)) # similar to repelem in matlab
        jh, jw = Kh-sh, Kw-sw # jump along height and width
        if jw!=0:
            L = dZp.shape[-1]-1
            l1 = np.arange(sw, L)
            l2 = np.arange(sw + jw, L + jw)
            mask = np.tile([True]*jw + [False]*jw, len(l1)//jw).astype(bool)
            r1 = l1[mask[:len(l1)]]
            r2 = l2[mask[:len(l2)]]
            dZp[:, :, :, r1] += dZp[:, :, :, r2]
            dZp = np.delete(dZp, r2, axis=-1)
        if jh!=0:
            L = dZp.shape[-2]-1
            l1 = np.arange(sh, L)
            l2 = np.arange(sh + jh, L + jh)
            mask = np.tile([True]*jh + [False]*jh, len(l1)//jh).astype(bool)
            r1 = l1[mask[:len(l1)]]
            r2 = l2[mask[:len(l2)]]
            dZp[:, :, r1, :] += dZp[:, :, r2, :]
            dZp = np.delete(dZp, r2, axis=-2)
        return dZp
    def averagepool_backprop(self, dZ, X):
        
        Xp = self.padding.forward(X, self.pool_size, self.s)
        m, Nc, Nh, Nw = Xp.shape
        dZp = self.dZ_dZp(dZ)
        ph = Nh - dZp.shape[-2]
        pw = Nw - dZp.shape[-1]
        
        padding_back = Padding2D(p=(ph, pw))
        
        dXp = padding_back.forward(dZp, s=self.s, kernel_size=self.pool_size)
        return dXp / (Nh*Nw)
    def forward(self, X):
        '''
        Parameters:
        
        X: input of shape (m, Nc, Nh, Nw)
        
        Returns: 
        
        Z: pooled X
        '''
        self.X = X
        # padding
        Xp = self.padding.forward(X, self.pool_size, self.s)
        Z = self.pooling(Xp, self.pool_size, self.s)
        
        return Z
    
    def backpropagation(self, dZ):
        '''
        Parameters: 
        
        dZ: Output Error
        
        Return:
        
        dX: Backprop Error of X
        '''
        if self.pool_type=='max':
            dXp = self.maxpool_backprop(dZ, self.X)
        elif self.pool_type=='mean':
            dXp = self.averagepool_backprop(dZ, self.X)
        dX = self.padding.backpropagation(dXp)
        return dX
Flatten class#
Converts 4D image \((m, N_c, N_h, N_w)\) to a 2D array of shape \((m, N_c \times N_h \times N_w)\) so that it can be sent to the Dense layer
class Flatten:
    def __init__(self):
        pass
    def forward(self, X):
        self.m, self.Nc, self.Nh, self.Nw = X.shape
        X_flat = X.reshape((self.m, self.Nc * self.Nh * self.Nw))
        return X_flat
     
    def backpropagation(self, dZ):
        dX = dZ.reshape((self.m, self.Nc, self.Nh, self.Nw))
        return dX
    
    def get_dimensions(self, input_shape):
        
        if len(input_shape)==4:
            self.m, self.Nc, self.Nh, self.Nw = input_shape
        elif len(input_shape)==3:
            self.Nc, self.Nh, self.Nw = input_shape
        
        self.output_shape = self.Nc * self.Nh * self.Nw
CNN#
This class finally contains the compile, summary, fit, predict, etc methods for executing our CNN model. Apart from network_architecture, summary and initialize_parameters all other functions are almost same to that of the functions used in MLP model
class CNN:
    def __init__(self, layers=None):
        '''
        This is a sequential CNN model
        '''
        if layers is None:
            self.layers = []
        else:
            self.layers = layers
        self.network_architecture_called = False
    def add(self, layer):
        # adds a layer to CNN model
        self.layers.append(layer)
    def Input(self, input_shape):
        self.d = input_shape
        self.architecture = [self.d] # output architecture
        self.layer_name = ["Input"]
    
    def network_architecture(self):
        for layer in self.layers:
            if layer.__class__.__name__=='Conv2D':
                if layer.input_shape_x is not None:
                    self.Input(layer.input_shape_x)
                layer.get_dimensions(self.architecture[-1])
                self.architecture.append(layer.output_shape)
                self.layer_name.append(layer.__class__.__name__)
            elif layer.__class__.__name__ in ['Flatten', 'Pooling2D']:
                layer.get_dimensions(self.architecture[-1])
                self.architecture.append(layer.output_shape)
                self.layer_name.append(layer.__class__.__name__)
            elif layer.__class__.__name__=='Dense':
                self.architecture.append(layer.neurons)
                self.layer_name.append(layer.__class__.__name__)
            else:
                self.architecture.append(self.architecture[-1])
                self.layer_name.append(layer.__class__.__name__)
        self.layers = [layer for layer in self.layers if layer is not None]
        try:
            idx = model.layer_name.index("NoneType")
            del model.layer_name[idx]
            del model.architecture[idx]
        except:
            pass
    def summary(self):
        if self.network_architecture_called==False:
            self.network_architecture()
            self.network_architecture_called = True
        len_assigned = [45, 26, 15]
        count = {'Dense': 1, 'Activation': 1, 'Input': 1,
                'BatchNormalization': 1, 'Dropout': 1, 'Conv2D': 1,
                'Pooling2D': 1, 'Flatten': 1}
        col_names = ['Layer (type)', 'Output Shape', '# of Parameters']
        print("Model: CNN")
        print('-'*sum(len_assigned))
        text = ''
        for i in range(3):
            text += col_names[i] + ' '*(len_assigned[i]-len(col_names[i]))
        print(text)
        print('='*sum(len_assigned))
        total_params = 0
        trainable_params = 0
        non_trainable_params = 0
        for i in range(len(self.layer_name)):
            # layer name
            layer_name = self.layer_name[i]
            name = layer_name.lower() + '_' + str(count[layer_name]) + ' ' + '(' + layer_name + ')'
            count[layer_name] += 1
            # output shape
            try:
                out = '(None, ' 
                for n in range(len(model.architecture[i])-1):
                    out += str(model.architecture[i][n]) + ', '
                out += str(model.architecture[i][-1]) + ')'
            except:
                out = '(None, ' + str(self.architecture[i]) + ')'
            # number of params
            if layer_name=='Dense':
                h0 = self.architecture[i-1]
                h1 = self.architecture[i]
                if self.layers[i-1].use_bias:
                    params = h0*h1 + h1
                else:
                    params = h0*h1
                total_params += params
                trainable_params += params
            elif layer_name=='BatchNormalization':
                h = self.architecture[i]
                params = 4*h
                trainable_params += 2*h
                non_trainable_params += 2*h
                total_params += params
            elif layer_name=='Conv2D':
                layer = self.layers[i-1]
                if layer.use_bias:
                    add_b = 1
                else:
                    add_b = 0
                params = ((layer.Nc * layer.Kh * layer.Kw) + add_b) * layer.F
                trainable_params += params
                total_params += params
            else:
                # Pooling, Dropout, Flatten, Input
                params = 0
            names = [name, out, str(params)]
            # print this row
            text = ''
            for j in range(3):
                text += names[j] + ' '*(len_assigned[j]-len(names[j]))
            print(text)
            if i!=(len(self.layer_name)-1):
                print('-'*sum(len_assigned))
            else:
                print('='*sum(len_assigned))
        print("Total params:", total_params)
        print("Trainable params:", trainable_params)
        print("Non-trainable params:", non_trainable_params)
        print('-'*sum(len_assigned))
    def compile(self, cost_type, optimizer_type):
        self.cost = Cost(cost_type)
        self.cost_type = cost_type
        self.optimizer_type = optimizer_type
    def initialize_parameters(self):
        if self.network_architecture_called==False:
            self.network_architecture()
            self.network_architecture_called = True
        # initialize parameters for different layers
        for i, layer in enumerate(self.layers):
            if layer.__class__.__name__ in ['Dense', 'Conv2D']:
                layer.initialize_parameters(self.architecture[i], self.optimizer_type)
            elif layer.__class__.__name__=='BatchNormalization':
                layer.initialize_parameters(self.architecture[i])
    def fit(self, X, y, epochs=10, batch_size=5, lr=1, X_val=None, y_val=None, verbose=1, lr_decay=None, **kwargs):
        
        self.history = {'Training Loss': [],
                        'Validation Loss': [],
                        'Training Accuracy': [],
                        'Validation Accuracy': []}
        iterations = 0
        self.m = batch_size
        self.initialize_parameters()
        total_num_batches = np.ceil(len(X)/batch_size)
        for epoch in range(epochs):
            cost_train = 0
            num_batches = 0
            y_pred_train = []
            y_train = []
            
            print('\nEpoch: ' + str(epoch+1) + '/' + str(epochs))
            for i in tqdm(range(0, len(X), batch_size)):
                X_batch = X[i:i+batch_size]
                y_batch = y[i:i+batch_size]
                
                Z = X_batch.copy()
                # feed-forward
                for layer in self.layers:
                    Z = layer.forward(Z)
                # calculating training accuracy
                if self.cost_type=='cross-entropy':
                    y_pred_train += np.argmax(Z, axis=1).tolist()
                    y_train += np.argmax(y_batch, axis=1).tolist()
                
                # calculating the loss
                cost_train += self.cost.get_cost(Z, y_batch) / self.m
                # calculating dL/daL (last layer backprop error)
                dZ = self.cost.get_d_cost(Z, y_batch)
                
                # backpropagation
                for layer in self.layers[::-1]:
                    dZ = layer.backpropagation(dZ)
                # Parameters update
                for layer in self.layers:
                    if layer.__class__.__name__ in ['Dense', 'BatchNormalization', 'Conv2D']:
                        layer.update(lr, self.m, iterations)
                # Learning rate decay
                if lr_decay is not None:
                    lr = lr_decay(iterations, **kwargs)
                num_batches += 1
                iterations += 1
            
            cost_train /= num_batches
            # printing purpose only (Training Accuracy, Validation loss and accuracy)
            text  = 'Training Loss: ' + str(round(cost_train, 4)) + ' - '
            self.history['Training Loss'].append(cost_train)
            # training accuracy
            if self.cost_type=='cross-entropy':
                accuracy_train = np.sum(np.array(y_pred_train) == np.array(y_train)) / len(y_train)
                text += 'Training Accuracy: ' + str(round(accuracy_train, 4))
                self.history['Training Accuracy'].append(accuracy_train)
            else:
                text += 'Training Accuracy: ' + str(round(cost_train, 4))
                self.history['Training Accuracy'].append(cost_train)
            
            if X_val is not None:
                cost_val, accuracy_val = self.evaluate(X_val, y_val, batch_size)
                text += ' - Validation Loss: ' + str(round(cost_val, 4)) + ' - '
                self.history['Validation Loss'].append(cost_val)
                text += 'Validation Accuracy: ' + str(round(accuracy_val, 4))
                self.history['Validation Accuracy'].append(accuracy_val)
            if verbose:
                print(text)
            else:
                print()
    def evaluate(self, X, y, batch_size=None):
        
        if batch_size is None:
            batch_size = len(X)
        cost = 0
        correct = 0
        num_batches = 0
        utility = Utility()
        Y_1hot, _ = utility.onehot(y)
        for i in range(0, len(X), batch_size):
            X_batch = X[i:i+batch_size]
            y_batch = y[i:i+batch_size]
            Y_1hot_batch = Y_1hot[i:i+batch_size]
            Z = X_batch.copy()
            for layer in self.layers:
                if layer.__class__.__name__=='BatchNormalization':
                    Z = layer.forward(Z, mode='test')
                else:
                    Z = layer.forward(Z)
            if self.cost_type=='cross-entropy':
                cost += self.cost.get_cost(Z, Y_1hot_batch) / len(y_batch)
                y_pred = np.argmax(Z, axis=1).tolist()
                correct += np.sum(y_pred == y_batch)
            else:
                cost += self.cost.get_cost(Z, y_batch) / len(y_batch)
            num_batches += 1
        if self.cost_type=='cross-entropy':
            accuracy = correct / len(y)
            cost /= num_batches
            return cost, accuracy
        else:
            cost /= num_batches
            return cost, cost
    def loss_plot(self):
        plt.plot(self.history['Training Loss'], 'k')
        if len(self.history['Validation Loss'])>0:
            plt.plot(self.history['Validation Loss'], 'r')
            plt.legend(['Train', 'Validation'], loc='upper right')
            plt.title('Model Loss')
        else:
            plt.title('Training Loss')
        plt.ylabel('Loss')
        plt.xlabel('Epoch')
        plt.show()
    def accuracy_plot(self):
        plt.plot(self.history['Training Accuracy'], 'k')
        if len(self.history['Validation Accuracy'])>0:
            plt.plot(self.history['Validation Accuracy'], 'r')
            plt.legend(['Train', 'Validation'], loc='lower right')
            plt.title('Model Accuracy')
        else:
            plt.title('Training Accuracy')
        plt.ylabel('Accuracy')
        plt.xlabel('Epoch')
        plt.show()
    def predict(self, X, batch_size=None):
        
        if batch_size==None:
            batch_size = len(X)
        for i in range(0, len(X), batch_size):
            X_batch = X[i:i+batch_size]
            Z = X_batch.copy()
            for layer in self.layers:
                if layer.__class__.__name__=='BatchNormalization':
                    Z = layer.forward(Z, mode='test')
                else:
                    Z = layer.forward(Z)
            if i==0:
                if self.cost_type=='cross-entropy':
                    y_pred = np.argmax(Z, axis=1).tolist()
                else:
                    y_pred = Z
            else:
                if self.cost_type=='cross-entropy':
                    y_pred += np.argmax(Z, axis=1).tolist()
                else:
                    y_pred = np.vstack((y_pred, Z))
        
        return np.array(y_pred)
Validating model using MNIST Dataset#
Check this page (link to an external website) to know more about MNIST dataset
from keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.reshape(X_train.shape[0], 1, X_train.shape[1], X_train.shape[2])
X_test = X_test.reshape(X_test.shape[0], 1, X_test.shape[1], X_test.shape[2])
samples = 5000
X_train = X_train[:samples, :]/255
X_test = X_test[:samples, :]/255
y_train = y_train[:samples]
y_test = y_test[:samples]
utility = Utility()
# train validation split
X_train_new, X_val, y_train_new, y_val = utility.train_test_split(X_train, y_train, test_ratio=0.2, seed=42)
Y_1hot_train, _ = utility.onehot(y_train_new)
input_shape = X_train_new.shape[1:]
output_dim = Y_1hot_train.shape[1]
input_shape, output_dim
((1, 28, 28), 10)
model = CNN()
model.add(model.Input(input_shape=input_shape))
model.add(Conv2D(32, kernel_size=(5, 5), p='same', activation_type="relu"))
model.add(Pooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dropout(0.4))
model.add(Dense(output_dim, activation_type="softmax"))
model.summary()
Model: CNN
--------------------------------------------------------------------------------------
Layer (type)                                 Output Shape              # of Parameters
======================================================================================
input_1 (Input)                              (None, 3, 32, 32)         0              
--------------------------------------------------------------------------------------
conv2d_1 (Conv2D)                            (None, 32, 32, 32)        2432           
--------------------------------------------------------------------------------------
pooling2d_1 (Pooling2D)                      (None, 32, 16, 16)        0              
--------------------------------------------------------------------------------------
flatten_1 (Flatten)                          (None, 8192)              0              
--------------------------------------------------------------------------------------
dropout_1 (Dropout)                          (None, 8192)              0              
--------------------------------------------------------------------------------------
dense_1 (Dense)                              (None, 10)                81930          
======================================================================================
Total params: 84362
Trainable params: 84362
Non-trainable params: 0
--------------------------------------------------------------------------------------
batch_size = 256
epochs = 10
lr = 0.05
model.compile(cost_type="cross-entropy", optimizer_type="adam")
LR_decay = LearningRateDecay()
model.fit(X_train_new, Y_1hot_train, epochs=epochs, batch_size=batch_size, lr=lr, X_val=X_val, 
          y_val=y_val, verbose=1, lr_decay=LR_decay.constant, lr_0=lr)
Epoch: 1/10
100%|██████████| 16/16 [06:06<00:00, 22.88s/it]
Training Loss: 2.2061 - Training Accuracy: 0.3025 - Validation Loss: 1.6472 - Validation Accuracy: 0.454
Epoch: 2/10
100%|██████████| 16/16 [05:43<00:00, 21.45s/it]
Training Loss: 1.3552 - Training Accuracy: 0.535 - Validation Loss: 1.121 - Validation Accuracy: 0.648
Epoch: 3/10
100%|██████████| 16/16 [05:36<00:00, 21.02s/it]
Training Loss: 1.0117 - Training Accuracy: 0.6577 - Validation Loss: 0.9004 - Validation Accuracy: 0.694
Epoch: 4/10
100%|██████████| 16/16 [05:50<00:00, 21.88s/it]
Training Loss: 0.7985 - Training Accuracy: 0.727 - Validation Loss: 0.7702 - Validation Accuracy: 0.751
Epoch: 5/10
100%|██████████| 16/16 [05:30<00:00, 20.66s/it]
Training Loss: 0.6912 - Training Accuracy: 0.766 - Validation Loss: 0.677 - Validation Accuracy: 0.77
Epoch: 6/10
100%|██████████| 16/16 [05:33<00:00, 20.87s/it]
Training Loss: 0.6056 - Training Accuracy: 0.7988 - Validation Loss: 0.6384 - Validation Accuracy: 0.799
Epoch: 7/10
100%|██████████| 16/16 [05:35<00:00, 20.99s/it]
Training Loss: 0.5665 - Training Accuracy: 0.81 - Validation Loss: 0.572 - Validation Accuracy: 0.806
Epoch: 8/10
100%|██████████| 16/16 [05:30<00:00, 20.68s/it]
Training Loss: 0.5072 - Training Accuracy: 0.8355 - Validation Loss: 0.5284 - Validation Accuracy: 0.839
Epoch: 9/10
100%|██████████| 16/16 [05:31<00:00, 20.73s/it]
Training Loss: 0.4754 - Training Accuracy: 0.8448 - Validation Loss: 0.5172 - Validation Accuracy: 0.843
Epoch: 10/10
100%|██████████| 16/16 [05:31<00:00, 20.73s/it]
Training Loss: 0.4364 - Training Accuracy: 0.8615 - Validation Loss: 0.501 - Validation Accuracy: 0.853
model.loss_plot()
model.accuracy_plot()
y_pred = model.predict(X_test, batch_size=batch_size)
confusion_matrix(y_test, y_pred)
array([[403,   0,   6,   5,   3,  10,  11,   1,  13,   8],
       [  0, 547,   5,   3,   3,   3,   3,   1,   6,   0],
       [  7,  14, 401,  36,   6,   7,   6,  15,  37,   1],
       [  1,   6,  15, 387,   1,  37,   5,  20,  23,   5],
       [  2,   2,   4,   0, 418,   2,   8,   8,   8,  48],
       [ 12,   4,   6,  38,  10, 304,  11,  10,  51,  10],
       [ 11,  10,  15,   1,  12,  22, 383,   1,   7,   0],
       [  1,  10,  18,  12,  11,   1,   0, 415,   4,  40],
       [ 10,   7,  13,  31,  20,  26,   2,  11, 350,  19],
       [  2,   5,   4,   9,  47,   7,   2,  25,  11, 408]])
acc = accuracy_score(y_test, y_pred)
print('Error Rate =',round((1-acc)*100, 2))
print('Accuracy =',round((acc)*100, 2))
Error Rate = 19.68
Accuracy = 80.32
Losses are converging and with more complex model (along with more time complexity), we can achieve very high accuracy on MNIST. But for now, the model is working as expected!