given a training data set comprising
observations
, where
, together with corresponding target values
, the goal is to predict the value of
for a new value of
with a predictive hypothesis function
:
but to make our hypothesis applicable to more function spaces we extend this linear function with a set of
fixed, non-linear basis functions
such that
for the bias parameter, such that
this gets rid of the restrictions we would be imposing by using linear functions of
, this is however still a linear function of
which eases analysis.
here,
is the weight matrix which we hope would represent a linear transformation that transforms a vector of input features
into an output vector
.
we start with a random point in
for
, and use an optimization algorithm to arrive at a good enough approximation of a hypothetical target function
from which we assume the observations
were drawn, "good enough" being defined by some criterion or loss function.
here we consider the traditional gradient descent as the optimization method. the observations may potentially be divided into batches, but that doesnt matter in theory. our goal is to converge on a good enough
by going in the direction opposite to that of the gradient of the loss function, because by doing so we could get closer to a local minima point (i.e. "sliding downhill") so a training step would consist of:
where
is the loss function and
is the learning rate.
the optimal weight matrix
would be the one to minimize this loss for a given batch (set of observations):