linear regression

given a training data set comprising observations , where , together with corresponding target values , the goal is to predict the value of for a new value of with a predictive hypothesis function :

but to make our hypothesis applicable to more function spaces we extend this linear function with a set of fixed, non-linear basis functions

such that for the bias parameter, such that

this gets rid of the restrictions we would be imposing by using linear functions of , this is however still a linear function of which eases analysis.

here, is the weight matrix which we hope would represent a linear transformation that transforms a vector of input features into an output vector .

we start with a random point in for , and use an optimization algorithm to arrive at a good enough approximation of a hypothetical target function from which we assume the observations were drawn, "good enough" being defined by some criterion or loss function.

here we consider the traditional gradient descent as the optimization method. the observations may potentially be divided into batches, but that doesnt matter in theory. our goal is to converge on a good enough by going in the direction opposite to that of the gradient of the loss function, because by doing so we could get closer to a local minima point (i.e. "sliding downhill") so a training step would consist of:

where is the loss function and is the learning rate.

the optimal weight matrix would be the one to minimize this loss for a given batch (set of observations):