distributed training

see /home/mahmooz/brain/notes/data/62/38c619-d187-481c-9e8b-db1c4af6f51f/hedge_usmani.pdf to parallelize the process of training a neural network, the simplest technique i've seen around is constructing multiple copies of the same network, splitting the training data into "batches" or smaller subsets of the whole training dataset, running each one of those networks on its own thread or core, then once all the copies finish training we take the average value of every parameter in every network and constructing yet another copy with the new parameters we get (after averaging), then we repeat the process with each "averaged" neural network until we complete an epoch (finish training on the dataset).

this technique has some downsides, as does every other technique i've read about, but this one is one of the simplest ones, so im gonna go with it first.