Paper reference: Large-Scale Machine Learning with Stochastic Gradient Descent
This GD(Gradient Descent), which is used for computing weight of NN (also used for other Machine Learning Algorithm). zi represents the example ‘i’, also as (xi, yi). After calculate all examples, we need to compute the average for all differentials by weight. Calculating all examples is a slow progress, so we can image GD is not adequate efficient.
Here comes the SGD, which use only one example to compute gradient. It is simpler, and more efficient.
Using SGD in K-mean clustering algorithm seems counterintuitive for me at first glance. But after thinking about “Sample zi belongs to cluster of wk, then don’t wait for all samples, just update wk by zi“, it becomes conceivable.
ASGD is suitable for distributed machine learning environment, since it could get averaged gradient from any example of data at any time (no order restrain).