Paper reference: Large-Scale Machine Learning with Stochastic Gradient Descent

This GD(Gradient Descent), which is used for computing weight of NN (also used for other Machine Learning Algorithm). z

_{i}represents the example ‘i’, also as (x

_{i}, y

_{i}). After calculate all examples, we need to compute the average for all differentials by weight. Calculating all examples is a slow progress, so we can image GD is not adequate efficient.

Here comes the SGD, which use only one example to compute gradient. It is simpler, and more efficient.

Using SGD in K-mean clustering algorithm seems counterintuitive for me at first glance. But after thinking about “Sample z

_{i}belongs to cluster of w

_{k}, then don’t wait for all samples, just update w

_{k}by z

_{i}“, it becomes conceivable.

ASGD is suitable for distributed machine learning environment, since it could get averaged gradient from any example of data at any time (no order restrain).