study

Read paper “Large-Scale Machine Learning with Stochastic Gradient Descent”

Paper reference: Large-Scale Machine Learning with Stochastic Gradient Descent

This GD(Gradient Descent), which is used for computing weight of NN (also used for other Machine Learning Algorithm). z_i represents the example ‘i’, also as (x_i, y_i). After calculate all examples, we need to compute the average for all differentials by weight. Calculating all examples is a slow progress, so we can image GD is not adequate efficient.

SGD

Here comes the SGD, which use only one example to compute gradient. It is simpler, and more efficient.

k-mean

Using SGD in K-mean clustering algorithm seems counterintuitive for me at first glance. But after thinking about “Sample z_i belongs to cluster of w_k, then don’t wait for all samples, just update w_k by z_i“, it becomes conceivable.

ASGD

ASGD is suitable for distributed machine learning environment, since it could get averaged gradient from any example of data at any time (no order restrain).

Read paper “In-Datacenter Performance Analysis of a Tensor Processing Unit”

Paper reference: In-Datacenter Performance Analysis of a Tensor Processing Unit”
Application
Using floating point (16bit or 32bit) for NN (Neural Network) training, then a step called quantization transforms floating-point numbers into narrow integers–often just 8 bits–which are usually good enough for inference.
MLP(Multi-layer Perceptions), CNN(Convolutional Neural Netowrks), and RNN(Recurrent Neural Networks), these three types of NN represent 95% of NN inference workload in Google datacenter. Therefore, the TPU mainly focus on them.

As we can see, CNNs are usually dense-computing NN, which are better for TPU.

TPU has 25 times as many MACs (Multiply and Accumulate) and 3.5 times as much on-chip memory as the K80 GPU.
Architecture
The TPU was designed to be a coprocessor on the PCIe I/O bus, more like FPU(floating-poin unit) than it is to a GPU.

The parameters of NN model (weights) comes from off-chip memory (8G DDR3 DRAM) to Weight FIFO, and then flow into MMU(Matrix Multiply Unit). The request (sample need to be inference) comes from PCIe to Unified Buffer, and also flow into MMU finally.
Even the “Activation” and “Pooling” algorithm in CNN have been fixed into hardware.

The MMU contains 256×256 MACs that can perform 8-bit multiply-and-adds on signed or unsigned integers.

According to this Floor Plan, we can imaging that UB and MMU might cost most energy of TPU.

TPU instructions follow the CISC tradition and only has about a dozen instructions, include “Read_Host_Memory”, “Read_Weights”, “MatrixMultiply”, “Activate” etc. Recalling how many codes we need to write to implement a effective Activation function, then we could conceive the speed of using only one “Activate” instruction in TPU.
This paper said TPU is a type of Systolic Array. But what is Systolic Array? Here is the explain: A systolic array is a network of processors that rhythmically compute and pass data through the system.
Performance
There are lot of tables and diagrams which show the top-rate performance of TPU. Although the TPU is fast, it also depend on the computing-density of applications. The CNNs are most computing-dense NN, so it gains most speed(or TeraOps per second) from TPU:

In this paper, it didn’t explain why the GPU is slower than TPU in inference operation. The only sentence about this topic is in “8 Discussion”: “GPUs have traditionally been seen as high-throughput architectures that reply on high-bandwidth DRAM and thousands of threads to achieve their goals”. Actually, I think this is not a serious explain.
The interesting thing is, after Google publish this paper, the CEO of Nvidia – Jensen Huang, wrote a blog to gently appeal a fact: the state-of-the-art GPU (Tesla P40) can inference faster than TPU. The war between different giants of Deep learning is just beginning.

Google TPU from Hao(Robin) Dong

Read paper “A Column Store Engine for Real-Time Streaming Analytics” (MemSQL)

Paper reference: A Column Store Engine for Real-Time Streaming Analytics
Background:
According to the official website, MemSQL is “a high-performance data warehouse designed for the cloud that delivers ultra-fast insights of your live and historical data”. It uses row-storage and lock-free engine for data in memory and column-storage for data in the disk.
MemSQL could also store all data into the disk. In its most durable state, MemSQL will not lose any transactions which have been acknowledged. And it implements “Read Commited” isolation level for transactions.
I heard about MemSQL in early 2012, but don’t know it has become a OLTP-and-OLAP system until several days ago.
What is the problem?
To fulfil OLAP jobs, MemSQL has to store data into a disk with columnar-storage-format. But MemSQL still needs to process OLTP requests such as random INSERT or UPDATE. Therefore it stores data into a data structure named “Segment”. And by connecting different segments into an ordered segments list (which named “Sorted Runs”), MemSQL could balance the requirements between frequent INSERT/UPDATE operations and SCAN/GROUP BY operations.
For example:
1. Users INSERT three keys: 1, 92, 107. MemSQL will create a segment that contains the three keys:
        [1, 92, 107]
2. Users continue to INSERT two keys: 63, 84. The segments list are:
        [1, 92, 107]
        [63, 84]
3. After many INSERT operations, the segments become:
        [1, 92, 107]
        [2, 17, 42]
        [63, 84]
        [110, 118, 172]
Now，MemSQL makes these segments into “Sorted Runs”, which have a basic order for keys:

When the SELECT comes, MemSQL could find the row quickly by just looking up two ordered segment-lists. Uses could also SCAN two segment-lists effectively to do OLAP tasks since all the data are stored in columnar-format.
What happens if users INSERT more rows? MemSQL will merge the old big Sorted-Runs and create a new segment for freshly-insert-data, which could keep the number of Sorted-Runs acceptable.
In practice, MemSQL column store engine uses a constant of 8, so that the biggest sorted run has at least of all the segments, the second biggest sorted run has at least of the remaining segments, and so forth. This strategy seems just like LSM tree in LevelDB of Google. The difference between LevelDB and MemSQL is LevelDB store every key-value pair respectively but MemSQL store a batch of rows into one segment.
If INSERT operations come when MemSQL is merging, the small merging actions will be aborted and relaunch. For big merging actions, it will barely skip any missing or updated segments, for skipping some segments will not ruin the new merged Sorted-Runs.
As we can see, MemSQL endeavours to avoid locking for in-memory data operations, which makes its performance significantly superior.
There are also some practical considerations for MemSQL.

If only one merger is working at big Sort-Runs, it will cost too much time and the small Sorted-Runs will become tremendous for intensive INSERT operations. So MemSQL launch two mergers: Fast Merger and Slow Merger.
MemSQL could accumulate some rows before batching them into a segment, which will decrease the fragment of data.
MemSQL also create special Commands for users to sort all the segments, or just decrease the level of Sorted-Runs.
Columnar-Storage format in memory makes using SIMD instruments of CPU possible. This shed some light on me: maybe one day we can run machine learning jobs on MemSQL directly 🙂

My understanding of CNN (Convolutional Neural Network)

The classic Neural Network of Machine Learning usually use fully-connection, which will cost too much computing resource to get final result if the inputs are high-resolution images. So comes the Convolutional Neural Network. CNN (Convolutional Neural Network) splits the whole big image into small pieces (called Receptive Fields), and do some “Convolutional Operations” (actually are some image transformations, also called Kernels) on each Receptive Field, then the pooling operation (usually max-polling, which is simply collect a biggest feature weight in a 2X2 matrix).
Receptive Fields is easy to understand, but why do it use different kind of “Convolutional Operations” on them? In my opinion, “Convolutional Operations” means using different kind of Kernel Functions to transfer the same image (for example: sharpen the image, or detect the edge of object in image), so they could reveal different views of the same image.
These different Kernel Functions review different “Features” of a image, thus we call them “Feature Maps”:
Convolutional Neural Network
From http://mxnet.io/tutorials/python/mnist.html
(The matrix of light-yellow is just transferred from light-gray matrix on its left)
By using Receptive Fields and max-pooling, the number of neurons will become very small gradually, which will make computing (or regression) much more easy and fast:
Convolutional Neural Network
From http://www.cnblogs.com/bzjia-blog/p/3442788.html
Therefore, I reckon the main purpose of using CNN is to reduce the difficulty of computing result of a fully-connected Neural Network.

Robin on Linux

study