Robin on Linux – Page 14 – All about technology

Efficient reading in pandas

My previous code was trying to read all data and get only one column that I need:

import pandas as pd
df = pd.read_csv("data.csv")["card_id"]

In the test environment, this program cost more than 10GB memory because of the large size of the data file.

To reduce the memory, I changed to use usecols :

import pandas as pd
df = pd.read_csv("data.csv", usecols=["card_id"])

Then, the program only cost less than 1GB memory.

The only problem is: only read_csv() and read_sql() support reading special columns. In read_parquet(), we still need to read all data at first.

Compare two tables in BigQuery

As this answer, the best solution for comparing two tables in BigQuery is:

(
  SELECT * FROM table1
  EXCEPT DISTINCT
  SELECT * from table2
)
UNION ALL
(
  SELECT * FROM table2
  EXCEPT DISTINCT
  SELECT * from table1
)

But in my test, two tables with the same rows report difference by using the above snippet. Then I found out that the order of column names may be different, and the order of rows too. Then the better solution should be fixing the order of column names and rows:

(
  (
  SELECT col, col2, col3, col4
  FROM table1
  ORDER BY col1, col2
  )
  EXCEPT DISTINCT
  (
  SELECT col1, col2, col3, col4
  FROM table2
  ORDER BY
  col1, col2
  )
)
UNION ALL
(
  (
  SELECT col, col2, col3, col4
  FROM table2
  ORDER BY col1, col2
  )
  EXCEPT DISTINCT
  (
  SELECT col1, col2, col3, col4
  FROM table1
  ORDER BY
  col1, col2
  )
)

Some tips about pandas, again

pd.merge() may change the names of original columns:

import pandas as pd
df1 = pd.DataFrame(data={"name": ["robin", "hood"], "age": [40, 30]})
df2 = pd.DataFrame(data={"name": ["lion", "heart"], "age": [50, 60]})
merged = pd.merge(df1, df2, how="outer", on="name")
print(merged)

The output will not have a column named age but two more new columns named age_x and age_y. So when you merging two tables with many columns, be aware of that the column names may change.

2. Use iterrows() to traverse rows of dataframe:

import pandas as pd
from multiprocessing import Pool
def process(row):
    # Do something for row
    print(row[1])
df = pd.DataFrame(data={"name": ["robin", "hood"], "age": [40, 30]})
pool = Pool(6)
pool.map(process, df.iterrows())

If we directly use pool.map(process, df), it will incorrectly traverse the column names of dataframe.

3. How to append pd.Series to a pd.DataFrame. From this article, the easist way is:

import pandas as pd
df = pd.DataFrame(data={"name": ["robin", "hood"], "age": [40, 30]})
series = pd.Series(["water", 50], index=["name", "age"])
print(df.append(series, ignore_index=True))

The result is

    name  age
0  robin   40
1   hood   30
2  water   50

Or, we can add a name to pd.Series and remove the ignore_index. It could give the same result.

If the pd.Series doesn’t have index, the result will become:

    name   age      0     1
0  robin  40.0    NaN   NaN
1   hood  30.0    NaN   NaN
2    NaN   NaN  water  50.0

Build a Python module for MFCC’s C++ implementation

MFCC means Mel-frequency cepstral coefficients. It’s a powerful feature representation for sound. Although there is a lot of implementations in different programming language for MFCC, they give sheerly different results for the same audio input.

To solve this problem, I got an open-source implementation of C++ for MFCC and built a Python module for it. By using SWIG, this work became less painful.

The function has sample_rate and a one-dimension-array as input, a two-dimensions-array as output. So the header file of C++ looks like:

void mfcc(int sample_rate,
          short* in_array, int size_in,
          double** out_array, int* dim1, int* dim2
         );

We also need to use numpy, so the interface file for SWIG is:

%module mfcc
%{
  #define SWIG_FILE_WITH_INIT
  #include "mfcc.hpp"
%}
%include "numpy.i"
%init %{
  import_array();
%}
%apply (short* IN_ARRAY1, int DIM1) {(short* in_array, int size_in)}
%apply (double** ARGOUTVIEW_ARRAY2, int* DIM1, int* DIM2) {(double** out_array, int* dim1, int* dim2)}
%rename (mfcc) my_mfcc;
%inline %{
  void my_mfcc(int sample_rate, short* in_array, int size_in, double** out_array, int* dim1, int* dim2) {
    mfcc(sample_rate, in_array, size_in, out_array, dim1, dim2);
  }
%}

To use this module, here is an example Python code:

import mfcc
import numpy as np
from scipy.io import wavfile
sr, audio = wavfile.read("mono.wav")
output = mfcc.mfcc(sr, audio)
print(output.shape, output)

All the code is in my repository.

Transfer Redshift SQL to BigQuery SQL

In my recent work, I need to run some SQL snippet from Redshift on Google’s BigQuery platform. Since different data warehouses have a different recipe for SQL, the transferring work couldn’t be avoided.

Here comes some tricks:

Redshift	BigQuery
field::VARCHAR	CAST(field AS String)
isnull(), nvl()	ifnull()
dateadd()	date_add()
datediff()	date_diff()
union	union all
field ILIKE pattern	UPPER(field) LIKE pattern
split_part(string, delimiter, part)	split(string, delimiter)[safe_offset(part)]

In Redshift we can select columns like this:

SELECT
  SQRT(score) AS new_score,
  new_score * 10
FROM ...

But in BigQuery we couldn’t use column name from “AS”. The SQL in BigQuery should be:

SELECT
  SQRT(score) AS new_score,
  SQRT(score) * 10
FROM ...

And, BigQuery has the “WITH” clause to replace the “temporary table”, which is very powerful:

WITH result AS (
   WITH example AS ( SELECT * FROM `dataset.table` )
   SELECT * FROM example
)
SELECT * FROM result

Understanding Transformer

In the paper Attention Is All You Need, the Transformer neural network had been introduced for the first time in 2017. One year later, the BERT appeared. And last year I gave a simple presentation in my previous company about the Transformer and BERT. As showed below:

Transformer and BERT from Hao(Robin) Dong

A couple of days before I started to review the Transformer paper and found out that I need to recommend the article The Illustrated Transformer again. This article really helps me to understand a lot of details in the Transformer.

But there is still a question jump out of my brain: what’s the use of decoder in Transformer? How the information flows through encoder to decoder ? After thinking for quite a while, I figured it out: Transformer was used for Machine Translation task at the first place. The encoder is used to “transform” sentence of source language to a couple of Keys and Values; the decoder will “transform” a word of target language to a Query. By using a Query and a couple of Keys and Values, it could get a vector, which is actually the embedding of next word in target language.

Here is a digram draw by me. Hope it could explain my own confusion.

“Ich bin ein guter Kerl” in German means “I am a good guy”. By encoding all German words to a couple of Keys and Values, and decode “good” to a Query, the Transformer could finally output the embedding vector of “guy”.

“Show” the sound of a bird

Seems librosa is a really popular python library for audio processing. By using librosa, I can show the MFCC of the sound from a bird by just some simple lines of python:

import librosa
import matplotlib.pyplot as plt
from matplotlib import cm
audio, sample_rate = librosa.load("shriek.mp3")
mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
fig, ax = plt.subplots()
ax.imshow(mfccs, interpolation="nearest", cmap=cm.coolwarm, origin="lower", aspect="auto")
ax.set_title("MFCC")
plt.show()

The image looks like below:

I know that It’s not intuitive for guys like me to understand the meaning of this type of spectrum-images. But it should be suitable for some machine learning model to recognize, such as CNN.

Use matplotlib to draw multiple pictures

To draw some sample images in one page, I used one of the most popular python library matplotlib:

import cv2
import matplotlib.pyplot as plt
img = cv2.imread("bird.jpg")
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
rows = 4
columns = 2
fig, ax = plt.subplots(nrows=rows, ncols=columns)
for index in range(columns*rows):
    ax.ravel()[index].imshow(img)
    ax.ravel()[index].set_title("Subplot"+str(index), {'fontsize': 6})
    ax.ravel()[index].set_axis_off()
plt.show()

Then it will jump out a windows like this:

Using loop in Jsonnet

Jsonnet is a templating language and tool to generate JSON/YAML files. Since already have a language instead of configuration, we can generate a bunch of configuration issues with simple code.

For example, I use loop in Jsonnet to write repeat items:

local food_type = ['cake', 'fruit', 'vegetable', 'rice', 'eggs', 'milk', 'meat', 'pizza'];
local final_meal = [
  { name: 'food_' + type, value: 'like' }
  for type in food_type
];
{ 'sample.yaml': std.manifestYamlDoc(
  final_meal,
) }

The std.manifestYamlDoc is used for generate YAML file instead of JSON.

After run jsonnet sample.jsonnet -m . -S, it will generate sample.yaml as:

- "name": "food_cake"
  "value": "like"
- "name": "food_fruit"
  "value": "like"
- "name": "food_vegetable"
  "value": "like"
- "name": "food_rice"
  "value": "like"
- "name": "food_eggs"
  "value": "like"
- "name": "food_milk"
  "value": "like"
- "name": "food_meat"
  "value": "like"
- "name": "food_pizza"
  "value": "like"

Less code but more human readable. I think this is the power of Jsonnet.

Using GPU for LightGBM

One of my team members had accomplished some tests on using GPU for LightGBM training. The result is quite good that GPU could accelerate training speed to 2 times fast.

But this also rises up my interesting about how LightGBM uses GPU for training. Since the GBDT algorithm only use operations like condition checking, data sorting, split point searching etc, it doesn’t have matrix operations which is the strong point of GPU.

Then Jimmy (one of my colleague) send me a paper. This is exactly how LightGBM uses GPU — using GPU for histogram algorithm. The story is: to find the best split point for a feature (or a column of a dataset), LightGBM needs to collect them into bins with different value ranges. This process could be concurrently executed so it could be put into the GPU.

So, GPU could not only used for heavy matrix operation situation but also highly parallel case. Thanks to Jimmy, the paper explained my doubt.

The code for GPU kernel of LightGBM is in three files:

src/treelearner/ocl/histogram16.cl
src/treelearner/ocl/histogram64.cl
src/treelearner/ocl/histogram256.cl

Since it use OpenCL framework to implement, the LightGBM could use both Nvidia and AMD’s GPU to train.