Robin on Linux – Page 9 – All about technology

Get the number of rows for a parquet file

We were using Pandas to get the number of rows for a parquet file:

import pandas as pd
df = pd.read_parquet("my.parquet")
print(df.shape[0])

This is easy but will cost a lot of time and memory when the parquet file is very large. For example, it may cost more than 100GB of memory to just read a 10GB parquet file.

If we only need to get the number of rows, not the whole data, Pyarrow will be a better solution:

import pyarrow.parquet as pq
table = pq.read_table("my.parquet", columns=[])
print(table.num_rows)

This method only spend a couple seconds and cost about 2GB of memory for the same parquet file.

Finding problem about ‘Nan’ result in model training

Intending to use distilling for training my model. The Plan is:

Train model A and model B with same code and same dataset
Predict the dataset with model A and model B, and store the average of their result
Use the average prediction as the target of a new training process

Step 1 and Step 2 are successful. But when I run the new training process, it will report the loss as “Nan” after some steps.

To find out the reason, I started to print all the “average prediction results” for every step. At first, they look just as normal, but after a while, I find out that some input has “Nan”.

Why there is “Nan” in the “average prediction results”? I guess the reason is: some samples are too rare (or special) so the model will give an unreliable output. It’s quite easy to just ignore them:

if np.isnan(label).any() or not np.isfinite(label).all():
  # Drop the corresponding sample
  return None

Now the distilling training could go on.

Python code for the sequence partition problem (using NumPy)

Imagine we have an array of numbers [9, 8, 7, 1, 2, 3, 4, 5, 6]. What’s the best solution to split it into 3 partitions with the most “average sum”, which means they have minimum differences for their sums. (Remember that we can’t change the order of this array)

For example, we can split this array into three partitions like this [9, 8, 7] [1, 2, 3, 4] [5, 6]. Then the sums of the three partitions are 24, 10, 11. This solution is not the best since the sums have big differences.

The algorithm and source code for this problem is in the book “The Algorithm Design Manual“. And below is the python code of my implementation:

# Find the most average way for partitions
import numpy as np
def maximum_partition(sequence, M, nr_partitions, sum_array):
    for n in range(2, len(sequence) + 1):
        for k in range(2, nr_partitions + 1):
            array = []
            for i in range(1, n + 1):
                select = max(M[i][k - 1], sum_array[n - 1] - sum_array[i - 1])
                array.append(select)
            M[n][k] = min(array)
    return M[len(sequence)][nr_partitions]
def init_matrix(sequence, nr_partitions, M, sum_array):
    for index in range(len(sequence)):
        sum_array.append(sum(sequence[: index + 1]))
    for k in range(1, nr_partitions + 1):
        M[1][k] = sequence[0]
    for n in range(1, len(sequence) + 1):
        M[n][1] = sum(sequence[:n])
if __name__ == "__main__":
    # The sequence and the number of partitions
    sequence = [9, 8, 7, 1, 2, 3, 4, 5, 6]
    partitions = 3
    # init
    M = np.zeros((len(sequence) + 1, partitions + 1), dtype=int)
    sum_array = []
    init_matrix(sequence, partitions, M, sum_array)
    # call the main function
    range_sum_max = maximum_partition(sequence, M, partitions, sum_array)
    print("Sum of the maximum range:", range_sum_max)
    # split the sequence by using maximum sum of one range
    current_sum = 0
    for index in range(len(sequence)):
        if (current_sum + sequence[index]) > range_sum_max:
            print("| ", end="")
            current_sum = 0
        current_sum += sequence[index]
        print(sequence[index], end=" ")
    print("\r")

The code is unbelievable simple (just less than 50 lines). And this is the power and charm of dynamic programming.

Python code for the n-queens problem (using NumPy)

I am reading the second edition of “The Algorithm Design Manual” recently. Reaching the chapter about backtracking, I realized that this method could solve many problems, even complicated problems, quickly and efficiently. Therefore I plan to write a solution to the n-queens problem by using backtracking.

Here is my Python code (Github):

# Using backtracking to resolve n-queues problem
import numpy as np
max_columns = 8
max_rows = max_columns
def print_chess(problem):
    head = "_" * (len(problem) + 2)
    tail = "-" * (len(problem) + 2)
    print(head)
    for row in problem:
        row_str = "|"
        for item in row:
            row_str += str(item)
        row_str += "|"
        print(row_str)
    print(tail + "\n")
def remove_diagonal(problem, occupied_coordinate, max_cols, candidates):
    row, col = occupied_coordinate
    step = max_cols - col - 1
    # remove right-down
    if (row + step) < len(problem) and (row + step) in candidates:
        candidates.remove(row + step)
    # remove right-up
    if (row - step) >= 0 and (row - step) in candidates:
        candidates.remove(row - step)
def construct_candidates(problem, k) -> int:
    if k <= 0:  # for first column
        return [0]  # return 'first row' if it is a 1x1 chess
    else:
        # find empty rows
        candidates = set(range(len(problem)))
        for col in range(k):
            for row in range(max_rows):
                if problem[row][col] > 0:
                    # remove queens in the same row or same column
                    if row in candidates:
                        candidates.remove(row)
                    # remove queens in the same diagonal
                    remove_diagonal(problem, (row, col), k, candidates)
        return list(candidates)
# check all rows to make sure they all have queues ('1')
def is_solution(problem, k):
    result = True
    for index in range(min(k, len(problem))):
        if sum(problem[index]) <= 0:
            result = False
    return result
def construct_solution(problem, candidate, k):
    new_problem = problem.copy()
    new_problem[candidate][k - 1] = 1
    return new_problem
def solve(problem, k):
    if is_solution(problem, k):
        print_chess(problem)
        return 1
    else:
        count = 0
        candidates = construct_candidates(problem, k)
        for candidate in candidates:
            new_problem = construct_solution(problem, candidate, k)
            count += solve(new_problem, k + 1)
        return count
if __name__ == "__main__":
    # try to resolve a max_rows x max_columns chess matrix
    initial_problem = np.zeros((max_rows, max_columns), dtype=int)
    count = solve(initial_problem, 1)
    print("Number of solutions: ", count)

I use NumPy for operations of the “chessboard” (matrix) as it’s extremely efficient.

The previous time I wrote code for the n-queens problem was in 2001, which was 20 years ago. The data centre of my school just gave us some old Intel 486 machines for practice, since we were not students of the computer science department (they keep the newest machines for their own students). The eight-queues problem cost about a half-hour to run at that time. But now it will only cost half a second on my laptop:

See how Moore’s law works in the recent 20 years.

Get DDL of a table in BigQuery

How could I conveniently get the creating-SQL of a table in BigQuery? We could use INFORMATION_SCHEMA:

SELECT
  table_name,
  ddl
FROM
  `data-to-insights.taxi.INFORMATION_SCHEMA.TABLES`
WHERE
  table_name="tlc_yellow_trips_2018_sample"

The result of ddl is:

CREATE TABLE `data-to-insights.taxi.tlc_yellow_trips_2018_sample`
(
  vendor_id STRING,
  pickup_datetime DATETIME,
  dropoff_datetime DATETIME,
  passenger_count INT64,
  trip_distance NUMERIC,
  rate_code STRING,
  store_and_fwd_flag STRING,
  payment_type STRING,
  fare_amount NUMERIC,
  extra NUMERIC,
  mta_tax NUMERIC,
  tip_amount NUMERIC,
  tolls_amount NUMERIC,
  imp_surcharge NUMERIC,
  total_amount NUMERIC,
  pickup_location_id STRING,
  dropoff_location_id STRING
);

Encrypt and decrypt a string in shell

When using a container in Kubernetes, I prefer to use a shell command to complete the task instead of writing a python script since the shell commands are usually much simpler and intuitive.

If I want to encrypt a string “helloworld”, it could be accomplished by:

echo helloworld | openssl aes-256-cbc -a -salt -pbkdf2 -pass pass:mypassword

The ‘mypassword’ is the real password for encryption.

To decrypt, I could just use

echo U2FsdGVkX1+SWVzJK6Ji7DJ77r/U9XxyxNqDXrxZ6ck= | openssl aes-256-cbc -d -a -pbkdf2 -pass pass:mypassword

Strange permission problem

I tried to run a basic example of the Google Vertex AI pipeline:

import kfp
from kfp.v2 import compiler
from kfp.v2.google.client import AIPlatformClient
from google.cloud import aiplatform
from google_cloud_pipeline_components import aiplatform as gcc_aip
project_id = PROJECT_ID
region = REGION
pipeline_root_path = PIPELINE_ROOT
@kfp.dsl.pipeline(
    name="automl-image-training-v2",
    pipeline_root=pipeline_root_path)
def pipeline(project_id: str):
    ds_op = gcc_aip.ImageDatasetCreateOp(
        project=project_id,
        display_name="flowers",
        gcs_source="gs://cloud-samples-data/vision/automl_classification/flowers/all_data_v2.csv",
        import_schema_uri=aiplatform.schema.dataset.ioformat.image.single_label_classification,
    )
    training_job_run_op = gcc_aip.AutoMLImageTrainingJobRunOp(
        project=project_id,
        display_name="train-iris-automl-mbsdk-1",
        prediction_type="classification",
        model_type="CLOUD",
        base_model=None,
        dataset=ds_op.outputs["dataset"],
        model_display_name="iris-classification-model-mbsdk",
        training_fraction_split=0.6,
        validation_fraction_split=0.2,
        test_fraction_split=0.2,
        budget_milli_node_hours=8000,
    )
    endpoint_op = gcc_aip.ModelDeployOp(
        project=project_id, model=training_job_run_op.outputs["model"]
    )
compiler.Compiler().compile(pipeline_func=pipeline,
        package_path='image_classif_pipeline.json')
api_client = AIPlatformClient(project_id=project_id, region=region)
response = api_client.create_run_from_job_spec(
    'image_classif_pipeline.json',
    pipeline_root=pipeline_root_path,
    parameter_values={
        'project_id': project_id
    })

but the script reported an error:

google.api_core.exceptions.PermissionDenied: 403 Permission 'aiplatform.pipelineJobs.create' denied on resource '//aiplatform.googleapis.com/projects/my_project/locations/us-central1' (or it may not exist).

I am using my own google account and could submit the job from GUI of GCP. But why didn’t I have permission to launch the job in my console?

Finally, the ops helped me found out the problem. By running

gcloud config set project my_project
gcloud auth application-default login

again. I noticed that I had set the environment variable GOOGLE_APPLICATION_CREDENTIALS to another account… Now the reason is obvious.

After unsetting the GOOGLE_APPLICATION_CREDENTIALS and re-login the default account. The Vertex AI pipeline job could be launched from my laptop now.

How to mount multiple Kubernetes secrets into one directory

As the title, Kubernetes already has a new component called `projected volume` that support the mounting of multiple secrets into one directory.

apiVersion: v1
kind: Pod
metadata:
  labels:
    run: my-app
  name: my-app
spec:
  containers:
  - command:
    - sleep
    - "3600"
    image: alpine
    name: alpine-secret
    volumeMounts:
    - name: my_secrets
      mountPath: "/var/secrets/"
      readOnly: true
  volumes:
  - name: my_secrets
    projected:
      sources:
      - secret:
          name: my-secret-one
      - secret:
          name: my-secret-two

Kubeflow deployment: part 3

After upgrading my k8s cluster, all the jobs of Kubeflow Pipelines will only finish the first operation and hang there. The reason is a bug in Argo (Kubeflow is based on Argo). And the most simple and straightforward solution is: relaunch the k8s cluster with a lower version. In my situation, the 1.18.20 works very well.

Furthermore, to let tasks in Kubeflow PIpelines run BigQuery job in GCP, we need to set security of the node pool.

As above, choose a specific service account that could access BigQuery resources instead of the default computing engine account.

Therefore, for running Kubeflow Pipelines successfully, we need to launch a k8s cluster with the following rules:

Use lower version, 1.18.20 etc.
Set service account for desired resources to node pools

Some hints on Dataproc

When running a job in the cluster of Dataproc, it reported:

java.util.concurrent.ExecutionException: java.lang.ClassNotFoundException: Failed to find data source: BIGQUERY.

The reason is I haven’t added the Jar file for BigQuery. After adding the new Jar file into properties to the template of creating a cluster:

properties:
          spark:spark.jars: gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.11-0.18.1.jar

the job starts to read data from BigQuery tables.

Remember not to use gs://spark-lib/bigquery/spark-bigquery-latest.jar because it will hang your job when you are reading BigQuery tables. Seems even google makes a significant mistake in their cloud platform :p

2. If a PySpark job needs to use some additional packages in the Dataproc cluster, what should we do?

Still need to add more items in the template to let it install pip packages:

    clusterName: robin
    config:
      gceClusterConfig:
        metadata:
          enable-cloud-sql-proxy-on-workers: 'false'
          use-cloud-sql-private-ip: 'false'
          PIP_PACKAGES: 'google-cloud-storage google-api-python-client google-auth'
      initializationActions:
      - executableFile: gs://goog-dataproc-initialization-actions-us-central1/python/pip-install.sh
        executionTimeout: 600s

3. To see how a Hive table be created

show create table <table>;