After upgrading my k8s cluster, all the jobs of Kubeflow Pipelines will only finish the first operation and hang there. The reason is a bug in Argo (Kubeflow is based on Argo). And the most simple and straightforward solution is: relaunch the k8s cluster with a lower version. In my situation, the 1.18.20 works very well.
Furthermore, to let tasks in Kubeflow PIpelines run BigQuery job in GCP, we need to set security of the node pool.
As above, choose a specific service account that could access BigQuery resources instead of the default computing engine account.
Therefore, for running Kubeflow Pipelines successfully, we need to launch a k8s cluster with the following rules:
Use lower version, 1.18.20 etc.
Set service account for desired resources to node pools
Remember not to use gs://spark-lib/bigquery/spark-bigquery-latest.jar because it will hang your job when you are reading BigQuery tables. Seems even google makes a significant mistake in their cloud platform :p
2. If a PySpark job needs to use some additional packages in the Dataproc cluster, what should we do?
Still need to add more items in the template to let it install pip packages:
For NLU (Natural Language Understanding), we use the bidirectional language model (like BERT), but for NLG(Natural Language Generation), the left-to-right unidirectional language model (like GPT) is the only choice.
Could we accomplish these two tasks by using one unified language model?
In this paper, the authors use a mask matrix to run different tasks in the same model:
The pivotal equation for this method is:
“M is the mask matrix and determines whether a pair of tokens can be attended to each other.”
“Unidirectional LM is done by using a triangular matrix for the self-attention mask M (as in the above equation), where the upper triangular part of the self-attention mask is set to −∞, and the other elements to 0”
“Within one training batch, 1/3 of the time we use the bidirectional LM objective, 1/3 of the time we employ the sequence-to-sequence LM objective, and both left-to-right and right-to-left LM objectives are sampled with the rate of 1/6”
Keep a note that the training process use bidirectional/unidirectional/seq2seq objective, not samples)
By following the document, I tried to deploy the management cluster of Kubeflow. But after running make apply-cluster it reported:
The management cluster name "kubeflow-mgmt" is valid.
# Delete the directory so any resources that have been removed
# from the manifests will be pruned
rm -rf build/cluster
mkdir -p build/cluster
kustomize build ./cluster -o build/cluster
# Create the cluster
anthoscli apply -f build/cluster
I0723 14:53:19.329785 24546 main.go:230] reconcile serviceusage.cnrm.cloud.google.com/Service container.googleapis.com
I0723 14:53:23.236897 24546 main.go:230] reconcile container.cnrm.cloud.google.com/ContainerCluster kubeflow-mgmt
Unexpected error: error reconciling objects: error reconciling ContainerCluster:gcp-wow-rwds-ai-mlchapter-dev/kubeflow-mgmt: error creating GKE cluster kubeflow-mgmt: googleapi: Error 400: Project "gcp-wow-rwds-ai-mlchapter-dev" has no network named "default".
make: *** [apply-cluster] Error 1
The reason for this error is that Kubeflow could only use the network with the name “default” in GCP as its VPC. This issue is still open and has been pointed to anthos.
Workaround: Create a new GKE cluster manually, and set MGMT_NAME to the existed cluster name
Groovy could be used as the configuration file for Jenkins workflow. Although I totally don’t make head or tail of Groovy, its syntax is not hard to learn.
How to iteratea list
To export a bunch of tables to CSV format, we could use
[
"table1",
"table2",
"table3",
].each {
table_name ->
sh "export ${table_name} to ${table_name}.csv"
}
Get the output of the shell
We could run shell commands in the Groovy file and then get its output.
all_files = sh (
script: 'ls -lh',
returnStdout: true
).trim()
echo "All files: ${all_files}"
Recently I dug out my USBasp tool and a few AVR microcontrollers, for enjoying programming the C language again. Unexpectedly, the old ATTINY2313V and ATmega88V couldn’t work with my USBasp tool (maybe they have already been fused with an external crystal but I don’t have one at hand). The only two pieces that could work are ATmega16A and ATmega16L. At least I could still have some fun with it.
Later when glancing over the documents of Atmel’s new ATTINY series, I found out that the ATTINY13A only have 1KB space to store program. My example of waterfall light was compiled out to a 2KB hex file. Does that mean I couldn’t put my program into the ATTINY13A? How could I get the real occupation size of flash for the hex file?
Here is one solution by using avr-size
$ avr-size main.hex
text data bss dec hex filename
0 756 0 756 2f4 main.hex
Only 756 bytes (text + data) will be used in flash, so the ATTINY13A should be okay.
Now, what should I do if I want to reduce the size of the binary file compiled from my code? Here is the guide from Atmel.
First, I change the type of the variable “mode” from “unsigned int” to “unsigned char”, this leads the binary file to 726 bytes. Then change all inner functions to “static” (I guess this removed some unused symbol for external linking), reduce the binary size to 668 bytes.
—— 2021.07.13 ——
Furthermore, when I use “-mrelax” option in gcc-avr for linker relaxation, the binary size shrink to 656 bytes.
As the above menu show in the Vertex AI, it is trying to include all common processes of building and running a machine learning model.
For my experiment, I just create a Dataset by loading file from GCS. Unfortunately, the loading process support only CSV file as tabular data so I have to convert my big PARQUET file into CSV format first (really inconvenient).
Strange error
But after I created a training process by using builtin XGBoost container. It report a strange error:
There is an invalid column, but what’s the name of it? The GUI didn’t show. I finally find out that it’s a column with an empty name. Seems Vertex AI couldn’t even process a table with a column of an empty name.
2. AutoML
After manually removed the column with an empty name and select AutoML for my tabular data. The training went successfully. The final regression L1 loss is 0.237, just the same result with my own LightGBM model.
3. Custom Pakcage
By following this document, I create a custom Python package for my training of the XGBoost model. The self-brew package use environment-variable to get Dataset from GCS. The final L1 loss is slightly worse than LightGBM.
Frankly speaking, I haven’t seen any advantage of Vertex AI over our home-brew Argo/K8S training framework. In the Vertex AI training process, those special errors, like OOM(Out Of Memory), are hard to discover.