col1
0 NaN
1 NaN
2 NaN
0 False
1 False
2 False
Name: col1, dtype: bool
0 False
1 False
2 False
Name: col1, dtype: bool
If we directly convert “None” to the “float” type, it will become the “NaN” value. The “Nan” couldn’t be compared with real float number therefore it is neither “bigger or equal than” a float-point number nor “smaller than” it.
Since only float-point type could allow the “None” value in a column, we should be much careful when processing with float-point number.
About two weeks ago I met the “number of islands” problem in LeetCode. The first idea that jumped out of my brain is ‘dynamic programming’: I can create a new matrix (let’s name it ‘number matrix’) with the same size as the problem matrix, and set every position a value. The value is the number of islands from (0, 0) to this position.
For example, as the below problem matrix:
1
1
1
1
0
0
0
0
1
0
1
1
0
0
0
0
problem matrix
We can create the ‘number matrix’:
1
1
1
1
1
1
1
1
2
2
3
3
2
2
3
3
number matrix
Why the (4, 0) position in ‘number matrix’ is 2? Because the (0, 0) to (4, 0) area in ‘problem matrix’ has two islands:
1
0
1
0
(0, 0) to (4, 0) area of ‘problem matrix’
Then, every time we want to calculate the value of a position in the ‘number matrix’ we can first calculate the values of its ‘left’, ‘top’ and ‘left-top’ position.
For example, since the (3, 3), (4, 3) and (3, 4) positions in the ‘number matrix’ are all ‘3’, and the (4, 4) position in ‘problem matrix’ is ‘0’, the (4, 4) position of ‘number matrix’ should be ‘3’ also.
But unfortunately, after two weeks of struggling on thinking, I finally found out that the value of a position in the ‘number matrix’ can’t be decided by its nearest three positions: left, top and left-top. The counterexample is below:
1
1
1
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
0
1
1
0
0
0
1
0
1
1
0
1
0
1
0
1
1
0
1
1
1
0
1
1
1
1
1
1
1
1
problem matrix A
Let’s just build the number matrix for the (6, 6) to (7, 7) area of problem matrix A:
2
2
2
Don’t rush. Let’s look at another problem matrix:
1
1
1
1
1
1
0
0
0
1
1
0
1
0
1
1
0
0
0
1
1
1
1
1
1
problem matrix B
This time, we also build the number matrix for the (4, 4) to (5, 5) area of problem matrix B:
2
2
2
See? They have the same values for left, top and left-top, but different final results (problem matrix A has just 1 island but problem matrix B has 2).
Two weeks of hard thinking just got the wrong idea. But this is the charming of algorithm 🙂
ls whatever
if [ $? -eq 0 ]; then
echo "success"
else
echo "fail"
fi
Since there is no file named ‘whatever’ in the current directory, the return code ‘$?’ will be none-zero and the program will finally print:
$ sh test.sh
ls: whatever: No such file or directory
fail
What will happen when someone wants to get the detail of the failed command in the shell script? Like, write error log into a file and also to the standard output, by using command ‘tee’?
ls whatever|tee log
if [ $? -eq 0 ]; then
echo "success"
else
echo "fail"
fi
Unfortunately, the ‘$?’ will get the return status of ‘tee log’ instead of ‘ls whatever’. Then the snippet will print what we (at least myself) don’t expect:
We can modify our shell script by following the new method:
ls whatever|tee log
if [ ${PIPESTATUS[0]} -eq 0 ]; then
echo "success"
else
echo "fail"
fi
Shame to say, even as a long term UNIX developer, this is the first time I know that the pipe symbol ‘|’ will also change the return status of the whole line of command.
We usually use the below Python code to get CPU cores:
from multiprocessing import cpu_count
print("CPU cores:", cpu_count())
But when the snippet running inside a docker container, it will just return the number of CPU cores for the physical machine the container runs on, not the actually --cpus (for docker) or CPU limit (for Kubernetes).
Then, how could we get the CPU cores set by docker argument or Kubernetes configuration for this container?
The only answer I could find is here. And the corresponding Python code written by me is:
def get_cpu_limit():
with open("/sys/fs/cgroup/cpu/cpu.cfs_quota_us") as fp:
cfs_quota_us = int(fp.read())
with open("/sys/fs/cgroup/cpu/cpu.cfs_period_us") as fp:
cfs_period_us = int(fp.read())
container_cpus = cfs_quota_us // cfs_period_us
# For physical machine, the `cfs_quota_us` could be '-1'
cpus = cpu_count() if container_cpus < 1 else container_cpus
return cpus
We were using Pandas to get the number of rows for a parquet file:
import pandas as pd
df = pd.read_parquet("my.parquet")
print(df.shape[0])
This is easy but will cost a lot of time and memory when the parquet file is very large. For example, it may cost more than 100GB of memory to just read a 10GB parquet file.
If we only need to get the number of rows, not the whole data, Pyarrow will be a better solution:
import pyarrow.parquet as pq
table = pq.read_table("my.parquet", columns=[])
print(table.num_rows)
This method only spend a couple seconds and cost about 2GB of memory for the same parquet file.
Intending to use distilling for training my model. The Plan is:
Train model A and model B with same code and same dataset
Predict the dataset with model A and model B, and store the average of their result
Use the average prediction as the target of a new training process
Step 1 and Step 2 are successful. But when I run the new training process, it will report the loss as “Nan” after some steps.
To find out the reason, I started to print all the “average prediction results” for every step. At first, they look just as normal, but after a while, I find out that some input has “Nan”.
Why there is “Nan” in the “average prediction results”? I guess the reason is: some samples are too rare (or special) so the model will give an unreliable output. It’s quite easy to just ignore them:
if np.isnan(label).any() or not np.isfinite(label).all():
# Drop the corresponding sample
return None
Imagine we have an array of numbers [9, 8, 7, 1, 2, 3, 4, 5, 6]. What’s the best solution to split it into 3 partitions with the most “average sum”, which means they have minimum differences for their sums. (Remember that we can’t change the order of this array)
For example, we can split this array into three partitions like this [9, 8, 7] [1, 2, 3, 4] [5, 6]. Then the sums of the three partitions are 24, 10, 11. This solution is not the best since the sums have big differences.
The algorithm and source code for this problem is in the book “The Algorithm Design Manual“. And below is the python code of my implementation:
# Find the most average way for partitions
import numpy as np
def maximum_partition(sequence, M, nr_partitions, sum_array):
for n in range(2, len(sequence) + 1):
for k in range(2, nr_partitions + 1):
array = []
for i in range(1, n + 1):
select = max(M[i][k - 1], sum_array[n - 1] - sum_array[i - 1])
array.append(select)
M[n][k] = min(array)
return M[len(sequence)][nr_partitions]
def init_matrix(sequence, nr_partitions, M, sum_array):
for index in range(len(sequence)):
sum_array.append(sum(sequence[: index + 1]))
for k in range(1, nr_partitions + 1):
M[1][k] = sequence[0]
for n in range(1, len(sequence) + 1):
M[n][1] = sum(sequence[:n])
if __name__ == "__main__":
# The sequence and the number of partitions
sequence = [9, 8, 7, 1, 2, 3, 4, 5, 6]
partitions = 3
# init
M = np.zeros((len(sequence) + 1, partitions + 1), dtype=int)
sum_array = []
init_matrix(sequence, partitions, M, sum_array)
# call the main function
range_sum_max = maximum_partition(sequence, M, partitions, sum_array)
print("Sum of the maximum range:", range_sum_max)
# split the sequence by using maximum sum of one range
current_sum = 0
for index in range(len(sequence)):
if (current_sum + sequence[index]) > range_sum_max:
print("| ", end="")
current_sum = 0
current_sum += sequence[index]
print(sequence[index], end=" ")
print("\r")
The code is unbelievable simple (just less than 50 lines). And this is the power and charm of dynamic programming.
I am reading the second edition of “The Algorithm Design Manual” recently. Reaching the chapter about backtracking, I realized that this method could solve many problems, even complicated problems, quickly and efficiently. Therefore I plan to write a solution to the n-queens problem by using backtracking.
# Using backtracking to resolve n-queues problem
import numpy as np
max_columns = 8
max_rows = max_columns
def print_chess(problem):
head = "_" * (len(problem) + 2)
tail = "-" * (len(problem) + 2)
print(head)
for row in problem:
row_str = "|"
for item in row:
row_str += str(item)
row_str += "|"
print(row_str)
print(tail + "\n")
def remove_diagonal(problem, occupied_coordinate, max_cols, candidates):
row, col = occupied_coordinate
step = max_cols - col - 1
# remove right-down
if (row + step) < len(problem) and (row + step) in candidates:
candidates.remove(row + step)
# remove right-up
if (row - step) >= 0 and (row - step) in candidates:
candidates.remove(row - step)
def construct_candidates(problem, k) -> int:
if k <= 0: # for first column
return [0] # return 'first row' if it is a 1x1 chess
else:
# find empty rows
candidates = set(range(len(problem)))
for col in range(k):
for row in range(max_rows):
if problem[row][col] > 0:
# remove queens in the same row or same column
if row in candidates:
candidates.remove(row)
# remove queens in the same diagonal
remove_diagonal(problem, (row, col), k, candidates)
return list(candidates)
# check all rows to make sure they all have queues ('1')
def is_solution(problem, k):
result = True
for index in range(min(k, len(problem))):
if sum(problem[index]) <= 0:
result = False
return result
def construct_solution(problem, candidate, k):
new_problem = problem.copy()
new_problem[candidate][k - 1] = 1
return new_problem
def solve(problem, k):
if is_solution(problem, k):
print_chess(problem)
return 1
else:
count = 0
candidates = construct_candidates(problem, k)
for candidate in candidates:
new_problem = construct_solution(problem, candidate, k)
count += solve(new_problem, k + 1)
return count
if __name__ == "__main__":
# try to resolve a max_rows x max_columns chess matrix
initial_problem = np.zeros((max_rows, max_columns), dtype=int)
count = solve(initial_problem, 1)
print("Number of solutions: ", count)
I use NumPy for operations of the “chessboard” (matrix) as it’s extremely efficient.
The previous time I wrote code for the n-queens problem was in 2001, which was 20 years ago. The data centre of my school just gave us some old Intel 486 machines for practice, since we were not students of the computer science department (they keep the newest machines for their own students). The eight-queues problem cost about a half-hour to run at that time. But now it will only cost half a second on my laptop: