All about technology
The first topic in second day of CLSF 2014 is about NFS, which lead by Tao Peng from PrimaryData. The protocol of NFS is updated to 4.2 and the mainly jobs of NFS community is implement many features (such as “server side copy”) which had been used on local file system onto server side.
#note: the way to use no journal mode in ext4 mkfs.ext4 -O ^has_journal, .... /dev/sdx
Another problem is Xiaomi want to use io-scheduler of deadline but they can’t use cgroup by ‘deadline’.
Coly Li (from Alibaba): I suggest that you could try tpps which is a simple and effective io-scheduler in ali_kernel.
Next topic about ext4 is hold by Zheng Liu. In this year, ext4 has add no new features (may be that’s why it is so stable). In google’s chrome OS, they want to store something like cookie for every user, so it need to add encryption feature in ext4. We ask why chrome os not using encryptfs on ext4. The answer of Zheng Liu is: the requirement is came from google itself, so no one knows why. Ext4 also add a new option “sparse_super2” (to store super block only in the beginning and the end of ext4 groups) and “packed_meta_block” (to squeeze all meta data of ext4 into the beginning of disk, mainly for the SMR disk).
CLSF (China Linux Storage & File System Workshop) is an effort to make local Linux kernel hackers get together to share and exchange ideas. CLSF is an invitation only workshop, in order to have effective communication, only a small group of people will be invited. Most of the invitees are active upstream Linux kernel developers locally from China, focus on I/O and storage sub-systems.
CLSF 2014 was hold in office of XiaoMi which is a famous consumer electronics company in china. Participators are mainly from Huawei, Fujitsu, Intel, Alibaba and other companies.
The first topic lead by Jiufei Xue from Huawei is about ocfs2. Huawei was building their private cloud product on ocfs2, so in recent two years the kernel developers in Huawei commited many fix patches and new features into ocfs2 community. In this year, they add range lock into ocfs2, so users could not only lock the whole file but a specific range of one file, which will promote the performance in cluster when many clients read/write files at the same time.
We use pipe in our program and face a new problem: it fail when we try to write 16MB data into a pipe in one time. Looks pipe has a limited size. But what exactly the size is? After searching on the web, the answers are not inconsistent, some say it’s 16KB and others say it’s 64KB. Therefore I have to watch kernel code by myself to find the correct answer.
Since all the servers in my company is using ali_kernel, which is based on 2.6.32 centos kernel, I find the original routine of codes:
sys_pipe() --> sys_pipe2() --> do_pipe_flags() --> create_write_pipe():
struct file *create_write_pipe(int flags)
{
......
path.dentry->d_flags &= ~DCACHE_UNHASHED;
d_instantiate(path.dentry, inode);
err = -ENFILE;
f = alloc_file(&path, FMODE_WRITE, &write_pipefifo_fops);
if (!f)
goto err_dentry;
f->f_mapping = inode->i_mapping;
......
Looks all the operations to the pipe about write are managed by “write_pipefifio_fops”. Let’s get in:
const struct file_operations write_pipefifo_fops = {
.llseek = no_llseek,
.read = bad_pipe_r,
.write = do_sync_write,
.aio_write = pipe_write,
.poll = pipe_poll,
.unlocked_ioctl = pipe_ioctl,
.open = pipe_write_open,
.release = pipe_write_release,
.fasync = pipe_write_fasync,
};
Clearly, pipe_write() is responsed for writting. Keep going.
static ssize_t
pipe_write(struct kiocb *iocb, const struct iovec *_iov,
unsigned long nr_segs, loff_t ppos)
{
......
for (;;) {
int bufs;
if (!pipe->readers) {
send_sig(SIGPIPE, current, 0);
if (!ret)
ret = -EPIPE;
break;
}
bufs = pipe->nrbufs;
if (bufs < PIPE_BUFFERS) {
int newbuf = (pipe->curbuf + bufs) & (PIPE_BUFFERS-1);
struct pipe_buffer *buf = pipe->bufs + newbuf;
struct page *page = pipe->tmp_page;
char *src;
int error, atomic = 1;
if (!page) {
page = alloc_page(GFP_HIGHUSER);
if (unlikely(!page)) {
ret = ret ? : -ENOMEM;
break;
}
pipe->tmp_page = page;
}
......
pipe->nrbufs = ++bufs;
pipe->tmp_page = NULL;
total_len -= chars;
if (!total_len)
break;
}
......
pipe_wait(pipe);
......
As above, kernel will allocate a page if new operation of write comes and pipe has not enough space. Every time it add a page, it increase the ‘pipe->nrbufs’, and if the ‘nrbufs’ is great than PIPE_BUFFERS, the routine will be blocked, which means the system-call of write() will be waiting. The ‘PIPE_BUFFERS’ is setted to 16, and a page in linux kernel is 4KB, so a pipe in ali_kernel can store 64KB (16 * 4KB) data at one time.
This condition has changed since kernel version of 3.6.35, which add a new proc entry in ‘/proc/sys/fs/pipe-max-size’.
Problem 1:
The zookeeper cluster is running well for half year a year. But today, after I re-configurate it and run command
zkServer.sh start-foreground
It failed to startup and report
log4j:WARN No appenders could be found for logger (org.apache.zookeeper.server.quorum.QuorumPeerConfig). log4j:WARN Please initialize the log4j system properly. Invalid config, exiting abnormally
The point is the last term “Invalid config”(log4j is just warning); therefore I reviewed zoo.cfg many times but finding no mistake utterly.
After checking all configurations, I eventually find out the problem: the file “myid” missed. After adding the “myid” file, zookeeper startup correctly.
echo [hostname or ip] > /var/log/zookeeper/myid (The path is 'dataDir' in zoo.cfg)
It seems the error log of zookeeper is misleading——it says the config file is invalid but the true reason is missing of a config file.
Problem 2:
For tolerating failure of four servers at most, we assumed that a five-servers zookeeper cluster will be enough. After learning of Paxos for a while, a problem occurs on me: the majority of five-servers-cluster is three-servers, how could zookeeper works to elect a new leader if more than two servers are down? So I do the test and find out that the zookeeper do fail to work if more than two servers are shutdown.
The correct number of zookeeper cluster which could tolerate failure of four servers is nine; because after four servers shutdown, the five survivors is also the majority of nine-server-cluster.