seastar – Robin on Linux

China Linux Storage & Filesystem 2015 workshop (first day)

The first topic is lead by Haomai Wang from XSKY. He first introduce some basic concepts about Ceph and I did catch this opportunity to ask some questions.
Robin(from Alibaba): Dose Ceph cache all meta-data information (may called “cluster map”) on monitor-nodes so client could fetch data by just one jump in network?
Haomai: Yes. One jump, and comes to the OSD.
Robin: If I use Cephfs, is it still one jump?
Haomai: Still one jump. Although we add MDS in Cephfs, the MDS does not store data or meta-data of filesystem but only to store the context of distributed lock.
Ceph also support samba2.0/3.0 now. In linux, it is recommend to use iSCSI to access ceph storage cluster because it will have to update kernel in clients if we use rbd/libceph kernel modules. Ceph use pipeline model in message processing therefore it is good to Hard Disk but not SSD. In the future, developers will use async-framework (such as Seastar) to refactor the ceph.
Robin: If I use three replications in ceph, will the client write three copies concurrently?
Haomai: No. Firstly, the IO will come to the primary OSD, and then the primary OSD will issue two other replicated IOs to other two OSDs, waiting until the two IOs back, and return to client “the IO is success”.
Robin: Ok, now we still have two jumps….Is it difficult to change OSD to write at the same time so we can make the latency of ceph low?
Haomai: That will not be easy. Ceph use primary OSD to make sure the consistent of writing transaction.
The future developing plan for ceph is de-duplication on pool level. Coly Li(from Suse) said that de-duplication is better to be made on business level instead of block level because the duplicated information has be split in block level. But the developers in ceph community looks still want make ceph to be omnipotent.

Discussion about ceph

Jiaju Zhang from Redhat lead the topic about use cases of ceph in enterprises. Ceph has become the most famous open source storage software around the world and also be used in Redhat/Intel/Sandisk(Low-end Storage Array)/Samsung/Suse.
Next topic is about ScyllaDB and Seastar. Asias He from OSv lead this topic. ScyllaDB is a distributed Key/Value store engine which is written in C++14 code and completely compatible to Cassandra. It could also run CQL (Cassandra Query Language). In the graph, ScyllaDB is 40 times more faster than Cassandra. The asynchronous developing framework in ScyllaDB is called Seastar.
Robin: What’s the magic in ScyllaDB?
Asias: We shard requests to every CPU core, and run with no locks/no threads. Data is zero-copy and use bi-direction queue to transfer messages between cores. The test result is base on kernel TCP/IP network stack but we will use our own network stack in our future.
Yanhai Zhu(from Alibaba): I think the test you do is not fair enough: ScyllaDB is designed to be run in multi-cores but Cassandra is not. You guys should run 24 Cassandra instances to compare with ScyllaDB, not just one.
Asias: May be you are right. But ScyllaDB use message queues to transfer messages between CPU cores, so it avoid atomic-operation and lock-operation cost. And, Cassandra is written by Java, which means the performance will be low when the JVM do garbage- collection. ScyllaDB is written completely by c++ so its performance is much steady.
Last topic today is lead by Xu Wang, the CTO of Hyper ( A startup company in china, works on how to run container like VM).
Hyper means “hypervisor” addd “docker image”. Customers could run docker image on Xen/KVM/Virtualbox now.

clsf2015

The guy on the right side is Xu Wang