Recently I make a presentation about my past works on linux kernel. Then I find out the PPT which is made four years ago ( I was in Kernel Team of Taobao Corp.). For any body who is interesting in file system 🙂
All about technology
Recently I make a presentation about my past works on linux kernel. Then I find out the PPT which is made four years ago ( I was in Kernel Team of Taobao Corp.). For any body who is interesting in file system 🙂
I need to compress small size of data in my project without too heavy side-effect on performance. Many people recommend LZ4 for me since it is almost the fastest compression algorithm at present.
Compression ratio is not the most important element here, but it still need to be evaluated. So I create a 4KB file contains normal text (from a README file from a open source software) and compress it with LZ4 and GZIP.
LZ4 | GZIP -1 | |
Compression Ratio | 1.57 | 2.24 |
Hum, Looks like the compression ratio of LZ4 is not bad. But when I run test with this special content (changed from here):
The First 1,000 Primes (the 1,000th is 7919) For more information on primes see http://primes.utm.edu/ 2,,3,,5,,7, 11, 13, 17, 19, 23, 29 31, 37, 41, 43, 47, 53, 59, 61, 67, 71 73, 79, 83, 89, 97, 101, 103, 107, 109, 113 127, 131, 137, 139, 149, 151, 157, 163, 167, 173 179, 181, 191, 193, 197, 199, 211, 223, 227, 229 233, 239, 241, 251, 257, 263, 269, 271, 277, 281 283, 293, 307, 311, 313, 317, 331, 337, 347, 349 353, 359, 367, 373, 379, 383, 389, 397, 401, 409 419, 421, 431, 433, 439, 443, 449, 457, 461, 463 467, 479, 487, 491, 499, 503, 509, 521, 523, 541 547, 557, 563, 569, 571, 577, 587, 593, 599, 601 607, 613, 617, 619, 631, 641, 643, 647, 653, 659 661, 673, 677, 683, 691, 701, 709, 719, 727, 733 739, 743, 751, 757, 761, 769, 773, 787, 797, 809 811, 821, 823, 827, 829, 839, 853, 857, 859, 863 877, 881, 883, 887, 907, 911, 919, 929, 937, 941 947, 953, 967, 971, 977, 983, 991, 997,1009,1013 1019,1021,1031,1033,1039,1049,1051,1061,1063,1069 1087,1091,1093,1097,1103,1109,1117,1123,1129,1151 1153,1163,1171,1181,1187,1193,1201,1213,1217,1223 1229,1231,1237,1249,1259,1277,1279,1283,1289,1291 1297,1301,1303,1307,1319,1321,1327,1361,1367,1373 1381,1399,1409,1423,1427,1429,1433,1439,1447,1451 1453,1459,1471,1481,1483,1487,1489,1493,1499,1511 1523,1531,1543,1549,1553,1559,1567,1571,1579,1583 1597,1601,1607,1609,1613,1619,1621,1627,1637,1657 1663,1667,1669,1693,1697,1699,1709,1721,1723,1733 1741,1747,1753,1759,1777,1783,1787,1789,1801,1811 1823,1831,1847,1861,1867,1871,1873,1877,1879,1889 1901,1907,1913,1931,1933,1949,1951,1973,1979,1987 1993,1997,1999,2003,2011,2017,2027,2029,2039,2053 2063,2069,2081,2083,2087,2089,2099,2111,2113,2129 2131,2137,2141,2143,2153,2161,2179,2203,2207,2213 2221,2237,2239,2243,2251,2267,2269,2273,2281,2287 2293,2297,2309,2311,2333,2339,2341,2347,2351,2357 2371,2377,2381,2383,2389,2393,2399,2411,2417,2423 2437,2441,2447,2459,2467,2473,2477,2503,2521,2531 2539,2543,2549,2551,2557,2579,2591,2593,2609,2617 2621,2633,2647,2657,2659,2663,2671,2677,2683,2687 2689,2693,2699,2707,2711,2713,2719,2729,2731,2741 2749,2753,2767,2777,2789,2791,2797,2801,2803,2819 2833,2837,2843,2851,2857,2861,2879,2887,2897,2903 2909,2917,2927,2939,2953,2957,2963,2969,2971,2999 3001,3011,3019,3023,3037,3041,3049,3061,3067,3079 3083,3089,3109,3119,3121,3137,3163,3167,3169,3181 3187,3191,3203,3209,3217,3221,3229,3251,3253,3257 3259,3271,3299,3301,3307,3313,3319,3323,3329,3331 3343,3347,3359,3361,3371,3373,3389,3391,3407,3413 3433,3449,3457,3461,3463,3467,3469,3491,3499,3511 3517,3527,3529,3533,3539,3541,3547,3557,3559,3571 3581,3583,3593,3607,3613,3617,3623,3631,3637,3643 3659,3671,3673,3677,3691,3697,3701,3709,3719,3727 3733,3739,3761,3767,3769,3779,3793,3797,3803,3821 3823,3833,3847,3851,3853,3863,3877,3881,3889,3907 3911,3917,3919,3923,3929,3931,3943,3947,3967,3989 4001,4003,4007,4013,4019,4021,4027,4049,4051,4057 4073,4079,4091,4093,4099,4111,4127,4129,4133,4139 4153,4157,4159,4177,4201,4211,4217,4219,4229,4231 4241,4243,4253,4259,4261,4271,4273,4283,4289,4297 4327,4337,4339,4349,4357,4363,4373,4391,4397,4409 4421,4423,4441,4447,4451,4457,4463,4481,4483,4493 4507,4513,4517,4519,4523,4547,4549,4561,4567,4583 4591,4597,4603,4621,4637,4639,4643,4649,4651,4657 4663,4673,4679,4691,4703,4721,4723,4729,4733,4751 4759,4783,4787,4789,4793,4799,4801,4813,4817,4831 4861,4871,4877,4889,4903,4909,4919,4931,4933,4937 4943,4951,4957,4967,4969,4973,4987,4993,4999,5003 5009,5011,5021,5023,5039,5051,5059,5077,5081,5087 5099,5101,5107,5113,5119,5147,5153,5167,5171,5179 5189,5197,5209,5227,5231,5233,5237,5261,5273,5279 5281,5297,5303,5309,5323,5333,5347,5351,5381,5387 5393,5399,5407,5413,5417,5419,5431,5437,5441,5443 5449,5471,5477,5479,5483,5501,5503,5507,5519,5521 5527,5531,5557,5563,5569,5573,5581,5591,5623,5639 5641,5647,5651,5653,5657,5659,5669,5683,5689,5693 5701,5711,5717,5737,5741,5743,5749,5779,5783,5791 5801,5807,5813,5821,5827,5839,5843,5849,5851,5857 5861,5867,5869,5879,5881,5897,5903,5923,5927,5939 5953,5981,5987,6007,6011,6029,6037,6043,6
the result became interesting:
LZ4 | GZIP -1 | |
Compression Ratio | 0.99 | 2.25 |
GZIP could compress this content unexceptionally, but LZ4 don’t. So, LZ4 can’t compress some special content (like, numbers), although it is faster than any other compression algorithm.
Somebody may ask: why don’t you use special algorithm to compress content of numbers? The answer is: in real situation, we could not know the type of our small data segments.
Redis use rdb file to make data persistent. Yesterday, I used the redis-rdb-tools for dumping data from rdb file to JSON format. After that, I write scala code to read data from JSON file and put it into Redis.
Firstly, I found out that the JSON file is almost 50% bigger than rdb file. After checking the whole JSON file, I make sure that the root cause is not the redundant symbols in JSON such as braces and brackets but the “Unicode transformation” in JSON format, especially the “\u0001” for ASCII of “0x01”. Therefore I write code to replace it:
val key: String = line.stripPrefix("\"").stripSuffix("\"").replaceAll("\\\\u0001", "^A")
Then the size of JSON file became normal.
There was still another problem. To read the JSON file line by line, I use code from http://naildrivin5.com/blog/2010/01/26/reading-a-file-in-scala-ruby-java.html:
import scala.io._
object ReadFile extends Application {
val s = Source.fromFile("some_file.txt")
s.getLines.foreach( (line) => {
println(line.trim.toUpperCase)
})
}
But this code will load all data from file and then run “foreach”. As my file is bigger than 100GB, it will cost too much time and RAM ….
The best way is to use IOStream in java:
val dataFile = new File("my.txt")
var reader = new BufferedReader(new FileReader(dataFile))
var line: String = reader.readLine()
while (line != null) {
// do something
line = reader.readLine()
}
This is exactly read file “line by line”.
While testing performance of redis these days, I need to use mset() interface of jedis (a java version redis client). But the prototype of mset() in jedis is:
@Override
public String mset(final String... keysvalues) {
Firstly I write my scala code like:
var array = Array[String]()
array = array:+key1:+value1
array = array:+key2:+value2
jedis.mset(array)
But it report compiling errors:
[error] xxx: overloaded method value mset with alternatives: [error] (x$1: String*)String[error] (x$1: Array[Byte]*)String [error] cannot be applied to (Array[String]) [error] jedis.mset(array) [error] ^ [error] one error found [error] (compile:compileIncremental) Compilation failed [error] Total time: 4 s, completed Jan 8, 2016 11:32:47 AM
After searching many documents about scala/java on google, I finally find the answer: http://docs.scala-lang.org/style/types.html. So, let’s write code this way:
jedis.mset(array:_*)
Then Array[String] of scala changes to varargs in java now. It also viable for Seq[String].
The first book is about network hardware, like router, switcher. As a coder, I usually use servers on cloud, therefore haven’t see the real high performance routers (I have sought bare server, 1Gb switcher). This book open my eyes.
The second book is about how to build Datacenter. It’s really a work for architecture, not IT guys.
About two years ago, I worked with Mysql team in my company as a kernel developer. We have used PCIE-card of NAND and flashcache as our solution for Mysql to process hight throughput pressure. But util this year, I have read over the architecture of InnoDB Engine which is the most powerful and effective engine in Mysql. Actually, it’s not so difficult to have a overview of the InnoDB Engine in a book. But, it is still very hard to understand the code of it 🙂
I haven’t go to cinema to watch “The Martian” because I have read it in my Kindle on my commute everyday. It is really a sci-fi story for Geeks who like do research on Computer,Chemistry,Physics,etc. The only question I want to ask the author is:” How could you invent so much troubles on Mars to torture Mark Watney?”
Unikernels are specialised, single-address-space machine images constructed by using library operating systems. The concept of Unikernel is very old (Since 1980s in embeded system), but become more and more popular in this cloud computing age for its portability and security.
In recent days, I tested two famous unikernel production: Rumpkernel and OSv by running redis in them.
1. Run redis in Rumpkernel (KVM)
Firstly, build rumpkernel and its environment as “https://github.com/rumpkernel/wiki/wiki/Tutorial%3A-Serve-a-static-website-as-a-Unikernel”, then
git clone https://github.com/rumpkernel/rumprun-packages.git
cd rumprun-packages/redis
make # we get "bin/redis-server" now
rumprun-bake hw_virtio ./redis.bin bin/redis-server # we get "redis.bin" now, so try to run it (make sure you have configured tap)
rumprun kvm -i -M 4096 \
-I if,vioif,'-net tap,script=no,ifname=tap0'\
-W if,inet,static,10.0.120.101/24 \
-b images/data.iso,/data \
-- ./redis.bin /data/conf/redis.conf
2. Run redis in OSv (KVM)
Firsty, build OSv by the tutorial of “https://github.com/cloudius-systems/osv/”, and the virbr0 network (as qemu/kvm usually do), then
vim apps/redis-memonly/GET # change the redis version to 3.0.2 (as same as redis in rumpkernel)
./scripts/build image=redis-memonly -j20
# run it (use only one cpu core, as rumpkernel)
sudo ./scripts/run.py -nv -c 1
3. Run redis on host (centos 7 on bare hardware)
wget "http://download.redis.io/releases/redis-3.0.2.tar.gz"
tar xzf redis-3.0.2.tar.gz
cd redis-3.0.2
./configure
make -j20
./src/redis-server
4. Use benchmark tool to test it
I choose memtier_benchmark as the benchmark tool.
5. The test result
I was trying to install new linux kernel (4.4-rc5) in my CentOS 7 server. But when I run “sudo make install” it report
$sudo make install sh ./arch/x86/boot/install.sh 4.4.0-rc5 arch/x86/boot/bzImage \ System.map "/boot" gzip: stdout: No space left on device dracut: creation of /boot/initramfs-0-rescue-a5ad1e5b00de400bbc8e83ec69fbe9ee.img failed cp: error writing '/boot/vmlinuz-0-rescue-a5ad1e5b00de400bbc8e83ec69fbe9ee': No space left on device cp: failed to extend '/boot/vmlinuz-0-rescue-a5ad1e5b00de400bbc8e83ec69fbe9ee': No space left on device grubby: error writing /boot/grub2/grub.cfg-: No space left on device grubby: error writing /boot/grub2/grub.cfg-: No space left on device
The initramfs-0-rescue-XXX file occupied too much space in boot device. Then I found this article. But after adding dracut_rescue_image=”no” into /etc/dracut.conf, the problem was still exist.
Finally, I use
$grep dracut_rescue /usr/lib/dracut/* -rn /usr/lib/dracut/dracut.conf.d/02-rescue.conf:1:dracut_rescue_image="yes"<
Therefore, the worked configuration item for dracut is in /usr/lib/dracut/dracut.conf.d/02-rescue.conf instead of /etc/dracut.conf on centos 7. The final solution is
$vim /usr/lib/dracut/dracut.conf.d/02-rescue.conf #change "yes" to "no" dracut_rescue_image="no"
In my kernel module, firstly I wrote:
int alloc_device(const char *name, int number)
{
char name[64];
snprintf(name, sizeof(name), "worker%d", number);
request_cache = kmem_cache_create(name, SECTOR_SIZE, 0, NULL, NULL);
......
}
In centos 7, this module works fine, but after port to centos 6, this kernel module reports:
kmem_cache_create: duplicate cache worker0 ......
The key to this problme is in the implementation of kmem_cache_create() in 2.6.32 linux kernel (for centos 6):
struct kmem_cache *
kmem_cache_create (const char *name, size_t size, size_t align,
unsigned long flags, void (*ctor)(void *))
{
......
cachep->ctor = ctor;
cachep->name = name;
......
After creating a new pool, it only point to ‘name’, not strdup() a new one. But the ‘name’ in my kernel module is a temporary variable (in stack), so it considers the name is “duplicated”.
The correct code should like:
static char *names[64];
/* before calling alloc_device() */
names = kcalloc(NR_OF_DEVICE, 64, GFP_KERNEL);
......
int alloc_device(const char *name, int number)
{
snprintf(names[number], 64, "worker%d", number);
request_cache = kmem_cache_create(names[number], SECTOR_SIZE, 0, NULL, NULL);
......
}
But why the old kernel module did not report error in centos 7? Because in centos 7 the default memory allocator is SLUB, and in centos 6 it is SLAB. They have totally different implementation.
fio is a effective tool to test IO-performance of a block device (also file system).
Today, my colleague tell me fio has report “Cache flush bypassed!” which means all IO have bypass the device cache. But I can’t agree because the cache of a RAID card usually could only be change by specific tool (such as MegaCLi), but not a test tool.
By review the code of fio:
static int __file_invalidate_cache(struct thread_data *td, struct fio_file *f,
unsigned long long off,
unsigned long long len)
{
......
} else if (f->filetype == FIO_TYPE_BD) {
int retry_count = 0;
ret = blockdev_invalidate_cache(f);
while (ret < 0 && errno == EAGAIN && retry_count++ < 25) {
/*
* Linux multipath devices reject ioctl while
* the maps are being updated. That window can
* last tens of milliseconds; we'll try up to
* a quarter of a second.
*/
usleep(10000);
ret = blockdev_invalidate_cache(f);
}
if (ret < 0 && errno == EACCES && geteuid()) {
if (!root_warn) {
log_err("fio: only root may flush block "
"devices. Cache flush bypassed!\n");
root_warn = 1;
}
ret = 0;
}
and the implementation for blockdev_invalidate_cache() is:
static inline int blockdev_invalidate_cache(struct fio_file *f)
{
return ioctl(f->fd, BLKFLSBUF);
}
Therefore, the "Cache flush bypassed!" is not mean all IO will bypass the buffer of device, but actually means: "fio can't flush the cache of device, so let's ignore it".
If you want to disable the DRAM cache on the RAID card, the correct way is set cache policy of RAID card to "Write Through":
sudo /opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp -WT -Immediate -Lall -aALL
When we use read() to read data from a socket like:
ret = read(fd, buffer, sizeof(struct msg));
the read() may return ‘ret’ which is small than sizeof(struct msg) even the socket is not O_NONBLOCKING.
The correct way is:
ret = recv(fd, buffer, sizeof(struct msg), MSG_WAITALL);
Then, recv() will wait util all sizeof(struct msg) be read out.