Saturday, January 30, 2010

Buffered IO and minor page faults

I had a problem in which my application is creating lots of minor page faults in the range of multiple ten thousands, I initially thought it is created by malloc() calls as mentioned by ezolt's paper, by further analysis showed that that problem is fixed in the glibc. I was sure that mmap will be a source of minor page faults as it is adding a free page to the process page table. So I did a strace of the application, that showed me that one mmap() call is coming after every open() call, by looking at the source code, I figured out that buffered IO calls like fopen, fread are creating this problem.

open("/root/dmesg.txt", O_RDONLY)       = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=23818, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2afbfba25000
read(3, "Linux version 2.6.18-120.el5 (bre"..., 4096) = 4096
read(3, "dle threads.\nCPU: Physical Proces"..., 4096) = 4096
read(3, "device 0000:00:1c.0 to 64\nPCI: Se"..., 4096) = 4096
read(3, "SB hub found\nhub 1-0:1.0: 8 ports"..., 4096) = 4096
read(3, "t 4\nusb-storage: waiting for devi"..., 4096) = 4096
read(3, "20\nACPI: PCI Interrupt 0000:01:04"..., 4096) = 3338
read(3, ""..., 4096)                    = 0
close(3)         

Now what is buffered IO? it is a buffered layer implemented on top of Direct IO  calls like open, read etc. Glibc library will have a buffer of typical size 4096bytes(i.e. size of page 4K), every read and write by the application is served from this buffer. If you want to know more about buffered IO read the books The C Standard Library and Unix File Systems.

The mmap() call is used by the glibc to create buffer, it is clearly mapping 4K page buffer. I really don't know why this is done this way, I asked in the glibc mailing list but nobody responded, see the post.

So how we will avoid minor page faults created by buffered IO calls? Luckily glibc provides another function called setvbuf(), by which application can provide it is own buffer.  If we provide our own buffer then glibc will not allocate buffer by using mmap(). So using setvbuf() avoids minor page faults and also improves the program performance.

10000 loops

[root@mysys testnew]# time ./read_nobuffer       --> direct IO
real    0m0.922s
user    0m0.210s
sys     0m0.711s

[root@mysys testnew]# time ./read_bufmmap     --> buffered IO with library mmap
real    0m0.321s
user    0m0.106s
sys     0m0.215s

[root@mysys testnew]# time ./read_bufsetvbuf    --> buffered IO with
user provided buffer,setvbuf()
real    0m0.178s
user    0m0.071s
sys     0m0.106s

[root@mysys testnew]#

 Minor Page Faults (see under faults/s)

[root@mysys ~]# sar -B 1 10000
Linux 2.6.18-120.el5 (mysys)     12/16/2009

06:17:43 PM  pgpgin/s pgpgout/s   fault/s  majflt/s
06:17:44 PM      0.00      0.00     51.00      0.00
06:17:45 PM      0.00      0.00     24.00      0.00
06:17:46 PM      0.00      0.00     12.00      0.00
06:17:47 PM      0.00      0.00     12.00      0.00
06:17:48 PM      0.00     31.68     11.88      0.00
06:17:49 PM      0.00      4.04     12.12      0.00
06:17:50 PM      0.00      0.00     12.00      0.00
06:17:51 PM      0.00      0.00     14.00      0.00
06:17:52 PM      0.00      0.00     12.00      0.00
06:17:53 PM      0.00      0.00    163.00      0.00   ---> direct IO
06:17:54 PM      0.00      0.00     36.00      0.00
06:17:55 PM      0.00      0.00     14.00      0.00
06:17:56 PM      0.00      0.00  10213.00      0.00  ---> buffered IO with library mmap
06:17:57 PM      0.00      0.00     13.27      0.00
06:17:58 PM      0.00     28.57    217.35      0.00  ---> buffered IO with user provided buffer,setvbuf()
06:17:59 PM      0.00      0.00     12.24      0.00
06:18:00 PM      0.00      0.00     12.12      0.00
06:18:01 PM      0.00      0.00     12.24      0.00
 
Source code for programs I used is pasted below.

[root@mysys testnew]# cat read_nobuffer.c
int main(void)
{
       char buffer[256];
       int i = 0;

       int fp;
       while(i++ < 10000)
       {
       if(!(fp = open("/root/dmesg.txt", O_RDONLY)))
               return 0;
       while(read(fp, buffer, 256));
       close(fp);
       }
       return 0;
}

[root@mysys testnew]# cat read_bufmmap.c
int main(void)
{
       char buffer[256];
       int i = 0;

       FILE *fp;
       while(i++ < 10000)
       {
       if((fp = fopen("/root/dmesg.txt", "r")) == NULL)
               return 0;
       while(!feof(fp))
               fread(buffer, 256, 1, fp);
       fclose(fp);
       }
       return 0;
}

[root@mysys testnew]# cat read_bufsetvbuf.c
char buffer123[8192];
int main(void)
{
       char buffer[256];
       int i = 0;

       FILE *fp;
       while(i++ < 10000)
       {
       if((fp = fopen("/root/dmesg.txt", "r")) == NULL)
               return 0;
       setvbuf(fp, buffer123, _IOFBF,  4096);
       while(!feof(fp))
               fread(buffer, 256, 1, fp);
       fclose(fp);
       }
       return 0;
}

Building Android Emulator on SMP Fedora 12

I was trying to build Android Eclair Emulator on a Fedora 12 SMP system. By making use of SMP support in make utility I was able to build the emulator much faster.

See the cpu utilizations graphs below. When we use of make without any SMP options, cpu utilization graph will come as below.

 
where one cpu is utilized to the maxium and other is not much used. But if you provide SMP options to make, then cpu utilization graph will change like what is below.

Now both the cpu's are utilized to the maximum. Now I will explain the steps and some tricks involved in compiling Eclair branch of Android on system with non standard java installation. Non standard java installation means, you have a system where java version is not 1.5, like my system where installed java version is 1.6 

[nmathew@Devel Desktop]$ java -version
java version "1.6.0_0"
OpenJDK Runtime Environment (IcedTea6 1.6) (fedora-33.b16.fc12-i386)
OpenJDK Client VM (build 14.0-b16, mixed mode)

Android requires java 1.5 i.e. because dalvik currently understands only the code generated by java 1.5, I think Android developers are adding support for Java 1.6.
Now we will start with it. You can follow the steps mentioned here. In order to sync eclair branch you must use repo init -u git://android.git.kernel.org/platform/manifest.git -b eclair. repo sync command will take some time. After that we need to setup the java 1.5 environment. For that first we need to download and install the java JDK  from sun site. Install in any of the local directory by just running java_ee_sdk-5_01-linux.bin directly. Now we need to export the JAVA path. 

export JAVA_HOME=/media/_workspace/android/java_cc
export PATH=$JAVA_HOME/bin:$JAVA_HOME/jdk/bin:$JAVA_HOME/jdk/jre/bin:$PATH
export ANDROID_JAVA_HOME=$JAVA_HOME

Now issue the following commands
./build/envsetup.sh
make -jN where N is the number of physical cpu cores in the system. For my case N was 2.

After completion of compilation you can execute following commands to run the emulator.
cd ./out/host/linux-x86/bin/
./emulator -system /media/_workspace/android/out/target/product/generic/ -kernel /media/_workspace/android/prebuilt/android-arm/kernel/kernel-qemu -data /media/_workspace/android/out/target/product/generic/userdata.img

Now you can see something like below.








Friday, January 29, 2010

Minor Page faults and dynamic memory allocation in Linux

There is a very good paper written by Philip Ezolt, this paper takes the reader for a step by step analysis of a problem where malloc is causing large number of minor page faults. If you want to understand what is written here, you must read that paper.

In that author is using following code snippet to reproduce the issue.

int main(int argc, char *argv[])
{
    char *ptr;
    int count;
   
    count = atoi(argv[1]) * 1024;

    while(1)
    {
        ptr = malloc(count);
        free(ptr);
    }
    return 0;
}

You run the application like ./test_malloc 128 where 128KB is the M_MMAP_THRESHOLD, use sar(System Activity Report tool) like sar -B 1 1000, where 1 is the time interval and 1000 is the count.  But when you see the sar output you won't see much changes in the minor page faults count, surprised? Don't worry, this is because the current glibc's malloc implementation changed a lot from 2001. See Wolfram Gloger's page and Arjan's patch.

So how we can reproduce the issue again, see the code snippet below

int main(int argc, char *argv[])
{
    char *ptr;
    int count;
   
    count = atoi(argv[1]) * 1024;

    while(1)
    {
        ptr = malloc(count);
        free(ptr);
        count += (100*1024);
    }
    return 0;
}

where we increment the chunk size by 100KB, this way we can reproduce the minor page faults issue.

This is because current glibc defines M_MMAP_THRESHOLD as a dynamically varying quantity between  DEFAULT_MMAP_THRESHOLD_MIN(128KB) and DEFAULT_MMAP_THRESHOLD_MAX(32MB) in 32 bit systems. This simple algorithm wants to achieve two things.

1) Create minimum fragmentation
2) use mmap() for long lived allocations
3) temporary allocations should use brk().

Major difference between brk() and mmap() is that, the chunk got using mmap() cann't be recycled i.e. it cann't be used for future allocations, you must give back to the system using munmap(), while chunk allocated using brk() can be used for future allocations. mmap() allocations are quite expensive. 

Now coming back to the algorithm, we have two main things, long lived ones and temporary ones, when we allocate a chunk greater than M_MMAP_THRESHOLD for the first time we will consider it as long lived ones and will use mmap() for allocating that chunk.  If we free this chunk immediately then glibc will update the M_MMAP_THRESHOLD and will consider future allocations of that chunk size as temporary ones and will use brk() for allocating it. Once it is allocated using brk(), that chunk can be recycled.This is for chunk sizes between between DEFAULT_MMAP_THRESHOLD_MIN and DEFAULT_MMAP_THRESHOLD_MAX. If chunk size is greater than DEFAULT_MMAP_THRESHOLD_MAX then glibc will always consider it as long lived ones and will use mmap() for allocating the chunk.

We can use mallopt() call to tune M_MMAP_THRESHOLD, DEFAULT_MMAP_THRESHOLD_MIN and DEFAULT_MMAP_THRESHOLD_MAX. If the application set the M_MMAP_THRESHOLD value explicitly then dynamic behaviour of M_MMAP_THRESHOLD will be switched off.