Saturday, January 30, 2010

Buffered IO and minor page faults

I had a problem in which my application is creating lots of minor page faults in the range of multiple ten thousands, I initially thought it is created by malloc() calls as mentioned by ezolt's paper, by further analysis showed that that problem is fixed in the glibc. I was sure that mmap will be a source of minor page faults as it is adding a free page to the process page table. So I did a strace of the application, that showed me that one mmap() call is coming after every open() call, by looking at the source code, I figured out that buffered IO calls like fopen, fread are creating this problem.

open("/root/dmesg.txt", O_RDONLY)       = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=23818, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2afbfba25000
read(3, "Linux version 2.6.18-120.el5 (bre"..., 4096) = 4096
read(3, "dle threads.\nCPU: Physical Proces"..., 4096) = 4096
read(3, "device 0000:00:1c.0 to 64\nPCI: Se"..., 4096) = 4096
read(3, "SB hub found\nhub 1-0:1.0: 8 ports"..., 4096) = 4096
read(3, "t 4\nusb-storage: waiting for devi"..., 4096) = 4096
read(3, "20\nACPI: PCI Interrupt 0000:01:04"..., 4096) = 3338
read(3, ""..., 4096)                    = 0
close(3)         

Now what is buffered IO? it is a buffered layer implemented on top of Direct IO  calls like open, read etc. Glibc library will have a buffer of typical size 4096bytes(i.e. size of page 4K), every read and write by the application is served from this buffer. If you want to know more about buffered IO read the books The C Standard Library and Unix File Systems.

The mmap() call is used by the glibc to create buffer, it is clearly mapping 4K page buffer. I really don't know why this is done this way, I asked in the glibc mailing list but nobody responded, see the post.

So how we will avoid minor page faults created by buffered IO calls? Luckily glibc provides another function called setvbuf(), by which application can provide it is own buffer.  If we provide our own buffer then glibc will not allocate buffer by using mmap(). So using setvbuf() avoids minor page faults and also improves the program performance.

10000 loops

[root@mysys testnew]# time ./read_nobuffer       --> direct IO
real    0m0.922s
user    0m0.210s
sys     0m0.711s

[root@mysys testnew]# time ./read_bufmmap     --> buffered IO with library mmap
real    0m0.321s
user    0m0.106s
sys     0m0.215s

[root@mysys testnew]# time ./read_bufsetvbuf    --> buffered IO with
user provided buffer,setvbuf()
real    0m0.178s
user    0m0.071s
sys     0m0.106s

[root@mysys testnew]#

 Minor Page Faults (see under faults/s)

[root@mysys ~]# sar -B 1 10000
Linux 2.6.18-120.el5 (mysys)     12/16/2009

06:17:43 PM  pgpgin/s pgpgout/s   fault/s  majflt/s
06:17:44 PM      0.00      0.00     51.00      0.00
06:17:45 PM      0.00      0.00     24.00      0.00
06:17:46 PM      0.00      0.00     12.00      0.00
06:17:47 PM      0.00      0.00     12.00      0.00
06:17:48 PM      0.00     31.68     11.88      0.00
06:17:49 PM      0.00      4.04     12.12      0.00
06:17:50 PM      0.00      0.00     12.00      0.00
06:17:51 PM      0.00      0.00     14.00      0.00
06:17:52 PM      0.00      0.00     12.00      0.00
06:17:53 PM      0.00      0.00    163.00      0.00   ---> direct IO
06:17:54 PM      0.00      0.00     36.00      0.00
06:17:55 PM      0.00      0.00     14.00      0.00
06:17:56 PM      0.00      0.00  10213.00      0.00  ---> buffered IO with library mmap
06:17:57 PM      0.00      0.00     13.27      0.00
06:17:58 PM      0.00     28.57    217.35      0.00  ---> buffered IO with user provided buffer,setvbuf()
06:17:59 PM      0.00      0.00     12.24      0.00
06:18:00 PM      0.00      0.00     12.12      0.00
06:18:01 PM      0.00      0.00     12.24      0.00
 
Source code for programs I used is pasted below.

[root@mysys testnew]# cat read_nobuffer.c
int main(void)
{
       char buffer[256];
       int i = 0;

       int fp;
       while(i++ < 10000)
       {
       if(!(fp = open("/root/dmesg.txt", O_RDONLY)))
               return 0;
       while(read(fp, buffer, 256));
       close(fp);
       }
       return 0;
}

[root@mysys testnew]# cat read_bufmmap.c
int main(void)
{
       char buffer[256];
       int i = 0;

       FILE *fp;
       while(i++ < 10000)
       {
       if((fp = fopen("/root/dmesg.txt", "r")) == NULL)
               return 0;
       while(!feof(fp))
               fread(buffer, 256, 1, fp);
       fclose(fp);
       }
       return 0;
}

[root@mysys testnew]# cat read_bufsetvbuf.c
char buffer123[8192];
int main(void)
{
       char buffer[256];
       int i = 0;

       FILE *fp;
       while(i++ < 10000)
       {
       if((fp = fopen("/root/dmesg.txt", "r")) == NULL)
               return 0;
       setvbuf(fp, buffer123, _IOFBF,  4096);
       while(!feof(fp))
               fread(buffer, 256, 1, fp);
       fclose(fp);
       }
       return 0;
}

Building Android Emulator on SMP Fedora 12

I was trying to build Android Eclair Emulator on a Fedora 12 SMP system. By making use of SMP support in make utility I was able to build the emulator much faster.

See the cpu utilizations graphs below. When we use of make without any SMP options, cpu utilization graph will come as below.

 
where one cpu is utilized to the maxium and other is not much used. But if you provide SMP options to make, then cpu utilization graph will change like what is below.

Now both the cpu's are utilized to the maximum. Now I will explain the steps and some tricks involved in compiling Eclair branch of Android on system with non standard java installation. Non standard java installation means, you have a system where java version is not 1.5, like my system where installed java version is 1.6 

[nmathew@Devel Desktop]$ java -version
java version "1.6.0_0"
OpenJDK Runtime Environment (IcedTea6 1.6) (fedora-33.b16.fc12-i386)
OpenJDK Client VM (build 14.0-b16, mixed mode)

Android requires java 1.5 i.e. because dalvik currently understands only the code generated by java 1.5, I think Android developers are adding support for Java 1.6.
Now we will start with it. You can follow the steps mentioned here. In order to sync eclair branch you must use repo init -u git://android.git.kernel.org/platform/manifest.git -b eclair. repo sync command will take some time. After that we need to setup the java 1.5 environment. For that first we need to download and install the java JDK  from sun site. Install in any of the local directory by just running java_ee_sdk-5_01-linux.bin directly. Now we need to export the JAVA path. 

export JAVA_HOME=/media/_workspace/android/java_cc
export PATH=$JAVA_HOME/bin:$JAVA_HOME/jdk/bin:$JAVA_HOME/jdk/jre/bin:$PATH
export ANDROID_JAVA_HOME=$JAVA_HOME

Now issue the following commands
./build/envsetup.sh
make -jN where N is the number of physical cpu cores in the system. For my case N was 2.

After completion of compilation you can execute following commands to run the emulator.
cd ./out/host/linux-x86/bin/
./emulator -system /media/_workspace/android/out/target/product/generic/ -kernel /media/_workspace/android/prebuilt/android-arm/kernel/kernel-qemu -data /media/_workspace/android/out/target/product/generic/userdata.img

Now you can see something like below.








Friday, January 29, 2010

Minor Page faults and dynamic memory allocation in Linux

There is a very good paper written by Philip Ezolt, this paper takes the reader for a step by step analysis of a problem where malloc is causing large number of minor page faults. If you want to understand what is written here, you must read that paper.

In that author is using following code snippet to reproduce the issue.

int main(int argc, char *argv[])
{
    char *ptr;
    int count;
   
    count = atoi(argv[1]) * 1024;

    while(1)
    {
        ptr = malloc(count);
        free(ptr);
    }
    return 0;
}

You run the application like ./test_malloc 128 where 128KB is the M_MMAP_THRESHOLD, use sar(System Activity Report tool) like sar -B 1 1000, where 1 is the time interval and 1000 is the count.  But when you see the sar output you won't see much changes in the minor page faults count, surprised? Don't worry, this is because the current glibc's malloc implementation changed a lot from 2001. See Wolfram Gloger's page and Arjan's patch.

So how we can reproduce the issue again, see the code snippet below

int main(int argc, char *argv[])
{
    char *ptr;
    int count;
   
    count = atoi(argv[1]) * 1024;

    while(1)
    {
        ptr = malloc(count);
        free(ptr);
        count += (100*1024);
    }
    return 0;
}

where we increment the chunk size by 100KB, this way we can reproduce the minor page faults issue.

This is because current glibc defines M_MMAP_THRESHOLD as a dynamically varying quantity between  DEFAULT_MMAP_THRESHOLD_MIN(128KB) and DEFAULT_MMAP_THRESHOLD_MAX(32MB) in 32 bit systems. This simple algorithm wants to achieve two things.

1) Create minimum fragmentation
2) use mmap() for long lived allocations
3) temporary allocations should use brk().

Major difference between brk() and mmap() is that, the chunk got using mmap() cann't be recycled i.e. it cann't be used for future allocations, you must give back to the system using munmap(), while chunk allocated using brk() can be used for future allocations. mmap() allocations are quite expensive. 

Now coming back to the algorithm, we have two main things, long lived ones and temporary ones, when we allocate a chunk greater than M_MMAP_THRESHOLD for the first time we will consider it as long lived ones and will use mmap() for allocating that chunk.  If we free this chunk immediately then glibc will update the M_MMAP_THRESHOLD and will consider future allocations of that chunk size as temporary ones and will use brk() for allocating it. Once it is allocated using brk(), that chunk can be recycled.This is for chunk sizes between between DEFAULT_MMAP_THRESHOLD_MIN and DEFAULT_MMAP_THRESHOLD_MAX. If chunk size is greater than DEFAULT_MMAP_THRESHOLD_MAX then glibc will always consider it as long lived ones and will use mmap() for allocating the chunk.

We can use mallopt() call to tune M_MMAP_THRESHOLD, DEFAULT_MMAP_THRESHOLD_MIN and DEFAULT_MMAP_THRESHOLD_MAX. If the application set the M_MMAP_THRESHOLD value explicitly then dynamic behaviour of M_MMAP_THRESHOLD will be switched off.

Saturday, December 5, 2009

Porting 32 bit appications to 64 bit in Linux

We when talk about 64bit we have two platforms amd64 and IA64. Here we are discussing only about amd64 , well known as x86_64. amd64 is the 64 bit extension of x86 developed initially by AMD, and now intel also produces it. IA64 is also known as Itanium architecture, is an entirely new architecture developed jointly by HP and Intel, theses are the processors used in HP's Integrity range of server.

In x86_64 Linux systems LP64 is the data model used and ILP32 is the model used in x86 systems. The following table will give you the difference,

Datatype
LP64
ILP64
LLP64
ILP32
char
8
8
8
8
short
16
16
16
16
int
32
64
32
32
long
64
64
32
32
long long
64
64
64
64
pointer
64
64
64
32

In 32bit and 64bit Linux systems, we are concerned only with ILP32 and LP64. long and pointer are the only difference between the two data models, so when we port applications from 32bit to 64 bit we must be concerned about part of the code which uses  long and pointer data types. If we can uses int data type instead of long then it is very easy to port applications from 32 bit to 64 bit.

Now we covered the higher level languages part, let's move on to assembly language.


The above table is will guide you to write instructions for GAS (GNU Assembler),  say push rax(64bit extension of eax register in x86) register, then the instruction is pushq %rax.

The register sets on a x86_64 platform is given below.




In x86_64, sixteen general purpose registers are available, i.e. rax, rbx, rcx, rdx, rdi, rsi, rbp, rsp and 8 additional registers r8-r15.

In x86 architecture we have segmentation, so we have difference between far call and near call.



Even though we have segmentation, we are using paging on top of flat memory model in x86 linux. By seeing this, in x86_64 we don't have segmentation, so linux uses only paging in linux. Since segmentation is not available in x86_64 platform, far call and near call are same, that means no need to push the CS segment register into the stack. One more thing to see is alignment in x86_64 platform, read more on alignment here.

x86_64 has different method of passing arguments during function calls, x86_64 uses mainly GP registers for passing arguments, in x86 we use stack for passing arguments.



x86_64 application binary interface is here. Best reference is AMD64 architecture Programmer's Manuals and Intel architecture system Programmer's Manuals.

Thursday, November 26, 2009

Blogging in First

I am starting this technical blog, hoping I can share something with people out there.