Technology and Information

DNS impact on Informix / Impacto do DNS no Informix

You decided...

This article if the first "on-demand" topic I write about. I had a few options for topics I'd like to cover and. The impact of DNS on Informix was the most voted. So you asked for it, here it is. I will probably keep doing this from now on.
Personally I also like this topic, partially because I had several problems related to this in more than one customer. This is a relatively common issue in complex environments.
I have a public and generic disclaimer about the material I publish here, but for this case I'd like to stretch this just a bit... Beside the normal disclaimer, I'd like to state that most of the information presented here lies a bit far from my regular competencies. This is the result of a lot of digging and investigation over the time (not only from me), and there may be a few (hopefully minor) errors on the information provided. Feel free to comment, email me etc.

Very short introduction about DNS

DNS is the acronym for Domain Name System which is a protocol/service that is able to convert a hostname (e.g. onlinedomus.com) into an IP address (e.g. 89.1.2.3) and the opposite. It can also do a lot of other stuff, like telling which mail server is responsible for a specific domain etc. but this is out of the scope of the article.
Without DNS there would be no Internet as we know it. It's a critical component of the Internet infra-structure and it's performance and security is crucial for us, the users.
In a small network you can work without DNS for the basic name resolution functions, by using files. But if your network is larger, it can be very hard to use those files.
The DNS system uses an hierarchical architecture and the UDP protocol (U does not stand for unreliable but it could be...) for performance reasons. Proper configuration of a DNS system can be a complex task, and my personal experience tells me it's not very easy to find who knows how to do it properly. Furthermore, many times people don't realize the terrible impact a DNS misconfiguration or malfunction may have on the systems.
I will not explain (and I wouldn't know how to) all the DNS configuration aspects, but I'd like to reference a few points:

/etc/nsswitch.conf
This file (on Unix/Linux systems, but the name can vary) defines how the name resolution (and other services) are used. In particular it can define if the system uses files, NIS, the DNS servers or other mechanism and the order it uses. As an example, a line like:

hosts: dns files

indicates that for hostname lookups the system will first ask the DNS servers and then looks in the files
/etc/hosts
This file can map IP addresses into hostnames (and vice-versa). As an example:

89.1.2.3 www.onlinedomus.com onlinedomus.com

This tells the system that the IP address 89.1.2.3 will map to "www.onlinedomus.com" (and vice-versa). As you can imagine, a lookup for "onlinedomus.com" will also map to the same IP address.
/etc/resolv.conf
This contains the list of DNS servers that will be used for lookups and possibly a few other options (like requests timeout, names of domains that will be appended to simple hostnames, if the lookups for those hostnames fail etc.). An example:

nameserver 192.168.112.2
nameserver 9.64.162.21

How does Informix use the DNS?

From the DNS perspective, Informix is just another application. The /etc/nsswitch.conf file will tell if Informix will use the files (/etc/hosts) or the DNS servers (specified in /etc/resolv.conf). The first important thing to note is that all interaction between Informix and the DNS system goes through system calls. In particular these two functions or their equivalents or replacements:

gethostbyname()
In short, this receives an hostname and returns a structure containing the IP address
gethostbyaddr()
This receives an IP address and returns the hostname that matches it

So, if something is not working, we need to understand how/when Informix calls these two functions and how they work. Typically customers blame Informix when it's not to blame (not completely, but more on this later). Most (if not all) problems I've seen in DNS affecting Informix can affect other applications and are reproducible with very short C language programs. This does not mean we couldn't do a few things differently (better?). But again, I'll cover this later (keep in mind the disclaimer! :) )
This article is written mainly from the database server perspective. But the DNS has obvious implications for the client too... Let's start there and then I'll jump into the server.
When a client tool tries to connect to an Informix server, it starts by looking up the $INFORMIXSERVER (or equivalent given in the connection string) in the $INFORMIXSQLHOSTS file (for Java it can look up LDAP or HTTP servers for the info, but let's stick to the files for easier understanding). The file contains lines in the following format:

INFORMIXSERVER PROTOCOL HOSTNAME/IP_ADDRESS PORT_NUMBER/SERVICE_NAME OPTIONS

when the client libraries find the line matching the INFORMIXSERVER, they check the hostname (or IP address) and the port number (or service name).
Typically, if a service name is used, we look the port number in /etc/services (this can be configured in /etc/nsswitch.conf). Personally I tend to use the port number to avoid that lookup...
Then, if a hostname is used, the client must map it to an IP address. For that it calls the gethostbyname() function. This function will behave as specified in /etc/nsswitch.conf, and will try to map the name to an IP address. A failure to do that will raise error -930. This can be reproduced:

cheetah@pacman.onlinedomus.com:fnunes-> echo $INFORMIXSERVER; grep $INFORMIXSERVER $INFORMIXSQLHOSTS; dbaccess sysmaster -
blogtest
blogtest     onsoctcp     nowhere.onlinedomus.com 1500

930: Cannot connect to database server (nowhere.onlinedomus.com).
cheetah@pacman.onlinedomus.com:fnunes->

and if you need evidences of what's going on behind the scenes we can use strace (or truss):

strace -o /tmp/strace.out dbaccess sysmaster -

This is an edited extract of /tmp/strace.out generated by the command above. If you have the patience, you can see it doing the following:

Open /etc/nsswitch.conf
Open $INFORMIXSQLHOSTS (/home/informix/etc/sqlhosts)
Open /etc/services (exceptionally I used a name instead of a port number)
Open /etc/resolv.conf to find out the configured nameservers
Open a socket to 192.168.112.2 (my configured DNS server)
Ask for nowhere.onlinedomus.com
Open /etc/hosts (in /etc/nsswtich.conf I configured to search the files if the DNS lookup fails)
Read the error message from the Informix message files
Write the error message to stderr
Exit with error code -1

cheetah@pacman.onlinedomus.com:fnunes-> cat /tmp/strace.out
[...]
open("/etc/nsswitch.conf", O_RDONLY)    = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=1803, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7862000
read(3, "#\n# /etc/nsswitch.conf\n#\n# An ex"..., 4096) = 1803
read(3, "", 4096)                       = 0
close(3)                                = 0
[...]
open("/home/informix/etc/sqlhosts", O_RDONLY|O_LARGEFILE) = 4
_llseek(4, 0, [0], SEEK_SET)            = 0
read(4, "blogtest     onsoctcp     nowher"..., 4096) = 1389
[...]
open("/etc/services", O_RDONLY|O_CLOEXEC) = 4
fstat64(4, {st_mode=S_IFREG|0644, st_size=644327, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7862000
read(4, "# /etc/services:\n# $Id: services"..., 4096) = 4096
close(4)                                = 0
[...]
open("/etc/resolv.conf", O_RDONLY)      = 4
[...]
read(4, "", 4096)                       = 0
close(4)                                = 0
[...]
open("/lib/libresolv.so.2", O_RDONLY)   = 4
read(4, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0Pf\256\0004\0\0\0"..., 512) = 512
[...]
close(4)                                = 0
[...]
socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 4
connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.112.2")}, 16) = 0
gettimeofday({1325590403, 502576}, NULL) = 0
poll([{fd=4, events=POLLOUT}], 1, 0)    = 1 ([{fd=4, revents=POLLOUT}])
send(4, "tl\1\0\0\1\0\0\0\0\0\0\7nowhere\vonlinedomus"..., 41, MSG_NOSIGNAL) = 41
poll([{fd=4, events=POLLIN}], 1, 5000)  = 1 ([{fd=4, revents=POLLIN}])
ioctl(4, FIONREAD, [101])               = 0
recvfrom(4, "tl\201\203\0\1\0\0\0\1\0\0\7nowhere\vonlinedomus"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.112.2")}, [16]) = 101
close(4)                                = 0
[...]
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 4
[...]
read(4, "127.0.0.1\tpacman1.onlinedomus.ne"..., 4096) = 439
[...]
close(4)                                = 0
[...]
read(3, "Cannot connect to database serve"..., 40) = 40
write(2, "\n", 1)                       = 1
write(2, "  930: Cannot connect to databas"..., 68) = 68
exit_group(-1)                          = ?
cheetah@pacman.onlinedomus.com:fnunes->

Now, we should move to the Informix server side. This requires a bit more work and preliminary explanations. To start, we must understand what the engine needs in order to establish the connection. One of those things is to do a reverse name lookup (IP address to hostname). This is not essential, but it's always tried. Informix may need the hostname for trust relation validation and to provide information to the DBA.

As you know, the Informix database engine comprises several operating system processes. From the OS perspective they all look the same (oninit), but every one has a specific role and runs certain engine threads. We can see the threads with:

panther@pacman.onlinedomus.com:fnunes-> onstat -g ath

IBM Informix Dynamic Server Version 11.70.UC4 -- On-Line -- Up 00:17:01 -- 411500 Kbytes

Threads:
tid     tcb      rstcb    prty status                vp-class       name
2       5583fa38 0        1    IO Idle                 3lio*        lio vp 0
3       558551f8 0        1    IO Idle                 4pio*        pio vp 0
4       5586b1f8 0        1    IO Idle                 5aio*        aio vp 0
5       558811f8 8f59dc0  1    IO Idle                 6msc*        msc vp 0
6       558af1f8 0        1    IO Idle                 7fifo*       fifo vp 0
7       558c9590 0        1    IO Idle                 9aio*        aio vp 1
8       558df3b8 54267018 3    sleeping secs: 1        8cpu         main_loop()
9       559276f8 0        1    running                10soc*        soctcppoll
10      5593ed18 0        2    sleeping forever        1cpu*        soctcplst
11      55927d20 542675fc 1    sleeping secs: 1        8cpu         flush_sub(0)
12      55988018 54267be0 1    sleeping secs: 1        8cpu         flush_sub(1)
13      559881f0 542681c4 1    sleeping secs: 1        8cpu         flush_sub(2)
14      559883c8 542687a8 1    sleeping secs: 1        8cpu         flush_sub(3)
15      559885a0 54268d8c 1    sleeping secs: 1        8cpu         flush_sub(4)
16      55988778 54269370 1    sleeping secs: 1        8cpu         flush_sub(5)
17      55988bf0 54269954 1    sleeping secs: 1        8cpu         flush_sub(6)
18      559fb468 54269f38 1    sleeping secs: 1        8cpu         flush_sub(7)
19      559fb640 0        3    IO Idle                 8cpu*        kaio
20      55ab6018 5426a51c 2    sleeping secs: 1        8cpu         aslogflush
21      55ab6960 5426ab00 1    sleeping secs: 92       1cpu         btscanner_0
22      55b6a408 5426b0e4 3    cond wait  ReadAhead    1cpu         readahead_0
39      55bcd5c8 0        3    IO Idle                 1cpu*        kaio
40      55bcd7a0 5426bcac 3    sleeping secs: 1        1cpu*        onmode_mon
41      55d3e148 5426c874 3    sleeping secs: 1        8cpu         periodic
49      55e80a78 5426da20 1    sleeping secs: 177      1cpu         dbScheduler
51      55f340f8 5426d43c 1    sleeping forever        1cpu         dbWorker1
52      55f34d80 5426ce58 1    sleeping forever        8cpu         dbWorker2
59      562ee228 5426e5e8 1    cond wait  bp_cond      1cpu         bf_priosweep()

And the OS processes with:

panther@pacman.onlinedomus.com:fnunes-> onstat -g glo

IBM Informix Dynamic Server Version 11.70.UC4 -- On-Line -- Up 00:18:48 -- 411500 Kbytes

MT global info:
sessions threads  vps      lngspins
0        29       10       3       

sched calls     thread switches yield 0   yield n   yield forever
total:    9589515         8992470         597961    14485     4457836  
per sec:  0               0               0         0         0        

Virtual processor summary:
class       vps       usercpu   syscpu    total   
cpu         2         11.51     94.06     105.57  
aio         2         3.57      75.44     79.01   
lio         1         0.01      0.01      0.02    
pio         1         0.00      0.01      0.01    
adm         1         0.01      0.15      0.16    
soc         1         0.04      0.15      0.19    
msc         1         0.00      0.01      0.01    
fifo        1         0.00      0.01      0.01    
total       10        15.14     169.84    184.98  

Individual virtual processors:
vp    pid       class       usercpu   syscpu    total     Thread    Eff  
1     29395     cpu         5.63      46.80     52.43     66.41     78%
2     29398     adm         0.01      0.15      0.16      0.00       0%
3     29399     lio         0.01      0.01      0.02      0.02     100%
4     29400     pio         0.00      0.01      0.01      0.01     100%
5     29401     aio         3.29      74.30     77.59     77.59    100%
6     29402     msc         0.00      0.01      0.01      0.03      31%
7     29403     fifo        0.00      0.01      0.01      0.01     100%
8     29404     cpu         5.88      47.26     53.14     64.45     82%
9     29405     aio         0.28      1.14      1.42      1.42     100%
10    29406     soc         0.04      0.15      0.19      NA         NA
tot         15.14     169.84    184.98

The threads that are listening on the engine TCP ports are the poll threads (soctcppoll) running on SOC class (this depends on the NETTYPE parameter). When a new request is received by them they call the listener threads (soctcplst) running on the cpu class to initiate the authentication process. Parts of this task are run by the MSC virtual processor. As we can see in the last output this has the PID 29402. So, in order to see what happens I'll trace that OS process. For reasons that I'll explain later, I will turn off the NS_CACHE feature (Informix 11.7) and I will restart the engine. So, for the first connection attempt we get (some parts cut off):

1      0.000000 semop(753664, {{5, -1, 0}}, 1) = 0
2      7.009868 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
3      0.000107 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
4      0.000242 close(3)                  = 0
5      0.000060 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
6      0.000063 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
7      0.000095 close(3)                  = 0
8      [...]
9      0.000000 open("/etc/resolv.conf", O_RDONLY) = 3
10     0.000000 fstat64(3, {st_mode=S_IFREG|0644, st_size=55, ...}) = 0
11     0.000000 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xab4000
12     0.000000 read(3, "# Generated by NetworkManager\nna"..., 4096) = 55
13     0.000926 read(3, "", 4096)         = 0
14     0.000050 close(3)                  = 0
15     [...]
16     0.000057 futex(0x29ab44, FUTEX_WAKE_PRIVATE, 2147483647) = 0
17     0.000256 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3
18     0.000089 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.112.2")}, 16) = 0
19     0.000107 gettimeofday({1325605320, 167025}, NULL) = 0
20     0.000072 poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}])
21     0.000083 send(3, "\363\337\1\0\0\1\0\0\0\0\0\0\0011\003112\003168\003192\7in-ad"..., 44, MSG_NOSIGNAL) = 44
22     0.000322 poll([{fd=3, events=POLLIN}], 1, 5000) = 1 ([{fd=3, revents=POLLIN}])
23     2.061369 ioctl(3, FIONREAD, [121]) = 0
24     0.000111 recvfrom(3, "\363\337\201\203\0\1\0\0\0\1\0\0\0011\003112\003168\003192\7in-ad"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.112.2")}, [16]) = 121
25     0.000155 close(3)                  = 0
26     0.000090 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 3
27     0.000377 fstat64(3, {st_mode=S_IFREG|0644, st_size=439, ...}) = 0
28     0.000089 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf22000
29     0.000057 read(3, "127.0.0.1\tpacman1.onlinedomus.ne"..., 4096) = 439
30     0.000130 close(3)                  = 0
31     [...]
32     0.000072 semop(753664, {{7, 1, 0}}, 1) = 0
33     0.000069 semop(753664, {{7, 1, 0}}, 1) = 0
34     0.007558 semop(753664, {{5, -1, 0}}, 1

Please note that for clarity I added line numbers and time differences between each call (this will be important later). Let's explain this.

On line 1) we have a semop() which is the way MSC VP stays idle. That was before the connection attempt.
7 seconds later it tries to "talk" with nscd daemon (lines 2-8). This is kind of a Linux specific mechanism that I'm not running. Then accesses the /etc/nsswitch.conf. Just keep in memory that it did this on the first attempt
Then it accesses /etc/resolv.conf (lines 9-15) and finds the nameserver address
On lines 16-25 it talks to the DNS server (192.168.112.2) and asks for the reverse name of the connecting IP address
Since the answer is inconclusive, it goes to /etc/hosts (lines 26-31)
I have cut the remaining part which is related to the authentication (opening the /etc/passwd, /etc/group, /etc/shadow etc.).
Finally it returns to the normal idle state

A few important points about this:

It all happened pretty quick (values are in seconds)
We don't see the gethostbyaddr() call. This is not a "system call" for strace. So we see the lower level calls, but not the gethostbyaddr() function. We can catch it by attaching a debugger to the same process. This is important because usually it's hard to discuss this issues with the network and OS administrators because they tend to assume all this is done by Informix. It isn't! Informix just calls gethostbyaddr() (or equivalent fiunctions)

Now, let's see the same for the second attempt. We would expect it to be the same, but it isn't. I have cut exactly the same parts:

1      0.000000 semop(753664, {{5, -1, 0}}, 1) = 0
2      6.452154 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3
3      0.000099 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.112.2")}, 16) = 0
4      0.008816 gettimeofday({1325605445, 534040}, NULL) = 0
5      0.000089 poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}])
6      0.000100 send(3, "\233\t\1\0\0\1\0\0\0\0\0\0\0011\003112\003168\003192\7in-ad"..., 44, MSG_NOSIGNAL) = 44
7      0.000417 poll([{fd=3, events=POLLIN}], 1, 5000) = 1 ([{fd=3, revents=POLLIN}])
8      2.089726 ioctl(3, FIONREAD, [121]) = 0
9      0.000118 recvfrom(3, "\233\t\201\203\0\1\0\0\0\1\0\0\0011\003112\003168\003192\7in-ad"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.112.2")}, [16]) = 121
10     0.000132 close(3)                  = 0
11     0.000069 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 3
12     0.000102 fstat64(3, {st_mode=S_IFREG|0644, st_size=439, ...}) = 0
13     0.000092 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe1f000
14     0.000064 read(3, "127.0.0.1\tpacman1.onlinedomus.ne"..., 4096) = 439
15     0.000099 close(3)                  = 0
16     [...]
17     0.000068 semop(753664, {{0, 1, 0}}, 1) = 0
18     0.000096 semop(753664, {{0, 1, 0}}, 1) = 0
19     0.000076 semop(753664, {{5, -1, 0}}, 1

So, again the analysis:

The first part accessing the nscd daemon, the /etc/nsswitch.conf and the /etc/resolv.conf is completely gone.
As you can see it starts by connecting to the DNS.
Then it reads the files (/etc/hosts)
Then I cut the authentication part as before
Finally it returns to the idle state

The important point to note here is that the first attempt is different from the others. And again, Informix just calls gethostbyaddr()... just the same call each time. Well... to be correct the function we call may depend on the platform. As I mentioned earlier, only by using a debugger we can find the gethostbyaddr() call. I've done it and here is the result:

(gdb) break connect
Breakpoint 1 at 0x95f640
(gdb) continue
Continuing.

Breakpoint 1, 0x0095f640 in connect () from /lib/libpthread.so.0
(gdb) where
#0  0x0095f640 in connect () from /lib/libpthread.so.0
#1  0x00aec9ab in reopen () from /lib/libresolv.so.2
#2  0x00aee542 in __libc_res_nsend () from /lib/libresolv.so.2
#3  0x00aeb24e in __libc_res_nquery () from /lib/libresolv.so.2
#4  0x002b6dc7 in _nss_dns_gethostbyaddr2_r () from /lib/libnss_dns.so.2
#5  0x002b6f1a in _nss_dns_gethostbyaddr_r () from /lib/libnss_dns.so.2
#6  0x0020890b in gethostbyaddr_r@@GLIBC_2.1.2 () from /lib/libc.so.6
#7  0x00211f77 in getnameinfo () from /lib/libc.so.6
#8  0x08c0e664 in ifx_getipnodebyaddr ()
#9  0x08c0f79c in ifx_gethostbyaddr ()
#10 0x08c0f8a2 in __osgethostbyaddr ()
#11 0x08b0c055 in aio_workon ()
#12 0x08b0c9c3 in aiothread ()
#13 0x08b0dbcb in iothread ()
#14 0x08b00762 in startup ()
#15 0x558749e8 in ?? ()
#16 0x00000000 in ?? ()
(gdb)

As you can see I've setup a breakpoint for connect(). Then I "continue" the execution and I try the connection. gdb stops the program at the breakpoint and I get the stack trace (read bottom up).
So, it shows we call getnameinfo() which in turn calls gethostbyaddr_r() etc. All this belongs to the system libraries, not to Informix code.

There are two additional considerations we need to be aware. First, the Informix MSC VP processes it's requests in a serial manner. For each connection it asks what it needs from the DNS servers and/or files and makes the authentication. By default we only have one MSC VP... so if one request gets stuck.... yes... the following connections will suffer delays. This delays can be a fraction of second, or a few seconds, but on some systems I've seen tens (heard about hundreds) of connections per second, so even a few seconds will have large impact.
The second consideration relates to the differences between the first call and the subsequent ones. As we've seen above, on the first call the process checks the configuration (the /etc/nsswitch.conf and /etc/resolv.conf files). After that first check it does not do that anymore. And this causes a problem. Again this is the behavior of the system functions (but not necessarily the end of the story....)

So, hopefully I was able to explain how Informix interacts with the DNS system. The important technical deep dive should be over. We'll proceed to the implications. It's important you understand all the above before proceeding to the next paragraphs.

What problems can we face?

Above I've tried to show you how things work when everything is ok. But what happens when something is wrong? Let's see what can go wrong first and then the implications. I'll also try to explain who to blame (and again the disclaimer...). The purpose of course is not to finger point, but knowing where the problem lies is the first step to solve it.

Network problems prevent the connection to the DNS servers
If this happens, the requests sent by the MSC VP will have to timeout (typically a few seconds) before the OS call returns. This delay will cause all the other connection requests to stay on hold (assuming we just have one MSC VP). If the network problems persist, it really doesn't matter how many MSC VPs we have, since they'll all get stuck and all our connection attempts will suffer delays
The DNS server dies, is stopped, or is extremely slow
The effect of this is very similar to the previous. Anything that causes delays in the DNS requests will potentially cause delays in the engine connections. Note that this can be a generic issue that affects all the requests, or it can affect only some requests. Due to the hierarchical and distributed nature of the DNS system, it may be able to answer most requests, but get slow or event try to "talk" to an unavailable server for some specific names or IP addresses. Needless to say this makes the debug of these problems very difficult to do
Something goes wrong with the DSN system. The reverse lookups fail and this affects the trusted connections.
You start to see -956 errors in the online.log and -951 is sent to the client side
You need to change your DNS servers
Then you''ll need to restart Informix. This is a nasty problem caused by Informix (we could do better). As I mentioned above, the OS function calls keep the configurations in memory for efficiency reasons (it would be very "expensive" to re-read that for each call). So, the problem in how it works is that if you want to change the policy (/etc/nsswitch.conf) or the DNS servers (/etc/resolv.conf), the Informix processes will not pick up the change. I can assure you that Informix is not the only daemon that suffers with this. The first time I had a similar problem with this, I noticed that for example sendmail was also suffering... (trying to talk with the older servers)

From the above problems it's easy to understand that the three first are not Informix's fault. When a customer decides (presumably because it needs) to use a DNS infra-structure people have to understand that it automatically becomes a critical component of the system. Any problems in it will cause problems in the layers above, specifically in the Informix system.
And that leaves us with the forth problem. I wrote before and it's true that this is how the functions work (and I'll show this in action), so why do I write that this is Informix's fault? Well, because there is a way to handle this. There is another system call named res_init() that does precisely what we need. From the Linux manual, I quote:

The res_init() function reads the configuration files (see resolv.conf(5)) to get the default domain name, search order and name server address(es).
If no server is given, the local host is tried. If no domain is given, that associated with the local host is used. It can be overridden with the environment variable LOCALDOMAIN. res_init() is normally executed by the first call to one of the other functions.

So, my point is that Informix should provide a way by which the DBA would request all the MSC VPs to call this function. This would refresh the information that is cached by the first call.
In fact IBM has a pending feature request about this. I really hope this could be implemented in a future release. This is something that we may live without for years, but if you need it, it really would make a difference

Knowing if you have the problem

How do we know that we have a problem in DNS and that Informix is being impacted? Typically you'll get some sort of complain from the Informix clients. Usually something like "the database is terribly slow", or "it takes a long time to connect", or eventually you'll have connections refused with server side error -956 (not trusted). In extreme situations you can have weird errors in the online.log (-25xxx ). In all these situations you'll notice that the already existing sessions are working smoothly.
But in order to be sure you may follow two paths:

The most simple is to run a "netstat -a" command on the database server. This shows all the TCP/UDP connections to and from the machine. By default it will go through all the socket connections and will try to reverse lookup the respective IP addresses. If you're having problems you'll see that the netstat output will be very slow or at least with some "bumps". But for this simple tests to provide meaningful conclusions, you must be sure that the system is still configured to use the same configuration (/etc/nsswitch.conf and /etc/resolv.conf) that was in place when the Informix engine was started. Otherwise you'll not be comparing apples to apples
The most complex is to run a truss or strace command against the MSC VP with timings. This can show slow response from the calls to the DNS hosts. Be aware that running truss/strace requires root privileges and that even if everything is running fine, it will cause delays on systems with a large number of connects per second

I wrote earlier that I could demonstrate the facts presented here without using Informix. For that I created a simple program (line numbers included):

1  #include <sys/time.h>
2  #include <netdb.h>
3  #include <resolv.h>
4  #include <stdlib.h>
5  #include <string.h>
6  
7  int main(int argc, char **argv)
8  {
9    struct hostent *hp;
10    in_addr_t data;
11    char buff[100];
12    struct timeval ts_initial, ts_final;
13  
14    if (argc == 2) {
15      strcpy(buff,argv[1]);
16    }
17    else {
18      printf("Introduce an IP address: ");
19      if (fscanf(stdin,"%s", buff) == EOF)
20        exit(0);
21    }
22  
23    while (1 == 1) {
24      data = inet_addr(buff);
25      gettimeofday(&ts_initial, NULL);
26      hp = gethostbyaddr(&data, 4, AF_INET);
27      gettimeofday(&ts_final, NULL);
28  
29      if (hp == NULL) {
30        printf("Unknown host (%s). Took %f seconds\n", buff, (double)((ts_final.tv_sec * 1000000 + ts_final.tv_usec) - (ts_initial.tv_sec * 1000000 + ts_initial.tv_usec))/1000000);
31      }
32      else {
33        printf("Name (%s): %s Took %f seconds\n", buff, hp->h_name, (double)((ts_final.tv_sec * 1000000 + ts_final.tv_usec) - (ts_initial.tv_sec * 1000000 + ts_initial.tv_usec))/1000000);
34      }
35      printf("Next: ");
36      if (fscanf(stdin,"%s", buff) == EOF)
37        exit(0);
38      if ( strncmp("refresh", buff, 7) == 0 )
39      {
40         res_init();
41         printf("Called res_init()\n");
42         printf("Next: ");
43         if (fscanf(stdin,"%s", buff) == EOF)
44           exit(0);
45      }
46    }
47  }

This is basically a loop that reads an IP address (which are not validated, so it's easily breakable), and runs gethostbyaddr() on it. If the "ip" provided is "refresh" then it calls res_init(). It reports the information returned by the resolver subsystem and the time it took. You can use it interactively or redirect a file with one IP address per line.

I'll run it interactively with tracing in order to show the effect of calling res_init(). I have "hosts: dns,files" in /etc/nsswitch.conf and 192.168.112.2 on /etc/resolv.conf. This file also contains options do define the timeout to 2 seconds. So I call this with:

cheetah@pacman1.onlinedomus.net:fnunes-> strace -r -o test_resolv.trace ./test_resolv
Introduce an IP address: 1.1.1.1
Unknown host (1.1.1.1). Took 2.010235 seconds
Next: 1.1.1.2
Unknown host (1.1.1.2). Took 2.003324 seconds
Next: refresh
Called res_init()
Next: 1.1.1.3
Unknown host (1.1.1.3). Took 2.004002 seconds
Next: ^C

Before I introduce "refresh" I change the DNS nameserver in /etc/resolv.conf from 192.168.112.2 to 192.168.112.5.
The trace is this (line numbers and timmings added):

1        [...]
2        0.000000 write(1, "Introduce an IP address: ", 25) = 25
3        0.000000 read(0, "1.1.1.1\n", 1024) = 8
4        3.055699 gettimeofday({1325865284, 104186}, NULL) = 0
5        [...]
6        0.000160 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
7        0.000099 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ECONNREFUSED (Connection refused)
8        0.000128 close(3)                  = 0
9        0.000053 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
10       0.000123 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ECONNREFUSED (Connection refused)
11       0.000122 close(3)                  = 0
12       0.000095 open("/etc/nsswitch.conf", O_RDONLY) = 3
13       0.000108 fstat64(3, {st_mode=S_IFREG|0644, st_size=1803, ...}) = 0
14       0.000101 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7843000
15       0.000057 read(3, "#\n# /etc/nsswitch.conf\n#\n# An ex"..., 4096) = 1803
16       0.000001 read(3, "", 4096)         = 0
17       0.000000 close(3)                  = 0
18       [...]
19       0.000055 open("/etc/resolv.conf", O_RDONLY) = 3
20       0.000072 fstat64(3, {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
21       0.000095 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7843000
22       0.000050 read(3, "# Generated by NetworkManager\nna"..., 4096) = 118
23       0.000086 read(3, "", 4096)         = 0
24       0.000046 close(3)                  = 0
25       [...]
26       0.000173 open("/etc/host.conf", O_RDONLY) = 3
27       0.000068 fstat64(3, {st_mode=S_IFREG|0644, st_size=26, ...}) = 0
28       0.000081 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7843000
29       0.000179 read(3, "multi on\norder hosts,bind\n", 4096) = 26
30       0.000083 read(3, "", 4096)         = 0
31       0.000048 close(3)                  = 0
32       0.000049 munmap(0xb7843000, 4096)  = 0
33       0.000160 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3
34       0.000075 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.112.2")}, 16) = 0
35       0.000640 gettimeofday({1325865284, 108850}, NULL) = 0
36       0.000062 poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}])
37       0.000094 send(3, "r\363\1\0\0\1\0\0\0\0\0\0\0011\0011\0011\0011\7in-addr\4arp"..., 38, MSG_NOSIGNAL) = 38
38       0.000889 poll([{fd=3, events=POLLIN}], 1, 2000) = 0 (Timeout)
39       2.003373 close(3)                  = 0
40       0.000109 open("/etc/ld.so.cache", O_RDONLY) = 3
41       0.000078 fstat64(3, {st_mode=S_IFREG|0644, st_size=72238, ...}) = 0
42       0.000093 mmap2(NULL, 72238, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7821000
43       0.000054 close(3)                  = 0
44       [...]
45       0.000105 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 3
46       0.000097 fcntl64(3, F_GETFD)       = 0x1 (flags FD_CLOEXEC)
47       0.000065 fstat64(3, {st_mode=S_IFREG|0644, st_size=438, ...}) = 0
48       0.000053 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7843000
49       0.000035 read(3, "127.0.0.1\tpacman.onlinedomus.net"..., 4096) = 438
50       0.000137 read(3, "", 4096)         = 0
51       0.000130 close(3)                  = 0
52       [...]
53       0.000101 write(1, "Unknown host (1.1.1.1). Took 2.0"..., 46) = 46
54       0.000071 write(1, "Next: ", 6)     = 6
55       0.000267 read(0, "1.1.1.2\n", 1024) = 8
56       0.000071 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3
57       0.000077 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.112.2")}, 16) = 0
58       0.000049 poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}])
59       0.000063 send(3, ":\243\1\0\0\1\0\0\0\0\0\0\0012\0011\0011\0011\7in-addr\4arp"..., 38, MSG_NOSIGNAL) = 38
60       0.000152 poll([{fd=3, events=POLLIN}], 1, 2000) = 0 (Timeout)
61       2.002550 close(3)                  = 0
62       0.000088 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 3
63       0.000077 fstat64(3, {st_mode=S_IFREG|0644, st_size=438, ...}) = 0
64       0.000092 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7843000
65       0.000057 read(3, "127.0.0.1\tpacman.onlinedomus.net"..., 4096) = 438
66       0.000110 read(3, "", 4096)         = 0
67       0.000049 close(3)                  = 0
68       [...]
69       0.000060 write(1, "Unknown host (1.1.1.2). Took 2.0"..., 46) = 46
70       0.000076 write(1, "Next: ", 6)     = 6
71       0.000253 read(0, "refresh\n", 1024) = 8
72      17.639011 open("/etc/resolv.conf", O_RDONLY) = 3
73       0.000088 fstat64(3, {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
74       0.000087 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7843000
75       0.000052 read(3, "# Generated by NetworkManager\n#n"..., 4096) = 118
76       0.000108 read(3, "", 4096)         = 0
77       0.000047 close(3)                  = 0
78       0.000048 munmap(0xb7843000, 4096)  = 0
79       0.000065 write(1, "Called res_init()\n", 18) = 18
80       0.000060 write(1, "Next: ", 6)     = 6
81       0.000051 read(0, "1.1.1.3\n", 1024) = 8
82       3.595382 gettimeofday({1325865312, 174933}, NULL) = 0
83       0.000075 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3
84       0.000078 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.112.5")}, 16) = 0
85       0.000266 gettimeofday({1325865312, 175350}, NULL) = 0
86       0.000052 poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}])
87       0.000069 send(3, "\321\234\1\0\0\1\0\0\0\0\0\0\0013\0011\0011\0011\7in-addr\4arp"..., 38, MSG_NOSIGNAL) = 38
88       0.000085 poll([{fd=3, events=POLLIN}], 1, 2000) = 0 (Timeout)
89       2.002271 close(3)                  = 0
90       0.000081 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 3
91       0.000076 fstat64(3, {st_mode=S_IFREG|0644, st_size=438, ...}) = 0
92       0.000087 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7843000
93       0.000054 read(3, "127.0.0.1\tpacman.onlinedomus.net"..., 4096) = 438
94       0.000091 read(3, "", 4096)         = 0
95       0.000047 close(3)                  = 0
96       0.000048 munmap(0xb7843000, 4096)  = 0
97       0.000062 gettimeofday({1325865314, 178373}, NULL) = 0
98       0.000058 write(1, "Unknown host (1.1.1.3). Took 2.0"..., 46) = 46
99       0.000072 write(1, "Next: ", 6)     = 6
100      0.000053 read(0, 0xb7844000, 1024) = ? ERESTARTSYS (To be restarted)
101      0.918564 --- SIGINT (Interrupt) @ 0 (0) ---
102      0.000931 +++ killed by SIGINT +++

And the explanation:

Lines 1-5 the program starts and asks for the IP address. 1.1.1.1 is provided
Lines 6-18 it tries to contact the nscd (Linux caching daemon) and then opens and reads /etc/nsswitch.conf
Lines 19-31 opens the other two configuration files (/etc/resolv.conf and /etc/host.conf)
Lines 33-44 contacts the DNS server on 192.168.112.2 (timeout = 2s)
Lines 45-52 reads /etc/hosts and prints the result (Unknown host)
Lines 53-68 is just the same, but doesn't read the config files since it's not the first time
Lines 69-79 I insert "refresh" and it calls res_init() and re-reads /etc/resolv.conf. Meanwhile, just before that I changed the /etc/resolv.conf and put 192.168.112.5
Lines 80-99 I insert another IP address (1.1.1.3) and it goes to the new DNS server (192.168.112.5). Failing to get a proper answer re-reads the /etc/hosts.
Lines 100-102 it was asking for another IP address and I pressed Control+C

Hacking it just for fun!

Please don't try this at home!.. Really, the following is something risky, not supported, and for learning purposes only. I argued above that calling res_init() would allow you to change the DNS servers without restarting Informix. Let's prove it!
Again I traced a connection, looking at MSC VP. I get this:

socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.112.2")}, 16) = 0
gettimeofday({1326038028, 174740}, NULL) = 0
poll([{fd=3, events=POLLOUT}], 1, 0)    = 1 ([{fd=3, revents=POLLOUT}])
send(3, "+\316\1\0\0\1\0\0\0\0\0\0\0011\003112\003168\003192\7in-ad"..., 44, MSG_NOSIGNAL) = 44
poll([{fd=3, events=POLLIN}], 1, 2000)  = 0 (Timeout)
close(3)                                = 0
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=438, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x12f000
read(3, "127.0.0.1\tpacman.onlinedomus.net"..., 4096) = 438
close3)

It is connecting to 192.168.112.2, which is what I have on /etc/resolv.conf. If I change that on the file to 192.168.112.5 and try to connect again, the same thing happens (it's not aware of the change). But now, without further changes in the file I run a debugger against the MSC VP process:

[root@pacman tmp]# gdb -p 4649
GNU gdb (GDB) Fedora (7.1-18.fc13)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-redhat-linux-gnu".
For bug reporting instructions, please see:
<http: bugs="" gdb="" software="" www.gnu.org="">.
Attaching to process 4649
Reading symbols from /usr/informix/srvr1170uc4/bin/oninit...(no debugging symbols found)...done.
[...]
(gdb) call __res_init()
$1 = 0
(gdb) detach
Detaching from program: /usr/informix/srvr1170uc4/bin/oninit, process 4649
(gdb) quit
[root@pacman tmp]#

And I "call" the __res_init() function which I checked to be defined in the libc.so code. Let's trace another connection now:

socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.112.5")}, 16) = 0
gettimeofday({1326038409, 784712}, NULL) = 0
poll([{fd=3, events=POLLOUT}], 1, 0)    = 1 ([{fd=3, revents=POLLOUT}])
send(3, "\364K\1\0\0\1\0\0\0\0\0\0\0011\003112\003168\003192\7in-ad"..., 44, MSG_NOSIGNAL) = 44
poll([{fd=3, events=POLLIN}], 1, 2000)  = 0 (Timeout)
close(3)                                = 0
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=438, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x12f000
read(3, "127.0.0.1\tpacman.onlinedomus.net"..., 4096) = 438
close(3)

Ups! Hacked!... Of course this is not supported. Don't try this on real servers. It just serves the purpose of proving a point.

Conclusions

What I tried to demonstrate in this article is how Informix interacts with the DNS system, if you configure DNS in your environment. I hope that I made clear that once DNS is in place it takes a very important role in the connection process. So much that a bad function in DNS will have major impact in the database system. There are a few points that are Informix responsibility that can contribute to the impact. In particular:

The fact that the reverse DNS requests are made by the MSC VP(s) can generate delays in requests that would work ok, just because a previous one has a problem. Having more than one MSC VP can greatly alleviate this, but may not solve it
The fact that Informix doesn't establish timeouts to the gethostbyname() calls - or equivalent - can also lead to unnecessary delays. But note that the functions signatures don't have any way to ask for this, so an alarm would have to be setup, so that the process would receive a signal and then check if it was still waiting on the function. This would cause additional overhead, that would not make sense specially because the timeouts are possible to configure in the generic DNS configuration
The fact that the functions called, cache the configuration details, allied to the fact that Informix has no way to clear the cache, means that a change in the DNS addresses will require a stop in the database system. There is a feature request to allow this. I'd love to see this implemented.

On the other hand, Informix 11.7 introduced a great feature to ease all this. With the NS_CACHE parameter we can configure the cache times for several systems (DNS, users/passwords, services...). This can reduce to a minimum the number of requests. Of course, as with any other caching mechanism the risk is to have stalled entries. But we can clear any of the caches by using "onmode -wm NS_CACHE=..." and using a timeout of zero. This would be a great place to call res_init() by the way...

To roundup the article I'd like to mention a few best practices:

Use Informix 11.7 if you can, so that you can take advantage of the caching. Your DNS servers and DNS admins will appreciate it if you have high connection rates
Consider including "files" in the /etc/nsswitch.conf configuration. Some people consider this a bad idea, because it keeps the /etc/hosts file very busy. But if you keep it short and use the cache of Informix 11.7 it shouldn't be too bad. And it can save you if your DNS blows away (you could then add entries to the /etc/hosts file even temporarily). Note that at least in my test system (Linux) not even res_init makes the process re-read /etc/nsswitch.conf
Make sure your DNS admins and DBA get along... They must work together and should be aware that their systems are tightly connected
Use Informix 11.7 if you can, so that you can use REMOTE_SERVER_CFG. Many companies don't allow the DBA to manage the /etc/hosts.equiv files (system files). If you have a problem in DNS, and your reverse DNS queries start to fail, your trusted connections will fail if they use the names and not the IP addresses. So, in critical situations it may be useful that the DBAs can act immediately and add the IP addresses in the trusted lists. With REMOTE_SERVER_CFG they'll be able to do it.
Use very small timeouts configured in /etc/resolvs.conf (1-2s) to minimize the waisted wait time (a successful query to a DNS is pretty quick

Small query performance analysis / Pequena an�lise de performance de querys

This article is written in English and Portuguese
Este artigo est� escrito em Ingl�s e Portugu�s

English version:

The need...

The end of the year is typically a critical time for IT people. Following up on last article's I'm still working with performance issues. "Performance issues on Informix?!" I hear you say... Well yes, but to give you an idea of the kind of system I'm talking about I can say that recently we noticed three small tables (between 3 and 60 rows) that between mid-night and 11AM were scanned 33M times. To save you the math, that's around 833 scans/queries for each of these tables per second. And this started to happen recently, on top of the normal load that nearly 3000 sessions can generate...
So, the point is: every bit of performance matters. And in most cases, on this system there are no long running queries. It's mostly very short requests made an incredible number of times. And yes, this makes the DBA life harder... If you have long running queries with bad query plans they're usually easy to spot. But if you have a large number of very quick queries, but with questionable query plans, than it's much more difficult to find.

Just recently I had one of this situations. I've found a query with a questionable query plan. The query plan varies with the arguments and both possible options have immediate response times (fraction of a second). That's not the first time I've found something similar, and most of the times I face the same situation twice I usually decide I need to have some tool to help me on that.

The idea!

The purpose was to see the difference in the work the engine does between two query plans. And when I say "tool" I'm thinking about a script. Last time I remember having this situation, I used a trick in dbaccess to obtain the performance counters for both the session, and the tables involved. Some of you probably know, others may not, but when dbaccess parses an SQL script file it can recognize a line starting with "!" as an instruction to execute the rest of the line as a SHELL command. So basically what I did previously was to customize the SQL script containing the query like this:

!onstat -z

SELECT .... FROM .... WHERE ...

!some_shell_scritpt

where some_shell_script had the ability to find the session and run an onstat -g tpf and also an onstat -g ppf. These two onstat commands show us a lot of performance counters respectively from the threads (tpf) and from the partitions (ppf). The output looks like:

IBM Informix Dynamic Server Version 11.70.UC4 -- On-Line -- Up 7 days 23:42:15 -- 411500 Kbytes

Thread profiles
tid lkreqs lkw dl to lgrs isrd iswr isrw isdl isct isrb lx bfr bfw lsus lsmx seq
24  0      0   0  0  0    0    0    0    0    0    0    0  0   0   0    0    0  
26  0      0   0  0  0    0    0    0    0    0    0    0  95  95  0    0    0  
51  32917  0   0  0  21101 13060 3512 57   532  3795 0    0  91215 29964 0    125008 4226
52  39036  0   0  0  9099 11356 2648 80   1372 265  0    0  45549 9312 0    244900 21 
49  705    0   0  0  574  8938 0    139  0    139  0    0  22252 148 0    5656 541
2444 706    0   0  0  14   344  0    4    0    0    3    0  819 7   136  224  0

This tells us the thread Id, lock requests, lock waits, deadlocks, timeouts, logical log records, isam calls (read, write, rewrite, delete, commit and rollback), long transactions, buffer reads and writes, logical log space used, logical log space maximum and sequential scans.
And this:

panther@pacman.onlinedomus.com:informix-> onstat -g ppf | grep -v "0     0     0     0     0     0     0     0     0     0     0     0"

IBM Informix Dynamic Server Version 11.70.UC4 -- On-Line -- Up 7 days 23:43:41 -- 411500 Kbytes

Partition profiles
partnum    lkrqs lkwts dlks  touts isrd  iswrt isrwt isdel bfrd  bfwrt seqsc rhitratio
0x100001   0     0     0     0     0     0     0     0     13697 0     0     100
0x100002   993   0     0     0     445   0     0     0     1460  0     2     100
0x10002d   6769  0     0     0     2379  34    340   34    9094  581   2     100
0x10002e   164   0     0     0     166   0     0     0     472   0     2     100
0x10002f   2122  0     0     0     2750  0     0     0     5288  0     0     100
0x100030   0     0     0     0     4     0     0     0     700   0     4     100
0x100034   14192 0     0     0     5922  192   80    192   15566 1274  0     100
0x100035   2260  0     0     0     188   80    0     80    2766  655   4     100
0x100036   1350  0     0     0     548   34    0     34    1872  249   0     100
0x100037   80    0     0     0     16    4     0     4     346   28    0     100
0x100038   4720  0     0     0     738   360   0     360   3734  1557  0     100

which tells us some of the above, but for each partition.
Note that I reset the counters, run the query and then obtain the profile counters. Ideally, nothing else should be running on the instance (better to do it on a test instance)

Sharing it

But I decided to make this a bit easier and I created a script for doing it. I'm also using this article to announce that starting today, I'll try to keep my collection of scripts on a publicly available site:

http://onlinedomus.com/informix/viewvc.cgi

This repository contains a reasonable amount of scripts for several purposes. Ideally I should create proper documentation and use cases for each one of them, but I currently don't have that. It's possible I'll cover some of them here in the blog, but probably only integrated in a wider article (like this one).

These scripts were created by me (with one exception - setinfx was created by Eric Vercelleto when we were colleagues in Informix Portugal and we should thank him for allowing the distribution), during my free time and should all contain license info (GPL 2.0). This means you can use them, copy them, change them etc. Some of them are very old and may not contain this info.
Some fixes and improvements were done during project engagements. Many of them were based on ideas I got from some scripts available in IIUG's software repository or from colleagues ideas, problems and suggestions (Thanks specially to Ant�nio Lima and Adelino Silva)

It's important to notice that the scripts are available "as-is", no guarantees are made and I cannot be held responsible for any problem that it's use may cause.
Having said that, I've been using most of them on several customers for years without problems.
Any comments and/or suggestions are very welcome, and if I find the suggestions interesting and they don't break the script's ideas and usage, I'll be glad to incorporate them on future versions.

Many of the scripts have two option switches that provide basic help (-h) and version info (-V).
If by any chance you are using any of these scripts I suggest you check the site periodically to find any updates. I try my best to maintain retro-compatibility and old behavior when I make changes on them.

Back to the problem

So, this article focus on analyzing and comparing the effects of running a query with two (or more) different query plans. The script created for this was ixprofiling. If you run it with -h (help) option it will print:

panther@pacman.onlinedomus.com:fnunes-> ./ixprofiling -h
ixprofiling [ -h | -V ]
            -s SID database
            [-z|-Z|-n] database sql_script
     -h             : Get this help
     -V             : Get script version
     -s SID database: Get stats for session (SID) and database
     -n             : Do NOT reset engine stats
     -z             : Reset engine stats using onstat (default - needs local database)
     -Z             : Reset engine stats using SQL Admin API (can work remotely )

Let's see what the options do:

-s SID database
Shows the info similar to onstat -g tpf (for the specified session id) and onstat -g ppf (for the specified database)
It will show information for all the partition objects in the specified database for which any of the profile counters is different from zero. Note that when I write partition, it can be a table, a table's partition or a table's index.
database sql_script
Runs the specified SQL script after making some changes that will (by default) reset the engine profile counters (-z option). See more information about the SQL script below
-n
Prevents the reset of profile counters (if you're not a system database administrator you'll need to specify this to avoid errors)
-z
Resets the profile counters using onstat -z. This is the quickest and most simple way to do it but will need local database access.
-Z
Resets the counters using SQL admin API, so it can be used on remote databases

And now let's see an usage example. The script has some particularities that need to be detailed.
First, since the idea is to compare two or more query plans we can put all the variations inside the SQL script, separating them by a line like:

-- QUERY

when the script finds these lines, it will automatically get the stats (from the previous query) and reset the counters to prepare for the next query. If you use just one query you don't need this, since by default it will reset the counters at the beginning and show the stats at the end.
If you put two or more queries on the script don't forget to end each query with ";" or it will break the functionality.
Let's see a practical example. I have a table with the following structure:

create table ibm_test_case 
  (
    col1 integer,
    col2 smallint not null ,
    col3 integer,
    col4 integer,
[... irrelevant bits... ]
    col13 datetime year to second,
[... more irrelevant bits... ]
  );

create index ix_col3_col13 on ibm_test_case (col3,col13) using btree ;
create index ix_col4 on ibm_test_case (col4) using btree ;

and a query like:

select c.col1
from ibm_test_case c
where
        c.col3 = 123456789 and
        c.col4 = 1234567 and
        c.col13 = ( select max ( c2.col13 ) from ibm_test_case c2
                where
                        c2.col3 = c.col3 and
                        c2.col4 = c.col4
                );

The problem is the query plan for the sub-query. It can choose between an index headed by col3 and another on col4. So I create a test_case.sql with:

unload to /dev/null select c.col1
from ibm_test_case c
where
        c.col3 = 123456789 and
        c.col4 = 1234567 and
        c.col13 = ( select max ( c2.col13 ) from ibm_test_case c2
                where
                        c2.col3 = c.col3 and
                        c2.col4 = c.col4
                );

-- QUERY
unload to /dev/null select c.col1
from ibm_test_case c
where
        c.col3 = 123456789 and
        c.col4 = 1234567 and
        c.col13 = ( select --+ INDEX ( c2 ix_col3_col13 )
                max ( c2.col13 ) from ibm_test_case c2
                where
                        c2.col3 = c.col3 and
                        c2.col4 = c.col4
                );

Note that on the second query I'm forcing the use of a particular index.
Then we run:

ixprofiling stores test_case.sql

and we get the following output:

Database selected.

Engine statistics RESETed. Query results:

Explain set.


1 row(s) unloaded.


Thread profiles (SID: 2690)
LkReq LkWai DLks  TOuts LgRec IsRd  IsWrt IsRWr IsDel BfRd  BfWrt LgUse LgMax SeqSc Srts  DskSr SrtMx Sched CPU Time    Name        
----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----------- ------------ 
5224  0     0     0     0     2611  0     0     0     2646  0     0     0     0     0     0     0     2170  0.051671256 sqlexec     

Partitions profiles (Database: stores)
LkReq LkWai DLks  TOuts DskRd DskWr IsRd  IsWrt IsRWr IsDel BfRd  BfWrt SeqSc Object name                                           
----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ------------------------------------------------------
6     0     0     0     0     0     2     0     0     0     10    0     0     systables
2609  0     0     0     1933  0     2607  0     0     0     2609  0     0     ibm_test_case
1     0     0     0     2     0     1     0     0     0     6     0     0     ibm_test_case#ix_col3_col13
2608  0     0     0     3     0     1     0     0     0     21    0     0     ibm_test_case#ix_col4
Engine statistics RESETed. Query results:

1 row(s) unloaded.


Thread profiles (SID: 2690)
LkReq LkWai DLks  TOuts LgRec IsRd  IsWrt IsRWr IsDel BfRd  BfWrt LgUse LgMax SeqSc Srts  DskSr SrtMx Sched CPU Time    Name        
----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----------- ------------ 
17    0     0     0     0     6     0     0     0     31    0     0     0     0     0     0     0     188   0.003161049 sqlexec     

Partitions profiles (Database: stores)
LkReq LkWai DLks  TOuts DskRd DskWr IsRd  IsWrt IsRWr IsDel BfRd  BfWrt SeqSc Object name                                           
----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ------------------------------------------------------
6     0     0     0     0     0     2     0     0     0     10    0     0     systables
4     0     0     0     2     0     1     0     0     0     4     0     0     ibm_test_case
7     0     0     0     0     0     6     0     0     0     17    0     0     ibm_test_case#ix_col3_col13

So, we can now analyze the differences. As you can see the output is more friendly than the output from onstat. On the session section we can see the usual counters, plus the number of times the engine scheduled the thread(s) to run, the CPU time consumed and the name of the threads.
On the tables/partitions section, we can find the partition, table or index name in a friendly nomenclature (instead of the partnum).
As for the comparison, you can spot a big difference. Much more buffer reads and ISAM reads for the first query plan and also a bigger CPU time. Be aware however that for very fast queries the CPU times may show very big variance so don't assume a lower CPU time is always associated with the better query plan. You should repeat the test many times to see the oscillations.
Also note that the meaning of ISAM calls is many times misunderstood. Some people think it's the number of "SELECTs", others the number of rows returned... In reality it's the number of internal functions calls. Some engine settings like BATCHEDREAD_TABLE and BATCHEDREAD_INDEX may influence the number of calls for the same query and query result.

That's all for now. I leave you with the repository and hopefully future articles will focus on some of these scripts. Feel free to use them and send me you suggestions.

Vers�o Portuguesa:

A necessidade...

O fina do ano � tipicamente uma altura critica para os inform�ticos. Continuando no mesmo tema do �ltimo artigo, continuo a trabalhar com problemas de performance. "Problemas de performance em Informix?!" poder�o estar a pensar... Bem, sim, mas para vos dar uma ideia do sistema sobre o qual estou a falar, posso dizer que recentemente not�mos tr�s pequenas tabelas (entre 3 e 60 linhas) que entre a meia-noite e as onze da manh� eram varridas (sequential scan) 33M de vezes. Para poupar nas contas posso j� dizer que d� cerca de 833 scans/queries por segundo para cada uma das tabelas. E isto come�ou a acontecer recentemente sobre a carga "normal" que perto de 3000 sess�es podem criar.
Portanto, a ideia � que cada bocadinho de performance tem impacto. Na maioria dos casos, este sistema n�o tem queries longas. Na maior parte das vezes os problemas s�o pedidos com curta dura��o mas feitos um imenso n�mero de vezes. E sim, isto torna a vida dos DBAs mais dicf�cil... Se tivermos queries longas com maus planos de execu��o s�o normalmente f�ceis de identificar. Mas se tivermos um grande n�mero de queries muito curtas, com um plano de execu��o question�vel, isso � muito mais dif�cil de encontrar.

Ainda recentemente tive uma dessas situa��es. Detectei uma query com um plano de execu��o duvidoso. O plano de execu��o varia com os par�metros usados e ambas as alternativas t�m um tempo de resposta "imediato" (frac��o de segundo). N�o foi a primeira vez que encontrei algo semelhante, e na maioria dos casos em que enfrento uma situa��o duas vezes, normalmente decido que preciso de alguma ferramenta que me ajude no futuro.

A ideia!

O objectivo era evidenciar a diferen�a no trabalho feito pelo motor entre dois planos de execu��o. E quando refiro "ferramenta" estou a pensar num script. A �ltima vez que me lembro de ter tido uma situa��o destas usei um truque no dbaccess para obter os indicadores de performance tanto para a sess�o como para as tabelas envolvidas.
Alguns de v�s saber�o, outros n�o, mas quando o dbaccess l� um scritpt SQL pode reconhecer uma linha come�ada com "!" como uma instru��o para executar o resto da linha como um comando SHELL. Assim, o que fiz em situa��es anteriores foi alterar o script SQL que continha a query para algo do g�nero:

!onstat -z

SELECT .... FROM .... WHERE ...

!um_shell_scritpt

onde um_shell_script tem a capacidade de encontrar a sess�o e correr um onstat -g tpf e tamb�m um onstat -g ppf. Ests dois comandos mostram-nos uma s�rie de contadores de performance respectivamente da sess�o/thread (tpf) e das parti��es (ppf). O output � semelhante a isto:

IBM Informix Dynamic Server Version 11.70.UC4 -- On-Line -- Up 7 days 23:42:15 -- 411500 Kbytes

Thread profiles
tid lkreqs lkw dl to lgrs isrd iswr isrw isdl isct isrb lx bfr bfw lsus lsmx seq
24  0      0   0  0  0    0    0    0    0    0    0    0  0   0   0    0    0  
26  0      0   0  0  0    0    0    0    0    0    0    0  95  95  0    0    0  
51  32917  0   0  0  21101 13060 3512 57   532  3795 0    0  91215 29964 0    125008 4226
52  39036  0   0  0  9099 11356 2648 80   1372 265  0    0  45549 9312 0    244900 21 
49  705    0   0  0  574  8938 0    139  0    139  0    0  22252 148 0    5656 541
2444 706    0   0  0  14   344  0    4    0    0    3    0  819 7   136  224  0

�-nos mostrado o ID da thread, n�mero de pedidos de lock, esperas em locks, deadlocks, lock timeouts, chamadas ISAM (leitura, escrita, re-escrita, apagar, commit e rollback), transac��es longas, leituras e escritas de buffers, espa�o usado em logical logs e m�ximo espa�o usado em logical logs e n�mero de sequential scans. E isto:

panther@pacman.onlinedomus.com:informix-> onstat -g ppf | grep -v "0     0     0     0     0     0     0     0     0     0     0     0"

IBM Informix Dynamic Server Version 11.70.UC4 -- On-Line -- Up 7 days 23:43:41 -- 411500 Kbytes

Partition profiles
partnum    lkrqs lkwts dlks  touts isrd  iswrt isrwt isdel bfrd  bfwrt seqsc rhitratio
0x100001   0     0     0     0     0     0     0     0     13697 0     0     100
0x100002   993   0     0     0     445   0     0     0     1460  0     2     100
0x10002d   6769  0     0     0     2379  34    340   34    9094  581   2     100
0x10002e   164   0     0     0     166   0     0     0     472   0     2     100
0x10002f   2122  0     0     0     2750  0     0     0     5288  0     0     100
0x100030   0     0     0     0     4     0     0     0     700   0     4     100
0x100034   14192 0     0     0     5922  192   80    192   15566 1274  0     100
0x100035   2260  0     0     0     188   80    0     80    2766  655   4     100
0x100036   1350  0     0     0     548   34    0     34    1872  249   0     100
0x100037   80    0     0     0     16    4     0     4     346   28    0     100
0x100038   4720  0     0     0     738   360   0     360   3734  1557  0     100

que nos mostra alguns dos contadores anteriores, mas por parti��o.
Note-se que re-inicializo os contadores, corro a query e depois obtenho os outputs. Idealmente n�o dever� estar mais nada a correr na inst�ncia (� prefer�vel usar uma inst�ncia de teste).

Partilha

Mas decidi tornar isto um pouco mais f�cil e criei um script para o fazer. Estou tamb�m a usar este artigo para anunciar que a partir de hoje, tentarei manter a minha colec��o de scripts dispon�vel num site p�blico:

http://onlinedomus.com/informix/viewvc.cgi

Este reposit�rio cont�m uma quantidade razo�vel de scripts e outras ferramentas �teis para v�rias tareafas. Idealmente eu deveria criar documenta��o e casos de uso para cada um deles, mas de momento isso n�o est� feito. � poss�vel que v� descrevendo alguns destes scripts em futuros artigos, mas sempre integrados em assuntos mais vastos (como este)

Estes scripts foram criados por mim (com uma excep��o - setinfx foi criado por Eric Vercelletto quando �ramos colegas na Informix Portugal e devemos agradecer-lhe por permitir a distribui��o), durante os meus tempos livres e devem conter informa��o de licenciamento (GPL 2.0). Isto quer dizer que podem ser usados, distribuidos, alterados etc.). Alguns podem n�o ter esta informa��o por serem muito antigos.
Naturalmente algumas correc��es e melhorias foram feitas durante projectos em clientes, sempre que detecto algum erro ou hip�tese de melhoria no seu uso. Muitos deles foram baseados em ideias que obtive de scripts existents no reposit�rio do IIUG, ou de ideias, problemas e sugest�es de colegas (agradecimento especial ao Ant�nio Lima e ao Adelino Silva)

� importante avisar que os scripts s�o disponiblizados "como s�o", sem qualquer tipo de garantia implicita ou explicita e eu n�o posso ser responsabilizado por qualquer problema que advenha do seu uso. Posto isto, conv�m tamb�m dizer que a maioria dos scripts t�m sido usados por mim em clientes ao longo de anos, sem problemas.

Quaisquer coment�rios e/ou sugest�es s�o bem vindas, e se os achar interessantes terei todo o prazer em os incorportar em futuras vers�es (desde que n�o fujam � l�gica e utiliza��o do script)
Muitos destes scripts disponibilizam duas op��es que fornecem ajuda b�sica (-h) e informa��o sobre a vers�o (-V).
Se utilizar algum destes scripts no seu ambiente, sugiro que verifique periodicamente se houve correc��es ou melhorias, consultando o site com alguma regularidade. Sempre que poss�vel evito que novas funcionalidades alterem o comportamento do script.

De volta ao problema

Este artigo foca a an�lise e compara��o dos efeitos de executar uma query com dois (ou mais) planos de execu��o. O script criado para isso chama-se ixprofiling. Se corrido com a op��o -h (help) mostra-nos:

panther@pacman.onlinedomus.com:fnunes-> ./ixprofiling -h
ixprofiling [ -h | -V ]
            -s SID database
            [-z|-Z|-n] database sql_script
     -h             : Get this help
     -V             : Get script version
     -s SID database: Get stats for session (SID) and database
     -n             : Do NOT reset engine stats
     -z             : Reset engine stats using onstat (default - needs local database)
     -Z             : Reset engine stats using SQL Admin API (can work remotely )

Vejamos o que fazem as op��es:

-s SID base_dados
Mostra informa��o semelhante ao onstat -g tpf (para a sess�o indicada por SID) e onstat -g ppf (para a base de dados indicada)
Ir� mostrar informa��o para todas as parti��es na base de dados escolhida, para as quais exista pelo menos um dos contadores com valor diferente de zero. Note-se que quando refiro parti��o estou a referir-me a uma tabela, a um fragmento de tabela fragmentada (ou se preferir particionada) ou a um ind�ce.
base_dados script_sql
Corre o script SQL indicado, fazendo altera��es que ir�o (por omiss�o), re-inicializar os contadores de performance do motor (op��o -z). Veja mais informa��o sobre o script SQL abaixo
-n
Evita a re-inicializa��o dos contadores de performance (se n�o f�r administrador do sistema de base de dados ter� de usar esta op��o para evitar erros)
-z
Faz a re-inicializa��o dos contadores do motor usando o comando onstat -z. Esta � a forma mais simples e r�pida de o fazer, mas requer que a base de dados seja local
-Z
Faz a re-inicializa��o dos contadores utilizando a SQL Admin API, de forma que possa ser feito com bases de dados remotas

E agora vejamos um exemplo de uso. O script tem algumas particularidades que merecem ser datalhadas.
Primeiro e porque a ideia � comparar dois ou mais planos de execu��o, podemos colocar todas as variantes de plano de execu��o dentro do mesmo script SQL usando uma linha como esta para separar as queries:

-- QUERY

Estas linhas s�o automaticamente substitu�das por comandos que obt�m os contadores actuais (da query anterior) e que re-inicializam os mesmos contadores preparando a execu��o seguinte. Se usar apenas uma query n�o � necess�rio isto, pois por omiss�o a re-inicializa��o dos contadores � feita no in�cio, e ap�s a �ltima query s�o automaticamente mostrados os contadores.
Se colocar duas ou mais queries no script n�o se esque�a de terminar cada uma com ";" ou o script n�o funcionar� como esperado.
Vamos ver um exemplo pr�tico. Tenho uma tabela com a seguinte estrutura:

create table ibm_test_case 
  (
    col1 integer,
    col2 smallint not null ,
    col3 integer,
    col4 integer,
[... parte irrelevante ... ]
    col13 datetime year to second,
[... mais colunas irrelevantes ... ]
  );

create index ix_col3_col13 on ibm_test_case (col3,col13) using btree ;
create index ix_col4 on ibm_test_case (col4) using btree ;

e uma query com:

select c.col1
from ibm_test_case c
where
        c.col3 = 123456789 and
        c.col4 = 1234567 and
        c.col13 = ( select max ( c2.col13 ) from ibm_test_case c2
                where
                        c2.col3 = c.col3 and
                        c2.col4 = c.col4
                );

O problem � o plano de execu��o da sub-query. Pode escolher entre um �ndice come�ado pela coluna col3 e outro pela coluna col4. Por isso crio um ficheiro, caso_teste.sql com:

unload to /dev/null select c.col1
from ibm_test_case c
where
        c.col3 = 123456789 and
        c.col4 = 1234567 and
        c.col13 = ( select max ( c2.col13 ) from ibm_test_case c2
                where
                        c2.col3 = c.col3 and
                        c2.col4 = c.col4
                );

-- QUERY
unload to /dev/null select c.col1
from ibm_test_case c
where
        c.col3 = 123456789 and
        c.col4 = 1234567 and
        c.col13 = ( select --+ INDEX ( c2 ix_col3_col13 )
                max ( c2.col13 ) from ibm_test_case c2
                where
                        c2.col3 = c.col3 and
                        c2.col4 = c.col4
                );

Repare que na segunda query estou a for�ar o uso de um determinado �ndice:
Depois corro:

ixprofiling stores caso_teste.sql

e obtemos o seguinte:

Database selected.

Engine statistics RESETed. Query results:

Explain set.


1 row(s) unloaded.


Thread profiles (SID: 2690)
LkReq LkWai DLks  TOuts LgRec IsRd  IsWrt IsRWr IsDel BfRd  BfWrt LgUse LgMax SeqSc Srts  DskSr SrtMx Sched CPU Time    Name        
----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----------- ------------ 
5224  0     0     0     0     2611  0     0     0     2646  0     0     0     0     0     0     0     2170  0.051671256 sqlexec     

Partitions profiles (Database: stores)
LkReq LkWai DLks  TOuts DskRd DskWr IsRd  IsWrt IsRWr IsDel BfRd  BfWrt SeqSc Object name                                           
----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ------------------------------------------------------
6     0     0     0     0     0     2     0     0     0     10    0     0     systables
2609  0     0     0     1933  0     2607  0     0     0     2609  0     0     ibm_test_case
1     0     0     0     2     0     1     0     0     0     6     0     0     ibm_test_case#ix_col3_col13
2608  0     0     0     3     0     1     0     0     0     21    0     0     ibm_test_case#ix_col4
Engine statistics RESETed. Query results:

1 row(s) unloaded.


Thread profiles (SID: 2690)
LkReq LkWai DLks  TOuts LgRec IsRd  IsWrt IsRWr IsDel BfRd  BfWrt LgUse LgMax SeqSc Srts  DskSr SrtMx Sched CPU Time    Name        
----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----------- ------------ 
17    0     0     0     0     6     0     0     0     31    0     0     0     0     0     0     0     188   0.003161049 sqlexec     

Partitions profiles (Database: stores)
LkReq LkWai DLks  TOuts DskRd DskWr IsRd  IsWrt IsRWr IsDel BfRd  BfWrt SeqSc Object name                                           
----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ------------------------------------------------------
6     0     0     0     0     0     2     0     0     0     10    0     0     systables
4     0     0     0     2     0     1     0     0     0     4     0     0     ibm_test_case
7     0     0     0     0     0     6     0     0     0     17    0     0     ibm_test_case#ix_col3_col13

Ent�o, podemos agora analisar as diferen�as. Como se pode ver o resultado � mais simp�tico que o do onstat. Na sec��o relativa � sess�o, podemos ver os contadores habituais, mais o n�mero de vezes que o motor escalonou a thread para correr, e o tempo de CPU consumido, bem como o nome das threads
Na sec��o destinada �s parti��es podemos encontrar os nomes das tabelas, parti��es ou ind�ces numa nomenclatura f�cil de entender (em vez do partnum).
Sobre a compara��o, podemos ver uma grande diferen�a. Muitos mais leituras de buffers e chamadas ISAM para o primeiro plano de execu��o e tamb�m mais consumo de CPU. Mas aten��o que para queries muito r�pidas os tempos de CPU podem apresentar uma varia��o muito grande. Por isso conv�m n�o assumir imediatamente que um plano de execu��o � melhor porque se v� um tempo de CPU menor na primeira intera��o. Deve repetir-se o teste muitas vezes para se verificar as oscila��es.
Chamo tamb�m a aten��o para o significado das chamadas ISAM. Muitas vezes vejo confus�es sobre este tema. Algumas pessoas pensam que s�o o n�mero de SELECTs (para os ISAM reads), ou que ser�o o n�mero de linhas retornadas... Na realidade � o n�mero de chamadas a fun��es internas. Algumas configura��es do motor como BATCHEDREAD_TABLE e BATCHEDREAD_INDEX podem influenciar o n�mero destas chamadas, para a mesma query e mesmo conjunto de resultados.

� tudo por agora. Deixo-lhe o reposit�rio e a esperan�a que artigos futuros se foquem em alguns destes scripts. Use-os � vontade e envie quaisquer sugest�es.

Optimizer secrets / segredos do optimizador

This article is written in English and Portuguese

Este artigo est� escrito em Ingl�s e Portugu�s

English version:

Spending a lot of time with customers is great. It gives me time to go beyond the strict task execution that short projects allow. We actually have time to dive deep into what sometimes looks only as a curiosity. I'll describe one of this situations in this post. It started out as a normal performance issue. A query involving a join was causing an hash join and apparently there was an index that could be used. Some quick investigation showed that the datatypes didn't match on the join columns, and as such the index was not being used.
The situation was fixed (using an explicit cast since changing the datatypes would require more deep analysis) and the programmer(s) received an explanation about the issue so that future situations would be avoided. Simple right? Yes... But when you have inquiring minds and programmers with a few minutes to spare you may be surprised. And this was the case. A few weeks later the DBA team received a request to explain why we recommended that the columns used in joins should have the same datatype. A programmer had produced a testcase where the engine was able to convert the parameter sent and use the index. In other words, if the engine is smart enough why should we care?!

Although this could be considered a waste of time (using the same datatypes or explicit cast is a good practice, right?!) the question was interesting enough to make us spend some time with it. In fact I had seen situations in the past where apparently sometimes the engine was smart, and others not. I never thought too much about it, since I always recommended to follow the best practices (which clearly saves us some troubles). So, personally I also had this doubt, and together with the customer DBAs we started to do some tests. We came up with a very simple test case that we though would show the problem:

DATABASE stores;

DROP TABLE IF EXISTS tst_int;
DROP TABLE IF EXISTS tst_char;

CREATE TABLE tst_int
(
    c1 INTEGER,
    c2 INTEGER
);
CREATE TABLE tst_char
(
    c1 CHAR(15),
    c2 INTEGER
);

INSERT INTO tst_int
SELECT FIRST 2000 pg_pagenum, pg_partnum FROM sysmaster:syspaghdr;

INSERT INTO tst_char
SELECT FIRST 2000 pg_pagenum, pg_partnum FROM sysmaster:syspaghdr;

CREATE INDEX i_tst_int ON tst_int(c1);
CREATE INDEX i_tst_char ON tst_char(c1);

SET EXPLAIN ON;
SELECT "tst_int with char parameter:", * FROM tst_int WHERE c1 = '12345678';
SELECT "tst_char whth int parameter",* FROM tst_char WHERE c1 = 12345678;

--- demonstrates that each index key is being casted
INSERT INTO tst_char VALUES("a1", 12345678);
SELECT * FROM tst_char WHERE c1 = 12345678;

If we take a look at the query plan we see:

QUERY: (OPTIMIZATION TIMESTAMP: 12-10-2011 23:25:19)
------
SELECT "tst_int with char parameter:", * FROM tst_int WHERE c1 = '12345678'

Estimated Cost: 1
Estimated # of Rows Returned: 1

  1) informix.tst_int: INDEX PATH

    (1) Index Name: informix.i_tst_int
        Index Keys: c1   (Serial, fragments: ALL)
        Lower Index Filter: informix.tst_int.c1 = 12345678

So here the engine was "smart". Meaning it converted the CHAR to an INTEGER and that allowed it to use the index. Nice.

But here:

QUERY: (OPTIMIZATION TIMESTAMP: 12-10-2011 23:25:19)
------
SELECT "tst_char whth int parameter",* FROM tst_char WHERE c1 = 12345678

Estimated Cost: 84
Estimated # of Rows Returned: 1

  1) informix.tst_char: SEQUENTIAL SCAN

        Filters: informix.tst_char.c1 = 12345678

It looks as it's not that smart... Instead of converting the INTEGER parameter to a CHAR and use the index it decides to do the opposite: Converts all the CHARs in that column into INTEGERs and makes a SEQUENTIAL SCAN.
Since I didn't have a good explanation for this we decided to open a PMR to get an official technical support explanation.

Technical support reported that we had something in the documentation that tries to explain this:

http://publib.boulder.ibm.com/infocenter/idshelp/v117/topic/com.ibm.sqls.doc/ids_sqs_1620.htm

This documentation is highly confusing but it tells us something important: If we use a numeric filter against a CHAR column, the database server will convert all the CHAR values to numbers. This is precisely what we saw in the example above. But it still does not explain why. Let me show why with a few examples:

If we have "00123" in the table and we give 123 as the parameter, if we convert the number to CHAR and try to match it, ir would fail. Bad results...
If we have "234" in the column and we give 9 as the parameter for a "less or equal" (col <= 9), if we convert the number to CHAR, and apply the filter to the CHAR type, it would match ("9" >= "234"). Again it would return an incorrect result because by using an INTEGER as a parameter we're assuming INTEGER comparison

So, after this, the rule would seem pretty simple. Something like: "If you use a numeric parameter against a CHAR column, all the values in the column will be converted to numeric. On the other hand, using a CHAR parameter against a numeric column allows the engine to convert the numeric to CHAR and if there is an index on the column it can be used".

But life's not that simple... If it was so simple why couldn't we find it in the documentation? I tried to search for other RDBMS (Oracle and MS SQL) documentation, and in those cases they're very clear about the issue. Something like "whenever an implicit CAST is needed we follow a precedence table of datatypes. The datatype with lower precedence will be converted into the datatype with higher precedence". Sincerely I thought this was a good way to put it, and if we did the same, why not document it properly? So the PMR already opened started to look like a documentation bug.
But again, life sometimes is not simple... And while this was being analyzed and discussed, the customer team discovered an interesting scenario: If you give an integer value as a filter against a CHAR column, AND the length of the integer value (excluding any leading zeros) is equal to the size of the column, than Informix will convert the number to CHAR and eventually will use an index on the specified column.
This is the optimizer being smart. If you think about it, if the number has the same number of digits as the length of the CHAR column, you can convert the number to CHAR and compare it. The result set will be correct no matter the values in question or the operator.

To end the story, while browsing the documentation in search for other topics we came across this:

http://publib.boulder.ibm.com/infocenter/idshelp/v117/topic/com.ibm.perf.doc/ids_prf_536.htm

It's clear and well explained. Informix makes the necessary casts but that can have a real impact on the performance, specially if exists an index on the column. And the optimizer is smart enough to use a better query plan in the only situation where it can be done. Really nice and at least here (Performance Guide) it's well explained. I really don't mind when a PMR generates a bug because that's a product improvement, but I must admit I prefer to be proven wrong and see that the product really works well!

Vers�o Portuguesa:

� �timo passar muito tempo com clientes. D�-me tempo para ir para al�m da estrita execu��o de tarefas que os projetos curtos permitem. Temos tempo para aprofundar o que por vezes n�o parece ser mais que uma curiosidade. Nesta entrada vou descrever uma dessas situa��es. Come�ou como um banal problema de performance. Uma query que envolvia um join estava a gerar um hash join havendo um �ndice que aparentemente podia ser usado. Ap�s uma r�pida investiga��o percebeu-se que os tipos de dados das colunas envolvidas no join n�o eram iguais e por isso o �ndice n�o era usado.
A situa��o foi corrigida (usando um CAST expl�cito pois mudar os tipos de dados teria necessitado de uma an�lise mais profunda e poderia ter outras implica��es) e o programador(es) recebeu uma explica��o sobre o problema para que situa��es semelhantes fossem evitadas no futuro. Simples, certo? Sim... Mas quando temos mentes curiosas e programadores com alguns minutos para dispensar podemos ser surpreendidos. E este foi um desses casos. Umas semanas mais tarde a equipa de DBAs do cliente recebeu um pedido para explicar o porqu� da recomenda��o, acompanhado de um caso de teste que demonstrava que o motor conseguia usar um �ndice mesmo quando os tipos de dados n�o batiam certo. Por outras palavras, se o motor tem intelig�ncia para o fazer, porque nos devemos n�s preocupar?!

Apesar de isto poder ser considerado uma perda de tempo (usar os mesmos tipos de dados ou um CAST expl�cito � uma boa pr�tica, n�o �?!) a quest�o era suficientemente interessante para nos fazer gastar algum tempo com ela. Na realidade j� tinha tido situa��es no passado onde aparentemente o motor parecia inteligente, e outras onde tal n�o acontecia. Nunca pensei muito no assunto, dado que recomendo sempre que seja seguida as boas pr�ticas (que claramente nos evitam problemas). Portanto, pessoalmente tamb�m tinha esta d�vida e em conjunto com os DBAs do cliente inici�mos alguns testes. Cri�mos um exemplo muito simples que pensamos que demonstra o problema:

DATABASE stores;

DROP TABLE IF EXISTS tst_int;
DROP TABLE IF EXISTS tst_char;

CREATE TABLE tst_int
(
    c1 INTEGER,
    c2 INTEGER
);
CREATE TABLE tst_char
(
    c1 CHAR(15),
    c2 INTEGER
);

INSERT INTO tst_int
SELECT FIRST 2000 pg_pagenum, pg_partnum FROM sysmaster:syspaghdr;

INSERT INTO tst_char
SELECT FIRST 2000 pg_pagenum, pg_partnum FROM sysmaster:syspaghdr;

CREATE INDEX i_tst_int ON tst_int(c1);
CREATE INDEX i_tst_char ON tst_char(c1);

SET EXPLAIN ON;
SELECT "tst_int with char parameter:", * FROM tst_int WHERE c1 = '12345678';
SELECT "tst_char whth int parameter",* FROM tst_char WHERE c1 = 12345678;

--- demonstrates that each index key is being casted
INSERT INTO tst_char VALUES("a1", 12345678);
SELECT * FROM tst_char WHERE c1 = 12345678;

Se olharmos para o plano de execu��o vemos:

QUERY: (OPTIMIZATION TIMESTAMP: 12-10-2011 23:25:19)
------
SELECT "tst_int with char parameter:", * FROM tst_int WHERE c1 = '12345678'

Estimated Cost: 1
Estimated # of Rows Returned: 1

  1) informix.tst_int: INDEX PATH

    (1) Index Name: informix.i_tst_int
        Index Keys: c1   (Serial, fragments: ALL)
        Lower Index Filter: informix.tst_int.c1 = 12345678

Ou seja, aqui o motor era "esperto". Convertia o CHAR para INTEGER e isso permitia usar o �ndice. Boa.

Mas aqui:

QUERY: (OPTIMIZATION TIMESTAMP: 12-10-2011 23:25:19)
------
SELECT "tst_char whth int parameter",* FROM tst_char WHERE c1 = 12345678

Estimated Cost: 84
Estimated # of Rows Returned: 1

  1) informix.tst_char: SEQUENTIAL SCAN

        Filters: informix.tst_char.c1 = 12345678

Parece que n�o � assim t�o esperto.... Em vez de converter o par�metro INTEGER para um CHAR e usar o �ndice, decide fazer o oposto: Converte todos os CHARs daquela coluna para INTEGERs e executa um SEQUENTIAL SCAN.
Como n�o tinha uma boa explica��o para isto decidi abrir um PMR para obter uma explica��o oficial do suporte t�cnico.

O suporte t�cnico informou que n�s t�nhamos algo na documenta��o que tenta explicar isto:

http://publib.boulder.ibm.com/infocenter/idshelp/v117/topic/com.ibm.sqls.doc/ids_sqs_1620.htm

Esta documenta��o � altamente confusa, mas diz-nos algo importante: Se usarmos um filtro num�rico contra uma coluna do tipo CHAR, o servidor de base de dados ir� converter todos os valores CHAR da coluna em n�meros. Isto � exatamente o que encontr�mos no exemplo acima. Mas ainda n�o explica porqu�. Deixe-me explicar o porqu� com alguns exemplos:

Se tivermos o valor "00123" na tabela e usarmos 123 como par�metro/filtro, se convertermos o n�mero para CHAR e tentarmos fazer a compara��o n�o vamos retornar nada ('123' != '00123') . Resultados errados...
Se tivermos o valor "234" na coluna e dermos 9 como par�metro/filtro para uma condi��o de menor ou igual (col <= 9), se convertermos o n�mero para CHAR isso implicaria que a linha era retornada ("9" >= "234"). Mais uma vez iria retornar um resultado "errado", pois ao usarmos um par�metro num�rico estamos a assumir compara��o num�rica (onde 9 < 234)

Assim, depois disto a regra parecia muito simples. Algo como "Se usarmos um par�metro num�rico contra uma coluna do tipo CHAR, todos os valores da coluna ser�o convertidos para num�rico. Por outro lado, usar um par�metro CHAR contra uma coluna num�rica permite que o motor converta o par�metro para n�mero e use um �ndice caso exista".

Mas a vida n�o � assim t�o simples... Se era assim t�o direto porque raz�o n�o estava documentado (ou pelo menos n�s n�o t�nhamos encontrado)? Tentei procurar na documenta��o de outros sistemas de gest�o de bases de dados (Oracle e MS SQL) , e nestes casos eram bastante claros sobre o assunto. Um resumo livre seria "sempre que um CAST impl�cito seja necess�rio, seguimos uma tabela de preced�ncias de tipos de dados. O tipo de dado com menor preced�ncia ser� convertido para o que tem mais preced�ncia". Sinceramente isto pareceu-me uma forma correta de colocar a quest�o, e se faz�amos o mesmo porque n�o ter isto claro na documenta��? Assim o PMR j� aberto parecia encaminhar-se para um bug de documenta��o.

Mas novamente, a vida por vezes n�o � simples... E enquanto isto estava a ser analisado e discutido , a equipa do cliente descobriu um cen�rio interessante: Se usarmos um numero como filtro contra uma coluna do tipo CHAR, e o numero de d�gitos desse inteiro (excluindo quaisquer zeros � esquerda) for igual ao n�mero de caracteres definido na coluna, ent�o o n�mero ser� convertido para CHAR e um eventual �ndice na coluna ser� usado.
Isto � o optimizador a ser "esperto". Se pensarmos sobre o assunto, se o n�mero de d�gitos do n�mero for igual ao n�mero de caracteres da coluna, podemos convert�-lo para CHAR e compar�-lo com a coluna. O resultado ser� o correto independentemente dos valores em quest�o e do operador.

Para terminar a hist�ria, enquanto consult�va-mos a documenta��o devido a outro assunto, demos com o seguinte:

http://publib.boulder.ibm.com/infocenter/idshelp/v117/topic/com.ibm.perf.doc/ids_prf_536.htm

Est� claro e bem explicado. O Informix efetua os CASTS necess�rios para resolver queries onde existam inconsist�ncias entre os tipos de dados. Mas isso pode ter um impacto significativo na performance, especialmente se existir um �ndice na coluna. E o optimizador � suficientemente inteligente para obter um melhor plano de execu��o na �nica situa��o onde isso pode ser feito. Muito correto e pelo menos aqui (Guia de Performance) est� bem explicado.
Sinceramente n�o me importo muito quando um PMR d� origem a um bug, pois isso traduz-se numa melhoria do produto, mas tenho de admitir que prefiro que resulte que estava enganado e ser-me mostrado que o produto est� a funcionar bem!

Short notes / Notas curtas

This article is written in English and Portuguese
Este artigo est� escrito em Ingl�s e Portugu�s

English version:

This is a very short article that consists of a few short notes.
First, as you can notice from the banner above, IIUG is going to organize the 2012 International User Conference in San Diego on April 22-25 2012. Next year's conference will move from the traditional Kansas City location to San Diego, California. As usual, you can get a lot of value for money. I will probably skip this one, but Ill surely miss it. Please consult the conference URL for all the details.

Some time ago I added a blog to my list, mas to be honest at the time I didn't review the content that it already had. Recently, during some investigation I ended up there and I had the opportunity to browse over it's content and it's really interesting. I'm talking about http://www.jfmiii/informix/ . The URL shows that it's author is John Miller, one of the most well known elements of the Informix community. John is the author of a lot of Informix material, specially about update statistics and backup and restore. In the last years is has been involved with Open Admin Tool (OAT). He's also a usual presence in conferences and presentations done or sponsored by IBM. I highly recommend this blog. The blog also includes contributions from other well known members of the community

Still related to blogs, I just added one more to the list: http://informixdba.wordpress.com/ . It's author is Ben Thompson, an UK based Informix and Oralce DBA. Not many articles yet, but it looks promising.

Philip Howard, from Bloor Research talks about the Informix revive and also mentions it in the article "Breakthrough and instrumented applications".

TatukGIS a Polish based GIS software and solution provider recently extended it's products support to Informix and DB2

In a blog hosted by Microsoft, there is a reference to future support for Informix 11 on their BizTalk ESB software in the next release of the product

Vers�o Portuguesa:

Este � um artigo muito curto e consistir� em algumas notas sobre v�rios temas.
Primeiro, como pode ver pelo cabe�alho acima, o IIUG vai organizar a confer�ncia internacional de utilizadores de 2012 em San Diego entre 22 e 25 de Abril de 2012. A confer�ncia do ano que vem deixar� a localiza��o tradicional em Kansas City e ir� para San Diego na California. Como � h�bito pode obter bastante valor pelo investimento. Em princ�pio n�o irei estar presente, mas irei sem d�vida sentir a falta. Consulte o endere�o da confer�ncia para mais detalhes.

H� uns tempos adicionei um blog � lista, mas sinceramente na altura n�o revi todo o conte�do que j� existia. Recentemente durante alguma investiga��o que efectuei acabei por l� ir parar e tive oportunidade de verificar que tem conte�do bastante interessante. Estou a falar de http://www.jfmiii/informix/ . Pelo endere�o e conte�do � f�cil perceber que se trata do John Miller um dos mais conhecidos elementos da comunidade Informix. O John tem estado ligado a imenso material sobre Informix, em particular artigos sobre update statistics, backup e restore e outros. Nos �ltimos anos tem estado por detr�s do Open Admin Tool. � tamb�m presen�a habitual em confer�ncias e apresenta��es da IBM. Recomendo vivamente. O blog cont�m tamb�m contribui��es de outros elementos bem conhecidos da comunidade

Ainda relacionado com blogs, acabei de adicionar um � lista: http://informixdba.wordpress.com/ . O seu autor � Ben Thompson, um DBA Informix e Oracle baseado no Reino Unido. Ainda n�o tem muitos artigos, mas parece promissor.

Philip Howard, da Bloor Research fala sobre o Informix revive e tamb�m o refere noutro artigo: "Breakthrough and instrumented applications".

TatukGIS, um fornecedre de software e solu��es GIS, baseado na Pol�nia anunciou recentemente a extens�o do suporte nos seus produtos para Informix e DB2

Num blog hospedado na Microsoft, existe uma referencia ao futuro suporte ao Informix 11 na pr�xima release do seu BizTalk ESB.