Technology and Information: Optimizer secrets / segredos do optimizador

This article is written in English and Portuguese

Este artigo est� escrito em Ingl�s e Portugu�s

English version:

Spending a lot of time with customers is great. It gives me time to go beyond the strict task execution that short projects allow. We actually have time to dive deep into what sometimes looks only as a curiosity. I'll describe one of this situations in this post. It started out as a normal performance issue. A query involving a join was causing an hash join and apparently there was an index that could be used. Some quick investigation showed that the datatypes didn't match on the join columns, and as such the index was not being used.
The situation was fixed (using an explicit cast since changing the datatypes would require more deep analysis) and the programmer(s) received an explanation about the issue so that future situations would be avoided. Simple right? Yes... But when you have inquiring minds and programmers with a few minutes to spare you may be surprised. And this was the case. A few weeks later the DBA team received a request to explain why we recommended that the columns used in joins should have the same datatype. A programmer had produced a testcase where the engine was able to convert the parameter sent and use the index. In other words, if the engine is smart enough why should we care?!

Although this could be considered a waste of time (using the same datatypes or explicit cast is a good practice, right?!) the question was interesting enough to make us spend some time with it. In fact I had seen situations in the past where apparently sometimes the engine was smart, and others not. I never thought too much about it, since I always recommended to follow the best practices (which clearly saves us some troubles). So, personally I also had this doubt, and together with the customer DBAs we started to do some tests. We came up with a very simple test case that we though would show the problem:

DATABASE stores;

DROP TABLE IF EXISTS tst_int;
DROP TABLE IF EXISTS tst_char;

CREATE TABLE tst_int
(
    c1 INTEGER,
    c2 INTEGER
);
CREATE TABLE tst_char
(
    c1 CHAR(15),
    c2 INTEGER
);

INSERT INTO tst_int
SELECT FIRST 2000 pg_pagenum, pg_partnum FROM sysmaster:syspaghdr;

INSERT INTO tst_char
SELECT FIRST 2000 pg_pagenum, pg_partnum FROM sysmaster:syspaghdr;

CREATE INDEX i_tst_int ON tst_int(c1);
CREATE INDEX i_tst_char ON tst_char(c1);

SET EXPLAIN ON;
SELECT "tst_int with char parameter:", * FROM tst_int WHERE c1 = '12345678';
SELECT "tst_char whth int parameter",* FROM tst_char WHERE c1 = 12345678;

--- demonstrates that each index key is being casted
INSERT INTO tst_char VALUES("a1", 12345678);
SELECT * FROM tst_char WHERE c1 = 12345678;

If we take a look at the query plan we see:

QUERY: (OPTIMIZATION TIMESTAMP: 12-10-2011 23:25:19)
------
SELECT "tst_int with char parameter:", * FROM tst_int WHERE c1 = '12345678'

Estimated Cost: 1
Estimated # of Rows Returned: 1

  1) informix.tst_int: INDEX PATH

    (1) Index Name: informix.i_tst_int
        Index Keys: c1   (Serial, fragments: ALL)
        Lower Index Filter: informix.tst_int.c1 = 12345678

So here the engine was "smart". Meaning it converted the CHAR to an INTEGER and that allowed it to use the index. Nice.

But here:

QUERY: (OPTIMIZATION TIMESTAMP: 12-10-2011 23:25:19)
------
SELECT "tst_char whth int parameter",* FROM tst_char WHERE c1 = 12345678

Estimated Cost: 84
Estimated # of Rows Returned: 1

  1) informix.tst_char: SEQUENTIAL SCAN

        Filters: informix.tst_char.c1 = 12345678

It looks as it's not that smart... Instead of converting the INTEGER parameter to a CHAR and use the index it decides to do the opposite: Converts all the CHARs in that column into INTEGERs and makes a SEQUENTIAL SCAN.
Since I didn't have a good explanation for this we decided to open a PMR to get an official technical support explanation.

Technical support reported that we had something in the documentation that tries to explain this:

http://publib.boulder.ibm.com/infocenter/idshelp/v117/topic/com.ibm.sqls.doc/ids_sqs_1620.htm

This documentation is highly confusing but it tells us something important: If we use a numeric filter against a CHAR column, the database server will convert all the CHAR values to numbers. This is precisely what we saw in the example above. But it still does not explain why. Let me show why with a few examples:

If we have "00123" in the table and we give 123 as the parameter, if we convert the number to CHAR and try to match it, ir would fail. Bad results...
If we have "234" in the column and we give 9 as the parameter for a "less or equal" (col <= 9), if we convert the number to CHAR, and apply the filter to the CHAR type, it would match ("9" >= "234"). Again it would return an incorrect result because by using an INTEGER as a parameter we're assuming INTEGER comparison

So, after this, the rule would seem pretty simple. Something like: "If you use a numeric parameter against a CHAR column, all the values in the column will be converted to numeric. On the other hand, using a CHAR parameter against a numeric column allows the engine to convert the numeric to CHAR and if there is an index on the column it can be used".

But life's not that simple... If it was so simple why couldn't we find it in the documentation? I tried to search for other RDBMS (Oracle and MS SQL) documentation, and in those cases they're very clear about the issue. Something like "whenever an implicit CAST is needed we follow a precedence table of datatypes. The datatype with lower precedence will be converted into the datatype with higher precedence". Sincerely I thought this was a good way to put it, and if we did the same, why not document it properly? So the PMR already opened started to look like a documentation bug.
But again, life sometimes is not simple... And while this was being analyzed and discussed, the customer team discovered an interesting scenario: If you give an integer value as a filter against a CHAR column, AND the length of the integer value (excluding any leading zeros) is equal to the size of the column, than Informix will convert the number to CHAR and eventually will use an index on the specified column.
This is the optimizer being smart. If you think about it, if the number has the same number of digits as the length of the CHAR column, you can convert the number to CHAR and compare it. The result set will be correct no matter the values in question or the operator.

To end the story, while browsing the documentation in search for other topics we came across this:

http://publib.boulder.ibm.com/infocenter/idshelp/v117/topic/com.ibm.perf.doc/ids_prf_536.htm

It's clear and well explained. Informix makes the necessary casts but that can have a real impact on the performance, specially if exists an index on the column. And the optimizer is smart enough to use a better query plan in the only situation where it can be done. Really nice and at least here (Performance Guide) it's well explained. I really don't mind when a PMR generates a bug because that's a product improvement, but I must admit I prefer to be proven wrong and see that the product really works well!

Vers�o Portuguesa:

� �timo passar muito tempo com clientes. D�-me tempo para ir para al�m da estrita execu��o de tarefas que os projetos curtos permitem. Temos tempo para aprofundar o que por vezes n�o parece ser mais que uma curiosidade. Nesta entrada vou descrever uma dessas situa��es. Come�ou como um banal problema de performance. Uma query que envolvia um join estava a gerar um hash join havendo um �ndice que aparentemente podia ser usado. Ap�s uma r�pida investiga��o percebeu-se que os tipos de dados das colunas envolvidas no join n�o eram iguais e por isso o �ndice n�o era usado.
A situa��o foi corrigida (usando um CAST expl�cito pois mudar os tipos de dados teria necessitado de uma an�lise mais profunda e poderia ter outras implica��es) e o programador(es) recebeu uma explica��o sobre o problema para que situa��es semelhantes fossem evitadas no futuro. Simples, certo? Sim... Mas quando temos mentes curiosas e programadores com alguns minutos para dispensar podemos ser surpreendidos. E este foi um desses casos. Umas semanas mais tarde a equipa de DBAs do cliente recebeu um pedido para explicar o porqu� da recomenda��o, acompanhado de um caso de teste que demonstrava que o motor conseguia usar um �ndice mesmo quando os tipos de dados n�o batiam certo. Por outras palavras, se o motor tem intelig�ncia para o fazer, porque nos devemos n�s preocupar?!

Apesar de isto poder ser considerado uma perda de tempo (usar os mesmos tipos de dados ou um CAST expl�cito � uma boa pr�tica, n�o �?!) a quest�o era suficientemente interessante para nos fazer gastar algum tempo com ela. Na realidade j� tinha tido situa��es no passado onde aparentemente o motor parecia inteligente, e outras onde tal n�o acontecia. Nunca pensei muito no assunto, dado que recomendo sempre que seja seguida as boas pr�ticas (que claramente nos evitam problemas). Portanto, pessoalmente tamb�m tinha esta d�vida e em conjunto com os DBAs do cliente inici�mos alguns testes. Cri�mos um exemplo muito simples que pensamos que demonstra o problema:

DATABASE stores;

DROP TABLE IF EXISTS tst_int;
DROP TABLE IF EXISTS tst_char;

CREATE TABLE tst_int
(
    c1 INTEGER,
    c2 INTEGER
);
CREATE TABLE tst_char
(
    c1 CHAR(15),
    c2 INTEGER
);

INSERT INTO tst_int
SELECT FIRST 2000 pg_pagenum, pg_partnum FROM sysmaster:syspaghdr;

INSERT INTO tst_char
SELECT FIRST 2000 pg_pagenum, pg_partnum FROM sysmaster:syspaghdr;

CREATE INDEX i_tst_int ON tst_int(c1);
CREATE INDEX i_tst_char ON tst_char(c1);

SET EXPLAIN ON;
SELECT "tst_int with char parameter:", * FROM tst_int WHERE c1 = '12345678';
SELECT "tst_char whth int parameter",* FROM tst_char WHERE c1 = 12345678;

--- demonstrates that each index key is being casted
INSERT INTO tst_char VALUES("a1", 12345678);
SELECT * FROM tst_char WHERE c1 = 12345678;

Se olharmos para o plano de execu��o vemos:

QUERY: (OPTIMIZATION TIMESTAMP: 12-10-2011 23:25:19)
------
SELECT "tst_int with char parameter:", * FROM tst_int WHERE c1 = '12345678'

Estimated Cost: 1
Estimated # of Rows Returned: 1

  1) informix.tst_int: INDEX PATH

    (1) Index Name: informix.i_tst_int
        Index Keys: c1   (Serial, fragments: ALL)
        Lower Index Filter: informix.tst_int.c1 = 12345678

Ou seja, aqui o motor era "esperto". Convertia o CHAR para INTEGER e isso permitia usar o �ndice. Boa.

Mas aqui:

QUERY: (OPTIMIZATION TIMESTAMP: 12-10-2011 23:25:19)
------
SELECT "tst_char whth int parameter",* FROM tst_char WHERE c1 = 12345678

Estimated Cost: 84
Estimated # of Rows Returned: 1

  1) informix.tst_char: SEQUENTIAL SCAN

        Filters: informix.tst_char.c1 = 12345678

Parece que n�o � assim t�o esperto.... Em vez de converter o par�metro INTEGER para um CHAR e usar o �ndice, decide fazer o oposto: Converte todos os CHARs daquela coluna para INTEGERs e executa um SEQUENTIAL SCAN.
Como n�o tinha uma boa explica��o para isto decidi abrir um PMR para obter uma explica��o oficial do suporte t�cnico.

O suporte t�cnico informou que n�s t�nhamos algo na documenta��o que tenta explicar isto:

http://publib.boulder.ibm.com/infocenter/idshelp/v117/topic/com.ibm.sqls.doc/ids_sqs_1620.htm

Esta documenta��o � altamente confusa, mas diz-nos algo importante: Se usarmos um filtro num�rico contra uma coluna do tipo CHAR, o servidor de base de dados ir� converter todos os valores CHAR da coluna em n�meros. Isto � exatamente o que encontr�mos no exemplo acima. Mas ainda n�o explica porqu�. Deixe-me explicar o porqu� com alguns exemplos:

Se tivermos o valor "00123" na tabela e usarmos 123 como par�metro/filtro, se convertermos o n�mero para CHAR e tentarmos fazer a compara��o n�o vamos retornar nada ('123' != '00123') . Resultados errados...
Se tivermos o valor "234" na coluna e dermos 9 como par�metro/filtro para uma condi��o de menor ou igual (col <= 9), se convertermos o n�mero para CHAR isso implicaria que a linha era retornada ("9" >= "234"). Mais uma vez iria retornar um resultado "errado", pois ao usarmos um par�metro num�rico estamos a assumir compara��o num�rica (onde 9 < 234)

Assim, depois disto a regra parecia muito simples. Algo como "Se usarmos um par�metro num�rico contra uma coluna do tipo CHAR, todos os valores da coluna ser�o convertidos para num�rico. Por outro lado, usar um par�metro CHAR contra uma coluna num�rica permite que o motor converta o par�metro para n�mero e use um �ndice caso exista".

Mas a vida n�o � assim t�o simples... Se era assim t�o direto porque raz�o n�o estava documentado (ou pelo menos n�s n�o t�nhamos encontrado)? Tentei procurar na documenta��o de outros sistemas de gest�o de bases de dados (Oracle e MS SQL) , e nestes casos eram bastante claros sobre o assunto. Um resumo livre seria "sempre que um CAST impl�cito seja necess�rio, seguimos uma tabela de preced�ncias de tipos de dados. O tipo de dado com menor preced�ncia ser� convertido para o que tem mais preced�ncia". Sinceramente isto pareceu-me uma forma correta de colocar a quest�o, e se faz�amos o mesmo porque n�o ter isto claro na documenta��? Assim o PMR j� aberto parecia encaminhar-se para um bug de documenta��o.

Mas novamente, a vida por vezes n�o � simples... E enquanto isto estava a ser analisado e discutido , a equipa do cliente descobriu um cen�rio interessante: Se usarmos um numero como filtro contra uma coluna do tipo CHAR, e o numero de d�gitos desse inteiro (excluindo quaisquer zeros � esquerda) for igual ao n�mero de caracteres definido na coluna, ent�o o n�mero ser� convertido para CHAR e um eventual �ndice na coluna ser� usado.
Isto � o optimizador a ser "esperto". Se pensarmos sobre o assunto, se o n�mero de d�gitos do n�mero for igual ao n�mero de caracteres da coluna, podemos convert�-lo para CHAR e compar�-lo com a coluna. O resultado ser� o correto independentemente dos valores em quest�o e do operador.

Para terminar a hist�ria, enquanto consult�va-mos a documenta��o devido a outro assunto, demos com o seguinte:

http://publib.boulder.ibm.com/infocenter/idshelp/v117/topic/com.ibm.perf.doc/ids_prf_536.htm

Est� claro e bem explicado. O Informix efetua os CASTS necess�rios para resolver queries onde existam inconsist�ncias entre os tipos de dados. Mas isso pode ter um impacto significativo na performance, especialmente se existir um �ndice na coluna. E o optimizador � suficientemente inteligente para obter um melhor plano de execu��o na �nica situa��o onde isso pode ser feito. Muito correto e pelo menos aqui (Guia de Performance) est� bem explicado.
Sinceramente n�o me importo muito quando um PMR d� origem a um bug, pois isso traduz-se numa melhoria do produto, mas tenho de admitir que prefiro que resulte que estava enganado e ser-me mostrado que o produto est� a funcionar bem!