Optimizing queries. The basics of EXPLAIN in PostgreSQL (part 2)


Podolzhayut to publish the author's processing Understanding EXPLAIN from Guillaume Lelarge.
Again note that some information has been omitted for brevity, so I strongly recommend to read the original.
Previous part:

Part 1

Cache


What happens on a physical level, with follow through with our request? Face it. My server raised to Ubuntu 13.10. Use disk level caches of the OS.
Stop PostgreSQL, force detecting changes in the file system, clear the caches, start PostgreSQL:
the
> sudo service postgresql-9.3 stop
> sudo sync
> sudo su -
# echo 3 > /proc/sys/vm/drop_caches
# exit
> sudo service postgresql-9.3 start

Now the caches are cleared, try to execute the query with the option BUFFERS
the
EXPLAIN (ANALYZE,BUFFERS) SELECT * FROM foo;

QUERY PLAN
— Seq Scan on foo (cost=0.00..18334.10 rows=1000010 width=37) (actual time=0.525..734.754 rows=1000010 loops=1)
Buffers: shared read=8334
Total runtime: 1253.177 ms
(3 rows)

The table is read out parts of the blocks. The cache is empty. The entire table is read from disk. This had to count 8334 block.
the Buffers: shared read — the number of blocks read from disk.

Repeat the last request
the
EXPLAIN (ANALYZE,BUFFERS) SELECT * FROM foo;

QUERY PLAN
— Seq Scan on foo (cost=0.00..18334.10 rows=1000010 width=37) (actual time=0.173..693.000 rows=1000010 loops=1)
Buffers: shared hit=32 read=8302
Total runtime: 1208.433 ms
(3 rows)

the Buffers: shared hit — number of blocks read from cache and PostgreSQL.
If you repeat this query multiple times, you will see how PostgreSQL each time more data is recovered from the cache. With each query PostgreSQL fills its cache.
Reading from cache faster than reading from disk. Can see this trend by tracking the value of Total runtime.
The cache size is determined by the constant shared_buffers in the file postgresql.conf.

WHERE


Add a condition to your query
the
EXPLAIN SELECT * FROM foo WHERE c1 > 500;

QUERY PLAN
— Seq Scan on foo (cost=0.00..20834.12 rows=999522 width=37)
Filter: (c1 > 500)
(2 rows)

Indexes the table has no. When the query runs, sequentially reads each record of the table (Seq Scan). Each record is compared with the condition c1 > 500. If the condition is met, the record is entered into the result. Otherwise — discarded. Filter means exactly this behavior.
The value of cost, which is logical, has increased.
The expected number of rows of result is rows is decreased.
the original explains why cost takes precisely this value and how to calculate the expected number of rows.

It's time to create the indexes.
the
CREATE INDEX ON foo(c1);
EXPLAIN SELECT * FROM foo WHERE c1 > 500;

QUERY PLAN
— Seq Scan on foo (cost=0.00..20834.12 rows=999519 width=37)
Filter: (c1 > 500)
(2 rows)

The expected number of rows has changed. Specified. Otherwise, nothing new. What about the index?
the
EXPLAIN (ANALYZE) SELECT * FROM foo WHERE c1 > 500;

QUERY PLAN
— Seq Scan on foo (cost=0.00..20834.12 rows=999519 width=37) (actual time=0.572..848.895 rows=999500 loops=1)
Filter: (c1 > 500)
Rows Removed by Filter: 510
Total runtime: 1330.788 ms
(4 rows)

Filtered only 510 lines of more than a million. We had to consider more than 99.9% of the table.

Force will force to use the index, prohibiting Seq Scan:
the
SET enable_seqscan TO off;
EXPLAIN (ANALYZE) SELECT * FROM foo WHERE c1 > 500;

QUERY PLAN
— Index Scan using foo_c1_idx on foo (cost=0.42..34623.01 rows=999519 width=37) (actual time=0.178..1018.045 rows=999500 loops=1)
Index Cond: (c1 > 500)
Total runtime: 1434.429 ms
(3 rows)

Index Scan, Index Cond instead of Filter — the index foo_c1_idx.
In the sample almost the entire table using the index only increases the cost and the query execution time. The scheduler is not stupid!

Don't forget to lift the ban on the use of a Seq Scan:
the
SET enable_seqscan TO on;


the
EXPLAIN SELECT * FROM foo WHERE c1 < 500;

QUERY PLAN
— Index Scan using foo_c1_idx on foo (cost=0.42..25.78 rows=491 width=37)
Index Cond: (c1 < 500)
(2 rows)

Here the planner has decided to use the index.

Let's complicate a condition. Use the text field.
the
EXPLAIN SELECT * FROM foo
WHERE c1 < 500 AND c2 LIKE 'abcd%';

QUERY PLAN
— Index Scan using foo_c1_idx on foo (cost=0.42..27.00 rows=1 width=37)
Index Cond: (c1 < 500)
Filter: (c2 ~~ 'abcd%'::text)
(3 rows)

As you can see, use the index foo_c1_idx for the condition c1 < 500. For c2 ~~ 'abcd%'::text filter is used.
Please note that the output uses POSIX the LIKE operator.

If the condition only text field:
the
EXPLAIN (ANALYZE)
SELECT * FROM foo WHERE c2 LIKE 'abcd%';

QUERY PLAN
— Seq Scan on foo (cost=0.00..20834.12 rows=100 width=37) (actual time=..14.497 412.030 rows=10 loops=1)
Filter: (c2 ~~ 'abcd%'::text)
Rows Removed by Filter: 1000000
Total runtime: 412.120 ms
(4 rows)

Expected, Seq Scan.

Build index on c2:
the
CREATE INDEX ON foo(c2);
EXPLAIN (ANALYZE) SELECT * FROM foo
WHERE c2 LIKE 'abcd%';

QUERY PLAN
— Seq Scan on foo (cost=0.00..20834.12 rows=100 width=37) (actual time=..20.992 424.946 rows=10 loops=1)
Filter: (c2 ~~ 'abcd%'::text)
Rows Removed by Filter: 1000000
Total runtime: 425.039 ms
(4 rows)

Again, Seq Scan? The index is not used because the base I have for the text field uses the UTF-8 format.
When you create the index in such cases it is necessary to use the class operator text_pattern_ops:
the
CREATE INDEX ON foo(c2 text_pattern_ops);
EXPLAIN SELECT * FROM foo WHERE c2 LIKE 'abcd%';

QUERY PLAN
— Bitmap Heap Scan on foo (cost=4.58..55.20 rows=100 width=37)
Filter: (c2 ~~ 'abcd%'::text)
-> Bitmap Index Scan on foo_c2_idx1 (cost=0.00..4.55 rows=13 width=0)
Index Cond: ((c2 ~ > =~ 'abcd'::text) AND (c2 ~ < ~ 'abce'::text))
(4 rows)

Yay! Did it!
Bitmap Index Scan — the index foo_c2_idx1 to define the desired records, and then climbs in PostgreSQL the table itself - (Bitmap Heap Scan) - to make sure that these records actually exist. This behavior is associated with versioned PostgreSQL.

If you choose not the whole line, and only the field on which the index is built.
the
EXPLAIN SELECT c1 FROM foo WHERE c1 < 500;

QUERY PLAN
— Index Only Scan using foo_c1_idx on foo (cost=0.42..25.78 rows=491 width=4)
Index Cond: (c1 < 500)
(2 rows)

Index Only Scan is faster than Index Scan due to the fact that you do not have to read the table row entirely: width=4.

Summary


the
    the
  • Seq Scan reads the entire table.
  • the
  • Index Scan — index is used for WHERE clauses, reads the table when selecting rows.
  • the
  • Bitmap Index Scan — Index Scan first, then the control sample on the table. Effectively for a large number of rows.
  • the
  • Index Only Scan the fastest. Read only the index.


Part 3
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Integration of PostgreSQL with MS SQL Server for those who want faster and deeper

Custom database queries in MODx Revolution

Parse URL in Zend Framework 2