Integration of PostgreSQL with MS SQL Server for those who want faster and deeper

Recently habré was already published integration PostgreSQL and MSSQL. But, items there is categorically not enough. Therefore, the objectives of this publication are the following:

the

to broaden and deepen publicly available information about FDW for MSSQL-called tds_fdw to tell the difference in major versions and describe basic compatibility issues;
to talk about the possibilities of optimization of queries that use the foreign table;
to address the subject of caching external data in a materialized view;
to say a few words about the exotic ways to integrate PostgreSQL and MSSQL.

the

Installing and configuring TDS FDW

The guys from PostgresPro already sufficiently said about this process, I will not repeat. Will leave a few links to the official PostgreSQL documentation and examples from tds_fdw:

the

creating an extension (extension) in PG: off.dokumentacija;
server: off.documentation, example of tds_fdw;
creating users (user mapping): off.documentation, example of tds_fdw;
create a foreign table (foreign table): off.dokumentacija, example of tds_fdw.

And one more thing: please do not do as specified in installation instructions tds_fdw

the

sudo make USE_PGXS=1 install

rescue cats, build deb package, and enjoy your life:

the

sudo USE_PGXS=1 checkinstall

the

the differences between the major versions of TDS FDW

At the moment there are two current versions FDW'shki: stable 1.0.7 and 2.0.0-alpha that, in fact, is the master branch and where all the excitement happens. Here is a small list of their differences:

the

in 2.0.0, finally adds support for pushdown conditions in the WHERE clause, directly related to the external table; however, it doesn't work when using option query in the Declaration foreign_table;
added support for tds 7.4 version (below I will describe why it is necessary and important);
there are some problems with the work of the DISTINCT on the outer table (issue on github), although not known for certain: my hands do not grow out or the bug is quite tricky and manifests itself only under certain circumstances.

the

pitfalls compatibility

Until recently, tds_fdw did not work with the version of the tds is above 7.3. But during the writing of this article support the version 7.4 had seeking. So now, since the commit 3a803c, tds_fdw supports all current version of tds.

Why is this version so important? For me personally this is important because of the need to work with MSSQL 2012. In short: in Ruby on Rails to connect to the MSSQL library is used activerecord-sqlserver-adapter, which, in turn, uses tiny_tds that uses FreeTDS, who is able to communicate with MSSQL. The trouble is that for RoR 3 and the respective major versions of libraries, using tds version 7.1 nailed down and change it through the config can only be in 4+ version. The version 7.1 works fine with MSSQL 2008, but when dealing with MSSQL 2012 receive the following error:

the

DB-Library error: DB #: 20017, DB Msg: Unexpected EOF from the server

the

ActiveRecord::LostConnection: TinyTds::Error: closed connection: ...

the

TinyTds::Error: Adaptive Server connection failed

They wanted to avoid by using FDW as to update RoR is strictly longer and more expensive. But tds_fdw did not support the correct version and had to do something.

As for mistakes, they all appear randomly and grow because of one and the same place; bring some "diversity" in the app, causing it to fall off in random places at random times. Treated all this mess just by using the correct tds version. For MSSQL 2012 is tds 7.4.

Here the first ambush: support version tds 7.4 implemented in FreeTDS from version 0.95. But out of the box in Ubuntu 14.04 and 16.04 go version of 0.91-5 and 0.91-6.1build1, respectively. And get a newer version of FreeTDS in two ways:

build FreeTDS from source;
to use alternative PPA with the FreeTDS version 1.00.

In the second case, there is one caveat: the specified repository has a package only for Ubuntu 14.04 (trusty which). To 16.04 (which xenial), there's nothing there. But, in General, nothing fatal, and if correct 16.04 /etc/apt/sources.list.d/jamiewillis-freetds-trusty.list something like this:

the

deb http://ppa.launchpad.net/jamiewillis/freetds/ubuntu trusty main

It will be possible to put the package in latest Ubuntu (and Yes, it works without problems).

If you have CentOS, then it is possible easy to find FreeTDS to version 0.95, inclusive. All over will have to build from source.

a workaround for the compatibility issues

If the error number 20017 and its derivatives are very much bother, and the ability to Taipalsaari the required version of the tds, we can handle the exception, throw PostgreSQL and restart the block/method/etc that access MSSQL via FDW. In my case for RoR apps, it looked like this:

the

def retry_mssql_operation(tries = 5)
begin
yield
rescue ActiveRecord::StatementInvalid = > e
if e.message =~ /^PG::FdwUnableToCreateExecution/ && tries > 0
tries -= 1
retry
else
raise
end
end
end

For the first time save, but for a long-term solution is strictly not suitable.

the

a Little about pushdown and how FDW works "fingers"

Before proceeding to the optimization of queries to external database I would like to say a few words about pushdown. For some reason the description of this mechanism is not popular in Russian-speaking resources (either I'm not familiar with its correct translation and tritasavit press down on the block this is clearly not out of the Opera). So I want to briefly tell you about it.

In the simplest case, when we are in the PG running the query:

the

SELECT column_name FROM foreign_table WHERE column_id = 42;

In fact in the database proishodil the following:

from assotsiirovannaya with the foreign_table the table (or tables) located on an external server, retrieves all the content to postgres;
then, the received data is filtered based on the conditions of the WHERE clause.

Not very effective scheme, especially if from the table with several million rows I want to get only one. And here there is pushdown. This mechanism allows to reduce the number of lines that we get from the remote server. This is done by constructing a query to an external DB based on what we want on the PG side, that is, given that stated in WHERE, JOIN, ORDER etc. in other words, FDW parses the original query in PotsgreSQL, choose, can realize remote data store to a query according to these conditions. Hence the obvious consequence: pushdown is not applicable for all FDW (for example, file_fdw pushdown is almost useless, but for postgres_fdw or tds_fdw — quite the contrary).

Total: pushdown is cool, it allows you to use the external data store, reduces the amount of data circulating between PG and external storage, thus speeding up the execution of queries, but it is a separate mechanism, so it is necessary to implement, maintain and it's a fairly trivial task.
the

Boost requests

With the installation, configuration, and materiel understood. We now proceed to the description of how to quickly retrieve data from MSSQL.

Pushdown

Useful to this approach in the case of simple queries, not burdened with various JOIN and other SQL tricks. In the latest version tds_fdw (currently, it is 2.0.0-alpha) adds support for pushdown simple WHERE clause.

For example, consider the table simple_table from the database MSSQL. This table has two fields: id and data. The external table definition for it is:

the

CREATE FOREIGN TABLE mssql_table (
id integer,
custom_data varchar OPTIONS (column_name 'data'))
SERVER mssql_svr
OPTIONS (schema_name 'dbo',
table_name 'simple_table',
row_estimate_method 'showplan_all',
match_column_names '1');

In this case, the first column has the same name in PostgreSQL and MSSQL: id. The second column different names in PG and MSSQL, so here the desired option is column_name. This parameter explicitly sets the column display the columns from PostgreSQL to MSSQL. Also, at the end of the specified match_column_name, which is responsible for implicit mapping of the column names by name, that is, thanks to him, mapetla the id column.

All, now if you run this query:

the

SELECT custom_data FROM mssql_table WHERE id = 42;

FDW have to handle the condition specified in WHERE clause and collect the correct query in MSSQL. For example like this:

the

SELECT data FROM simple_table WHERE id = 42;

In the case tds_fdw version 1.0.7 and below is the query in MSSQL is another:

the

SELECT id, data FROM simple_table;

Once again, pushdown, at the moment, it only works for WHERE clauses; for JOIN, ORDER and other functions like MAX, LOWER, etc. it won't fly up.

And one more thing: how do you know what the query actually executed on the MSSQL side? When using FDW for example, MySQL, the explain you get this kind of line:

the


Remote query: SELECT `id`, `customer_id`, `order_date` FROM `data`.`orders`

And it's convenient. In tds_fdw this yet and need to go the longer way through the FreeTDS logs. By default, the FreeTDS logs off, but it's easy to fix scratch in /etc/freetds/freetds.conf. There you can find here the following lines:

the

; dump file = /tmp/freetds.log
; debug flags = 0xffff

Which need to remove the semicolon at the beginning.

Now for any query from PG to MSSQL, FreeTDS will log everything I can. This will slow down the execution of all external requests and can spawn a bunch of logs (in my case normal SELECT made a log ~300Mb, and the JOIN was barely radilsya in ~1.5 GB). But in logic will be seen that the are actually executed in MSSQL. Besides, the volume of logs can be reduced, play around with `debug flags`. Read more about logging in FreeTDS to write here, and details on `debug flags` lie here.

Materialized view

Materialized view (hereinafter MV) is a normal representation of the + data table. This approach will help in the case of complex queries with joyname external and internal tables, functions, preference and courtesans.

Profit from MV as follows: it are "native" object for the PG, that is, MV is wonderful interacts with the rest of PostgreSQL, and it can be indexed and analyzed, regardless of the data source that filled him. Cons too: the MV needs to be updated. You can update via the internal triggers, external events can completely re-build and so on. But, in any case, MV creates the gap PG from the source data.

For the above external table, you can create MV as follows:

the

CREATE MATERIALIZED VIEW materialized_mssql_table AS
SELECT id, custom_data
FROM mssql_table;

Now all the data from MSSQL to PostgreSQL is, so they can be indexed as you wish (B-tree, GIN and GiST, etc.), is available statistics, you can see details about the query execution plan and much more enjoyable of PG.
To update the MV through a standard INCERT/UPDATE/DELETE commands, or just recreate all the content with

the

REFRESH MATERIALIZED VIEW CONCURRENTLY materialized_mssql_table;

The CONCURRENTLY option allows you to update the MV is not blocking a competing read requests, but requires more time and resources. Also, to be able to use CONCURRENTLY the target MV must satisfy some requirements. You can find them at the corresponding documentation page.

Exotic approach

Honestly, God knows whether to fly this approach may be dear audience, say something interesting on this account. In any case, I think you need to say about him, since specialized resources, most of the questions on the integration of two databases reply "use FDW" and no change is expected, even if you want strange.

So when you might need it: if all of the above options didn't help due to various constraints. For example:

the

obscenely large amount of external databases and the impossibility of cloning in PG;
stringent demands on performance and optimal query in an external database;
desire to execute a parameterized query, i.e. the equivalent of selecting query for FDW, only the dynamic parameter, for example I want to use full text search on MSSQL side through the function CONTAINS;
something unusual.

What to use: dbi-link or dblink-tds. This is analogues dblink' but with the support of several DBMS: PostgreSQL, MySQL, MSSQL Server and Oracle in the case of dbi-link and just TDS'and in the case of dblink-tds.

As seen the mechanics of the job: as a specialized analogue of the FDW in the form of a function PG which collects inside your query based on the passed arguments, executes it in the external DB using the above tools, retrieves the data, processes them, and returns them in PG as pipeline-funkcia. I mean, hypothetically, you can run the query that I want to present the result in a form digestible for further processing in PG.

All of the above is pure theoretical considerations. If you have real experience using these or similar tools, please share your knowledge with the world.

the

Conclusion

Currently the only simple and working solution for interfacing PostgreSQL and MSSQL. This tds_fdw. He has a lot of flaws, but the project is progressing, bugs repaired, the features are rolled out and it's great. So tds_fdw can solve most of the problems associated with extracting data from MSSQL using PG. Those who want faster, better and with courtesans will help PostgreSQL and its rich Arsenal of tools for optimization. And those who want a very strange and wants to do everything in the database with a minimum of external services will be in difficulty. The ancient tools, no documentation, no support, ~~inhabited by robots~~ and in addition to reading source code does not help.

Article based on information from habrahabr.ru

Поиск по этому блогу

computer express