394 stories
·
1 follower

Greg Sabino Mullane: Postgres migrating SQL_ASCII to UTF-8 with fix_latin

1 Share

Upgrading Postgres is not quite as painful as it used to be, thanks primarily to the pg_upgrade program, but there are times when it simply cannot be used. We recently had an existing End Point client come to us requesting help upgrading from their current Postgres database (version 9.2) to the latest version (9.6 - but soon to be 10). They also wanted to finally move away from their SQL_ASCII encoding to UTF-8. As this meant that pg_upgrade could not be used, we also took the opportunity to enable checksums as well (this change cannot be done via pg_upgrade). Finally, they were moving their database server to new hardware. There were many lessons learned and bumps along the way for this migration, but for this post I'd like to focus on one of the most vexing problems, the database encoding.

When a Postgres database is created, it is set to a specific encoding. The most common one (and the default) is "UTF8". This covers 99% of all user's needs. The second most common one is the poorly-named "SQL_ASCII" encoding, which should be named "DANGER_DO_NOT_USE_THIS_ENCODING", because it causes nothing but trouble. The SQL_ASCII encoding basically means no encoding at all, and simply stores any bytes you throw at it. This usually means the database ends up containing a whole mess of different encodings, creating a "byte soup" that will be difficult to sanitize by moving to a real encoding (i.e. UTF-8).

Many tools exist which convert text from one encoding to another. One of the most popular ones on Unix boxes is "iconv". Although this program works great if your source text is using one encoding, it fails when it encounters byte soup.

For this migration, we first did a pg_dump from the old database to a newly created UTF-8 test database, just to see which tables had encoding problems. Quite a few did - but not all of them! - so we wrote a script to import tables in parallel, with some filtering for the problem ones. As mentioned above, iconv was not particularly helpful: looking at the tables closely showed evidence of many different encodings in each one: Windows-1252, ISO-8859-1, Japanese, Greek, and many others. There were even large bits that were plainly binary data (e.g. images) that simply got shoved into a text field somehow. This is the big problem with SQL_ASCII: it accepts *everything*, and does no validation whatsoever. The iconv program simply could not handle these tables, even when adding the //IGNORE option.

To better explain the problem and the solution, let's create a small text file with a jumble of encodings. Discussions of how UTF-8 represents characters, and its interactions with Unicode, are avoided here, as Unicode is a dense, complex subject, and this article is dry enough already. :)

First, we want to add some items using the encodings 'Windows-1252' and 'Latin-1'. These encoding systems were attempts to extend the basic ASCII character set to include more characters. As these encodings pre-date the invention of UTF-8, they do it in a very inelegant (and incompatible) way. Use of the "echo" command is a great way to add arbitrary bytes to a file as it allows direct hex input:

$ echo -e "[Windows-1252]   Euro: \x80   Double dagger: \x87" > sample.data
$ echo -e "[Latin-1]   Yen: \xa5   Half: \xbd" >> sample.data
$ echo -e "[Japanese]   Ship: \xe8\x88\xb9" >> sample.data
$ echo -e "[Invalid UTF-8]  Blob: \xf4\xa5\xa3\xa5" >> sample.data

This file looks ugly. Notice all the "wrong" characters when we simply view the file directly:

$ cat sample.data
[Windows-1252]   Euro: �   Double dagger: �
[Latin-1]   Yen: �   Half: �
[Japanese]   Ship: 船
[Invalid UTF-8]  Blob: ����

Running iconv is of little help:

## With no source encoding given, it errors on the Euro:
$ iconv -t utf8 sample.data >/dev/null
iconv: illegal input sequence at position 23

## We can tell it to ignore those errors, but it still barfs on the blob:
$ iconv -t utf8//ignore sample.data >/dev/null
iconv: illegal input sequence at position 123

## Telling it the source is Window-1252 fixes some things, but still sinks the Ship:
$ iconv -f windows-1252 -t utf8//ignore sample.data
[Windows-1252]   Euro: €   Double dagger: ‡
[Latin-1]   Yen: ¥   Half: ½
[Japanese]   Ship: 船
[Invalid UTF-8]  Blob: ô¥£¥

After testing a few other tools, we discovered the nifty Encoding::FixLatin , a Perl module which provides a command-line program called "fix_latin". Rather than being authoritative like iconv, it tries its best to fix things up with educated guesses. It's documentation gives a good summary:

  The script acts as a filter, taking source data which may contain a mix of
  ASCII, UTF8, ISO8859-1 and CP1252 characters, and producing output will be
  all ASCII/UTF8.

  Multi-byte UTF8 characters will be passed through unchanged (although
  over-long UTF8 byte sequences will be converted to the shortest normal
  form). Single byte characters will be converted as follows:

    0x00 - 0x7F   ASCII - passed through unchanged
    0x80 - 0x9F   Converted to UTF8 using CP1252 mappings
    0xA0 - 0xFF   Converted to UTF8 using Latin-1 mappings

While this works great for fixing the Windows-1252 and Latin-1 problems (and thus accounted for at least 95% of our table's bad encodings), it still allows "invalid" UTF-8 to pass on through. Which means that Postgres will still refuse to accept it. Let's check our test file:

$ fix_latin sample.data
[Windows-1252]   Euro: €   Double dagger: ‡
[Latin-1]   Yen: ¥   Half: ½
[Japanese]   Ship: 船
[Invalid UTF-8]  Blob: ����

## Postgres will refuse to import that last part:
$ echo "SELECT E'"  "$(fix_latin sample.data)"  "';" | psql
ERROR:  invalid byte sequence for encoding "UTF8": 0xf4 0xa5 0xa3 0xa5

## Even adding iconv is of no help:
$ echo "SELECT E'"  "$(fix_latin sample.data | iconv -t utf-8)"  "';" | psql
ERROR:  invalid byte sequence for encoding "UTF8": 0xf4 0xa5 0xa3 0xa5

The UTF-8 specification is rather dense and puts many requirements on encoders and decoders. How well programs implement these requirements (and optional bits) varies, of course. But at the end of the day, we needed that data to go into a UTF-8 encoded Postgres database without complaint. When in doubt, go to the source! The relevant file in the Postgres source code responsible for rejecting bad UTF-8 (as in the examples above) is src/backend/utils/mb/wchar.c Analyzing that file shows a small but elegant piece of code whose job is to ensure only "legal" UTF-8 is accepted:

bool
pg_utf8_islegal(const unsigned char *source, int length)
{
  unsigned char a;

  switch (length)
  {
    default:
      /* reject lengths 5 and 6 for now */
      return false;
    case 4:
      a = source[3];
      if (a < 0x80 || a > 0xBF)
        return false;
      /* FALL THRU */
    case 3:
      a = source[2];
      if (a < 0x80 || a > 0xBF)
        return false;
      /* FALL THRU */
    case 2:
      a = source[1];
      switch (*source)
      {
        case 0xE0:
          if (a < 0xA0 || a > 0xBF)
            return false;
          break;
        case 0xED:
          if (a < 0x80 || a > 0x9F)
            return false;
          break;
        case 0xF0:
          if (a < 0x90 || a > 0xBF)
            return false;
          break;
        case 0xF4:
          if (a < 0x80 || a > 0x8F)
            return false;
          break;
        default:
          if (a < 0x80 || a > 0xBF)
            return false;
          break;
      }
      /* FALL THRU */
    case 1:
      a = *source;
      if (a >= 0x80 && a < 0xC2)
        return false;
      if (a > 0xF4)
        return false;
      break;
  }
  return true;
}

Now that we know the UTF-8 rules for Postgres, how do we ensure our data follows it? While we could have made another standalone filter to run after fix_latin, that would increase the migration time. So I made a quick patch to the fix_latin program itself, rewriting that C logic in Perl. A new option "--strict-utf8" was added. Its job is to simply enforce the rules found in the Postgres source code. If a character is invalid, it is replaced with a question mark (there are other choices for a replacement character, but we decided simple question marks were quick and easy - and the surrounding data was unlikely to be read or even used anyway).

Voila! All of the data was now going into Postgres without a problem. Observe:

$ echo "SELECT E'"  "$(fix_latin  sample.data)"  "';" | psql
ERROR:  invalid byte sequence for encoding "UTF8": 0xf4 0xa5 0xa3 0xa5

$ echo "SELECT E'"  "$(fix_latin --strict-utf8 sample.data)"  "';" | psql
                   ?column?                   
----------------------------------------------
  [Windows-1252]   Euro: €   Double dagger: ‡+
 [Latin-1]   Yen: ¥   Half: ½                +
 [Japanese]   Ship: 船                       +
 [Invalid UTF-8]  Blob: ???? 
(1 row)

What are the lessons here? First and foremost, never use SQL_ASCII. It's outdated, dangerous, and will cause much pain down the road. Second, there are an amazing number of client encodings in use, especially for old data, but the world has pretty much standardized on UTF-8 these days, so even if you are stuck with SQL_ASCII, the amount of Windows-1252 and other monstrosities will be small. Third, don't be afraid to go to the source. If Postgres is rejecting your data, it's probably for a very good reason, so find out exactly why. There were other challenges to overcome in this migration, but the encoding was certainly one of the most interesting ones. Everyone, the client and us, is very happy to finally have everything using UTF-8!

Read the whole story
internetionals
3 days ago
reply
Netherlands
Share this story
Delete

Gabriele Bartolini: Barman 2.2 and the magic of parallel copy

1 Share

Barman 2.2 introduces support for parallel copy, by improving performance of both backup and recovery operations in your PostgreSQL disaster recovery solution.

Barman is a piece of software that has been incrementally improved since its conception in 2011. Brick after brick, with just one goal in mind: foster a disaster recovery culture in PostgreSQL, by making the whole backup/recovery process easier and more standard.

Barman is full of interesting features that go beyond disaster recovery (consider the WAL hub facility implemented via barman_wal_restore). Just to name a few: support for rsync/SSH and streaming backup, support for both WAL archiving and streaming (including synchronous for zero data loss clusters), a monitoring facility, incremental backup and recovery, hook scripts, and so on.

However, when managing large databases, Barman suffered from being bound to one CPU for backup and recovery operations. Lately, this was perceived as Barman’s main weak spot by users, and we decided to fill the gap.

Version 2.2 introduces support for parallel backup and recovery when using the rsync copy method, allowing you to specify how many jobs you want to run concurrently.

We have added one global option, called parallel_jobs, that can be overridden at server level. For back compatibility, this option is set to 1 by default. This option controls parallelism for both the backup and the recover command.

[vaughan]
description =  "Backup of SRV database"
ssh_command = ssh postgres@vaughan
conninfo = host=vaughan user=barman dbname=postgres
backup_method = rsync
parallel_jobs = 4
; … more options here

In some cases though, users might want to change the default behaviour and decide how many jobs are requested for a backup or recovery operation. For this reason we have implemented the --jobs option (or -j) for both the backup and recover command.

If you want to spread your backup over 8 rsync processes, you can simply execute:

$ barman backup -j 8 vaughan

Likewise, for recovery:

$ barman recover -j 8 [...] vaughan [...]

Another interesting change is in the show-backup command. This is an excerpt taken from one of the Subito.it databases (thanks for the kind concession and for co-funding the development of this feature). You can appreciate the improvement:

$ barman show-backup pg95 last

       ... [snip] ...
  Base backup information:
    Disk usage           : 1.8 TiB (1.8 TiB with WALs)
    Incremental size     : 672.6 GiB (-62.76%)
       ... [snip] ...
    WAL number           : 392
    WAL compression ratio: 60.68%
    Begin time           : 2017-06-15 01:00:02.929344+02:00
    End time             : 2017-06-15 02:55:06.626676+02:00
    Copy time            : 1 hour, 29 minutes, 31 seconds + 6 seconds startup
    Estimated throughput : 128.2 MiB/s (4 jobs)
       ... [snip] ...

Roughly, with their 1.8 terabyte database, Subito.it has reduced their backup time by 60% (from 3 hours and 40 minutes, to less than 1 hour and 30 minutes). Similarly, they have reduced recovery time by 40% (from 5 hours and 20 minutes, to 3 hours and 10 minutes) by increasing the number of jobs from 1 to 4.

Indeed, Subito.it automatically test their backups through post backup hooks scripts that re-create a reporting database every day from scratch (watch my presentation from 2015 at PgConf.US for details). Thanks to this feature, Subito.it is able to provision a database to their BI department almost 5 hours earlier!

It goes without saying that there is not a precise formula for this, as many variables come into play, including I/O and network throughput. But it is definitely another option you now have with Barman.

Barman 2.2 fixes a few outstanding bugs and improves robustness of your PostgreSQL disaster recovery installation by adding the max_incoming_wals_queue option, which makes sure that your WAL files are regularly archived by Barman.

As with any other previous release, just update your current Barman installation and you will be able to experience parallel backup and recovery.

We believe this is a killer feature, let us know that you like it and share your feedback with us!

Read the whole story
internetionals
8 days ago
reply
Netherlands
Share this story
Delete

David Fetter: Cool Stuff in PostgreSQL 10: Auto-logging

1 Share
We started off by creating a logging infrastructure, then arranging for a single table to use it. Rather than repeat that work for each table, let's use a relatively although not completely new feature: `EVENT TRIGGER`. The idea here is that we fire a trigger on `CREATE TABLE` and see to it that the table is logged. We'll write the trigger first, even though in reality, we'd need to load the function it calls first. ```sql CREATE EVENT TRIGGER add_logger ON ddl_command_end WHEN tag IN ('create table') EXECUTE PROCEDURE add_logger(); COMMENT ON EVENT TRIGGER add_logger IS 'Ensure that each table which is not a log gets logged'; ``` The magic happens inside `add_logger()`, but it's magic we've already seen. First, we'll get the table's name and schema using `pg_event_trigger_ddl_commands()`, filtering out tables which are already log tables. The test here is crude and string-based, but we could easily go to schema-based ones. ```sql CREATE OR REPLACE FUNCTION add_logger() RETURNS event_trigger LANGUAGE plpgsql AS $$$$ DECLARE r RECORD; cmd TEXT; BEGIN SELECT p.*, c.relname as table_name INTO STRICT r FROM pg_catalog.pg_event_trigger_ddl_commands() p JOIN pg_catalog.pg_class c ON (p.objid = c.oid) WHERE p.object_type = 'table' AND c.relname !~ '_log$''; /* No recursing! */ IF NOT FOUND THEN RAISE NOTICE 'Skipping log table'; RETURN; END IF; ``` Next, we add the log table for the table in question: ```sql RAISE NOTICE 'Adding log table(s) for %.%', r.schema_name, r.table_name; cmd := format('CREATE TABLE IF NOT EXISTS %I PARTITION OF the_log FOR VALUES IN (%L) PARTITION BY LIST(table_name);', pg_catalog.concat_ws('_', r.schema_name, 'log'), r.schema_name ); EXECUTE cmd; cmd := format('CREATE TABLE IF NOT EXISTS %I PARTITION OF %s FOR VALUES IN (%L);', pg_catalog.concat_ws('_', r.schema_name, r.table_name, 'log'), pg_catalog.concat_ws('_', r.schema_name, 'log'), r.table_name ); EXECUTE cmd; ``` Finally, the triggers that populate it. ```sql cmd := format( $q$CREATE TRIGGER %I AFTER INSERT ON %I.%I REFERENCING NEW TABLE AS new_table FOR EACH STATEMENT EXECUTE PROCEDURE public.log();$q$, pg_catalog.concat_ws('_', 'log_insert', r.schema_name, r.table_name), r.schema_name, r.table_name ); EXECUTE cmd; cmd := format( $q$CREATE TRIGGER %I AFTER UPDATE ON %I.%I REFERENCING OLD TABLE AS old_table NEW TABLE AS new_table FOR EACH STATEMENT EXECUTE PROCEDURE public.log();$q$, pg_catalog.concat_ws('_', 'log_update', r.schema_name, r.table_name), r.schema_name, r.table_name ); EXECUTE cmd; cmd := format( $q$CREATE TRIGGER %I AFTER DELETE ON %I.%I REFERENCING OLD TABLE AS old_table FOR EACH STATEMENT EXECUTE PROCEDURE public.log();$q$, pg_catalog.concat_ws('_', 'log_delete', r.schema_name, r.table_name), r.schema_name, r.table_name ); EXECUTE cmd; EXCEPTION WHEN no_data_found THEN NULL; WHEN too_many_rows THEN RAISE EXCEPTION 'This function should only fire on one table, not this list: %', r.object_identity; END; $$$$; ``` This concludes this mini-series on auto-logging. I hope you've enjoyed it, and until the next time, keep database doing!
Read the whole story
internetionals
8 days ago
reply
Netherlands
Share this story
Delete

Nftables port knocking

1 Share
One of the main advantage of nftables over itpables is its native handling of set. They can be used for multiple purpose and thanks to the timeout capabilities it is easy to do some fun things like implementing port knocking in user space. The idea of this technic is fairly simple, a closed port is dynamically opened if the user send packets in order to a predetermine series of ports. To implement that on a system, let’s start from ruleset where Input is closed but for a SSH rule using a set of IPv4 address as parameter:
# nft list ruleset
table inet Filter {
	set Client_SSH_IPv4 {
		type ipv4_addr
		flags timeout
		elements = { 192.168.1.1, 192.168.1.254 }
	}

	chain Input {
		type filter hook input priority 0; policy drop;
		ct state established counter accept
		tcp dport ssh ip saddr @Client_SSH_IPv4 counter accept
		iif "lo" accept
		counter log prefix "Input Dft DROP " drop
	}
}
This gives a control of which IP get access to the SSH server. It contains two fixed IP but there is an timeout flag set allowing the definition of entry with timeout. To implement a 3 steps port knocking we define two sets with a short timeout. Hitting one port make you enter in the first set then if you are in the first set you can enter the second set by hitting the second port. To get the first step done we use the fact that nftables can add element to a set in the packet path:
# nft add inet Filtre Raw tcp dport 36 set add ip saddr @Step1 drop
Once this is done, second step is a simple derivative of first one:
# nft add inet Filtre Raw ip saddr @Step1 tcp dport 15 set add ip saddr @Step2 drop
We add these rules in the prerouting table so kernel won’t do any connection tracking for these packets. This gets us the following ruleset:
	set Step1 {
		type ipv4_addr
		timeout 5s
	}

	set Step2 {
		type ipv4_addr
		timeout 5s
	}

	chain Raw {
		type filter hook prerouting priority -300; policy accept;
		tcp dport 36 set add ip saddr @Step1 drop
		ip saddr @Step1 tcp dport 15 set add ip saddr @Step2 drop
		ip saddr @Step2 tcp dport 42 set add ip saddr timeout 10s @Client_SSH_IPv4 drop
	}
Poor man testing can be
nmap -p 36 IPKNOCK & sleep 1;  nmap -p 15 IPKNOCK & sleep 1;  nmap -p 42 IPKNOCK
We run this and our IP is in the Client_SSH_IPv4 set:
# nft list set inet Filter Client_SSH_IPv4
table inet Filter {
	set Client_SSH_IPv4 {
		type ipv4_addr
		flags timeout
		elements = { 192.168.2.1 expires 8s, 192.168.1.1,
			     192.168.1.254 }
	}
}
and we can SSH to IPKNOCK 😉

And here is the complete ruleset:

# nft list ruleset
table inet Filter {
	set Client_SSH_IPv4 {
		type ipv4_addr
		flags timeout
		elements = { 192.168.1.1, 192.168.1.254 }
	}

	set Step1 {
		type ipv4_addr
		timeout 5s
	}

	set Step2 {
		type ipv4_addr
		timeout 5s
	}

	chain Input {
		type filter hook input priority 0; policy drop;
		ct state established counter accept
		tcp dport ssh ip saddr @Client_SSH_IPv4 counter packets 0 bytes 0 accept
		iif "lo" accept
		counter log prefix "Input Dft DROP " drop
	}

	chain Raw {
		type filter hook prerouting priority -300; policy accept;
		tcp dport 36 set add ip saddr @Step1 drop
		ip saddr @Step1 tcp dport netstat set add ip saddr @Step2 drop
		ip saddr @Step2 tcp dport nameserver set add ip saddr timeout 10s @Client_SSH_IPv4 drop
	}
}
For more information on nftables feel free to visit the Nftables wiki.
Read the whole story
internetionals
8 days ago
reply
Netherlands
Share this story
Delete

Pavel Stehule:

1 Share
I hope so every, who uses psql uses less pager as default pager. See important Merlin's article. less is great pager, but it is not designed for usage as pager of relational data, and miss some features. Mainly, there is not possible to freeze some rows and some columns. I didn't find any solution good enough for me, although there are some projects on github. The special pager for psql should to have some logic, to choose correctly and automatically frozen columns (in chars) and rows.

I wrote new pager pspg - Postgres pager. This pager is designed primary for usage as psql pager. Now, it is available for playing and testing. Probably there will be some unwanted artefacts - but for usual work it is good enough. It is first my ncurses applications, probably there will be some bugs.

This pager currently supports:
* possibility to set one from three themes from command line: -s 0 - black/white theme, -s 1 mc theme, -s 2 fox theme,
* possibility to set frozen columns from command line: -c N or interactively (0..4),
* scrolling: left, rights, up, down, PgUp, PgDn, ^PgUp, ^PgDn,
* possibility to run demo from command line: -d,
* supports different psql border linestyles: ascii, unicode,
* supports different psql border levels: 0, 1, 2

Usage:
export PAGER="pspg" #default theme 1
psql dbname
\pset border 2 #best works with border 2, but others are supported too
or
export PAGER="pspg -s 0" #black white theme
psql dbname
..

  • black white
  • midnight commander
  • fox


I invite any cooperation with people with ncurses knowleadge. All patches are welcome.
Read the whole story
internetionals
9 days ago
reply
Netherlands
Share this story
Delete

Shaun M. Thomas: PG Phriday: Crazy Correlated Column Crusade

1 Share

For a long time, the Postgres query planner has sported a huge blinking neon blind-spot that frustrated and enraged DBAs throughout the universe to a level just shy of murderous frenzy. How is this even possible? What terrible lurking horror could elicit such a visceral response from probably the most boring and straight-laced people ever to draw breath? What else? Correlated statistics.

The Adventure Begins!

The Postgres query planner is a cost-estimation engine. Postgres gathers statistics on table contents such as most and least frequent values, value cardinality, rudimentary histograms, and so on. Along with multiple metrics related to hardware responsiveness, it generates multiple viable execution strategies, and chooses the one that probably costs the least resources. So far, this is something nearly all database engines do.

But there’s an area that query planners tend to skip: correlation. After all, if two columns are related in some manner, there’s nearly an infinite number of possible combinations; gathering statistics on all of them would be inadvisable to say the least. Unfortunately, doing so is critical in some cases.

To see why, let’s build a test case that helps illustrate the problem more directly. As usual, this means trotting out the old, reliable sensor_log table:

CREATE TABLE sensor_log (
  sensor_log_id  SERIAL PRIMARY KEY,
  location       BIGINT NOT NULL,
  reading        BIGINT NOT NULL,
  reading_date   TIMESTAMP NOT NULL
);

INSERT INTO sensor_log (location, reading, reading_date)
SELECT s.id % 1000, id % 100,
       CURRENT_DATE + INTERVAL '1d' - ((s.id * 10)::TEXT || 's')::INTERVAL
  FROM generate_series(1, 1000000) s(id);

CREATE INDEX idx_sensor_log_location ON sensor_log (location);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);

ANALYZE sensor_log;

So now we have 1,000,000 rows spread across 1,000 sensors, tracking up to 100 distinct values. In this particular scenario, the location and reading columns are strongly correlated thanks to our use of the modulo operator. Up to a value of 100, there’s a 1-1 relationship between location and reading.

So far, this is what the Postgres query planner sees:

EXPLAIN
SELECT *
  FROM sensor_log
 WHERE location = 10
   AND reading = 10;

                                       QUERY PLAN                                        
-----------------------------------------------------------------------------------------
 Bitmap Heap Scan on sensor_log  (cost=19.67..2704.07 rows=10 width=28)
   Recheck Cond: (location = 10)
   Filter: (reading = 10)
   ->  Bitmap Index Scan on idx_sensor_log_location  (cost=0.00..19.66 rows=965 width=0)
         Index Cond: (location = 10)

What’s 1,000,000 divided by 1,000? We should be seeing 1,000 rows in the results, but the estimate is off by two orders of magnitude. Well, that’s pretty embarrassing.

Into the Bottomless Chasm

To see why Postgres mis-estimated the amount of matches, we need to view the statistics Postgres has stored for the sensor_log table:

SELECT attname, n_distinct
  FROM pg_stats
 WHERE tablename = 'sensor_log'
   AND attname IN ('location', 'reading');

 attname  | n_distinct 
----------+------------
 location |       1000
 reading  |        100

Well, that looks correct, right? So what’s the problem? The issue is that Postgres doesn’t know these two values are tightly bound. By default, all predicate probabilities are multiplied independently. So we have a 1/1000 chance multiplied by a 1/100 chance, and we end up with a 1/100000 chance. What’s 1,000,000 divided by 100,000? So it’s no surprise that Postgres only expects to match 10 rows from sensor_log based on our query.

A Hero is Born

That used to be the end of the story. Postgres simply couldn’t handle correlated columns, and would repeatedly botch row estimates where they occurred. In some cases, a good DBA with a lot of optimization expertise could construct a one-time remedy. Unfortunately, it’s not always possible to hand-tune every query. When that happened, everyone simply had to hope the bad estimates weren’t too detrimental, or were exceptionally rare.

Starting with Postgres 10 and the new CREATE STATISTICS statement however, things are a bit different. Now we can specifically inform Postgres of column correlations where relevant, and the results are quite amazing. Let’s create a correlation and attempt the query again:

CREATE STATISTICS stats_sensor_log_location_reading_1t1
       (dependencies)
    ON location, reading
  FROM sensor_log;

ANALYZE sensor_log;

EXPLAIN
SELECT *
  FROM sensor_log
 WHERE location = 10
   AND reading = 10;

                                       QUERY PLAN                                        
-----------------------------------------------------------------------------------------
 Bitmap Heap Scan on sensor_log  (cost=19.94..2714.09 rows=970 width=28)
   Recheck Cond: (location = 10)
   Filter: (reading = 10)
   ->  Bitmap Index Scan on idx_sensor_log_location  (cost=0.00..19.70 rows=970 width=0)
         Index Cond: (location = 10)

Much better! Now when Postgres encounters these two columns in a query, it will have a far more accurate picture of how the data is related.

Following the White Rabbit

Why is this so important? The query plans are the same, and would produce equivalent results regardless of the estimation error. This is only true for extremely simple queries; once we start combining tables, things change drastically.

We only have a single table right now, so let’s join it to itself to simulate what would happen if query complexity increased even slightly:

EXPLAIN
SELECT l1.*
  FROM sensor_log l1
  JOIN sensor_log l2 USING (sensor_log_id)
 WHERE l1.location = 10
   AND l1.reading = 10
   AND l2.location = 10
   AND l2.reading = 10;

                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Hash Join  (cost=2746.16..5452.56 rows=1 width=28)
   Hash Cond: (l1.sensor_log_id = l2.sensor_log_id)
   ->  Bitmap Heap Scan on sensor_log l1  (cost=19.94..2714.09 rows=970 width=28)
         Recheck Cond: (location = 10)
         Filter: (reading = 10)
         ->  Bitmap Index Scan on idx_sensor_log_location  (cost=0.00..19.70 rows=970 width=0)
               Index Cond: (location = 10)
   ->  Hash  (cost=2714.09..2714.09 rows=970 width=4)
         ->  Bitmap Heap Scan on sensor_log l2  (cost=19.94..2714.09 rows=970 width=4)
               Recheck Cond: (location = 10)
               Filter: (reading = 10)
               ->  Bitmap Index Scan on idx_sensor_log_location  (cost=0.00..19.70 rows=970 width=0)
                     Index Cond: (location = 10)

We can ignore much of this output. Focus on the count estimates. We have roughly 1000 rows being hash joined with another 1000 rows. Hashing 2000 values isn’t especially intensive, and this just happens to be the optimal query plan. This is precisely what we want to see, and Postgres 10 makes it possible.

Things Fall Apart

So what would have happened in an older, somewhat less sophisticated version of Postgres? We can drop the statistical representation we created and see for ourselves. Be prepared; this is a gruesome exhibition not suitable for some readers:

DROP STATISTICS stats_sensor_log_location_reading_1t1;

EXPLAIN
SELECT l1.*
  FROM sensor_log l1
  JOIN sensor_log l2 USING (sensor_log_id)
 WHERE l1.location = 10
   AND l1.reading = 10
   AND l2.location = 10
   AND l2.reading = 10;

                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Nested Loop  (cost=20.13..2791.49 rows=1 width=28)
   ->  Bitmap Heap Scan on sensor_log l1  (cost=19.70..2713.85 rows=9 width=28)
         Recheck Cond: (location = 10)
         Filter: (reading = 10)
         ->  Bitmap Index Scan on idx_sensor_log_location  (cost=0.00..19.70 rows=970 width=0)
               Index Cond: (location = 10)
   ->  Index Scan using sensor_log_pkey on sensor_log l2  (cost=0.42..8.45 rows=1 width=4)
         Index Cond: (sensor_log_id = l1.sensor_log_id)
         Filter: ((location = 10) AND (reading = 10))

Deplorable. Now that Postgres only expects to match about 10 rows, it reverts to a nested loop to join the two tables together. Our particular example is only off by two orders of magnitude at relatively low scale, so the execution times don’t diverge too much. This particular query plan is “only” about 50-60% slower than the optimal version.

Yet queries get more complicated than this, to be sure. Add another predicate, up the scale slightly, and Postgres might use a nested loop for a situation where a merge would have been more appropriate. Consider that merges are normally reserved for very large result sets, and try not to balk in abject terror after realizing a bad estimate here could mean looping through millions of results unintentionally. It’s the difference between a plan that might finish in less than a second, and one that may consume several hours.

Trouble in Paradise

Unfortunately, correlated statistics aren’t quite perfect yet. As a new and somewhat experimental piece of functionality, there are some edge cases it trips over. In past articles, the location column was historically a VARCHAR type for mostly arbitrary reasons. Let’s switch it back and apply the same dependent statistics to see what the planner does:

CREATE TABLE sensor_log_vc (
  sensor_log_id  SERIAL PRIMARY KEY,
  location       VARCHAR NOT NULL,
  reading        BIGINT NOT NULL,
  reading_date   TIMESTAMP NOT NULL
);

INSERT INTO sensor_log_vc (location, reading, reading_date)
SELECT s.id % 1000, id % 100,
       CURRENT_DATE + INTERVAL '1d' - ((s.id * 10)::TEXT || 's')::INTERVAL
  FROM generate_series(1, 1000000) s(id);

CREATE INDEX idx_sensor_log_vc_location ON sensor_log_vc (location);
CREATE INDEX idx_sensor_log__vc_date ON sensor_log_vc (reading_date);

CREATE STATISTICS stats_sensor_log_vc_location_reading_1t1
       (dependencies)
    ON location, reading
  FROM sensor_log_vc;

ANALYZE sensor_log_vc;

EXPLAIN
SELECT *
  FROM sensor_log_vc
 WHERE location = '10'
   AND reading = 10;

                                         QUERY PLAN                                         
--------------------------------------------------------------------------------------------
 Bitmap Heap Scan on sensor_log_vc  (cost=19.70..2623.99 rows=9 width=23)
   Recheck Cond: ((location)::text = '10'::text)
   Filter: (reading = 10)
   ->  Bitmap Index Scan on idx_sensor_log_vc_location  (cost=0.00..19.70 rows=970 width=0)
         Index Cond: ((location)::text = '10'::text)

What happened here? Notice how everything is being cast to TEXT by the planner? Normally this is quite innocuous, as TEXT and VARCHAR are roughly equivalent in Postgres. Though it’s plainly obvious that “roughly” does not mean “exactly”, at least to some of the backend code.

If we try the experiment again, but use TEXT for the column type instead, we get the expected result:

CREATE TABLE sensor_log_txt (
  sensor_log_id  SERIAL PRIMARY KEY,
  location       TEXT NOT NULL,
  reading        BIGINT NOT NULL,
  reading_date   TIMESTAMP NOT NULL
);

INSERT INTO sensor_log_txt (location, reading, reading_date)
SELECT s.id % 1000, id % 100,
       CURRENT_DATE + INTERVAL '1d' - ((s.id * 10)::TEXT || 's')::INTERVAL
  FROM generate_series(1, 1000000) s(id);

CREATE INDEX idx_sensor_log_txt_location ON sensor_log_txt (location);
CREATE INDEX idx_sensor_log__txt_date ON sensor_log_txt (reading_date);

CREATE STATISTICS stats_sensor_log_txt_location_reading_1t1
       (dependencies)
    ON location, reading
  FROM sensor_log_txt;

ANALYZE sensor_log_txt;

EXPLAIN
SELECT *
  FROM sensor_log_txt
 WHERE location = '10'
   AND reading = 10;

                                         QUERY PLAN                                          
---------------------------------------------------------------------------------------------
 Bitmap Heap Scan on sensor_log_txt  (cost=19.93..2619.57 rows=968 width=23)
   Recheck Cond: (location = '10'::text)
   Filter: (reading = 10)
   ->  Bitmap Index Scan on idx_sensor_log_txt_location  (cost=0.00..19.68 rows=968 width=0)
         Index Cond: (location = '10'::text)

Hopefully this kind of thing will be ironed out before Postgres 10 is released to the public. If not, it’s still better than what we’ve been forced to endure thus far.

A Welcome Rest

It’s difficult not to be excited by Postgres 10. The development team has pulled out all the stops and is addressing long-standing concerns as if there was no tomorrow. On a more personal level, I’ve ranted and raved about this particular shortcoming on numerous occasions. I’ve had queries that should have executed in seconds literally take hours instead. In the past, the only way to fix these instances was to resort to CTEs or even multiple temp tables, thus forcing Postgres to periodically materialize for accurate row counts when it mattered.

At best, that’s an ugly hack. At worst, it makes a temporary problem permanent barring frequent code refactors. Such workarounds may avoid the catastrophic result, but they’re almost never optimal. CTEs and temp tables aren’t mere optimization hints or fences, they’re optimization walls. When used in situations like this, improvements in the query planner are rendered irrelevant; such walls were constructed expressly to thwart its efforts.

So now we’ve entered an era where bad practices aren’t perpetuated by the Postgres query planner itself. That’s an advancement of Earth-shattering proportions, the implications of which will continue to reverberate through the community in exciting and unexpected ways. There is indeed much about Postgres 10 that has me jumping in anticipation, but this… this is something special.

Well, at least from the perspective of an easily entertained and thoroughly boring DBA. Hopefully I’m not the only one that feels this way.

Read the whole story
internetionals
11 days ago
reply
Netherlands
Share this story
Delete
Next Page of Stories