Skip to main content

Search This Blog

TutorialDBA - Support | Training | Consultant

Drop Down Menus

CSS Drop Down Menu

Pure CSS Dropdown Menu

DBA Jobs
TutorialDBA Forum
IT SUPPORT
Our Services
Training
About Me

TABLESAMPLE, SQL STANDARD AND EXTENSIBLE postgreSQL 9.5

Get link
Facebook
X
Pinterest
Email
Other Apps

- November 16, 2017

Add a TABLESAMPLE clause to SELECT statements that allows
user to specify random BERNOULLI sampling or block level
SYSTEM sampling. Implementation allows for extensible
sampling functions to be written, using a standard API.
Basic version follows SQLStandard exactly. Usable
concrete use cases for the sampling API follow in later
commits.


Getting random sample of the table looks potentially interesting, but how does it work?

Let's make some random table:


create table test (
    id serial primary key,
    some_timestamp timestamptz,
    some_text text
);
CREATE TABLE
insert into test (some_timestamp, some_text)
    select
        now() - random() * '1 year'::interval,
        'depesz #' || i
    from
        generate_series(1,100000) i;
INSERT 0 100000




The table is around 6MB:


                   List of relations
 Schema | Name | Type  | Owner  |  Size   | Description 
--------+------+-------+--------+---------+-------------
 public | test | table | depesz | 5920 kB | 
(1 row)




Tablesample has two modes. SYSTEM and BERNOULLI.

Before we'll go any further, we will need to know how large is the table, in pages:


select relpages from pg_class where relname = 'test';
 relpages 
----------
      736
(1 row)




OK. So, we have 736 pages, and 100 000 rows, which means that on average, in single page we have 136 rows.

Let's say we'd like to get just 10 rows. 10 rows, out of 100000, means we want to get 0.0001 of the table, so:


explain analyze select * from test tablesample system ( 0.01 );
                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 Sample Scan (system) on test  (cost=0.00..0.08 rows=8 width=44) (actual time=0.016..0.021 rows=136 loops=1)
 Planning time: 0.102 ms
 Execution time: 0.045 ms
(3 rows)




That's too much – 136 rows instead of 10. Why is it so?

Well, SYSTEM TABLESAMPLE method randomly picks single page, and returns all rows from this page. This means it will be fast (pick random value 1-736, load page (8kB) return all rows from it).

But it can't return less data than single page.

of course we can use then secondary randomization:


explain analyze with x as (select * from test tablesample system ( 0.01 ))
select * from x order by random() limit 10;
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.39..0.41 rows=8 width=44) (actual time=0.088..0.090 rows=10 loops=1)
   CTE x
     ->  Sample Scan (system) on test  (cost=0.00..0.08 rows=8 width=44) (actual time=0.015..0.027 rows=136 loops=1)
   ->  Sort  (cost=0.31..0.33 rows=8 width=44) (actual time=0.087..0.087 rows=10 loops=1)
         Sort Key: (random())
         Sort Method: top-N heapsort  Memory: 25kB
         ->  CTE Scan on x  (cost=0.00..0.19 rows=8 width=44) (actual time=0.017..0.048 rows=136 loops=1)
 Planning time: 0.136 ms
 Execution time: 0.115 ms
(9 rows)




usually using “order by random()" is slow, but in here, we're orderingonly 136 rows, so it's fast enough.

There is 2nd method – BERNOULLI – that can return smaller number of rows:


explain analyze select * from test tablesample bernoulli ( 0.01 );
                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Sample Scan (bernoulli) on test  (cost=0.00..736.08 rows=8 width=44) (actual time=0.465..2.742 rows=10 loops=1)
 Planning time: 0.107 ms
 Execution time: 2.758 ms
(3 rows)




Looks great – the number of rows is what I wanted (it will not always be exactly given percentage, as it's random). But notice what happens when I add more data:


insert into test (some_timestamp, some_text)
    select
        now() - random() * '1 year'::interval,
        'depesz #' || i
    from
        generate_series(100001, 1000000) i;
INSERT 0 900000




The table is now roughly 10 times larger. Times:


explain analyze select * from test tablesample system ( 0.001 );
                                                  QUERY PLAN                                                  
--------------------------------------------------------------------------------------------------------------
 Sample Scan (system) on test  (cost=0.00..0.10 rows=10 width=25) (actual time=0.029..0.042 rows=136 loops=1)
 Planning time: 0.125 ms
 Execution time: 0.061 ms
(3 rows)
 
 
explain analyze select * from test tablesample bernoulli ( 0.001 );
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Sample Scan (bernoulli) on test  (cost=0.00..7353.10 rows=10 width=25) (actual time=2.779..27.288 rows=15 loops=1)
 Planning time: 0.112 ms
 Execution time: 27.312 ms
(3 rows)




Time for system tablesample is more or less the same. But in case of BERNOULLI – it's 10x longer. Why? Basically BERNOULLI has to seq scan whole table, and pick rows using some math to return more or less the number of rows we want.

This means that while it is more precise, and gives exactly the same chance to each row, it is slower.

In most cases, I think, that SYSTEM sampling will be best, but it has to be understood that it works on page level.

You have to remember also, that tablesample is applied before any WHERE conditions, so this query:


explain analyze select * from test TABLESAMPLE SYSTEM ( 1 ) where id < 10;




Will usually not return any rows – it will first pick 1% of the pages, and then filter them by id < 10 - we would have to randomly pick page number 1 (and some other) for the filter to find any matching rows. All in all - I think it is useful. Thanks.

Get link
Facebook
X
Pinterest
Email
Other Apps

Comments

Post a Comment

Popular posts from this blog

PostgreSQL Index

- November 16, 2017

Indexes are special lookup tables that the database search engine can use to speed up data retrieval. Simply put, an index is a pointer to data in a table. An index in a database is very similar to an index in the back of a book. For example, if you want to reference all pages in a book that discusses a certain topic, you have to first refer to the index, which lists all topics alphabetically and then refer to one or more specific page numbers. An index helps to speed up SELECT queries and WHERE clauses; however, it slows down data input, with UPDATE and INSERT statements. Indexes can be created or dropped with no effect on the data. Creating an index involves the CREATE INDEX statement, which allows you to name the index, to specify the table and which column or columns to index, and to indicate whether the index is in ascending or descending order. Indexes can also be unique, similar to the UNIQUE constraint, in that the index prevents duplicate entries in the column or combina...

VMWARE WORKSTATION 3,4,5,6,7,8,9,10,11,12,14,15...etc LICENSE KEYS COLLECTION

- April 09, 2018

Below tutorialdba.com collected and sorted out hundreds of universal License Keys for all major versions of VMware Workstation Pro (not for VMware Workstation Player) 4.x, 5.x, 6.x, 7.x, 8.x, 9.x, 10.x, 11.x, 12.x and v14.x on Windows and Linux platforms (support both 32-bit and 64-bit operating system) in this single post. Besides, we also provide some license keys for VMware other projects. Just enjoy and share them. // 4~14 Universal License Keys // Version License Keys VMware Workstation VMware Workstation 4.x.x ZHDH1-UR90N-W844G-4PTN6 G1NP0-T88AL-M016F-4P8N2 ZC14J-4U16A-0A04G-4MEZP J1WF8-58LDE-881DG-4M8Q3 VMware Workstation 5.x.x LUXRM-WP0DN-A256U-4M9Q3 DJXDR-NDT27-Y2NDU-4YTZK DA925-HP80U-Z8HDC-4WXXP 3KW2W-AYR2C-88M6F-4MDQ2 VMware Workstation 6.x.x A0E8R-YUDFV-6AK2F-4GAN2 CRX0D-VWL0V-7CJ6C-46C7A NA8RX-QPNDU-D2LA9-4WAZL 1H4WM-N21FZ-7GK2A-44U5U 6AJ6N-THY2P-42KEF-4WTFG FK8R9-LPCDT-88H4Y-4WRN3 KAR8R-T8MAL-K8J6A-4WDXQ YJEKW-JMFF4-YA1DC-4WTQ...

How to CreateYour Own AWS Account Alias?

Account alias is the URL for your sign-in page and contains the account ID by default. We can customize this URL with the company name and even overwrite the previous one. How to CreateYour Own AWS Account Alias? Step 1 − Sign in to the AWS management console and open the IAM console using the following link https://console.aws.amazon.com/iam/ Step 2 − Select the customize link and create an alias of choice. Step 3. After created alias Account Id changed as alias name from 074209010282 to tutorialdba Step 4. If you delete the alias, click the customize link, then click the Yes, Delete button. This deletes the alias and it reverts to the Account ID.

How to Get Table Size, Database Size, Indexes Size, schema Size, Tablespace Size, column Size in PostgreSQL Database

- June 26, 2018

In this post, I am sharing few important function for finding the size of database, table and index in PostgreSQL. Finding object size in postgresql database is very important and common. Is it very useful to know the exact size occupied by the object at the tablespace. The object size in the following scripts is in GB. The scripts have been formatted to work very easily with PUTTY SQL Editor. 1. Checking table size excluding table dependency: SELECT pg_size_pretty(pg_relation_size('mhrordhu_shk.mut_kharedi_audit')); pg_size_pretty ---------------- 238 MB (1 row) 2. Checking table size including table dependency: SELECT pg_size_pretty(pg_total_relation_size('mhrordhu_shk.mut_kharedi_audit')); pg_size_pretty ---------------- 268 MB (1 row) 3. Finding individual postgresql database size SELECT pg_size_pretty(pg_database_size('db_name')); 4. Finding individual table size for postgresql database -including dependency index: SELECT pg_size_pretty(pg_total_rel...

PostgreSQL ALTER TABLE ... SET LOGGED / UNLOGGED

- November 16, 2017

PostgreSQL allows one to create tables which aren't written to the Write Ahead Log, meaning they aren't replicated or crash-safe, but also don't have the associated overhead, so are good for data that doesn't need the guarantees of regular tables. But if you decided an unlogged table should now be replicated, or a regular table should no longer be logged, you'd previously have to create a new copy of the table and copy the data across. But in 9.5, you can switch between logged and unlogged using a new command: Set an unlogged table to logged: ALTER TABLE <tablename> SET LOGGED; Set a logged table to unlogged: ALTER TABLE <tablename> SET UNLOGGED; For example: # CREATE UNLOGGED TABLE messages (id int PRIMARY KEY, message text); # SELECT relname, CASE relpersistence WHEN 'u' THEN 'unlogged' WHEN 'p' then 'logged' ELSE 'unknown' END AS table_type FROM pg_class WHERE relna...

Powered by Blogger

Theme images by badins

TutorialDBA.com

Nijamutheen J: I have 6+ years of experience in PostgreSQL database administrator as well as PostgreSQL Architect , Linux admin , web hosting - apache server , Oracle ,mySQL, Mriadb, MSSQL , AWS & Server security as well as Greenplum database in Allstate .

SOFTWARES

PostgreSQL and Linux Software
PostgreSQL Dumpfiles

Linux

VBOX installation
Linux Installation
Basic Linux For Handling Databases
Linux Rsync
Linux Permissions
Linux Find Command
Linux netstat command
SSH Keygen
Oracle Linux
Linux Top Command
Linux Crontab
Linux Tuning
Linux Priority Changing
CPU and Memory

PostgreSQL Tutorial

Advance PostgreSQL

PL/pgSQL

Trigger
Procedures and Functions in PostgreSQL

PostgreSQL Scripts

PostgreSQL Backup
PostgreSQL Vacuum script
Tuning Script
PostgreSQL Archive Script
PostgreSQ Audit Script
Basic PostgreSQL Script
Killing ALL IDLE connection
PostgreSQL Monitoring Script
Killing Long Running Query Script
Log Compressing and Moving

Interview Preparation

PostgreSQL Daily Activity
PostgreSQL interview Q&A
PostgreSQL Interview Scenarios
PostgreSQL Errors
MVCC

Advance Oracle

Oracle Interview Q/A

Linux Interview Q/A For DBA
Basic SQL Interview Questions
Oracle TNS/Listener Interview Q/A
RMAN Interviews Q/A
ASM Interview Q/A
GoldenGate Interview Q/A
Data Guard Interview Q/A
Performance Tuning Interview Q/A
RAC Interview Questions

DB2

install DB2 on RHEL

MariaDB

MariaDB

Informix

Oracle VS Mysql Vs Informix

Blogger Help

Basic Blogger
Ads management
Blogger Gadget
Blogs Errors

Resume

DBA Resume

Networking

IP Address

Pageviews