Postgresql 9.4 Architecture Explanation With Query Flow
LIBPQ -Library Pooled Quota
- Details About Connected Users Using tools
- libpq is the C application programmer's interface to PostgreSQL. libpq is a set of library functions that allow client programs to pass queries to the PostgreSQL backend server and to receive the results of these queries.
- Client programs that use libpq must include the header file libpq-fe.h and must link with the libpq library
- Here are also several complete examples of libpq applications in the directory src/test/examples in the source code distribution.
- Whenever we issue a query or the action made by us (client) is called the client process
- It is front end.
- Front end may be a text application, graphical application or web server page.
- Through TCP/IP clients access the server
- Many users at a time can access the DB
- FORKS – This process makes multi user access possible. It don’t disturb the postgres process
- The work of postmaster is that it authenticates the port (5432) and allocates process for users.
- It is also called as postgres. It accepts the connection from the clients(we) like database files and manages the database action.
I.Instance
II.Storage
I.Instance is divide into two types
1.Memory buffer
2.Utility Process
1.Memory Buffer:
a)Shared_buffer:
Sets the amount of memory the database server uses for shared memory buffers. The default is typically 128 megabytes (128MB), but might be less if your kernel settings will not support it (as determined during initdb). This setting must be at least 128 kilobytes. (Non-default values of BLCKSZ change the minimum.) However, settings significantly higher than the minimum are usually needed for good performance. This parameter can only be set at server start.
If you have a dedicated database server with 1GB or more of RAM, a reasonable starting value for shared_buffers is 25% of the memory in your system. There are some workloads where even large settings for shared_buffers are effective, but because PostgreSQL also relies on the operating system cache, it is unlikely that an allocation of more than 40% of RAM to shared_buffers will work better than a smaller amount. Larger settings for shared_buffers usually require a corresponding increase in checkpoint_segments, in order to spread out the process of writing large quantities of new or changed data over a longer period of time.
On systems with less than 1GB of RAM, a smaller percentage of RAM is appropriate, so as to leave adequate space for the operating system. Also, on Windows, large values for shared_buffers aren't as effective. You may find better results keeping the setting relatively low and using the operating system cache more instead. The useful range for shared_buffers on Windows systems is generally from 64MB to 512MB.
b)Wall_buffer:
The amount of shared memory used for WAL data that has not yet been written to disk. The default setting of -1 selects a size equal to 1/32nd (about 3%) of shared_buffers, but not less than 64kB nor more than the size of one WAL segment, typically 16MB. This value can be set manually if the automatic choice is too large or too small, but any positive value less than 32kB will be treated as 32kB. This parameter can only be set at server start.
The contents of the WAL buffers are written out to disk at every transaction commit, so extremely large values are unlikely to provide a significant benefit. However, setting this value to at least a few megabytes can improve write performance on a busy server where many clients are committing at once. The auto-tuning selected by the default setting of -1 should give reasonable results in most cases.
c)CLOG Buffers:
CLOG BUFFERS are one of the SLRU-style buffers oriented toward circular "rings" of data, like which transaction numbers have been committed or rolled back.
d)Temp_Buffer:
Sets the maximum number of temporary buffers used by each database session. These are session-local buffers used only for access to temporary tables. The default is eight megabytes (8MB). The setting can be changed within individual sessions, but only before the first use of temporary tables within the session; subsequent attempts to change the value will have no effect on that session.
A session will allocate temporary buffers as needed up to the limit given by temp_buffers. The cost of setting a large value in sessions that do not actually need many temporary buffers is only a buffer descriptor, or about 64 bytes, per increment in temp_buffers. However if a buffer is actually used an additional 8192 bytes will be consumed for it (or in general, BLCKSZ bytes).
e)Work_mem:
Specifies the amount of memory to be used by internal sort operations and hash tables before writing to temporary disk files. The value defaults to four megabytes (4MB). Note that for a complex query, several sort or hash operations might be running in parallel; each operation will be allowed to use as much memory as this value specifies before it starts to write data into temporary files. Also, several running sessions could be doing such operations concurrently. Therefore, the total memory used could be many times the value of work_mem; it is necessary to keep this fact in mind when choosing the value. Sort operations are used for ORDER BY, DISTINCT, and merge joins. Hash tables are used in hash joins, hash-based aggregation, and hash-based processing of IN subqueries.
f)Maintenance_work_mem:
Specifies the maximum amount of memory to be used by maintenance operations, such as VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY. It defaults to 64 megabytes (64MB). Since only one of these operations can be executed at a time by a database session, and an installation normally doesn't have many of them running concurrently, it's safe to set this value significantly larger than work_mem. Larger settings might improve performance for vacuuming and for restoring database dumps.
Note that when autovacuum runs, up to autovacuum_max_workers times this memory may be allocated, so be careful not to set the default value too high. It may be useful to control for this by separately setting autovacuum_work_mem.
2.Utility(Background) Process:
a)BGWriter:
There is a separate server process called the background writer, whose function is to issue writes of "dirty" (new or modified) shared buffers. It writes shared buffers so server processes handling user queries seldom or never need to wait for a write to occur. However, the background writer does cause a net overall increase in I/O load, because while a repeatedly-dirtied page might otherwise be written only once per checkpoint interval, the background writer might write it several times as it is dirtied in the same interval. The parameters discussed in this subsection can be used to tune the behavior for local needs.
b)WallWriter:
WAL buffers are written out to disk at every transaction commit, so extremely large values are unlikely to provide a significant benefit. However, setting this value to at least a few megabytes can improve write performance on a busy server where many clients are committing at once. The auto-tuning selected by the default setting of -1 should give reasonable results in most cases.
pecifies the delay between activity rounds for the WAL writer. In each round the writer will flush WAL to disk. It then sleeps for wal_writer_delay milliseconds, and repeats. The default value is 200 milliseconds (200ms). Note that on many systems, the effective resolution of sleep delays is 10 milliseconds; setting wal_writer_delay to a value that is not a multiple of 10 might have the same results as setting it to the next higher multiple of 10. This parameter can only be set in the postgresql.conf file or on the server command line.
c)SysLogger:Error Reporting and Logging
As per the figure, it is clearly understood that all – the utility processes + user backends + Postmaster Daemon are attached to syslogger process for logging the information about their activities. Every process information is logged under $PGDATA/pg_log with the file .log.
Debugging more on the process information will cause overhead on the server. Minimal tuning is always recommended. However, increasing the debug level when required. Click Here for further on logging parameters
logging collector, which is a background process that captures log messages sent to stderr and redirects them into log files
- log_directory- data directory
- log_filename -The default is postgresql-%Y-%m-%d_%H%M%S.log
- he default permissions are 0600
When checkpoints occur, all the dirty pages must write to disk. If we increase the checkpoint_segments then checkpoint will occur less and so I/O will be less as it need to write less to disk. IF large amount of data is inserted there is more generation of checkpoints.
Write-Ahead Logging (WAL) puts a checkpoint in the transaction log every so often. The CHECKPOINT command forces an immediate checkpoint when the command is issued, without waiting for a scheduled checkpoint.
A checkpoint is a point in the transaction log sequence at which all data files have been updated to reflect the information in the log. All data files will be flushed to disk.
If executed during recovery, the CHECKPOINT command will force a restartpoint rather than writing a new checkpoint.
Only superusers can call CHECKPOINT. The command is not intended for use during normal operation.
e)Stats Collector:
PostgreSQL's statistics collector is a subsystem that supports collection and reporting of information about server activity. Presently, the collector can count accesses to tables and indexes in both disk-block and individual-row terms. It also tracks the total number of rows in each table, and information about vacuum and analyze actions for each table. It can also count calls to user-defined functions and the total time spent in each one.
PostgreSQL also supports reporting of the exact command currently being executed by other server processes. This facility is independent of the collector process.
The statistics collector transmits the collected information to other PostgreSQL processes through temporary files. These files are stored in the directory named by the stats_temp_directory parameter, pg_stat_tmp by default. For better performance, stats_temp_directory can be pointed at a RAM-based file system, decreasing physical I/O requirements. When the server shuts down cleanly, a permanent copy of the statistics data is stored in the pg_stat subdirectory, so that statistics can be retained across server restarts. When recovery is performed at server start (e.g. after immediate shutdown, server crash, and point-in-time recovery), all statistics counters are reset.
f)Archiver:
Achiver process is optional process, default is OFF.
Setting up the database in Archive mode means to capture the WAL data of each segment file once it is filled and save that data somewhere before the segment file is recycled for reuse.
On Database Archivelog mode, once the WAL data is filled in the WAL Segment, that filled segment named file is created under PGDATA/pg_xlog/archive_status by the WAL Writer naming the file as “.ready”. File naming will be “segment-filename.ready”.
Archiver Process triggers on finding the files which are in “.ready” state created by the WAL Writer process. Archiver process picks the ‘segment-file_number’ of .ready file and copies the file from $PGDATA/pg_xlog location to its concerned Archive destination given in ‘archive_command’ parameter(postgresql.conf).
On successful completion of copy from source to destination, archiver process renames the “segment-filename.ready” to “segment-filename.done”. This completes the archiving process.
It is understood that, if any files named “segement-filename.ready” found in $PGDATA/pg_xlog/archive_status. They are the pending files still to be copied to Archive destination.
II.Storage:
- postgresql.conf file already mentioned, PostgreSQL uses two other manually-edited configuration files, which control client authentication
- all three configuration files are stored in the database cluster's data directory.
- The parameters described in this section allow the configuration files to be placed elsewhere
data_directory:
Specifies the directory to use for data storage. This parameter can only be set at server start.
config_file:
Specifies the main server configuration file (customarily called postgresql.conf). This parameter can only be set on the postgres command line.
hba_file:
Specifies the configuration file for host-based authentication (customarily called pg_hba.conf). This parameter can only be set at server start.
ident_file:
Specifies the configuration file for Section 19.2 user name mapping (customarily called pg_ident.conf). This parameter can only be set at server start.
external_pid_file :
Specifies the name of an additional process-ID (PID) file that the server should create for use by server administration programs. This parameter can only be set at server start.
PG_LOG:
It is not an actual postgres directory, it is the directory where RHEL stores the actual textual LOG.
PG_XLOG:
Here the write ahead logs are stored. It is the log file, where all the logs are stored of committed and un committed transaction. It contains max 6 logs, and last one overwrites. If archiver is on, it moves there.
PG_CLOG:
It contains the commit log files, used for recovery for instant crash
PG_VERSION:
A file containing the major version number of PostgreSQL
Base:
Subdirectory containing per-database subdirectories
Global:
Subdirectory containing cluster-wide tables, such as pg_database
PG_MULTIXACT:
Subdirectory containing multitransaction status data (used for shared row locks)
PG_SUBTRANS:
Subdirectory containing subtransaction status data
PG_TBLSPC:
Subdirectory containing symbolic links to tablespaces
PG_TWOPHASE:
Subdirectory containing state files for prepared transactions
POSTMASTER.OPTS:
A file recording the command-line options the postmaster was last started with
POSTMASTER.PID:
A lock file recording the current postmaster PID and shared memory segment ID (not present after postmaster shutdown)
Postgresql Query Flow
1.parser:
The parser stage consists of two parts:
- The parser defined in gram.y and scan.l is built using the Unix tools bison and flex.
- The transformation process does modifications and augmentations to the data structures returned by the parser.
- The parser has to check the query string (which arrives as plain text) for valid syntax. If the syntax is correct a parse tree is built up and handed back; otherwise an error is returned. The parser and lexer are implemented using the well-known Unix tools bison and flex.
- The lexer is defined in the file scan.l and is responsible for recognizing identifiers, the SQL key words etc. For every key word or identifier that is found, a token is generated and handed to the parser.
- The parser is defined in the file gram.y and consists of a set of grammar rules and actions that are executed whenever a rule is fired. The code of the actions (which is actually C code) is used to build up the parse tree.
- The file scan.l is transformed to the C source file scan.c using the program flex and gram.y is transformed to gram.c using bison. After these transformations have taken place a normal C compiler can be used to create the parser. Never make any changes to the generated C files as they will be overwritten the next time flex or bison is called.
A detailed description of bison or the grammar rules given in gram.y would be beyond the scope of this paper. There are many books and documents dealing with flex and bison. You should be familiar with bison before you start to study the grammar given in gram.y otherwise you won't understand what happens there
2.Traffic Cop:
The traffic cop is the agent that is responsible for differentiating between simple and complex query
commands. Transaction control commands such as BEGIN and ROLLBACK are simple enough so as to not
need additional processing, whereas other commands such as SELECT and JOIN are passed on to the
rewriter. This discrimination reduces the processing time by performing minimal optimization on the
simple commands, and devoting more time to the complex ones.
Parsing is Two Types:
- Soft Parse – when the parsed representation of a submitted SQL statement exists in the Postgres Server(Shared Buffer) Performs syntax and semantic checks but avoids the relatively costly operation of query optimization. Reuses the existing Postgres SQL area which already has the execution plan required to execute the SQL statement
- Hard Parse – if a statement cannot be reused or if it the very first time the SQL statement is being loaded in the Postgres Server(Shared Buffer), it results in a hard parse. Also when a statement is aged out of the Postgres Server(Shared Buffer) (because the sPostgres Server(Shared Buffer) is limited in size), when it is reloaded again, it results in another hard parse. So size of the shared Buffer can also affect the amount of parse calls.
- We can query pg_prepared_statements to see what is cached. Note that it is not available across sessions and visible only to the current session.
- The pg_buffercache module provides a means for examining what's happening in the shared buffer cache in real time
- It(below Query) can even tell how much data blocks came from disk and how much came from shared_buffers i.e memory.
- explain (analyze,buffers) statement
Shared read, means it comes from the disk and it was not cached. If the query is run again, and if the cache configuration is correct (we will discuss about it below), it will show up as shared hit.
3.Rewriter:
PostgreSQL rule system consisted of two implementations:
- The first one worked using row level processing and was implemented deep in the executor. The rule system was called whenever an individual row had been accessed. This implementation was removed in 1995 when the last official release of the Berkeley Postgres project was transformed into Postgres95.
- The second implementation of the rule system is a technique called query rewriting. The rewrite system is a module that exists between the parser stage and the planner/optimizer. This technique is still implemented.
The task of the planner/optimizer is to create an optimal execution plan. A given SQL query (and hence, a query tree) can be actually executed in a wide variety of different ways, each of which will produce the same set of results. If it is computationally feasible, the query optimizer will examine each of these possible execution plans, ultimately selecting the execution plan that is expected to run the fastest.
The planner's search procedure actually works with data structures called paths, which are simply cut-down representations of plans containing only as much information as the planner needs to make its decisions. After the cheapest path is determined, a full-fledged plan tree is built to pass to the executor. This represents the desired execution plan in sufficient detail for the executor to run it. In the rest of this section we'll ignore the distinction between paths and plans.
5.Executor:
The executor takes the plan created by the planner/optimizer and recursively processes it to extract the required set of rows. This is essentially a demand-pull pipeline mechanism. Each time a plan node is called, it must deliver one more row, or report that it is done delivering rows.
The executor mechanism is used to evaluate all four basic SQL query types: SELECT, INSERT, UPDATE, and DELETE. For SELECT, the top-level executor code only needs to send each row returned by the query plan tree off to the client. For INSERT, each returned row is inserted into the target table specified for the INSERT. This is done in a special top-level plan node called ModifyTable. (A simple INSERT ... VALUES command creates a trivial plan tree consisting of a single Result node, which computes just one result row, and ModifyTable above it to perform the insertion. But INSERT ... SELECT can demand the full power of the executor mechanism.) For UPDATE, the planner arranges that each computed row includes all the updated column values, plus the TID (tuple ID, or row ID) of the original target row; this data is fed into a ModifyTable node, which uses the information to create a new updated row and mark the old row deleted. For DELETE, the only column that is actually returned by the plan is the TID, and the ModifyTable node simply uses the TID to visit each target row and mark it deleted.
Comments
Post a Comment