F.51. pg_trace#

F.51. pg_trace

F.51. pg_trace #

Real-time filtered query tracing for Tantor SE-1C

F.51.1. Overview #

pg_trace - is a tracing library that allows track executed queries in real-time including their resources, timing and query plan.

Features:

  • Filtering queries using user defined conditions

  • Builtin tracing timeout after which it will be disabled

  • Support multiple simultaneous clients listening to traces

  • Multiple collected statistics about query: resources, timing, query plan

F.51.2. Getting started #

Add extension to shared_preload_libraries in postgresql.conf:

shared_preload_libraries = 'pg_trace'

Create extension and start tracing

CREATE EXTENSION pg_trace;
SELECT pg_trace_start();

Connect to socket. I.e. using nc:

nc -U /tmp/pg_trace.sock

Run some queries:

SELECT 1;

You will see output like this:

{"traceid":0,"queryid":6865378226349601843,"dbid":5,"userid":10,"tuplescount":1,"start_time":1733221222489,"end_time":1733221222489,"total_time":0.00739,"startup_time":0,"sys_time":0,"user_time":0,"rows":1,"shared_blks_hit":0,"shared_blks_read":0,"shared_blks_written":0,"shared_blks_dirtied":0,"local_blks_hit":0,"local_blks_read":0,"local_blks_written":0,"local_blks_dirtied":0,"shared_blk_read_time":0,"shared_blk_write_time":0,"local_blk_read_time":0,"local_blk_write_time":0,"temp_blk_read_time":0,"temp_blk_write_time":0,"wal_records":0,"wal_fpi":0,"wal_bytes":0,"query":"select 1;"}

F.51.3. Functionality #

F.51.3.1. Tracing management #

To start tracing use function pg_trace_start().

CREATE FUNCTION pg_trace_start(
    backend_pid int DEFAULT NULL,
    user_id oid DEFAULT NULL,
    database_id oid DEFAULT NULL,
    duration interval DEFAULT NULL,
    query_like text DEFAULT NULL,
    plan boolean DEFAULT FALSE,
    timeout interval DEFAULT NULL
)
RETURNS boolean;

-- example
SELECT pg_trace_start(
    user_id := 123,
    query_like := 'SELECT *%',
    plan := TRUE,
    timeout := interval '1 hour'
);

This function starts new tracing session. It accepts arguments:

  • Filtering attributes, if NULL filter is not applied:

    • backend_pid - query executed on this backend

    • user_id - query executed by this effective user

    • database_id - query executed on this database

    • duration - query executed longer than provided

    • query_like - query must match template (using LIKE)

  • plan - include query plan in output trace

  • timeout - override default trace timeout with provided interval

Returns boolean - tracing was enabled when function run.

To stop tracing session run pg_trace_stop():

CREATE FUNCTION pg_trace_stop()
RETURNS boolean;

SELECT pg_trace_stop();

This function forcibly disables tracing. Returns boolean - tracing was enabled when function run.

To check if tracing is enabled use pg_trace_enabled function:

CREATE FUNCTION pg_trace_enabled()
RETURN boolean;

SELECT pg_trace_enabled();

F.51.3.2. Observability #

To send traces extension uses background worker which here called server. Server collects some metrics and expose them using SQL functions:

  • Total amount of clients connected to server

    CREATE FUNCTION @extschema@.pg_trace_metrics_clients_accepted()
    RETURNS bigint;
    
  • Total amount of clients disconnected from server

    CREATE FUNCTION @extschema@.pg_trace_metrics_clients_disconnected()
    RETURNS bigint;
    
  • Total amount of times when WaitEventSet was recreated

    CREATE FUNCTION @extschema@.pg_trace_metrics_wes_recreations()
    RETURNS bigint;
    
  • Total amount of traces was collected

    CREATE FUNCTION @extschema@.pg_trace_metrics_traces_collected()
    RETURNS bigint;
    
  • Total amount of traces were declined because of lack of free space

    CREATE FUNCTION @extschema@.pg_trace_metrics_traces_declined()
    RETURNS bigint;
    
  • Total amount of bytes send to all clients over socket

    CREATE FUNCTION @extschema@.pg_trace_metrics_send_bytes()
    RETURNS bigint;
    
  • Total amount of times when sending data over socket returned error

    CREATE FUNCTION @extschema@.pg_trace_metrics_send_errors()
    RETURNS bigint;
    

Description of pg_trace_metrics_wes_recreations see below

F.51.4. Client-Server protocol #

Communication between server (background worker) and client (application reading traces) setup in client/server manner.

Client connects to server UNIX socket (it must know its location). After connection client will start receiving messages. Each message represents one trace and contains 2 fields:

Length Trace
4 bytes N bytes

Length - length of Trace field. This is stored as little-endian serialized integer.

Trace - ASCII encoded JSON representation of Trace object.

Schema of JSON message:

Key Type Description
queryid number ID of query
dbid number OID of database
userid number OID of user
tuplescount number Total amount of tuples processed by query
start_time number UNIX timestamp of query execution start
end_time number UNIX timestamp of query execution end
total_time number amount of time query was executing in seconds
startup_time number startup time of query
startup_time number total time of query execution
sys_time number amount of time query was in kernel space
user_time number amount of time query was in user space
rows number Total number of rows retrieved or affected by the query
shared_blks_hit number Total number of shared block cache hits by the statement
shared_blks_read number Total number of shared blocks read by the statement
shared_blks_written number Total number of shared blocks written by the statement
shared_blks_dirtied number Total number of shared blocks dirtied by the statement
local_blks_hit number Total number of local block cache hits by the statement
local_blks_read number Total number of local blocks read by the statement
local_blks_written number Total number of local blocks written by the statement
local_blks_dirtied number Total number of local blocks dirtied by the statement
local_blks_dirtied number Total number of local blocks dirtied by the statement
shared_blk_read_time number Total time the statement spent reading data file shared blocks, in milliseconds
shared_blk_write_time number Total time the statement spent writing data file shared blocks, in milliseconds
local_blk_read_time number Total time the statement spent reading data file local blocks, in milliseconds
local_blk_write_time number Total time the statement spent writing data file local blocks, in milliseconds
temp_blk_read_time number Total time the statement spent reading temporary file blocks, in milliseconds
temp_blk_write_time number Total time the statement spent writing temporary file blocks, in milliseconds
wal_records number Total number of WAL records generated by the statement
wal_fpi number Total number of WAL full page images generated by the statement
wal_bytes number Total amount of WAL generated by the statement in bytes
query string Query string
plan plan object Query plan in EXPLAIN JSON format
jit_functions number Number of JIT emitted functions
jit_generation_time number Accumulated time to generate code using JIT
jit_optimization_time number Accumulated time to generate code using JIT
jit_inlining_time number Accumulated time used to inline JIT code
jit_emission_time number Accumulated time for JIT code emission

Many fields provide time information. In this schema all time information provided as UNIX timestamp.

Plan is provided as JSON object generated using EXPLAIN.

F.51.5. Configuration #

pg_trace.max_clients - max amount of clients that can be processed simultaneously.

pg_trace.trace_plan_mem - amount of memory used to store query plan in text representation. Specified as memory, with size suffix. I.e. 8 kB. By default, set to 16 kB.

Note

This setting will affect result size of single Trace record. So changing this parameter will affect others.

pg_trace.traces_mem - amount of memory to allocate for traces in server background worker. Specified as memory, with size suffix. I.e. 128 MB. By default, set to amount of memory to store exactly 5000 traces.

pg_trace.shm_queue_mem - amount of memory to allocate in shared memory for traces buffer. Specified as memory, with size suffix. I.e. 32 MB. By default, set to amount of memory to store exactly 32 traces.

Note

If you will specify amount of memory less than size of 1 Trace you will get an error. traces_mem and shm_queue_mem depend on trace_plan_mem setting.

pg_trace.unix_socket_path - path to unix socket for server. If socket already exists on it will be removed and new socket will be created. By default, /tmp/pg_trace.sock.

pg_trace.trace_timeout - default timeout for tracing session after which trace session will be automatically disabled. This values used if timeout argument in pg_trace_start is not set. Specified as time, with time suffix. I.e. 1h. By default, 10 minutes.

pg_trace.update_delay - amount of time to wait before start processing traces from shared memory queue by background worker. This time can give backends chance to write some more traces in queue before background worker will take exclusive lock. By default, 0 - no delay.

pg_trace.plan_format - format of plan to pass when ‘plan’ was requested when tracing session started. Accepted values: text, json, yaml, xml. If json is requested it’s JSON plan object will be used, otherwise (for other formats) it will be represented as plan string (possible escaped for JSON). By default, text format is used.

Parameters trace_timeout, update_delay and plan_format can be changed using configuration file update with SIGHUP sending. All others can not be changed after postgres startup - you need to restart server.

F.51.6. How it works #

Terminology:

  • Server - background worker

  • Client - user which connects to socket and listens for traces

  • Backend - backend that executes queries and passes traces to background worker

Overall architecture and work principles:

        ----------------- Trace format ----------------

                    +---------------------+
                    |   Instrumentation   |
                    |       Timings       |
                    |      Resources      |
                    +---------------------+
                    |      Query plan     |  `pg_trace.plan_max`
                    +---------------------+  



        ---------------- Architecture ----------------
                        
    +---------+           +---------+           +---------+
    | backend |           | backend |           | backend |
    +---------+           +---------+           +---------+
         \                     |                     /
          +--------------------+--------------------+
                               |
                           filtering
                               |
                               |
                               V                         `pg_trace.shm_queue_max`
                    +---------------------+     buffer         +---+---+---+          
                    | shared memory queue |--------------------| X |   |   |          
                    +---------------------+                    +---+---+---+
                               |
                               |
                               |
                               V                          `pg_trace.traces_mem`
                    +---------------------+  ring buffer  +---+---+---+---+---+
                    |  background worker  |---------------| X | X | X | X |   |
                    +---------------------+               +---+---+---+---+---+
                               |
                               |
                               |
            +------------------+------------------+
           /                   |                   \
          |                    |                    |
          V                    V                    V
          O                    O                    O
         /|\                  /|\                  /|\
         / \                  / \                  / \

                     `pg_trace.max_clients`      

When query executed and passes all passed filters it sent to background worker using shared memory queue. It contains buffer with pg_trace.shm_queue_max size.

After that background worker copies all traces in shm queue buffer to it’s own buffer. Traces are stored in ring buffer. It is allocated at server startup with size pg_trace.traces_mem. Current implementation allows all operations (freeing, searching, inserting, etc…) take O(1) complexity, so feel free to increase this parameter.

When new traces inserted they are sent to clients. Max amount of clients that can be processed simultaneously set using pg_trace.max_clients parameter.

Server is implemented using non-blocking sockets. It uses WaitEventSet PostgreSQL functionality to implement this, so all platforms must be supported. Also, this means that for new clients the same background worker is used to send data - feel free to increase max clients amount.

F.51.7. Known issues #

F.51.7.1. WaitEventSet recreation #

To trace asynchronous requests WaitEventSet is used. When new client is connected its socket is added to this set.

When client is disconnected we need to disable his tracking. Sadly, currently there is no way to remove client from set and add new later, so we just stop tracking him. Sooner or later free space will end.

To handle this situation this WaitEventSet is recreated. This operation is not free, because we need to close old file descriptor, create new and traverse through all clients and filter out active.

Amount of times we need to recreate WaitEventSet can be decreased by increasing pg_trace.max_clients setting.

Metric function pg_trace_metrics_wes_recreations returns amount of times this WaitEventSet was recreated.

F.51.7.2. Race condition #

Functions pg_trace_start and pg_trace_stop can be used at the same time.

State is updated atomically (in lock), but these functions can operate at same time. It is up to end user to track their usage.

F.51.7.3. Privileges #

By default, pg_trace_start and pg_trace_stop are created without any privilege check. Allowing these function to be used by everyone can lead to security leaks and performance degradation.

To fix this change access privileges:

REVOKE EXECUTE ON FUNCTION pg_trace_start FROM public;
REVOKE EXECUTE ON FUNCTION pg_trace_stop FROM public;

GRANT EXECUTE ON FUNCTION pg_trace_start TO superuser;
GRANT EXECUTE ON FUNCTION pg_trace_stop TO superuser;

F.51.7.4. Plan memory #

Amount of memory to store query plan is set using pg_trace.trace_plan_mem and used to store serialized representation of query plan to pass to server. It is fixed, so sometimes this can be less than result size of serialized plan. In such cases plan is not set in trace JSON object.

I.e. pg_trace.trace_plan_mem set to 1 kB, but total serialized size of query plan is 1,5 kB. In this case, plan is not set in output JSON.