F.51. pg_trace#
F.51. pg_trace #
Real-time filtered query tracing for Tantor SE-1C
F.51.1. Overview #
pg_trace
- is a tracing library that allows
track executed queries in real-time including their resources,
timing and query plan.
Features:
Filtering queries using user defined conditions
Builtin tracing timeout after which it will be disabled
Support multiple simultaneous clients listening to traces
Multiple collected statistics about query: resources, timing, query plan
F.51.2. Getting started #
Add extension to shared_preload_libraries
in
postgresql.conf
:
shared_preload_libraries = 'pg_trace'
Create extension and start tracing
CREATE EXTENSION pg_trace; SELECT pg_trace_start();
Connect to socket. I.e. using nc
:
nc -U /tmp/pg_trace.sock
Run some queries:
SELECT 1;
You will see output like this:
{"traceid":0,"queryid":6865378226349601843,"dbid":5,"userid":10,"tuplescount":1,"start_time":1733221222489,"end_time":1733221222489,"total_time":0.00739,"startup_time":0,"sys_time":0,"user_time":0,"rows":1,"shared_blks_hit":0,"shared_blks_read":0,"shared_blks_written":0,"shared_blks_dirtied":0,"local_blks_hit":0,"local_blks_read":0,"local_blks_written":0,"local_blks_dirtied":0,"shared_blk_read_time":0,"shared_blk_write_time":0,"local_blk_read_time":0,"local_blk_write_time":0,"temp_blk_read_time":0,"temp_blk_write_time":0,"wal_records":0,"wal_fpi":0,"wal_bytes":0,"query":"select 1;"}
F.51.3. Functionality #
F.51.3.1. Tracing management #
To start tracing use function
pg_trace_start()
.
CREATE FUNCTION pg_trace_start( backend_pid int DEFAULT NULL, user_id oid DEFAULT NULL, database_id oid DEFAULT NULL, duration interval DEFAULT NULL, query_like text DEFAULT NULL, plan boolean DEFAULT FALSE, timeout interval DEFAULT NULL ) RETURNS boolean; -- example SELECT pg_trace_start( user_id := 123, query_like := 'SELECT *%', plan := TRUE, timeout := interval '1 hour' );
This function starts new tracing session. It accepts arguments:
Filtering attributes, if
NULL
filter is not applied:backend_pid
- query executed on this backenduser_id
- query executed by this effective userdatabase_id
- query executed on this databaseduration
- query executed longer than providedquery_like
- query must match template (usingLIKE
)
plan
- include query plan in output tracetimeout
- override default trace timeout with provided interval
Returns boolean
- tracing was enabled when
function run.
To stop tracing session run pg_trace_stop()
:
CREATE FUNCTION pg_trace_stop() RETURNS boolean; SELECT pg_trace_stop();
This function forcibly disables tracing. Returns
boolean
- tracing was enabled when function
run.
To check if tracing is enabled use
pg_trace_enabled
function:
CREATE FUNCTION pg_trace_enabled() RETURN boolean; SELECT pg_trace_enabled();
F.51.3.2. Observability #
To send traces extension uses background worker which here called server. Server collects some metrics and expose them using SQL functions:
Total amount of clients connected to server
CREATE FUNCTION @extschema@.pg_trace_metrics_clients_accepted() RETURNS bigint;
Total amount of clients disconnected from server
CREATE FUNCTION @extschema@.pg_trace_metrics_clients_disconnected() RETURNS bigint;
Total amount of times when WaitEventSet was recreated
CREATE FUNCTION @extschema@.pg_trace_metrics_wes_recreations() RETURNS bigint;
Total amount of traces was collected
CREATE FUNCTION @extschema@.pg_trace_metrics_traces_collected() RETURNS bigint;
Total amount of traces were declined because of lack of free space
CREATE FUNCTION @extschema@.pg_trace_metrics_traces_declined() RETURNS bigint;
Total amount of bytes send to all clients over socket
CREATE FUNCTION @extschema@.pg_trace_metrics_send_bytes() RETURNS bigint;
Total amount of times when sending data over socket returned error
CREATE FUNCTION @extschema@.pg_trace_metrics_send_errors() RETURNS bigint;
Description of
pg_trace_metrics_wes_recreations
see below
F.51.4. Client-Server protocol #
Communication between server (background worker) and client (application reading traces) setup in client/server manner.
Client connects to server UNIX socket (it must know its location). After connection client will start receiving messages. Each message represents one trace and contains 2 fields:
Length | Trace |
---|---|
4 bytes | N bytes |
Length
- length of Trace
field. This is stored as little-endian serialized integer.
Trace
- ASCII encoded JSON representation of
Trace object.
Schema of JSON message:
Key | Type | Description |
---|---|---|
queryid
|
number
| ID of query |
dbid
|
number
| OID of database |
userid
|
number
| OID of user |
tuplescount
|
number
| Total amount of tuples processed by query |
start_time
|
number
| UNIX timestamp of query execution start |
end_time
|
number
| UNIX timestamp of query execution end |
total_time
|
number
| amount of time query was executing in seconds |
startup_time
|
number
| startup time of query |
startup_time
|
number
| total time of query execution |
sys_time
|
number
| amount of time query was in kernel space |
user_time
|
number
| amount of time query was in user space |
rows
|
number
| Total number of rows retrieved or affected by the query |
shared_blks_hit
|
number
| Total number of shared block cache hits by the statement |
shared_blks_read
|
number
| Total number of shared blocks read by the statement |
shared_blks_written
|
number
| Total number of shared blocks written by the statement |
shared_blks_dirtied
|
number
| Total number of shared blocks dirtied by the statement |
local_blks_hit
|
number
| Total number of local block cache hits by the statement |
local_blks_read
|
number
| Total number of local blocks read by the statement |
local_blks_written
|
number
| Total number of local blocks written by the statement |
local_blks_dirtied
|
number
| Total number of local blocks dirtied by the statement |
local_blks_dirtied
|
number
| Total number of local blocks dirtied by the statement |
shared_blk_read_time
|
number
| Total time the statement spent reading data file shared blocks, in milliseconds |
shared_blk_write_time
|
number
| Total time the statement spent writing data file shared blocks, in milliseconds |
local_blk_read_time
|
number
| Total time the statement spent reading data file local blocks, in milliseconds |
local_blk_write_time
|
number
| Total time the statement spent writing data file local blocks, in milliseconds |
temp_blk_read_time
|
number
| Total time the statement spent reading temporary file blocks, in milliseconds |
temp_blk_write_time
|
number
| Total time the statement spent writing temporary file blocks, in milliseconds |
wal_records
|
number
| Total number of WAL records generated by the statement |
wal_fpi
|
number
| Total number of WAL full page images generated by the statement |
wal_bytes
|
number
| Total amount of WAL generated by the statement in bytes |
query
|
string
| Query string |
plan
|
plan object
| Query plan in EXPLAIN JSON format |
jit_functions
|
number
| Number of JIT emitted functions |
jit_generation_time
|
number
| Accumulated time to generate code using JIT |
jit_optimization_time
|
number
| Accumulated time to generate code using JIT |
jit_inlining_time
|
number
| Accumulated time used to inline JIT code |
jit_emission_time
|
number
| Accumulated time for JIT code emission |
Many fields provide time information. In this schema all time information provided as UNIX timestamp.
Plan is provided as JSON object generated using EXPLAIN.
F.51.5. Configuration #
pg_trace.max_clients
- max amount of clients
that can be processed simultaneously.
pg_trace.trace_plan_mem
- amount of memory used
to store query plan in text representation. Specified as memory,
with size suffix. I.e. 8 kB
. By default, set to
16 kB.
Note
This setting will affect result size of single Trace record. So changing this parameter will affect others.
pg_trace.traces_mem
- amount of memory to
allocate for traces in server background worker. Specified as
memory, with size suffix. I.e. 128 MB
. By
default, set to amount of memory to store exactly 5000 traces.
pg_trace.shm_queue_mem
- amount of memory to
allocate in shared memory for traces buffer. Specified as memory,
with size suffix. I.e. 32 MB
. By default, set
to amount of memory to store exactly 32 traces.
Note
If you will specify amount of memory less than size of 1
Trace you will get an error
.
traces_mem
and shm_queue_mem
depend on trace_plan_mem
setting.
pg_trace.unix_socket_path
- path to unix socket
for server. If socket already exists on it will be removed and new
socket will be created. By default,
/tmp/pg_trace.sock
.
pg_trace.trace_timeout
- default timeout for
tracing session after which trace session will be automatically
disabled. This values used if timeout
argument
in pg_trace_start
is not set. Specified as
time, with time suffix. I.e. 1h
. By default, 10
minutes.
pg_trace.update_delay
- amount of time to wait
before start processing traces from shared memory queue by
background worker. This time can give backends chance to write
some more traces in queue before background worker will take
exclusive lock. By default, 0
- no delay.
pg_trace.plan_format
- format of plan to pass
when ‘plan’ was requested when tracing session started. Accepted
values: text
, json
,
yaml
, xml
. If
json
is requested it’s JSON plan object will be
used, otherwise (for other formats) it will be represented as plan
string (possible escaped for JSON). By default,
text
format is used.
Parameters trace_timeout
,
update_delay
and plan_format
can be changed using configuration file update with
SIGHUP
sending. All others can not be changed
after postgres startup - you need to restart server.
F.51.6. How it works #
Terminology:
Server - background worker
Client - user which connects to socket and listens for traces
Backend - backend that executes queries and passes traces to background worker
Overall architecture and work principles:
----------------- Trace format ---------------- +---------------------+ | Instrumentation | | Timings | | Resources | +---------------------+ | Query plan | `pg_trace.plan_max` +---------------------+ ---------------- Architecture ---------------- +---------+ +---------+ +---------+ | backend | | backend | | backend | +---------+ +---------+ +---------+ \ | / +--------------------+--------------------+ | filtering | | V `pg_trace.shm_queue_max` +---------------------+ buffer +---+---+---+ | shared memory queue |--------------------| X | | | +---------------------+ +---+---+---+ | | | V `pg_trace.traces_mem` +---------------------+ ring buffer +---+---+---+---+---+ | background worker |---------------| X | X | X | X | | +---------------------+ +---+---+---+---+---+ | | | +------------------+------------------+ / | \ | | | V V V O O O /|\ /|\ /|\ / \ / \ / \ `pg_trace.max_clients`
When query executed and passes all passed filters it sent to
background worker using shared memory queue. It contains buffer
with pg_trace.shm_queue_max
size.
After that background worker copies all traces in shm queue buffer
to it’s own buffer. Traces are stored in ring buffer. It is
allocated at server startup with size
pg_trace.traces_mem
. Current implementation
allows all operations (freeing, searching, inserting, etc…) take
O(1) complexity, so feel free to increase this parameter.
When new traces inserted they are sent to clients. Max amount of
clients that can be processed simultaneously set using
pg_trace.max_clients
parameter.
Server is implemented using non-blocking sockets. It uses
WaitEventSet
PostgreSQL functionality to
implement this, so all platforms must be supported. Also, this
means that for new clients the same background worker is used to
send data - feel free to increase max clients amount.
F.51.7. Known issues #
F.51.7.1. WaitEventSet recreation #
To trace asynchronous requests WaitEventSet
is used. When new client is connected its socket is added to
this set.
When client is disconnected we need to disable his tracking. Sadly, currently there is no way to remove client from set and add new later, so we just stop tracking him. Sooner or later free space will end.
To handle this situation this WaitEventSet is recreated. This operation is not free, because we need to close old file descriptor, create new and traverse through all clients and filter out active.
Amount of times we need to recreate
WaitEventSet
can be decreased by increasing
pg_trace.max_clients
setting.
Metric function
pg_trace_metrics_wes_recreations
returns
amount of times this WaitEventSet was recreated.
F.51.7.2. Race condition #
Functions pg_trace_start
and
pg_trace_stop
can be used at the same time.
State is updated atomically (in lock), but these functions can operate at same time. It is up to end user to track their usage.
F.51.7.3. Privileges #
By default, pg_trace_start
and
pg_trace_stop
are created without any
privilege check. Allowing these function to be used by everyone
can lead to security leaks and performance degradation.
To fix this change access privileges:
REVOKE EXECUTE ON FUNCTION pg_trace_start FROM public; REVOKE EXECUTE ON FUNCTION pg_trace_stop FROM public; GRANT EXECUTE ON FUNCTION pg_trace_start TO superuser; GRANT EXECUTE ON FUNCTION pg_trace_stop TO superuser;
F.51.7.4. Plan memory #
Amount of memory to store query plan is set using
pg_trace.trace_plan_mem
and used to store
serialized representation of query plan to pass to server. It is
fixed, so sometimes this can be less than result size of
serialized plan. In such cases plan is not set in
trace JSON object.
I.e. pg_trace.trace_plan_mem
set to 1 kB, but
total serialized size of query plan is 1,5 kB. In this case,
plan is not set in output JSON.