F.65. transp_anon#
F.65. transp_anon #
transp_anon
(transparent anonymization)
is an extension to mask or replace personally identifiable
information or commercially sensitive data from a PostgreSQL database.
F.65.1. Definitions of the terms used in this project #
The data can be altered with several techniques:
Deletion or Nullification simply removes data.
Static Substitution consistently replaces the data with a generic value. For instance: replacing all values of a TEXT column with the value “CONFIDENTIAL”.
Variance is the action of “shifting” dates and numeric values. For example, by applying a +/- 10% variance to a salary column, the dataset will remain meaningful.
Generalization reduces the accuracy of the data by replacing it with a range of values. Instead of saying “Bob is 28 years old”, you can say “Bob is between 20 and 30 years old”. This is useful for analytics because the data remains true.
Shuffling mixes values within the same columns. This method is open to being reversed if the shuffling algorithm can be deciphered.
Randomization replaces sensitive data with random-but-plausible values. The goal is to avoid any identification from the data record while remaining suitable for testing, data analysis and data processing.
Partial scrambling is similar to static substitution but leaves out some part of the data. For instance : a credit card number can be replaced by “40XX XXXX XXXX XX96”
Custom rules are designed to alter data following specific needs. For instance, randomizing simultaneously a zipcode and a city name while keeping them coherent.
Pseudonymization is a way to protect personal information by hiding it using additional information. Encryption and Hashing are two examples of pseudonymization techniques. However a pseudonymizated data is still linked to the original data.
F.65.2. Anonymization & Data Masking for PostgreSQL #
The project has a declarative approach of anonymization. This means you can declare the masking rules using the PostgreSQL Data Definition Language (DDL) and specify your anonymization strategy inside the table definition itself.
Once the maskings rules are defined, you can access the anonymized data using: * Dynamic Masking: Hide personal information only for the masked users
In addition, various Masking Functions are available : randomization, faking, partial scrambling, shuffling, noise or your own custom function!
Beyond masking, it is also possible to use a fourth approach called Generalization which is perfect for statistics and data analytics.
F.65.2.1. Example #
=# SELECT * FROM people; id | firstname | lastname | phone ----+-----------+----------+------------ T1 | Sarah | Conor | 0609110911
Step 1 : Activate the dynamic masking engine
=# CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE; =# SELECT transp_anon.init();
Step 2 : Declare a masked user and give it permissions to select data
=# CREATE ROLE skynet LOGIN; =# GRANT SELECT ON TABLE people TO skynet; =# SECURITY LABEL FOR transp_anon ON ROLE skynet IS 'MASKED';
Step 3 : Declare the masking rules
=# SECURITY LABEL FOR transp_anon ON COLUMN people.lastname -# IS 'MASKED WITH FUNCTION transp_anon.fake_last_name()'; =# SECURITY LABEL FOR transp_anon ON COLUMN people.phone -# IS 'MASKED WITH FUNCTION transp_anon.partial(phone,2,$$******$$,2)';
Step 4 : Connect with the masked user
=# \connect - skynet => SELECT * FROM people; id | firstname | lastname | phone ----+-----------+-----------+------------ T1 | Sarah | Stranahan | 06******11
F.65.3. Configuration #
The extension has currently a few options that be defined for the entire instance (
inside postgresql.conf
or with ALTER SYSTEM
).
It is also possible and often a good idea to define them at the database level like this:
ALTER DATABASE people SET transp_anon.restrict_to_trusted_schemas = on;
Only superuser can change the parameters below :
F.65.3.1. transp_anon.enabled #
Type | Boolean |
Default value | “true” |
Visible | only to superusers |
Allows to globally enable or disable transp_anon.
F.65.3.2. transp_anon.inherit_labels #
Type | Boolean |
Default value | “true” |
Visible | only to superusers |
Inherit security labels from relation ancestors (partitioned tables and inheritance tables)if any.
F.65.3.3. transp_anon.algorithm #
Type | Text |
Default value | “sha256” |
Visible | only to superusers |
This is the hashing method used by pseudonymizing functions. Checkout the pgcrypto documentation for the list of available options.
See transp_anon.salt
to learn why this parameter is a very
sensitive information.
F.65.3.4. transp_anon.restrict_to_trusted_schemas #
Type | Boolean |
Default value | off |
Visible | to all users |
By enabling this parameter, masking rules must be defined using functions located
in a limited list of namespaces. By default, only the transp_anon
schema is trusted.
This improves security by preventing users from declaring their custom masking filters. This also means that the schema must be explicit inside the masking rules.
For more details, check out the Write your own masks section of the Masking functions chapter.
F.65.3.5. transp_anon.salt #
Type | Text |
Default value | (empty) |
Visible | only to superusers |
This is the salt used by pseudonymizing functions. It is very important to define a custom salt for each database like this:
ALTER DATABASE foo SET transp_anon.salt = 'This_Is_A_Very_Secret_Salt';
If a masked user can read the salt, he/she can run a brute force attack to retrieve the original data based on the 3 elements:
The pseudonymized data
The hashing algorithm (see
transp_anon.algorithm
)The salt
The salt and the name of the hashing algorithm should be protected with the same
level of security that the data itself. This is why you should store the salt
directly within the database with ALTER DATABASE
.
F.65.4. Declare masking rules #
The main idea of this extension is to offer anonymization by design.
The data masking rules should be written by the people who develop the application because they have the best knowledge of how the data model works. Therefore masking rules must be implemented directly inside the database schema.
This allows to mask the data directly inside the PostgreSQL instance without using an external tool and thus limiting the exposure and the risks of data leak.
The data masking rules are declared simply by using SECURITY LABEL:
CREATE TABLE player( id SERIAL, name TEXT, points INT); INSERT INTO player VALUES ( 1, 'Kareem Abdul-Jabbar', 38387), ( 5, 'Michael Jordan', 32292 ); SECURITY LABEL FOR transp_anon ON COLUMN player.name IS 'MASKED WITH FUNCTION transp_anon.fake_last_name()'; SECURITY LABEL FOR transp_anon ON COLUMN player.id IS 'MASKED WITH VALUE NULL';
F.65.4.1. Escaping String literals #
As you may have noticed the masking rule definitions are placed between single quotes. Therefore if you need to use a string inside a masking rule, you need to use C-Style escapes like this:
SECURITY LABEL FOR transp_anon ON COLUMN player.name IS E'MASKED WITH VALUE \'CONFIDENTIAL\'';
Or use dollar quoting which is easier to read:
SECURITY LABEL FOR transp_anon ON COLUMN player.name IS 'MASKED WITH VALUE $$CONFIDENTIAL$$';
F.65.4.2. Listing masking rules #
To display all the masking rules declared in the current database, check out the
transp_anon.pg_masking_rules
:
SELECT * FROM transp_anon.pg_masking_rules;
F.65.4.3. Debugging masking rules #
When an error occurs to due a wrong masking rule, you can get more detailed
information about the problem by setting client_min_messages
to
DEBUG
and you will get useful details
SET client_min_messages=DEBUG;
F.65.4.4. Removing a masking rule #
You can simply erase a masking rule like this:
SECURITY LABEL FOR transp_anon ON COLUMN player.name IS NULL;
To remove all rules at once, you can use:
SELECT transp_anon.remove_masks_for_all_columns();
F.65.4.5. Limitations #
Masking rules are forbidden for views CREATE VIEW
Sampling doesn't work TABLESAMPLE methods
F.65.5. Masking functions #
The extension provides functions to implement the following main anonymization strategies:
Also you may use additional masking functions or write your own :
Depending on your data, you may need to use different strategies on different columns :
For names and other “direct identifiers” , Faking is often useful
Shuffling is convenient for foreign keys
Adding Noise is interesting for numeric values and dates
Partial Scrambling is perfect for email address and phone numbers
etc.
F.65.5.1. Destruction #
First of all, the fastest and safest way to anonymize a data is to destroy it.
In many cases, the best approach to hide the content of a column is to replace all the values with a single static value.
For instance, you can replace a entire column by the word “CONFIDENTIAL” like this:
SECURITY LABEL FOR transp_anon ON COLUMN users.address IS 'MASKED WITH VALUE ''CONFIDENTIAL'' ';
F.65.5.2. Adding Noise #
This is also called Variance. The idea is to “shift” dates and numeric values. For example, by applying a +/- 10% variance to a salary column, the dataset will remain meaningful.
transp_anon.noise(original_value,ratio)
where original_value can be aninteger
, abigint
or adouble precision
. If the ratio is 0.33, the return value will be the original value randomly shifted with a ratio of +/- 33%transp_anon.dnoise(original_value, interval)
where original_value can be a date, a timestamp, or a time. If interval = “2 days”, the return value will be the original value randomly shifted by +/- 2 days
WARNING : The noise()
masking functions are
vulnerable to a form of repeat attack, especially with Dynamic Masking. A masked user can
guess an original value by requesting its masked value multiple times and then
simply use the AVG()
function to get a close approximation.
(See Noise Reduction Attack for more details).
In a nutshell, these
functions are best fitted for Anonymous Dumps and Static Masking. They should be
avoided when using Dynamic
Masking.
F.65.5.3. Randomization #
The extension provides a large choice of functions to generate purely random data:
F.65.5.3.1. Basic Random values #
transp_anon.random_date()
returns a datetransp_anon.random_string(n)
returns a TEXT value containingn
letterstransp_anon.random_zip()
returns a 5-digit codetransp_anon.random_phone(p)
returns a 8-digit phone withp
as a prefixtransp_anon.random_hash(seed)
returns a hash of a random string for a given seed
F.65.5.3.2. Random between #
To pick any value inside between two bounds:
transp_anon.random_date_between(d1,d2)
returns a date betweend1
andd2
transp_anon.random_int_between(i1,i2)
returns an integer betweeni1
andi2
transp_anon.random_bigint_between(b1,b2)
returns a bigint betweenb1
andb2
NOTE: With these functions, the lower and upper bounds are
included. For instance transp_anon.random_int_between(1,3)
returns either 1, 2 or 3.
For more advanced interval descriptions, check out the Random in Range section.
F.65.5.3.3. Random in Array #
The random_in
function returns an element a given array
For example:
transp_anon.random_in(ARRAY[1,2,3])
returns an int between 1 and 3transp_anon.random_in(ARRAY['red','green','blue'])
returns a text
F.65.5.3.4. Random in Enum #
This is one especially useful when working with ENUM
types!
transp_anon.random_in_enum(variable_of_an_enum_type)
returns any val
CREATE TYPE card AS ENUM ('visa', 'mastercard', ‘amex’); SELECT transp_anon.random_in_enum(NULL::CARD); random_in_enum ---------------- mastercard CREATE TABLE customer ( id INT, ... credit_card CARD ); SECURITY LABEL FOR transp_anon ON COLUMN customer.creditcard IS 'MASKED WITH FUNCTION transp_anon.random_in_enum(creditcard)'
F.65.5.3.5. Random in Range #
RANGE typesare a powerfull way to describe an interval of values, where can define inclusive or excluvive bounds, see examples.
There a function for each subtype of range:
transp_anon.random_in_int4range('[5,6)')
returns an INT of value 5transp_anon.random_in_int8range('(6,7]')
returns a BIGINT of value 7`transp_anon.random_in_numrange(“[0.1,0.9]”) returns a NUMERIC between 0.1 and 0.9
transp_anon.random_in_daterange('[2001-01-01, 2001-12-31)')
returns a date in 2001transp_anon.random_in_tsrange('[2022-10-01,2022-10-31]')
returns a TIMESTAMP in october 2022transp_anon.random_in_tstzrange('[2022-10-01,2022-10-31]')
returns a TIMESTAMP WITH TIMEZONE in october 2022
NOTE: It is not possible to get a random value from a RANGE
with an infinite bound. For example
transp_anon.random_in_int4range('[2022,)')
returns NULL.
F.65.5.4. Faking #
The idea of Faking is to replace sensitive data with random-but-plausible values. The goal is to avoid any identification from the data record while remaining suitable for testing, data analysis and data processing.
In order to use the faking functions, you have to init()
the
extension in your database first:
SELECT transp_anon.init();
The init()
function will import a default dataset of random
data (iban, names, cities, etc.).
Note
This dataset is in Russian and very small ( 1000 values for each category ). If you want to use localized data or load a specific dataset, please read the Custom Fake Data section.
Once the fake data is loaded, you have access to these faking functions:
transp_anon.fake_address()
returns a complete post addresstransp_anon.fake_city()
returns an existing citytransp_anon.fake_country()
returns a countrytransp_anon.fake_company()
returns a generic company nametransp_anon.fake_email()
returns a valid email addresstransp_anon.fake_first_name()
returns a generic first nametransp_anon.fake_iban()
returns a valid IBANtransp_anon.fake_last_name()
returns a generic last nametransp_anon.fake_postcode()
returns a valid zipcode
For TEXT and VARCHAR columns, you can use the classic Lorem Ipsum generator:
transp_anon.lorem_ipsum()
returns 5 paragraphstransp_anon.lorem_ipsum(2)
returns 2 paragraphstransp_anon.lorem_ipsum( paragraphs := 4 )
returns 4 paragraphstransp_anon.lorem_ipsum( words := 20 )
returns 20 wordstransp_anon.lorem_ipsum( characters := 7 )
returns 7 characterstransp_anon.lorem_ipsum( characters := transp_anon.length(table.column) )
returns the same amount of characters as the original string
F.65.5.5. Pseudonymization #
Pseudonymization is similar to Faking in the sense that it generates realistic values. The main difference is that the pseudonymization is deterministic : the functions always will return the same fake value based on a seed and an optional salt.
In order to use the faking functions, you have to init()
the
extension in your database first:
SELECT transp_anon.init();
Once the fake data is loaded you have access to 10 pseudo functions:
transp_anon.pseudo_first_name(seed,salt)
returns a generic first nametransp_anon.pseudo_last_name(seed,salt)
returns a generic last nametransp_anon.pseudo_email(seed,salt)
returns a valid email addresstransp_anon.pseudo_city(seed,salt)
returns an existing citytransp_anon.pseudo_country(seed,salt)
returns a countrytransp_anon.pseudo_company(seed,salt)
returns a generic company nametransp_anon.pseudo_iban(seed,salt)
returns a valid IBAN
The second argument (salt
) is optional. You can call each
function with only the seed like this transp_anon.pseudo_city('bob')
.
The salt is here to increase complexity and avoid dictionary and brute force attacks
(see warning below). If a specific salt is not given, the value of the
transp_anon.salt
GUC parameter is used instead (see the Generic Hashing section for more
details).
The seed can be any information related to the subject. For instance, we can consistently generate the same fake email address for a given person by using her login as the seed :
SECURITY LABEL FOR transp_anon ON COLUMN users.emailaddress IS 'MASKED WITH FUNCTION transp_anon.pseudo_email(users.login) ';
NOTE : You may want to produce unique values using a
pseudonymization function. For instance, if you want to mask an email
column that is declared as UNIQUE
. In this case, you will need to
initialize the extension with a fake dataset that is way bigger
than the numbers of rows of the table. Otherwise you may see some “collisions”
happening, i.e. two different original values producing the same pseudo value.
WARNING : Pseudonymization is often confused with anonymization
but in fact they serve 2 different purposes : pseudonymization
is
a way to protect the personal information but the pseudonymized
data is still “linked” to the real data. The GDPR makes it very clear
that personal data which has undergone pseudonymization is still related to a
person. (see GDPR
Recital 26)
F.65.5.6. Generic hashing #
In theory, hashing is not a valid anonymization technique, however in practice it is sometimes necessary to generate a determinist hash of the original data.
For instance, when a pair of primary key / foreign key is a “natural key”, it may contain actual information ( like a customer number containing a birth date or something similar).
Hashing such columns allows to keep referential integrity intact even for relatively unusual source data. Therefore, the
transp_anon.digest(value,salt,algorithm)
lets you choose a salt, and a hash algorithm from a pre-defined listtransp_anon.hash(value)
will return a text hash of the value using a secret salt (defined by thetransp_anon.salt
parameter) and hash algorithm (defined by thetransp_anon.algorithm
parameter). The default value oftransp_anon.algorithm
issha256
and possible values are: md5, sha1, sha224, sha256, sha384 or sha512. The default value oftransp_anon.salt
is an empty string. You can modify these values with:ALTER DATABASE foo SET transp_anon.salt TO 'xsfnjefnjsnfjsnf'; ALTER DATABASE foo SET transp_anon.algorithm TO 'sha384';
Keep in mind that hashing is a form a Pseudonymization. This means that
the data can be “de-anonymized” using the hashed value and the masking
function. If an attacker gets access to these 2 elements, he or she could
re-identify some persons using brute force
or dictionary
attacks. Therefore, the salt and the algorithm used to hash
the data must be protected with the same level of security that
the original dataset.
In a nutshell, we recommend that you use the transp_anon.hash()
function rather than transp_anon.digest()
because the salt will
not appear clearly in the masking rule.
Furthermore: in practice the hash function will return a long string of character like this:
SELECT transp_anon.hash('bob'); hash ---------------------------------------------------------------------------------------------------------------------------------- 95b6accef02c5a725a8c9abf19ab5575f99ca3d9997984181e4b3f81d96cbca4d0977d694ac490350e01d0d213639909987ef52de8e44d6258d536c55e427397
For some columns, this may be too long and you may have to cut some parts the hash in order to fit into the column. For instance, if you have a foreign key based on a phone number and the column is a VARCHAR(12) you can transform the data like this:
SECURITY LABEL FOR transp_anon ON COLUMN people.phone_number IS 'MASKED WITH FUNCTION transp_anon.left(transp_anon.hash(phone_number),12)'; SECURITY LABEL FOR transp_anon ON COLUMN call_history.fk_phone_number IS 'MASKED WITH FUNCTION transp_anon.left(transp_anon.hash(fk_phone_number),12)';
Of course, cutting the hash value to 12 characters will increase the risk of “ collision” (2 different values having the same fake hash). In such case, it’s up to you to evaluate this risk.
F.65.5.7. Partial Scrambling #
Partial scrambling leaves out some part of the data. For instance : a credit card number can be replaced by “40XX XXXX XXXX XX96”.
2 functions are available:
transp_anon.partial('abcdefgh',1,'xxxx',3)
will return “ axxxxfgh”;transp_anon.partial_email('[email protected]')
will become “da******@gm******.com”
F.65.5.8. Conditional Masking #
In some situations, you may want to apply a masking filter only for some value or for a limited number of lines in the table.
For instance, if you want to “preserve NULL
values”, i.e. masking only the lines that contains a value, you can use
the transp_anon.ternary
function, which works like a CASE
WHEN x THEN y ELSE z
statement:
SECURITY LABEL FOR transp_anon ON COLUMN player.score IS 'MASKED WITH FUNCTION transp_anon.ternary(score IS NULL, NULL, transp_anon.random_int_between(0,100));
You may also want to exclude some lines within the table. Like keeping the password of some users so that they still may be able to connect to a testing deployment of your application:
SECURITY LABEL FOR transp_anon ON COLUMN account.password IS 'MASKED WITH FUNCTION transp_anon.ternary( id > 1000, NULL::TEXT, password)';
WARNING : Conditional masking may create a partially deterministic “connection” between the original data and the masked data. And that connection can be used to retrieve personal information from the masked data. For instance, if NULL values are preserved for a “deceased_date” column, it will reveal which persons are still actually alive… In a nutshell: conditional masking may often produce a dataset that is not fully anonymized and therefore would still technically contain personal information.
F.65.5.9. Generalization #
Generalization is the principle of replacing the original value by a range containing this value. For instance, instead of saying “Paul is 42 years old”, you would say “Paul is between 40 and 50 years old”.
Note
The generalization functions are a data type transformation. Therefore it is not possible to use them with the dynamic masking engine. However they are useful to create anonymized views. See example below.
Let’s imagine a table containing health information:
SELECT * FROM patient; id | name | zipcode | birth | disease ----+----------+----------+------------+--------------- 1 | Alice | 47678 | 1979-12-29 | Heart Disease 2 | Bob | 47678 | 1959-03-22 | Heart Disease 3 | Caroline | 47678 | 1988-07-22 | Heart Disease 4 | David | 47905 | 1997-03-04 | Flu 5 | Eleanor | 47909 | 1999-12-15 | Heart Disease 6 | Frank | 47906 | 1968-07-04 | Cancer 7 | Geri | 47605 | 1977-10-30 | Heart Disease 8 | Harry | 47673 | 1978-06-13 | Cancer 9 | Ingrid | 47607 | 1991-12-12 | Cancer
We can build a view upon this table to suppress some columns ( SSN
and name
) and generalize the zipcode and the birth date like
this:
CREATE VIEW anonymized_patient AS SELECT 'REDACTED' AS lastname, transp_anon.generalize_int4range(zipcode,100) AS zipcode, transp_anon.generalize_tsrange(birth,'decade') AS birth disease FROM patients;
The anonymized table now looks like that:
SELECT * FROM anonymized_patient; lastname | zipcode | birth | disease ----------+---------------+-----------------------------+--------------- REDACTED | [47600,47700) | ["1970-01-01","1980-01-01") | Heart Disease REDACTED | [47600,47700) | ["1950-01-01","1960-01-01") | Heart Disease REDACTED | [47600,47700) | ["1980-01-01","1990-01-01") | Heart Disease REDACTED | [47900,48000) | ["1990-01-01","2000-01-01") | Flu REDACTED | [47900,48000) | ["1990-01-01","2000-01-01") | Heart Disease REDACTED | [47900,48000) | ["1960-01-01","1970-01-01") | Cancer REDACTED | [47600,47700) | ["1970-01-01","1980-01-01") | Heart Disease REDACTED | [47600,47700) | ["1970-01-01","1980-01-01") | Cancer REDACTED | [47600,47700) | ["1990-01-01","2000-01-01") | Cancer
The generalized values are still useful for statistics because they remain true, but they are less accurate, and therefore reduce the risk of re-identification.
PostgreSQL offers several RANGE types which are perfect for dates and numeric values.
For numeric values, 3 functions are available:
generalize_int4range(value, step)
generalize_int8range(value, step)
generalize_numrange(value, step)
…where value
is the data that will be generalized, and
step
is the size of each range.
F.65.5.10. Shuffling #
Shuffling mixes values within the same columns.
anon.shuffle_column(shuffle_table, shuffle_column, primary_key)
will rearrange all values in a given column. You need to provide a primary key of the table.
This is useful for foreign keys because referential integrity will be kept.
F.65.5.11. Using pg_catalog
functions #
Since version 1.3, the pg_catalog
schema is not trusted by
default. This is a security measure designed to prevent users from using
sophisticated functions in masking rules (such as pg_catalog.query_to_xml
, pg_catalog.ts_stat
or the
system administration functions) that should not be used as masking functions.
However, the extension provides bindings to some useful and safe functions from
the pg_catalog
schema for your convenience:
transp_anon.concat(TEXT,TEXT)
transp_anon.concat(TEXT,TEXT, TEXT)
transp_anon.date_add(TIMESTAMP WITH TIME ZONE,INTERVAL)
transp_anon.date_part(TEXT,TIMESTAMP)
transp_anon.date_part(TEXT,INTERVAL)
transp_anon.date_subtract(TIMESTAMP WITH TIME ZONE, INTERVAL )
transp_anon.date_trunc(TEXT,TIMESTAMP)
transp_anon.date_trunc(TEXT,TIMESTAMP WITH TIME ZONE,TEXT)
transp_anon.date_trunc(TEXT,INTERVAL)
transp_anon.left(TEXT,INTEGER)
transp_anon.length(TEXT)
transp_anon.lower(TEXT)
transp_anon.make_date(INT,INT,INT )
transp_anon.make_time(INT,INT,DOUBLE PRECISION)
transp_anon.md5(TEXT)
transp_anon.random()
transp_anon.replace(TEXT,TEXT,TEXT)
transp_anon.regexp_replace(TEXT,TEXT,TEXT)
transp_anon.regexp_replace(TEXT,TEXT,TEXT,TEXT)
transp_anon.right(TEXT,INTEGER)
transp_anon.substr(TEXT,INTEGER)
transp_anon.substr(TEXT,INTEGER,INTEGER)
transp_anon.upper(TEXT)
If you need more bindings, you can either
Write your own mapping function in a trusted schema (see below)
Set the
pg_catalog
schema asTRUSTED
(not recommended)open an issue
F.65.5.12. Write your own Masks #
You can also use your own function as a mask. The function must either be destructive (like Partial Scrambling) or insert some randomness in the dataset (like Faking).
Especially for complex data types, you may have to write your own function. This will be a common use case if you have to hide certain parts of a JSON field.
For example:
CREATE TABLE company ( business_name TEXT, info JSONB )
The info
field contains unstructured data like this:
SELECT jsonb_pretty(info) FROM company WHERE business_name = 'Soylent Green'; jsonb_pretty ---------------------------------- { "employees": [ { "lastName": "Doe", "firstName": "John" }, { "lastName": "Smith", "firstName": "Anna" }, { "lastName": "Jones", "firstName": "Peter" } ] } (1 row)
Using the PostgreSQL JSON functions and operators you can walk through the keys and replace the sensitive values as needed.
CREATE SCHEMA custom_masks; -- This step requires superuser privilege SECURITY LABEL FOR transp_anon ON SCHEMA custom_masks IS 'TRUSTED'; CREATE FUNCTION custom_masks.remove_last_name(j JSONB) RETURNS JSONB VOLATILE LANGUAGE SQL AS $func$ SELECT json_build_object( 'employees' , array_agg( jsonb_set(e ,'{lastName}', to_jsonb(transp_anon.fake_last_name())) ) )::JSONB FROM jsonb_array_elements( j->'employees') e $func$;
Then check that the function is working correctly:
SELECT custom_masks.remove_last_name(info) FROM company;
When that’s ok you can declare this function as the mask of the info
field:
SECURITY LABEL FOR transp_anon ON COLUMN company.info IS 'MASKED WITH FUNCTION custom_masks.remove_last_name(info)';
Select the data:
# SELECT anonymize_table('company'); # SELECT jsonb_pretty(info) FROM company WHERE business_name = 'Soylent Green'; jsonb_pretty ------------------------------------- { "employees": [ + { + "lastName": "Prawdzik",+ "firstName": "John" + }, + { + "lastName": "Baltazor",+ "firstName": "Anna" + }, + { + "lastName": "Taylan", + "firstName": "Peter" + } + ] + } (1 row)
As you can see, manipulating a sophisticated JSON structure with SQL is possible, but it can be tricky at first! There are multiple ways of walking through the keys and updating values. You will probably have to try different approaches, depending on your real JSON data and the performance you want to reach.
F.65.6. Dynamic masking #
You can hide some data from a role by declaring this role as a MASKED
one. Other roles will still access the original data.
Example:
CREATE TABLE people ( id TEXT, firstname TEXT, lastname TEXT, phone TEXT); INSERT INTO people VALUES ('T1','Sarah', 'Conor','0609110911'); SELECT * FROM people; =# SELECT * FROM people; id | firstname | lastname | phone ----+----------+----------+------------ T1 | Sarah | Conor | 0609110911 (1 row)
Step 1 : Activate the dynamic masking engine
=# CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE; =# SELECT transp_anon.init();
Step 2 : Declare a masked user
=# CREATE ROLE skynet LOGIN; =# GRANT SELECT ON TABLE people TO skynet; =# SECURITY LABEL FOR transp_anon ON ROLE skynet IS 'MASKED';
Step 3 : Declare the masking rules
SECURITY LABEL FOR transp_anon ON COLUMN people.lastname IS 'MASKED WITH FUNCTION transp_anon.fake_last_name()'; SECURITY LABEL FOR transp_anon ON COLUMN people.phone IS 'MASKED WITH FUNCTION transp_anon.partial(phone,2,$$******$$,2)';
Step 4 : Connect with the masked user
=# \c - skynet => SELECT * FROM people; id | firstname | lastname | phone ----+----------+-----------+------------ T1 | Sarah | Stranahan | 06******11 (1 row)
F.65.6.1. How to unmask a role #
Simply remove the security label like this:
SECURITY LABEL FOR transp_anon ON ROLE bob IS NULL;
To unmask all masked roles at once you can type:
SELECT transp_anon.remove_masks_for_all_roles();
F.65.7. Generalization #
F.65.7.1. Reducing the accuracy of sensitive data #
The idea of generalization is to replace data with a broader, less accurate value. For instance, instead of saying “Bob is 28 years old”, you can say “Bob is between 20 and 30 years old”. This is interesting for analytics because the data remains true while avoiding the risk of re-identification.
PostgreSQL can handle generalization very easily with the RANGE types, a very powerful way to store and manipulate a set of values contained between a lower and an upper bound.
F.65.7.2. Example #
Here’s a basic table containing medical data:
# SELECT * FROM patient; ssn | firstname | zipcode | birth | disease -------------+-----------+---------+------------+--------------- 253-51-6170 | Alice | 47012 | 1989-12-29 | Heart Disease 091-20-0543 | Bob | 42678 | 1979-03-22 | Allergy 565-94-1926 | Caroline | 42678 | 1971-07-22 | Heart Disease 510-56-7882 | Eleanor | 47909 | 1989-12-15 | Acne 098-24-5548 | David | 47905 | 1997-03-04 | Flu 118-49-5228 | Jean | 47511 | 1993-09-14 | Flu 263-50-7396 | Tim | 47900 | 1981-02-25 | Heart Disease 109-99-6362 | Bernard | 47168 | 1992-01-03 | Asthma 287-17-2794 | Sophie | 42020 | 1972-07-14 | Asthma 409-28-2014 | Arnold | 47000 | 1999-11-20 | Diabetes (10 rows)
We want the anonymized data to remain true because it will be used for statistics. We can build a view upon this table to remove useless columns and generalize the indirect identifiers:
CREATE MATERIALIZED VIEW generalized_patient AS SELECT 'REDACTED'::TEXT AS firstname, transp_anon.generalize_int4range(zipcode,1000) AS zipcode, transp_anon.generalize_daterange(birth,'decade') AS birth, disease FROM patient;
This will give us a less accurate view of the data:
# SELECT * FROM generalized_patient; firstname | zipcode | birth | disease -----------+---------------+-------------------------+--------------- REDACTED | [47000,48000) | [1980-01-01,1990-01-01) | Heart Disease REDACTED | [42000,43000) | [1970-01-01,1980-01-01) | Allergy REDACTED | [42000,43000) | [1970-01-01,1980-01-01) | Heart Disease REDACTED | [47000,48000) | [1980-01-01,1990-01-01) | Acne REDACTED | [47000,48000) | [1990-01-01,2000-01-01) | Flu REDACTED | [47000,48000) | [1990-01-01,2000-01-01) | Flu REDACTED | [47000,48000) | [1980-01-01,1990-01-01) | Heart Disease REDACTED | [47000,48000) | [1990-01-01,2000-01-01) | Asthma REDACTED | [42000,43000) | [1970-01-01,1980-01-01) | Asthma REDACTED | [47000,48000) | [1990-01-01,2000-01-01) | Diabetes (10 rows)
F.65.7.3. Generalization Functions #
PostgreSQL Anonymizer provides 6 generalization functions. One for each RANGE type. Generally these functions take the original value as the first parameter, and a second parameter for the length of each step.
For numeric values :
transp_anon.generalize_int4range(42,5)
returns the range[40,45)
transp_anon.generalize_int8range(12345,1000)
returns the range[12000,13000)
transp_anon.generalize_numrange(42.32378,10)
returns the range[40,50)
For time values :
transp_anon.generalize_tsrange('1904-11-07','year')
returns['1904-01-01','1905-01-01')
transp_anon.generalize_tstzrange('1904-11-07','week')
returns['1904-11-07','1904-11-14')
transp_anon.generalize_daterange('1904-11-07','decade')
returns[1900-01-01,1910-01-01)
The possible steps are : microseconds, milliseconds, second, minute, hour, day, week, month, year, decade, century and millennium.
F.65.7.4. Limitations #
F.65.7.4.1. Singling out and extreme values #
“Singling Out” is the possibility to isolate an individual in a dataset by using extreme value or exceptional values.
For example:
# SELECT * FROM employees; id | name | job | salary ------+----------------+------+-------- 1578 | xkjefus3sfzd | NULL | 1498 2552 | cksnd2se5dfa | NULL | 2257 5301 | fnefckndc2xn | NULL | 45489 7114 | npodn5ltyp3d | NULL | 1821
In this table, we can see that a particular employee has a very high salary, very far from the average salary. Therefore this person is probably the CEO of the company.
With generalization, this is important because the size of the range (the “ step”) must be wide enough to prevent the identification of one single individual.
WARNING : By definition, with generalization the data remains true, but the column type is changed. This means that the transformation is not transparent.
F.65.8. Custom Fake Data #
This extension is delivered with a small set of fake data by default. For each fake
function ( fake_email()
, fake_first_name()
) we
provide only 1000 unique values, and they are only in Russian.
Here’s how you can create your own set of fake data!
F.65.8.1. Generate your own fake dataset #
As an example, here’s a python script that will generate fake data for you:
$TANTOR_DIR/utils/data_generator/data_generator.py
This script requires library Faker. You can install it with:
pip3 install faker
To produce 5000 emails in Russian & English, you’d call the scripts like this:
data_generator.py --table email --locales ru_RU,en --lines 5000
This will output the fake data in CSV
format.
Use data_generator.py --help
for more details about the script
parameters.
You can load the fake data directly into the extension like this:
TRUNCATE transp_anon.email; COPY transp_anon.email FROM PROGRAM 'data_generator.py --table email --locales ru_RU,en --lines 5000'; SELECT setval('transp_anon.email_oid_seq', max(oid)) FROM transp_anon.email; CLUSTER transp_anon.email;
F.65.8.2. Load your own fake data #
If you want to use your own dataset, you can import custom CSV files with :
SELECT transp_anon.init('/path/to/custom_csv_files/')
Look at the data
folder to find the format of the CSV files.
F.65.9. Performances #
Any anonymization process has a price as it will consume CPU time, RAM space and probably a bunch of disk I/O… Here’s a a quick overview of the question depending on what strategy you are using….
In a nutshell, the anonymization performances will mainly depend on 2 important factors:
The size of the database
The number of masking rules
F.65.9.1. Dynamic Masking #
With dynamic masking, the real data is replaced on-the-fly every time a masked user sends a query to the database. This means that the masking users will have slower response time than regular (unmasked) users. This is generally ok because usually masked users are not considered as important as the regular ones.
If you apply 3 or 4 rules to a table, the response time for the masked users should approx. 20% to 30% slower than for the normal users.
As the masking rules are applied for each queries of the masked users, the dynamic masking is appropriate when you have a limited number of masked users that connect only from time to time to the database. For instance, a data analyst connecting once a week to generate a business report.
If there are multiple masked users or if a masked user is very active, you should probably export the masked data once-a-week on a secondary instance and let these users connect to this secondary instance.
Note
In this case, the cost of anonymization is “paid” only by the masked users.
F.65.9.2. How to speed things up ? #
F.65.9.2.1. Prefer MASKED WITH VALUE
whenever possible #
It is always faster to replace the original data with a static value instead of calling a masking function.
F.65.9.2.2. Sampling #
If you need to anonymize data for testing purpose, chances are that a smaller subset of your database will be enough. In that case, you can easily speed up the anonymization by downsizing the volume of data.
Checkout the Sampling section for more details.
F.65.9.2.3. Materialized Views #
Dynamic masking is not always required! In some cases, it is more efficient to build CREATE MATERIALIZED VIEW instead.
For instance:
CREATE MATERIALIZED VIEW masked_customer AS SELECT id, anon.random_last_name() AS name, anon.random_date_between('1920-01-01'::DATE,now()) AS birth, fk_last_order, store_id FROM customer;
F.65.10. Security #
F.65.10.1. Permissions #
Here’s an overview of what users can do depending on the priviledge they have:
Action | Superuser | Owner | Masked Role |
---|---|---|---|
Create the extension | Yes | ||
Drop the extension | Yes | ||
Init the extension | Yes | ||
Reset the extension | Yes | ||
Configure the extension | Yes | ||
Put a mask upon a role | Yes | ||
Start dynamic masking | Yes | ||
Stop dynamic masking | Yes | ||
Create a table | Yes | Yes | |
Declare a masking rule | Yes | Yes | |
Insert, delete, update a row | Yes | Yes | |
Static Masking | Yes | Yes | |
Select the real data | Yes | Yes | |
Regular Dump | Yes | Yes | |
Anonymous Dump | Yes | Yes | |
Use the masking functions | Yes | Yes | Yes |
Select the masked data | Yes | Yes | Yes |
View the masking rules | Yes | Yes | Yes |
F.65.10.2. Limit masking filters only to trusted schemas #
By default, the database owner can only write masking rules with functions that are located in the trusted schemas which are controlled by the superusers.
Out of the box, only the transp_anon
schema is declared as
trusted. This means that by defautt the functions from the pg_catalog
cannot be used in masking rules.
For more details, read the Using pg_catalog functions section.
F.65.10.3. Security context of the functions #
Most of the functions of this extension are declared with the SECURITY
INVOKER
tag. This means that these functions are executed with the
privileges of the user that calls them. This is an important restriction.
This extension contains another few functions declared with the tag SECURITY
DEFINER
.
F.65.11. Demos #
Demos show different cases.
F.65.11.1. Add more emails #
Add values to transp_anon.email
BEGIN; CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE; SELECT transp_anon.init(); SELECT count(*) FROM transp_anon.email; CREATE TEMPORARY TABLE tmp_email AS SELECT * FROM transp_anon.email; TRUNCATE transp_anon.email; INSERT INTO transp_anon.email SELECT ROW_NUMBER() OVER (), concat(u.username,'@', d.domain) FROM ( SELECT split_part(val,'@',1) AS username FROM tmp_email ORDER BY RANDOM() LIMIT 10 ) u, ( SELECT split_part(val,'@',2) AS domain FROM tmp_email ORDER BY RANDOM() LIMIT 5 ) d ; SELECT count(*) FROM transp_anon.email; ROLLBACK;
F.65.11.2. Dynamic anonymization for another role #
In this demo we create masked role skynet
. Select operations
will return anonymized data for this role.
-- STEP 1 : Activate the masking engine CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE; SELECT transp_anon.init(); -- STEP 2 : Declare a masked user CREATE ROLE skynet LOGIN; SECURITY LABEL FOR transp_anon ON ROLE skynet IS 'MASKED'; -- STEP 3 : Declare the masking rules CREATE TABLE people ( id TEXT, firstname TEXT, lastname TEXT, phone TEXT); INSERT INTO people VALUES ('T1','Sarah', 'Conor','0609110911'); SELECT * FROM people; -- STEP 3 : Declare the masking rules SECURITY LABEL FOR transp_anon ON COLUMN people.lastname IS 'MASKED WITH FUNCTION transp_anon.fake_last_name()'; SECURITY LABEL FOR transp_anon ON COLUMN people.phone IS 'MASKED WITH FUNCTION transp_anon.partial(phone,2,$$******$$,2)'; -- STEP 4 : Connect with the masked user \! psql demo -U skynet -c 'SELECT * FROM people;' -- STEP 5 : Clean up DROP EXTENSION anon CASCADE; REASSIGN OWNED BY skynet TO postgres; DROP OWNED BY skynet; DROP ROLE skynet;
F.65.11.3. Declare masking rules for a column #
In this demo masking rules are declared for a column player.name
.
BEGIN; CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE; SELECT transp_anon.init(); CREATE TABLE player( id SERIAL, name TEXT, points INT); INSERT INTO player VALUES ( 1, 'Kareem Abdul-Jabbar', 38387), ( 5, 'Michael Jordan', 32292); SECURITY LABEL FOR transp_anon ON COLUMN player.name IS 'MASKED WITH FUNCTION transp_anon.fake_last_name()'; ROLLBACK;
F.65.11.4. Generalization #
Demo using generalization functions.
BEGIN; CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE; CREATE TABLE patient ( ssn TEXT, firstname TEXT, zipcode INTEGER, birth DATE, disease TEXT ); INSERT INTO patient VALUES ('253-51-6170','Alice',47012,'1989-12-29','Heart Disease'), ('091-20-0543','Bob',42678,'1979-03-22','Allergy'), ('565-94-1926','Caroline',42678,'1971-07-22','Heart Disease'), ('510-56-7882','Eleanor',47909,'1989-12-15','Acne'), ('098-24-5548','David',47905,'1997-03-04','Flu'), ('118-49-5228','Jean',47511,'1993-09-14','Flu'), ('263-50-7396','Tim',47900,'1981-02-25','Heart Disease'), ('109-99-6362','Bernard',47168,'1992-01-03','Asthma'), ('287-17-2794','Sophie',42020,'1972-07-14','Asthma'), ('409-28-2014','Arnold',47000,'1999-11-20','Diabetes') ; SELECT * FROM patient; CREATE MATERIALIZED VIEW generalized_patient AS SELECT 'REDACTED' AS firstname, transp_anon.generalize_int4range(zipcode,1000) AS zipcode, transp_anon.generalize_daterange(birth,'decade') AS birth, disease FROM patient; SELECT * FROM generalized_patient; ROLLBACK;
F.65.11.5. Hashing #
This is a demo of the hashing function We will try to pseudonymize a natural foreign key
BEGIN; -- We have a simplistic customer table CREATE TABLE customer ( id SERIAL, name TEXT, phone_number TEXT UNIQUE NOT NULL ); INSERT INTO customer VALUES (2046,'Omar Little','410-719-9009'), (8123,'Russell Bell','410-617-7308'), (3456,'Avon Barksdale','410-385-2983'); -- We have a log of their calls to our customer service hotline CREATE TABLE hotline_call_history ( id SERIAL, fk_phone_number TEXT REFERENCES customer(phone_number) DEFERRABLE, call_start_time TIMESTAMP, call_end_time TIMESTAMP ); INSERT INTO hotline_call_history VALUES (834,'410-617-7308','2004-05-17 09:41:01','2004-05-17 09:44:24'), (835,'410-385-2983','2004-05-17 11:22:55','2004-05-17 11:34:18'), (839,'410-719-9009','2004-05-18 16:02:03','2004-05-18 16:22:56'), (878,'410-385-2983','2004-05-20 13:13:34','2004-05-18 13:51:00'); -- We can get a details view of the calls like this SELECT c.id AS customer_id, c.name AS customer_name, h.call_start_time, h.call_end_time FROM hotline_call_history h JOIN customer c ON h.fk_phone_number = c.phone_number WHERE extract(year from h.call_start_time) = 2004 ORDER BY h.call_start_time; -- Now let's pseudonymize this ! CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE; -- Init SELECT transp_anon.init(); SET transp_anon.salt = '4ag3b803de6180361'; SET transp_anon.algorithm = 'md5'; -- SELECT transp_anon.set_secret_salt('4ag3b803de6180361'); -- SELECT transp_anon.set_secret_algorithm('md5'); -- Masking Rules SECURITY LABEL FOR transp_anon ON COLUMN customer.name IS 'MASKED WITH FUNCTION transp_anon.concat( transp_anon.pseudo_first_name(name), $$ $$, transp_anon.pseudo_last_name(name))'; SECURITY LABEL FOR transp_anon ON COLUMN customer.phone_number IS 'MASKED WITH FUNCTION transp_anon.hash(phone_number)'; SECURITY LABEL FOR transp_anon ON COLUMN hotline_call_history.fk_phone_number IS 'MASKED WITH FUNCTION transp_anon.hash(fk_phone_number)'; -- Apply the masking rules SET CONSTRAINTS ALL DEFERRED; SELECT current_user \gset SECURITY LABEL FOR transp_anon ON ROLE :current_user IS 'MASKED'; -- Launch the same request SELECT c.id AS customer_id, c.name AS customer_name, h.call_start_time, h.call_end_time FROM hotline_call_history h JOIN customer c ON h.fk_phone_number = c.phone_number WHERE extract(year from h.call_start_time) = 2004 ORDER BY h.call_start_time; -- ROLLBACK;
F.65.11.6. Masking rule inheritance #
Demo show anonymization during inheritance
BEGIN; CREATE TABLE public.invoice ( name text, amount float, published_date date ); CREATE TABLE public.invoice_2020 ( CHECK ( published_date >= DATE '2020-01-01' AND published_date <= DATE '2020-12-31' ) ) INHERITS (invoice); INSERT INTO invoice_2020 VALUES('John DOE', 236.25, '2020-12-09'); CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE; SELECT transp_anon.init(); SELECT current_user \gset SECURITY LABEL FOR transp_anon ON ROLE :current_user IS 'MASKED'; -- -- Put mask on the mother table -- SECURITY LABEL FOR transp_anon ON COLUMN public.invoice.name IS 'MASKED WITH VALUE $$CONFIDENTIAL$$'; SELECT public.invoice.name, public.invoice_2020.name FROM public.invoice, public.invoice_2020; -- name | name -- --------------+-------------- -- CONFIDENTIAL | CONFIDENTIAL -- -- Put mask on the child table -- SECURITY LABEL FOR transp_anon ON COLUMN public.invoice_2020.name IS 'MASKED WITH VALUE $$DELETED$$'; SELECT public.invoice.name, public.invoice_2020.name FROM public.invoice, public.invoice_2020; -- name | name -- --------------+--------- -- CONFIDENTIAL | DELETED ROLLBACK;
F.65.11.7. Noise reduction attack #
This is a basic example of a repeat attacks against the noise()
masking
functions when they're used with dynamic masking. We will demonstrate how
a masked role can guess a real data using only the masking view
BEGIN; -- STEP 1 : Activate the masking engine CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE; SELECT transp_anon.init(); -- STEP 2: Declare a masked user CREATE ROLE attacker LOGIN; SECURITY LABEL FOR transp_anon ON ROLE attacker IS 'MASKED'; -- STEP 3: Load the data DROP TABLE IF EXISTS people CASCADE; CREATE TABLE people ( id INT, name TEXT, age INT); INSERT INTO people VALUES (157, 'Mike Ehrmantraut' , 63); INSERT INTO people VALUES (482, 'Jimmy McGill', 47); -- STEP 4: Declare the masking rules SECURITY LABEL FOR transp_anon ON COLUMN people.name IS 'MASKED WITH VALUE $$CONFIDENTIAL$$ '; SECURITY LABEL FOR transp_anon ON COLUMN people.age IS 'MASKED WITH FUNCTION transp_anon.noise(age, 0.33)'; -- STEP 5: Now let's connect with the masked user SELECT current_user \gset SECURITY LABEL FOR transp_anon ON ROLE :current_user IS 'MASKED'; -- When attacker asks for the age of person #157, he/she gets a "noised" value SELECT age FROM people WHERE id = 157; -- Now let's do this 10000 times and get the average DO $$ DECLARE v iNT; a int[]; BEGIN FOR i in 1..10000 LOOP SELECT age into v FROM people WHERE id=157; a:=array_append(a,v); END LOOP; SELECT avg(u) into v FROM unnest(a) u; RAISE NOTICE 'Age of Person 157: %', v; END $$; ROLLBACK;
F.65.11.8. Return NULL or fake value #
BEGIN; CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE; SELECT transp_anon.init(); CREATE OR REPLACE FUNCTION null_or_random_date(val TIMESTAMPTZ) RETURNS TIMESTAMPTZ AS $$ SELECT transp_anon.random_date(); $$ LANGUAGE SQL VOLATILE RETURNS NULL ON NULL INPUT; SELECT null_or_random_date(NULL); SELECT null_or_random_date('2020-05-05'); ROLLBACK;
F.65.11.9. Partial #
BEGIN; CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE; SELECT transp_anon.partial('abcdefgh',1,'xxxx',3); SELECT transp_anon.partial('+33142928100',4,'******',2); SELECT transp_anon.partial_email('[email protected]'); ROLLBACK;
F.65.11.10. Difference fake functions #
BEGIN; CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE; SELECT transp_anon.init(); --let's use `TEMPORARY` instead of `pg_temp` for clarity --SET search_path TO pg_temp, public; CREATE TEMPORARY TABLE customer( id SERIAL, full_name TEXT, hair_color TEXT, birth DATE, employer TEXT, zipcode TEXT, fk_shop INTEGER ); INSERT INTO customer VALUES (911,'Chuck Norris','brown','1940-03-10','Texas Rangers', '75001',12), (312,'David Hasselhoff','black','1952-07-17','Baywatch', '90001',423) ; SELECT * FROM customer; UPDATE customer SET full_name=transp_anon.fake_first_name() || ' ' || transp_anon.fake_last_name(), hair_color=transp_anon.random_in(ARRAY['black','blonde','brown','red',NULL]), birth=transp_anon.random_date_between('1920-01-01'::DATE,now()), employer=transp_anon.fake_company(), zipcode=transp_anon.random_zip() ; SELECT * FROM customer; ROLLBACK;
F.65.11.11. Write your own mask #
BEGIN; CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE; SELECT transp_anon.init(); CREATE TABLE company ( business_name TEXT, info JSONB ); INSERT INTO company VALUES ( 'Soylent Green', '{"employees":[ {"firstName":"John", "lastName":"Doe"}, {"firstName":"Anna", "lastName":"Smith"}, {"firstName":"Peter", "lastName":"Jones"} ]}'); SELECT jsonb_pretty(info) FROM company WHERE business_name = 'Soylent Green'; CREATE FUNCTION transp_anon.anonymize_last_name(j JSONB) RETURNS JSONB VOLATILE LANGUAGE SQL AS $func$ SELECT json_build_object( 'employees' , array_agg( jsonb_set(e ,'{lastName}', to_jsonb(transp_anon.fake_last_name())) ) )::JSONB FROM jsonb_array_elements( j->'employees') e $func$; SELECT transp_anon.anonymize_last_name(info) FROM company; SECURITY LABEL FOR transp_anon ON COLUMN company.info IS 'MASKED WITH FUNCTION transp_anon.anonymize_last_name(info)'; SELECT current_user \gset SECURITY LABEL FOR transp_anon ON ROLE :current_user IS 'MASKED'; SELECT jsonb_pretty(info) FROM company WHERE business_name = 'Soylent Green'; ROLLBACK;