F.65. transp_anon#

F.65. transp_anon

F.65. transp_anon #

transp_anon (transparent anonymization) is an extension to mask or replace personally identifiable information or commercially sensitive data from a PostgreSQL database.

F.65.1. Definitions of the terms used in this project #

The data can be altered with several techniques:

  • Deletion or Nullification simply removes data.

  • Static Substitution consistently replaces the data with a generic value. For instance: replacing all values of a TEXT column with the value CONFIDENTIAL.

  • Variance is the action of shifting dates and numeric values. For example, by applying a +/- 10% variance to a salary column, the dataset will remain meaningful.

  • Generalization reduces the accuracy of the data by replacing it with a range of values. Instead of saying Bob is 28 years old, you can say Bob is between 20 and 30 years old. This is useful for analytics because the data remains true.

  • Shuffling mixes values within the same columns. This method is open to being reversed if the shuffling algorithm can be deciphered.

  • Randomization replaces sensitive data with random-but-plausible values. The goal is to avoid any identification from the data record while remaining suitable for testing, data analysis and data processing.

  • Partial scrambling is similar to static substitution but leaves out some part of the data. For instance : a credit card number can be replaced by 40XX XXXX XXXX XX96

  • Custom rules are designed to alter data following specific needs. For instance, randomizing simultaneously a zipcode and a city name while keeping them coherent.

  • Pseudonymization is a way to protect personal information by hiding it using additional information. Encryption and Hashing are two examples of pseudonymization techniques. However a pseudonymizated data is still linked to the original data.

F.65.2. Anonymization & Data Masking for PostgreSQL #

The project has a declarative approach of anonymization. This means you can declare the masking rules using the PostgreSQL Data Definition Language (DDL) and specify your anonymization strategy inside the table definition itself.

Once the maskings rules are defined, you can access the anonymized data using: * Dynamic Masking: Hide personal information only for the masked users

In addition, various Masking Functions are available : randomization, faking, partial scrambling, shuffling, noise or your own custom function!

Beyond masking, it is also possible to use a fourth approach called Generalization which is perfect for statistics and data analytics.

F.65.2.1. Example #

=# SELECT * FROM people;
id | firstname | lastname |   phone
----+-----------+----------+------------
T1 | Sarah     | Conor    | 0609110911
            

Step 1 : Activate the dynamic masking engine

=# CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE;
=# SELECT transp_anon.init();
            

Step 2 : Declare a masked user and give it permissions to select data

=# CREATE ROLE skynet LOGIN;
=# GRANT SELECT ON TABLE people TO skynet;
=# SECURITY LABEL FOR transp_anon ON ROLE skynet IS 'MASKED';
            

Step 3 : Declare the masking rules

=# SECURITY LABEL FOR transp_anon ON COLUMN people.lastname
-# IS 'MASKED WITH FUNCTION transp_anon.fake_last_name()';

=# SECURITY LABEL FOR transp_anon ON COLUMN people.phone
-# IS 'MASKED WITH FUNCTION transp_anon.partial(phone,2,$$******$$,2)';
            

Step 4 : Connect with the masked user

=# \connect - skynet
=> SELECT * FROM people;
id | firstname | lastname  |   phone
----+-----------+-----------+------------
T1 | Sarah     | Stranahan | 06******11
            

F.65.3. Configuration #

The extension has currently a few options that be defined for the entire instance ( inside postgresql.conf or with ALTER SYSTEM).

It is also possible and often a good idea to define them at the database level like this:

ALTER DATABASE people SET transp_anon.restrict_to_trusted_schemas = on;
            

Only superuser can change the parameters below :

F.65.3.1. transp_anon.enabled #

Type Boolean
Default value true
Visible only to superusers

Allows to globally enable or disable transp_anon.

F.65.3.2. transp_anon.inherit_labels #

Type Boolean
Default value true
Visible only to superusers

Inherit security labels from relation ancestors (partitioned tables and inheritance tables)if any.

F.65.3.3. transp_anon.algorithm #

Type Text
Default value sha256
Visible only to superusers

This is the hashing method used by pseudonymizing functions. Checkout the pgcrypto documentation for the list of available options.

See transp_anon.salt to learn why this parameter is a very sensitive information.

F.65.3.4. transp_anon.restrict_to_trusted_schemas #

Type Boolean
Default value off
Visible to all users

By enabling this parameter, masking rules must be defined using functions located in a limited list of namespaces. By default, only the transp_anon schema is trusted.

This improves security by preventing users from declaring their custom masking filters. This also means that the schema must be explicit inside the masking rules.

For more details, check out the Write your own masks section of the Masking functions chapter.

F.65.3.5. transp_anon.salt #

Type Text
Default value (empty)
Visible only to superusers

This is the salt used by pseudonymizing functions. It is very important to define a custom salt for each database like this:

ALTER DATABASE foo SET transp_anon.salt = 'This_Is_A_Very_Secret_Salt';
            

If a masked user can read the salt, he/she can run a brute force attack to retrieve the original data based on the 3 elements:

  • The pseudonymized data

  • The hashing algorithm (see transp_anon.algorithm)

  • The salt

The salt and the name of the hashing algorithm should be protected with the same level of security that the data itself. This is why you should store the salt directly within the database with ALTER DATABASE.

F.65.4. Declare masking rules #

The main idea of this extension is to offer anonymization by design.

The data masking rules should be written by the people who develop the application because they have the best knowledge of how the data model works. Therefore masking rules must be implemented directly inside the database schema.

This allows to mask the data directly inside the PostgreSQL instance without using an external tool and thus limiting the exposure and the risks of data leak.

The data masking rules are declared simply by using SECURITY LABEL:

CREATE TABLE player( id SERIAL, name TEXT, points INT);

INSERT INTO player VALUES
( 1, 'Kareem Abdul-Jabbar', 38387),
( 5, 'Michael Jordan', 32292 );

SECURITY LABEL FOR transp_anon ON COLUMN player.name
IS 'MASKED WITH FUNCTION transp_anon.fake_last_name()';

SECURITY LABEL FOR transp_anon ON COLUMN player.id
IS 'MASKED WITH VALUE NULL';
        

F.65.4.1. Escaping String literals #

As you may have noticed the masking rule definitions are placed between single quotes. Therefore if you need to use a string inside a masking rule, you need to use C-Style escapes like this:

SECURITY LABEL FOR transp_anon ON COLUMN player.name
IS E'MASKED WITH VALUE \'CONFIDENTIAL\'';
            

Or use dollar quoting which is easier to read:

SECURITY LABEL FOR transp_anon ON COLUMN player.name
IS 'MASKED WITH VALUE $$CONFIDENTIAL$$';
            

F.65.4.2. Listing masking rules #

To display all the masking rules declared in the current database, check out the transp_anon.pg_masking_rules:

SELECT * FROM transp_anon.pg_masking_rules;
            

F.65.4.3. Debugging masking rules #

When an error occurs to due a wrong masking rule, you can get more detailed information about the problem by setting client_min_messages to DEBUG and you will get useful details

SET client_min_messages=DEBUG;
            

F.65.4.4. Removing a masking rule #

You can simply erase a masking rule like this:

SECURITY LABEL FOR transp_anon ON COLUMN player.name IS NULL;
            

To remove all rules at once, you can use:

SELECT transp_anon.remove_masks_for_all_columns();
            

F.65.4.5. Limitations #

F.65.5. Masking functions #

The extension provides functions to implement the following main anonymization strategies:

Also you may use additional masking functions or write your own :

Depending on your data, you may need to use different strategies on different columns :

  • For names and other direct identifiers , Faking is often useful

  • Shuffling is convenient for foreign keys

  • Adding Noise is interesting for numeric values and dates

  • Partial Scrambling is perfect for email address and phone numbers

  • etc.

F.65.5.1. Destruction #

First of all, the fastest and safest way to anonymize a data is to destroy it.

In many cases, the best approach to hide the content of a column is to replace all the values with a single static value.

For instance, you can replace a entire column by the word CONFIDENTIAL like this:

SECURITY LABEL FOR transp_anon
ON COLUMN users.address
IS 'MASKED WITH VALUE ''CONFIDENTIAL'' ';
            

F.65.5.2. Adding Noise #

This is also called Variance. The idea is to shift dates and numeric values. For example, by applying a +/- 10% variance to a salary column, the dataset will remain meaningful.

  • transp_anon.noise(original_value,ratio) where original_value can be an integer, a bigint or a double precision. If the ratio is 0.33, the return value will be the original value randomly shifted with a ratio of +/- 33%

  • transp_anon.dnoise(original_value, interval) where original_value can be a date, a timestamp, or a time. If interval = 2 days, the return value will be the original value randomly shifted by +/- 2 days

WARNING : The noise() masking functions are vulnerable to a form of repeat attack, especially with Dynamic Masking. A masked user can guess an original value by requesting its masked value multiple times and then simply use the AVG() function to get a close approximation. (See Noise Reduction Attack for more details). In a nutshell, these functions are best fitted for Anonymous Dumps and Static Masking. They should be avoided when using Dynamic Masking.

F.65.5.3. Randomization #

The extension provides a large choice of functions to generate purely random data:

F.65.5.3.1. Basic Random values #
  • transp_anon.random_date() returns a date

  • transp_anon.random_string(n) returns a TEXT value containing n letters

  • transp_anon.random_zip() returns a 5-digit code

  • transp_anon.random_phone(p) returns a 8-digit phone with p as a prefix

  • transp_anon.random_hash(seed) returns a hash of a random string for a given seed

F.65.5.3.2. Random between #

To pick any value inside between two bounds:

  • transp_anon.random_date_between(d1,d2) returns a date between d1 and d2

  • transp_anon.random_int_between(i1,i2) returns an integer between i1 and i2

  • transp_anon.random_bigint_between(b1,b2) returns a bigint between b1 and b2

NOTE: With these functions, the lower and upper bounds are included. For instance transp_anon.random_int_between(1,3) returns either 1, 2 or 3.

For more advanced interval descriptions, check out the Random in Range section.

F.65.5.3.3. Random in Array #

The random_in function returns an element a given array

For example:

  • transp_anon.random_in(ARRAY[1,2,3]) returns an int between 1 and 3

  • transp_anon.random_in(ARRAY['red','green','blue']) returns a text

F.65.5.3.4. Random in Enum #

This is one especially useful when working with ENUM types!

  • transp_anon.random_in_enum(variable_of_an_enum_type) returns any val

CREATE TYPE card AS ENUM ('visa', 'mastercard', ‘amex’);

SELECT transp_anon.random_in_enum(NULL::CARD);
random_in_enum
----------------
mastercard

CREATE TABLE customer (
id INT,
...
credit_card CARD
);

SECURITY LABEL FOR transp_anon ON COLUMN customer.creditcard
IS 'MASKED WITH FUNCTION transp_anon.random_in_enum(creditcard)'
                
F.65.5.3.5. Random in Range #

RANGE typesare a powerfull way to describe an interval of values, where can define inclusive or excluvive bounds, see examples.

There a function for each subtype of range:

  • transp_anon.random_in_int4range('[5,6)') returns an INT of value 5

  • transp_anon.random_in_int8range('(6,7]') returns a BIGINT of value 7

  • `transp_anon.random_in_numrange([0.1,0.9]) returns a NUMERIC between 0.1 and 0.9

  • transp_anon.random_in_daterange('[2001-01-01, 2001-12-31)') returns a date in 2001

  • transp_anon.random_in_tsrange('[2022-10-01,2022-10-31]') returns a TIMESTAMP in october 2022

  • transp_anon.random_in_tstzrange('[2022-10-01,2022-10-31]') returns a TIMESTAMP WITH TIMEZONE in october 2022

NOTE: It is not possible to get a random value from a RANGE with an infinite bound. For example transp_anon.random_in_int4range('[2022,)') returns NULL.

F.65.5.4. Faking #

The idea of Faking is to replace sensitive data with random-but-plausible values. The goal is to avoid any identification from the data record while remaining suitable for testing, data analysis and data processing.

In order to use the faking functions, you have to init() the extension in your database first:

SELECT transp_anon.init();
            

The init() function will import a default dataset of random data (iban, names, cities, etc.).

Note

This dataset is in Russian and very small ( 1000 values for each category ). If you want to use localized data or load a specific dataset, please read the Custom Fake Data section.

Once the fake data is loaded, you have access to these faking functions:

  • transp_anon.fake_address() returns a complete post address

  • transp_anon.fake_city() returns an existing city

  • transp_anon.fake_country() returns a country

  • transp_anon.fake_company() returns a generic company name

  • transp_anon.fake_email() returns a valid email address

  • transp_anon.fake_first_name() returns a generic first name

  • transp_anon.fake_iban() returns a valid IBAN

  • transp_anon.fake_last_name() returns a generic last name

  • transp_anon.fake_postcode() returns a valid zipcode

For TEXT and VARCHAR columns, you can use the classic Lorem Ipsum generator:

  • transp_anon.lorem_ipsum() returns 5 paragraphs

  • transp_anon.lorem_ipsum(2) returns 2 paragraphs

  • transp_anon.lorem_ipsum( paragraphs := 4 ) returns 4 paragraphs

  • transp_anon.lorem_ipsum( words := 20 ) returns 20 words

  • transp_anon.lorem_ipsum( characters := 7 ) returns 7 characters

  • transp_anon.lorem_ipsum( characters := transp_anon.length(table.column) ) returns the same amount of characters as the original string

F.65.5.5. Pseudonymization #

Pseudonymization is similar to Faking in the sense that it generates realistic values. The main difference is that the pseudonymization is deterministic : the functions always will return the same fake value based on a seed and an optional salt.

In order to use the faking functions, you have to init() the extension in your database first:

SELECT transp_anon.init();
            

Once the fake data is loaded you have access to 10 pseudo functions:

  • transp_anon.pseudo_first_name(seed,salt) returns a generic first name

  • transp_anon.pseudo_last_name(seed,salt) returns a generic last name

  • transp_anon.pseudo_email(seed,salt) returns a valid email address

  • transp_anon.pseudo_city(seed,salt) returns an existing city

  • transp_anon.pseudo_country(seed,salt) returns a country

  • transp_anon.pseudo_company(seed,salt) returns a generic company name

  • transp_anon.pseudo_iban(seed,salt) returns a valid IBAN

The second argument (salt) is optional. You can call each function with only the seed like this transp_anon.pseudo_city('bob'). The salt is here to increase complexity and avoid dictionary and brute force attacks (see warning below). If a specific salt is not given, the value of the transp_anon.salt GUC parameter is used instead (see the Generic Hashing section for more details).

The seed can be any information related to the subject. For instance, we can consistently generate the same fake email address for a given person by using her login as the seed :

SECURITY LABEL FOR transp_anon
ON COLUMN users.emailaddress
IS 'MASKED WITH FUNCTION transp_anon.pseudo_email(users.login) ';
            

NOTE : You may want to produce unique values using a pseudonymization function. For instance, if you want to mask an email column that is declared as UNIQUE. In this case, you will need to initialize the extension with a fake dataset that is way bigger than the numbers of rows of the table. Otherwise you may see some collisions happening, i.e. two different original values producing the same pseudo value.

WARNING : Pseudonymization is often confused with anonymization but in fact they serve 2 different purposes : pseudonymization is a way to protect the personal information but the pseudonymized data is still linked to the real data. The GDPR makes it very clear that personal data which has undergone pseudonymization is still related to a person. (see GDPR Recital 26)

F.65.5.6. Generic hashing #

In theory, hashing is not a valid anonymization technique, however in practice it is sometimes necessary to generate a determinist hash of the original data.

For instance, when a pair of primary key / foreign key is a natural key, it may contain actual information ( like a customer number containing a birth date or something similar).

Hashing such columns allows to keep referential integrity intact even for relatively unusual source data. Therefore, the

  • transp_anon.digest(value,salt,algorithm) lets you choose a salt, and a hash algorithm from a pre-defined list

  • transp_anon.hash(value) will return a text hash of the value using a secret salt (defined by the transp_anon.salt parameter) and hash algorithm (defined by the transp_anon.algorithm parameter). The default value of transp_anon.algorithm is sha256 and possible values are: md5, sha1, sha224, sha256, sha384 or sha512. The default value of transp_anon.salt is an empty string. You can modify these values with:

    ALTER DATABASE foo SET transp_anon.salt TO 'xsfnjefnjsnfjsnf';
    ALTER DATABASE foo SET transp_anon.algorithm TO 'sha384';
                        

Keep in mind that hashing is a form a Pseudonymization. This means that the data can be de-anonymized using the hashed value and the masking function. If an attacker gets access to these 2 elements, he or she could re-identify some persons using brute force or dictionary attacks. Therefore, the salt and the algorithm used to hash the data must be protected with the same level of security that the original dataset.

In a nutshell, we recommend that you use the transp_anon.hash() function rather than transp_anon.digest() because the salt will not appear clearly in the masking rule.

Furthermore: in practice the hash function will return a long string of character like this:

SELECT transp_anon.hash('bob');
                                hash
----------------------------------------------------------------------------------------------------------------------------------
95b6accef02c5a725a8c9abf19ab5575f99ca3d9997984181e4b3f81d96cbca4d0977d694ac490350e01d0d213639909987ef52de8e44d6258d536c55e427397
            

For some columns, this may be too long and you may have to cut some parts the hash in order to fit into the column. For instance, if you have a foreign key based on a phone number and the column is a VARCHAR(12) you can transform the data like this:

SECURITY LABEL FOR transp_anon ON COLUMN people.phone_number
IS 'MASKED WITH FUNCTION transp_anon.left(transp_anon.hash(phone_number),12)';

SECURITY LABEL FOR transp_anon ON COLUMN call_history.fk_phone_number
IS 'MASKED WITH FUNCTION transp_anon.left(transp_anon.hash(fk_phone_number),12)';
            

Of course, cutting the hash value to 12 characters will increase the risk of collision (2 different values having the same fake hash). In such case, it’s up to you to evaluate this risk.

F.65.5.7. Partial Scrambling #

Partial scrambling leaves out some part of the data. For instance : a credit card number can be replaced by 40XX XXXX XXXX XX96.

2 functions are available:

  • transp_anon.partial('abcdefgh',1,'xxxx',3) will return axxxxfgh;

  • transp_anon.partial_email('[email protected]') will become da******@gm******.com

F.65.5.8. Conditional Masking #

In some situations, you may want to apply a masking filter only for some value or for a limited number of lines in the table.

For instance, if you want to preserve NULL values, i.e. masking only the lines that contains a value, you can use the transp_anon.ternary function, which works like a CASE WHEN x THEN y ELSE z statement:

SECURITY LABEL FOR transp_anon ON COLUMN player.score
IS 'MASKED WITH FUNCTION transp_anon.ternary(score IS NULL,
                                        NULL,
                                        transp_anon.random_int_between(0,100));
            

You may also want to exclude some lines within the table. Like keeping the password of some users so that they still may be able to connect to a testing deployment of your application:

SECURITY LABEL FOR transp_anon ON COLUMN account.password
IS 'MASKED WITH FUNCTION transp_anon.ternary( id > 1000, NULL::TEXT, password)';
            

WARNING : Conditional masking may create a partially deterministic connection between the original data and the masked data. And that connection can be used to retrieve personal information from the masked data. For instance, if NULL values are preserved for a deceased_date column, it will reveal which persons are still actually alive… In a nutshell: conditional masking may often produce a dataset that is not fully anonymized and therefore would still technically contain personal information.

F.65.5.9. Generalization #

Generalization is the principle of replacing the original value by a range containing this value. For instance, instead of saying Paul is 42 years old, you would say Paul is between 40 and 50 years old.

Note

The generalization functions are a data type transformation. Therefore it is not possible to use them with the dynamic masking engine. However they are useful to create anonymized views. See example below.

Let’s imagine a table containing health information:

SELECT * FROM patient;
id |   name   |  zipcode |   birth    |    disease
----+----------+----------+------------+---------------
1 | Alice    |    47678 | 1979-12-29 | Heart Disease
2 | Bob      |    47678 | 1959-03-22 | Heart Disease
3 | Caroline |    47678 | 1988-07-22 | Heart Disease
4 | David    |    47905 | 1997-03-04 | Flu
5 | Eleanor  |    47909 | 1999-12-15 | Heart Disease
6 | Frank    |    47906 | 1968-07-04 | Cancer
7 | Geri     |    47605 | 1977-10-30 | Heart Disease
8 | Harry    |    47673 | 1978-06-13 | Cancer
9 | Ingrid   |    47607 | 1991-12-12 | Cancer
            

We can build a view upon this table to suppress some columns ( SSN and name ) and generalize the zipcode and the birth date like this:

CREATE VIEW anonymized_patient AS
SELECT
    'REDACTED' AS lastname,
    transp_anon.generalize_int4range(zipcode,100) AS zipcode,
    transp_anon.generalize_tsrange(birth,'decade') AS birth
    disease
FROM patients;
            

The anonymized table now looks like that:

SELECT * FROM anonymized_patient;
lastname |   zipcode     |           birth             |    disease
----------+---------------+-----------------------------+---------------
REDACTED | [47600,47700) | ["1970-01-01","1980-01-01") | Heart Disease
REDACTED | [47600,47700) | ["1950-01-01","1960-01-01") | Heart Disease
REDACTED | [47600,47700) | ["1980-01-01","1990-01-01") | Heart Disease
REDACTED | [47900,48000) | ["1990-01-01","2000-01-01") | Flu
REDACTED | [47900,48000) | ["1990-01-01","2000-01-01") | Heart Disease
REDACTED | [47900,48000) | ["1960-01-01","1970-01-01") | Cancer
REDACTED | [47600,47700) | ["1970-01-01","1980-01-01") | Heart Disease
REDACTED | [47600,47700) | ["1970-01-01","1980-01-01") | Cancer
REDACTED | [47600,47700) | ["1990-01-01","2000-01-01") | Cancer
            

The generalized values are still useful for statistics because they remain true, but they are less accurate, and therefore reduce the risk of re-identification.

PostgreSQL offers several RANGE types which are perfect for dates and numeric values.

For numeric values, 3 functions are available:

  • generalize_int4range(value, step)

  • generalize_int8range(value, step)

  • generalize_numrange(value, step)

…where value is the data that will be generalized, and step is the size of each range.

F.65.5.10. Shuffling #

Shuffling mixes values within the same columns.

  • anon.shuffle_column(shuffle_table, shuffle_column, primary_key) will rearrange all values in a given column. You need to provide a primary key of the table.

This is useful for foreign keys because referential integrity will be kept.

F.65.5.11. Using pg_catalog functions #

Since version 1.3, the pg_catalog schema is not trusted by default. This is a security measure designed to prevent users from using sophisticated functions in masking rules (such as pg_catalog.query_to_xml , pg_catalog.ts_stat or the system administration functions) that should not be used as masking functions.

However, the extension provides bindings to some useful and safe functions from the pg_catalog schema for your convenience:

  • transp_anon.concat(TEXT,TEXT)

  • transp_anon.concat(TEXT,TEXT, TEXT)

  • transp_anon.date_add(TIMESTAMP WITH TIME ZONE,INTERVAL)

  • transp_anon.date_part(TEXT,TIMESTAMP)

  • transp_anon.date_part(TEXT,INTERVAL)

  • transp_anon.date_subtract(TIMESTAMP WITH TIME ZONE, INTERVAL )

  • transp_anon.date_trunc(TEXT,TIMESTAMP)

  • transp_anon.date_trunc(TEXT,TIMESTAMP WITH TIME ZONE,TEXT)

  • transp_anon.date_trunc(TEXT,INTERVAL)

  • transp_anon.left(TEXT,INTEGER)

  • transp_anon.length(TEXT)

  • transp_anon.lower(TEXT)

  • transp_anon.make_date(INT,INT,INT )

  • transp_anon.make_time(INT,INT,DOUBLE PRECISION)

  • transp_anon.md5(TEXT)

  • transp_anon.random()

  • transp_anon.replace(TEXT,TEXT,TEXT)

  • transp_anon.regexp_replace(TEXT,TEXT,TEXT)

  • transp_anon.regexp_replace(TEXT,TEXT,TEXT,TEXT)

  • transp_anon.right(TEXT,INTEGER)

  • transp_anon.substr(TEXT,INTEGER)

  • transp_anon.substr(TEXT,INTEGER,INTEGER)

  • transp_anon.upper(TEXT)

If you need more bindings, you can either

  • Write your own mapping function in a trusted schema (see below)

  • Set the pg_catalog schema as TRUSTED (not recommended)

  • open an issue

F.65.5.12. Write your own Masks #

You can also use your own function as a mask. The function must either be destructive (like Partial Scrambling) or insert some randomness in the dataset (like Faking).

Especially for complex data types, you may have to write your own function. This will be a common use case if you have to hide certain parts of a JSON field.

For example:

CREATE TABLE company (
    business_name TEXT,
    info JSONB
)
            

The info field contains unstructured data like this:

SELECT jsonb_pretty(info) FROM company WHERE business_name = 'Soylent Green';
        jsonb_pretty
----------------------------------
{
    "employees": [
        {
            "lastName": "Doe",
            "firstName": "John"
        },
        {
            "lastName": "Smith",
            "firstName": "Anna"
        },
        {
            "lastName": "Jones",
            "firstName": "Peter"
        }
    ]
}
(1 row)
            

Using the PostgreSQL JSON functions and operators you can walk through the keys and replace the sensitive values as needed.

CREATE SCHEMA custom_masks;

-- This step requires superuser privilege
SECURITY LABEL FOR transp_anon ON SCHEMA custom_masks IS 'TRUSTED';

CREATE FUNCTION custom_masks.remove_last_name(j JSONB)
RETURNS JSONB
VOLATILE
LANGUAGE SQL
AS $func$
SELECT
json_build_object(
    'employees' ,
    array_agg(
    jsonb_set(e ,'{lastName}', to_jsonb(transp_anon.fake_last_name()))
    )
)::JSONB
FROM jsonb_array_elements( j->'employees') e
$func$;
            

Then check that the function is working correctly:

SELECT custom_masks.remove_last_name(info) FROM company;
            

When that’s ok you can declare this function as the mask of the info field:

SECURITY LABEL FOR transp_anon ON COLUMN company.info
IS 'MASKED WITH FUNCTION custom_masks.remove_last_name(info)';
            

Select the data:

# SELECT anonymize_table('company');
# SELECT jsonb_pretty(info) FROM company WHERE business_name = 'Soylent Green';
            jsonb_pretty
-------------------------------------
{
    "employees": [                 +
        {                          +
            "lastName": "Prawdzik",+
            "firstName": "John"    +
        },                         +
        {                          +
            "lastName": "Baltazor",+
            "firstName": "Anna"    +
        },                         +
        {                          +
            "lastName": "Taylan",  +
            "firstName": "Peter"   +
        }                          +
    ]                              +
}
(1 row)
            

As you can see, manipulating a sophisticated JSON structure with SQL is possible, but it can be tricky at first! There are multiple ways of walking through the keys and updating values. You will probably have to try different approaches, depending on your real JSON data and the performance you want to reach.

F.65.6. Dynamic masking #

You can hide some data from a role by declaring this role as a MASKED one. Other roles will still access the original data.

Example:

CREATE TABLE people ( id TEXT, firstname TEXT, lastname TEXT, phone TEXT);
INSERT INTO people VALUES ('T1','Sarah', 'Conor','0609110911');
SELECT * FROM people;

=# SELECT * FROM people;
id | firstname | lastname |   phone
----+----------+----------+------------
T1 | Sarah    | Conor    | 0609110911
(1 row)
        

Step 1 : Activate the dynamic masking engine

=# CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE;
=# SELECT transp_anon.init();
        

Step 2 : Declare a masked user

=# CREATE ROLE skynet LOGIN;
=# GRANT SELECT ON TABLE people TO skynet;
=# SECURITY LABEL FOR transp_anon ON ROLE skynet IS 'MASKED';
        

Step 3 : Declare the masking rules

SECURITY LABEL FOR transp_anon ON COLUMN people.lastname
IS 'MASKED WITH FUNCTION transp_anon.fake_last_name()';

SECURITY LABEL FOR transp_anon ON COLUMN people.phone
IS 'MASKED WITH FUNCTION transp_anon.partial(phone,2,$$******$$,2)';
        

Step 4 : Connect with the masked user

=# \c - skynet
=> SELECT * FROM people;
id | firstname | lastname  |   phone
----+----------+-----------+------------
T1 | Sarah    | Stranahan | 06******11
(1 row)
        

F.65.6.1. How to unmask a role #

Simply remove the security label like this:

SECURITY LABEL FOR transp_anon ON ROLE bob IS NULL;
            

To unmask all masked roles at once you can type:

SELECT transp_anon.remove_masks_for_all_roles();
            

F.65.6.2. Limitations #

F.65.6.2.1. Performances #

Dynamic Masking is known to be very slow with some queries, especially if you try to join 2 tables on a masked key using hashing or pseudonymization.

F.65.7. Generalization #

F.65.7.1. Reducing the accuracy of sensitive data #

The idea of generalization is to replace data with a broader, less accurate value. For instance, instead of saying Bob is 28 years old, you can say Bob is between 20 and 30 years old. This is interesting for analytics because the data remains true while avoiding the risk of re-identification.

PostgreSQL can handle generalization very easily with the RANGE types, a very powerful way to store and manipulate a set of values contained between a lower and an upper bound.

F.65.7.2. Example #

Here’s a basic table containing medical data:

# SELECT * FROM patient;
    ssn     | firstname | zipcode |   birth    |    disease
-------------+-----------+---------+------------+---------------
253-51-6170 | Alice     |   47012 | 1989-12-29 | Heart Disease
091-20-0543 | Bob       |   42678 | 1979-03-22 | Allergy
565-94-1926 | Caroline  |   42678 | 1971-07-22 | Heart Disease
510-56-7882 | Eleanor   |   47909 | 1989-12-15 | Acne
098-24-5548 | David     |   47905 | 1997-03-04 | Flu
118-49-5228 | Jean      |   47511 | 1993-09-14 | Flu
263-50-7396 | Tim       |   47900 | 1981-02-25 | Heart Disease
109-99-6362 | Bernard   |   47168 | 1992-01-03 | Asthma
287-17-2794 | Sophie    |   42020 | 1972-07-14 | Asthma
409-28-2014 | Arnold    |   47000 | 1999-11-20 | Diabetes
(10 rows)
            

We want the anonymized data to remain true because it will be used for statistics. We can build a view upon this table to remove useless columns and generalize the indirect identifiers:

CREATE MATERIALIZED VIEW generalized_patient AS
SELECT
'REDACTED'::TEXT AS firstname,
transp_anon.generalize_int4range(zipcode,1000) AS zipcode,
transp_anon.generalize_daterange(birth,'decade') AS birth,
disease
FROM patient;
            

This will give us a less accurate view of the data:

# SELECT * FROM generalized_patient;
firstname |    zipcode    |          birth          |    disease
-----------+---------------+-------------------------+---------------
REDACTED  | [47000,48000) | [1980-01-01,1990-01-01) | Heart Disease
REDACTED  | [42000,43000) | [1970-01-01,1980-01-01) | Allergy
REDACTED  | [42000,43000) | [1970-01-01,1980-01-01) | Heart Disease
REDACTED  | [47000,48000) | [1980-01-01,1990-01-01) | Acne
REDACTED  | [47000,48000) | [1990-01-01,2000-01-01) | Flu
REDACTED  | [47000,48000) | [1990-01-01,2000-01-01) | Flu
REDACTED  | [47000,48000) | [1980-01-01,1990-01-01) | Heart Disease
REDACTED  | [47000,48000) | [1990-01-01,2000-01-01) | Asthma
REDACTED  | [42000,43000) | [1970-01-01,1980-01-01) | Asthma
REDACTED  | [47000,48000) | [1990-01-01,2000-01-01) | Diabetes
(10 rows)
            

F.65.7.3. Generalization Functions #

PostgreSQL Anonymizer provides 6 generalization functions. One for each RANGE type. Generally these functions take the original value as the first parameter, and a second parameter for the length of each step.

For numeric values :

  • transp_anon.generalize_int4range(42,5) returns the range [40,45)

  • transp_anon.generalize_int8range(12345,1000) returns the range [12000,13000)

  • transp_anon.generalize_numrange(42.32378,10) returns the range [40,50)

For time values :

  • transp_anon.generalize_tsrange('1904-11-07','year') returns ['1904-01-01','1905-01-01')

  • transp_anon.generalize_tstzrange('1904-11-07','week') returns ['1904-11-07','1904-11-14')

  • transp_anon.generalize_daterange('1904-11-07','decade') returns [1900-01-01,1910-01-01)

The possible steps are : microseconds, milliseconds, second, minute, hour, day, week, month, year, decade, century and millennium.

F.65.7.4. Limitations #

F.65.7.4.1. Singling out and extreme values #

Singling Out is the possibility to isolate an individual in a dataset by using extreme value or exceptional values.

For example:

# SELECT * FROM employees;

id  |  name          | job  | salary
------+----------------+------+--------
1578 | xkjefus3sfzd   | NULL |    1498
2552 | cksnd2se5dfa   | NULL |    2257
5301 | fnefckndc2xn   | NULL |   45489
7114 | npodn5ltyp3d   | NULL |    1821
                

In this table, we can see that a particular employee has a very high salary, very far from the average salary. Therefore this person is probably the CEO of the company.

With generalization, this is important because the size of the range (the step) must be wide enough to prevent the identification of one single individual.

WARNING : By definition, with generalization the data remains true, but the column type is changed. This means that the transformation is not transparent.

F.65.8. Custom Fake Data #

This extension is delivered with a small set of fake data by default. For each fake function ( fake_email(), fake_first_name()) we provide only 1000 unique values, and they are only in Russian.

Here’s how you can create your own set of fake data!

F.65.8.1. Generate your own fake dataset #

As an example, here’s a python script that will generate fake data for you:

$TANTOR_DIR/utils/data_generator/data_generator.py

This script requires library Faker. You can install it with:

pip3 install faker
                

To produce 5000 emails in Russian & English, you’d call the scripts like this:

data_generator.py --table email --locales ru_RU,en --lines 5000
            

This will output the fake data in CSV format.

Use data_generator.py --help for more details about the script parameters.

You can load the fake data directly into the extension like this:

TRUNCATE transp_anon.email;

COPY transp_anon.email
FROM
PROGRAM 'data_generator.py --table email --locales ru_RU,en --lines 5000';

SELECT setval('transp_anon.email_oid_seq', max(oid))
FROM transp_anon.email;

CLUSTER transp_anon.email;
            

F.65.8.2. Load your own fake data #

If you want to use your own dataset, you can import custom CSV files with :

SELECT transp_anon.init('/path/to/custom_csv_files/')
            

Look at the data folder to find the format of the CSV files.

F.65.9. Performances #

Any anonymization process has a price as it will consume CPU time, RAM space and probably a bunch of disk I/O… Here’s a a quick overview of the question depending on what strategy you are using….

In a nutshell, the anonymization performances will mainly depend on 2 important factors:

  • The size of the database

  • The number of masking rules

F.65.9.1. Dynamic Masking #

With dynamic masking, the real data is replaced on-the-fly every time a masked user sends a query to the database. This means that the masking users will have slower response time than regular (unmasked) users. This is generally ok because usually masked users are not considered as important as the regular ones.

If you apply 3 or 4 rules to a table, the response time for the masked users should approx. 20% to 30% slower than for the normal users.

As the masking rules are applied for each queries of the masked users, the dynamic masking is appropriate when you have a limited number of masked users that connect only from time to time to the database. For instance, a data analyst connecting once a week to generate a business report.

If there are multiple masked users or if a masked user is very active, you should probably export the masked data once-a-week on a secondary instance and let these users connect to this secondary instance.

Note

In this case, the cost of anonymization is paid only by the masked users.

F.65.9.2. How to speed things up ? #

F.65.9.2.1. Prefer MASKED WITH VALUE whenever possible #

It is always faster to replace the original data with a static value instead of calling a masking function.

F.65.9.2.2. Sampling #

If you need to anonymize data for testing purpose, chances are that a smaller subset of your database will be enough. In that case, you can easily speed up the anonymization by downsizing the volume of data.

Checkout the Sampling section for more details.

F.65.9.2.3. Materialized Views #

Dynamic masking is not always required! In some cases, it is more efficient to build CREATE MATERIALIZED VIEW instead.

For instance:

CREATE MATERIALIZED VIEW masked_customer AS
SELECT
    id,
    anon.random_last_name() AS name,
    anon.random_date_between('1920-01-01'::DATE,now()) AS birth,
    fk_last_order,
    store_id
FROM customer;
            

F.65.10. Security #

F.65.10.1. Permissions #

Here’s an overview of what users can do depending on the priviledge they have:

Action Superuser Owner Masked Role
Create the extension Yes
Drop the extension Yes
Init the extension Yes
Reset the extension Yes
Configure the extension Yes
Put a mask upon a role Yes
Start dynamic masking Yes
Stop dynamic masking Yes
Create a table Yes Yes
Declare a masking rule Yes Yes
Insert, delete, update a row Yes Yes
Static Masking Yes Yes
Select the real data Yes Yes
Regular Dump Yes Yes
Anonymous Dump Yes Yes
Use the masking functions Yes Yes Yes
Select the masked data Yes Yes Yes
View the masking rules Yes Yes Yes

F.65.10.2. Limit masking filters only to trusted schemas #

By default, the database owner can only write masking rules with functions that are located in the trusted schemas which are controlled by the superusers.

Out of the box, only the transp_anon schema is declared as trusted. This means that by defautt the functions from the pg_catalog cannot be used in masking rules.

For more details, read the Using pg_catalog functions section.

F.65.10.3. Security context of the functions #

Most of the functions of this extension are declared with the SECURITY INVOKER tag. This means that these functions are executed with the privileges of the user that calls them. This is an important restriction.

This extension contains another few functions declared with the tag SECURITY DEFINER.

F.65.11. Demos #

Demos show different cases.

F.65.11.1. Add more emails #

Add values to transp_anon.email

BEGIN;
CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE;
SELECT transp_anon.init();

SELECT count(*) FROM transp_anon.email;

CREATE TEMPORARY TABLE tmp_email
AS SELECT * FROM transp_anon.email;
TRUNCATE transp_anon.email;
INSERT INTO transp_anon.email
SELECT
ROW_NUMBER() OVER (),
concat(u.username,'@', d.domain)
FROM
(
SELECT split_part(val,'@',1) AS username
FROM tmp_email
ORDER BY RANDOM()
LIMIT 10
) u,
(
SELECT split_part(val,'@',2) AS domain
FROM tmp_email
ORDER BY RANDOM()
LIMIT 5
) d
;

SELECT count(*) FROM transp_anon.email;

ROLLBACK;
            

F.65.11.2. Dynamic anonymization for another role #

In this demo we create masked role skynet. Select operations will return anonymized data for this role.

-- STEP 1 : Activate the masking engine
CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE;
SELECT transp_anon.init();

-- STEP 2 : Declare a masked user
CREATE ROLE skynet LOGIN;
SECURITY LABEL FOR transp_anon ON ROLE skynet IS 'MASKED';

-- STEP 3 : Declare the masking rules
CREATE TABLE people ( id TEXT, firstname TEXT, lastname TEXT, phone TEXT);
INSERT INTO people VALUES ('T1','Sarah', 'Conor','0609110911');
SELECT * FROM people;

-- STEP 3 : Declare the masking rules
SECURITY LABEL FOR transp_anon ON COLUMN people.lastname
IS 'MASKED WITH FUNCTION transp_anon.fake_last_name()';

SECURITY LABEL FOR transp_anon ON COLUMN people.phone
IS 'MASKED WITH FUNCTION transp_anon.partial(phone,2,$$******$$,2)';

-- STEP 4 : Connect with the masked user
\! psql demo -U skynet -c 'SELECT * FROM people;'

-- STEP 5 : Clean up
DROP EXTENSION anon CASCADE;
REASSIGN OWNED BY skynet TO postgres;
DROP OWNED BY skynet;
DROP ROLE skynet;
            

F.65.11.3. Declare masking rules for a column #

In this demo masking rules are declared for a column player.name .

BEGIN;

CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE;

SELECT transp_anon.init();

CREATE TABLE player( id SERIAL, name TEXT, points INT);

INSERT INTO player VALUES  
( 1, 'Kareem Abdul-Jabbar',	38387),
( 5, 'Michael Jordan',	32292);

SECURITY LABEL FOR transp_anon ON COLUMN player.name 
IS 'MASKED WITH FUNCTION transp_anon.fake_last_name()';

ROLLBACK;
            

F.65.11.4. Generalization #

Demo using generalization functions.

BEGIN;

CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE;

CREATE TABLE patient (
    ssn TEXT,
    firstname TEXT,
    zipcode INTEGER,
    birth DATE,
    disease TEXT
);

INSERT INTO patient
VALUES
('253-51-6170','Alice',47012,'1989-12-29','Heart Disease'),
('091-20-0543','Bob',42678,'1979-03-22','Allergy'),
('565-94-1926','Caroline',42678,'1971-07-22','Heart Disease'),
('510-56-7882','Eleanor',47909,'1989-12-15','Acne'),
('098-24-5548','David',47905,'1997-03-04','Flu'),
('118-49-5228','Jean',47511,'1993-09-14','Flu'),
('263-50-7396','Tim',47900,'1981-02-25','Heart Disease'),
('109-99-6362','Bernard',47168,'1992-01-03','Asthma'),
('287-17-2794','Sophie',42020,'1972-07-14','Asthma'),
('409-28-2014','Arnold',47000,'1999-11-20','Diabetes')
;

SELECT * FROM patient;

CREATE MATERIALIZED VIEW generalized_patient AS
SELECT
    'REDACTED' AS firstname,
    transp_anon.generalize_int4range(zipcode,1000) AS zipcode,
    transp_anon.generalize_daterange(birth,'decade') AS birth,
    disease
FROM patient;

SELECT * FROM generalized_patient;

ROLLBACK;
            

F.65.11.5. Hashing #

This is a demo of the hashing function We will try to pseudonymize a natural foreign key

BEGIN;

-- We have a simplistic customer table
CREATE TABLE customer (
id SERIAL,
name TEXT,
phone_number TEXT UNIQUE NOT NULL
);

INSERT INTO customer VALUES
(2046,'Omar Little','410-719-9009'),
(8123,'Russell Bell','410-617-7308'),
(3456,'Avon Barksdale','410-385-2983');

-- We have a log of their calls to our customer service hotline
CREATE TABLE hotline_call_history (
id SERIAL,
fk_phone_number TEXT REFERENCES customer(phone_number) DEFERRABLE,
call_start_time TIMESTAMP,
call_end_time TIMESTAMP
);

INSERT INTO hotline_call_history VALUES
(834,'410-617-7308','2004-05-17 09:41:01','2004-05-17 09:44:24'),
(835,'410-385-2983','2004-05-17 11:22:55','2004-05-17 11:34:18'),
(839,'410-719-9009','2004-05-18 16:02:03','2004-05-18 16:22:56'),
(878,'410-385-2983','2004-05-20 13:13:34','2004-05-18 13:51:00');

-- We can get a details view of the calls like this
SELECT
c.id    AS customer_id,
c.name  AS customer_name,
h.call_start_time,
h.call_end_time
FROM
hotline_call_history h
JOIN customer c ON h.fk_phone_number = c.phone_number
WHERE
extract(year from h.call_start_time) = 2004
ORDER BY h.call_start_time;

-- Now let's pseudonymize this !
CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE;

-- Init
SELECT transp_anon.init();
SET transp_anon.salt = '4ag3b803de6180361';
SET transp_anon.algorithm = 'md5';

-- SELECT transp_anon.set_secret_salt('4ag3b803de6180361');
-- SELECT transp_anon.set_secret_algorithm('md5');

-- Masking Rules
SECURITY LABEL FOR transp_anon ON COLUMN customer.name
IS 'MASKED WITH FUNCTION transp_anon.concat(
                                transp_anon.pseudo_first_name(name),
                                $$ $$,
                                transp_anon.pseudo_last_name(name))';

SECURITY LABEL FOR transp_anon ON COLUMN customer.phone_number
IS 'MASKED WITH FUNCTION transp_anon.hash(phone_number)';

SECURITY LABEL FOR transp_anon ON COLUMN hotline_call_history.fk_phone_number
IS 'MASKED WITH FUNCTION transp_anon.hash(fk_phone_number)';


-- Apply the masking rules
SET CONSTRAINTS ALL DEFERRED;
SELECT current_user \gset
SECURITY LABEL FOR transp_anon ON ROLE :current_user IS 'MASKED';

-- Launch the same request
SELECT
c.id    AS customer_id,
c.name  AS customer_name,
h.call_start_time,
h.call_end_time
FROM
hotline_call_history h
JOIN customer c ON h.fk_phone_number = c.phone_number
WHERE
extract(year from h.call_start_time) = 2004
ORDER BY h.call_start_time;

--
ROLLBACK;
            

F.65.11.6. Masking rule inheritance #

Demo show anonymization during inheritance

BEGIN;

CREATE TABLE public.invoice (
    name            text,
    amount          float,
    published_date  date
);

CREATE TABLE public.invoice_2020 (
    CHECK ( published_date >= DATE '2020-01-01'
        AND published_date <= DATE '2020-12-31' )
) INHERITS (invoice);

INSERT INTO invoice_2020
VALUES('John DOE', 236.25, '2020-12-09');

CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE;
SELECT transp_anon.init();

SELECT current_user \gset
SECURITY LABEL FOR transp_anon ON ROLE :current_user IS 'MASKED';

--
-- Put mask on the mother table
--
SECURITY LABEL FOR transp_anon ON COLUMN public.invoice.name
IS 'MASKED WITH VALUE $$CONFIDENTIAL$$';

SELECT public.invoice.name, public.invoice_2020.name
    FROM public.invoice, public.invoice_2020;

--      name     |     name     
-- --------------+--------------
--  CONFIDENTIAL | CONFIDENTIAL

--
-- Put mask on the child table
--
SECURITY LABEL FOR transp_anon ON COLUMN public.invoice_2020.name
IS 'MASKED WITH VALUE $$DELETED$$';

SELECT public.invoice.name, public.invoice_2020.name
    FROM public.invoice, public.invoice_2020;

--      name     |  name   
-- --------------+---------
--  CONFIDENTIAL | DELETED

ROLLBACK; 
            

F.65.11.7. Noise reduction attack #

This is a basic example of a repeat attacks against the noise() masking functions when they're used with dynamic masking. We will demonstrate how a masked role can guess a real data using only the masking view

BEGIN;

-- STEP 1 : Activate the masking engine
CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE;
SELECT transp_anon.init();

-- STEP 2: Declare a masked user
CREATE ROLE attacker LOGIN;
SECURITY LABEL FOR transp_anon ON ROLE attacker IS 'MASKED';

-- STEP 3: Load the data
DROP TABLE IF EXISTS people CASCADE;
CREATE TABLE people ( id INT, name TEXT, age INT);
INSERT INTO people VALUES (157, 'Mike Ehrmantraut' , 63);
INSERT INTO people VALUES (482, 'Jimmy McGill',  47);

-- STEP 4: Declare the masking rules
SECURITY LABEL FOR transp_anon ON COLUMN people.name
IS 'MASKED WITH VALUE $$CONFIDENTIAL$$ ';

SECURITY LABEL FOR transp_anon ON COLUMN people.age
IS 'MASKED WITH FUNCTION transp_anon.noise(age, 0.33)';

-- STEP 5: Now let's connect with the masked user

SELECT current_user \gset
SECURITY LABEL FOR transp_anon ON ROLE :current_user IS 'MASKED';

-- When attacker asks for the age of person #157, he/she gets a "noised" value
SELECT age FROM people WHERE id = 157;

-- Now let's do this 10000 times and get the average
DO
$$
DECLARE
v iNT;
a int[];
BEGIN
FOR i in 1..10000
LOOP
    SELECT age into v FROM people WHERE id=157;
    a:=array_append(a,v);
END LOOP;
SELECT avg(u) into v FROM unnest(a) u;
RAISE NOTICE 'Age of Person 157: %', v; END
$$;

ROLLBACK;  
            

F.65.11.8. Return NULL or fake value #

BEGIN;

CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE;

SELECT transp_anon.init();


CREATE OR REPLACE FUNCTION null_or_random_date(val TIMESTAMPTZ)
RETURNS TIMESTAMPTZ
AS $$
SELECT transp_anon.random_date();
$$ LANGUAGE SQL VOLATILE RETURNS NULL ON NULL INPUT;

SELECT null_or_random_date(NULL);

SELECT null_or_random_date('2020-05-05');

ROLLBACK;
            

F.65.11.9. Partial #

BEGIN;

CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE;

SELECT transp_anon.partial('abcdefgh',1,'xxxx',3);

SELECT transp_anon.partial('+33142928100',4,'******',2);

SELECT transp_anon.partial_email('[email protected]');

ROLLBACK;
            

F.65.11.10. Difference fake functions #

BEGIN;

CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE;

SELECT transp_anon.init();

--let's use `TEMPORARY` instead of `pg_temp` for clarity
--SET search_path TO pg_temp, public;

CREATE TEMPORARY TABLE customer(
id SERIAL,
full_name TEXT,
hair_color TEXT,
birth DATE,
employer TEXT,
zipcode TEXT,
fk_shop INTEGER
);

INSERT INTO customer
VALUES
(911,'Chuck Norris','brown','1940-03-10','Texas Rangers', '75001',12),
(312,'David Hasselhoff','black','1952-07-17','Baywatch', '90001',423)
;

SELECT * FROM customer;

UPDATE customer
SET
full_name=transp_anon.fake_first_name() || ' ' || transp_anon.fake_last_name(),
hair_color=transp_anon.random_in(ARRAY['black','blonde','brown','red',NULL]),
birth=transp_anon.random_date_between('1920-01-01'::DATE,now()),
employer=transp_anon.fake_company(),
zipcode=transp_anon.random_zip()
;

SELECT * FROM customer;

ROLLBACK;
            

F.65.11.11. Write your own mask #

BEGIN;

CREATE EXTENSION IF NOT EXISTS transp_anon CASCADE;

SELECT transp_anon.init();

CREATE TABLE company (
business_name TEXT,
info JSONB
);

INSERT INTO company
VALUES ( 'Soylent Green',
'{"employees":[
    {"firstName":"John", "lastName":"Doe"},
    {"firstName":"Anna", "lastName":"Smith"},
    {"firstName":"Peter", "lastName":"Jones"}
]}');

SELECT jsonb_pretty(info) FROM company WHERE business_name = 'Soylent Green';

CREATE FUNCTION transp_anon.anonymize_last_name(j JSONB)
RETURNS JSONB
VOLATILE
LANGUAGE SQL
AS $func$
SELECT
json_build_object(
    'employees' ,
    array_agg(
    jsonb_set(e ,'{lastName}', to_jsonb(transp_anon.fake_last_name()))
    )
)::JSONB
FROM jsonb_array_elements( j->'employees') e
$func$;

SELECT transp_anon.anonymize_last_name(info) FROM company;

SECURITY LABEL FOR transp_anon ON COLUMN company.info
IS 'MASKED WITH FUNCTION transp_anon.anonymize_last_name(info)';

SELECT current_user \gset
SECURITY LABEL FOR transp_anon ON ROLE :current_user IS 'MASKED';

SELECT jsonb_pretty(info) FROM company WHERE business_name = 'Soylent Green';

ROLLBACK;