Quick start

Create a new database and connect Tantor PipelineDB extension:

psql -c "CREATE DATABASE pipeline;"
psql -d pipeline -c "CREATE EXTENSION pipelinedb;"

Analysis of air traffic data

In this example, we will analyze the airlines passenger traffic data. The data contains information on amount of passenger by month over three years: 1958, 1959 и 1960. We will use Tantor PipelineDB to create an external table, load data from a CSV file and then analyze that data using a view.

Creating an external table

First, let’s create an external table air_travel_stream, which will store data about passenger traffic data. This table will contain three columns for each year and one column for the month:

psql -d pipeline -c "
CREATE FOREIGN TABLE air_travel_stream (
  month text,
  year_1958 int,
  year_1959 int,
  year_1960 int
) SERVER pipelinedb;"

The format of the entries in the air_travel_stream table will be as follows:

month | year_1958 | year_1959 | year_1960

where:

  • month — month of the year (e.g. JAN for January).

  • year_1958 — number of passengers in 1958.

  • year_1959 — number of passengers in 1959.

  • year_1960 — number of passengers in 1960.

Creating a view for data aggregation

After creating the air_travel_stream table, let’s create a air_travel_stats view that will aggregate the data by month. This view will provide summary data such as the minimum, maximum, average and total number of passengers for each year.

psql -d pipeline -c "
CREATE VIEW air_travel_stats WITH (action=materialize) AS
SELECT month,
  count(*) AS total_records,
  sum(year_1958) AS total_1958,
  min(year_1958) AS min_1958,
  max(year_1958) AS max_1958,
  avg(year_1958)::int AS avg_1958,
  sum(year_1959) AS total_1959,
  min(year_1959) AS min_1959,
  max(year_1959) AS max_1959,
  avg(year_1959)::int AS avg_1959,
  sum(year_1960) AS total_1960,
  min(year_1960) AS min_1960,
  max(year_1960) AS max_1960,
  avg(year_1960)::int AS avg_1960
FROM air_travel_stream
GROUP BY month;"

Loading data and executing queries

Now let’s load the data from the CSV file into the table air_travel_stream. This file contains the number of passengers for each month in 1958, 1959, and 1960.

curl -sL https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv | \
grep -v '^$' | \
psql -d pipeline -c "COPY air_travel_stream (month, year_1958, year_1959, year_1960) FROM STDIN WITH CSV HEADER"

Once the data is loaded, you can query the air_travel_stats view to retrieve the aggregated data. This query will sort the data by the number of passengers in 1958 in descending order.

psql -d pipeline -c "SELECT * FROM air_travel_stats ORDER BY total_1958 DESC;"

The result of the query will be as follows:

 month | total_records | total_1958 | min_1958 | max_1958 | avg_1958 | total_1959 | min_1959 | max_1959 | avg_1959 | total_1960 | min_1960 | max_1960 | avg_1960
-------+---------------+------------+----------+----------+----------+------------+----------+----------+----------+------------+----------+----------+----------
 AUG   |             1 |        505 |      505 |      505 |      505 |        559 |      559 |      559 |      559 |        606 |      606 |      606 |      606
 JUL   |             1 |        491 |      491 |      491 |      491 |        548 |      548 |      548 |      548 |        622 |      622 |      622 |      622
 JUN   |             1 |        435 |      435 |      435 |      435 |        472 |      472 |      472 |      472 |        535 |      535 |      535 |      535
 SEP   |             1 |        404 |      404 |      404 |      404 |        463 |      463 |      463 |      463 |        508 |      508 |      508 |      508
 MAY   |             1 |        363 |      363 |      363 |      363 |        420 |      420 |      420 |      420 |        472 |      472 |      472 |      472
 MAR   |             1 |        362 |      362 |      362 |      362 |        406 |      406 |      406 |      406 |        419 |      419 |      419 |      419
 OCT   |             1 |        359 |      359 |      359 |      359 |        407 |      407 |      407 |      407 |        461 |      461 |      461 |      461
 APR   |             1 |        348 |      348 |      348 |      348 |        396 |      396 |      396 |      396 |        461 |      461 |      461 |      461
 JAN   |             1 |        340 |      340 |      340 |      340 |        360 |      360 |      360 |      360 |        417 |      417 |      417 |      417
 DEC   |             1 |        337 |      337 |      337 |      337 |        405 |      405 |      405 |      405 |        432 |      432 |      432 |      432
 FEB   |             1 |        318 |      318 |      318 |      318 |        342 |      342 |      342 |      342 |        391 |      391 |      391 |      391
 NOV   |             1 |        310 |      310 |      310 |      310 |        362 |      362 |      362 |      362 |        390 |      390 |      390 |      390
(12 rows)

As can be seen from the results, the data are sorted by the month with the highest number of passengers in 1958. The air_travel_stats view provides a quick summary by month for each year.