Report versions

With a simple query, it is easy to see that reports when only looking at their local_number have multiple versions. Let's select the top 10 reports with the most versions.

SELECT
  local_number,
  count(*)
FROM import.substance
GROUP BY local_number
ORDER BY count(*) DESC LIMIT 10;
local_numbercount
EU-EC-2950025133
EU-EC-10015259316123
EU-EC-10015258987106
EU-EC-12370933100
EU-EC-1001457223199
EU-EC-1001081394198
EU-EC-1001526846694
EU-EC-1001516394994
EU-EC-1001526137694
EU-EC-1001494765993

(10 rows)

Can a report span multiple days?

Let's check:

SELECT
  local_number,
  COUNT(DISTINCT report_date)
FROM import.substance
GROUP BY local_number
ORDER BY COUNT(DISTINCT report_date) DESC LIMIT 10;
local_numbercount
EU-EC-10000001
EU-EC-100000001
EU-EC-100000000111
EU-EC-100000000161
EU-EC-100000000191
EU-EC-100000000211
EU-EC-100000000221
EU-EC-100000000241
EU-EC-100000000251
EU-EC-100001

(10 rows)

it seems it can't.

So what is causing multiple versions of a report?

We find that the report with local_number "EU-EC-2950025" has 133 versions. We will look at this report in more detail. First of all, we check if our previous conclusion that the date does not change holds.

SELECT
  local_number,
  report_date,
  COUNT(*)
FROM import.substance
WHERE
  local_number = 'EU-EC-2950025'
GROUP BY
  local_number,
  report_date;
local_numberreport_datecount
EU-EC-29500252009-07-13133

It does, all the versions are from exactly the same date (day). Let's look at the is_child_report boolean to see if at least one of the versions is marked as a parent.

SELECT
  local_number,
  report_date,
  is_child_report,
  COUNT(*)
FROM import.substance
WHERE
  local_number = 'EU-EC-2950025'
GROUP BY
  local_number,
  report_date,
  is_child_report;

It turns out they are all child reports.

local_numberreport_dateis_child_reportcount
EU-EC-29500252009-07-13t133

Could it have something to do with the substance_id?

SELECT
  local_number,
  report_date,
  substance_id,
  COUNt(*)
FROM
  import.substance
WHERE local_number = 'EU-EC-2950025'
GROUP BY
  local_number,
  report_date,
  is_child_report,
  substance_id;
local_numberreport_datesubstance_idcount
EU-EC-29500252009-07-13146211
EU-EC-29500252009-07-13146231
EU-EC-29500252009-07-13146401
.....................
EU-EC-29500252009-07-13768841
EU-EC-29500252009-07-132011611
EU-EC-29500252009-07-132280301

(133 rows) (truncated for readability)

As the table shows: It does. This report spans 133 substances. We can conclude for this particular report that it is a report that is related to multiple substances. Our scripts to get line listings from EMA loops all substance of product identifiers and thus will return a list of reports per substance and product. This means that where the local_number should be the so called primary key across our database, it is impossible to use it that way in the downloading process but the import.product and import.substance tables should have a UNIQUE(substance_id, local_number) combination.

Do we have unique combinations of id and local_number?

Let's see if the combination product_id and local_number is unique:

ALTER TABLE import.product ADD CONSTRAINT
  product_unique_local_number_product_id
UNIQUE (product_id, local_number);

If this is the case, we should not get an error. It is indeed true. The output is

ALTER TABLE

We repeat the process for the substance table

ALTER TABLE import.substance ADD CONSTRAINT
  substance_unique_local_number_substance_id
UNIQUE (substance_id, local_number);

Again, we should not get an error, so the output should be

ALTER TABLE

We can now say that the product and substance tables have the integrity we want, that all the records are unique for the combination between the identifier of the substance(substance_id) or product(product_id) and the report identifier (local_number)