Product, Substance, Drug

Throughout the database, we find products and substances but drugs are also mentioned. We are curious about the overlap and the differences and would like to see if we can make head or tails from the different types.

We have tables for products and substances. Is it possible to make a table that contains both and add all entries found for drugs as well? Let's first look at the unique entries in the drug list tables.

SELECT
  'product_suspect_drug_list' as source,
  count(distinct drugs)
FROM
  import.product_subject_drug_list
UNION
SELECT
  'product_concomitant_drug_list' as source,
  count(distinct drugs)
FROM
  import.product_concomitant_drug_list
UNION
SELECT
  'substance_suspect_drug_list' as source,
  count(distinct drugs)
FROM
  import.substance_subject_drug_list
UNION
SELECT
  'substance_concomitant_drug_list' as source,
  count(distinct drugs)
FROM
  import.substance_concomitant_drug_list
UNION
SELECT
  'products' as source,
  count(*)
FROM
  import.products
UNION
SELECT
  'substances' as source,
  count(*)
FROM
  import.substances;
sourcecount
substances4333
substance_suspect_drug_list39350
product_suspect_drug_list18960
product_concomitant_drug_list34939
substance_concomitant_drug_list65836
products1469

As we can see there is a big difference between the amount of products, substances and drugs.

The drug lists contains textual information which seems to be a combination of medicine (not always present as product and sometimes omitted) and substances between square brackets.

We are going to try to get everything into a new table to see if we can find any relations.

CREATE TABLE analyses.temp_names as
  SELECT distinct drugs as "words" from import.product_subject_drug_list;
insert into analyses.temp_names
  SELECT distinct drugs from import.substance_subject_drug_list;
insert into analyses.temp_names
  SELECT distinct drugs from import.product_concomitant_drug_list;
insert into analyses.temp_names
  SELECT distinct drugs from import.substance_concomitant_drug_list;
insert into analyses.temp_names
  SELECT distinct name from import.products;
insert into analyses.temp_names
  SELECT distinct name from import.substances;
create table analyses.name as select distinct words from analyses.temp_names;
drop table analyses.temp_names;
SELECT count(*) from analyses.name;
count
84990

We find that the result of getting the distinct set of names, words or whatever we should call this is less then the sum of all found words from all tables that contain names or descriptions for substance, product or drug. Can we find intersections and relations?

At least we now have one long list to search through. Let's do a little experiment.

Mentions of a product

select * from analyses.name where words ilike '%comirnaty%';
words
COMIRNATY 10 MICROGRAMS/DOSE CONCENTRATE FOR DISPERSION FOR INJECTION COVID-19 MRNA VACCINE
COMIRNATY 10 MICROGRAMS/DOSE CONCENTRATE FOR DISPERSION FOR INJECTION [TOZINAMERAN]
COMIRNATY 10 MICROGRAMS/DOSE [TOZINAMERAN]
COMIRNATY 30 MICROGRAMS/DOSE CONCENTRATE FOR DISPERSION FOR INJECTION COVID-19 MRNA VACCINE
COMIRNATY 30 MICROGRAMS/DOSE CONCENTRATE FOR DISPERSION FOR INJECTION [TOZINAMERAN]
COMIRNATY 30 MICROGRAMS/DOSE DISPERSION FOR INJECTION COVID-19 MRNA VACCINE
COMIRNATY 30 MICROGRAMS/DOSE DISPERSION FOR INJECTION [TOZINAMERAN]
COMIRNATY 30 MICROGRAMS/DOSE [TOZINAMERAN]
COMIRNATY 3 MICROGRAMS/DOSE CONCENTRATE FOR DISPERSION FOR INJECTION COVID-19 MRNA VACCINE
COMIRNATY CONCENTRATE FOR DISPERSION FOR INJECTION 10 MICROGRAMS/DOSE [TOZINAMERAN]
COMIRNATY CONCENTRATE FOR DISPERSION FOR INJECTION 30 MICROGRAMS/DOSE [TOZINAMERAN]
COMIRNATY CONCENTRATE FOR DISPERSION FOR INJECTION COVID-19 MRNA VACCINE
COMIRNATY CONCENTRATE FOR DISPERSION FOR INJECTION [TOZINAMERAN]
COMIRNATY COVID-19 MRNA VACCINE
COMIRNATY DISPERSION FOR INJECTION [TOZINAMERAN]
COMIRNATY ORIGINAL/OMICRON BA.1
COMIRNATY ORIGINAL/OMICRON BA.1 DISPERSION FOR INJECTION
COMIRNATY ORIGINAL/OMICRON BA.1 [TOZINAMERAN, RILTOZINAMERAN]
COMIRNATY ORIGINAL/OMICRON BA.4-5
COMIRNATY ORIGINAL/OMICRON BA.4-5 COVID-19 MRNA VACCINE
COMIRNATY ORIGINAL/OMICRON BA.4-5 [TOZINAMERAN, 5'-CAPPED MRNA ENCODING SARS-COV-2, OMICRON VARIANTS BA.4 AND BA.5, SPIKE PROTEIN, PRE-FUSION STABILISED
COMIRNATY ORIGINAL/OMICRON BA.4-5 [TOZINAMERAN, FAMTOZINAMERAN]
COMIRNATY [TOZINAMERAN]

As seen, there are products and substances mentioned in the drug lists, but also information about the dose and sometimes even unrelated terms.

Language specific information

One of the things we do not seem to be able to extract from the EMA database is the country a report originates from. But looking at the created analyses.name table shows something interesting. There is information in the words that are language specific. If we look for instance for the text per dag which is Dutch for per day, we find 3 rows.

select * from analyses.name where words ilike '%per dag%';
words
1000MG 3X PER DAG [NOT AVAILABLE]
DE PIL 30 MG PER DAG [NOT AVAILABLE]
KOELZALF 3X PER DAG OP DE HUID AANBRENGEN; [NOT AVAILABLE]

Another particular Dutch word is prik let's see if we can find that too:

select * from analyses.name where words ilike '%prik%';
words
6DE PRIK DT [NOT AVAILABLE]
BOOSTRIX POLIO INJSUSP WWSP 0,5ML / 4DE PRIK DKTP/HIB/HEPATITIS B [NOT AVAILABLE]
EEN PRIK IN MIJN SCHOUDER TEGEN DE SLIJMBEURSONTSTEKING [NOT AVAILABLE]
RUGGENPRIK [NOT AVAILABLE]
RUGGEPRIK [NOT AVAILABLE]
TETANUS PRIK [NOT AVAILABLE]

Here we also discover the Dutch word tegen meaning against. Let's look for that too.

select * from analyses.name where words ilike '%tegen%';
words
2 GANGBARE MEDICIJNEN TEGEN HOGE BLOEDDRUK. [NOT AVAILABLE]
AANSTIPMIDDEL TEGEN ACNE [NOT AVAILABLE]
DIVERSE PUFJES TEGEN MIJN COPD [NOT AVAILABLE]
EEN PRIK IN MIJN SCHOUDER TEGEN DE SLIJMBEURSONTSTEKING [NOT AVAILABLE]
GENEESMIDDEL TEGEN HOOIKOORTS EN ANDERE ALLERGIEKLACHTEN [NOT AVAILABLE]
IETS TEGEN MISSELIJKHEID [NOT AVAILABLE]
MIDDEL TEGEN BOEZEM FIBRILEREN [NOT AVAILABLE]
MIDDEL TEGEN TRAGE SCHILDKLIER [NOT AVAILABLE]
ONTSTEKINGSREMMER TEGEN ARTRITIS PSORIATICA [NOT AVAILABLE]
PUFJE TEGEN ASTMA [NOT AVAILABLE]
TEGEN DE MISSELIJKHEID EN TEGEN EEN ALLERGISCHE REACTIE [NOT AVAILABLE]
ZALFJES TEGEN PSORIASIS [NOT AVAILABLE]

Although we cannot discover in detail from which country a report originates, we can determine that the Dutch language is used in specific reports. Maybe this method can also be used to investigate other languages?

Word cloud

To get insight in the most used words, we generate a table that can be used to create a word cloud.

WITH words AS (
    SELECT unnest(string_to_array(lower(words), ' ')) AS word
    FROM analyses.name
)
SELECT word, count(*) as frequency  FROM words
GROUP BY word order by count(*) desc limit 25;
wordfrequency
[not16887
available]16871
acid,5725
sodium5111
not4488
hydrochloride,4425
available4292
for3875
mg3647
chloride,3021
calcium2874
hydrochloride]2603
2424
magnesium2259
nos]1974
solution1913
potassium1850
unspecified1850
chloride1742
medication1633
nos,1558

Maybe we should clean up special characters like ] and , but for now this will give us an idea.

We generate a file containing all words from the previous query.

COPY (SELECT unnest(string_to_array(lower(words), ' ')) AS word
    FROM analyses.name) TO '/var/lib/postgres/wordcloud.csv' (format CSV);

Uploading this file to Free World Cloud Generator renders a nice word cloud.

Word cloud

If we ignore the word "available" for now, it is interesting to see words like (hydro)chloride, aluminium and acid and another Dutch word: "pollen".

Can you detect the most used medicine in the word cloud?

You now should have a general idea about what is available by gathering textual information and stitching it together. Can you think of more advanced uses?