disease.database.postgresql#

Provide PostgreSQL client.

class disease.database.postgresql.PostgresDatabase(db_url=None, **db_args)[source]#

Disease Normalizer database client for PostgreSQL.

__init__(db_url=None, **db_args)[source]#

Initialize Postgres connection.

>>> from disease.database.postgresql import PostgresDatabase
>>> db = PostgresDatabase(
>>>     user="postgres",
>>>     password="matthew_cannon2",
>>>     db_name="disease_normalizer"
>>> )
Parameters:

db_url (Optional[str]) – libpq compliant database connection URI

Keyword Arguments:
  • user: Postgres username

  • password: Postgres password (optional or blank if unneeded)

  • db_name: name of database to connect to

Raises:

DatabaseInitializationException – if initial setup fails

add_merged_record(record)[source]#

Add merged record to database.

Parameters:

record (Dict) – merged record to add

Return type:

None

add_record(record, src_name)[source]#

Add new record to database.

Parameters:
  • record (Dict) – record to upload

  • src_name (SourceName) – name of source for record. Not used by PostgreSQL instance.

Return type:

None

add_source_metadata(src_name, meta)[source]#

Add new source metadata entry.

Parameters:
Raises:

DatabaseWriteException – if write fails

Return type:

None

check_schema_initialized()[source]#

Check if database schema is properly initialized.

Return type:

bool

Returns:

True if DB appears to be fully initialized, False otherwise

check_tables_populated()[source]#

Perform rudimentary checks to see if tables are populated. Emphasis is on rudimentary – if some fiendish element has deleted half of the disease aliases, this method won’t pick it up. It just wants to see if a few critical tables have at least a small number of records.

Return type:

bool

Returns:

True if queries successful, false if DB appears empty

close_connection()[source]#

Perform any manual connection closure procedures if necessary.

Return type:

None

complete_write_transaction()[source]#

Conclude transaction or batch writing if relevant.

Return type:

None

delete_normalized_concepts()[source]#

Remove merged records from the database. Use when performing a new update of normalized data.

It would be faster to drop the entire table and do a cascading delete onto the merge_ref column in disease_concepts, but that requires an exclusive access lock on the DB, which can be annoying (ie you couldn’t have multiple processes accessing it, or PgAdmin, etc…). Instead, we’ll take down each merge table dependency and rebuild afterwards.

Raises:
Return type:

None

delete_source(src_name)[source]#

Delete all data for a source. Use when updating source data.

All of the foreign key relations make deletes extremely slow, so this method drops and then re-adds them once deletes are finished. This makes it a little brittle, and it’d be nice to revisit in the future to perform as a single atomic transaction.

Refreshing the materialized view at the end might be redundant, because this method will almost always be called right before more data is written, but it’s probably necessary just in case that doesn’t happen.

Parameters:

src_name (SourceName) – name of source to delete

Raises:

DatabaseWriteException – if deletion call fails

Return type:

None

drop_db()[source]#

Perform complete teardown of DB. Useful for quickly resetting all data or reconstructing after apparent schema error. If in a protected environment, require confirmation.

Raises:

DatabaseWriteException – if called in a protected setting with confirmation silenced.

Return type:

None

export_db(output_directory)[source]#

Dump DB to specified location.

Parameters:

export_location – path to directory to save DB dump in

Return type:

None

Returns:

Nothing, but saves results of pg_dump to file named disease_norm_<date and time>.sql

Raises:
  • ValueError – if output directory isn’t a directory or doesn’t exist

  • DatabaseException – if psql call fails

get_all_concept_ids(source=None)[source]#

Retrieve concept IDs for use in generating normalized records.

Parameters:

source (Optional[SourceName]) – optionally, just get all IDs for a specific source

Return type:

Set[str]

Returns:

Set of concept IDs as strings.

get_all_records(record_type)[source]#

Retrieve all source or normalized records. Either return all source records, or all records that qualify as “normalized” (i.e., merged groups + source records that are otherwise ungrouped).

For example,

>>> from disease.database import create_db
>>> from disease.schemas import RecordType
>>> db = create_db()
>>> for record in db.get_all_records(RecordType.MERGER):
>>>     pass  # do something

Unlike DynamoDB, merged records are stored in a separate table from source records. As a result, when fetching all normalized records, merged records are return first, and iteration continues with all source records that don’t belong to a normalized concept group.

Parameters:

record_type (RecordType) – type of result to return

Return type:

Generator[Dict, None, None]

Returns:

Generator that lazily provides records as they are retrieved

get_record_by_id(concept_id, case_sensitive=True, merge=False)[source]#

Fetch record corresponding to provided concept ID

Parameters:
  • concept_id (str) – concept ID for disease record

  • case_sensitive (bool) – not used by this implementation – lookups use case-insensitive index

  • merge (bool) – if true, look for merged record; look for identity record otherwise.

Return type:

Optional[Dict]

Returns:

complete disease record, if match is found; None otherwise

get_refs_by_type(search_term, ref_type)[source]#

Retrieve concept IDs for records matching the user’s query. Other methods are responsible for actually retrieving full records.

Parameters:
  • search_term (str) – string to match against

  • ref_type (RefType) – type of match to look for.

Return type:

List[str]

Returns:

list of associated concept IDs. Empty if lookup fails.

get_source_metadata(src_name)[source]#

Get license, versioning, data lookup, etc information for a source.

Parameters:

src_name (Union[str, SourceName]) – name of the source to get data for

Return type:

Optional[SourceMeta]

Returns:

source metadata, if lookup is successful

initialize_db()[source]#

Check if DB is set up. If not, create tables/indexes/views.

Return type:

None

list_tables()[source]#

Return names of tables in database.

Return type:

List[str]

Returns:

Table names in database

load_from_remote(url)[source]#

Load DB from remote dump. Warning: Deletes all existing data. If not passed as an argument, will try to grab latest release from VICC S3 bucket.

Parameters:

url (Optional[str]) – location of .tar.gz file created from output of pg_dump

Raises:

DatabaseException – if unable to retrieve file from URL or if psql command fails

Return type:

None

update_merge_ref(concept_id, merge_ref)[source]#

Update the merged record reference of an individual record to a new value.

Parameters:
  • concept_id (str) – record to update

  • merge_ref (Any) – new ref value

Raises:

DatabaseWriteException – if attempting to update non-existent record

Return type:

None