Loading and updating data#

The primary means of managing Disease Normalizer data is via the included command-line interface.

Note

See the ETL API documentation for information on programmatic access to the data loader classes.

disease-normalizer#

disease-normalizer [OPTIONS] COMMAND [ARGS]...

Manage Disease Normalizer data.

Options

--version#

Show the version and exit.

check-db#

disease-normalizer check-db [OPTIONS]

Perform basic checks on DB health and population. Exits with status code 1 if DB schema is uninitialized or if critical tables appear to be empty.

$ disease-normalizer check-db
$ echo $?
1  # indicates failure

This command is equivalent to the combination of the database classes’ check_schema_initialized() and check_tables_populated() methods:

>>> from disease.database import create_db
>>> db = create_db()
>>> db.check_schema_initialized() and db.check_tables_populated()
True  # DB passes checks

Options

--db_url <db_url>#

URL endpoint for the application database. Can either be a URL to a local DynamoDB server (e.g. "http://localhost:8001") or a libpq-compliant PostgreSQL connection description (e.g. "postgresql://postgres:password@localhost:5432/disease_normalizer").

--silent#

Suppress output to console.

dump-database#

disease-normalizer dump-database [OPTIONS]

Dump data from database into file.

DynamoDB export to existing dynamodb_local_latest directory:

$ disease-normalizer dump-database -o dynamodb_local_latest --db_url http://localhost:8001

Options

-o, --output_directory <output_directory>#

Output location to write to

--db_url <db_url>#

URL endpoint for the application database. Can either be a URL to a local DynamoDB server (e.g. "http://localhost:8001") or a libpq-compliant PostgreSQL connection description (e.g. "postgresql://postgres:password@localhost:5432/disease_normalizer").

--silent#

Suppress output to console.

dump-mappings#

disease-normalizer dump-mappings [OPTIONS]

Produce JSON Lines file dump of concept referents (e.g. name/label, alias, xrefs) and the associated concept.

By default, produces output for all known referents to a normalized ID. The --scope option can be used to constrain this either to all non-merged identity records:

$ disease-normalizer dump-mappings --scope identity

Or to the identity records of a specific source:

$ disease-normalizer dump-mappings --scope ncit

The first object in the .jsonl file will include metadata about the parameters used to create the document.

Options

--db_url <db_url>#

URL endpoint for the application database. Can either be a URL to a local DynamoDB server (e.g. "http://localhost:8001") or a libpq-compliant PostgreSQL connection description (e.g. "postgresql://postgres:password@localhost:5432/disease_normalizer").

--scope <scope>#

Scope of mappings – either an item type (merged/normalized vs base source records), or base records of an individaul source

Options:

RecordType.IDENTITY | RecordType.MERGER | SourceName.NCIT | SourceName.MONDO | SourceName.DO | SourceName.ONCOTREE | SourceName.OMIM

-o, --outfile <outfile>#

Output location to write to

--cancer-only#

Whether to constrain mappings to just include cancers. Note: only supported by DO and MONDO records.

update#

disease-normalizer update [OPTIONS] [SOURCES]...

Update provided normalizer SOURCES in the disease database.

Valid SOURCES are "DO", "MONDO", "NCIt", "OMIM", and "OncoTree" (case is irrelevant).

SOURCES are optional, but if not provided, either --all or --normalize must be used.

For example, the following command will update DO and MONDO source records:

$ disease-normalizer update DO MONDO

To completely reload all source records and construct normalized concepts, use the --all and --normalize options:

$ disease-normalizer update --all --normalize

The Disease Normalizer will fetch the latest available data from all sources if local data is out-of-date. To suppress this and force usage of local files only, use the --use_existing flag:

$ disease-normalizer update --all --use_existing

Options

--all#

Update records for all sources.

--normalize#

Create normalized records.

--db_url <db_url>#

URL endpoint for the application database. Can either be a URL to a local DynamoDB server (e.g. "http://localhost:8001") or a libpq-compliant PostgreSQL connection description (e.g. "postgresql://postgres:password@localhost:5432/disease_normalizer").

--aws_instance#

Use cloud DynamodDB instance.

--use_existing#

Use most recent locally-available source data instead of fetching latest version

--silent#

Suppress output to console.

Arguments

SOURCES#

Optional argument(s)

update-from-remote#

disease-normalizer update-from-remote [OPTIONS]

Update data from remotely-hosted DB dump. By default, fetches from latest available dump on VICC S3 bucket; specific URLs can be provided instead by command line option or ``DISEASE_NORM_REMOTE_DB_URL ``environment variable.

Options

--data_url <data_url>#

URL to data dump

--db_url <db_url>#

URL endpoint for the application database. Can either be a URL to a local DynamoDB server (e.g. "http://localhost:8001") or a libpq-compliant PostgreSQL connection description (e.g. "postgresql://postgres:password@localhost:5432/disease_normalizer").

--silent#

Suppress output to console.