Installation#
The Disease Normalizer can be installed from PyPI. Users who have access to a PostgreSQL database and don’t need to regenerate the Disease Normalizer database can use the quick installation instructions. To use a DynamoDB instance, or to enable local data updates, use the full installation instructions.
Note
The Disease Normalizer defines five optional dependency groups in total:
etl
provides dependencies for regenerating data from sources. It’s necessary for users who don’t intend to rely on existing database dumps.pg
provides dependencies for connecting to a PostgreSQL database. It’s not necessary for users who are using a DynamoDB backend.dev
provides development dependencies, such as static code analysis. It’s required for contributing to the Disease Normalizer, but otherwise unnecessary.test
provides dependencies for running tests. As withdev
, it’s mostly relevant for contributors.docs
provides dependencies for documentation generation. It’s only relevant for contributors.
Note
Source data files are stored using the wags-tails library within source-specific subdirectories of a designated WagsTails data directory. By default, this location is ~/.local/share/wags_tails/
, but it can be configured by passing a Path directly to a data class on initialization, via the WAGS_TAILS_DIR
environment variable, or via XDG data environment variables.
Quick Installation#
Requirements#
A UNIX-like environment (e.g. MacOS, WSL, Ubuntu)
Python 3.8+
A recent version of PostgreSQL (ideally at least 11+)
Package installation#
Install the Disease Normalizer, and the pg
dependency group, via PyPI:
pip install "disease-normalizer[pg]"
Database setup#
Create a new PostgreSQL database. For example, using the psql createdb utility, and assuming that postgres
is a valid user:
createdb -h localhost -p 5432 -U postgres disease_normalizer
Set the environment variable DISEASE_NORM_DB_URL
to a connection description for that database. See the PostgreSQL connection string documentation for more information:
export DISEASE_NORM_DB_URL=postgres://postgres@localhost:5432/disease_normalizer
Load data#
Use the disease_norm_update_remote
shell command to load data from the most recent remotely-stored data dump:
disease_norm_update_remote
Start service#
Finally, start an instance of the Disease Normalizer API on port 5000:
uvicorn disease.main:app --port=5000
Point your browser to http://localhost:5000/disease/. You should see the SwaggerUI page demonstrating available REST endpoints.
The beginning of the response to a GET request to http://localhost:5000/disease/normalize?q=braf should look something like this:
{
"query": "nsclc",
"match_type": 60,
"normalized_id": "ncit:C2926",
"disease": {
"id": "normalize.disease.ncit:C2926",
"label": "Lung Non-Small Cell Carcinoma",
...
}
}
Full Installation#
Requirements#
A UNIX-like environment (e.g. MacOS, WSL, Ubuntu) with superuser permissions
Python 3.8+
A recent version of PostgreSQL (ideally at least 11+), if using PostgreSQL as the database backend
An available Java runtime (version 8.x or newer), or Docker Desktop, if using DynamoDB as the database backend
Package installation#
First, install the Disease Normalizer from PyPI:
pip install "disease-normalizer[etl]"
The [etl]
option installs dependencies necessary for using the disease.etl
package, which performs data loading operations.
Users intending to utilize PostgreSQL to store source data should also include the pg
dependency group:
pip install "disease-normalizer[etl,pg]"
Database setup#
The Disease Normalizer requires a separate database process for data storage and retrieval. See the instructions on database setup and population for the available database options:
By default, the Disease Normalizer will attempt to connect to a DynamoDB instance listening at http://localhost:8000
.
Load data#
For most data sources, the Disease Normalizer can acquire necessary data automatically. However, data from the Online Mendelian Inheritance in Man (OMIM), mimTitles.txt
, must be manually acquired. In order to access OMIM data, users must submit a request here. Once approved, the relevant OMIM file (mimTitles.txt
) should be renamed according to the convention omim_YYYYMMDD.tsv
, where YYYYMMDD
indicates the date that the file was generated.
To load all other source data, and then generate normalized records, use the following shell command:
disease_norm_update --update_all --update_merged
This will download the latest available versions of all source data files, extract and transform recognized disease concepts, load them into the database, and construct normalized concept groups.