The disgenet2r package contains a set of functions to retrieve, visualize and expand DisGeNET data. DisGeNET is a discovery platform that contains information about the genetic basis of human diseases (Piñero et al. 2015, 2017, 2019). DisGeNET integrates data from several expert curated databases and from text-mining the biomedical literature.
The current version of DisGeNET (v7.0) contains 1134942 gene-disease associations (GDAs), between 21671 genes and 30170 diseases, disorders, traits, and clinical or abnormal human phenotypes, and 369554 variant-disease associations (VDAs), between 194515 variants and 14155 diseases, traits, and phenotypes.
The information in DisGeNET is organized according to the original data source (Table 1). Diseases are identified using the UMLS concept unique identifier (CUI), but mappings to commonly employed biomedical vocabularies such as MeSH, OMIM, DO, HPO, and ICD-9 are also provided. The genes are identified using the NCBI Entrez Identifier, but annotations to the official gene symbol, the UniProt identifier, and the Panther Protein Class are also supplied. Finally, the GDAs and VDAs can be ranked using the DisGeNET score. The DisGeNET score ranges from 0 to 1, and takes into account the evidence supporting the association (See more information at http://disgenet.org/dbinfo/)
DisGeNET data is also represented as Resource Description Framework (RDF), which provides new opportunities for data integration, making possible to link DisGeNET data to other external RDF datasets (Queralt-Rosinach et al. 2016).
Table: Sources of DisGeNET data
|CTD_human||GDAs||The Comparative Toxicogenomics Database, human data|
|CGI||GDAs||The Cancer Genome Interpreter|
|CLINGEN||GDAs||The Clinical Genome Resource|
|GENOMICS_ENGLAND||GDAs||The Genomics England PanelApp|
|ORPHANET||GDAs||The portal for rare diseases and orphan drugs|
|PSYGENET||GDAs||Psychiatric disorders Gene association NETwork|
|HPO||GDAs||Human Phenotype Ontology|
|UNIPROT||GDAs/VDAs||The Universal Protein Resource|
|CLINVAR||GDAs/VDAs||ClinVar, public archive of relationships among sequence variation and human phenotype|
|GWASCAT||GDAs/VDAs||The NHGRI-EBI GWAS Catalog|
|GWASDB||GDAs/VDAs||The GWas Database|
|CTD_mouse||GDAs||The Comparative Toxicogenomics Database, Mus musculus data|
|MGD||GDAs||The Mouse Genome Database|
|CTD_rat||GDAs||The Comparative Toxicogenomics Database, Rattus norvergicus data|
|RGD||GDAs||The Rat Genome Database|
|BEFREE||GDAs/VDAs||Data from text mining medline abstracts using the BeFree System (Bravo et al. 2015)|
|LHGDN||GDAs||Literature-derived human gene-disease network generated by text mining NCBI GeneRIFs (Bundschus et al. 2008)|
|CURATED||GDAs/VDAs||Human curated sources: CTD, ClinGen, CGI, UniProt, Orphanet, PsyGeNET, Genomics England PanelApp|
|INFERRED||GDAs||Inferred data from: HPO,ClinVar, GWASCat, GwasDB|
|ANIMAL_MODELS||GDAs||Data from animal models: CTD_rat, RGD, CTD_mouse, MGD|
|ALL||GDAs/VDAs||All data sources|
For questions regarding disgenet2r, contact our support account at firstname.lastname@example.org.
The package disgenet2r is available through Bitbucket. The package requires an R version > 3.5. Additionally, the following packages are needed: VennDiagram, stringr, tidyr, SPARQL, RCurl, igraph, ggplot2, and reshape2.
Install disgenet2r by typing in R:
To load the package:
In the following document, we illustrate how to use the disgenet2r package through a series of examples.
The gene2disease function retrieves the GDAs in DisGeNET for a given gene, or a for a list of genes. The gene(s) can be identified by either the NCBI gene identifier, or the official Gene Symbol, and the type of identifier used must be specified using the parameter
vocabulary. By default, vocabulary = “HGNC”, to switch to Entrez Gene identifiers, set vocabulary to ENTREZ.
The function also requires the user to specify the source database using the argument
database. By default, all the functions in the disgenet2r package use as source database CURATED, which includes GDAs from CTD (human data), PsyGeNET, the HPO, Genomics England PanelApp, ClinGen, CGI, UniProt, and Orphanet.
The information can be filtered using the DisGeNET score. The argument
score is filled with a range of score to perform the search. The score is entered as a vector which first position is the initial value of score, and the second argument is the final value of score. Both values will always be included. By default,
In the example, the query for the Leptin Receptor (Gene Symbol
LEPR, and Entrez Identifier
3953) is performed in all databases in DisGeNET (
database = "ALL").
data1 <- gene2disease( gene = 3953, vocabulary = "ENTREZ", database = "ALL")
The function gene2disease produces an object
DataGeNET.DGN that contains the results of the query.
##  "DataGeNET.DGN" ## attr(,"package") ##  "disgenet2r"
Type the name of the object to display its attributes: the input parameters such as whether a single entity, or a list were searched (single or list), the type of entity (gene-disease), the selected database (ALL), the score range used in the search (
0-1), and the gene ncbi identifier (
## Object of class 'DataGeNET.DGN' ## . Search: single ## . Type: gene-disease ## . Database: ALL ## . Score: 0-1 ## . Term: 3953 ## . Results: 416
To obtain the data frame with the results of the query, apply the extract function:
results <- extract(data1) head( results, 3 )
The same query can be performed using the Gene Symbol (
LEPR). Additionally, a minimun threshold for the score can be defined. In the example, a cutoff of
score=c(0.2,1) is imposed. Notice how the number of diseases associated to the Leptin Receptor drops from 264 to 68 when the score is restricted.
## Object of class 'DataGeNET.DGN' ## . Search: single ## . Type: gene-disease ## . Database: ALL ## . Score: 0.3-1 ## . Term: LEPR ## . Results: 79
The disgenet2r package offers two options to visualize the results of querying DisGeNET for a single gene: a network showing the diseases associated to the gene of interest (
Gene-Disease Network), and a network showing the MeSH Disease Classes of the diseases associated to the gene (
Gene-Disease Class Network). These graphics can be obtained by changing the
class argument in the plot function.
By default, the plot function produces a
Gene-Disease Network on a
DataGeNET.DGN object (Figure 1). In the
Gene-Disease Network the blue nodes are diseases, the pink nodes are genes, and the width of the edges is proportional to the score of the association. The
prop parameter allows to adjust the width of the edges while keeping the proportionality to the score.
plot( data1, class = "Network", prop = 20)