Introduction

The disgenet2r package contains a set of functions to retrieve, visualize and expand DisGeNET data. DisGeNET is a discovery platform that contains information about the genetic basis of human diseases (Piñero et al. 2015, 2017, 2019). DisGeNET integrates data from several expert curated databases and from text-mining the biomedical literature.

The current version of DisGeNET (v7.0) contains 1134942 gene-disease associations (GDAs), between 21671 genes and 30170 diseases, disorders, traits, and clinical or abnormal human phenotypes, and 369554 variant-disease associations (VDAs), between 194515 variants and 14155 diseases, traits, and phenotypes.

The information in DisGeNET is organized according to the original data source (Table 1). Diseases are identified using the UMLS concept unique identifier (CUI), but mappings to commonly employed biomedical vocabularies such as MeSH, OMIM, DO, HPO, and ICD-9 are also provided. The genes are identified using the NCBI Entrez Identifier, but annotations to the official gene symbol, the UniProt identifier, and the Panther Protein Class are also supplied. Finally, the GDAs and VDAs can be ranked using the DisGeNET score. The DisGeNET score ranges from 0 to 1, and takes into account the evidence supporting the association (See more information at http://disgenet.org/dbinfo/)

DisGeNET data is also represented as Resource Description Framework (RDF), which provides new opportunities for data integration, making possible to link DisGeNET data to other external RDF datasets (Queralt-Rosinach et al. 2016).

Table: Sources of DisGeNET data

Source_Name Type_of_data Description
CTD_human GDAs The Comparative Toxicogenomics Database, human data
CGI GDAs The Cancer Genome Interpreter
CLINGEN GDAs The Clinical Genome Resource
GENOMICS_ENGLAND GDAs The Genomics England PanelApp
ORPHANET GDAs The portal for rare diseases and orphan drugs
PSYGENET GDAs Psychiatric disorders Gene association NETwork
HPO GDAs Human Phenotype Ontology
UNIPROT GDAs/VDAs The Universal Protein Resource
CLINVAR GDAs/VDAs ClinVar, public archive of relationships among sequence variation and human phenotype
GWASCAT GDAs/VDAs The NHGRI-EBI GWAS Catalog
GWASDB GDAs/VDAs The GWas Database
CTD_mouse GDAs The Comparative Toxicogenomics Database, Mus musculus data
MGD GDAs The Mouse Genome Database
CTD_rat GDAs The Comparative Toxicogenomics Database, Rattus norvergicus data
RGD GDAs The Rat Genome Database
BEFREE GDAs/VDAs Data from text mining medline abstracts using the BeFree System (Bravo et al. 2015)
LHGDN GDAs Literature-derived human gene-disease network generated by text mining NCBI GeneRIFs (Bundschus et al. 2008)
CURATED GDAs/VDAs Human curated sources: CTD, ClinGen, CGI, UniProt, Orphanet, PsyGeNET, Genomics England PanelApp
INFERRED GDAs Inferred data from: HPO,ClinVar, GWASCat, GwasDB
ANIMAL_MODELS GDAs Data from animal models: CTD_rat, RGD, CTD_mouse, MGD
ALL GDAs/VDAs All data sources

Contact

For questions regarding disgenet2r, contact our support account at .

Installation and first run

The package disgenet2r is available through Bitbucket. The package requires an R version > 3.5. Additionally, the following packages are needed: VennDiagram, stringr, tidyr, SPARQL, RCurl, igraph, ggplot2, and reshape2.

Install disgenet2r by typing in R:

library(devtools)
install_bitbucket("ibi_group/disgenet2r")

To load the package:

library(disgenet2r)

In the following document, we illustrate how to use the disgenet2r package through a series of examples.

Retrieving Gene-Disease Associations from DisGeNET

Searching by gene

The gene2disease function retrieves the GDAs in DisGeNET for a given gene, or a for a list of genes. The gene(s) can be identified by either the NCBI gene identifier, or the official Gene Symbol, and the type of identifier used must be specified using the parameter vocabulary. By default, vocabulary = “HGNC”, to switch to Entrez Gene identifiers, set vocabulary to ENTREZ.

The function also requires the user to specify the source database using the argument database. By default, all the functions in the disgenet2r package use as source database CURATED, which includes GDAs from CTD (human data), PsyGeNET, the HPO, Genomics England PanelApp, ClinGen, CGI, UniProt, and Orphanet.

The information can be filtered using the DisGeNET score. The argument score is filled with a range of score to perform the search. The score is entered as a vector which first position is the initial value of score, and the second argument is the final value of score. Both values will always be included. By default, score=c(0,1).

In the example, the query for the Leptin Receptor (Gene Symbol LEPR, and Entrez Identifier 3953) is performed in all databases in DisGeNET (database = "ALL").

data1 <- gene2disease( gene = 3953, vocabulary = "ENTREZ",
                       database = "ALL")

The function gene2disease produces an object DataGeNET.DGN that contains the results of the query.

class(data1)
## [1] "DataGeNET.DGN"
## attr(,"package")
## [1] "disgenet2r"

Type the name of the object to display its attributes: the input parameters such as whether a single entity, or a list were searched (single or list), the type of entity (gene-disease), the selected database (ALL), the score range used in the search (0-1), and the gene ncbi identifier (3953).

data1
## Object of class 'DataGeNET.DGN'
##  . Search:      single 
##  . Type:        gene-disease 
##  . Database:     ALL 
##  . Score:        0-1 
##  . Term:        3953 
##  . Results:  416

To obtain the data frame with the results of the query, apply the extract function:

results <- extract(data1)
head( results, 3 )

The same query can be performed using the Gene Symbol (LEPR). Additionally, a minimun threshold for the score can be defined. In the example, a cutoff of score=c(0.2,1) is imposed. Notice how the number of diseases associated to the Leptin Receptor drops from 264 to 68 when the score is restricted.

## Object of class 'DataGeNET.DGN'
##  . Search:      single 
##  . Type:        gene-disease 
##  . Database:     ALL 
##  . Score:        0.3-1 
##  . Term:        LEPR 
##  . Results:  79

Visualizing the diseases associated to a single gene

The disgenet2r package offers two options to visualize the results of querying DisGeNET for a single gene: a network showing the diseases associated to the gene of interest (Gene-Disease Network), and a network showing the MeSH Disease Classes of the diseases associated to the gene (Gene-Disease Class Network). These graphics can be obtained by changing the class argument in the plot function.

By default, the plot function produces a Gene-Disease Network on a DataGeNET.DGN object (Figure 1). In the Gene-Disease Network the blue nodes are diseases, the pink nodes are genes, and the width of the edges is proportional to the score of the association. The prop parameter allows to adjust the width of the edges while keeping the proportionality to the score.

plot( data1,
      class = "Network",
      prop = 20)