Introduction
The disgenet2r package contains a set of functions to retrieve, visualize and expand DisGeNET data. DisGeNET is a discovery platform that contains information about the genetic basis of human diseases (J. Piñero et al. 2015, 2017; Janet Piñero et al. 2019). DisGeNET integrates data from several expert curated databases and from text-mining the biomedical literature.
The current version of DisGeNET (v7.0) contains 1134942 gene-disease associations (GDAs), between 21671 genes and 30170 diseases, disorders, traits, and clinical or abnormal human phenotypes, and 369554 variant-disease associations (VDAs), between 194515 variants and 14155 diseases, traits, and phenotypes.
The information in DisGeNET is organized according to the original data source (Table 1). Diseases are identified using the UMLS concept unique identifier (CUI), but mappings to commonly employed biomedical vocabularies such as MeSH, OMIM, DO, HPO, and ICD-9 are also provided. The genes are identified using the NCBI Entrez Identifier, but annotations to the official gene symbol, the UniProt identifier, and the Panther Protein Class are also supplied. Finally, the GDAs and VDAs can be ranked using the DisGeNET score. The DisGeNET score ranges from 0 to 1, and takes into account the evidence supporting the association (See more information at http://disgenet.org/dbinfo/)
DisGeNET data is also represented as a Resource Description Framework (RDF), which provides new opportunities for data integration, making possible to link DisGeNET data to other external RDF datasets (Queralt-Rosinach et al. 2016).
Table: Sources of DisGeNET data
Source_Name | Type_of_data | Description |
---|---|---|
CTD_human | GDAs | The Comparative Toxicogenomics Database, human data |
CGI | GDAs | The Cancer Genome Interpreter |
CLINGEN | GDAs | The Clinical Genome Resource |
GENOMICS_ENGLAND | GDAs | The Genomics England PanelApp |
ORPHANET | GDAs | The portal for rare diseases and orphan drugs |
PSYGENET | GDAs | Psychiatric disorders Gene association NETwork |
HPO | GDAs | Human Phenotype Ontology |
UNIPROT | GDAs/VDAs | The Universal Protein Resource |
CLINVAR | GDAs/VDAs | ClinVar, public archive of relationships among sequence variation and human phenotype |
GWASCAT | GDAs/VDAs | The NHGRI-EBI GWAS Catalog |
GWASDB | GDAs/VDAs | The GWas Database |
CTD_mouse | GDAs | The Comparative Toxicogenomics Database, Mus musculus data |
MGD | GDAs | The Mouse Genome Database |
CTD_rat | GDAs | The Comparative Toxicogenomics Database, Rattus norvergicus data |
RGD | GDAs | The Rat Genome Database |
BEFREE | GDAs/VDAs | Data from text mining medline abstracts using the BeFree System (Bravo et al. 2015) |
LHGDN | GDAs | Literature-derived human gene-disease network generated by text mining NCBI GeneRIFs (Bundschus et al. 2008) |
CURATED | GDAs/VDAs | Human curated sources: CTD, ClinGen, CGI, UniProt, Orphanet, PsyGeNET, Genomics England PanelApp |
INFERRED | GDAs | Inferred data from: HPO,ClinVar, GWASCat, GwasDB |
ANIMAL_MODELS | GDAs | Data from animal models: CTD_rat, RGD, CTD_mouse, MGD |
ALL | GDAs/VDAs | All data sources |
Contact
For questions regarding disgenet2r, contact our support account at support@disgenet.org.
Installation and first run
The package disgenet2r is available through Bitbucket. The package requires an R version > 3.5. Additionally, the following packages are needed: VennDiagram, stringr, tidyr, SPARQL, RCurl, igraph, ggplot2, and reshape2.
Install disgenet2r by typing in R:
library(devtools)
install_bitbucket("ibi_group/disgenet2r")
To load the package:
library(disgenet2r)
To be able to use the disgenet2r package, you need to register for a free account here. Once you have completed the registration process, use the get_disgenet_api_key to retrieve your API key.
<- get_disgenet_api_key(
disgenet_api_key email = "user@email.com",
password = "myspwd" )
After retrieving the API key, run the line below so the key is available for all the disgenet2r functions.
Sys.setenv(DISGENET_API_KEY= disgenet_api_key)
In the following document, we illustrate how to use the disgenet2r package through a series of examples.
Retrieving Gene-Disease Associations from DisGeNET
Searching by gene
The gene2disease function retrieves the GDAs in DisGeNET for a given gene, or a for a list of genes. The gene(s) can be identified by either the NCBI gene identifier, or the official Gene Symbol, and the type of identifier used must be specified using the parameter vocabulary
. By default, vocabulary = "HGNC"
, to switch to Entrez NCBI Gene identifiers, set vocabulary to ENTREZ.
The function also requires the user to specify the source database using the argument database
. By default, all the functions in the disgenet2r package use as source database CURATED, which includes GDAs from CTD (human data), PsyGeNET, the HPO, Genomics England PanelApp, ClinGen, CGI, UniProt, and Orphanet.
The information can be filtered using the DisGeNET score. The argument score
is filled with a range of score to perform the search. The score is entered as a vector which first position is the initial value of score, and the second argument is the final value of score. Both values will always be included. By default, score=c(0,1)
.
In the example, the query for the Leptin Receptor (Gene Symbol LEPR
, and Entrez NCBI Identifier 3953
) is performed in all databases in DisGeNET (database = "ALL"
).
<- gene2disease( gene = 3953, vocabulary = "ENTREZ",
data1 database = "CURATED")
The function gene2disease produces an object DataGeNET.DGN
that contains the results of the query.
class(data1)
## [1] "DataGeNET.DGN"
## attr(,"package")
## [1] "disgenet2r"
Type the name of the object to display its attributes: the input parameters such as whether a single entity, or a list were searched (single
or list
), the type of entity (gene-disease
), the selected database (ALL
), the score range used in the search (0-1
), and the gene NCBI identifier (3953
).
data1
## Object of class 'DataGeNET.DGN'
## . Search: single
## . Type: gene-disease
## . Database: CURATED
## . Score: 0-1
## . Term: 3953
## . Results: 89
To obtain the data frame with the results of the query, apply the extract function:
<- extract(data1)
results head( results, 3 )
## year_initial protein_class el disease_type
## 1 1998 DTO:05007599 strong disease
## 2 1966 DTO:05007599 disease
## 3 1966 DTO:05007599 disease
## disease_class_name protein_class_name gene_dpi year_final
## 1 Signaling 0.434 2019
## 2 Pathological Conditions, Signs and Symptoms; Nutritional and Metabolic Diseases Signaling 0.434 2021
## 3 Nutritional and Metabolic Diseases; Endocrine System Diseases Signaling 0.434 2020
## score disease_class disease_semantic_type ei gene_symbol source gene_dsi disease_name geneid
## 1 0.82 Disease or Syndrome 1.000 LEPR CURATED 0.99475 LEPTIN RECEPTOR DEFICIENCY 3953
## 2 0.80 C23;C18 Disease or Syndrome 0.937 LEPR CURATED 0.99475 Obesity 3953
## 3 0.60 C18;C19 Disease or Syndrome 0.975 LEPR CURATED 0.99475 Diabetes Mellitus, Non-Insulin-Dependent 3953
## gene_pli uniprotid diseaseid
## 1 0.84 P48357 C3554225
## 2 0.84 P48357 C0028754
## 3 0.84 P48357 C0011860
The same query can be performed using the Gene Symbol (LEPR
). Additionally, a minimum threshold for the score can be defined. In the example, a cutoff of score=c(0.2,1)
is used. Notice how the number of diseases associated to the Leptin Receptor drops from 264 to 68 when the score is restricted.
## Object of class 'DataGeNET.DGN'
## . Search: single
## . Type: gene-disease
## . Database: ALL
## . Score: 0.3-1
## . Term: LEPR
## . Results: 89
Visualizing the diseases associated to a single gene
The disgenet2r package offers two options to visualize the results of querying DisGeNET for a single gene: a network showing the diseases associated to the gene of interest (Gene-Disease Network
), and a network showing the MeSH Disease Classes of the diseases associated to the gene (Gene-Disease Class Network
). These graphics can be obtained by changing the class
argument in the plot function.
By default, the plot function produces a Gene-Disease Network
on a DataGeNET.DGN
object (Figure 1). In the Gene-Disease Network
the blue nodes are diseases, the pink nodes are genes, and the width of the edges is proportional to the score of the association. The prop
parameter allows to adjust the width of the edges while keeping the proportionality to the score.
plot( data1,
class = "Network",
prop = 20)
The results can also be visualized in a network in which diseases are grouped by the MeSH Disease Class if the class
argument is set to “DiseaseClass” (Gene-Disease Class Network
, Figure 2). In the Gene-Disease Class Network
, the node size of is proportional to the fraction of diseases in the disease class, with respect to the total number of diseases with disease classes associated to the gene. In the example, the Lepin Receptor is associated mainly to Nutritional and Metabolic Diseases. There is 1 disease in the example that does not have annotations to MeSH disease class (Shown as a warning).
plot( data1,
class = "DiseaseClass",
prop = 3)
## [1] "warning: 1 disease(s) not shown in the plot"
Exploring the evidences associated to a gene
You can extract the evidences associated to a particular gene using the function gene2evidence. Additionally, you can explore the evidences for a specific gene-disease pair by specifying the disease identifier using the argument disease
.
<- gene2evidence( gene = "LEPR",
data1 vocabulary = "HGNC",
disease ="C3554225",
database = "ALL",
score =c(0.3,1))
data1
## Object of class 'DataGeNET.DGN'
## . Search: single
## . Type: evidence
## . Database: ALL
## . Score: 0.3-1
## . Term: LEPR
<- extract(data1) results
pmid | year | sentence |
---|---|---|
16284652 | 2005 | Complete rescue of obesity, diabetes, and infertility in db/db mice by neuron-specific LEPR-B transgenes. |
17229951 | 2007 | Clinical and molecular genetic spectrum of congenital deficiency of the leptin receptor. |
23275530 | 2013 | Homozygous leptin receptor mutation due to uniparental disomy of chromosome 1: response to bariatric surgery. |
25751111 | 2015 | Seven novel deleterious LEPR mutations found in early-onset obesity: a deltaExon6-8 shared by subjects from Reunion Island, France, suggests a founder effect. |
29545012 | 2018 | Potential role of gender specific effect of leptin receptor deficiency in an extended consanguineous family with severe early-onset obesity. |
30847515 | 2019 | Whole genome sequencing reveals that genetic conditions are frequent in intensively ill children. |
9537324 | 1998 | A mutation in the human leptin receptor gene causes obesity and pituitary dysfunction. |
9537324 | 1998 | A mutation in the human leptin receptor gene causes obesity and pituitary dysfunction. |
9537324 | 1998 | A mutation in the human leptin receptor gene causes obesity and pituitary dysfunction. |
29568105 | 2018 | A cohort of n = 21 children living in Germany or Austria with monogenic obesity due to congenital leptin deficiency (group LEP, n = 6), leptin receptor deficiency (group LEPR, n = 6) and primarily heterozygous MC4 receptor deficiency (group MC4R, n = 9) was analyzed. |
31702041 | 2019 | In conclusion, activation of the leptin receptor ob‑R is an important pathogenic mechanism of UC, and leptin receptor deficiency may provide resistance against TNBS‑induced colitis by inhibiting the NF‑κB and RhoA signaling pathways. |
Searching multiple genes
The gene2disease function can also receive a list of genes as input, either as NCBI Gene Identifiers or Gene Symbols. In the example, we show how to create a vector with the Gene Symbols of several genes belonging to the family of voltage-gated potassium channels (Table 2) and then, we apply the function gene2disease.
Table 2: Example of voltage-gated potassium channel family members
Name | Description |
---|---|
KCNE1 | potassium channel, voltage gated subfamily E regulatory beta subunit 1 |
KCNE2 | potassium channel, voltage gated subfamily E regulatory beta subunit 2 |
KCNH1 | potassium channel, voltage gated eag related subfamily H, member 1 |
KCNH2 | potassium channel, voltage gated eag related subfamily H, member 2 |
KCNG1 | potassium voltage-gated channel modifier subfamily G member 1 |
Creating the vector with the list of genes belonging to the voltage-gated potassium channel family.
<- c( "KCNE1", "KCNE2", "KCNH1", "KCNH2", "KCNG1") myListOfGenes
The gene2disease function also requires the user to specify the source database using the argument database
, and optionally, the DisGeNET score
can also be applied to filter the results.
<- gene2disease(
data2 gene = myListOfGenes,
score =c(0.2, 1),
verbose = TRUE
)
## Warning in gene2disease(gene = myListOfGenes, score = c(0.2, 1), verbose = TRUE):
## One or more of the genes in the list is not in DisGeNET ( 'CURATED' ):
## - KCNG1
data2
## Object of class 'DataGeNET.DGN'
## . Search: list
## . Type: gene-disease
## . Database: CURATED
## . Score: 0.2-1
## . Term: KCNE1 ... KCNH2
## . Results: 71
Visualizing the diseases associated to multiple genes
By default, plotting a DataGeNET.DGN
resulting of the query with a list of genes produces a Gene-Disease Network
where the blue nodes are diseases, the pink nodes are genes, and the width of the edges is proportional to the score of the association (Figure 3).
plot( data2,
class = "Network",
prop = 10)
Setting the argument class
to “Heatmap” produces a Gene-Disease Heatmap
(Figure 4), where the scale of colors is proportional to the score of the GDA. The argument limit
can be used to limit the number of rows to the top scoring GDAs. The argument nchars
can be used to limit the length of the name of the disease. By default, the plot shows the 50 highest scoring GDAs.
plot( data2,
class ="Heatmap",
limit = 100, nchars = 50 )
These results can also be visualized as a Gene-Disease Class Heatmap
by setting the argument class
to “DiseaseClass” (Figure 5). In this case, diseases are grouped by the their MeSH disease classes, and the color scale is proportional to the percentage of diseases in each MeSH disease class. In the example, genes are associated mainly to Cardiovascular Diseases, and to Congenital, Hereditary, and Neonatal Diseases and Abnormalities.
plot( data2,
class="DiseaseClass", nchars=60)
## [1] "warning: 13 disease(s) not shown in the plot"
Searching by disease
The disease2gene function allows to retrieve the genes associated to a disease, or a list of diseases. The function uses as input the disease, or list of diseases of interest (as UMLS CUI, MeSH, OMIM, Disease Ontology, ICD9CM, NCIt, EFO, or Orphanet Identifiers), the disease vocabulary employed (OMIM
(OMIM), MESH
(MeSH),ICD9CM
(ICD9-CM), DO
(Disease Ontology), NCI
(NCI thesaurus), ORDO
(Orphanet), or EFO
(EFO) and the database (by default, CURATED
). A threshold value for the score can be set, like in the gene2disease function.
In the example, we will use the disease2gene function to retrieve the genes associated to the UMLS CUI C0036341. This function also receives as input the database, in the example, CURATED, and a score range, in the example, from 0.4 to 1.
<- disease2gene( disease = "C0036341",
data3 database = "CURATED",
score = c( 0.4,1 ) )
data3
## Object of class 'DataGeNET.DGN'
## . Search: single
## . Type: disease-gene
## . Database: CURATED
## . Score: 0.4-1
## . Term: C0036341
## . Results: 227
The same results are obtained when querying DisGeNET with the MeSH identifier for Schizophrenia (D012559).
## Object of class 'DataGeNET.DGN'
## . Search: single
## . Type: disease-gene
## . Database: CURATED
## . Score: 0.4-1
## . Term: D012559
## . Results: 227
The same results are obtained when querying DisGeNET with the OMIM identifier for Schizophrenia (181500).
<- disease2gene( disease = "181500", vocabulary = "OMIM",
data4 database = "CURATED",
score = c(0.4,1 ) )
data4
## Object of class 'DataGeNET.DGN'
## . Search: single
## . Type: disease-gene
## . Database: CURATED
## . Score: 0.4-1
## . Term: 181500
## . Results: 227
The same results are obtained when querying DisGeNET with the ICD9-CM identifier for Schizophrenia (295).
<- disease2gene( disease = "295", vocabulary = "ICD9CM",
data4 database = "CURATED",
score = c(0.4,1 ) )
data4
## Object of class 'DataGeNET.DGN'
## . Search: single
## . Type: disease-gene
## . Database: CURATED
## . Score: 0.4-1
## . Term: 295
## . Results: 227
The same results are obtained when querying DisGeNET with the NCI identifier for Schizophrenia (C3362).
<- disease2gene( disease = "C3362", vocabulary = "NCI",
data4 database = "CURATED",
score = c(0.4,1 ) )
data4
## Object of class 'DataGeNET.DGN'
## . Search: single
## . Type: disease-gene
## . Database: CURATED
## . Score: 0.4-1
## . Term: C3362
## . Results: 227
The same results are obtained when querying DisGeNET with the DO identifier for Schizophrenia (5419).
<- disease2gene( disease = "HP:0100753", vocabulary = "HPO",
data4 database = "CURATED",
score = c(0.4,1 ) )
data4
## Object of class 'DataGeNET.DGN'
## . Search: single
## . Type: disease-gene
## . Database: CURATED
## . Score: 0.4-1
## . Term: HP:0100753
## . Results: 227
Visualizing the genes associated to a single disease
There are two options to visualize the results from searching a single disease: a Gene-Disease Network
showing the genes related to the disease of interest (Figure 6), and a Disease-Protein Class Network
with the genes grouped by Panther Protein Class (Figure 7).
Figure 6 shows the default Gene-Disease Network
for Schizophrenia. As in the case of the gene2disease function, the blue nodes is the disease, the pink nodes are genes, and the width of the edges is proportional to the score of the association.
plot ( data3,
prop = 10)
Alternatively, in the Disease-Protein Class Network
, genes are grouped by the Panther Protein Class (Figure 7). This is a better choice when there is a large number of genes associated to the disease. This plot uses as class
argument “ProteinClass.” The resulting network will show in blue the disease, and in green the Protein Classes of the genes associated to the disease. The node size is proportional to the number of genes in the Panther Protein Class. In the example, the largest proportion of the genes associated to Schizophrenia are receptors. Notice again that not all genes have annotations to Panther Protein classes (69 genes in Figure 7)
plot( data3,
class="ProteinClass"
)
## [1] "warning: 86 gene(s) not shown in the plot"
Exploring the evidences associated to a disease
To explore the evidences supporting the associations for Schizophrenia using the function disease2evidence.
<- disease2evidence( disease = "C0036341",
data3 type = "GDA",
database = "CURATED",
score = c( 0.4,1 ) )
data3
## Object of class 'DataGeNET.DGN'
## . Search: single
## . Type: evidence
## . Database: CURATED
## . Score: 0.4-1
## . Term: C0036341
Additionally, you can explore the evidences for a specific gene-disease pair by specifying the gene identifier using the argument gene
.
<- disease2evidence( disease = "C0036341",
data3 gene = c("DRD2", "DRD3"),
type = "GDA",
database = "CURATED",
score = c( 0.4,1 ) )
data3
## Object of class 'DataGeNET.DGN'
## . Search: single
## . Type: evidence
## . Database: CURATED
## . Score: 0.4-1
## . Term: C0036341
<- extract(data3) results
The results are shown in the table below.
pmid | year | sentence |
---|---|---|
10831489 | 2000 | Subjective experience and striatal dopamine D(2) receptor occupancy in patients with schizophrenia stabilized by olanzapine or risperidone. |
11245917 | 2001 | Because dopamine D2 receptors are the primary targets for antipsychotic drugs, including clozapine and quetiapine, and because some studies have found D2 receptors to be elevated in schizophrenia, we examined the mRNA of three forms of the D2 receptor, particularly the new form of the dopamine D2 receptor, D2(Longer), in post-mortem brains from patients who died with schizophrenia. |
18583979 | 2008 | Systematic meta-analyses and field synopsis of genetic association studies in schizophrenia: the SzGene database. |
21187413 | 2011 | DRD2/AKT1 interaction on D2 c-AMP independent signaling, attentional processing, and response to olanzapine treatment in schizophrenia. |
10523822 | 1999 | Homozygosity at the dopamine DRD3 receptor gene in cocaine dependence. |
Searching multiple diseases
The disease2gene function also accepts as input a list of diseases (as UMLS CUI, MeSH, OMIM, Disease Ontology, Orphanet, Dechipher, or ICD9CM Identifiers), the database (by default, CURATED), and optionally, a value range for the score. In the example, we have selected a list of 10 diseases. Table 3 shows the UMLS CUIs and the corresponding disease names.
Table 3: Disease list selected for illustrating the disease2gene multiple search
UMLS_CUI | Disease_Name |
---|---|
C0036341 | Schizophrenia |
C0036341 | Alzheimer’s Disease |
C0030567 | Parkinson Disease |
C0005586 | Bipolar Disorder |
Creating the vector with the list of diseases.
<- c("C0036341", "C0002395", "C0030567","C0005586") diseasesOfInterest
In the example, we will search in CURATED data, using a score range of 0.4-1.
<- disease2gene(
data5 disease = diseasesOfInterest,
database = "CURATED",
score =c(0.4,1),
verbose = TRUE )
Visualizing the genes associated to multiple diseases
The default plot of the results of querying DisGeNET with a list of diseases produces a Gene-Disease Network
where the blue nodes are diseases, the pink nodes are genes, and the width of the edges is proportional to the score of the association (Figure 8).
plot( data5,
class = "Network",
prop = 10)
To visualize the results as a Gene-Disease Heatmap
(Figure 9) change the argument class
to “Heatmap.” In this plot, the scale of colors is proportional to the score of the GDA. The argument limit
can be used to limit the number of rows to the top scoring GDAs when the results are large. By default, the plot shows the 50 highest scoring GDAs.
plot( data5,
class="Heatmap",
limit =30,
cutoff=0.2)
## [1] "Dataframe of 440 rows has been reduced to 30 rows."
A third visualization option is a Protein Class-Disease Heatmap
(Figure 10), in which genes are grouped by protein class. This plot is obtained by setting the class
argument to “ProteinClass.” In this case, the color of the heatmap is proportional to the percentage of genes for each disease in each protein class. This heatmap displays the protein classes associated to each disease.
plot( data5,
class="ProteinClass" )
## [1] "warning: 143 gene(s) not shown in the plot"
Retrieving Variant-Disease Associations from DisGeNET
Searching by variant
The variant2disease function receives a variant, or a list of variants as input, identified by the dbSNP identifier. It produces an object DataGeNET.DGN
, with Type = "variant-disease"
.
<- variant2disease( variant= "rs121913279",
data6 database = "CURATED")
data6
## Object of class 'DataGeNET.DGN'
## . Search: single
## . Type: variant-disease
## . Database: CURATED
## . Score: 0-1
## . Term: rs121913279
## . Results: 44
Visualizing the diseases associated to a single variant
The disgenet2r package offers several options to visualize the results of querying DisGeNET for a single variant: a Variant-Disease Network
showing the diseases associated to the variant of interest, a Variant-Gene-Disease Network
showing the genes, diseases, and variant of interest, and a network showing the MeSH Disease Classes of the diseases associated to the variant (Variant-Disease Class Network
). These graphics can be obtained by changing the class
argument in the plot function.
By default, the plot function produces a Variant-Disease Network
on a DataGeNET.DGN
object (Figure 11). In the Variant-Disease Network
the blue nodes are diseases, the yellow nodes are variants, the blue nodes are diseases, and the width of the edges is proportional to the score of the association.
plot( data6,
class = "Network",
prop = 10)
The results can also be visualized in a network in which diseases are grouped by the MeSH Disease Class if the class
argument is set to “DiseaseClass” (Figure 12). In the Variant-Disease Class Network
, the node size of is proportional to the fraction of diseases in the disease class, with respect to the total number of diseases with disease class associated to the gene. In the example, the variant is associated mainly to Neoplasms.
plot( data6,
class = "DiseaseClass",
prop = 3)
## [1] "warning: 2 disease(s) not shown in the plot"
## [1] "warning: 2 disease(s) not shown in the plot"
Exploring the evidences associated to a variant
You can extract the evidences associated to a particular variant using the function variant2evidence. Additionally, you can explore the evidences for a specific variant-disease pair by specifying the argument disease
.
## Object of class 'DataGeNET.DGN'
## . Search: single
## . Type: evidence
## . Database: ALL
## . Score: 0-1
## . Term: rs121913279
The results are shown in the table below.
pmid | year | sentence |
---|---|---|
23826249 | 2013 | Collectively, our data suggest that PF-04691502 exhibits potent anticancer activity in colorectal cancer by targeting both PIK3CA (H1047R) mutant CSCs and their derivatives. |
27405731 | 2016 | We developed a sensitive, simple and rapid approach to detect the low-abundance PIK3CA (H1047R) mutation in real CRC specimens, providing an effective tool for guiding cancer targeted therapy. |
Searching multiple variants
The variant2disease function retrieves the information in DisGeNET for a list of variants identified by the dbSNP identifier. The function also requires the user to specify the source database using the argument database
. By default, variant2disease function uses as source database CURATED.
<- variant2disease(
data7 variant = c("rs121913013", "rs1060500621",
"rs199472709", "rs72552293",
"rs74315445", "rs199472795"),
database = "ALL")
Visualizing the diseases associated to multiple variants
The results of querying DisGeNET with a list of variants can be visualized as a Variant-Disease Network
(Figure 13), as a Variant-Gene-Disease Network
(Figure 15) and as heatmap (Variant-Disease Class Heatmap
, Figure 16). Below, the Variant-Disease Network
.
plot( data7,
class = "Network",
prop = 10)
To obtain the Variant-Gene-Disease Network
, Figure 14), change the showGenes
argument to “TRUE.”
plot( data7,
class = "Network",
showGenes = T)
The results of querying DisGeNET variant information with a list of diseases can be visualized as a Variant-Disease Heatmap
by changing the class
argument to Heatmap
(Figure 15).
plot( data7,
class = "Heatmap",
prop = 10)
The results of querying DisGeNET variant information with a list of diseases can also be visualized as a Variant-Disease Class Network
.
plot( data7,
class = "DiseaseClass")
Searching by disease
The disease2variant function allows to retrieve the variants associated to a disease, or a list of diseases. The function uses as input the disease, or list of diseases of interest (as UMLS CUI, MeSH, OMIM, Disease Ontology, ICD9CM, NCIt, EFO, or Orphanet Identifiers), the disease vocabulary employed (OMIM
(OMIM), MESH
(MeSH),ICD9CM
(ICD9-CM), DO
(Disease Ontology), NCI
(NCI thesaurus), ORDO
(Orphanet), or EFO
(EFO) and the database (by default, CURATED
). A threshold value for the score can be set, like in the gene2disease function.
<- disease2variant(disease = c("C0752166"),
data8 database = "CLINVAR",
score = c(0, 1) )
data8
## Object of class 'DataGeNET.DGN'
## . Search: single
## . Type: disease-variant
## . Database: CLINVAR
## . Score: 0-1
## . Term: C0752166
## . Results: 219
Visualizing the variants associated to a single disease
The results of querying DisGeNET variant information with a list of diseases can be visualized as a Variant-Disease Network
(Figure 17).
plot( data8,
class = "Network")
Explore the evidences associated to a single disease
To explore the evidences supporting the VDAs for Schizophrenia, run the disease2evidence function. You can use the argument variant to inspect the evidences for a particular variant and Schizophrenia. You can also specify a year range for the publications
<- disease2evidence( disease = "C0036341",
data8 variant = "rs1344706",
type = "VDA",
year = c(2019, 2020),
database = "ALL",
score = c( 0.1,1 ) )
data8
## Object of class 'DataGeNET.DGN'
## . Search: single
## . Type: evidence
## . Database: ALL
## . Score: 0.1-1
## . Term: C0036341
pmid | year | sentence | |
---|---|---|---|
5 | 32544854 | 2020 | ZNF804A has now been recognized as a schizophrenia risk gene by multiple genome-wide association studies with its intronic polymorphism rs1344706 being reported as the first genome-wide significant risk variant for schizophrenia. |
1 | 30079586 | 2019 | CACNA1C-rs1006737 and ZNF804A-rs1344706 polymorphisms are among the most robustly associated with schizophrenia (SCZ) and bipolar disorder (BD), and recently with brain phenotypes. |
2 | 30852803 | 2019 | ZNF804A rs1344706 has been identified as one of the risk genes for schizophrenia. |
3 | 30962687 | 2019 | rs1344706 risk allele carriers of schizophrenia had increased gray matter in the brain regions including frontal lobe, temporal lobe, and other brain regions, but the carriers of healthy individuals had decreased gray matter and white matter integrity in the frontal lobe, central network, and other brain regions. |
4 | 31122238 | 2019 | Previous studies indicated that single-nucleotide polymorphism (SNP) rs1344706 in ZNF804A was strongly associated with schizophrenia and might influence social interaction. |
If you want to inspect the evidences for Schizophrenia, and all the variants in a particular gene, use the argument gene
.
<- disease2evidence( disease = "C0036341",
data8 gene = "CACNA1C",
type = "VDA",
database = "BEFREE",
score = c( 0.1,1 ) )
data8
## Object of class 'DataGeNET.DGN'
## . Search: single
## . Type: evidence
## . Database: BEFREE
## . Score: 0.1-1
## . Term: C0036341
<- extract(data8) results
pmid | year | sentence | |
---|---|---|---|
24 | 30607529 | 2020 | Both CACNA1C (rs1006737) and KCNH2 (rs3800779) were genotyped in 101 controls and 50 schizophrenia patients. |
26 | 32770953 | 2020 | However, the influence model for rs1006737 on schizophrenia in Asians and Europeans demonstrated both similarities and differences between the two ancestors. |
29 | 32770953 | 2020 | Rs1006737, rs2007044, and rs4765905 of the CACNA1C gene were associated with susceptibility to schizophrenia. |
33 | 32770953 | 2020 | Rs1006737, rs2007044, and rs4765905 of the CACNA1C gene were associated with susceptibility to schizophrenia. |
23 | 30079586 | 2019 | CACNA1C-rs1006737 and ZNF804A-rs1344706 polymorphisms are among the most robustly associated with schizophrenia (SCZ) and bipolar disorder (BD), and recently with brain phenotypes. |
Searching multiple diseases
<- disease2variant(
data8 disease = c("C3150943", "C1859062", "C2678485", "C4015695"),
database = "CURATED",
score = c(0.75, 1) )
data8
## Object of class 'DataGeNET.DGN'
## . Search: list
## . Type: disease-variant
## . Database: CURATED
## . Score: 0.75-1
## . Term: C3150943 ... C4015695
## . Results: 67
Visualizing the variants associated to multiple diseases
The results of querying DisGeNET variant information with a list of diseases can be visualized as a Variant-Disease Network
, or as a Variant-Disease Heatmap
, Figure 18), by changing the class
argument from “Network” to “Heatmap.”
plot( data8,
class = "Network")
The Variant-Disease Network
can be displayed as a Variant-Disease-Gene Network
, by setting the showGenes
parameter to TRUE
.
plot( data8,
class = "Network",
showGenes = T)
The results can be visualized as a Heatmap.
plot( data8,
class = "Heatmap",
prop = 10)
## [1] "warning: dataframe of 67 rows has been reduced to 50 rows."
Retrieving Disease-Disease Associations from DisGeNET
The disgenet2r package also allows to obtain a list of diseases that share genes or variants with a particular disease, or disease list (disease-disease associations, or DDAs).
Searching DDAs by genes for a single disease
To obtain disease-disease associations, use the disease2disease_by_gene function. This function uses as input a disease, in the same format that in disease2gene, the database to perform the search (by default, CURATED), and the argument ndiseases
, the number of diseases that share genes with the disease(s) of interest. By default ndiseases = 10
. The output is a DataGeNET.DGN
object that contains the top 10 diseases that share genes with the disease that has been searched.
<- disease2disease_by_gene(
data9 disease = "C0010674",
database = "CURATED", ndiseases = 5 )
data9
## Object of class 'DataGeNET.DGN'
## . Search: single
## . Type: disease-disease-gene
## . Database: CURATED
## . Score:
## . Term: C0010674
## . Results: 5
<- extract(data9)
qr head(qr[c("disease1_name", "disease2_name","jaccard_genes","ngenes", "pvalue")])
## disease1_name disease2_name jaccard_genes ngenes pvalue
## 1 Cystic Fibrosis Pulmonary Cystic Fibrosis 0.375 9 0
## 2 Cystic Fibrosis Fibrocystic Disease of Pancreas 0.375 9 0
## 3 Cystic Fibrosis BRONCHIECTASIS WITH OR WITHOUT ELEVATED SWEAT CHLORIDE 1 0.292 7 0
## 4 Cystic Fibrosis Congenital bilateral aplasia of vas deferens 0.280 7 0
## 5 Cystic Fibrosis Hereditary pancreatitis 0.200 6 0
The DataGeNET.DGN
object produced by the disease2disease_by_gene function also contains the Jaccard Index, also known as the Jaccard similarity coefficient for each disease pair. The Jaccard Coefficient is a similarity metric, computed as as the size of the intersection divided by the size of the union of two sample sets, in this case, the genes associates to each disease:
\[\begin{equation*} J(A, B) = \frac{\mid A \cap B \mid}{\mid A \cup B \mid} \end{equation*}\]
The pvalue
column estimates the significance of the Jaccard coefficient for a list of disease pairs. The estimation of the pvalue is estimated by randomizing 1000 times.
Visualizing the diseases associated to a single disease
The plot function applied to the DataGeNET.DGN
object generated by the disease2disease_by_gene function results in a Disease-Disease Network
, where the node in dark blue is the disease of interest and nodes in ligth blue are the diseases that share genes with it (Figure 21). The node size is proportional to the number of genes associated to each disease.
plot( data9,
class = "Network",
layout="layout.lgl" ,
prop = 0.4 )
To obtain the fully connected network, set the class
argument of the plot function to Diseasome
(Figure 22). In this network, the nodes are the diseases of interest, and the node size is proportional to the number of genes associated with them, and the edges size is proportional to the number of genes that are shared between the diseases they are connecting. The nodes in dark blue are the query diseases, and in light blue, the diseases that share genes with them.
plot( data9,
class = "Diseasome",
layout="layout.lgl" ,
prop = 5 )
A second option to visualize the results of disease2disease_by_gene is as a Disease-Disease Barplot
(Figure 23), obtained by setting the class
argument to Barplot
. The Disease-Disease Barplot
shows the top n diseases that share genes with the disease of interest. This barplot also provides the jaccard index for each disease pair.
plot( data9,
class="Barplot")
Searching DDAs via genes for multiple diseases
Visualizing the diseases associated to multiple diseases
The function disease2disease_by_gene can receive a list of diseases as input, in any of the previously described vocabularies. It will retrieve the top 10 diseases that share genes with each of the diseases in the input list. This results can be visualized as a Disease-Disease Network
, a Diseasome plot
, or as a Barplot
.
Table 4: Disease list selected for illustrating the disease2disease_by_gene function
UMLS_CUI | Disease_Name |
---|---|
C0162671 | MELAS Syndrome |
C0023264 | Leigh Disease |
C0917796 | Optic Atrophy, Hereditary, Leber |
<- c("C0013182", "C0013221", "C3658290", "C0860207", "C1274933")
diseasesOfInterest <- disease2disease_by_gene(
data10 disease = diseasesOfInterest,
database = "CURATED", ndiseases = 5)
To obtain the network, set the class
argument of the plot function to Network
(Figure 24). In this network, the nodes are the diseases of interest, and the node size is proportional to the number of genes associated with them. On the other hand, the edges size is proportional to the number of genes that are shared between the diseases they are connecting.
plot( data10,
class = "Network",
prop = 0.1)
To obtain the fully connected network, set the class
argument of the plot function to Diseasome
(Figure 25). In this network, the nodes are the diseases of interest, and the node size is proportional to the number of genes associated with them. On the other hand, the edges size is proportional to the number of genes that are shared between the diseases they are connecting.
plot( data10,
class = "Diseasome",
prop = 0.1)
To further inspect the relationship between a pair of diseases, create a Venn Diagram by setting the class
argument of the plot function to Venn
(Figure 26).
plot( data10,
class="Venn")
The Venn diagram for 4 diseases
<- disease2disease_by_gene(
data11 disease = c("C0018801", "C0028754", "C0004153", "C0011849"),
database = "CURATED",
ndiseases = 5)
plot( data11,
class="Venn"
)
The Venn diagram for 5 diseases
<- disease2disease_by_gene(
data11 disease = c("C0018801", "C0028754", "C0004153", "C0011849", "C0021368"),
database = "CURATED",
ndiseases = 5)
plot( data11,
class="Venn"
)
Performing a disease enrichment
The disease_enrichment function receives a list of genes and performs an enrichment analysis over the diseases in DisGeNET. The input list of genes should be identified with HGNC symbols, or Entrez Gene Identifiers. The vocabulary should be specified using the parameter vocabulary
. By default, vocabulary = "HGNC"
. The function has other optional argument: the source database (by default, database = “CURATED”
). The background for the Fisher test are all the genes in DisGeNET. The p-values resulting from the multiple Fisher tests are corrected for false discovery rate using the Benjamini-Hochberg method.
To perform the enrichment, run:
<- disease2gene(disease = "C0004352", database = "GENOMICS_ENGLAND")
list_of_genes
<- list_of_genes@qresult$gene_symbol
list_of_genes
<-disease_enrichment( entities =list_of_genes, vocabulary = "HGNC",
res_enrich database = "CURATED" )
<- res_enrich@qresult[1:10, c("Description", "FDR", "Ratio", "BgRatio")] table1
The column Ratio represents the number of genes from the input list that are associated to the disease. The BgRatio are the number of genes associated to the disease from the length of the universe, in the example, all genes in DisGeNET curated (9703).
Description | FDR | Ratio | BgRatio |
---|---|---|---|
Autistic Disorder | 0.0000000 | 28/28 | 298/10654 |
Intellectual Disability | 0.0000000 | 20/28 | 649/10654 |
Global developmental delay | 0.0000000 | 11/28 | 338/10654 |
Neurodevelopmental Disorders | 0.0000000 | 8/28 | 114/10654 |
Abnormal behavior | 0.0000006 | 5/28 | 30/10654 |
Seizures | 0.0000077 | 9/28 | 357/10654 |
Cardiovascular Abnormalities | 0.0003252 | 3/28 | 17/10654 |
Autism Spectrum Disorders | 0.0003252 | 5/28 | 114/10654 |
Epilepsy | 0.0004925 | 5/28 | 131/10654 |
Abnormality of the urinary system | 0.0004925 | 2/28 | 3/10654 |
To visualize the results of the enrichment, use the function plot. Use the argument cutoff
to set a minimum p value threshold, and the argument limit
to reduce the number of records shown. By default, limit=50
. Use the argument nchars
to control the lenght of the disease name.
plot( res_enrich, class = "Enrichment", count =3, cutoff= 0.05, nchars=70)
<-disease_enrichment( entities =list_of_genes, vocabulary = "HGNC",
res_enrich database = "ALL" )
<- res_enrich@qresult[1:10, c("Description", "FDR", "Ratio", "BgRatio")] table1
Description | FDR | Ratio | BgRatio |
---|---|---|---|
Autistic Disorder | 0.0e+00 | 28/28 | 1119/22929 |
Autism Spectrum Disorders | 0.0e+00 | 22/28 | 1296/22929 |
Intellectual Disability | 0.0e+00 | 23/28 | 2299/22929 |
Neurodevelopmental Disorders | 0.0e+00 | 15/28 | 556/22929 |
Pervasive Development Disorder | 0.0e+00 | 14/28 | 545/22929 |
Autistic behavior | 0.0e+00 | 12/28 | 375/22929 |
Global developmental delay | 0.0e+00 | 19/28 | 1958/22929 |
Profound Mental Retardation | 3.0e-07 | 8/28 | 297/22929 |
Seizures | 1.8e-06 | 15/28 | 2331/22929 |
Dysmorphic facies | 1.8e-06 | 8/28 | 378/22929 |
A disease_enrichment for a list of variants can also be performed. The input list of variants should be identified with dbSNP identifiers and the vocabulary should be specified using the parameter vocabulary = "DBSNP"
. The function has other optional argument: the source database (by default, database = “CURATED”
). The background for the Fisher test are all the genes in DisGeNET. The p-values resulting from the multiple Fisher tests are corrected for false discovery rate using the Benjamini-Hochberg method.
To perform the enrichment, run:
<- disease2variant(disease = "C0004352", database = "CLINVAR")
list_of_variants
<- as.character(list_of_variants@qresult$variantid)
list_of_variants
<-disease_enrichment( entities =list_of_variants, vocabulary = "DBSNP",
res_enrich database = "CURATED" )
<- res_enrich@qresult[1:10, c("Description", "FDR", "Ratio", "BgRatio")] table1
Description | FDR | Ratio | BgRatio |
---|---|---|---|
Autistic Disorder | 0e+00 | 33/33 | 151/183851 |
Seizures | 0e+00 | 11/33 | 255/183851 |
Global developmental delay | 0e+00 | 10/33 | 287/183851 |
Delayed speech and language development | 0e+00 | 6/33 | 52/183851 |
Abnormality of vision | 0e+00 | 3/33 | 5/183851 |
Coloboma of iris | 0e+00 | 3/33 | 6/183851 |
Cryptorchidism | 1e-07 | 3/33 | 19/183851 |
Motor delay | 1e-07 | 3/33 | 18/183851 |
Uranostaphyloschisis | 1e-07 | 3/33 | 19/183851 |
Impaired cognition | 1e-07 | 4/33 | 122/183851 |
To visualize the results of the enrichment, use the function plot. Use the argument cutoff
to set a minimum p value threshold, and the argument limit
to reduce the number of records shown. By default, limit=50
.
plot( res_enrich, class = "Enrichment", count =6, cutoff= 0.05)
Expanding DisGeNET data with other resources available in the LOD cloud
Currently, Wikipathways-RDF has issues and these federated queries do not work!
Retrieving the pathways associated to a disease
To obtain the pathways from WikiPathways associated to a disease, use the disease2pathway function.
<- disease2pathway( disease = "C0018801" ,
dis2path database = "ALL", score = c(0.5, 1))
dis2path# head(dis2path@qresult)
# qr <- extract(dis2path)
# head(qr, 3)
Retrieving the diseases associated to a given pathway
To obtain the pathways from WikiPathways associated to a disease, use the disease2pathway function.
<- pathway2disease( pathway = "WP1591" ,
path2dis database = "CURATED",
score = c(0.9, 1))
path2dishead(path2dis@qresult)
<- extract(path2dis)
qr head(qr, 3)
Retrieve the drug targets for a disease
The function disease2compound allows retrieved the chemicals in ChEMBL that target the genes associated to the given disease, or list of diseases. The function produces a DataGeNET.RDF object with the input disease(s), the associated genes, the DisGeNET score, and the chemicals associated to the disease proteins. If there were no chemicals found, the genes are still included in the output. It is important to higlight that because of ChEMBL restrictions for their SPARQL endpoint if 1000 compounds are retrieved for a single protein it probably means that there are more than 1000. (See more information about ChEMBL SPARQL endpoint at https://www.ebi.ac.uk/rdf/what-are-limitations-sparql-endpoints).
library("SPARQL")
<- disease2compound(
dis2com disease = c("C0020538"),
database = "CURATED" )
dis2com<- extract(dis2com)
qr head(qr, 3)
Retrieving Gene Ontology data for a gene
library("SPARQL")
<- gene2biologicalprocess(gene = "351")
gene2bp head(gene2bp@qresult)
<- gene2cellcomponent(gene = "351")
gene2cc head(gene2cc@qresult)
<- gene2molecularfunction(gene = "351")
gene2mf head(gene2mf@qresult)
<- gene2indication( gene = "1588" )
gene2indi head(gene2indi@qresult)
Retrieving Gene Ontology data for a disease
The functions disease2biologicalprocess, disease2molecularfunction, and disease2cellcomponent can retrieve the Gene Ontology (GO) molecular function, GO biological process and GO cellular component of the genes associated to a disease or a list of diseases.
Retrieving the biological processes associated to a disease
For a disease, the biological process information can be retrieved in GO terms using the function disease2biologicalprocess that receives as input an NCBI gene identfier. The function produces a DataGeNET.RDF object containing the disease, the genes and their GOBP terms.
<- disease2biologicalprocess(
disease2bp disease = "C0036341",
database = "CURATED",
score = c( 0.6,1)
)
disease2bp<- extract(disease2bp)
disease2bp head(disease2bp[c("gene_product_label","go_label")])
Retrieving the molecular functions of the genes associated to a disease}
For a particular disease, the molecular activities of associated genes can be retrieved as GO terms using the function disease2molecularfunction. The function produces a DataGeNET.RDF object containing the disease, the genes and their GOMF terms.
<- disease2molecularfunction(
disease2mf disease = "C0002395",
database = "UNIPROT"
)
disease2mf<- extract(disease2mf)
disease2mf head(disease2mf[c("gene_product_label", "go_label")])
Retrieving the molecular functions of the genes associated to a disease
For a given gene, the cell component information (i.e. where gene products are active) can be retrieved as GO terms using the function gene2cellcomponent that receives as input an NCBI gene identfier. The function produces a DataGeNET.RDF object containing the disease, the genes and their GO CC terms.
<- disease2cellcomponent( disease = "C0002395",
disease2cc database = "MGD")
<- extract(disease2cc)
disease2cc head(disease2cc[c("gene_product_label", "go_label")])
Get DisGeNET data version
get_disgenet_version()
## [1] "{ api_version : 1.2.0 , database_version : DisGeNET Database v7 , lastUpdate : May 2021 , comments : - Added GDA and VDA evidences endpoints.- Added the disease similarity endpoint.- Added the disease mapping by name endpoint. }"
COPYRIGHT
Copyright (C) 2021, IBI group.
disgenet2r is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
disgenet2r is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.
How to cite disgenet2r
Piñero, J., Ramírez-Anguita, J. M., Saüch-Pitarch, J., Ronzano, F., Centeno, E., Sanz, F., & Furlong, L. I. (2020). The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Research, 48(D1), D845-D855.