COVID-19 DisGeNET data collection

The web shows the results of applying state-of-the art text mining tools developed by MedBioinformatics solutions to the LitCovid dataset (Chen, Allot, and Lu, 2020), to identify mentions of diseases, signs and symptoms. The LitCovid dataset contains a selection of papers referring to Coronavirus 19 disease.

Our text mining tools scan those articles and identify any mention of genes, diseases and phenotypes, together with mentions of the COVID-19 virus. These mentions are normalized to standard vocabularies. The data can be downloaded here

The data is available under license the Attribution-NonCommercial-ShareAlike 4.0 International License whose text can be found here.

For more information, please contact usat support(at)disgenet(dot)org

In the image below, we represent the genes and disease more frequently mentioned together with CoVID-19 and with SARS CoV-2.

Data History

  • Version 5.0 released (September 28, 2020)
    • Variant information has been added.
    • The dataset contains 1843 genes, and 4018 diseases, and 211 variants, and phenotypes over 49410 publications.
  • Version 4.0 released (July 14, 2020)
    • The dataset contains 863 genes, and 2563 diseases and phenotypes over 21.620 publications.
  • Version 3.0 released (June 10, 2020)
    • Gene information has been added.
    • The dataset contains 754 genes, and 2,101 diseases and phenotypes over 15430 publications.
  • Version 2.0 released (May 20, 2020)
    • The dataset contains 1569 diseases and phenotypes over 11,297 publications.
  • Version 1.0 released (April 19, 2020)
    • The dataset contains 905 diseases and phenotypes over 4,833 publications.