Data Models

PyHGNC uses SQLAlchemy to store the data in the database. You can use an instance of pyhgnc.manager.query.QueryManager to query the content of the database.

Entity–relationship model:

ToDo: Add ER figure here!

HGNC

class pyhgnc.manager.models.HGNC(**kwargs)[source]

Root class (table, model) for all other classes (tables, models) in PyHGNC. Basic information with 1:1 relationship to identifier are stored here

Warning

  • homeodb (Homeobox Database ID)
  • horde_id (Symbol used within HORDE for the gene)

described in README, but not found in HGNC JSON file

Hint

To link to IUPHAR/BPS Guide to PHARMACOLOGY database only use the number (only use 1 from the result objectId:1)

Variables:
  • name (str) – HGNC approved name for the gene. Equates to the “APPROVED NAME” field within the gene symbol report
  • symbol (str) – The HGNC approved gene symbol. Equates to the “APPROVED SYMBOL” field within the gene symbol report
  • orphanet (int) – Orphanet ID
  • identifier (str) – Unique ID created by the HGNC for every approved symbol (HGNC ID)
  • status (str) – Status of the symbol report, which can be either “Approved” or “Entry Withdrawn”
  • uuid (str) – universally unique identifier
  • locus_group (str) – Group name for a set of related locus types as defined by the HGNC (e.g. non-coding RNA)
  • locus_type (str) – Locus type as defined by the HGNC (e.g. RNA, transfer)
  • date_name_changed (date) – date the gene name was last changed
  • date_modified (date) – date the entry was last modified
  • date_symbol_changed (date) – date the gene symbol was last changed
  • date_approved_reserved (date) – date the entry was first approved
  • ensembl_gene (str) – Ensembl gene ID. Found within the “GENE RESOURCES” section of the gene symbol report
  • horde (str) – symbol used within HORDE for the gene (not available in JSON)
  • vega (str) – Vega gene ID. Found within the “GENE RESOURCES” section of the gene symbol report
  • lncrnadb (str) – Long Noncoding RNA Database identifier
  • entrez (str) – Entrez gene ID. Found within the “GENE RESOURCES” section of the gene symbol report
  • mirbase (str) – miRBase ID
  • iuphar (str) – The objectId used to link to the IUPHAR/BPS Guide to PHARMACOLOGY database
  • ucsc (str) – UCSC gene ID. Found within the “GENE RESOURCES” section of the gene symbol report
  • snornabase (str) – snoRNABase ID
  • imgt (str) – Symbol used within international ImMunoGeneTics information system
  • pseudogeneorg (str) – Pseudogene.org ID
  • bioparadigmsslc (str) – Symbol used to link to the SLC tables database at bioparadigms.org for the gene
  • locationsortable (str) – locations sortable
  • merops (str) – ID used to link to the MEROPS peptidase database
  • location (str) – Cytogenetic location of the gene (e.g. 2q34).
  • cosmic (str) – Symbol used within the Catalogue of somatic mutations in cancer for the gene
  • rgds (list) – relationship to RGD
  • omims (list) – relationship to OMIM
  • ccdss (list) – relationship to CCDS
  • lsdbs (list) – relationship to LSDB
  • orthology_predictions (list) – relationship to OrthologyPrediction
  • enzymes (list) – relationship to Enzyme
  • gene_families (list) – relationship to GeneFamily
  • refseq_accessions (list) – relationship to RefSeq
  • mgds (list) – relationship to MGD
  • uniprots (list) – relationship to UniProt
  • pubmeds (list) – relationship to PubMed
  • enas (list) – relationship to ENA

AliasSymbol

class pyhgnc.manager.models.AliasSymbol(**kwargs)[source]

Other symbols used to refer to this gene as seen in the “SYNONYMS” field in the symbol report.

Attention

Symbols previously approved by the HGNC for this gene are tagged with is_previous_symbol==True. Equates to the “PREVIOUS SYMBOLS & NAMES” field within the gene symbol report.

Variables:

AliasName

class pyhgnc.manager.models.AliasName(**kwargs)[source]

Other names used to refer to this gene as seen in the “SYNONYMS” field in the gene symbol report.

Attention

Gene names previously approved by the HGNC for this gene are tagged with is_previous_name==True.. Equates to the “PREVIOUS SYMBOLS & NAMES” field within the gene symbol report.

Variables:

GeneFamily

class pyhgnc.manager.models.GeneFamily(**kwargs)[source]

Name and identifier given to a gene family or group the gene has been assigned to. Equates to the “GENE FAMILY” field within the gene symbol report.

Variables:
  • familyid (int) – family identifier
  • familyname (str) – family name
  • hgncs (list) – back populates to HGNC

RefSeq

class pyhgnc.manager.models.RefSeq(**kwargs)[source]

RefSeq nucleotide accession(s). Found within the”NUCLEOTIDE SEQUENCES” section of the gene symbol report.

See also RefSeq database for more information.

Variables:
  • accession (str) – RefSeq accession number
  • hgncs (list) – back populates to HGNC

RGD

class pyhgnc.manager.models.RGD(**kwargs)[source]

Rat genome database gene ID. Found within the “HOMOLOGS” section of the gene symbol report

Variables:
  • rgdid (str) – Rat genome database gene ID
  • hgncs – back populates to HGNC

OMIM

class pyhgnc.manager.models.OMIM(**kwargs)[source]

Online Mendelian Inheritance in Man (OMIM) ID

Variables:
  • omimid (str) – OMIM ID
  • hgnc – back populates to pyhgnc.manager.models.HGNC

MGD

class pyhgnc.manager.models.MGD(**kwargs)[source]

Mouse genome informatics database ID. Found within the “HOMOLOGS” section of the gene symbol report

Variables:
  • mgdid (str) – Mouse genome informatics database ID
  • hgncs (list) – back populates to HGNC

UniProt

class pyhgnc.manager.models.UniProt(**kwargs)[source]

Universal Protein Resource (UniProt) protein accession. Found within the “PROTEIN RESOURCES” section of the gene symbol report.

See also UniProt webpage for more information.

Variables:
  • uniprotid (str) – UniProt identifier
  • hgncs (list) – back populates to HGNC

CCDS

class pyhgnc.manager.models.CCDS(**kwargs)[source]

Consensus CDS ID. Found within the “NUCLEOTIDE SEQUENCES” section of the gene symbol report.

See also CCDS for more information.

Variables:
  • ccdsid (str) – CCDS identifier
  • hgnc – back populates to HGNC

PubMed

class pyhgnc.manager.models.PubMed(**kwargs)[source]

PubMed and Europe PubMed Central PMID

Variables:
  • pubmedid (str) – Pubmed identifier
  • hgncs (list) – back populates to HGNC

ENA

class pyhgnc.manager.models.ENA(**kwargs)[source]

International Nucleotide Sequence Database Collaboration (GenBank, ENA and DDBJ) accession number(s). Found within the “NUCLEOTIDE SEQUENCES” section of the gene symbol report.

Variables:
  • enaid (str) – European Nucleotide Archive (ENA) identifier
  • hgncs (list) – back populates to HGNC

Enzyme

class pyhgnc.manager.models.Enzyme(**kwargs)[source]

Enzyme Commission number (EC number)

Variables:
  • ec_number (str) – EC number
  • hgncs (list) – back populates to HGNC

LSDB

class pyhgnc.manager.models.LSDB(**kwargs)[source]

The name of the Locus Specific Mutation Database and URL

Variables:
  • lsdb (str) – name of the Locus Specific Mutation Database
  • url (str) – URL to database
  • hgnc – back populates to HGNC

OrthologyPrediction

class pyhgnc.manager.models.OrthologyPrediction(**kwargs)[source]

Orthology Predictions

Warning

OrthologyPrediction is still not correctly normalized and documented.

Variables:
  • ortholog_species (int) – NCBI taxonomy identifier
  • human_entrez_gene (int) – Human Entrey gene identifier
  • human_ensembl_gene (str) – Human Ensembl gene identifier
  • human_name (str) – Human gene name
  • human_symbol (str) – Human gene symbol
  • human_chr (str) – Human gene chromosome location
  • human_assert_ids (str) –
  • ortholog_species_entrez_gene (str) – Ortholog species Entrez gene identifier
  • ortholog_species_ensembl_gene (str) – Ortholog species Ensembl gene identifier
  • ortholog_species_db_id (str) – Ortholog species database identifier
  • ortholog_species_name (str) – Ortholog species gene name
  • ortholog_species_symbol (str) – Ortholog species gene symbol
  • ortholog_species_chr (str) – Ortholog species gene chromosome location
  • ortholog_species_assert_ids (str) –
  • support (str) –
  • hgnc – back populates to HGNC

Database functions

set_connection

pyhgnc.manager.database.set_connection(connection='sqlite:////home/docs/.pyhgnc/data/pyhgnc.db')[source]

Set the connection string for sqlalchemy and write it to the config file.

import pyhgnc
pyhgnc.set_connection('mysql+pymysql://{user}:{passwd}@{host}/{db}?charset={charset}')

Hint

valid connection strings

  • mysql+pymysql://user:passwd@localhost/database?charset=utf8
  • postgresql://scott:tiger@localhost/mydatabase
  • mssql+pyodbc://user:passwd@database
  • oracle://user:passwd@127.0.0.1:1521/database
  • Linux: sqlite:////absolute/path/to/database.db
  • Windows: sqlite:///C:path odatabase.db
Parameters:connection (str) – sqlalchemy connection string

update

pyhgnc.manager.database.update(connection=None, silent=False, hgnc_file_path=None, hcop_file_path=None, low_memory=False)[source]

Update the database with current version of HGNC

Parameters:
  • connection (str) – conncetion string
  • silent (bool) – silent while import
  • hgnc_file_path (str) – import from path HGNC
  • hcop_file_path (str) – import from path HCOP (orthologs)
  • low_memory (bool) – set to True if you have low memory
Returns:

set_mysql_connection

pyhgnc.manager.database.set_mysql_connection(host='localhost', user='pyhgnc_user', passwd='pyhgnc_passwd', db='pyhgnc', charset='utf8')[source]

Method to set a MySQL connection

Parameters:
  • host (str) – MySQL database host
  • user (str) – MySQL database user
  • passwd (str) – MySQL database password
  • db (str) – MySQL database name
  • charset (str) – MySQL database charater set
Returns:

connection string

Return type:

str