Skip to content

Conversation

@sbabyanusha
Copy link
Contributor

No description provided.

Copy link
Collaborator

@rmadupuri rmadupuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sbabyanusha! This looks good. A couple of questions and improvements we can make

  1. Need to consistently maintain the download date. The archive says Oct_07_2025 and Oct_31_2025 in the filenames.
  2. Extra column in gene_info file after ensembl_id
  3. Remove EGFRVIII with entrez_id -3 record from gene_info.
  4. For miRNA records, add type as miRNA in gene_info.txt
  5. Remove Mitochondrial genes.
  6. Some records have the Chromosome as '-' or '13cen...'. Can they be backfilled from NCBI?
  7. Add LINC01394 -> FOXF2-AS1 to outdated_hugo_symbols.txt
  8. for case 2: Genes where only Entrez ID got updated, what were the previous entrez ids? Where they added to outdated_entrez_ids list?
  9. For genes added to entrez-id-supp.txt, they are already present in the hgnc_complete_set.txt. Missing entrez ids can be backfilled there instead of adding them to supp file?
  10. There are ambiguous symbols
Entrez symbol freq
3794 ARP11 2
3795 ARP2 2
3796 ARP3 2
13884 COP1 2
15845 CYP1 2
47904 MAF1P1 2
90703 TRY3 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants