Last updated: 2025-01-24
Checks: 7 0
Knit directory: muse/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200712)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 6a84acc. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: data/pbmc3k.csv
Ignored: data/pbmc3k.csv.gz
Ignored: data/pbmc3k/
Ignored: r_packages_4.4.0/
Ignored: r_packages_4.4.1/
Untracked files:
Untracked: analysis/fgsea_edger.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/biomart.Rmd
) and HTML
(docs/biomart.html
) files. If you’ve configured a remote
Git repository (see ?wflow_git_remote
), click on the
hyperlinks in the table below to view the files as they were in that
past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 6a84acc | Dave Tang | 2025-01-24 | Ensembl gene ID to Entrez gene ID |
html | ce560e1 | Dave Tang | 2024-11-15 | Build site. |
Rmd | cabd047 | Dave Tang | 2024-11-15 | Compare with database dump |
html | 3cf092d | Dave Tang | 2024-11-08 | Build site. |
Rmd | 5391fcc | Dave Tang | 2024-11-08 | wflow_publish(files = "analysis/biomart.Rmd") |
html | d6c949a | Dave Tang | 2024-10-24 | Build site. |
Rmd | e6a2c58 | Dave Tang | 2024-10-24 | Compare with org.Hs.eg.db |
html | 6e8540e | Dave Tang | 2024-10-24 | Build site. |
Rmd | ff049cd | Dave Tang | 2024-10-24 | Using biomaRt |
The biomaRt package provides an interface to BioMart databases provided by Ensembl.
biomaRt provides an interface to a growing collection of databases implementing the BioMart software suite. The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas or write complex SQL queries. The most prominent examples of BioMart databases are maintain by Ensembl, which provides biomaRt users direct access to a diverse set of data and enables a wide range of powerful online queries from gene annotation to database mining.
For more information, check out the Accessing Ensembl annotation with biomaRt guide.
To begin, install the {biomaRt} package.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("biomaRt")
Load package.
packageVersion("biomaRt")
[1] '2.62.0'
suppressPackageStartupMessages(library(biomaRt))
If you are using Ubuntu and get a “Cannot find xml2-config” error
while installing the {XML} package, a dependency of {biomaRt}, try
installing (or asking the sysadmin to install)
libxml2-dev
:
sudo apt-get install libxml2-dev
List the available BioMart databases.
listMarts()
biomart version
1 ENSEMBL_MART_ENSEMBL Ensembl Genes 113
2 ENSEMBL_MART_MOUSE Mouse strains 113
3 ENSEMBL_MART_SNP Ensembl Variation 113
4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 113
Connect to the selected BioMart database by using
useMart()
.
ensembl <- useMart("ENSEMBL_MART_ENSEMBL")
avail_datasets <- listDatasets(ensembl)
head(avail_datasets)
dataset description
1 abrachyrhynchus_gene_ensembl Pink-footed goose genes (ASM259213v1)
2 acalliptera_gene_ensembl Eastern happy genes (fAstCal1.3)
3 acarolinensis_gene_ensembl Green anole genes (AnoCar2.0v2)
4 acchrysaetos_gene_ensembl Golden eagle genes (bAquChr1.2)
5 acitrinellus_gene_ensembl Midas cichlid genes (Midas_v5)
6 amelanoleuca_gene_ensembl Giant panda genes (ASM200744v2)
version
1 ASM259213v1
2 fAstCal1.3
3 AnoCar2.0v2
4 bAquChr1.2
5 Midas_v5
6 ASM200744v2
Look for human datasets by searching the description column.
idx <- grep('human', avail_datasets$description, ignore.case = TRUE)
avail_datasets[idx, ]
dataset description version
80 hsapiens_gene_ensembl Human genes (GRCh38.p14) GRCh38.p14
Connect to the selected BioMart database and human dataset.
ensembl <- useMart("ensembl", dataset=avail_datasets[idx, 'dataset'])
ensembl
Object of class 'Mart':
Using the ENSEMBL_MART_ENSEMBL BioMart database
Using the hsapiens_gene_ensembl dataset
Building a query, requires three things:
Use listFilters()
to show available filters.
avail_filters <- listFilters(ensembl)
head(avail_filters)
name description
1 chromosome_name Chromosome/scaffold name
2 start Start
3 end End
4 band_start Band Start
5 band_end Band End
6 marker_start Marker Start
Use listAttributes()
to show available attributes.
avail_attributes <- listAttributes(ensembl)
head(avail_attributes)
name description page
1 ensembl_gene_id Gene stable ID feature_page
2 ensembl_gene_id_version Gene stable ID version feature_page
3 ensembl_transcript_id Transcript stable ID feature_page
4 ensembl_transcript_id_version Transcript stable ID version feature_page
5 ensembl_peptide_id Protein stable ID feature_page
6 ensembl_peptide_id_version Protein stable ID version feature_page
The getBM()
function is the main query function in
{biomaRt}; use it once you have identified your attributes of interest
and filters to use. Here’s an example that converts Affymetrix
microarray probe IDs for a specific platform into Entrez Gene IDs and
their descriptions.
affyids <- c("202763_at", "209310_s_at", "207500_at")
getBM(
attributes=c('affy_hg_u133_plus_2', 'entrezgene_id', 'entrezgene_description'),
filters = 'affy_hg_u133_plus_2',
values = affyids,
mart = ensembl
)
affy_hg_u133_plus_2 entrezgene_id entrezgene_description
1 209310_s_at 837 caspase 4
2 207500_at 838 caspase 5
3 202763_at 836 caspase 3
Look for filters with RefSeq.
grep("refseq", avail_filters$name, ignore.case=TRUE, value=TRUE)
[1] "with_refseq_mrna" "with_refseq_mrna_predicted"
[3] "with_refseq_ncrna" "with_refseq_ncrna_predicted"
[5] "with_refseq_peptide" "with_refseq_peptide_predicted"
[7] "refseq_mrna" "refseq_mrna_predicted"
[9] "refseq_ncrna" "refseq_ncrna_predicted"
[11] "refseq_peptide" "refseq_peptide_predicted"
RefSeq information for ACTB.
my_refseq <- 'NM_001101'
getBM(
attributes = c('refseq_mrna', 'ensembl_gene_id', 'description'),
filters = 'refseq_mrna',
values = my_refseq,
mart = ensembl
)
refseq_mrna ensembl_gene_id description
1 NM_001101 ENSG00000075624 actin beta [Source:HGNC Symbol;Acc:HGNC:132]
Find GO attribute names.
grep("^go", avail_attributes$name, ignore.case=TRUE, value=TRUE)
[1] "go_id" "go_linkage_type" "goslim_goa_accession"
[4] "goslim_goa_description"
Find Ensembl filters.
grep("^ensembl", avail_filters$name, ignore.case=TRUE, value=TRUE)
[1] "ensembl_gene_id" "ensembl_gene_id_version"
[3] "ensembl_transcript_id" "ensembl_transcript_id_version"
[5] "ensembl_peptide_id" "ensembl_peptide_id_version"
[7] "ensembl_exon_id"
ENSG00000075624 is the Ensembl gene ID for DMD, which stands for Dystrophin; it encodes the dystrophin protein. Here’s a query that obtains the GO terms associated with DMD.
dmd <- 'ENSG00000075624'
getBM(
attributes=c("go_id"),
filters="ensembl_gene_id",
values = dmd,
mart = ensembl
) -> dmd_go
tail(dmd_go)
go_id
89 GO:0097433
90 GO:1900242
91 GO:0005903
92 GO:0030863
93 GO:0044305
94 GO:0098685
Use Term()
to get GO terms for the GO IDs.
suppressPackageStartupMessages(library("GO.db"))
AnnotationDbi::Term(dmd_go$go_id) |>
as.data.frame() |>
tail()
AnnotationDbi::Term(dmd_go$go_id)
GO:0097433 dense body
GO:1900242 regulation of synaptic vesicle endocytosis
GO:0005903 brush border
GO:0030863 cortical cytoskeleton
GO:0044305 calyx of Held
GO:0098685 Schaffer collateral - CA1 synapse
Use GOTERM
to get more information on a term.
my_go_id <- 'GO:0098685'
class(GOTERM)
[1] "GOTermsAnnDbBimap"
attr(,"package")
[1] "AnnotationDbi"
GOTERM[[my_go_id]]
GOID: GO:0098685
Term: Schaffer collateral - CA1 synapse
Ontology: CC
Definition: A synapse between the Schaffer collateral axon of a CA3
pyramidal cell and a CA1 pyramidal cell.
Use the SNP database.
snp <- useMart("ENSEMBL_MART_SNP")
avail_snp_datasets <- listDatasets(snp)
head(avail_snp_datasets)
dataset
1 btaurus_snp
2 btaurus_structvar
3 chircus_snp
4 clfamiliaris_snp
5 clfamiliaris_structvar
6 drerio_snp
description
1 Cow Short Variants (SNPs and indels excluding flagged variants) (ARS-UCD1.3)
2 Cow Structural Variants (ARS-UCD1.3)
3 Goat Short Variants (SNPs and indels excluding flagged variants) (ARS1)
4 Dog Short Variants (SNPs and indels excluding flagged variants) (ROS_Cfam_1.0)
5 Dog Structural Variants (ROS_Cfam_1.0)
6 Zebrafish Short Variants (SNPs and indels excluding flagged variants) (GRCz11)
version
1 ARS-UCD1.3
2 ARS-UCD1.3
3 ARS1
4 ROS_Cfam_1.0
5 ROS_Cfam_1.0
6 GRCz11
Look for human datasets.
idx <- grep('human', avail_snp_datasets$description, ignore.case = TRUE)
avail_snp_datasets[idx, ]
dataset
12 hsapiens_snp
13 hsapiens_snp_som
14 hsapiens_structvar
15 hsapiens_structvar_som
description
12 Human Short Variants (SNPs and indels excluding flagged variants) (GRCh38.p14)
13 Human Somatic Short Variants (SNPs and indels excluding flagged variants) (GRCh38.p14)
14 Human Structural Variants (GRCh38.p14)
15 Human Somatic Structural Variants (GRCh38.p14)
version
12 GRCh38.p14
13 GRCh38.p14
14 GRCh38.p14
15 GRCh38.p14
Get SNPs within a genomic location.
snp <- useMart("ENSEMBL_MART_SNP", dataset="hsapiens_snp")
my_snps <- getBM(
attributes=c("refsnp_id","allele","chrom_start"),
filters=c("chr_name","start","end"),
values=list(8,148350, 149000),
mart=snp
)
rbind(
head(my_snps, 3),
tail(my_snps, 3)
)
refsnp_id allele chrom_start
1 rs1450830176 G/C 148350
2 rs1360310185 C/A/T 148352
3 rs1434776028 A/T 148353
243 rs1435594779 C/G 148998
244 rs1800825262 C/G/T 148999
245 rs1800825282 G/A 149000
Get SNP information with SNP IDs.
my_snp_ids <- c('rs547420070', 'rs77274555')
getBM(
attributes=c("refsnp_id","allele","chrom_start"),
filters=c("snp_filter"),
values=my_snp_ids,
mart=snp
)
refsnp_id allele chrom_start
1 rs547420070 A/C/G 148373
2 rs77274555 G/A/C/T 148391
Convert Ensembl gene IDs to HUGO Gene Nomenclature Committee (HGNC) gene symbols.
my_genes <- c('ENSG00000118473', 'ENSG00000162426')
getBM(
attributes=c('ensembl_gene_id', "hgnc_symbol", "description"),
filters = "ensembl_gene_id",
values=my_genes,
mart=ensembl
)
ensembl_gene_id hgnc_symbol
1 ENSG00000118473 SGIP1
2 ENSG00000162426 SLC45A1
description
1 SH3GL interacting endocytic adaptor 1 [Source:HGNC Symbol;Acc:HGNC:25412]
2 solute carrier family 45 member 1 [Source:HGNC Symbol;Acc:HGNC:17939]
Bioconductor provides annotation packages such as {org.Hs.eg.db}; here we compare biomaRt results with results using {org.Hs.eg.db}.
Install it if you haven’t already.
BiocManager::install("org.Hs.eg.db")
Get 100 Entrez Gene IDs.
suppressPackageStartupMessages(library(org.Hs.eg.db))
entrez_gene_ids <- head(keys(org.Hs.eg.db), 100)
length(entrez_gene_ids)
[1] 100
Convert them to Ensembl gene IDs.
AnnotationDbi::select(
org.Hs.eg.db,
keys = entrez_gene_ids,
columns=c("ENSEMBL","ENTREZID","SYMBOL","GENENAME"),
keytype="ENTREZID"
) -> org_table
'select()' returned 1:many mapping between keys and columns
head(org_table)
ENTREZID ENSEMBL SYMBOL GENENAME
1 1 ENSG00000121410 A1BG alpha-1-B glycoprotein
2 2 ENSG00000175899 A2M alpha-2-macroglobulin
3 3 ENSG00000291190 A2MP1 alpha-2-macroglobulin pseudogene 1
4 9 ENSG00000171428 NAT1 N-acetyltransferase 1
5 10 ENSG00000156006 NAT2 N-acetyltransferase 2
6 11 <NA> NATP N-acetyltransferase pseudogene
Perform similar query using {biomaRt}.
getBM(
attributes=c('entrezgene_id', 'ensembl_gene_id', "hgnc_symbol", "description"),
filters = "entrezgene_id",
values=entrez_gene_ids,
mart=ensembl
) -> biomart_table
head(biomart_table)
entrezgene_id ensembl_gene_id hgnc_symbol
1 1 ENSG00000121410 A1BG
2 10 ENSG00000156006 NAT2
3 100 ENSG00000196839 ADA
4 101 ENSG00000151651 ADAM8
5 102 ENSG00000137845 ADAM10
6 103 ENSG00000160710 ADAR
description
1 alpha-1-B glycoprotein [Source:HGNC Symbol;Acc:HGNC:5]
2 N-acetyltransferase 2 [Source:HGNC Symbol;Acc:HGNC:7646]
3 adenosine deaminase [Source:HGNC Symbol;Acc:HGNC:186]
4 ADAM metallopeptidase domain 8 [Source:HGNC Symbol;Acc:HGNC:215]
5 ADAM metallopeptidase domain 10 [Source:HGNC Symbol;Acc:HGNC:188]
6 adenosine deaminase RNA specific [Source:HGNC Symbol;Acc:HGNC:225]
Join tables.
biomart_table <- dplyr::mutate(biomart_table, entrezgene_id = as.character(entrezgene_id))
joint_table <- dplyr::full_join(x = org_table, y = biomart_table, by = dplyr::join_by(ENTREZID == entrezgene_id))
Warning in dplyr::full_join(x = org_table, y = biomart_table, by = dplyr::join_by(ENTREZID == : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 18 of `x` matches multiple rows in `y`.
ℹ Row 31 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
head(joint_table)
ENTREZID ENSEMBL SYMBOL GENENAME
1 1 ENSG00000121410 A1BG alpha-1-B glycoprotein
2 2 ENSG00000175899 A2M alpha-2-macroglobulin
3 3 ENSG00000291190 A2MP1 alpha-2-macroglobulin pseudogene 1
4 9 ENSG00000171428 NAT1 N-acetyltransferase 1
5 10 ENSG00000156006 NAT2 N-acetyltransferase 2
6 11 <NA> NATP N-acetyltransferase pseudogene
ensembl_gene_id hgnc_symbol
1 ENSG00000121410 A1BG
2 ENSG00000175899 A2M
3 ENSG00000291190
4 ENSG00000171428 NAT1
5 ENSG00000156006 NAT2
6 <NA> <NA>
description
1 alpha-1-B glycoprotein [Source:HGNC Symbol;Acc:HGNC:5]
2 alpha-2-macroglobulin [Source:HGNC Symbol;Acc:HGNC:7]
3 alpha-2-macroglobulin pseudogene 1 [Source:NCBI gene (formerly Entrezgene);Acc:3]
4 N-acetyltransferase 1 [Source:HGNC Symbol;Acc:HGNC:7645]
5 N-acetyltransferase 2 [Source:HGNC Symbol;Acc:HGNC:7646]
6 <NA>
Comparison table.
joint_table |>
dplyr::filter(!is.na(ENSEMBL) & !is.na(ensembl_gene_id)) |>
dplyr::select(ENSEMBL, ensembl_gene_id) |>
dplyr::mutate(same = ENSEMBL == ensembl_gene_id) -> comp_table
table(comp_table$same)
FALSE TRUE
58 94
Check out the different IDs.
dplyr::filter(comp_table, same == FALSE) |>
head()
ENSEMBL ensembl_gene_id same
1 ENSG00000204574 ENSG00000236149 FALSE
2 ENSG00000204574 ENSG00000225989 FALSE
3 ENSG00000204574 ENSG00000232169 FALSE
4 ENSG00000204574 ENSG00000236342 FALSE
5 ENSG00000204574 ENSG00000206490 FALSE
6 ENSG00000204574 ENSG00000231129 FALSE
Check out some differences.
dplyr::filter(comp_table, same == FALSE) |>
head() |>
dplyr::pull(ensembl_gene_id) -> my_ensembl_gene_ids
AnnotationDbi::select(
org.Hs.eg.db,
keys = my_ensembl_gene_ids,
columns=c("ENSEMBL","ENTREZID","SYMBOL","GENENAME"),
keytype="ENSEMBL"
)
'select()' returned 1:1 mapping between keys and columns
ENSEMBL ENTREZID SYMBOL GENENAME
1 ENSG00000236149 23 ABCF1 ATP binding cassette subfamily F member 1
2 ENSG00000225989 23 ABCF1 ATP binding cassette subfamily F member 1
3 ENSG00000232169 23 ABCF1 ATP binding cassette subfamily F member 1
4 ENSG00000236342 23 ABCF1 ATP binding cassette subfamily F member 1
5 ENSG00000206490 23 ABCF1 ATP binding cassette subfamily F member 1
6 ENSG00000231129 23 ABCF1 ATP binding cassette subfamily F member 1
{org.Hs.eg.db} matched an Entrez Gene ID to one Ensembl gene ID, whereas {biomaRt} matched an Entrez Gene ID to all possible Ensembl gene IDs.
Get Ensembl Gene IDs from An example differential gene expression results table.
edger_res <- readr::read_csv("https://raw.githubusercontent.com/davetang/muse/refs/heads/main/data/13970886_edger_res.csv", show_col_types = FALSE)
head(edger_res)
# A tibble: 6 × 6
ensembl_gene_id logFC logCPM F PValue adjusted_pvalue
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSG00000000003 2.73 4.83 4.28 0.0684 0.109
2 ENSG00000000005 -7.00 0.541 17.6 0.00216 0.0138
3 ENSG00000000419 0.120 5.34 0.114 0.743 0.776
4 ENSG00000000457 -0.708 5.31 3.35 0.0993 0.145
5 ENSG00000000460 -0.897 3.95 2.66 0.136 0.186
6 ENSG00000000938 1.54 5.60 1.86 0.205 0.258
Convert to Entrez Gene IDs.
ensembl_to_entrez <- getBM(
attributes=c('ensembl_gene_id', "entrezgene_id"),
filters = "ensembl_gene_id",
values = edger_res$ensembl_gene_id,
mart = ensembl
)
head(ensembl_to_entrez)
ensembl_gene_id entrezgene_id
1 ENSG00000000003 7105
2 ENSG00000000005 64102
3 ENSG00000000419 8813
4 ENSG00000000457 57147
5 ENSG00000000460 55732
6 ENSG00000000938 2268
Number of missing Entrez Gene IDs.
sum(grepl(pattern = "^$", x = ensembl_to_entrez$entrezgene_id))
[1] 0
table(is.na(ensembl_to_entrez$entrezgene_id))
FALSE TRUE
24340 15000
Use {org.Hs.eg.db}.
AnnotationDbi::select(
org.Hs.eg.db,
keys = edger_res$ensembl_gene_id,
columns=c("ENSEMBL","ENTREZID"),
keytype="ENSEMBL"
) -> ensembl_to_entrez_org
'select()' returned 1:many mapping between keys and columns
head(ensembl_to_entrez_org)
ENSEMBL ENTREZID
1 ENSG00000000003 7105
2 ENSG00000000005 64102
3 ENSG00000000419 8813
4 ENSG00000000457 57147
5 ENSG00000000460 55732
6 ENSG00000000938 2268
Number of missing Entrez Gene IDs using {org.Hs.eg.db}.
sum(grepl(pattern = "^$", x = ensembl_to_entrez_org$ENTREZID))
[1] 0
table(is.na(ensembl_to_entrez_org$ENTREZID))
FALSE TRUE
28722 10968
As I wrote in a blog post about converting Ensembl Gene IDs to gene symbols, I found the database dump that provides the lookup. Here we confirm whether the database dump generates the same results as using {biomaRt}.
Download and load database dump.
my_ensembl_ver <- '113'
my_url <- paste0("https://ftp.ensembl.org/pub/release-", my_ensembl_ver, "/mysql/ensembl_mart_", my_ensembl_ver, "/hsapiens_gene_ensembl__gene__main.txt.gz")
my_outfile <- paste0('/tmp/', basename(my_url))
db_dump <- download.file(url = my_url, destfile = my_outfile)
gene_db <- readr::read_tsv(file = my_outfile, col_names = FALSE, show_col_types = FALSE)
gene_db |>
dplyr::select(X7, X8) |>
dplyr::rename(ensembl_gene_id = X7, hgnc_symbol = X8) -> gene_db
head(gene_db)
# A tibble: 6 × 2
ensembl_gene_id hgnc_symbol
<chr> <chr>
1 ENSG00000210049 MT-TF
2 ENSG00000211459 MT-RNR1
3 ENSG00000210077 MT-TV
4 ENSG00000210082 MT-RNR2
5 ENSG00000209082 MT-TL1
6 ENSG00000198888 MT-ND1
Make a query.
ensembl <- biomaRt::useMart("ensembl", dataset='hsapiens_gene_ensembl')
gene_db_biomart <- biomaRt::getBM(
attributes=c('ensembl_gene_id', "hgnc_symbol"),
filters = "ensembl_gene_id",
values=gene_db$ensembl_gene_id,
mart=ensembl
)
Join and compare!
dplyr::inner_join(x = gene_db, y = gene_db_biomart, by = "ensembl_gene_id") |>
dplyr::mutate(same = hgnc_symbol.x == hgnc_symbol.y) |>
dplyr::filter(same == FALSE) |>
head()
# A tibble: 6 × 4
ensembl_gene_id hgnc_symbol.x hgnc_symbol.y same
<chr> <chr> <chr> <lgl>
1 ENSG00000299200 "\\N" "" FALSE
2 ENSG00000308964 "\\N" "" FALSE
3 ENSG00000303867 "\\N" "" FALSE
4 ENSG00000271254 "\\N" "" FALSE
5 ENSG00000278625 "U6" "" FALSE
6 ENSG00000278704 "\\N" "" FALSE
Look for entries that are probably not gene symbols.
gene_db_table <- table(gene_db$hgnc_symbol)
sort(gene_db_table[gene_db_table != 1], decreasing = TRUE) |> head()
\\N Y_RNA Metazoa_SRP U6 LILRP2 U3
38023 845 216 117 72 53
Join and compare after removing ‘\N’!
gene_db |>
dplyr::filter(hgnc_symbol != '\\N') |>
dplyr::inner_join(y = gene_db_biomart, by = "ensembl_gene_id") |>
dplyr::mutate(same = hgnc_symbol.x == hgnc_symbol.y) |>
dplyr::filter(same == FALSE) |>
dplyr::pull(hgnc_symbol.x) |>
table() |>
sort(decreasing = TRUE) |>
head()
Y_RNA Metazoa_SRP U6 U3 U1 SNORA70
845 216 117 53 29 27
Differences mostly for non-coding RNAs.
From the help page of useMart()
:
archive - Boolean to indicate if you want to access archived versions of BioMart databases. Note that this argument is now deprecated and will be removed in the future. A better alternative is to leave archive = FALSE and to specify the url of the archived BioMart you want to access. For Ensembl you can view the list of archives using listEnsemblArchives
listEnsemblArchives()
name date url version
1 Ensembl GRCh37 Feb 2014 https://grch37.ensembl.org GRCh37
2 Ensembl 113 Oct 2024 https://oct2024.archive.ensembl.org 113
3 Ensembl 112 May 2024 https://may2024.archive.ensembl.org 112
4 Ensembl 111 Jan 2024 https://jan2024.archive.ensembl.org 111
5 Ensembl 110 Jul 2023 https://jul2023.archive.ensembl.org 110
6 Ensembl 109 Feb 2023 https://feb2023.archive.ensembl.org 109
7 Ensembl 108 Oct 2022 https://oct2022.archive.ensembl.org 108
8 Ensembl 107 Jul 2022 https://jul2022.archive.ensembl.org 107
9 Ensembl 106 Apr 2022 https://apr2022.archive.ensembl.org 106
10 Ensembl 105 Dec 2021 https://dec2021.archive.ensembl.org 105
11 Ensembl 104 May 2021 https://may2021.archive.ensembl.org 104
12 Ensembl 103 Feb 2021 https://feb2021.archive.ensembl.org 103
13 Ensembl 102 Nov 2020 https://nov2020.archive.ensembl.org 102
14 Ensembl 101 Aug 2020 https://aug2020.archive.ensembl.org 101
15 Ensembl 100 Apr 2020 https://apr2020.archive.ensembl.org 100
16 Ensembl 99 Jan 2020 https://jan2020.archive.ensembl.org 99
17 Ensembl 98 Sep 2019 https://sep2019.archive.ensembl.org 98
18 Ensembl 80 May 2015 https://may2015.archive.ensembl.org 80
19 Ensembl 77 Oct 2014 https://oct2014.archive.ensembl.org 77
20 Ensembl 75 Feb 2014 https://feb2014.archive.ensembl.org 75
21 Ensembl 54 May 2009 https://may2009.archive.ensembl.org 54
current_release
1
2 *
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Use https://may2024.archive.ensembl.org
.
ensembl <- useMart("ENSEMBL_MART_ENSEMBL", host = "https://may2024.archive.ensembl.org")
avail_datasets_v112 <- listDatasets(ensembl)
grep('sapien', avail_datasets_v112$dataset, value = TRUE)
[1] "hsapiens_gene_ensembl"
Use hsapiens_gene_ensembl
dataset.
ensembl <- useMart(
biomart = "ENSEMBL_MART_ENSEMBL",
dataset = "hsapiens_gene_ensembl",
host = "https://may2024.archive.ensembl.org"
)
Convert Ensembl gene IDs to HUGO Gene Nomenclature Committee (HGNC) gene symbols.
my_genes <- c('ENSG00000118473', 'ENSG00000162426')
getBM(
attributes=c('ensembl_gene_id', "hgnc_symbol", "description"),
filters = "ensembl_gene_id",
values=my_genes,
mart=ensembl
)
Error in .processResults(postRes, mart = mart, hostURLsep = sep, fullXmlQuery = fullXmlQuery, : Query ERROR: caught BioMart::Exception::Database: Error during query execution: Table 'ensembl_mart_112.hsapiens_gene_ensembl__ox_hgnc__dm' doesn't exist
From https://github.com/grimbough/biomaRt/issues/104 but no dice.
ensembl_112 <- useEnsembl(
biomart = "genes",
dataset = "hsapiens_gene_ensembl",
version = 112
)
getBM(
attributes=c('ensembl_gene_id', "hgnc_symbol", "description"),
filters = "ensembl_gene_id",
values=my_genes,
mart=ensembl_112
)
Error in .processResults(postRes, mart = mart, hostURLsep = sep, fullXmlQuery = fullXmlQuery, : Query ERROR: caught BioMart::Exception::Database: Error during query execution: Table 'ensembl_mart_112.hsapiens_gene_ensembl__ox_hgnc__dm' doesn't exist
Last patch of hg19.
grch37 <- useMart(
biomart="ENSEMBL_MART_ENSEMBL",
host="https://grch37.ensembl.org",
path="/biomart/martservice"
)
grch37
Database timed out and the code block below is not evaluated.
listDatasets(grch37)
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] org.Hs.eg.db_3.20.0 GO.db_3.20.0 AnnotationDbi_1.68.0
[4] IRanges_2.40.1 S4Vectors_0.44.0 Biobase_2.66.0
[7] BiocGenerics_0.52.0 biomaRt_2.62.0 workflowr_1.7.1
loaded via a namespace (and not attached):
[1] KEGGREST_1.46.0 xfun_0.48 bslib_0.8.0
[4] httr2_1.0.7 processx_3.8.4 tzdb_0.4.0
[7] callr_3.7.6 generics_0.1.3 vctrs_0.6.5
[10] tools_4.4.1 ps_1.8.1 parallel_4.4.1
[13] curl_6.0.1 tibble_3.2.1 fansi_1.0.6
[16] RSQLite_2.3.9 blob_1.2.4 pkgconfig_2.0.3
[19] dbplyr_2.5.0 lifecycle_1.0.4 GenomeInfoDbData_1.2.13
[22] compiler_4.4.1 stringr_1.5.1 git2r_0.35.0
[25] Biostrings_2.74.1 progress_1.2.3 getPass_0.2-4
[28] httpuv_1.6.15 GenomeInfoDb_1.42.1 htmltools_0.5.8.1
[31] sass_0.4.9 yaml_2.3.10 later_1.3.2
[34] pillar_1.9.0 crayon_1.5.3 jquerylib_0.1.4
[37] whisker_0.4.1 cachem_1.1.0 tidyselect_1.2.1
[40] digest_0.6.37 stringi_1.8.4 purrr_1.0.2
[43] dplyr_1.1.4 rprojroot_2.0.4 fastmap_1.2.0
[46] cli_3.6.3 magrittr_2.0.3 utf8_1.2.4
[49] readr_2.1.5 withr_3.0.2 filelock_1.0.3
[52] prettyunits_1.2.0 UCSC.utils_1.2.0 promises_1.3.0
[55] rappdirs_0.3.3 bit64_4.5.2 rmarkdown_2.28
[58] XVector_0.46.0 httr_1.4.7 bit_4.5.0
[61] png_0.1-8 hms_1.1.3 memoise_2.0.1
[64] evaluate_1.0.1 knitr_1.48 BiocFileCache_2.14.0
[67] rlang_1.1.4 Rcpp_1.0.13 glue_1.8.0
[70] DBI_1.2.3 xml2_1.3.6 vroom_1.6.5
[73] rstudioapi_0.17.1 jsonlite_1.8.9 R6_2.5.1
[76] fs_1.6.4 zlibbioc_1.52.0