Last updated: 2024-10-24

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version e6a2c58. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    data/pbmc3k.csv
    Ignored:    data/pbmc3k.csv.gz
    Ignored:    data/pbmc3k/
    Ignored:    r_packages_4.4.0/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/biomart.Rmd) and HTML (docs/biomart.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd e6a2c58 Dave Tang 2024-10-24 Compare with org.Hs.eg.db
html 6e8540e Dave Tang 2024-10-24 Build site.
Rmd ff049cd Dave Tang 2024-10-24 Using biomaRt

The biomaRt package provides an interface to BioMart databases provided by Ensembl.

biomaRt provides an interface to a growing collection of databases implementing the BioMart software suite. The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas or write complex SQL queries. The most prominent examples of BioMart databases are maintain by Ensembl, which provides biomaRt users direct access to a diverse set of data and enables a wide range of powerful online queries from gene annotation to database mining.

For more information, check out the Accessing Ensembl annotation with biomaRt guide.

Installation

To begin, install the {biomaRt} package.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("biomaRt")

Package

Load package.

packageVersion("biomaRt")
[1] '2.60.1'
suppressPackageStartupMessages(library(biomaRt))

If you are using Ubuntu and get a “Cannot find xml2-config” error while installing the {XML} package, a dependency of {biomaRt}, try installing (or asking the sysadmin to install) libxml2-dev:

sudo apt-get install libxml2-dev

Getting started

List the available BioMart databases.

listMarts()
               biomart                version
1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 113
2   ENSEMBL_MART_MOUSE      Mouse strains 113
3     ENSEMBL_MART_SNP  Ensembl Variation 113
4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 113

Connect to the selected BioMart database by using useMart().

ensembl <- useMart("ENSEMBL_MART_ENSEMBL")
avail_datasets <- listDatasets(ensembl)
head(avail_datasets)
                       dataset                           description
1 abrachyrhynchus_gene_ensembl Pink-footed goose genes (ASM259213v1)
2     acalliptera_gene_ensembl      Eastern happy genes (fAstCal1.3)
3   acarolinensis_gene_ensembl       Green anole genes (AnoCar2.0v2)
4    acchrysaetos_gene_ensembl       Golden eagle genes (bAquChr1.2)
5    acitrinellus_gene_ensembl        Midas cichlid genes (Midas_v5)
6    amelanoleuca_gene_ensembl       Giant panda genes (ASM200744v2)
      version
1 ASM259213v1
2  fAstCal1.3
3 AnoCar2.0v2
4  bAquChr1.2
5    Midas_v5
6 ASM200744v2

Look for human datasets by searching the description column.

idx <- grep('human', avail_datasets$description, ignore.case = TRUE)
avail_datasets[idx, ]
                 dataset              description    version
80 hsapiens_gene_ensembl Human genes (GRCh38.p14) GRCh38.p14

Connect to the selected BioMart database and human dataset.

ensembl <- useMart("ensembl", dataset=avail_datasets[idx, 'dataset'])
ensembl
Object of class 'Mart':
  Using the ENSEMBL_MART_ENSEMBL BioMart database
  Using the hsapiens_gene_ensembl dataset

Building a query, requires three things:

  1. filters
  2. attributes
  3. values

Use listFilters() to show available filters.

avail_filters <- listFilters(ensembl)
head(avail_filters)
             name              description
1 chromosome_name Chromosome/scaffold name
2           start                    Start
3             end                      End
4      band_start               Band Start
5        band_end                 Band End
6    marker_start             Marker Start

Use listAttributes() to show available attributes.

avail_attributes <- listAttributes(ensembl)
head(avail_attributes)
                           name                  description         page
1               ensembl_gene_id               Gene stable ID feature_page
2       ensembl_gene_id_version       Gene stable ID version feature_page
3         ensembl_transcript_id         Transcript stable ID feature_page
4 ensembl_transcript_id_version Transcript stable ID version feature_page
5            ensembl_peptide_id            Protein stable ID feature_page
6    ensembl_peptide_id_version    Protein stable ID version feature_page

The getBM() function is the main query function in {biomaRt}; use it once you have identified your attributes of interest and filters to use. Here’s an example that converts Affymetrix microarray probe IDs for a specific platform into Entrez Gene IDs and their descriptions.

affyids <- c("202763_at", "209310_s_at", "207500_at")

getBM(
  attributes=c('affy_hg_u133_plus_2', 'entrezgene_id', 'entrezgene_description'),
  filters = 'affy_hg_u133_plus_2',
  values = affyids,
  mart = ensembl
)
  affy_hg_u133_plus_2 entrezgene_id entrezgene_description
1         209310_s_at           837              caspase 4
2           207500_at           838              caspase 5
3           202763_at           836              caspase 3

Human RefSeq to Entrez

Look for filters with RefSeq.

grep("refseq", avail_filters$name, ignore.case=TRUE, value=TRUE)
 [1] "with_refseq_mrna"              "with_refseq_mrna_predicted"   
 [3] "with_refseq_ncrna"             "with_refseq_ncrna_predicted"  
 [5] "with_refseq_peptide"           "with_refseq_peptide_predicted"
 [7] "refseq_mrna"                   "refseq_mrna_predicted"        
 [9] "refseq_ncrna"                  "refseq_ncrna_predicted"       
[11] "refseq_peptide"                "refseq_peptide_predicted"     

RefSeq information for ACTB.

my_refseq <- 'NM_001101'

getBM(
  attributes = c('refseq_mrna', 'ensembl_gene_id', 'description'),
  filters = 'refseq_mrna',
  values = my_refseq,
  mart = ensembl
)
  refseq_mrna ensembl_gene_id                                  description
1   NM_001101 ENSG00000075624 actin beta [Source:HGNC Symbol;Acc:HGNC:132]

Gene Ontology terms

Find GO attribute names.

grep("^go", avail_attributes$name, ignore.case=TRUE, value=TRUE)
[1] "go_id"                  "go_linkage_type"        "goslim_goa_accession"  
[4] "goslim_goa_description"

Find Ensembl filters.

grep("^ensembl", avail_filters$name, ignore.case=TRUE, value=TRUE)
[1] "ensembl_gene_id"               "ensembl_gene_id_version"      
[3] "ensembl_transcript_id"         "ensembl_transcript_id_version"
[5] "ensembl_peptide_id"            "ensembl_peptide_id_version"   
[7] "ensembl_exon_id"              

ENSG00000075624 is the Ensembl gene ID for DMD, which stands for Dystrophin; it encodes the dystrophin protein. Here’s a query that obtains the GO terms associated with DMD.

dmd <- 'ENSG00000075624'
getBM(
  attributes=c("go_id"),
  filters="ensembl_gene_id",
  values = dmd,
  mart = ensembl
) -> dmd_go
tail(dmd_go)
        go_id
89 GO:0097433
90 GO:1900242
91 GO:0005903
92 GO:0030863
93 GO:0044305
94 GO:0098685

Use Term() to get GO terms for the GO IDs.

suppressPackageStartupMessages(library("GO.db"))
AnnotationDbi::Term(dmd_go$go_id) |>
  as.data.frame() |>
  tail()
                    AnnotationDbi::Term(dmd_go$go_id)
GO:0097433                                 dense body
GO:1900242 regulation of synaptic vesicle endocytosis
GO:0005903                               brush border
GO:0030863                      cortical cytoskeleton
GO:0044305                              calyx of Held
GO:0098685          Schaffer collateral - CA1 synapse

Use GOTERM to get more information on a term.

my_go_id <- 'GO:0098685'
class(GOTERM)
[1] "GOTermsAnnDbBimap"
attr(,"package")
[1] "AnnotationDbi"
GOTERM[[my_go_id]]
GOID: GO:0098685
Term: Schaffer collateral - CA1 synapse
Ontology: CC
Definition: A synapse between the Schaffer collateral axon of a CA3
    pyramidal cell and a CA1 pyramidal cell.

SNPs

Use the SNP database.

snp <- useMart("ENSEMBL_MART_SNP")
avail_snp_datasets <- listDatasets(snp)
head(avail_snp_datasets)
            dataset
1       btaurus_snp
2 btaurus_structvar
3       chircus_snp
4        drerio_snp
5  drerio_structvar
6     ecaballus_snp
                                                                     description
1   Cow Short Variants (SNPs and indels excluding flagged variants) (ARS-UCD1.3)
2                                           Cow Structural Variants (ARS-UCD1.3)
3        Goat Short Variants (SNPs and indels excluding flagged variants) (ARS1)
4 Zebrafish Short Variants (SNPs and indels excluding flagged variants) (GRCz11)
5                                         Zebrafish Structural Variants (GRCz11)
6  Horse Short Variants (SNPs and indels excluding flagged variants) (EquCab3.0)
     version
1 ARS-UCD1.3
2 ARS-UCD1.3
3       ARS1
4     GRCz11
5     GRCz11
6  EquCab3.0

Look for human datasets.

idx <- grep('human', avail_snp_datasets$description, ignore.case = TRUE)
avail_snp_datasets[idx, ]
                  dataset
10           hsapiens_snp
11       hsapiens_snp_som
12     hsapiens_structvar
13 hsapiens_structvar_som
                                                                              description
10         Human Short Variants (SNPs and indels excluding flagged variants) (GRCh38.p14)
11 Human Somatic Short Variants (SNPs and indels excluding flagged variants) (GRCh38.p14)
12                                                 Human Structural Variants (GRCh38.p14)
13                                         Human Somatic Structural Variants (GRCh38.p14)
      version
10 GRCh38.p14
11 GRCh38.p14
12 GRCh38.p14
13 GRCh38.p14

Get SNPs within a genomic location.

snp <- useMart("ENSEMBL_MART_SNP", dataset="hsapiens_snp")

my_snps <- getBM(
  attributes=c("refsnp_id","allele","chrom_start"),
  filters=c("chr_name","start","end"),
  values=list(8,148350, 149000),
  mart=snp
)

rbind(
  head(my_snps, 3),
  tail(my_snps, 3)
)
       refsnp_id allele chrom_start
1   rs1450830176    G/C      148350
2   rs1360310185  C/A/T      148352
3   rs1434776028    A/T      148353
243 rs1435594779    C/G      148998
244 rs1800825262  C/G/T      148999
245 rs1800825282    G/A      149000

Get SNP information with SNP IDs.

my_snp_ids <- c('rs547420070', 'rs77274555')
 
getBM(
  attributes=c("refsnp_id","allele","chrom_start"),
  filters=c("snp_filter"),
  values=my_snp_ids,
  mart=snp
)
    refsnp_id  allele chrom_start
1 rs547420070   A/C/G      148373
2  rs77274555 G/A/C/T      148391

Gene symbols

Convert Ensembl gene IDs to HUGO Gene Nomenclature Committee (HGNC) gene symbols.

my_genes <- c('ENSG00000118473', 'ENSG00000162426')
 
getBM(
  attributes=c('ensembl_gene_id', "hgnc_symbol", "description"),
  filters = "ensembl_gene_id",
  values=my_genes,
  mart=ensembl
)
  ensembl_gene_id hgnc_symbol
1 ENSG00000118473       SGIP1
2 ENSG00000162426     SLC45A1
                                                                description
1 SH3GL interacting endocytic adaptor 1 [Source:HGNC Symbol;Acc:HGNC:25412]
2     solute carrier family 45 member 1 [Source:HGNC Symbol;Acc:HGNC:17939]

org.Hs.eg.db

Bioconductor provides annotation packages such as {org.Hs.eg.db}; here we compare biomaRt results with results using {org.Hs.eg.db}.

Install it if you haven’t already.

BiocManager::install("org.Hs.eg.db")

Get 100 Entrez Gene IDs.

suppressPackageStartupMessages(library(org.Hs.eg.db))
entrez_gene_ids <- head(keys(org.Hs.eg.db), 100)
length(entrez_gene_ids)
[1] 100

Convert them to Ensembl gene IDs.

AnnotationDbi::select(
  org.Hs.eg.db,
  keys = entrez_gene_ids,
  columns=c("ENSEMBL","ENTREZID","SYMBOL","GENENAME"),
  keytype="ENTREZID"
) -> org_table
'select()' returned 1:many mapping between keys and columns
head(org_table)
  ENTREZID         ENSEMBL SYMBOL                           GENENAME
1        1 ENSG00000121410   A1BG             alpha-1-B glycoprotein
2        2 ENSG00000175899    A2M              alpha-2-macroglobulin
3        3 ENSG00000291190  A2MP1 alpha-2-macroglobulin pseudogene 1
4        9 ENSG00000171428   NAT1              N-acetyltransferase 1
5       10 ENSG00000156006   NAT2              N-acetyltransferase 2
6       11            <NA>   NATP     N-acetyltransferase pseudogene

Perform similar query using {biomaRt}.

getBM(
  attributes=c('entrezgene_id', 'ensembl_gene_id', "hgnc_symbol", "description"),
  filters = "entrezgene_id",
  values=entrez_gene_ids,
  mart=ensembl
) -> biomart_table

head(biomart_table)
  entrezgene_id ensembl_gene_id hgnc_symbol
1             1 ENSG00000121410        A1BG
2            10 ENSG00000156006        NAT2
3           100 ENSG00000196839         ADA
4           101 ENSG00000151651       ADAM8
5           102 ENSG00000137845      ADAM10
6           103 ENSG00000160710        ADAR
                                                         description
1             alpha-1-B glycoprotein [Source:HGNC Symbol;Acc:HGNC:5]
2           N-acetyltransferase 2 [Source:HGNC Symbol;Acc:HGNC:7646]
3              adenosine deaminase [Source:HGNC Symbol;Acc:HGNC:186]
4   ADAM metallopeptidase domain 8 [Source:HGNC Symbol;Acc:HGNC:215]
5  ADAM metallopeptidase domain 10 [Source:HGNC Symbol;Acc:HGNC:188]
6 adenosine deaminase RNA specific [Source:HGNC Symbol;Acc:HGNC:225]

Join tables.

biomart_table <- dplyr::mutate(biomart_table, entrezgene_id = as.character(entrezgene_id))

joint_table <- dplyr::full_join(x = org_table, y = biomart_table, by = dplyr::join_by(ENTREZID == entrezgene_id))
Warning in dplyr::full_join(x = org_table, y = biomart_table, by = dplyr::join_by(ENTREZID == : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 18 of `x` matches multiple rows in `y`.
ℹ Row 31 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
head(joint_table)
  ENTREZID         ENSEMBL SYMBOL                           GENENAME
1        1 ENSG00000121410   A1BG             alpha-1-B glycoprotein
2        2 ENSG00000175899    A2M              alpha-2-macroglobulin
3        3 ENSG00000291190  A2MP1 alpha-2-macroglobulin pseudogene 1
4        9 ENSG00000171428   NAT1              N-acetyltransferase 1
5       10 ENSG00000156006   NAT2              N-acetyltransferase 2
6       11            <NA>   NATP     N-acetyltransferase pseudogene
  ensembl_gene_id hgnc_symbol
1 ENSG00000121410        A1BG
2 ENSG00000175899         A2M
3 ENSG00000291190            
4 ENSG00000171428        NAT1
5 ENSG00000156006        NAT2
6            <NA>        <NA>
                                                                        description
1                            alpha-1-B glycoprotein [Source:HGNC Symbol;Acc:HGNC:5]
2                             alpha-2-macroglobulin [Source:HGNC Symbol;Acc:HGNC:7]
3 alpha-2-macroglobulin pseudogene 1 [Source:NCBI gene (formerly Entrezgene);Acc:3]
4                          N-acetyltransferase 1 [Source:HGNC Symbol;Acc:HGNC:7645]
5                          N-acetyltransferase 2 [Source:HGNC Symbol;Acc:HGNC:7646]
6                                                                              <NA>

Comparison table.

joint_table |>
  dplyr::filter(!is.na(ENSEMBL) & !is.na(ensembl_gene_id)) |>
  dplyr::select(ENSEMBL, ensembl_gene_id) |>
  dplyr::mutate(same = ENSEMBL == ensembl_gene_id) -> comp_table

table(comp_table$same)

FALSE  TRUE 
   58    94 

Check out the different IDs.

dplyr::filter(comp_table, same == FALSE) |>
  head()
          ENSEMBL ensembl_gene_id  same
1 ENSG00000204574 ENSG00000236149 FALSE
2 ENSG00000204574 ENSG00000225989 FALSE
3 ENSG00000204574 ENSG00000232169 FALSE
4 ENSG00000204574 ENSG00000236342 FALSE
5 ENSG00000204574 ENSG00000206490 FALSE
6 ENSG00000204574 ENSG00000231129 FALSE

Check out some differences.

dplyr::filter(comp_table, same == FALSE) |>
  head() |>
  dplyr::pull(ensembl_gene_id) -> my_ensembl_gene_ids

AnnotationDbi::select(
  org.Hs.eg.db,
  keys = my_ensembl_gene_ids,
  columns=c("ENSEMBL","ENTREZID","SYMBOL","GENENAME"),
  keytype="ENSEMBL"
)
'select()' returned 1:1 mapping between keys and columns
          ENSEMBL ENTREZID SYMBOL                                  GENENAME
1 ENSG00000236149       23  ABCF1 ATP binding cassette subfamily F member 1
2 ENSG00000225989       23  ABCF1 ATP binding cassette subfamily F member 1
3 ENSG00000232169       23  ABCF1 ATP binding cassette subfamily F member 1
4 ENSG00000236342       23  ABCF1 ATP binding cassette subfamily F member 1
5 ENSG00000206490       23  ABCF1 ATP binding cassette subfamily F member 1
6 ENSG00000231129       23  ABCF1 ATP binding cassette subfamily F member 1

{org.Hs.eg.db} matched an Entrez Gene ID to one Ensembl gene ID, whereas {biomaRt} matched an Entrez Gene ID to all possible Ensembl gene IDs.

Older reference assemblies

Last patch of hg19.

grch37 <- useMart(
  biomart="ENSEMBL_MART_ENSEMBL",
  host="https://grch37.ensembl.org",
  path="/biomart/martservice"
)

grch37

Database timed out and the code block below is not evaluated.

listDatasets(grch37)

sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] org.Hs.eg.db_3.19.1  GO.db_3.19.1         AnnotationDbi_1.66.0
[4] IRanges_2.38.1       S4Vectors_0.42.1     Biobase_2.64.0      
[7] BiocGenerics_0.50.0  biomaRt_2.60.1       workflowr_1.7.1     

loaded via a namespace (and not attached):
 [1] KEGGREST_1.44.1         xfun_0.44               bslib_0.7.0            
 [4] httr2_1.0.2             processx_3.8.4          callr_3.7.6            
 [7] generics_0.1.3          vctrs_0.6.5             tools_4.4.0            
[10] ps_1.7.6                curl_5.2.1              tibble_3.2.1           
[13] fansi_1.0.6             RSQLite_2.3.7           blob_1.2.4             
[16] pkgconfig_2.0.3         dbplyr_2.5.0            lifecycle_1.0.4        
[19] GenomeInfoDbData_1.2.12 compiler_4.4.0          stringr_1.5.1          
[22] git2r_0.33.0            Biostrings_2.72.1       progress_1.2.3         
[25] getPass_0.2-4           httpuv_1.6.15           GenomeInfoDb_1.40.1    
[28] htmltools_0.5.8.1       sass_0.4.9              yaml_2.3.8             
[31] later_1.3.2             pillar_1.9.0            crayon_1.5.2           
[34] jquerylib_0.1.4         whisker_0.4.1           cachem_1.1.0           
[37] tidyselect_1.2.1        digest_0.6.37           stringi_1.8.4          
[40] purrr_1.0.2             dplyr_1.1.4             rprojroot_2.0.4        
[43] fastmap_1.2.0           cli_3.6.3               magrittr_2.0.3         
[46] utf8_1.2.4              withr_3.0.1             filelock_1.0.3         
[49] prettyunits_1.2.0       UCSC.utils_1.0.0        promises_1.3.0         
[52] rappdirs_0.3.3          bit64_4.0.5             rmarkdown_2.27         
[55] XVector_0.44.0          httr_1.4.7              bit_4.0.5              
[58] png_0.1-8               hms_1.1.3               memoise_2.0.1          
[61] evaluate_0.24.0         knitr_1.47              BiocFileCache_2.12.0   
[64] rlang_1.1.4             Rcpp_1.0.12             glue_1.7.0             
[67] DBI_1.2.3               xml2_1.3.6              rstudioapi_0.16.0      
[70] jsonlite_1.8.8          R6_2.5.1                fs_1.6.4               
[73] zlibbioc_1.50.0