NCBI Matching/Linking

Screenshot of an Advanced Processing menu with 'NCBI Matching-Linking' highlighted. Bibliographic References, Duplicate Reference Check, and URL Check have already been processed

Search this space

Available in:

NCBI Matching/Linking looks for accession numbers to entries in the various NCBI databases and some EBI databases.

If it finds a match, it applies a character style to the accession number specific to the appropriate database, and it also creates a hyperlink to the record for that accession number in the database.

Example:

When the module sees this text:

More recently, closely related strains were also isolated … from the ciliate Collinia sp. endoparasitic in euphausiids from the Gulf of California (unpublished GenBank record EU090136), and in a culture-independent analysis of the microbial burden and diversity in commercial airline cabins [7].

it applies the character style db_ncbi_genbank and this link to the text “EU090136”:

http://www.ncbi.nlm.nih.gov/nuccore/156744481

Clicking on this link brings up the NCBI GenBank record for this sequence. The linked text is displayed:

 

Matching ranges of accession number

If the module finds a range of accession numbers, it embeds a link that includes each accession number in the range (up to a user-configurable limit, set to 100 by default).

Example:

When the module sees this text:

The GenBank/EMBL/DDBJ accession numbers for the bovine RVC sequences determined in this study are AB738402–AB738417, as detailed in Fig. 1.

it applies the character style to the entire range and generates this link:

http://www.ncbi.nlm.nih.gov/nuccore/430726479%20430726477%20430726475%20430726473%20430726471%20430726469%20430726467%20430726465%20430726463%20430726461%20430726459%20430726457%20430726455%20430726453%20430726451%20430726449

which brings up a list of all 16 accession numbers in the range.

If a range of accession numbers greater than the maximum limit is encountered (set to 100 by default), the NCBI Matching/Linking module will style the range as shown in the previous example, but the link will point only to the first and last accession numbers in the range, not to all of the intermediate values.

 

How to use

To use NCBI Linking/Matching:

  1. Select eXtyles > Advanced Processing > NCBI Linking/Matching.

 

Databases Linked

NCBI hosts a number of sequence and structure databases for nucleotide and amino acid sequence data. It also hosts a number of literature databases, most obviously PubMed and PubMed Central, and databases that contain other original data, such as the Gene Expression Omnibus (GEO) database, and sources of secondary data such as the HIV-1, Human Protein Interaction database and MedGen.

A full list can be found at http://www.ncbi.nlm.nih.gov/guide/all/#databases.

The NCBI Matching/Linking module also links to a few EBI databases; more information about EBI can be found at http://www.ebi.ac.uk.

NCBI Matching/Linking uses 17 different character styles to indicate the database that has been matched. These are shown in the following tables:

 

EBI Databases

Character style

Database

Character style

Database

db_ebi_ArrayExpress_Array

A database of genome arrays used in ArrayExpress experiments

db_ebi_ArrayExpress_Experiment

A database of genomics experiments, results of which are included in ArrayExpress

db_ebi_ArrayExpress_GEO

Functional genomics data, either submitted directly to ArrayExpress or imported from the NCBI GEO database

 

NCBI Databases

Character style

Database

Character style

Database

db_ncbi_ccds

CCDS, the Consensus CDS Project

db_ncbi_dbGap

dbGaP, the Database of Genotypes and Phenotypes

db_ncbi_dbnsp

dbSNP, the Database of Short Genetic Variations or single nucleotide polymorphisms (SNPs)

db_ncbi_entrezgene

Entrez Gene

db_ncbi_genbank

GenBank – both accession numbers and GI numbers are recognized

db_ncbi_genpept

Translated protein sequences from GenBank (see Entrez Protein)

db_ncbi_geo

GEO, Gene Expression Omnibus database

db_ncbi_omim

OMIM, Online Mendelian Inheritance in Man

db_ncbi_pdb

Protein Data Bank; these records form part of Entrez Protein

db_ncbi_refseq

RefSeq, the NCBI Reference Sequence database

db_ncbi_SRA

SRA, the Sequence Read Archive, containing raw sequence data from next-generation sequencing platforms

db_ncbi_swissprot

Swiss-Prot protein sequence database; these records form part of Entrez Protein

db_ncbi_unigene

UniGene*

*Note that as of July 2019, the UniGene database and web interface have been retired. 

db_ncbi_other

Records that match an NCBI database covered by the module but not listed above

 

Databases Not Linked

Not all of the NCBI databases are covered by NCBI Matching/Linking. Specifically, the module does not attempt to link to the following databases:

 

Excluded Document Parts

The module excludes the reference section of the document from the matching process by default.

It is also possible to configure the module to omit other paragraph styles on a customer-specific basis.

Example:

If accession numbers would never appear in the acknowledgments section of your content, this paragraph style could be excluded, and this might help to avoid false-positive matches against grant numbers.

You could exclude the affiliations, if complex room numbers or postal codes throw up warnings.

 

How It Works

Each of the NCBI databases has certain constraints on the form that its accession numbers can take.

NCBI Matching/Linking uses these rules to look for strings in the text that might match one of the databases.

If a string that matches the rules for one of the databases is found, the module queries that database to see whether it contains an accession that matches the found string and, if it does, it links that string in the text and applies the appropriate character style.

If at least one match is found during this “first pass”, NCBI Matching/Linking then runs a second, less strict pass, under the assumption that, if the document contains at least one accession number, it’s worth looking for other patterns that might be accession numbers that contain errors or that have not yet been released by the database.

During this second pass, the module looks for strings in the text that loosely match the database rules but don’t correspond to an entry in the database. The module then attaches a Word comment to those pieces of text to alert the editor to the possibility that they correspond to a database record.

The second pass only takes place if the module has already found at least one match to a database in the document; this is designed to reduce the likelihood of false-positives in content of a subject matter where accession numbers don’t appear.

The module also uses some semantic clues to tell it that a potential match is either more likely or less likely to match one of the databases.

 

Copyright © 2022 Atypon Systems, LLC. All Rights Reserved.