FindRelated

Description

FindRelated is a companion tool to Publication Harvester that works with a database of publications previously downloaded from PubMed using Publication Harvester. FindRelated uses the Related Citations search to find and harvest all of the publications related to publications already in the database.

Screenshot

Download

Software downloads:

The latest version of FindRelated can be downloaded as part of the Publication Harvester suite. The latest version can be downloaded from the GitHub release page

The C# source code can be downloaded from Github. (The source for t his project is combined with the source code from its sister project, Publicati onHarvester. The Publication Harvester project is an open source project.. Yo u'll need Visual Studio to compile it. It compiles just fine in Visual C# 2013 Express Edition fo r Desktop, which you can download for free from Microsoft's website.

The Publication Harvester software runs on Windows 7, 8, and 10 (and probably runs fine on previous versions). It was written in C#, and requires .NET Framework 4.0. (This should already be installed if you're running a current version of Windows.)

The following sample file may be helpful:

sample-findrelated-input.csv

Documentation

Data sources

FindRelated uses the following data file format:

setnb,pmid
X0000001,12764489
X0000001,9474027
X0000002,17130168
X0000002,12682366
X0000002,12625820

example: sample-findrelatedi-input.csv

The Related Citations search uses the Elink query to retrieve related citation data from PubMed. The following links have additional information about this query:

Related citations for document 21860326: https://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed&cmd=link&linkname=pubmed_pubmed&uid=21860326
Elink query that retrieves the raw XML results for the same related publications: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=21860326&db=pubmed&cmd=neighbor_score
Elink section in the NCBI E-utilites documentation

Basic Operation

For each pair of setnb/PMID in the input, FindRelated uses the Elink query to retrieve the list of related articles, harvests them into the Publication Harvester database, and adds the rank and score to the related publications table specified by the user:

+-------------+---------+------+-----+---------+-------+
| Field       | Type    | Null | Key | Default | Extra |
+-------------+---------+------+-----+---------+-------+
| PMID        | int(11) | NO   | PRI | NULL    |       |
| RelatedPMID | int(11) | NO   | PRI | NULL    |       |
| Rank        | int(11) | NO   |     | NULL    |       |
| Score       | int(11) | NO   |     | NULL    |       |
+-------------+---------+------+-----+---------+-------+

The user can specify filters using the FindRelated form:

Restrict the publication date for any related publication to a publication window (number of years before and and after the article)
Restrict to publication type categories (see the Publication Harvester documentation for an explanation of categories)
Set a maximum for the link ranking to retrieve (eg. only retrieve the top ten links for any publication)
Only retrieve related publications in the same journal
Restrict languages using abbreviations in the Medline language table

Once the related publications are harvested, FindRelated can generate reports:

The linking report contains the list of pairs of source PMID and related PMID:

-- Linking Report
SELECT PMID AS source_pmid, RelatedPMID AS related_pmid,
Rank AS link_ranking, Score AS link_score
FROM relatedpublications

The related PMID report contains the harvested information for each related publications found:

-- Related PMID report
SELECT DISTINCT rp.RelatedPMID AS related_pmid, 
p.journal, p.authors, p.year, p.month, p.day, p.title, p.volume, p.issue, p.pages, p.pubtype, p.pubtypecategoryid
FROM relatedpublications rp, publications p
WHERE rp.RelatedPMID = p.PMID

The related MeSH report contains a list of MeSH headings for each related publication:

-- Related MeSH report
SELECT DISTINCT rp.RelatedPMID AS related_pmid, mh.Heading AS related_mesh
FROM relatedpublications RP, publicationmeshheadings pmh, meshheadings mh
WHERE RP.RelatedPMID = pmh.PMID
AND pmh.MeSHHeadingID = mh.ID

The extreme relvance report contains a list of all of the source PMIDs, the most relevant related PMID (eg. the one with the highest score), its relatedness score, the least relevant related PMID, and its relatedness score and rank.

-- Extreme Relevance report
SELECT PMID as source_pmid, MostRelevantPMID as most_rlvnt_pmid, MostRelevantScore as most_rlvnt_score, 
LeastRelevantPMID as least_rlvnt_pmid, LeastRelevantScore as least_rlvnt_score, LeastRelevantRank as least_rlvnt_rank
FROM relatedpublications_extremerelevance

Note: In the above queries, relatedpublications is replaced with the name of the table generated by FindRelated (eg. for the most relevant report, if the user specified relatedxyz as the table name, it would query against the table relatedxyz_mostrelevant.

Interaction with SC/Gen

FindRelated can retrieve colleagues in the "idea space" by interacting with SC/Gen. It automatically creates a view by appending _peoplepublications to the related publications table name:

CREATE OR REPLACE VIEW relatedpublications_peoplepublications AS
SELECT p.Setnb, rp.RelatedPMID AS PMID, -1 AS AuthorPosition, 6 AS PositionType
FROM people p, peoplepublications pp, relatedpublications rp
WHERE p.Setnb = pp.Setnb
AND pp.PMID = rp.PMID;

This view is used in conjunction with SC/Gen, which can use it as an alternate people publications table. This causes SC/Gen to find colleagues and harvest publications in the "idea space", where a colleague is any author in the roster that coauthored a related paper.

Once the related colleagues are found, the FindRelated idea peer report is enabled. This report shows the list of peers found for each star, with a row for each peer publication including the position type (which is documented in the Publication Harvester documentation):

-- Idea peer report, with author position and position type for the colleagues based on the related publication
SELECT sc.StarSetnb AS star_setnb, sc.setnb,
rp.PMID AS source_pmid, rp.RelatedPMID AS related_pmid,
cp.AuthorPosition as author_position, cp.PositionType as position_type
FROM starcolleagues sc, peoplepublications pp,
   relatedpublications rp LEFT JOIN colleaguepublications cp ON (cp.PMID = rp.RelatedPMID)
WHERE sc.StarSetnb = pp.Setnb
AND pp.PMID = rp.PMID
AND cp.Setnb = sc.Setnb

Fault tolerance

FindRelated is built for fault tolerance, so that its runs can be interrupeted at any time without losing data. This is done by reading the input file into a table (the table name is the derived by appending _queue to the name of the related publications table):

+-----------+---------+------+-----+---------+-------+
| Field     | Type    | Null | Key | Default | Extra |
+-----------+---------+------+-----+---------+-------+
| Setnb     | char(8) | NO   | PRI | NULL    |       |
| PMID      | int(11) | NO   | PRI | NULL    |       |
| Processed | bit(1)  | YES  |     | NULL    |       |
| Error     | bit(1)  | YES  |     | NULL    |       |
+-----------+---------+------+-----+---------+-------+

Data is loaded into this queue automatically when you specify an input filename and click the "Start" button. The program works by first reading each Setnb/PMID pair from each row in the input file, adding those pairs to the queue table, and then processing all of the pairs as usual. Each time a pair is successfully processed, its Processed column is changed from 0 to 1. If an error occurs, its Processed column is set to 0 and its Error column is set to 1. This is how FindRelated keeps track of its queue of remaining pairs to be processed.

When you select a database from the dropdown and specify a related publications table name, the program queries the database to see if any unprocessed pairs are in the queue. If there are pairs remaining, it will display an error in the log indicating the number of pairs, and how many of those pairs are errors. To resume the run where it left off, click the "Resume" button. If you click the "Start" button, the existing tables (including the queue table) will be truncated and repopulated from the beginning.

"Lite" mode

When the "lite" mode checkbox is checked, FindRelated runs in "lite" mode. This changes the behavior in the following ways:

An output file must be specified
Only the related publication and queue tables are created
The input file is read, and the related publication search is run
The related publications are written to the table
The output file is written as a CSV file with only four columns: pmid, rltd_pmid, rltd_rank, and rltd_score
Publication filtering is disabled
Stop and resume behavior works the same way as in normal mode
No additional publication tables are required—this does not need to be run on an previously harvested database, and no additional harvesting is done

Revision history

PublicationHarvester 1.1.0.5 -- 31-Aug-2019
See GitHub release page for details
v1.0.19 -- 19-Apr-2019
Added support for NCBI API keys (add to api_key.txt in the same folder as PubMed.dll)
v1.0.18 -- 12-Oct-2017
Changed program to use https:// instead of http:// for NCBI API calls
v1.0.17 -- 13-Dec-2014
Fixed problem that caused the run to halt when the remote server drops the connection
v1.0.16 -- 28-Sep-2013
Added "lite" mode
v1.0.14 -- 04-Aug-2012
Modified the Related PMID report
v1.0.13 -- 02-Jul-2012
Added "Resume" button and fault tolerance, modified to create UTF8 tables, added least relevant columns to extreme relevance report (previously called the most relevant report)
v1.0.10 -- 05-Jun-2012
Fixed header for Idea Peer report in FindRelated
v1.0.9 -- 13-May-2012
Added 'most relevant' handling. Here's the description:
We need to keep track of the score of the most relevant pub even when it is filtered out.
When we filter the related pubs, sometimes the most highly related pubs overall will not survive the filtering. As a result there is no way to use its score as a normalizing factor to assess the closeness in idea space for the filtered pubs. The solution right now is to rerun the entire stuff by including only the top ranked related pub. This is cumbersome. I can imagine a separate MySQL table that has three columns: source_pmid, most related pmid, relatedness score for that pmid.
v1.0.8 -- 14-Jan-2012
Handled errors that caused runs to stop
v1.0.7 -- 07-Jan-2012
Handled web exception during search that caused runs to stop
v1.0.6 -- 11-Dec-2011
Fixed CSV file reading to remove errors, allow for numeric PMIDs, increase speed
v1.0.5 -- 10-Sep-2011
Implemented reports, saved settings, language filter
v1.0.4 -- 13-Aug-2011
Added filters, updated UI

License

This software is released under the GNU General Public License (GPL).

Contact Information

The Publication Harvester project is maintained by Andrew Stellman of Stellman & Greene Consulting. If you have questions, comments, patches, or bug reports, please contact pubharvester@stellman-greene.com.

We gratefully acknowledgement is given to the financial support of the National Science Foundation (Award SBE-0738142).