FindRelated is a companion tool to Publication Harvester that works with a database of publications previously downloaded from PubMed using Publication Harvester. FindRelated uses the Related Citations search to find and harvest all of the publications related to publications already in the database.
Software downloads:
The Publication Harvester software runs on Windows 7, 8, and 10 (and probably runs fine on previous versions). It was written in C#, and requires .NET Framework 4.0. (This should already be installed if you're running a current version of Windows.)
The following sample file may be helpful:
FindRelated uses the following data file format:
setnb,pmid X0000001,12764489 X0000001,9474027 X0000002,17130168 X0000002,12682366 X0000002,12625820
example: sample-findrelatedi-input.csv
The Related Citations search uses the Elink query to retrieve related citation data from PubMed. The following links have additional information about this query:
For each pair of setnb/PMID in the input, FindRelated uses the Elink query to retrieve the list of related articles, harvests them into the Publication Harvester database, and adds the rank and score to the related publications table specified by the user:
+-------------+---------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------------+---------+------+-----+---------+-------+ | PMID | int(11) | NO | PRI | NULL | | | RelatedPMID | int(11) | NO | PRI | NULL | | | Rank | int(11) | NO | | NULL | | | Score | int(11) | NO | | NULL | | +-------------+---------+------+-----+---------+-------+
The user can specify filters using the FindRelated form:
Once the related publications are harvested, FindRelated can generate reports:
The linking report contains the list of pairs of source PMID and related PMID:
-- Linking Report SELECT PMID AS source_pmid, RelatedPMID AS related_pmid, Rank AS link_ranking, Score AS link_score FROM relatedpublications
The related PMID report contains the harvested information for each related publications found:
-- Related PMID report SELECT DISTINCT rp.RelatedPMID AS related_pmid, p.journal, p.authors, p.year, p.month, p.day, p.title, p.volume, p.issue, p.pages, p.pubtype, p.pubtypecategoryid FROM relatedpublications rp, publications p WHERE rp.RelatedPMID = p.PMID
The related MeSH report contains a list of MeSH headings for each related publication:
-- Related MeSH report SELECT DISTINCT rp.RelatedPMID AS related_pmid, mh.Heading AS related_mesh FROM relatedpublications RP, publicationmeshheadings pmh, meshheadings mh WHERE RP.RelatedPMID = pmh.PMID AND pmh.MeSHHeadingID = mh.ID
The extreme relvance report contains a list of all of the source PMIDs, the most relevant related PMID (eg. the one with the highest score), its relatedness score, the least relevant related PMID, and its relatedness score and rank.
-- Extreme Relevance report SELECT PMID as source_pmid, MostRelevantPMID as most_rlvnt_pmid, MostRelevantScore as most_rlvnt_score, LeastRelevantPMID as least_rlvnt_pmid, LeastRelevantScore as least_rlvnt_score, LeastRelevantRank as least_rlvnt_rank FROM relatedpublications_extremerelevance
Note: In the above queries, relatedpublications
is replaced with the name of the table generated by FindRelated (eg. for the most relevant report, if the user specified relatedxyz
as the table name, it would query against the table relatedxyz_mostrelevant
.
FindRelated can retrieve colleagues in the "idea space" by interacting with SC/Gen. It automatically creates a view by appending _peoplepublications
to the related publications table name:
CREATE OR REPLACE VIEW relatedpublications_peoplepublications AS SELECT p.Setnb, rp.RelatedPMID AS PMID, -1 AS AuthorPosition, 6 AS PositionType FROM people p, peoplepublications pp, relatedpublications rp WHERE p.Setnb = pp.Setnb AND pp.PMID = rp.PMID;
This view is used in conjunction with SC/Gen, which can use it as an alternate people publications table. This causes SC/Gen to find colleagues and harvest publications in the "idea space", where a colleague is any author in the roster that coauthored a related paper.
Once the related colleagues are found, the FindRelated idea peer report is enabled. This report shows the list of peers found for each star, with a row for each peer publication including the position type (which is documented in the Publication Harvester documentation):
-- Idea peer report, with author position and position type for the colleagues based on the related publication SELECT sc.StarSetnb AS star_setnb, sc.setnb, rp.PMID AS source_pmid, rp.RelatedPMID AS related_pmid, cp.AuthorPosition as author_position, cp.PositionType as position_type FROM starcolleagues sc, peoplepublications pp, relatedpublications rp LEFT JOIN colleaguepublications cp ON (cp.PMID = rp.RelatedPMID) WHERE sc.StarSetnb = pp.Setnb AND pp.PMID = rp.PMID AND cp.Setnb = sc.Setnb
FindRelated is built for fault tolerance, so that its runs can be interrupeted at any time without losing data. This is done by reading the input file into a table (the table name is the derived by appending _queue
to the name of the related publications table):
+-----------+---------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------+---------+------+-----+---------+-------+ | Setnb | char(8) | NO | PRI | NULL | | | PMID | int(11) | NO | PRI | NULL | | | Processed | bit(1) | YES | | NULL | | | Error | bit(1) | YES | | NULL | | +-----------+---------+------+-----+---------+-------+
Data is loaded into this queue automatically when you specify an input filename and click the "Start" button. The program works by first reading each Setnb/PMID pair from each row in the input file, adding those pairs to the queue table, and then processing all of the pairs as usual. Each time a pair is successfully processed, its Processed
column is changed from 0 to 1. If an error occurs, its Processed
column is set to 0 and its Error
column is set to 1. This is how FindRelated keeps track of its queue of remaining pairs to be processed.
When you select a database from the dropdown and specify a related publications table name, the program queries the database to see if any unprocessed pairs are in the queue. If there are pairs remaining, it will display an error in the log indicating the number of pairs, and how many of those pairs are errors. To resume the run where it left off, click the "Resume" button. If you click the "Start" button, the existing tables (including the queue table) will be truncated and repopulated from the beginning.
When the "lite" mode checkbox is checked, FindRelated runs in "lite" mode. This changes the behavior in the following ways:
We need to keep track of the score of the most relevant pub even when it is filtered out.
When we filter the related pubs, sometimes the most highly related pubs overall will not survive the filtering. As a result there is no way to use its score as a normalizing factor to assess the closeness in idea space for the filtered pubs. The solution right now is to rerun the entire stuff by including only the top ranked related pub. This is cumbersome. I can imagine a separate MySQL table that has three columns: source_pmid, most related pmid, relatedness score for that pmid.
This software is released under the GNU General Public License (GPL).
The Publication Harvester project is maintained by Andrew Stellman of Stellman & Greene Consulting. If you have questions, comments, patches, or bug reports, please contact pubharvester@stellman-greene.com.
We gratefully acknowledgement is given to the financial support of the National Science Foundation (Award SBE-0738142).