Metawise is a unique and powerful ontology that can identify, translate and connect medically relevant relationships expressed in scientific content. Metawise uses intelligent metadata that is aligned to both reference (e.g. accession number, HGNC term, MedDRA term) or referent (e.g. Disease, Drug, Target, Pathway, Tissue) categories to harmonize differing terminology and semantics. It can also be used to translate terms used by different communities or software applications (e.g. switching from a two dimensional chemical structure to a drug name or a SMILES string. This extends the utility and value of the data whilst preserving the integrity of the original data.
How does a big organisation know who are the internal specialists in breast cancer? In HPLC? In kinase inhibitors? It’s a difficult question, and increases with the size of the organisation; and yet without the answers, valuable time and knowledge can be wasted. Instem Scientific have recently completed a project with one pharmaceutical company that came up with an ingenious solution.
They collated all the publication records for any scientist within their organisation, from a number of sources (Medline, Embase, Biosis, Scopus, and Current Contents) into an Endnote dataset, and wanted to annotate each abstract with key categories – therapeutic areas, drug classes, drug processing techniques, analytical technologies. Then, they would use these categories to “tag” the abstract authors with that skill set, and hey presto! a searchable data set of internal talent and expertise.
Of course, many abstract providers already give sets of key words, but these can be sporadic, and assembling abstracts from several different sources means that the keywords aren’t consistent across the whole set. And this is where Instem Scientific came in. We used our data integration and harmonisation tools (Metawise) and Instem Scientific’s life science vocabularies to annotate all the abstracts, with both the keywords and categories across biomedical observations (e.g. Oncology, breast cancer), protein target classes (GPCR, 5-HT2B), analytical techniques (Chromatography, size exclusion-HPLC) and others.
Instem’s Metawise is a concept identification and translation engine that utilizes proprietary entity recognition algorithms to identify key terms in text using a unique approach based on term structure and semantics. This means that however someone has described a particular type of cancer or protein class, Metawise can mark-up the text and find the appropriate key concept, hugely enhancing both recall and accuracy.
Obviously, although this project was on public domain publication records, exactly the same process could be run across internal reports, documents, emails etc. – again so that any organisation can make best use of its own resources.
The Pharma and Healthcare industries are currently buzzing with discussions around better use of data, re-use of current and legacy data, and translation of knowledge and discoveries from preclinical into clinical, and back again (e.g. “Ensuring patient safety during clinical trials; translation to preclinical drug discovery”. David Cook & James Milligan, AstraZeneca. European Pharmaceutical Review; 3 September 2012; http://www.europeanpharmaceuticalreview.com/tag/patient-safety/).
One of the main obstacles to this, is the vast range of vocabulary used to describe findings, pathologies, and diseases. It varies hugely between pre-clinical and clinical, within pre-clinical studies, and even between scientists working for the same institution. This lack of consistency makes the sharing, searching and analysing of data a huge problem.
One way forwards is the INHAND (International Harmonization of Nomenclature and Diagnostic Criteria) proposal (http://www.toxpath.org/inhand_112105.pdf), the aim of which is to create a controlled vocabulary based around proliferative and non-proliferative lesions in rats and mice, to be used across pre-clinical studies. It is a joint initiative between the STP (http://www.toxpath.org/) in the US, UK, Europe and Japan, as well as RITA (Registry of Industrial Toxicology Animal-data; http://reni.item.fraunhofer.de/reni/public/rita/).
INHAND consists of 15 working groups, each of which covers vocabulary relating to a particular organ or group of organs. So far, 7 guides have been published, the most recent being from the Mammary, Zymbal’s, Preputial, and Clitoral Glands working group, and from the Male Reproductive System working group, both published last month (August 2012; http://tpx.sagepub.com/content/40/6_suppl.toc?etoc).
INHAND has international support, and is likely to become the standard which scientific reports and papers adhere to for pre-clinical pathologies. However, the guide documents so far published by the organ working groups are in pdf format, and are not useable as a controlled terminology. It is also still a young system, and has not yet built up a vocabulary as comprehensive as other available ontologies; nor does it enable harmonisation across to the clinical world.
We at Instem have built up a very broad vocabulary around biomedical observations, pathological and physiological, clinical and pre-clinical, utilising existing terminologies (such as Gene Ontology, OMIM, MedDRA, SNOMED) and also our own text-analytics engine, Metawise. This enables entity recognition for new biomedical concepts in free text and other sources, and mapping and translation to existing vocabularies. We applied this tool-set to the INHAND pdf guides, and were able to produce a consistent searchable vocabulary from these. This work was presented at the 31st Society of Toxicologic Pathology Annual Symposium in June 2012, in a poster entitled “Metawise: Extraction and Normalisation of Toxicologic Pathology Terminology from the INHAND project for Enhanced Search”. This poster generated a lot of interest from people involved in INHAND, particularly within the organ working groups, and also with scientists from a number of pharmaceutical companies. The poster is available to view below.
INHAND is expected to produce the rest of the guide documents by the end of 2013, thus covering all organ groups. It is well on its way to becoming a key part of pre-clinical data management, and these vocabularies can be used within a harmonised overarching terminology, as a step along the path to data harmonisation across the industry.
As of January 2012, DailyMed (http://dailymed.nlm.nih.gov/) featured labels for over 32,000 drug products submitted to the FDA. These labels are a rich source of drug safety intelligence, but their free text format and inconsistent terminology use make them difficult to interrogate or use with other applications.
BioWisdom’s Metawise enabled the identification of key medical terms as well as the creation of assertional metadata. Approximately 380,000 assertions were curated from 5000 DailyMed labels. They describe the side effects, black box warnings and laboratory tests for over 700 medicinal products, including highly-prescribed and black box warning-containing drugs. The data were originally built as a decision support tool for a US government agency and the approach was presented at the 2011 OpenTox conference in Munich (click for poster and presentation).
These data have now been integrated into SIP, allowing complex questions to be answered such as:
- Which laboratory tests should be considered when prescribing ACE inhibitors?
- Which black box warnings occur in medicines that list haemolytic anaemia as an adverse event?
The DailyMed data further enhances SIP’s content, particularly in the area of clinical adverse drug effects. The drug label information also benefits hugely from the SIP search interface, with a wide range of summary and dashboard views, querying against a rich collection of synonyms and taxonomic “tags” and chemical structure searches. Example summary views for the questions listed above are presented in Figure 1.
Figure 1: A list of the most common laboratory tests for drugs of the ACE inhibitor pharmacological class (left image), and the most frequent boxed warnings in drugs for which haemolytic anaemia is an adverse event (right image). Data were generated in approximately 5 minutes.
In August this year I joined around 100 other delegates to attend the OpenTox 2011 meeting in the pleasant surroundings of the Technical University of Munich (TUM). The meeting was kindly organised by Barry Hardy from Douglas Connect and Stefan Kramer of the TUM and further information can be found at http://www.opentox.org/meet/opentox2011. The program was spread over 4 days, starting with a pre-conference workshop, followed by a 3 day conference comprising presentations, poster sessions and social activities.
OpenTox is a collaborative project, with contributions from academia, government research groups, industry and individual experts. Its aim is to establish an interoperable predictive toxicology framework that can enable the creation of predictive toxicology applications. The project received funding from the seventh framework programme (FP7) and is coming to a close, and the conference provided an opportunity for those affiliated with the project as well as non-participants, such as myself, to talk about their work in predictive toxicology.
There were too many presentations in the conference program to single any out in particular but, not surprisingly, quantitative structure-activity relationship (QSAR) modelling was a common theme. Something that struck me from speaking to some of the delegates that work in this area is the sparseness of information in the domain. Consequently, the correlations that come from QSAR experiments are often based on highly curated decisions from relatively small amounts of data. This contrasts with the approach we typically take at BioWisdom, where decisions are usually based on large quantities of data harvested from a variety of data sources.