From OntologPSMW

Jump to: navigation, search
[ ]
Session Track A Session 2
Duration 2 hour
Date/Time Apr 05 2017 18:30 GMT
9:30am PDT/12:30pm EDT
5:30pm BST/6:30pm CEST
Convener GaryBergCross

Meeting ID: 768423137     (4)

Please use the chatroom above. Do not use the video teleconference chat, which is only for communicating with the moderator.     (6)

When you use the Video Conference URL above, you will be given the choice of using the computer audio or using your own telephone. Some attendees had difficulties when using the computer audio choice. If this happens to you, please leave the meeting and reenter it using the telephone choice with access code 768423137.     (7)


Session 2, Track A: Automation of Knowledge Extraction and Ontology Learning     (8)

Session Chair: Gary Berg-Cross     (8A)

Context:     (8B)

Building & maintaining knowledge bases & ontologies is hard work and could use some automated help.     (8B1)

Perception:     (8C)

Various parts of AI, such as NLP and machine learning are developing rapidly and could offer help.     (8C1)

Approach:     (8D)

As with Session 1 we aim to bring together various researchers to discuss the issues and state of the art.     (8D1)

Sample Questions:    (8D2)

What are the ranges of methods used to extract knowledge and build ontologies and other knowledge structures? How have been techniques enhanced and expanded over time? What issues of knowledge building and reuse have been noted? Are there hybrid efforts?     (8D2A)

Agenda and Speakers     (8E)

Speakers:    (8F)

  • Michael Yu (UCSD) - "Inferring the hierarchical structure and function of a cell from millions of biological measurements". Slides     (8F1)

Abstract A cell operates at many physical scales. For example, genetic variation in nucleotides (1 nm) gives rise to functional changes in proteins (1-10 nm), which in turn affect protein complexes, cellular processes, pathways, organelles (10 nm-1 μm), and, ultimately, phenotypes observed in cells (1–10 μm). In the first half of the talk, I will present a general strategy for automatically inferring these cellular subsystems and their hierarchical organization based on millions of experimental measurements. The result, a “data-driven gene ontology”, complements the biological knowledge found by manual curation of literature. In the second half of the talk, I will also show how a gene ontology can be applied not only to describe cell structure but also to predict cell functions, such as growth rate, from this structure. Predictions made in this way outperform those by alternative methods that do not take advantage of the hierarchical knowledge in an ontology.     (8F2)

Short Bio Michael Yu is a Bioinformatics Ph.D. student in Trey Ideker’s laboratory at UC San Diego. His current research focuses on designing algorithms for integrating large “omics” datasets into predictive models of molecular biology and human disease. Prior to his Ph.D., he studied comparative genomics at MIT, where he received his Bachelor’s in mathematics and Master’s in computer science     (8F3)

  • Francesco Corcoglioniti (Post-doc at Fondazione Bruno Kessler, Italy)     (8F4)

"Frame-based Ontology Population from text with PIKES" Slides     (8F5)

Abstract: PIKES ( is an open source tool for ontology population from natural language English text that extracts RDF triples according to FrameBase, a Semantic Web ontology derived from FrameNet. Processing is decoupled in two phases: (i) linguistic feature extraction, where several NLP tools are used to produce an RDF graph of mentions, i.e., snippets of text denoting some entity / fact; and (ii) knowledge distillation, where the mention graph is mapped via rules to produce a knowledge graph, whose content is linked to DBpedia and organized around semantic frames, i.e., prototypical descriptions of events and situations. A single RDF/OWL representation is used where each triple is related to the mentions/tools it comes from. This talk provides an overview of PIKES approach, implementation, and related/future research developments. Full Article     (8F6)

Short bio: Francesco Corcoglioniti is a post-doc researcher at Fondazione Bruno Kessler (FBK), where he previously conducted his activities to obtain his Ph.D in Computer Science from the University of Trento in 2016. His research interests cross the areas of Semantic Web, Data Management and Natural Language Processing, and focus on the extraction, modeling, processing, and storage of knowledge from natural language text and social media.     (8F7)

  • Evangelos Pafilis (Hellenic Center Marine Research [HCMR]) -     (8F8)

“EXTRACT 2.0: interactive extraction of environmental and biomedical contextual information." Slides     (8F9)

Abstract: EXTRACT,, is an interactive web-based annotation tool that employs a basic text mining technique, the Named Entity Recognition (NER) to identify and extract standard-compliant terms for the annotation of metagenomic and biomedical records. A fast performing, dictionary-based tagger [1] constitutes EXTRACT’s core. The tagger relies on a set of dictionaries that map biological names to corresponding terms in biological ontologies, or to pertinent records in public biological databases. (depending on the entity type). In the case of ontology-described entity types, the NER dictionaries are constructed based on the ontology-term names and synonyms. In particular, after the latter are extracted, they are subjected to a series of filtering, rule-based expansion, and manual curation steps. In its first version EXTRACT supported the identification of environment descriptors, tissue, disease and organism mentions in text [2]. Environment Ontology, Brenda Tissue Ontology, Disease Ontology terms, and NCBI Taxonomy database records – in corresponding order, were used to this end (see [2] for the mentioned web resource references). Aim of this effort was to explore easy-to-use methods to assist the annotation of metagenomics records with standards compliant metadata. In such context, EXTRACT participated context of the BioCreative V interactive annotation task (BCV-IAT)[3] In its present version, and via the work described in [4], EXTRACT’s has been extended to support a wider scope of biological record annotation (e.g. protein function). In particular, it supports the identification also of: genes/proteins, PubChem Compound identifiers, and Gene Ontology terms. This talk will describe: the need for standards-compliant metagenomics record annotation, the EXTRACT architecture – focusing on the Environment Ontology [5] term identification, the EXTRACT web interface, its performance in BCV-IAT, and briefly present its present version. References – Resources [1] Pafilis,E. et al. (2013) The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text. PLoS One, 8, e65390. [2] Pafilis,E. et al. (2016) EXTRACT: Interactive extraction of environment metadata and term suggestion for metagenomic sample annotation. Database, 2016. baw005 [3] Wang,Q. et al. (2016) Overview of the interactive task in BioCreative V. Database, 2016, baw119. [4] Pafilis,E. et al. (2017) EXTRACT 2.0: text-mining-assisted interactive annotation of biomedical named entities and ontology terms. bioRxiv. [5] Buttigieg,P.L. et al. (2016) The environment ontology in 2016: bridging domains with increased scope, semantic density, and interoperation. J. Biomed. Semantics, 7, 57.     (8F10)

Short Bio Evangelos Pafilis is a Postdoctoral Researcher at the Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC) at Hellenic Center for Marine Research (HCMR), Crete, Greece. Originally trained as a biologist, Evangelos specialized in Bioinformatics, and in particular in literature mining and data integration. Such skills were initially developed in a biomedical research context (PhD in Bioinformatics EMBL/Uni of Heidelberg). In IMBBC/HCMR he is exploring how text mining, data integration, and interactive web application development can be applied and/or extended to serve the information extraction needs of additional biological fields, such as microbiology, ecology and biodiversity.     (8F11)

Attendees     (8G)

Proceedings     (8H)

[12:16] KenBaclawski:     (8H1)

Michael Yu (UCSD) "Inferring the hierarchical structure and function of a cell from millions of biological measurements".     (8H3)

Francesco Corcoglioniti (Post-doc at Fondazione Bruno Kessler, Italy) "Frame-based Ontology Population from text with PIKES"     (8H4)

Evangelos Pafilis (Hellenic Center Marine Research [HCMR]) EXTRACT 2.0: interactive extraction of environmental and biomedical contextual information."     (8H5)

[12:23] KenBaclawski: The video teleconference is at     (8H6)

[12:23] KenBaclawski: The meeting page is     (8H7)

[12:26] ToddSchneider: Ken, is the Blue Jeans recording capability working?     (8H8)

[12:30] KenBaclawski: @Todd: Yes, it is working. I have tried it, and it seems adequate.     (8H9)

[12:36] Rebecca Tauber: Does anybody know the meeting code for calling in? I'm having some connection trouble.     (8H10)

[12:37] gary berg-cross: @Rebecca does work???     (8H11)

[12:38] Rebecca Tauber: @Gary - finally got it to work, thanks (loading took forever)     (8H12)

[12:39] gary berg-cross: I have a bit of an echo...I am muted...     (8H13)

[12:40] ToddSchneider: Participants using Blue Jeans teleconference tool, please do not send video (i.e., don't use a/your web cam) to conserve bandwidth.     (8H14)

[12:41] ToddSchneider: Gary, I'm not hearing any echo.     (8H15)

[12:42] ToddSchneider: How does data acquisition impact any models created (e.g., implicit or embedded assumptions or biases)?     (8H16)

[12:44] ToddSchneider: What does the hierarchy represent?     (8H17)

[12:45] Rebecca Tauber: Are you pulling the public data from open-access articles, or curated data from databases?     (8H18)

[12:52] gary berg-cross: Reloaded to get rid of the echo.     (8H19)

[12:54] Mark Underwood: What tool was used for the visualization demo?     (8H20)

[12:59] Rebecca Tauber: May or may not be useful/relevant - have you looked into GIST (genetic interaction structured terminology)? Currently in development: (look at genetic interaction curation)     (8H21)

[13:01] ToddSchneider: What is a 'bundle' on interactions? Is there some additional behavior that emerges from a 'bundle' of interactions (i.e., is it another type of interaction)?     (8H22)

[13:05] ToddSchneider: Ken, There are more people listed on BlueJeans than on this chat.     (8H23)

[13:06] Mark Underwood: Is the "framework for supervised machine learning" a general design pattern, or something special that needs a deeper dive to apply to different domains?     (8H24)

[13:10] Mark Underwood: Is it ok to socialize the demonstration web page?     (8H25)

[13:12] Rebecca Tauber: Would love to see more about this project. I work on the Evidence & Conclusion Ontology and we work closely with GO.     (8H26)

[13:12] gary berg-cross: Francesco has a nice contrast or follow up from Valentina's FRED presentation.     (8H27)

[13:14] gary berg-cross: @Michael Yu, thanks for a nicely detailed presentation.     (8H28)

[13:15] Mark Underwood: @Michael 0 Nice presentation - and shout-out to San Diego (Long time former resident, Leucadia / La Mesa resident)     (8H29)

[13:16] KenBaclawski: @ToddSchneider: I will remind participants to join the soaphub chat during the break between speakers.     (8H30)

[13:21] AndreaWesterinen: @Francesco How do you get the correct parse of the Bush/Bono sentence to achieve the frame that you note in the slides? For example, "in Africa" could refer to where Bush and Bono are OR to the fight of HIV. Humans know that Bush and Bono are not in Africa, but a parser would not.     (8H31)

[13:21] Michael Yu: A 'bundle' of interactions is simply a dense cluster of interactions. In our experience, we find that bundles can occur in at least two ways. The first type of bundle spans BETWEEN two terms, i.e. there are many pairwise interactions (g1, g2) where g1 is in one term and g2 is in the other term. The second type of bundle occurs WITHIN a term, i.e. there are many interactions (g1,g2) where both g1 and g2 are in that same term. Bundling is a nice property because it means that many interactions can be more easily explained (and thus predicted) by that bundling     (8H32)

[13:23] gary berg-cross: Q from Queue for Francesco " Could you define the word Frame?"     (8H33)

[13:23] KenBaclawski: Please put your questions in the main entry box at the bottom. Do not put your question in the box next to the hand.     (8H34)

[13:27] Michael Yu: @RebeccaTauber. Thanks for the link about GIST! I will look into it.     (8H35)

Also, we have previously with GO (Mike Cherry, Michal A Surma, Rama Balakrishnan) by suggesting that they includes some data-driven terms into GO. See section "Using NeXO to systematically update and expand GO" of the Dutkowski et al. Nature Biotech 2012 paper (5th paper listed on slide 26). In general, I'm excited about merging data-driven and curated knowledge!     (8H36)

[13:31] Michael Yu: @MarkUnderwood. The visualization for was custom built. I believe it relies heavily on javascript libraries (Node.JS or related libraries I think?) More info can be found in the paper Dutkowski et al. Nucleic Acids Res 2014 (4th paper in slide 26)     (8H37)

[13:33] Michael Yu: @RebeccaTauber, the data used to create the ontology comes from both datasets tied to specific journal articles and also data that we downloaded from a database (e.g. we have used BioGRID database, which is described in the paper you linked)     (8H38)

[13:36] Michael Yu: @MarkUnderwood, sorry I'm not sure what you mean by "socialize" the demonstration web page     (8H39)

The supervised machine learning framework is a general method for creating informative features (a.k.a. "ontotype") to be used by any machine learning method, such as the decision trees I presented.     (8H40)

[13:38] Rebecca Tauber5: Thanks @Michael! It's great to see GO development being driven this way... really cool.     (8H41)

[13:40] gary berg-cross: I note that these text understanding efforts leverage ontologies like YAGO to provide background knowledge. This reflects our chicken and egg discussion last time that Track A and B views are intertwined in practice.     (8H42)

[13:41] gary berg-cross: Q from queue, " Again, what was the URL for PIKE?"     (8H43)

[13:45] ToddSchneider: Were the NLP tools 'tuned' or trained?     (8H44)

[13:51] gary berg-cross: @Francesco Thanks for a wonderful presentation with lots of work to build on.     (8H45)

[14:19] gary berg-cross: @Evangelos Thank you!     (8H46)

[14:31] Mark Underwood: yes, sorry, no audio here     (8H47)

[14:31] Mark Underwood: Yes, provide on Twitter etc     (8H48)

[14:32] Mark Underwood: Got it, thanks     (8H49)

[14:36] Mark Underwood: Thank you, presenters - very worthwhile     (8H50)

Resources     (8I)

Previous Meetings     (8J)