Actions

Ontolog Forum

Revision as of 17:14, 18 April 2018 by imported>Garybc (→‎An overview of Extraction and Knowledge Graph Building)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Purpose
Overview

OKN in Context

The Open Knowledge Network (OKN) aims to create an open knowledge graph (KG) of various generally known entities and their relationships in support of a range of publicly useful applications. This blog summarizes some of the OKN ideas and approaches along with possible the mechanisms for specifying and capturing and using context as part of the expected OKN effort.

OKN Background

Starting in July 2016, the Big Data Interagency Working Group (BD IWG) leadership has been involved in 3 meetings to discuss the viability, and possible first steps to creating a joint public/private open data network infrastructure, called the Open Knowledge Network (OKN). The vision of OKN as noted is to create an open knowledge graph (or related graphs) of the bulk of generally known and discussed entities and their relationships, ranging from the macro (have there been unusual clusters of earthquakes in the US in the past six months?) to the micro (what is the best combination of chemotherapeutic drugs for a 56 y/o female with stage 3 glioblastoma and an FLT3 mutation but no symptoms of AML?). The work builds in part by efforts of vendors like Google who have employed private KGs as part of their search applications. Knowledge Graphs are a type of knowledge base used originally by Google and its services behind-the-scenes to representation general world knowledge (strings about things) as graphs and "enhance its search engine's results with information gathered and connected from a variety of sources." They are also used to present a summary of what it knows. So, for example, Google searching on the term "ice field" yields a small box of information (officially called the Knowledge Graph Card) from a KG that says it is "an expansive area of interconnected glaciers found in a mountain region, or it is an extensive formation of packs of ice at sea." It is sourced to Wikipedia and pictures are also provided. That Google search also bring back much more information from the Web such as details at the National Snow and Ice Data Center which may have been a source for the information stored in Wikipedia and leveraged by the Google KG. While Google was first to deploy a KG, Wikipedia, Wikidata, DBpedia, YAGO, and Freebase (e.g. Freebase’s schema comprises roughly 27,000 entity types and 38,000 relation types) are now among other prominent KGs, but KGs are also used by NELL (discussed later), PROSPERA (a Hadoop-based scalable knowledge-harvesting engine), and Knowledge Vault. Knowledge Vault, a Google project, is the type of effort that might be typical for OKN. It goes beyond leveraging the Semantic Web since it extracts knowledge from many different sources, such as text documents, HTML tables, and structured annotations on the Web stored in Microdata or MicroFormats.

Defining Knowledge Graphs

From a broad perspective,any graph-based representation of some facts might be considered a KG. In particular this includes any kind of RDF dataset or RDF "graph" but also the more expressive OWL ontologies. With this said, however, there seems to be no common definition about what a knowledge graph is and what it is not. Looking back at the 1980s, researchers from the University of Groningen and the University of Twente in the Netherlands initially introduced the term knowledge graph to formally describe their knowledge-based system that integrates knowledge from different sources and was used to represent natural language expressions. This largely fits the concept of the source of OKN knowledge. More recently several definitions have been attempted to reflect how the term is used in open knowledge efforts:

  • A knowledge graph (i) mainly describes real world entities and their interrelations, organized in a graph, (ii) defines possible classes and relations of entities in a schema, (iii) allows for potentially interrelating arbitrary entities with each other and (iv) covers various topical domains." Paulheim
  • Knowledge graphs are large networks of entities, their semantic types, properties, and relationships between entities." Journal of Web Semantics
  • Knowledge graphs could be envisaged as a network of all kind things which are relevant to a specific domain or to an organization. They are not limited to abstract concepts and relations but can also contain instances of things like documents and datasets." Semantic Web Company
  • We define a Knowledge Graph as an RDF graph. An RDF graph consists of a set of RDF triples where each RDF triple (s, p, o) is an ordered set of the following RDF terms: a subject s 2 U [ B, a predicate p 2 U, and an object U [ B [ L. An RDF term is either a URI u 2 U, a blank node b 2 B, or a literal l 2 L." Farber et al.
  • [...] systems exist, [...], which use a variety of techniques to extract new knowledge, in the form of facts, from the web. These facts are interrelated, and hence, recently this extracted knowledge has been referred to as a knowledge graph."

For some purposes the use of a graph DB to store the knowledge may lead to it being called a KG. To get around the lack of a formal definition of what a KG is, OKN discussions tend to be constrained to some minimum set of characteristics of knowledge graphs, such as the previously mentioned entities and relations (Ehrlingeret al., 2016). This fits the approach taken by one of our speakers, Mayank Kejriwal, who noted the OKN approach that tries to avoid the use of "unwieldly ontologies" and to use a shallow ontology to represent Entity types. The size of KGs comes not from the class schema which tends to be shallow but from the many instances of each class/entity. Entity examples offered were similar to what is in Schema.org:

  • Ad, Posting Date, Title, Content, Phone, Email, Review ID, Social Media ID, Price, Location, Service, Hair
  • Color, Eye Color, Ethnicity, Weight, Height

It may be an open issue as to what the cost of an overly simple approach to representation might involve and how we might improve on this.

OKN is meant to be an inclusive, open, community activity resulting in a knowledge infrastructure that could facilitate and empower a host of applications and open new research avenues including how to create trustworthy knowledge networks/bases/graphs as a means of storing base facts about the real world. This "fact" knowledge is currently represented relatively simply as entities (nodes of the graph) connected by relations, which are the edges of the graph and express relationship among these entities. While we us KG is the sense of a knowledge base it is also used to suggest a series of applications that use the KG and the resulting "answers" or products the Kg-based application provides. In practice the reliability of a KG's fact, is inferred from its sources such as the web. Such sources may include noisy information and conflicting information that is available on the Web. To date this problem is mitigated by the presence of data redundancy available from multiple Web pages. Heuristically a high belief threshold is set before a fact is believed and incorporated into a KG based on converging evidence.

Open Knowledge holds the promise of leveraging existing data pools to make positive impacts science, education business and society as whole. This may be comparable to that of the WWW. We are already seeing the first wave of such benefits in consumer services such as Siri and Alexa. But these services are limited the data they can use and the extent or scope of knowledge. Their "competence" in and knowledge about things is tool limited to a particular context. Currently, for example, a simple knowledge structure like Schema.org, launched with 297 classes and 187 relations, is used by roughly ⅓ of web pages to label and categorize or type information. The items below, for example, illustrate Auto Extension within schema.org.

  • BusOrCoach, CarUsageType, Motorcycle, MotorizedBicycle
  • Properties (20)
    • accelerationTime, acrissCode, bodyType, emissionsCO2, engineDisplacement, enginePower, engineType, fuelCapacity, meetsEmissionStandard, modelDate, payload, roofLoad, seatingCapacity, speed, tongueWeight, torque, trailerWeight, vehicleSpecialUsage, weightTotal, wheelbase
  • Enumeration values (3)
    • DrivingSchoolVehicleUsage, RentalVehicleUsage, TaxiVehicleUsage

While light and useful they are not very deep from an ontology perspective and seem not to discuss or address contextual issues head on. Further while a group exists to work on an ontology in the automotive area none is publicly available yet. Earlier vocabularies such as the Vehicle Sales Ontology (VSO) still amount to a Web vocabulary for describing cars, boats, bikes, and other vehicles for e-commerce purposes (their intended context). A similar situation applies to GS1 and its SmartSearch product vocabulary integrates with schema.org. The intent of these efforts has not been to model a full domain but rather the easier task of improving searches for information within a domain by providing better metadata.

Another problem with current knowledge graph efforts is that they are not open to direct access or contributors beyond current corporate firewalls, and can only answer relatively limited questions in their domain areas. They can leverage only some information, but this may be enhanced by the growing number of natural interfaces providing access to a growing number of large knowledge structures. This approach should leverage technology along with know-how and expand to many new topic areas allowing users to generate many more useful classes of questions. But this will require mounting an open and well supported effort, particularly as information is gathered from diverse and less well documented sources.

Context and OKNs

This track has examined the issues about and contribution for OKNs in light of the context challenge of extracting and expanding knowledge for robust applications. Issues of context come up immediately when we consider the Semantic Web, for example, as a source of knowledge, but also the WWW in general. Most of the knowledge available in the Semantic Web which might be leveraged by OKN is context dependent. A short list of the context of Semantic Web knowledge include time, topic, provenance, and reliability. In general we have to consider the contextual dimension of knowledge. This is the fact that any particular statement (say expressed as a RDF triple) is not universally true. It is conditional and true only in certain circumstances such as for a certain time interval, and/or a particular spatial region.

An example sometimes citing about facts in the Cyc KB include facts available on the WEB which differ primarily in time:

  1. Mandela is an elder statesman.
  2. Mandela is the president of South Africa (this is a bit earlier than the first,
  3. Mandela is a political prisoner (this is earlier still)

In the Cyc KB, discussed later these facts exist in separate portion of the KB as what is called a microtheory whose facts are not contradictory. It is argued this is simpler in the long run from adding time directly to a triple "Mandela was a political prisoner from 1962 to 1989."

As part of the Summit several circumstances these have been discussed. But to date these are only a subset of possible issues with context because context can be a grab bag of ideas starting with whatever is in a person's mind as they consider something. Contexualizing knowledge inside a mind cannot be publicly observed, so as Hayes observes in Contexts In Context:

  • "there are many ideas about what its structure might be."

In a keynote at CogSIMA 2012 conference (cited in Jakobson, 2014) Patrick Hayes added a bit more:

  • “Everyone agrees that meaning depends on context, but not everyone agrees what context is”. He continues that
  • “… any theory of meaning will focus on some of the things that influence it, and whatever is left over gets to be called the “context” … so the “context” gets to be a trash-can term. It means all the rest, whatever that is.”

This challenge that context may extend indefinitely has been discussed at several Summit sessions as well as followed up in the literature (Homola et al, 2010). Considering this, as a strategy it is useful to specify some focus and sub-set of context that may be pursued and what efforts might have in mind when they discuss topics like Open Knowledge which evoke a large contextual space.

Here we start with the guidance laid down for the Summit by John Sowa in his writings about context. As he notes there are four kinds of context involved in language understanding (which is relevant more widely in other cognitive products):

  • the text or discourse; for uttered statements we often talk of the linguistic context of what is being expressed.
  • the situation; reflecting in part that is a context is a context of something. It may be text as above but also physical and mental realms. Situations both physical and social structure the relation of objects and processes. For example some process or object may come into existance or change state of disappear as a result of some situations. Examples include accident situations which may create injuries or death and other states of involved objects.
  • common background knowledge; which may include framing knowledge about situations or text relations or about agent intentions and
  • the intentions of the participants.

Context also occurs as a concept in some upper level ontology work such as Gangemi and Mika's (2003) Descriptions and Situations (D & S) ontology design pattern. D & S provides a model of context and allows to us to clearly delineate between, for example, some generic pattern we associate with a relationship, a norm, a plan, or a social role like mothering, to the actual state­-of-­affairs we observe in the world, where a physical object mother might nurse a baby in a nursing situation. In D & S the notion of Context is specified in details beyond what we attempt here but inlolve a composition of a set of Parameters, Functional Roles and Courses of Events. Parameters take on values from­ DOLCE's concept of (quality) Regions, Functional Roles are played­ by Objects (endurants like Anne) and Courses of Events or sequence Events which are perdurants. The elements of the context thus mirror (map to) the elements of the state­ of ­affairs (a set of objects, events and their locations), but add additional semantics to them. For the Summit the important take away from a conceptualization like D & S is that it provides one example of situations which have been formalized and related to one notion of context.

Context and Roles

Throughout the Summit meetings roles as context were noted as important. Important distinctions made here in upper level ontologies such as GFO include:

  • Relational role
  • Processual role and
  • Social role

An example of this is that the role of legal advice is different in the context of a banking activity compared to that of lying under oath.


Some connection to these different types of context are noted as they are relevant to an OKN discussion. This context and its relation to extracted knowledge and its formalization may differ from how it is addressed in other Ontology Summit topics and issues. As a first approximation we are interested in context in the following senses:

  1. The text & informational context from which OKN might have acquired information for population of a KG. In one sense this is the data domain. This may include general information about a block of text or a web page as a source. There may important information to document as the context for the extract.
  2. An extraction processing context or situation which may include the AI or NLP tools and processes used to extract the knowledge. In some advanced machine learning cases prior knowledge (background knowledge) may be used to judge the relevant facts of an extract.
  3. In some cases and increasingly so a variety of information extracted is aligned (e.g. some information converges from different sources) by means of an extant ontology and perhaps several. This means that some aspect of the knowledge in the ontology provides an interpretive or validating activity. It is useful to note that this process of building a KG clearly shows that it is not equivement to an ontology or at least the ontology that was used to support its construction. In general there are confusing from equating KGs, KBs and ontologies in discussing efforts like OKN.
  4. The domain which the knowledge is about provides a great deal of extended context beyond the immediate context in which information is found or what is related to in a stored representation such as its neighbors in a KB.
  5. Provenance information or source context is key requirement for validated knowledge is the ability to trace back from a knowledge graph to the original source documents.

So open knowledge, say about glaciers, may be created/extracted from discourse & text on the web or from some particular structural format such as RDF or a table. Some information may be about entities in the context of a situation. For example the text below from a NASA site describes a melting glacier including including information about time and place:

"There’s nothing quite like historical photos of glaciers to show what a dynamic planet we live on. Alaska’s Muir Glacier, like many Alaskan glaciers, has retreated and thinned dramatically since the 19th century.

This particular pair of images shows the glacier’s continued retreat and thinning in the second half of the 20th century. From 1941 to 2004, the front of the glacier moved back about seven miles while its thickness decreased by more than 2,625 feet, according to the National Snow and Ice Data Center."


For a human there is quite a bit of interpretation of the facts in the text and the situation assumed using some background knowledge such as how it fits into a climate change picture and the reference to historical change from the "19th century". But this knowledge connection may be captured, at least in part, by a smart extraction process that includes the page heading "Global Climate Change." Smart annotation of the pages metadata may also contribute this connection and in a sense this information incorporated into metadata used to document a KG's facts. There is a good deal "credit" information on the page associated with particular pieces of the information such as the pictures:

NASA Climate 365 project - a collaboration of the NASA Earth Science News Team, NASA Goddard and Jet Propulsion Laboratory communications teams, and NASA websites Earth Observatory and Global Climate Change. Photo credits: Photographed by William O. Field on Aug. 13, 1941 (left) and by Bruce F. Molnia on Aug. 31, 2004 (right). From the Glacier Photograph Collection. Boulder, Colorado USA: National Snow and Ice Data Center/World Data Center for Glaciology.

Metadata to document the knowledge should also capture the nature of the extraction tool used. These are some common uses of context for data. Some data that is used to extract facts will also has an intentional context reflecting the fact that say particular data, such as the NASA glacier data and pictures has been intentional produced as part of a research program. These may be tool assisted actions organized by an intelligent agent to gather (and document) information. Indeed we may talk about such data as acquired from a research situation. One ontology that includes some of this is one developed as part of EarthCube's GeoLink project which describes data collection from physical sampling during oceanographic cruise (Krisnadhi et al, 2015).

The following is a preliminary summary of material about this which may also be useful for the Summit Communique. Material is roughly organized as follows:

  1. Relation of OKN in the context of discussions from prior Ontology Summits
  2. An overview of Extraction and Knowledge Graph Building
  3. Contextual Examples of interest to OKN
  4. Preliminary findings
  5. Challenges and issues to investigate

Preliminary Synthesis

OKN in the Context of Prior Ontology Summits

As noted in previous Summits especially the 2017 Summit on [AI, Learning, Reasoning, and Ontologies and its track on "Using Automation and Machine Learning to Extract Knowledge and Improve Ontologies] there are substantial research efforts to automate and semi-automate extraction of information from a wide range of sources -- web crawls, scientific databases, Public DBs such as PubMed, natural language processing extraction systems and the like. One prominent example is NELL (Never-Ending Language Learner) which was discussed at the 2017 Ontology Summit. Central to the NELL effort is the idea that we will never truly understand machine or human learning until we can build computer programs that share some similarity to human learn. In particular that they, like people:

  • learn many different types of knowledge or functions and thus many contexts,
  • from years of diverse, mostly self-supervised experience,
  • in a staged curricular fashion, where previously learned knowledge in one context enables learning further types of knowledge,
  • where self-reflection and the ability to formulate new representations and new learning tasks enable the learner to avoid stagnation and performance plateaus.

As reported at the 2017 Summit NELL has been learning to read the web 24 hours/day since January 2010, and so far has acquired a knowledge base with over 80 million confidence weighted beliefs (e.g., servedWith(tea, biscuits)). NELL has also learned millions of features and parameters that enable it to read these beliefs from the web. Additionally, it has learned to reason over these beliefs to infer new beliefs, and is able to extend its ontology by synthesizing new relational predicates. NELL learns to acquire two types of knowledge in a variety of ways. It learns free-form text patterns for extracting this knowledge from sentences on a largescale corpus of web sites. NELL exploits a coupled process which learns text patterns corresponding type and relation assertions, and then applies them to extract new entities and relations. In practice it learns to extract this knowledge from semi-structured web data such as tables and lists. In the process it learns morphological regularities of instances of categories, and it learns probabilistic horn clause rules that enable it to infer new instances of relations from other relation instances that it has already learned. Reasoning is also applied for consistency checking and removing inconsistent axioms as in other KG generation efforts.

NELL might learn a number of facts from a sentence defining "icefield", for example:

"a mass of glacier ice; similar to an ice cap, and usually smaller and lacking a dome-like shape; somewhat controlled by terrain." In the context of this sentence and this new "background knowledge" extracted it might then extract supporting facts/particulars from following sentences:

"Kalstenius Icefield, located on Ellesmere Island, Canada, shows vast stretches of ice. The icefield produces multiple outlet glaciers that flow into a larger valley glacier."

It might also note not only the textual situation relating extracted facts but the physical location (e.g.Ellesmere Island) and any temporal situations expressed in these statements.

Techniques from natural language processing are also useful and some have gone beyond the sentence level to take extra-sentential context of discourse into account as part of "discourse parsing." An example of this is the use of Rhetorical Structure Theory (RST) to find the "semantic" relations between text fragments by analyzing the discourse structure of a text. RST also provides some conceptual structure that can be used to talk about context and OWL ontologies have been employed as part of applications. In this approach originally proposed by (Mann et al,1998) the smallest, unit text fragment is called the Elementary Discourse Unit (EDU). Linguistically the EDU is usually a clause or a short sentence expressing a core idea. More Complex Discourse Units (CDUs) text fragments bigger than clauses, and sentences may be built from many EDUs. One particular text fragment is considered the nucleus and represents the salient part of the text, while other text, called satellites, represents the additional information about and related to the nucleus. These discourse relations are organized into three categories:

  • subject matter relations,
  • presentational relations and
  • multi nuclear relations.

Subject matter relations should guide an information consumer to recognize the relation between some question and what is the subject of the nucleus. So in a subject matter relation such as "cause", a satellite has a potential question, a request, or a problem about causes raised by a text consumer. This questions should be addressed and satisfied, solved or answered in the nucleus. Who, what, where and when questions are good examples of this and can be considered, of course, as context.

The central constructs in RST are rhetorical relations because text coherence is attributed principally to the presence of these relations. Of major interest are the subject matter relations which can be thought of as role relations developed as part of RST (Mann and Taboada,2005). These include:

  • Elaboration and Evaluation,
  • Interpretation,
  • Means, Purpose and Cause,
  • Condition,
  • Result,
  • Otherwise,
  • Solutionhood,
  • Unconditional and
  • Unless

As an example, elaboration may be signaled in text by conceptual relations like hyperonymy/ hyponymy, holonymy or meronymy, and/or lexical relations like synonymy. For more detail see Bärenfänger et al, 2008.

Presentational relations are supportive, that is they play a supportive role. They often relate to some certainty of belief. An example occurs when a satellite increases the an information consumers inclination in accepting as facts the assertions stated in the nucleus. This may be as simple as an assertion of some observation event about the claim or some statement about the motivation behind the claim. Finally a multi nuclear relation connects 2 nuclei instead of connecting a nucleus and a satellite. This amounts to building out the core of more knowledge. One can imagine this as following a scientific text which paints a detailed picture of some phenomena such as a page on icefields illustrates.


FRED is a machine reader for the semantic web discussed by Valentina Presutti at the [http://content.iospress.com/articles/semantic-web/sw240 2017 Summit. FRED's its output is a RDF/OWL graph, whose design is based on frame semantics and discourse frames which are useful tools for the difficult activity of reading information on the Web. Discourse frames, for example, provides some likely relations for the sense of the word "buy." This is clarified by knowing about the context of a commercial transfer that involves certain individuals, e.g. a seller, a buyer, goods, money, etc. Linguistic frames are referred to by sentences in text that describe different situations of the same type i.e. frame occurrences. The words in a sentence "evoke" concepts as well as the perspective from which the situation expressed is viewed. Cluster patterns, which may be used to guide knowledge graph development can then be built around these.

The experience of past summits issues including those on the Ontology of Big Systems, Interoperability, Big Data and AI establish the role of ontology for the Semantic Web, Linked Data, and Big Data communities. These bring a wide array of real problems (performance and scalability challenges and the variety problem such as in Big Data) and technologies (like automated reasoning tools) that make use of ontologies. There is a particular emphasis on the Web in making sense of data and information distributed over the Web. Building & maintaining knowledge bases & ontologies is hard engineering and could use some automated help. Tools such as Apache Nutch framework may be used for web crawling, for example, and many natural language tools for data extraction exist. The idea is to make these more available and processabile. This will require formalizing and structuring the extract in various ways into a "knowledge" form. Since there are diverse structures and much unstructured (on the Web for example) considerable work may needed for this transformation.

An overview of Extraction and Knowledge Graph Building

Following Craig Knoblock's From Artwork to Cyber Attacks: Lessons Learned in Building Knowledge Graphs using Semantic Web Technologies, this process can be visualized as the building of a knowledge graph in 3 broad steps:

  1. Data acquisition -crawling to find relevant pages & extracting the required information from
  2. Mapping to an ontology or ontologies
  3. Entity linking and similarity identification as related material is found

Steps To Build a Knowledge graph (KG) As noted acquisition starts with crawling information source like the Web to identify and extraction relevant information. Extraction needs to be "structured", which means that information about structure is part of the context of extracted content. Some work has long existed, for example, using suites of unsupervised learning algorithms that induce the structure of lists by exploiting the "landmark" regularities, both from formatted web pages and the data contained on them (Knoblock et al, 2003). Data extraction is followed by creating a data extract structure (e.g. JSON, XML, CSV) which can then be used for efficient feature extraction, identification of such things as people, places and processes. This activity may be assisted by templates.

This is followed by and structural alignment of the data by mapping attributes in data sources to classes in ontologies. from-artwork-to-cyber-attacks-lessons-learned-in-building-knowledge-graphs-using-semantic-web-technologies-13-638.jpg

Entities can then be identified and a knowledge graph construction which involves selecting a minimal "tree" that connects all semantic types. A customized Steiner tree algorithm can be used (Evans, 2016). Given the heuristic methods are applied to build a KG this way it is unlikely that any particular KG will be fully correct or complete. In practice there is usually a trade-off between factors such as coverage and correctness, which is addressed differently in each context of the data extracted and the methods used to build a KG. KGs can also be filled in by back internal processes looking for such things as consistency as well as from external processes which adds information from human and/or automated sources. An example currently used is to employ something like Freebase's data as a "gold standard" to evaluate data in DBpedia used as to populate a KG.

We should again note that a key requirement for validatable application of knowledge is the ability to trace back from a KB to the original documents/LinkedData and if filled in, from other sources. As OKN involves different efforts different extractor Apps may cull the same or essentially the same facts from various web pages (e.g., multiple phone number extractors) but sometimes produce conflicting fact extractions. To produce a refined high quality KG, it may be necessary to investigate the origin of the different extractors and determine which extractions should be added to the knowledge graph and which ones should be discarded to avoid conflicts. To address such confliction problems, it will be necessary to record provenance about every node and edge in KG.

From one perspective we might say that a KG is a cleaned up KB whose population (e.g., instances) is the realization of a more abstract and expressive ontology such as DOLCE or some piece of DOLCE.

The role of the ontology and its expressiveness facilitates validation of semantic relationships as well as supporting logical conclusions from known facts of the KB. An ontologies explicit structure, for example, codifies the hierarchy among classes as well as for relations. And since an ontology like DOLCE is without instances it is often not considered a KG. Another aspect to consider is the dynamic nature of KGs as they are populated with extracted supplemented by reasoners and alignment processes which extends the idea of KBs. KGs may be very dynamic and the underlying schema in an ontology much less so.

This simple distinction and relation between KGs, KBs and ontologies provides another view of contextualizing. There are interesting and different contextual questions that arise for each of these. We have already spoken of the textual relations that may be involved in understanding facts extracted from text as part of KG construction. Knowledge base development from a KG has more of a knowledge refinement context. An example is resolving some inconsistency of facts, perhaps considering temporal context because they were gathered at different times. Ontologies may serve a richer role of providing background knowledge contexts for KGs and KBs. They may fill in knowledge that connects facts or categorizes them in some way. They may be partitioned into MTs which can be employed as in the Mandala example to distinguish facts by time-intervals, using an ist relation meaning that some KG proposition holds during a particular time-interval. Of course, ontologies themselves have contexts as well as provide contexts. Ontologies have scope and change over time, but probably not as much as KGs. But it is likely that as more material is extracted conflicts and contradictions will accumulate. Thus will require additional knowledge engineering of supporting ontologies over time and how to do this feasibly is a research issue to consider.

Once the KG is constructed and visualization is enabled it can be deployed. Each step may be complicated and to be useful extracted information and its content should be well cleaned of things that are out of context, well annotated and documented. The structural relation between extracted elements is one aspect of context as noted before and so becomes suitable metadata about the information extracted. Tools such as Karma exist to support integration activities by unsupervised learning semantic types such as people versus place along with descriptions of the sources (Knoblock et al, 2015). The confidence of such judgments may be affected by how much context is employed as part of this activity (Pham et al. 2016) Mapping to extant and new ontologies may be used as part of both open knowledge development and its annotation by metadata.

Contextual Examples and Approaches of interest to OKN

Microtheories

As part of 2 sessions we have considered a Microtheory approach to handling context. The first was based on The dimensions of context-space by Douglas B. Lenat, January 1997 and the 2nd was a presentation by Cycorp principle scientist, Charles Klein.

Microtheries are in part a response to the recognition that particular facts are conditional and apply in only some circumstances. To handle this the large Cyc KB is divided into a large number of microtheories (Mt). Each MT is a collections of concepts and facts that are intended to apply to a particular realm of knowledge. The collection may be based on shared assumptions, shared topics, shared sources, or other features. But unlike the buzzing booming Cyc KB as a whole, each MT is engineering to be free from monotonic contradictions in its scope. That is an MTs assertions must be mutually consistent so no hard contradictions are allowed. Any apparent contradictions, such as a bird that doesn't fly, must be resolvable by evaluation of the evidence visible in that microtheory. In contrast, because of the vast collection of MTs making up the Cyc KB there are inconsistencies across MTs that make up the whole KB. In CYC, the interpretation of every fact and every inference is localized to a specific region of “context space”. All conclusions that the inference draws involve only facts that are visible from that region of context space, that is, are stated either in an MT or 1 of its general MTs. This reflects the fact that MTs are organized in a hierarchy and can inherit from each other. In some domains the leaf Mt structure is as deep as 50 levels, which is perhaps an order of magnitude deeper than the leaf depth of a KG and some of its supporting ontologies.

And each Cyc MT is a first-class ontological object with a name, via a MT suffix string (e.g. #$geographicMT) that is a regular constant. That is, MT constants contain the string "Mt" by convention. So we may have a hierarchy with first class general MT #$geographicMT and a sub-type su#$USgeographicMT.

As an aid to knowledge management MTs distinguish the origins of the facts, and provide meta-statements about the facts. Several advantages are claimed for a MT approach:

  1. As part of knowledge engineering MTs allow us to focus development of the KB by gathering the most relevant information together. MTs help the a knowledge engineer (KE) to focus on a coherent subset of information, rather than exhaustively searching through a vast KB field of knowledge.

Existing MTs may be similar and help as a package to guide work. The coherence of an MT ultimately improves inference since reasoning is about a smaller group of the most relevant entities. This reduces search space and improving efficiency. From an engineering perspective, making assertions in microtheories allows the background assumptions that apply to a particular domain to be stated only once, and makes it far easier to construct a logically consistent KB (Taylor et al, 2007).

  1. Microtheories enable a KE to use terse assertions. The earlier example, about an MT with assertions for “Mandela was a political prison” illustrated this when a MT provided the basis for this being true for a time period. This means that we don't need to explicitly build such time contextualization into the assertions themselves. Taken as a whole MTs make knowledge engineering more efficient.
  2. Microtheories allow us to cope with the reality that there will be global inconsistency in a large KB/KG. In building a KG for the scale of OKN we will need to cover many different interpretative points of view from which open knowledge is built, along with different facts over time and in different places. All of which means much inconsistency is inevitable and if left unresolved will make useful applications impossible.

It is worth noting some similar attempts have been made outside of Cyc to develop partitioned contextualized knowledge repositories. One has been proposed using a set of OWL 2 knowledge bases, which are embedded in a context by a set of qualifying attributes (time, space, topic, etc.). These are like Cyc MTs in that they specify the boundaries (the sub-domain in which a certain statement holds) within which the knowledge base is assumed to be true (Serafini, 2012).

RDF++ and the RDF Singleton Property approach

Vinh Nguyen (formerly at Kno.e.sis Center, now at NLM) is one of the OKN community people trying to directly address contextualizing some aspects of Linked Data for open knowledge in her talk "CKG Portal: A Knowledge Publishing Proposal for Open Knowledge Network." Vinh proposes a lightweight design choice for OKN in which RDF triples are contextualized as RDF++. In this design some metadata is added to define a small set of context for the RDF fact. These include representing the provenance, time, information location, trust score, which is the probability, etc. of a target fact triple in the form of metadata RDF Triple. Among the advantages of this approach, leveraging RDF to document RDF context is that it:

  • Has a model theory
  • Reasoning is supported
  • Every triple has a global identifier which enables access
  • Tracking of inferred triples is enabled
  • Since sources are tracked one can give credits back to the publishers which in turn motivates Linked Data publication

More information on what is called an RDF Singleton Property approach is at Nguyen (2017.)

As noted by 2 of our speakers Amit Sheth (Kno.e.sis Center) & Vinh Nguyen (formerly at Kno.e.sis Center, now at NLM) in "Evolving a Health KG" modern healthcare system involves Big Data as a result of a complex data-driven mesh which relies on continuous patient monitoring, data streaming and sharing. If this diverse data, such as stored in patients’ electronic health records (EHRs) is opened it may be leveraged using advanced big data analytics to provide essential health services to patients. The totality of available information that may be assembled into a KG includes past diagnoses, hospital visits, interactions with the doctors, lab results (e.g., XRays, MRIs, and EEG results), past medications, treatment plans, and post-treatment complications. Useful even patient centered applications might be built off of an patient available KG and provide myriad of health services to the patients, such as real-time monitoring for identifying health anomalies.

One particular area seem to provide a starting opportunities for an open knowledge approach:

  1. Understanding EMR/PHR Records: EMR (Electronic Medical Records) and PHR (Personal Health Records) are used to document patient visits. They capture the digital footprint representative of the health of individuals including fine details such as chronic conditions, prescribed medications, demographic details, and temporal aspects of disease, and medication outcomes. The meaningful use of documented information is often restricted due to the relatively unstructured nature of how the information is represented. The KG building techniques previously discussed as well as supporting tools could extract more structured knowledge and there are several aspects of data and knowledge context to take into account. One is the assembling of a meaningful health picture from the wealth of information. Extracted data such as diagnosis and treatment needs to be related in a proper situational context and simple templates formalized as ontologies may play this integrative role. Provider intentions as part of treatment and interventions may also be an important contextual ingredient.

One particular application that might be pursued with a foundational health KG is that if Health Risk Assessment and an application that would provide actionable information to patients. Risk assessment of a person's health condition would be based on observational data & reports collected from Physical, Cyber, and Social sources and might for presented as personalized risk profiles and recommending actions. Patient-Generated Health Data (PGHD) is a new ingredient since it may primarily be generated by small, smart personal devices ( worn sensors and environmental sensors) recording fine-grained patient health information (e.g. HR, blood pressure) over time. Relevant research on the role on context in HC includes Sordo et al.( 2015).

Preliminary Findings

Application Focus, Knowledge Breath and Depth

Our OKN Summit sessions described the basic approach to extract information from web and public resources and leverage lightweight methods and tools, such as schema.org to create a network of open knowledge. This knowledge is then a resource for applications built on top of these and this is a main point of investing in the effort to create deposits of open knowledge. It also means that as part of current research the knowledge is well scoped leveraging shallow but broad information sources as is the case for Knoblock's work on human trafficking. This also often means that not a great many contextual issues involving deep ontologies are taken into account. Because an application is being served the ontology used to provide a schema for the KG may be relatively simple and no based very extensively on prior ontology work. This can be expected to change as deeper problem areas such as healthcare are targeted.

Source Context

We further discussed how NLP and ML approaches can be used to build knowledge graphs for public purposes and that source context as well as ontology context needs to be taken into account moving forward. It is obvious that the role of existing and emerging Semantic Web technologies and associated ontologies is central to making OKN a viable enterprise.

Big Knowledge, Single vs. Multiple Knowledge Bases and Ontologies

To be consistent with the Semantic Web approach early OKN-like research (Knoblock et al, 2017) represent extracted data in some core, common ontology, and define URIs for all entities, which may be linked to external but familiar Semantic Web resources such as Geonames for location. Single ontologies are not likely to be suitable as work expands and more contexts are encountered which will require multiple ontologies. Big knowledge, with its heterogeneity and depth complexity, may be as much of a problem as Big Data especially if we are leveraging heterogeneous, noisy and conflicting data to create knowledge. it is hard to imagine that this problem can be avoided. The ontology experience is that as a model of the real world we need to select some part of it based on interest and conceptualization of that interest. Differences of selection and interpretation are impossible to avoid and as noted, different external factors will generate different context for any intelligent agent doing the sections and interpretation.

How Formal?

Formalization as KGs or any other form cannot reasonably reach complete coverage, i.e., contain an understanding of each and every process, situation and entity in the universe. But we can put understanding and knowledge in context and one approach to handle this discussed at the Summit is to employ the microtheory (MT) approach of Cyc. MTs distinguish the origins of the facts, and provide meta-statements about the facts. In CYC, the interpretation of every fact and every inference is localized to a specific region of “context space” and all conclusions that the inference draws involve only facts that are visible from that region of context space, that is, are stated either in the leaf microtheory or 1 of its general Mts.

Use of Microtheories

To date outside of Cycorp and its Cyc Knowledge Base structured as a hierarchy of MTs there seems to be little consistent use of MTs. OKN could employ microtheories to frame a small number of very broad reasoning contexts starting with high-level, abstract knowlege, which fan out into progressively more specific contexts for application use. A small research project to test our this idea might be valuable to provide some focus to early OKN work, to test the effort needed and the benefits noted.

Semantic Web and Other Sources

As part of Linked data approach OKN knowledge may be publish in lightweight form of RDF and RDF graphs. Heavier semantics is likely to be needed. For OKN then it becomes particularly important to understanding how to do this and such efforts may affect resulting ontology development, use and maintenance. An important goal of OKN going forward is to identify some of the major research problems, such as the scope, nature and precision with which context should be specified when information is extracted. Both lightweight, using RDF to document extractions, and "heavier" approaches, using CyC metatheories to document such specifications have been discussed and need to be investigated further as OKN proceeds. While light-weight efforts may be needed to start the experience in some areas suggests that more formal semantics will probably be necessary to add incrementally to avoid semantic issues and provide better reuse. This may be characterized as middle way.

Application of Focused Meanings of Context

One issue for the development and application of ontologies is that context is seldom “explicitly stated” because this is difficult to do. Bu the need a formal mechanism for specifying a context has long been recognized and some approach to it, such as CYC' microtheories descend from J. McCarthy's tradition of treating contexts as formal objects over which one can quantify and express first-order properties. Our approach descends from J. McCarthy's tradition of treating contexts as formal objects over which one can quantify and express first-order properties. Another is called Description Logics of Context (DLCs) which extend Description Logics (DLs) for context-based reasoning (Klarman et al, 2016). DLCs are founded in 2-Dim possible world semantics, where 1 dimension represents a usual object domain and the other a domain of contexts, In this approach we have 2 interacting DL languages—the object and the context language—interpreted over their respective domains.

Contextual Knowledge Engineering

In all such efforts we expect to arrive at an understanding of contextualizations that can be incorporated into ontological engineering practices to use and achieve richer ontologies. To some extent lightweight efforts to associated contextual information is underway. Besides the RFF++ idea other examples include RDF quadruples, named graphs, annotated RDF, and contextualized knowledge repositories. These still are relatively new paradigms which introduce a new factor into knowledge engineering practice. That is along with representing individuals, concepts, properties and their relations, we also need to document some relevant selection of contexts, and we need to section ontological knowledge of entities, concepts and the like between and among these contexts (Homola et al., 2010). In light of all of this future work, will need to refine the tools and technology to make it easier and faster to build and validate knowledge graphs as well as new applications based on this knowledge. One important goal still challenging is to add knowledge into contextualized knowledge bases, perhaps structured at MTs quickly and with little or no human interaction (Taylor et al. 2007).

Research Issues

A preliminary view of a more contextualized OKN provides several issues and perspectives to investigate.

Selecting Sources with Context and Annotation

There are many data sources and types to consider as sources for an OKN KGeven within the Semantic Web. As a research tactic low hanging fruit might consider harvesting Triplestores and Linked data since their structuring helps fact extraction. To some degree these come with some contextual information such as the URL location of the "fact."

As noted, however, producing a refined high quality KG, may require investigating the origin of the different data targets and determine which extractions should be added to the knowledge graph and which ones should be discarded to avoid conflicts. In some cases classification of the extracted information is important and some sources may be better document with things like metadata to make this an easier task. However if a hierarchy of MTs is used where to fit information into the hierarchy may require a fair amount of contextual information. To address such confliction problems, it will be necessary to record provenance about every node and edge in an OKN KG. Provenance at this level of detail may be difficult to manage and sophisticated approaches like use of PROVO have not been routinely used in KG construction. Faithful use of PROVO would capture information about the data/information entity, the agent that has modified the data entity and the activities of the modification.

Data annotated with Schema.org may also be considered a useful and opportunistic place to start investigations and tests. But limitations of the semantics of Schema.org have been noted before.

Depth and Formality of Representation

Another issue of interest to the semantic community concerns the question of what is the best representation of the data to consider what this general idea of a knowledge graph (Paulheim et al, 2017).

Along with representation there is the issue of depth and detail. The evolutionary path should be a rich and detailed ontology. What is possible here with existing approaches is an open question as detail may come with conflicting assumptions due to data coming from varying contexts as well as data refinement using ontologies and processes with different assumptions.

Handling Context During KG Building

Context will also affect how KB building deals with noisy missing and incomplete information. As mentioned converging and redundent evidence is needed before a fact is believed and incorporated into a KG. There are many issues here including in the extreme how to protect a KG from targeting by a pernicious source. Because people may differe about beliefs there may be multiple perspectives about certain entities and their relations. One can imagine large knowledge spaces growing around such differences as we learn how to safely improve the quality of what is learnt and organized for the OKN knowledge bases. Automated cleaning of data remains tentative and manual curation of such data is often needed. Machine learning based on these experiences needs to be pursued, but machine explanation of its processing also seems like an needed intermediate step. While KG construction in small domains along with some heuristics for fine-tuning is a maturing field, the job of KG refinement in the face of extensive growth as coverage is expanded is not well developed. An ambitious OKN to capture at least a large proportion of common entities and relations will require considerable refinement of KGs over time, especially as low hanging fruit areas give way to more difficult subjects.

Since a major goal of OKN is to make rich knowledge openly available to a wide audience an issue is how to organize and store the data-based knowledge for efficient access. A lightweight path might use RDF Triplestores, but a richer representation may be achievable and useful.

A useful ontology for use in documenting source contests in PROVO which can guide the documentation and understanding of provenance metadata. [Belhajjame et al, 2013] has been used widely but has not systematically investigated as part of OKN efforts.

Enhancing Ontology Engineering Practices

As noted as part of findings we will need to arrive at a focuses understanding of contextualizations that can be incorporated into ontological engineering practices. For efforts like OKN this should include guidance and best practices for the extraction and building of KGs as well as how to clean, refine and organize them with suitable robust and rich KBs and ontologies. In light of this future work, will need to refine a suite of tools and technologies to make the lifecycle of OKN KGs easier and faster to build. We will also need to support validating knowledge graphs as well as the new applications built off of KG knowledge.

References:

  1. Bärenfänger, Maja, et al. "OWL ontologies as a resource for discourse parsing." LDV Forum. Vol. 23. No. 1. 2008.
  2. Belhajjame, K., et al. "PROVO: The PROV Ontology. W3C Recommendation." World Wide Web Consortium (W3C), April 2013a. https://www. w3. org/TR/2013/REC-prov-o-20130430 (2013).
  3. Ehrlinger, Lisa, and Wolfram Wöß. "Towards a Definition of Knowledge Graphs." SEMANTiCS (Posters, Demos, SuCCESS). 2016.
  4. Evans, James. Optimization algorithms for networks and graphs. Routledge, 2017.
  5. Hayes, Pat. "Contexts in context." Context in knowledge representation and natural language, AAAI Fall Symposium. 1997.
  6. Homola, Martin, Luciano Serafini, and Andrei Tamilin. "Modeling contextualized knowledge." Procs. of the 2nd Workshop on Context, Information and Ontologies (CIAO 2010). Vol. 626. 2010.
  7. Jakobson, Gabriel. "On modeling context in Situation Management." Cognitive Methods in Situation Awareness and Decision Support (CogSIMA), 2014 IEEE International Inter-Disciplinary Conference on. IEEE, 2014.
  8. Kejriwal, Mayank "Context-Rich Social Uses of Knowledge Graphs" Talk to Ontology Summit, Feb. 14 2018
  9. Knoblock, Craig A., et al. "Accurately and reliably extracting data from the web: A machine learning approach." Intelligent exploration of the web. Physica, Heidelberg, 2003. 275-287.
  10. Klarman, Szymon, and Víctor Gutiérrez-Basulto. "Description logics of context." Journal of Logic and Computation 26.3 (2016): 817-854.
  11. Krisnadhi, Adila, et al. "The GeoLink modular oceanography ontology." International Semantic Web Conference. Springer, Cham, 2015.
  12. Knoblock, Craig A., and Pedro Szekely. "A scalable architecture for extracting, aligning, linking, and visualizing multi-int data." Next Generation Analyst III. Vol. 9499. International Society for Optics and Photonics, 2015.
  13. Mann, W., and Thompson, S. “Rhetorical structure theory: Toward a functional theory of organization”. Text 8(3):243–281,1988
  14. Mann, W. C. and Taboada, M. (2005). RST – Rhetorical Structure Theory. W3C page. [1]
  15. Mika, Peter, and Aldo Gangemi. "Descriptions of social relations." benefits 1 (2016): 14.
  16. Nguyen, Vinh Thi Kim. Semantic Web Foundations for Representing, Reasoning, and Traversing Contextualized Knowledge Graphs. Doctoral Dissertation, Wright State University, 2017.
  17. Paulheim, Heiko. "Knowledge g,raph refinement: A survey of approaches and evaluation methods." Semantic web 8.3 (2017): 489-508.
  18. Pham, Minh, et al. "Semantic labeling: a domain-independent approach." International Semantic Web Conference. Springer, Cham, 2016.
  19. Serafini, Luciano, and Martin Homola. "Contextualized knowledge repositories for the semantic web." Web Semantics: Science, Services and Agents on the World Wide Web 12 (2012): 64-87.
  20. Sordo, Margarita, Saverio M. Maviglia, and Roberto A. Rocha. Modeling Contextual Knowledge for Clinical Decision Support. Tech Report PHS-2015-MS, 2015.
  21. Taylor, Matthew E., et al. "Autonomous Classification of Knowledge into an Ontology." FLAIRS Conference. 2007.