ConferenceCall 2018 03 28/Klein Transcription

Charles Klein "Context Models for Production of Vital Information"

Talk given as part of the Ontology Summit 2018 on 28 March 2018

The talk slides are available at http://bit.ly/2Grcfj2

The video recording of the session is available at http://bit.ly/2E4uKIf

Transcribed by Thomas Lörtsch

[00:48:00] I want to talk about the role context can play specifically in big data productions where’s a hard limit for open and ### valued information market to providing vital information, information on which you can critically depend due to the complexity of all the transforms and all the provenance issues in chains of permutations that information goes through to get to the final data and … can play an essential role in those.

[… skipping some minutes talk about ontologies and introduction to microtheories in Cyc]

[00:54:21] Contexts can have these useful KR features but most of them could be packed into … a lot of things, uses of context, even that very general 'ist' framework that Cyc has, are facilitating features of the representational language. You could highly complicate the expression of individual sentences and replicate and regain a lot of what you gained with context but there’s a functional articulate ### ization of what context can do that has an important application to this problem of big data productions which is that you can’t rely on the output of data that has gone through thousands of transforms, replications, applications of logic and so forth. It’s not enough to say what the most recent source of it was. You need the total of that provenance trace or you can’t have any trust in the final data.

[00:54:34 talking about interpretation models in CycL which are one partitioning among others of the context and which are templates of tabular structures without meaning, that associate or translate those structures into statements in CycL. They are templates or recipes for looking at any cell content and translating that into a CycL statement. Gives an example of a bond trade with some clearing house information added to it, 4 or 5 steps away from the trade, but an essential part of the information about that trade. Interpretations make sure that such information isn’t lost in the process and that content of some type is always entered into Cyc in exactly the same way, in a canonical representation in natural CycL terms. The interpretation model also contains access codes etc and is tailored to specific sources of information. They call that process semantic normalization. Such an interpretation is called an SKS context, a Semantic Knowledge Source context, and is a microtheory tailored to a specific source.]

[00:57:58] This is a case study on how bad the situation is in banks and I think it is the same thing for OKN and any large effort to bring all the knowledge, err to approach this limit of … as much as possible and as much as rational and implicitly [edit: not sure I understood this right] possible will never get past the trivial uses of that information until they can be reliably, err until there are assurances, mechanical guarantees of the correctness of what is an incredibly, incomprehensibly hard transform problem.

[00:58:33] Slide 8, showing a visualization of a highly complex social network graph This is a fair representation of what’s going on within Goldman or Credit Suisse. Goldman for example has 100.000 databases and 2 billion columns (not rows!). In addition to those 100.00 databases there’s another 100.000 transformation nodes, so data comes from a database to a transformer which does essentially first order logic transformations on it or statistical math, pumps it to another one, ans so on. Those flows are absolutely as complex as this, hundreds or thousands of nodes and transforms of nodes - chains, that resolve in critically important "life depends" risk data. The OKN, any effort to unify all knowledge, face _this_. It will never be used for anything except for toy applications and ad placement until we have a mechanical guarantee that it is valid relative to the input data. So, the proposal is that we need a mechanism and I want us to get that there is an implementation path for a system that fully applies it, via context, to ensure locally that all the transform is correct and thereby globally that the final of the input-outputs are processed and guaranteed to be correct and also a fair provision about end user access to that total provenance trace subject to their own rational suite.

[01:01:19 Adding a new data source typically involves 2 kinds of modelers: source owners who are familiar with source and all its intricacies, inherent transformations etc and that do a one time entry of that data description to create an Cyc interpretation model, and content modelers that are just doing what ontologists do. So the first step are information generation services that normalize the sources into CycL. The semantic normalization is the hard part. Ingesting the data can then be massively parallelized to 10^12 statements per second - that will never be the bottleneck. Then there are powerful interactive modelling tools to edit the model, inspect the model, browse it, see what the natural language definition is etc. Then there’s a set of information provisioning services which pipe the right content to the right people and this context framework has a solution to the problem of securitizing the consequences of combinations of structured information - an extremely hard problem but I believe it has a tractable and provable solution, but that’s too much for this talk.]

[01:04:20 Raw computing power is not the problem (slide showing a shiny new Cray being rolled into the hallway)]

[01:05:30] Once we had all the data nicely in CycL, say about half a billion of CycL assertions, they wanted to see another billion derived statements and they needed that done intraday, and that is a hard problem for an ontology language but we studied the compilation of CycL rules into SPARQL and executing them on a Cray X-MP 4TB super computer that was able to generate the 1 billion consequences in about an hour so that did show that there was a path of scalability with super computers but that was a 6 million dollars machine and still puts ontologies firmly in the space of the exotic. But subsequent testing found that certain of these industrial databases are massively parallel columnar stores that are really only used by such as Goldman or Facebook, for absolutely highly structured well interpreted data that must subject to the exact kinds of transformations. The equities data that Goldman produces is I think 68 million attributes of 78 million positions per day, that’s millions of attributes per those positions so describing - that’s done by this sort of machinery. 100s of thousands of servers, linearly scaling hardware are executing these transforms. Currently we’ve been looking in the compilation of CycL as much as possible into SQL executable by these highly scalable industrial columnar stores which lay out there data much like an ontology does. That’s why they are well suited to this sort of whole stack tuple per seconds types of operation [01:07:17 ???]. So the claim is for arbitrary logical enrichment by ontologies you can get to the dream that anything you can state can be applied.

[01:07:45] If you use context that way - ### - to control the model so you assure that given finite input the closure is a finite set of axioms - in agreeing with the OKN principles that data must be accessible, information generated however valuable must be accessible in its totality -, you can use formal methods to verify that you have complete and correct content. If you control through some limitations of what can be stated in the model that you don’t end up with infinities, you can introduce mathematically verified algorithms of completeness and correctness of normalization and transformation at each local node in the world wide knowledge propagation graph and thereby [guarantee] the validity and soundness of the total.

[...skipping a lot of Q&A…]

Ontolog Forum