Solving the Big Problem A Pragmatic Approach Towards

Solving the Big Problem A Pragmatic Approach Towards Information Management at NASA Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

Agenda • • Background As Built & Rationale for Design The Bigger Issues What’s Next 1 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

Background It all started innocently enough, I was at this conference 2 years ago listening to a W 3 C talk about FOAF at a Semantic Technology. Jeanne Holm (JPL) was in the audience too and told us of a problem she was having. – Didn’t have the $3 M to build a new expertise locator. – Even so there were anticipated issues with information integrity and curation. – Customer’s expectations were demanding and needs were real. I opened my big mouth and said, – NASA already had a directory that could populate a FOAF model and – Probably all the data we needed about projects to populate a DOAP model. We could re-use the information NASA already had and apply it to this new customer requirement!!! Bijan Parsia, Kendall Clark, Mike Grove, Evren Sirin (now Clark & Parsia LLC. ) and other MINDLAB folks built a prototype in 9 weeks and… Presto! A project is born!!! 2 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

Overview Enable efficient expertise location by: • Integrating already existing but disparate data sources, • Providing a dynamic UI for exploring the information integration, • Visualizing social networks to facilitate communication, • Supporting incremental integration and incremental annotation. 3 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

POPS (People, Organization, Projects & Skills) Capabilities Provides single integrated view of: People, competencies, project participation, publications. NASA location information. Visualizes perspectival social networks between people. Allows for local or sharable annotations of integrated info. Aggregates info into a query-able, reusable service. 4 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

Jspace is a “polyarchical query builder” for federated RDF Stores. Folks can learn a QL, but why? Get the machine to build queries based on customary user input: browsing. Browsing is better than searching!! Started as a clone, then mass extension of mspace, from University of Southampton, UK Be Different? Then look-feel-&-act different! 5 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

Goes-innas & Goes-outtas @ 10, 000 ft 6 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

Goes-outtas & Goes-innas @ 5, 000 ft 7 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

Some Grit of SW & Interchange • • • Database proxy: Apache 2. 2. x RDF Database: SWI- Prolog RDF DB Data adapters software: Python 2. 4 and rdflib 2. 2. 1 SOAP Library: Java SOAP Java 1. 4 or Java 1. 5 POPS client: jspace 0. 28 Database proxy application server: Pylons 0. 9. 3 NTRS harvester: Java OAI-MHP library Social network visualization library: Jung XML between client and DB proxy RDF loaded into the DB • Data sources: – – CMS: SOAP → RDF NTRS: OAI-MHP → RDF WIMS: CSV dump → RDF X 500: LDAP → RDF 8 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

Extensible • Adding new data sources (Cite. Seer, PRACA, etc. ) is done easily with no end-user or data source disruption. • Customized views of existing data sources (tweaking the Jspace model file). • Extend visualization to other facets. • Everything annotatable by users or groups. • Open Source, soup to nuts! 9 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

No, Really Extensible • POPS isn’t really an expertise locator; It’s: – An infrastructure for information integration. – A generic data services (convert, federate, query, browse) for other apps and services to use. – A generic client of those services (Jspace). – Applicable to hundreds of information integration problems at NASA. • How does this help with NASA’s problem? 10 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

The Problem • Our reliance on data and the information we derive from it touches everything that we do. • Critical information related to our daily operation is becoming more difficult to find. • It is difficult to find relevant information that you know is available. • And it’s virtually impossible to discover critical information that is relevant but unknown. 11 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

Our Situation • The data problem exists within at least 5 dimensions; size, complexity, diversity, rate of growth and trust. • When we cannot find resources, we often recreate them. When we have trouble integrating information, we often copy it. – These habits make NASA’s data volume and data integrity problems worse. • Use-case scenarios and requirements change all the time. – We cannot anticipate in advance what the next collection of information elements need to be or for what purpose!! • NASA needs a strategy to help us be more consistent about our use of, reliance on, and trust in our data, and which would enable information sharing and reuse. • Our goal is to implement a strategy for organizing our information and data assets so they can be discoverable (by machines and humans) and reusable. 12 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

The Challenge Integrate information from disjoint data sources, ad hoc’ly, to solve customer needs. Without upsetting delicate info-ecologies (data owners, curators, extant policies & procedures). Without requiring unrealistic investment in time or money. 13 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

The Inspiration 14 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

Being a Model What do you need to determine the value / utility of a data model before you use it? 1. Models should be discoverable. - you or your machine must be able to find it. 2. Models should be written to the applicable standard. - easy to incorporate or adopt. 3. Models should indicate a) Provenance, b) Currency, c) Validation, d) that they work, function, perform as expected. 15 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

Get the Machines Involved! Who can access the data? When can they access it? (How often, what duration, etc. ) Why would someone want this data? (what is data good for) Where does the data originate from? What curation processes? What is the carrying capacity of the application that supports this data source? Is there spare capacity for accessing the data source? What can clients grant/do with the data? What can they not do? 16 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

Our Target Customer Experience • Make information contained within databases and systems across projects and programs discoverable without disruption, without great expense, without loss of original contextual meaning, and without breaches of trust. • Make attributes of trust, validity, currency and provenance known. • Make information easier to find and, once it is discovered, make it easier for the next person to find. – Your experience compiling information benefits the next person’s collection. 17 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

Better (or Different) Security • Some mechanism to manage all of the security and access-control issues. – Working on a prototype (XACML-DL) that uses W 3 C’s OWL DL to manage access control policies: • Policy verification and consistency. • Policy containment. • Policy comparison and subsumption. 18 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

Prepare for the eventual semantic upshift • A web-based “curation-friendly” catalog of information services. – A database of databases, data models, and policies built from the fabric of the Web. • Using current web standards and technologies, computers will be able to negotiate with each other for access and services. • Customers will be able to browse, query, and search through NASA’s collection of information resources as easily as choosing a hotel or sweater. • Opportunities use-pattern matches will assist customers can make the experiences of others available to you. • Through your browser (or web service) discover attributes of the information’s currency, provenance, validity, and trust. • This service’s utility will be enriched by each customer’s use over time and it will grow incrementally, just like the Web. 19 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

The Big Picture 20 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

Next 18 - 24 Months • Define the Gold, Silver and Bronze criteria for NASA’s Reference Model Types. • Build a prototype repository service in collaboration with our communities of practice. • Assist developers in the construction of initial SLAPs for data and data model discovery & reuse. • Assist developers in building a proof-of-concept repository for Ontologies and SLAPs. • Determine best practices and techniques for adding a validation bit. • Construct go-to standards for new applications and models. • Participate in key W 3 C standards groups (e. g. WS-policy, Owl 1. 1). 21 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

The Strategy Given the heterogeneity and diversity of NASA data (e. g. , scientific, administrative, operational, financial, analytic), we need a flexible approach to building information integration solutions with sufficient formality to provide cross-system discovery and reuse. • • Establish Information Management standards and mechanisms that promote enriched and ad-hoc information sharing and reuse across NASA data services. Define a prospective solution that will augment data management capabilities as newly created data sources are integrated. Promote a layered approach, enriching services incrementally, when practical and requirementsdriven. Enable integration so that the most sought after, useful, and mostly easily integrated data services (databases, models, web services, etc. ) are pushed to the front of the queue. Enable discovery and reuse of policy agreements between data providers and customers and between data systems so attributes of confidentiality, integrity, availability and currency are managed uniformly across diverse systems. Enable easier query integration across disparate hierarchies by modernizing NASA Information Standards to include a NASA Data Reference Model and definition of “gold, silver and bronze” standards for data and data models. Leverage current communities who have demonstrated excellence within their projects and programs. 22 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

Acknowledgements Leadership: Jeanne Holm, Dan Schumacher, Hal Bell, Greg Robinson, the EA Data Team, Ken Griffey, Nitin Niak, many others. Data sources: Chris Carlson, Calvin Mackey, Robin Land, Tim Sullivan Code & design: Clark & Parsia, LLC (Mike Grove, Evren Sirin, Bijan Parsia, and Kendall Clark) and Koansys, LLC (Chris Shenton) R&D, proof of concept: Jim Hendler, m. c. schraefel, Mindlab people 23 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

Bibliography Mc. Guffin & schraefel, A Comparison of Hyperstructures: Zzstructures, m. Spaces, and Polyarchies (Proceedings of ACM Conference on Hypertext and Hypermedia) Clark, Schain, & Parsia: Semantic Web At NASA (XTech 2006) SWI-Prolog Semantic Web Server Construction, Collection & Curation Of NASA’s Data Reference Models Navigating NASA’s Information Space M. Smith, A. Schain, K. Clark, A. Griffey, and V. Kolovski, Mother, May I? OWL-based Policy Management at NASA 24 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

Questions? Complaints? 25 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

Future View • Develop and deploy new classes of applications that merge data, services, and physical resources into a semantically aware, adaptive environment. • Create a pervasive collaborative environment by having software “tasking” agents autonomously scan published IT service assets in conference areas, and choreograph them to an interconnected virtual work environment. • Deploy software agents that can autonomously scan published knowledge and metadata and automatically connect them, or harvest them for information, anticipating users' needs: give the users the data they need when the need it, in a form relevant to their current task. • Develop agents that can resolve conflicts amongst different data sources and ascertain the trustworthiness of the published data, both within NASA and outside the Agency. • Develop agents that can learn, anticipate needs, discover relevant data, and enter into transactions, all on behalf of their human users. 26 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

Screen Shot 27 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

Design Choices • • • Java client v. javascript, in-browser client federation v. data consolidation HTTP v. SOAP RDF v. XML Se. RQL/RDF v. SQL/RDBMS aggregation v. distributed query Via broker v. aggregation via client visual query building v. NLP Browsing v. some other “direct query” interface (QBE, forms) versus Browsing v. searching 28 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007

Mathematics of the who-knows-who relationship visualization Given a set of people, P and a set of relationships, R, that connect people and entities We define five types of relationships: 1) same facility, 2) same department, 3) same skill and department, 4) same skill and project, 5) same skill, project, and facility. Call these r 1 - r 5. rixy indicates a relationship of type i between person x (px) and person y (py) There is a direct connection between users pu and ps if there exists an rmus If there is not a direct connection, we search for a path from pu to ps by finding pa such that there exists rmua, rnas. Then, we add (pu, ps, pa, rmua, rnas) to the graph. For example, if Alice is the user and Bob is the selected person, we will look for a direct relationship between them, such as if Alice and Bob both work in the same department (i. e. find rmalice, bob). If the direct relationship does not exist, we look at all the people Alice has relationships with, and check to see if any of them also have relationships with Bob. For example, Alice may work in the same facility as Chuck (r 1 alice, chuck). Chuck, in turn, may have the same skill and work on the same project as Bob (r 4 Chuck, Bob). Chuck then becomes a connection between Alice and Bob. All three people and their relationships are added to the graph. 29 Andrew Schain, NASA HQ, Government Emerging Technology Subcommittee, Washington, DC 07/17/2007