n2Mate: Exploiting social capital to create a standards-rich semantic network David Peterson Anne Cregan Rob Atkinson John Brisbin BoaB interactive National ICT Australia CSIRO Land & Water BoaB interactive 2/84 Denham St. 223 Anzac Parade Lucas Heights Research 2/84 Denham St. Tow nsville, QLD Australia 4810 Kensington NSW Australia 2052 Laboratories Tow nsville, QLD Australia 4810 +61 7 4724 2933 +61 2 8306 0458 Private Mail Bag 7, Bangor +61 7 4724 2933 david@boabinteractiv e.com.au anne.cregan@nicta.com.au NSW 2234, Australia john@boabinteractive.com.au rob.atkinson@csir o.au ABSTRACT 1. SOCIAL AND TECHNICAL CONTEXT A significant boost on the path towards a web of linked, open data The current emergence of a data web has re-focussed our attention is the establishment and promotion of common semantic resources on standards. To be truly effective, the semantic web needs to including ontologies and other operationalised vocabularies, and evolve towards a minimum number of ontologies, highly re-used, their instance data. Without consensus on these, we are and densely interlinked, rather than a sparse network with hamstrung by the famous “n-squared” mapping problem. In addition, each vocabulary has its own associated attributes to do minimal interoperability . with why it was developed, what purposes it is best suited for, and 1.1 The standard proble m with standards how accurate and reliable it is at both a content and technical level, but most of this information is opaque to the general The project to link open data can be realised through explicit community. declarations by one data source in relation to another. These “hard” linkages provide a high degree of certainty, but make data Our theory is that it is the lack of socially -sensitised processes maintenance exponentially difficult as the number of hard highlighting who is using what and why, that have led to the linkages grows. current unmanageable plethora of vocabularies, where it is far easier to build your own vocabulary than try to find a suitable, Standards, understood as nodes of agreed meaning, provide a reliable existing one. more scalable approach to data linking. By agreeing to use the same term to describe similar ideas in our different data, we We therefore suggest that there is considerable value in the establish an implicit (semantic) linkage between our data. The development of an online facility that performs the function of project to conceive, negotiate, and promote standards, however, providing a space listing vocabulary and ontology resources with has proven to be even more difficult than the maintenance of hard their associated authority, governance and quality of service linkages. attributes. Presenting this in a visual form and providing pivotable search facilities enhances recognition and comprehension. It is often noted, with some irony, that the great thing about standards is that there are so many to choose from...and if you Additionally, and critically, the facility provides a focal point can‟t find one you like, you can always create your own. . where discourse communities can make authority claims, rate vocabularies on various parameters, register their commitment to While these sentiments provide excellent platforms for pub-based or usage of particular vocabularies, and provide feedback on their oratory, the realities are not so easily dismissed. Application experiences. Through social interaction, we expect the most solid designers, knowledge seekers, and agencies with a mandate to and useful vocabularies to emerge and form a stable semantic interoperate are all too familiar with the significant resource platform for content representation and interlinked knowledge. drains that occur when standards are hard to locate, difficult to Our strategy is to become sufficiently enmeshed in the native apply, or confusing to distinguish between. information habits of people and their derivative institutions to Standard vocabularies and data definitions have been quietly reveal and collect their standards-seeking needs and activities with multiplying in traditional media since ancient Sumer (ca. a minimum of effort on their part. Wikipedia, Cuneiform) but in more recent times the Semantic This paper describes a pilot facility testing the theory above. Web has inspired a hyperbolic growth in contributions to the Dubbed “n2M ate”, it is a novel exploitation of social networking standards project. For instance, a search in Swoogle on the word software to provide a lightweight and flexible platform for testing “address” returns 12,834 semantic web documents; on “book” it the efficacy of leveraging social networks to link existing registers returns 19,601 (at 2008-01-24). For someone seeking to exercise and „seed‟ an information space focussing on the use of standards the efficiencies of knowledge reuse, this wealth of choice is in online information management. simply overwhelming and self-defeating. The current state of affairs reveals semantic fragmentation, not semantic integration The paper uses examples from the Australian context to provide and knowledge creation. clear illustration of the central arguments. Even within a narrow domain like the Australian government, Keywords there are a wealth of terminologies and metadata “standards” Registers, vocabularies, standards, linking density, rdf graph, available for government agencies to consider. It is not clear if a social networking, knowledge re-use, n2M ate, n-squared whole of government survey of standards has ever been undertaken, but informal observation suggests that there are hundreds of attempts to describe very similar concept spaces. 1.2 Does anyone have a wheel like mine? 1.3 Scalable register networks People have been trying to standardise themselves in one way or As we have argued, there are many technical standards and another for quite some time. The most obvious benefit of this common policies in use across a wide range of government instinct toward standardisation is communication efficiency, a activities, but the very number of such activities and standards is direct input to the rate of knowledge creation. By speaking the in itself posing a significant challenge. same language, we can communicate and collaborate far more AGIM O and others have a role in promoting the use of common effectively. Yet the barriers to standardisation appear to take on approaches, but it is increasingly difficult to track which standards new forms as fast as we evolve knowledge. apply to which set of problems. In our present age the benefits of information interoperability are In general, there is an issue about the scalability of any approach now well understood, if only through their absence. M ost people for improving interconnectedness. We believe that the most and institutions involved in project scoping, information product promising strategy is to utilise registers to hold metadata about development, and online service provision clearly grasp the power standards and their implementation, including records of of knowledge re-use and the cost efficiencies of standards-based organisations, projects, standards, controlled vocabularies (and interoperation. This assertion is supported by the existence of an associated people and roles). A network of such registers, coupled entire government department whose mandate is to promote through normal web services mechanisms, has the potential to effective and efficient information sharing, governance structures, form a semantic fabric that addresses the business-level needs of tools, methods and re-usable technical components across the people and institutions. Whilst this is potentially a vast Australian Government. undertaking, the bulk of target information already exists, and The Australian Government Information M anagement Office there are already a great many people actively tasked with (AGIM O) published a Government Architecture Reference M odel identifying, using and promoting standards. These people are 1 that discusses “...a repository of architectural artefacts (including likely to be receptive to an effort such as n2M ate. standards, guidelines, designs and solutions) that may be utilised A network of registers, supported by a “register of registers” by agencies to deliver an increasing range of Whole of addresses the most important questions: who is doing what, which Government services.” standards are relevant, who can I talk to, what is the governance In practice, however, we find that the task of identifying and model for these artefacts, and how trustworthy is the source. verifying the suitability of existing artefacts is simply too time- Through a richly populated network of registers, these become consuming. As a consequence, there are a great many ontologies questions any organisation can rapidly address, and in doing so and informal vocabularies used by a very limited number of can promote commonality of approach within and amongst organisations or agencies, with a great sparsity of intermappings various discourse communities. between them, even though there is a very large amount of 1.4 Socially-sensitive metadata crossover in terms of content. M ore globally, the Linking Open Data (LOD) project [1] holds One of the dark secrets of the machine-based knowledge project is datasets that currently comprise over 2 billion triples but reveal the enormous loss of content as we move from people‟s minds to only about 3 million links (SWEO, 2007), so overall the graph is their documents and datasets. David Snowden, amongst many very sparsely interconnected [2]. others, has pointed to the impossibility of “collecting” knowledge from people without providing a meaningful context: In many ways the current situation is akin to a train network that has millions of stations (nodes) covering the same area “Human knowledge is deeply contextual, it is triggered by (knowledge domains) but with a great sparsity of tracks circumstance and need, and is revealed in action. .... to ask (mappings) between stations, and hardly any trains and passengers someone what he or she knows is to ask a meaningless question in (services, publishers, agents, users) running on the vast majority a meaningless context. Tacit knowledge ... comes about when our of them. skilled performance is p unctuated in new ways through social interaction” [3]. Our experience with efficient rail networks shows that we want to reach a necessary minimum of stations interconnected with an A socially-sensitised strategy provides the meaningful context and optimised number of tracks, and attract a maximum number of familiar atmosphere that people require before they can (or will) trains to utilise the infrastructure. This obviously gives us a far reveal their knowledge in a useful way. more robust and useful semantic network to traverse. We suggest there is a cluster of persistent problems in complex In related research, it should be possible to show how the density information spaces that can be socially characterised as follows: of interconnectedness in the RDF graph improves the efficiency Who and what: of machine process operation without producing a debilitating level of ambiguity. We would argue that the degree of  Owner: Who owns it? interconnectedness implemented between ontologies can be taken  Creation: Who created it? as a proxy indicator of interoperability across the knowledge  M aintenance: Who is responsible for maintaining it? domain.  Domain: which domains is it relevant to? This will include a number of different ways of considering domains.  Usage: Who uses it? 1 http://www.agimo.gov.au/services/GovDex  Endorsement: Who endorses it? This will include  standard project management practice in reporting on various parameters and a rating system. her project‟s progress. In the absence of a useful standards locator, it‟s not likely that she  Processes: What Business, Government or other will achieve a high standard of conformance to the norms of her processes is it used in? What role does it play? discourse community.  Governance: Who is in charge of it? Who has formally In the absence of a socially -sensitised register space, it is not agreed to support, maintain, and implement it? likely her discourse community is actively sharing their Quality of Service Parameters: experience and wisdom with standards.  Provenance: What guarantees are there that the 2.2 Instance Data information is accurate and verified? The facility needs to be designed around a sufficient minimum of  Currency: How often is it updated? What guarantees are predicates that embody the “business logic” of the facility and there that it is up to date? establish the semantic armature we require for inferencing.  Availability: What guarantees are there regarding the We propose the following [shows predicate] as a starting point: availability of the vocab, dereferencing considerations  Organisations are [responsible for] people, projects, Other Considerations: standards, and vocabularies  How does it relate to other standards in the space?  People are [associated with] Projects  User experiences  Projects are [implemented by] Standards 2. SOCIAL ARCHITECTURES AND  Standards are [expressed with] vocabularies SEMANTIC NETWORKS  Trust or utility of Standards are [ranked by] People The principle social platform techniques we seek to exploit Using these indicative predicates as a starting point, we can include: answer a matrix of discovery questions through faceted  Popularity Rankings: number of times a standards artefact visualisation. In each search operation, the user can rotate to a is referenced (implemented). facet of interest to continue the discovery process.  Authority Badges: mechanism to advertise an authority  I know someone like me [PersonName] > What projects claim over a standards artefact. are they associated with?  Related to (“Friends of a S tandard (FOAS )” ): linkages  Those projects are like mine [ProjectName] > What from standards artefacts to their cohort of implementers. standards are used in them?  Trust ratings: showing satisfaction with the custodian of a  Those standards are of interest [StandardName] > How standards artefact. can I decide which one is most appropriate for me?  Hero worship: most interlinked, most trusted, most useful The logic described here is possible because we have imposed a Each of these techniques have corresponding interface strategies limited set of predicate types. These types are native to the that provide a powerful social platform in which people (and n2M ate facility. To take advantage of existing social networks institutional roles) can operate quite naturally. that utilise other predicate types, Semantic Web vocabularies such Each of these techniques also forms a search facet that can be as SIOC [4] and FOAF [5] will be used. traversed with high efficiency faceted search and browsing tools. The facility will also consider structured lists of resources, like a list of country names available from the same address, to itself be 2.1 Use Case a kind of register. For instance, many applications need a list of A simple use case will help us set the stage for describing the every valid country name for users to input their address technical architecture proposed. information. The ability to reference an external source that is authoritative, accurate, up -to-date and reliably available and A researcher is preparing her research plan on a section of the derefenceable reduces the need for application maintenance. Great Barrier Reef. Although she is an experienced marine scientist, she is new to the GBR and to her host research facility. The metadata held in these registers can be typed according to She suspects she should be using: existing conceptualisations. For example, the National Data  standard naming conventions for the GBR regions; Network 2 draws on ideas from the M etadata Open Forum 3 to  standard identifications for the particular reefs; classify their metadata as: Discovery metadata; Quality metadata; and Definitional metadata.  standard data sampling techniques appropriate to the Australian tropics; We note that the semantic register network can also list web  standard data formats, enumerators, and vocabularies in services in addition to typical standards artefacts such as her datasets; ontologies and vocabularies.  standard citations of agencies, programmes, and people referenced in her work;  standard metadata fields and vocabularies to describe 2 her research output; http://www.nationaldatanetwork.org/ 3 http://metadataopenforum.org/ We intend to specifically tune this facility to the needs of Semantic interpretati on: MOAT government and community agencies that have a mandate to MOAT 7 (M eaning of a Tag) could serve as the basis for giving participate in the creation and maintenance of highly effective extended quality of information to free form folksonomy tagging. approaches to service improvement. This will allow users of the bookmarking system to have the flexibility of folksonomy and the interlinked structure of the 3. IMPLEMENTATION OPTIONS Semantic Web. The added benefit is that M OAT is a distributed A demonstrator version of n2M ate can be established using system and can tap into other servers to give extended meaning to readily available tools and datasets so that a more detailed critique free-form tags. can be pursued with a minimum of upfront overhead. In this section we discuss some of the more promising approaches. Tri ple-store: Sesame Sesame 8 could provide backend triple store, graph manipulation, 3.1 Key components RDF inferencing, and remote SPARQL [7] endpoint access. The registration process, and maintaining a network of linked objects, is the function of traditional registry technologies, such as ebXM L Registry. Navigating and efficiently querying the contents and relationships is not well supported by this environment. It is proposed to automate the harvesting of object relationships from the “Register of Registers” into a triple-store. This is the same pattern found in data-mining, where transactional database content is restructured into generalised query -oriented structures. For our purposes, automated discovery of patterns is not the focus: fast, efficient visual presentation is essential. Users will be parsing through extensive data structures, and may need to propose and refine their discovery logic in quick, exploratory sorties. Visualisation and facet search: Gnizr + Solr We want a tool that thinks natively in URIs and triples. Gnizr 4 is an open source front end that handles user account management, bookmarking, tagging, and semantic search Every object stored by gnizr is a bookmark (URI), and the folksonomy tag interface is SKOS [6] enabled. Solr 5 is an open source enterprise search server based on the Lucene Java search library, with Figure 1: n2Mate Conceptual architecture XM L/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. Policy l ayer: PLING The development of robust approaches to policy negotiation is Solr could be used to facet the data into searchable and being driven by a W3C Interest Group 9. The n2M ate project browseable components. For example, if users are interested in could field test various strategies for handling issues of personal what ontologies Sun M icrosystems is using, they select Sun from privacy, information reuse, and access control. the „Who is Using‟ facet. The other facets instantly re-order and re-number themselves and the user is free to further refine the results by selecting additional facets. Faceted search visualisation can be negotiated through cluster maps (eg, Aduna 6) with a high degree of efficiency. 4 7 http://code.google.com/p/gnizr/ http://moat-project.org/ 5 8 http://lucene.apache.org/solr/ http://sourceforge.net/projects/sesame/ 6 9 http://www.aduna-software.org http://www.w3.org/Policy/pling/ Trust and Governance: POWDER  Ping the Semantic Web 16: archives the location of POWDER 10 is the W3C‟s Protocol for Web Description recently created/updated, web-accessible RDF Resources, currently in development. 3.3 Data harvesting and processing Governance: is related to the idea of trust. In the context of this n2M ate can leverage existing search engine services, such as project, we want to explore two aspects of governance: those listed above, to collect data instances from target registers 1. How to make it easy for agencies who have a mandate to be an and sources. M any of these have or are developing APIs that authority for some asset to discharge their duty in an efficient and facilitate direct access to their collections and service points. useful way. Where well-formed registers and artefact collections exist already, 2. How to provide users with a suite of trust measures that will n2M ate could establish harvesting relationships (presumably allow them to evaluate the qualities of a particular asset in relation through appropriate API arrangements). OWL files, RDF data to their needs. dumps, and SPARQL endpoints could be pointed to the n2M ate POWDER seeks to develop a mechanism through which system for automated data fetching and processing. structured metadata can be authenticated and applied to groups of Additionally, trust algorithms would be created from graph web resources. inferencing, metadata and social data to further guide the POWDER provides us with a means to both retrieve information prospective n2M ate user, allowing them to more quickly about a block of Web Resources and authenticate that this determine what is the best artefact to use in their situation. This information may be attributed to the owners of the information. will be an evolving process that will occur over time as the quality 3.2 Testing the system with existing resources of data and user interactions flows back and forth. There are already many semantically rich registers implicit in the 4. CONCLUSION operations of government, including the identifier of government The unique aspect of this proposal is that it leverages the hidden agencies, registers of company names, standards recognised by formal and informal knowledge networks created by existing Standards Australia, legislation and regulations, management business processes, and marries this information with social areas for land, water, soils, health etc. This represents a wealth of networking models to provide a useful way of organising and entities about which assertions can be made, to create a navigating the wealth of available information. It uses the semantically rich environment. community of people using vocabularies to empower others, Semantic Web data can be roughly broken down into 3 levels: [2] starting with the places where agreements already exist. 1. Vocabulary / Ontology The n2M ate provides a tool that encourages use of standardised 2. Individual occurrence of those terms and actual artefacts by exposing existing registers, leveraging social instances of non-information resources networks and building a central reference point for users that will 3. The links that tie the vocabularies to their occurrences assist them to identify relevant semantic assets for their needs, choose amongst them, and feel confident about their utilisation. All three of these need to be captured with adequate provenance data to bootstrap n2M ate. Further, research into the strategy proposed should provide contributions to related projects, such as the development of: The following web services can be utilised to populate/update information as well as add important metadata to the Register of  A lightweight mechanism revealing the state of Registers component of n2M ate. interconnectedness in and between discourse communities.  Watson 11: A gateway to the Semantic Web, focusing  A bridging space between government, business, community, on: semantic data quality; relations between ontologies; academia and science knowledge assets to enhance access to semantic data broadscale interoperability.  Talis S chema Cache 12: Cross-linked and navigable  A genetic algorithm to breed, select, and hybridise various index of ontologies and vocabularies. standards artefacts such as ontologies, services, and trust  S woogle 13: Search engine for Semantic Web artefacts authorities.  Sindice 14: Indexes the RDF web and pulls out the In conclusion, we suggest that there is currently a significant level triples. From there it essentially creates a reverse of inefficiency in the applied domain of project scoping, lookup. information product development, and online service provision due to the inadequacy and irrelevance of existing knowledge  Falcons 15: Currently indexing 34,566,728 objects registers. (2008-02-01), Provides bi-directional resource linking. We further suggest that a promising solution strategy involves using the power of social networks, coupled with semantic 10 http://www.w3.org/2007/powder/ discovery and visualisation tools, to create a socially -sensitised 11 http://watson.kmi.open.ac.uk/Overview.html semantic network of standards registers. 12 http://schemacache.test.talis.com/ 13 http://swoogle.umbc.edu 14 http://sindice.com 16 15 http://iws.seu.edu.cn/services/falcons/ http://pingthesemanticweb.com/ 5. REFERENCES 5.1 Citations [1] C. Bizer, T. Heath, D. Ayers, and Y. Raimond. Interlinking Open Data on the Web (Poster). In 4th European Semantic Web Conference (ESWC2007), pages 802–815, 2007. http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProje cts/LinkingOpenData [2] M . Hausenblas, W. Halb, Y. Raimond, and T. Heath. What is the Size of the Semantic Web? - M etrics for M easuring the Giant Global Graph, 2007. [3] Snowden, Dave. Information vs Knowledge. http://www.rkrk.net.au/index.php/Information_Vs_Knowledge [4] J. Breslin, A. Harth, U. Bojars, and S. Decker. Towards Semantically-Interlinked Online Communities. In Second European Semantic Web Conference, ESWC 2005, Heraklion, Crete, Greece, M ay 29-June 1, 2005. Proceedings, 2005. http://sioc-project.org/ [5] D. Brickley and L. M iller. FOAF Vocabulary Specication. Namespace Document 2 Sept 2004, FOAF Project, 2004. http://xmlns.com/foaf/0.1/. [6] D. Brickley and A. M iles. SKOS core vocabulary specication 2005-11-02. W3C working draft, W3C, November 2005. updated version under http://www.w3.org/TR/swbp -skos- core-spec. [7] E. Prud'hommeaux, A. Seaborne, eds, SPARQL Query Language for RDF http://www.w3.org/TR/rdf-sparql-query/. 5.2 Special thanks…. Renato Iannella provided background thinking on the Policy Aware Web. Alan Ruttenberg of the Science Commons and Tom Heath of the Linking Open Data project provided encouragement and wisdom from their broad experience. Steve M atheson from the Australian Bureau of Statistics corroborated our intuition that social platforms could play an important role in standards adoption.