=Paper=
{{Paper
|id=Vol-289/paper-2
|storemode=property
|title=Semantic Annotation of Mobile Data for Language Access
|pdfUrl=https://ceur-ws.org/Vol-289/p02.pdf
|volume=Vol-289
|dblpUrl=https://dblp.org/rec/conf/kcap/LenceviciusR07
}}
==Semantic Annotation of Mobile Data for Language Access==
<pdf width="1500px">https://ceur-ws.org/Vol-289/p02.pdf</pdf>
<pre>
 Semantic Annotation of Mobile Data for Language Access
               Raimondas Lencevicius                                                          Alexander Ran
           Nokia Research Center Cambridge                                         Nokia Research Center Cambridge
                  3 Cambridge Center                                                      3 Cambridge Center
                 Cambridge, MA 02142                                                     Cambridge, MA 02142
      Raimondas.Lencevicius@nokia.com                                                Alexander.Ran@nokia.com

ABSTRACT                                                              are equipped with more and more sensors including GPS
Mobile devices both host and collect significant amount of data       receivers, Bluetooth transmitters and receivers, RFID receivers
that could be interesting to users. To make this data easily          and others. They also receive and store information about such
accessible, it has to be stored in semantic repositories using a      events as messages, phone calls, meetings, application usage and
well-defined ontology. Relationships between data from various        access to digital services. It is therefore natural to expect that this
sources should be explicit. Natural language interface to such data   data should be collected and made accessible on mobile device.
is an attractive option for information access. However, there are          However, there are some open questions that need to be
semantic gaps between the data repositories and the formal            resolved in order to make this data useful and accessible both to
representation of meaning produced by language understanding          the programs and to the mobile device users. Collected real-world
systems. This paper describes a solution to the issues above. We      data must be structured and integrated with other information
have implemented a system that converts the mobile data into          available on mobile devices such as the information found in the
RDF format and annotates it with information necessary for            user’s phone book or calendar. There also needs to be an intuitive
efficient access via natural language. We have designed and           interface that allows flexible access to collected information.
implemented Natural Query system that automates the interface of
natural language system and the semantic data repository.                  Mobile devices store a rich set of structured information. The
Language tags are used to map between the natural language            address book or phone book application contains names, phone
meaning representation and the repository elements. Repository        numbers, addresses and affiliations of personal contacts. The
graph search is used to discover the knowledge about the              calendar application contains entries for meetings with
repository structure.                                                 participants, meeting location and time. We exchange messages
                                                                      and calls with people and organizations listed as our contacts. All
                                                                      these data are related. Retrieving these data based on their relation
Categories and Subject Descriptors                                    could be very useful for device owners. With such retrieval
H.3.3 [Information Search and Retrieval], H.5.2 [User                 capabilities they could learn who called them when they were in
Interfaces]:   Natural    language,   I.2.4    [Knowledge             California, or when is their next meeting with Ann from
Representation Formalisms and Methods]: Semantic networks             Accenture. Unfortunately, the relations between different data
                                                                      items are not always recorded explicitly when the events occur or
Keywords                                                              information is entered in some application. Therefore it is
Semantic annotation, query language, natural language.                important to integrate the collected data by explicating its relation
                                                                      to the data available on the device. To achieve this goal, we have
                                                                      developed an extended PIM ontology that covers all relevant
1. INTRODUCTION                                                       types of information available on the mobile device: from
     Natural language based interaction with software is              observed events, information from external data stores, to on-
increasingly viewed as a promising addition and sometimes even        device data from several mobile applications. Once the data was
alternative to graphical user interfaces (GUIs), especially in the    structured and augmented with relations, it is stored in RDF [15]
domain of mobile devices. Mobile devices host structured and          repository.
semi-structured information bases, software services, and
integrated devices such as cameras, music players, etc. Mobile             So far mobile applications have been designed with their
devices also make a perfect user interface to the real-world          own user interfaces, mostly GUIs, and occasional dedicated
environment. They are constantly carried with the user [2]            hardware controls. Most of the application software on the mobile
enabling gathering of user location information. Mobile devices       device could benefit from a natural language interface to its
                                                                      functionality that would simplify and streamline performing
                                                                      various tasks.
                                                                           As a rule, language systems and mobile applications software
                                                                      are developed independently of each other. To recoup the
                                                                      investment in the development of a language system it must be
                                                                      capable to integrate with a broad range of information sources.
                                                                      Unfortunately information bases are not designed for interaction
                                                                      using natural language. As a result this integration process is
                                                                      mostly ad hoc, manual process. This severely limits the impact
that maturing language processing technology can have on               mobile device was richer than the types supported by standard
transforming the way we interact with the mobile devices.              ontologies; therefore we decided to create our own ontology.
                                                                       Vcard also uses string values for certain objects that we wanted to
      In our research, we have investigated ways of created robust
                                                                       represent as full fledged RDF objects with URIs and attributes so
and portable natural language interfaces to semantic repositories.
                                                                       they would have identities and we could add information about
We created a novel Natural Query (NQ) language and data access
                                                                       them. For example, city and country fields are represented as
engine that greatly reduces the costs of providing natural language
                                                                       strings in vcard. However in order to represent even basic
interfaces to semantic repositories. NQ can heuristically attach
                                                                       geographic relationships cities and countries must be represented
operational semantic interpretation to a database independent
                                                                       as objects.
meaning representation of a natural language question over a
given semantic repository. NQ enables us to provide a natural               We also considered mixing and matching types from several
language interface to the integrated real-world and on-device data.    ontologies for our data. This approach has the advantage of using
NQ requires attaching basic linguistic information to structural       types possibly known by other systems. However this approach,
elements of semantic repository. In this paper we give a brief         leads to a rather incoherent architecture of the ontology. We
overview of such annotations for ontology in the extended PIM          decided that creating a single internally consistent ontology was
domain.                                                                preferable in our case. If needed, classes and properties in our
                                                                       ontology can be related to types in vcard and foaf via equivalence
      The paper describes the mobile data conversion into RDF
                                                                       declarations using RDFS and OWL [21].
and semantic annotation (Section 2). Additional annotation and
knowledge extraction is needed for automated natural language                Main class for contacts in our ontology is the Contact class.
interface to the data repository (Section 3). Our experience with      It contains address, email, group, phoneNumber and URL
the system is presented in Section 4. We finish with the               attributes. Organizations and persons can be Contacts, so we have
description of related work and conclusions.                           Organization and Person classes inheriting from the Contact
                                                                       class. In addition to inherited attributes, Organization class also
2. MOBILE DATA INTEGRATION INTO                                        has name and representative attributes. Person class adds
SEMANTIC REPOSITORY                                                    affiliation, birthday, btDevice, familyName, givenName, and
                                                                       nickname attributes. Affiliation class showing the affiliation of a
     We had to deal with two major data sources: events gathered
                                                                       person with some organization has organization and title
by data collection framework and PIM data available from PIM
                                                                       attributes. Part of the ontology relating these classes is shown in
applications. This section describes data from both sources,
                                                                       Figure 1.
necessary data conversion and integration into semantic
repository.

2.1      Mobile Device Data
      Data on mobile devices is owned by different applications.
This makes it hard to establish and explicitly indicate semantic
relationship between different data items. This situation is
acceptable as long as the users can only interact with their data
using the limited set of functionality provided by the applications.
However, if we open these data for language based access, it
becomes necessary to support access to different data items using
their semantic relationships. Some examples are referring to
people by their affiliations, titles, city of residence or office                   Figure 1. Part of Mobile PIM ontology
location; referring to meeting by their participants, subject, or
location; referring to received calls by the name of caller’s               Group class describes groups of contacts, such as office
organization.                                                          colleagues or baseball friends. It has contacts attribute that
                                                                       contains contacts belonging to the group and name attribute.
     In our project we dealt with data that originated from the
phone book application (sometimes also called address book) and              Location is a generic class describing locations that has a
the calendar application. Data in these applications are stored in     number of subclasses: Address, Country, GPSLocation,
separate Symbian data bases [6]. Since these databases cannot be       GSMLocation, Locality, Pcode and Region. Address represents
changed without interfering with the functionality of standard         detailed addresses and contains country, locality, pcode, pobox,
applications we chose to integrate all data in a separate semantic     region, and street attributes. Country, Locality, Pcode and Region
repository. We designed an extended Personal Information               classes are simple with just a name attribute for respective objects.
Management (PIM) ontology that adequately represented all data         GSMLocation class describes locations as obtained from GSM
items that we were interested in and their relationships. We           network. It has carrier, cellTower and lac attributes. Carrier is
implemented a set of Python scripts that extract the data from         the cellular network operator, cellTower has a single cell tower
native databases and import them into the PIM ontology. We used        ID, and lac is a Location Area Code describing a certain region
RDF repository for data storage.                                       within the network. GPSLocation specifies locations using
                                                                       latitude and longitude attributes.
     We created the PIM ontology to cover all data available in
the device. We considered using such standard ontologies as W3C
foaf [8] and vcard [20]. However, the information available on the
      Mobile device Calendar application contains information          needed to infer and attach these codes to some phone numbers
about meetings. Meeting class has subject, location, participants,     that enter the system without such codes. For example, the phone
start and end attributes.                                              number supplied via caller ID does not always include the country
                                                                       code. Custom code has to be written for many data items to
    Message class objects represent messages. They indicate
                                                                       convert them on entry into the form required by the semantic
messageSubject, messageBody, receiver and sender.
                                                                       repository.
     One of the goals of semantic web is developing standard
                                                                            The attributes of Observation objects connect with other
universal ontologies. Unfortunately, neither the existing
                                                                       objects of the repository. For example, the phoneNumber attribute
ontologies, nor the one we used in our project can be claimed to
                                                                       of a CallObserved is of type PhoneNumber, which is also used in
be standard. Attributes and data in different applications and
                                                                       the attribute phoneNumber of a Person or Organization class.
domains vary significantly. For example, some calendar
                                                                       Therefore the gathered data semantically integrates with the on-
applications may specify participants, while others don’t. Some
                                                                       device data. Common classes are basis for building relations
address book applications may allow specifying birthdays for
                                                                       between data classes belonging to different applications.
contacts, but others do not. Ontologies seem to follow in their
structure the applications or uses that their creators considered at         Another area where observed data integrates with on-device
the ontology creation time. Classes are created based on particular    data is the location information. GSM locations gathered on the
use cases. Attributes are chosen based on data availability and        phone can be related to geographical locations, such as cities,
planned use of that data. Rather than focusing on the                  states or countries. Some data processing and additional relations
standardization, we discovered that an important value of RDF          in the RDF repository are needed for this. We use the partOf
ontology is its extensibility – ability to accommodate new types       relation between different objects to represent geographic or
and attributes at any time.                                            organizational inclusions. For example, a relation can indicate that
                                                                       Boston is a part of Massachusetts, which in turn is a part of the
2.2 Event Data                                                         USA. This attribute is also used to describe the GSM location
      For data collection on mobile devices, we have used one of       containment within a certain geographical object. Since GSM
the frameworks available within Nokia to collect events that occur     locations are somewhat imprecise, we have chosen to associate
on a mobile device: phone calls, SMS messages, nearby Bluetooth        them with town or city level geographical entities. This provides
devices, and GSM locations. All of these events are tagged with a      sufficient information in most cases. If a more precise location can
timestamp when they occur. For phone calls the device records          be determined, it could be associated with a city neighborhood,
the phone number called (or the phone number that called the           street, house or even part of the office building.
user) and call duration. For messages, the phone number and the
                                                                            For some other data, programs or users have to add
message text is recorded. A GSM location change event is
                                                                       information to facilitate integration. Bluetooth device IDs need to
recorded when the cell tower associated with the phone changes.
                                                                       be associated with specific persons, since such association is not
Finally, the phone periodically scans for Bluetooth devices in its
                                                                       usually available in the mobile device phone book. For this reason
vicinity and records their names and IDs. All observations are
                                                                       we added btDevice attribute to the Person class. It has to be filled
stored     in   the     objects  of    Observation     subclasses:
                                                                       in with concrete values in order to associate the
BTDeviceObserved, CallObserved, MessageObserved, and
                                                                       BTDeviceObserved observation to a specific person carrying a
LocationObserved.
                                                                       Bluetooth device.
     Although the gathered data is interesting by itself, it becomes
even more useful when properly linked to the data already              2.3 Discussion
available in the device. For example, user may want to know                 In a number of cases we had to decide whether to represent
where the person who called them lives. This information could         particular entities as strings or as objects using URIs. It seems that
be found by relating the call log to the phone book on the device      constructing an object is almost always worthwhile, since such
that maintains the association of phone numbers to people and          objects can be later used for inter-object relations. For example,
their addresses. To enable this connection, it is important to         by having Country, Region and City objects, we are able to
collect and preserve semantically relevant information. The            indicate partOf relations between them. Also a single URI for a
connection of gathered information to other data can be achieved       particular object, for example, city, allows to detect such
through time and location relationships, phone numbers, email          connections as people living or working in a single city.
addresses, Bluetooth IDs and other inverse functional properties.
                                                                            Overall, we found that our RDF repository is significantly
Time and location can be used to relate data items that are either
                                                                       more flexible than a relational database. It naturally supports
associated with same time period or the same location. All event
                                                                       multiple classes of contacts, multiple affiliations per person, and
data is time stamped, which makes such associations relatively
                                                                       supports a sophisticated typing system.
simple. Location can be related to time stamped data items
through location observed during the same period of time.
Unfortunately for establishing some other relationships however        3. NATURAL LANGUAGE INTERFACE
there might be no generic approach. For example in order to                  Although the repository of integrated real-world and in-
connect phone call and message data to other data associated with      device data can be used in a variety of ways, for example, via
the phone number, the phone number has to be known in a                querying it using SPARQL [19], we were interested to provide an
standard form URI. We used the standard international form of          intuitive and flexible user interface to it. A general natural
the phone number with country code and long distance code, for         language interface to a rich data set could be more effective than a
example +1 555 555-5555. However, data processing may be               GUI based application.
      As a rule, information bases and language systems are               SELECT DISTINCT $person ?givenName ?familyName
developed independently of each other. Therefore information              FROM <http://localhost/pim.rdf>
bases are not designed for interaction using natural language and         WHERE { $person a pim:Person; pim:givenName
their integration process is mostly ad hoc, manual process. Figure   ?givenName; pim:familyName ?familyName; pim:affiliation
2 is a sketch of a typical architecture that is used to provide a    ?affiliation; pim:address ?person_address.
natural language interface to databases and other back-end or             $affiliation pim:organization $organization.
native services.                                                          $organization       pim:address     ?organization_address;
                                                                     pim:name “IBM”.
                                                                          {?person_address pim:locality “Ulm”} UNION
                                                                          {?organization_address pim:locality “Ulm”}}
                                                                           Unfortunately in order for a language system to generate
                                                                     such semantic representation from the original questions, the
                                                                     language system must contain a large amount of information
                                                                     about the structure of the database and its content. Such
                                                                     information includes the facts that IBM is a name of an
                                                                     organization and Ulm is a name of a city, cities can be related to
                                                                     organization through their addresses, organizations are related to
                                                                     people through their affiliations, people are related to cities
                                                                     through their home and office addresses, and all these
                                                                     relationships and objects are represented by the specific structures
                                                                     and entities of the database.
                                                                          Entering such information into a language system is a tedious
                                                                     and costly process that is not only domain dependent but also is
                                                                     sensitive to specific choices of database organization. There is an
                                                                     obvious advantage in maintaining some independence between
                                                                     the database and the language system. One way to achieve this
                                                                     independence is to have the language system generate semantic
                                                                     representations of the questions that are as independent of the
                                                                     database organization as possible.
Figure 2. Architecture sketch of Natural Language Interface               In the example above semantic information contained in the
to Services                                                          question and independent of database organization amounts to the
     The speech recognition and generation components translate      following meaning representation:
between text and speech modalities. The language understanding
                                                                          contact.name: ?
component converts the text into a formal representation of
                                                                          organization: IBM
meaning sometimes called semantic frame [17]. The language
                                                                          city: Ulm
generation component converts the formal meaning representation
to a natural language text [1]. The dialog manager uses the               It is possible to have the language system produce such
context of conversation to complete frames received from the         database independent meaning representation of questions. But is
language understanding module or created by the custom               the information in such meaning representation of the question
integration code from responses of backend services. The custom      sufficient to perform the requested operation? Obviously there
integration code also translates meaning representation frames it    are several information gaps between this database independent
receives from the dialog manager into a standard database query      meaning representation and the database specific semantic
or backend specific API requests.                                    representation of the question in the form of a formal query.
    Let us assume the user asks the system about contacts in              The first gap is due to different names used to refer to the
some organization and geographical location:                         same elements in the language system and the repository. For
                                                                     example, the category called “city” in the language system
      Who do I know at IBM Ulm?
                                                                     corresponds to the attribute locality of the Address class.
      Who are my contacts at IBM in Ulm?
                                                                     Therefore there is a need to maintain the mapping between the
      What are the names of my contacts at IBM in Ulm?1
                                                                     two naming systems.
     The operational semantics of these questions can be
                                                                          The second kind of gap between the two systems is that one
adequately represented with a database query. Let us consider
                                                                     element in the language system may correspond to multiple
how this request would need to be posed to an RDF repository.
                                                                     elements in the repository and vice versa. In our example the
SPARQL [19] query corresponding to our example question over
                                                                     reference to the address can map to home address, work address,
the ontology shown on Figure 1 looks as follows:
                                                                     or the organization address of the contact. This is partly due to the
                                                                     ambiguity of the natural language, which is not the main focus of
                                                                     our discussion in this paper. There are also situations where the
1
    The name of the organization and the city were selected for      granularity of categorization is different between natural language
    shortness and carry no other information                         and repository representations. This happens when several
different concepts exist in the repository for objects which are      meaning representation, the data and ontology, and the language
viewed as instances of the same concept in natural language. In       tags. It could be argued that if there were correspondence between
our example this gap required the UNION in the query to               the categories of database-independent meaning representation
represent the original natural language request.                      and the data and ontology, the language tags would not be needed.
                                                                      Unfortunately, if the ontology and language system are to be
     Third and the most important source of the information gap
                                                                      developed independently, there is no way to maintain or ensure
between the meaning representation of the natural language
                                                                      such match. Thus language tags provide the many-to-many
request and the SPARQL query is due to the fact that the query
                                                                      mapping between the two independent systems of categorizations
must specify the navigation to the information in the repository
                                                                      and eliminate the first and second kind of information gaps
using the repository structure. This information about the
                                                                      between the meaning representation and semantic repositories.
repository organization is entirely absent from the natural
language question and cannot appear in a database independent               Figure 3 illustrates language tags associated with a part of
semantic representation.                                              our PIM ontology. A generalization like “Contact” can be
                                                                      attached to specific classes like “Person” and “Organization”. A
     We have designed and implemented the Natural Query (NQ)
                                                                      general reference like “Name” can be attached to multiple
language and engine [14] that bridges the gaps identified above
                                                                      elements like “givenName”, “familyName”, and so on. In our RDF
thus opening a way for portable (database independent) natural
                                                                      repository of real-world and in-device data, we added language
language interfaces to semantic repositories. NQ can
                                                                      tags to the RDF objects using a subproperty of RDFS label field.
automatically map meaning representation produced by language
systems into precise queries. NQ employs two mechanisms:
language tags and data graph search to return requested data using
                                                                      3.2 Graph search
                                                                           The third gap that exists between the database independent
only the information in the database-independent meaning
                                                                      meaning representation of the natural language request and the
representation of the user request.
                                                                      formal query that actuates it over a given database is the
                                                                      information about the organization of the data repository. In order
3.1 Language Tags                                                     to navigate from the given attributes of an object to the target of
     Language tags are words, expressions, and linguistic tokens
                                                                      the query, SPARQL queries need to know the specific path that
attached to database elements such as classes and properties.
                                                                      connects them on the database graph. In current language systems,
Multiple tags can be attached to a single element and a single tag
                                                                      this path is encoded by the query and stored in the custom
can be attached to multiple elements. Language tags are the names
                                                                      integration code for every different type of query. Thus a query
of the corresponding categories used by the language system(s).
                                                                      defines a subgraph with given properties some of which are
When a language system produces a form like the one in our
                                                                      specified in the database-independent meaning representation of
example,
                                                                      the natural language request and some are encoded in the custom
     contact.name: ?                                                  integration code component.
     organization: IBM
                                                                            While a formal query defines a connected subgraph as
     city: Ulm
                                                                      illustrated on Figure 4, the database-independent meaning
     under the NQ system its interpretation is:                       representation only identifies some nodes and edges of this
                                                                      subgraph. Identified fragment might be disconnected. In the
     find the attributes tagged as “name” of an instance of the       example        above       it    identifies     “Person”     and
class tagged as “contact” related through properties tagged as        “Organization” classes as well as “Ulm” value of “locality”
“organization” and “city” to values “IBM” and “Ulm”                   property (by reference to its language tag “city”) and “IBM” as a
respectively                                                          value of “name” property of an instance of “Organization” class.
     Contact
                                                                      This leads to an important idea: that the knowledge embedded in
                                                                      the formal queries that know the database organization, can be
                                                                      also extracted from the natural language meaning representation
                                                                      and the data repository itself.


First name     Name   Last name


                                       Address


                       City
               in


Figure 3. Language tags for database elements
     Language tags provide an opportunity for a semantic
annotation additional to the class names and their properties. In a
natural language system accessing an RDF repository data, we
have three layers of semantic information: database-independent       Figure 4. Answering query via graph search
     In Figure 4 it is possible to notice that for a given set of         The system can answer questions ranging from “What is the
elements identified by a meaning representation of natural           email of John?” to “Where does Ann work?” to “My meetings next
language request it is possible to identify the query subgraph by    week in Cambridge with John from MIT” and “Who called me
searching the database. In other words, a program could find paths   yesterday during the meeting with Ann?”. Some of these
connecting the nodes known from the meaning representation,          questions would convert to quite complex relational or SPARQL
such as “Person”, “name”, “Organization”, “City”, “Ulm”, and         queries. For example for the query “Who called me yesterday”, we
“IBM”. One of such paths is highlighted in the picture.              need to find all telephone numbers of calls that occurred yesterday
                                                                     and then find all people who have these telephone numbers. NQ
     Therefore while traditional approaches to semantic analysis
                                                                     query for this is very simple: “:select ‘Person’ :where ("Received
of natural language questions over databases rely on hand crafted
                                                                     Call", Time ('yesterday'))”.
code or data for representing the information about the
organization of the database, NQ extracts such knowledge from             If we classified questions according to domains, one domain
the data repository by using graph search. Given a question “Who     would contain questions about the personal information data from
are my contacts at IBM in Ulm?”, NQ finds paths connecting the       an address book application, for example “Who works as a real
nodes known from the database independent meaning                    estate broker?”. Another set of questions is about meetings, for
representation, such as “Person”, “name”, “Organization”,            example, “When are my meetings next month at MIT?”. Yet
“City”, “Ulm”, and “IBM”.                                            another set is about calls and messages, for example, “Who called
                                                                     me last Friday?”. Finally there are questions spanning multiple
3.3 NQE Discussion                                                   domains, for example, “What are emails of people who
      NQE may find multiple subgraphs that connect all given         participated in a meeting on Monday?”, “Who called me when I
elements. In such cases we apply heuristic ranking of these          was in Finland?”, and so on. All these types of queries were
subgraphs in order to determine the most relevant ones. So far we    successfully created and executed on the extended PIM data store.
experimented with several ranking mechanisms all of which are
                                                                          We found out that we could easily ask questions both about
variations on path length (weight) between the elements specified
                                                                     the in-device data and the collected real-world data. Semantic
by the meaning representation. In all our experiments the results
                                                                     integration of multiple data sources enhanced our question
retrieved by the system in response to natural language questions
                                                                     answering capability significantly, allowing such questions as
correspond well with intuition of human subjects.
                                                                     “Who called me when I was in Helsinki?”, “Which messages did I
     The results returned by NQE are designed to support the         receive during the meeting with Juha?”, etc. Although an out-of-
needs of conversational interfaces. If no results are found that     pattern detection of someone’s Bluetooth device is a weak
match the elements specified in the meaning representation, NQE      indication the phone user met the owner of the Bluetooth device,
returns best matches that include only a subset of elements in the   in our experiments we assumed such implication. This allowed us
query. For example, if no contacts at IBM in Ulm can be found,       to ask questions such as “Who did I meet last week?” or “At what
contacts at IBM in other cities would be returned as well as         time did I meet Ann last Saturday?”
contact from Ulm that are not affiliated with IBM
     NQE can perform basic reasoning over type hierarchy. A
“Person”       is   substitutable    for    a    “Contact”,     a
“MobilePhoneNumber” for a “PhoneNumber”, but the opposite is
not true. NQE supports organizational and geographic inclusion
and can perform corresponding reasoning. When a calendar
application lists meetings in Helsinki and Oulu, NQE can answer
questions regarding meetings in Finland, where these cities are
located. Similarly information about organizational structure can
be used to answer questions about Nokia while the database only
records Nokia’s internal organizations like Multimedia or
Enterprise Solutions. Finally NQE creates structures that can be
used to produce explanations regarding how the answers relate to
the questions.
     We have created a proof of concept implementation of NQ in
Python [12] that runs on S60 [16] mobile phones. Full description
of the Natural Query system implementation is outside the scope
of this paper.

4. EXPERIENCE WITH THE SYSTEM                                                    Figure 5. Example question and answer
     We tested our system on a PIM test data set containing 550
contacts with about 150 meetings and 250 phone calls, which is            Test NQ queries mostly returned expected answers (96%
normal for executives with many active contacts and frequent         recall, 92% precision) (Figure 5) including the approximate
meetings. The repository contained over 11000 RDF triples. We        answers where the exact answers were not available. For example,
asked over 50 natural queries corresponding to over 600              the question “When was my meetings with Sam last month?” had
parameterized questions.                                             no exact answers, so the system returned approximate answers of
meetings with Sam that did not occur last month as well as the               Applications," Proc. ICSLP '00, Vol. III, pp. 271-274,
meetings that occurred last month, but did not include Sam.                  Beijing, China, Oct. 2000.
     The performance of the system was acceptable with answers           [2] Chipchase, J., “Why do People Carry Mobile Phones?”,
taking from less than a second to several seconds. The system                http://www.janchipchase.com/blog/archives/2005/11/mobile
implementation is a prototype written in Python that was not                 _essentia.html, 2005.
optimized for memory or speed. The detailed evaluation of system         [3] Davis, M., King, S., Good, N., Sarvas, R., “From Context to
performance is outside the scope of this paper. We are planning to           Content: Leveraging Context to Infer Media Metadata”
optimize the system performance in the near future.                          Proceedings of the 12th annual ACM international
                                                                             Conference on Multimedia, New York, NY, USA, pp: 188 –
5. RELATED AND FUTURE WORK                                                   195, 2004.
     Mobile data storage in RDF repositories is investigated by          [4] Dill, S, et al., “SemTag and Seeker: Bootstrapping the
ConnectingMe [9] project at Nokia Research Center. We have                   Semantic Web via Automated Semantic Annotation”,
collaborated with ConnectingMe in the ontology and repository                Proceedings of the 12th international conference on World
development. Some tools for data extraction and conversion are               Wide Web, Budapest, Hungary, pp: 178 – 186, 2003.
shared between our two projects.
                                                                         [5] N. Eagle, "Machine Perception and Learning of Complex
     Semantic markup and annotation of web [7][4] and media                  Social Systems", Ph.D. Thesis, Program in Media Arts and
[3] data is a topic of active research. Our research is related to the       Sciences, Massachusetts Institute of Technology, June 2005.
mobile media data annotation. There has been a lot of research on        [6] Edwards, L., Barker, R., et al. “Developing Series 60
ontology creation tools. We used one of such tool—Protégé [11]               Applications”, Addison Wesley 2004.
to design our extended PIM ontology.
                                                                         [7] M. Erdmann, A. Maedche, H.P. Schnurr, S. Staab, “From
     Event data has been gathered on mobile devices by a number              manual to semi-automatic semantic annotation: About
of projects including Context [13] and Reality Mining [5]. In our            ontology-based text annotation tools”, Proceedings of the
work, we have extended one of the data gathering frameworks                  Workshop on Semantic Annotation and Intelligent Content,
available at Nokia.                                                          2000.
      We have not discovered any research directly corresponding         [8] FOAF Vocabulary Specification 0.9,
to the Natural Query approach. The Precise system by Popescu et              http://xmlns.com/foaf/0.1/, 2007.
al. [10] attaches language tokens to database elements in a way          [9] Lassila, O. et al, “ConnectingMe”,
very similar to language tags of NQ. Also the query derivation               http://research.nokia.com/research/projects/connectingme/ind
approach of Precise is based on database graph search. NQ uses a             ex.html, 2007.
more flexible data model, supports incomplete answers, and
collects data for explanations.                                          [10] Popescu, A., Etzioni, O., and Kautz, H. 2003. Towards a
                                                                              theory of natural language interfaces to databases.
      In the future, we plan to connect our system to such natural            Proceedings of the 8th international Conference on
language and speech systems as TINA [17] and Galaxy [18]. We                  intelligent User interfaces (Miami, Florida, USA, January 12
plan to perform user trials to evaluate our system and its user               - 15, 2003). IUI '03. ACM Press, New York, NY, 149-157.
interface to real world data. We will collect additional data such
                                                                         [11] Protégé Ontology Editor and Knowledge Acquisition
as email messages, songs listened, and pictures viewed and taken.
                                                                              System, http://protege.stanford.edu/, 2007
We will also optimize the current prototype implementation.
                                                                         [12] Python for S60, http://sourceforge.net/projects/pys60, 2007
6. CONCLUSIONS                                                           [13] Mika Raento, “Context software - A prototype platform for
      Mobile devices are now able to continuously collect various             contextual mobile applications”. Proceedings of the
events interesting to the user. Mobile devices also host structured           International Proactive Computing Workshop. University of
and semi-structured information bases. We have demonstrated the               Helsinki, 2004.
integration of all this data using a flexible and powerful RDF           [14] Ran, A., and Lencevicius, R., “Natural Language Query
repository and a common ontology. We have designed and                        System for RDF Repositories”, To appear in Proceedings of
implemented a query language and engine NQ that can                           the Seventh International Symposium on Natural Language
automatically map meaning representation produced by language                 Processing, SNLP 2007, 2007.
systems into formal queries on RDF repositories. We have used
                                                                         [15] Resource Description Framework, http://www.w3.org/RDF/,
language tags for mapping of the meaning representation to the
                                                                              2007.
data classes. NQ uses graph search to extract the information
about the repository’s structure. Our experience shows that              [16] S60 platform, http://www.s60.com, 2007
semantic data annotation and knowledge extraction significantly          [17] S. Seneff, "TINA: A natural language system for spoken
improves the capability of natural languages interfaces to mobile             language applications," Computational Linguistics, vol. 18,
data.                                                                         no. 1, pp. 61-86, March 1992.
7. REFERENCES                                                            [18] S. Seneff, E. Hurley, R. Lau, C. Pao, P. Schmid, and V. Zue,
[1] Baptist L. and S. Seneff, "Genesis-II: A Versatile System for             "GALAXY-II: A Reference Architecture for Conversational
    Language Generation in Conversational System
    System Development," Proc. ICSLP 98, Sydney, Australia,   [21] Web Ontology Language, http://www.w3.org/TR/owl-
    November 1998.                                                 features/, 2007.
[19] SPARQL Query Language for RDF,
     http://www.w3.org/TR/rdf-sparql-query/, 2007.
[20] Vcard, http://www.w3.org/TR/vcard-rdf, 2007.

</pre>