=Paper= {{Paper |id=Vol-2134/paper04 |storemode=property |title=A Pipeline for Extracting Multi-Modal Markers for Meaning in Lectures |pdfUrl=https://ceur-ws.org/Vol-2134/paper04.pdf |volume=Vol-2134 |authors=Johannes Ude,Bianca Schüller,Rebekah Wegener,Jörg Cassens |dblpUrl=https://dblp.org/rec/conf/ijcai/UdeSWC18 }} ==A Pipeline for Extracting Multi-Modal Markers for Meaning in Lectures== https://ceur-ws.org/Vol-2134/paper04.pdf
        Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden




       A Pipeline for Extracting Multi-Modal Markers for Meaning in Lectures

                    Johannes Ude1 , Bianca Schüller2 , Rebekah Wegener2 , Jörg Cassens1
                           1
                             University of Hildesheim, 2 RWTH Aachen University
                        udejoh@uni-hildesheim.de, bianca.schueller@rwth-aachen.de,
                    rebekah.wegener@ifaar.rwth-aachen.de, cassens@cs.uni-hildesheim.de


                         Abstract                                         Further, we suggest that by using a rich model of context
                                                                       that maps the unfolding of the text in real time with features
     This article introduces initial concepts for a con-               of the context, we can produce query driven summarisation.
     text sensitive computing pipeline to detect multi-                This paper focuses on the proposed computational pipeline
     modal markers for meaning from video and audio                    for processing the different input data streams. The context
     data, to notify the audience of markers of impor-                 of the research presented here is to discuss the different com-
     tance and then to classify sequences of a recorded                ponents needed for implementing a functional prototype of
     video into segments by content and importance in                  the system.
     order to summarise the content as video and audio                    The concept of using different modalities for summarisa-
     and in other modalities. In this paper, we first con-             tion of videos has been used in a number of other works be-
     sider the linguistic background, then show the input              fore, see for example Maskey and Hirschberg [2005], where
     data for the pipeline. Finally, we outline the con-               an acoustic signal was used in addition to text in order to sum-
     cepts which are to be implemented in each step of                 marise. The authors used broadcast news as their proof of
     this pipeline and discuss how the evaluation for this             concept. As a classifier, Maskey and Hirschberg [2006] used
     pipeline can be achieved.                                         a hidden markov model.
                                                                          The first application domain we are looking at is academic
1   Introduction                                                       lectures as these are largely monologic in nature, making
The summarisation of multimodal utterances is an important             them easier to work with computationally. They are also read-
area of research in linguistics, natural language processing,          ily available as large corpora and represent a clear case where
and multimodal interaction. Summarisation of text is a diffi-          extracting important information is useful. Besides provid-
cult task in itself and the methods used vary widely depending         ing a working prototype, we also like to use the system to
on the purpose or function of the summarisation. Most re-              showcase the concept of “smart data”, meaning that we make
cent work in natural language processing now integrates lex-           use of existing knowledge; on the one hand in order to model
ical, acoustic/prosodic, textual and discourse features for ef-        the necessary contextual parameters and, on the other hand,
fective summarisation [Maskey and Hirschberg, 2005]. Only              to be able to use a limited amount of data to learn and tune
recently, however, are behavioural features being taken into           the computational model. We have successfully used this
consideration [Hussein et al., 2016] and here only to sum-             method before to model intention as expressed through be-
marise the movement in a video. Behaviour is frequently                haviour [Kofod-Petersen et al., 2009], see Butt et al. [2013]
under-utilised as a modality because it is treated as a contex-        for some methodological background. In future research, a
tual footnote to speech. However, it can be equally meaning            further application domain next to monologic discourse will
bearing and can often signal meaning prior to verbalisation            be dialogic discourse, as for example telemedicine consulta-
[Butt et al., 2013]. For additional supporting arguments see           tions. Both monologic and dialogic domains provide founda-
Lukin et al. [2011] and Cartmill et al. [2007].                        tions for multi-participant dialogic and multilingual domains,
   But meaning making is most often multi-modal, includ-               such as business meetings or team work.
ing aspects of behaviour, and it is this multi-modality that we
make use of in outlining a model for an automatic and con-             2 Method and Motivation
text dependent note-taking system for academic lectures (see           In order to explore how humans detect and classify im-
Wegener and Cassens [2016] for an overview of the research             portant information, note-taking data from two studies was
program). Drawing on semiotic models of gesture and be-                used. Both of them were based on a stimulus lecture from
haviour, linguistic models of text structure and sound, and a          a recorded first-year computer science course at MIT. The
rich model of context, we argue that the combination of in-            lecture is the first lecture in that course and has a length of
formation from all of these modalities through data triangu-           approximately 55 minutes. It can be divided into four main
lation provides a better basis for information extraction and          parts: 1. administrivia, 2. lecture, 3. examples, and 4. con-
summarisation than each alone.                                         clusion [Wegener, 2017]. The administrivia and lecture parts



                                                                  16
        Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden

consist of three sections each.                                          same, the students largely understand what the experts ex-
   In the first study by Wegener et al. [2017], experts in com-          pect; but where the gap increases, there might be problems in
puter science were asked to annotate the lecture transcript ac-          capturing the students’ attention. This gives the motivation
cording to what they consider to be important information                for a system to alert students when important information is
that students should take away from the lecture. As the lec-             triggered so that their note-taking skills can improve. These
turer was not asked personally, experienced computer scien-              notifications can also act as cues for an automated summari-
tists were asked for their opinion instead so that their anno-           sation system to denote important parts of the lecture.
tations could act as a ground truth for measures of impor-
tance. It was found that the four experts who took part in the           3 Linguistic background
study largely agree on their notions of importance, which is
why their annotations were combined into one. The extrac-                The initial goal of this study is to get a summary that is sim-
tive summary that they created was transferred to notes at a             ilar in nature to that of the experts’ notes automatically with-
later point.                                                             out looking at the transcript of the text. Previous research has
   The second study from Schüller [2018] involved under-                shown that the important parts of the lecture can be identified
graduate and graduate students of computer science or me-                consistently when analysing the video and audio [Schüller,
chanical engineering, both native and non-native speakers of             2018]. The important parts (targets) are indicated by the lec-
English. They were asked to watch the recorded lecture and               turer’s behaviour (e.g. gestures, movement, gaze and visual
to take notes either by hand or by typing, depending on their            target), their voice (e.g. prosodic markers, pitch, tone, loud-
preferences. Afterwards, they filled in a short survey that              ness), and, of course, important markers can be found in the
asked them for some demographic data and for information                 textual transcript of the audio if that is available (e.g. words
on their note-taking practices. Notes were collected from nine           such as “right”, “so”, “ok”).
students in total, one of whom was a native speaker of En-                  A marker is a specific pattern which is found in one of the
glish. Two students were native speakers of Albanian and the             three categories of data. We focus on markers that signal im-
remaining ones were German. There was a balance of male                  portant parts of the lecture. This means whenever a marker of
and female participants, writing and typing, and there were              a specific type appears in a lecture, this small part of the lec-
slightly more graduate participants than undergraduates. All             ture should be considered as important and therefore included
participants were competent users of English. Three partici-             in a summary.
pants spoke three, five participants spoke one, and the native              By applying linguistic analyses from Systemic Functional
speaker spoke no languages other than English. This study                Linguistics (SFL) [Halliday and Matthiessen, 2014] as well
was done in order to compare what experts expect students to             as research on note-taking triggers in Cognitive Linguistics
take notes on and what the students actually do take notes on,           and Psychology [Boch and Piolat, 2005] to the computer sci-
i.e. to compare the different notions of importance.                     ence lecture that was focused on in the experiments, mark-
   Initial results show that the notion of importance differs a          ers that signal importance have been found. SFL was cho-
lot between the experts and the students as well as among                sen as an approach because it looks at language as a system
the students themselves, which can be seen in the number of              of choices and therefore offers useful tools for investigating
words they consider relevant in the different sections. This             how meaning-making works in texts. In the analysis, SFL,
difference provides a strong motivation for the development              which looks at language in its social context, and cognitive
of a system that can guide students in identifying importance            linguistics, which looks at language from the perspective of
during during academic lectures. In total, the expert notes in-          language users, were combined in order to consider different
cluded 1775 words, while student notes included fairly even              perspectives for the detection of markers.
steps between 93 and 691 words. This is partly due to the                   We differentiate between markers that act as flags and
fact that the experts did not have to take notes while watch-            markers that are targets. Flags are multimodal markers that
ing the recording, but is also attributed to the proficiency they        appear before the target text and therefore are a signal to pay
have in their field as well as their note-taking competence.             attention to the following while targets are the multimodal
Among the students, the native speaker and the participants              markers that we try to extract. The types of multimodal mark-
who spoke four languages took more notes than the others.                ers for meaning that can be found below have already been
When looking at the different lecture sections, the lowest dis-          observed in previous work carried out as part of this ongoing
crepancy between expert and student notes is in the welcome              research and will be incorporated into the model:
(1.1), administrivia (1.3), and summary (4.) sections, while
the highest discrepancy is in the examples part, followed by             Board-writing: Building up on work by Boch and Piolat
the course goals (1.2) and the lecture part (2.). The variation              [2005], board-writing was found to act as a signifi-
seen in importance across different phases of the lecture sug-               cant target [Schüller, 2018]. Being a behaviour, board-
gests that a model of the generic structure of a lecture could               writing can be found within the image data.
be useful for the information extraction process.                        Pointing Gestures: Schüller [2018] found pointing gestures
   While it is natural that experts consider more information                to be targets as well. Not triggering note-taking to an ex-
to be important than what students take notes on, what needs                 tent as high as board-writing does, they showed discrep-
to be regarded is the size and location of the gap between                   ancies between experts’ and students’ notes and could
the experts’ and students’ number of words that is noted in                  therefore be a marker that students tend to miss. Being a
the different phases of the lecture. Where the distance is the               behaviour as well, it is also part of the image data.



                                                                    17
        Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden

Notes as the Lecturer’s Visual Target: Appearing together                where markers of importance should show up, if they ex-
    with the textual marker of continuatives, the image data             ist. So instead of learning from data alone, we already know
    of the lecturer looking at his notes was found to be reli-           where and when in the data to look for these markers. In
    able in acting as a flag [Schüller, 2018].                          essence, without the expert data, we do not know whether a
                                                                         marker is really signifying importance or the time and dura-
Continuatives As it was mentioned above, when appearing
                                                                         tion of importance. Neither do we know which parts should
    together with the lecturer looking at his notes, the textual
                                                                         be included in the summary, etc., so all of these aspects would
    marker of continuatives acts as a flag.
                                                                         have to be learned and validated if the manual annotation was
‘Needing’, ‘Wanting’, ‘Going’ Wegener et al. [2017] dis-                 not available.
    covered that certain kinds of process types or the use                  In our proposed system the computing pipeline to detect
    of the going-to future appear in the computer science                multi-modal markers for meaning to classify video parts by
    lecture from the experiment quite frequently. They are               whether they belong to a summary or not can be learned from
    textual markers that are targets.                                    both manual annotation and data. For deployment and eval-
Most commonly repeated words in lexical bundles:                         uation of the finished system, just the video and audio of the
    Taking word lists from Martinez et al. [2013], specific              lecture is given.
    words from lexical bundles were found to match with                     Another input type, not in the sense of data, but in the sense
    the continuatives and process types that were mentioned              of a model, is knowledge of linguistic research as a configu-
    above, making them further textual markers that are                  ration. We already know which markers exist and what mean-
    targets [Schüller, 2018].                                           ing they hold. In this pipeline, we use multi-modal markers
                                                                         acting as flags or targets. So the markers signal the parts of
   It needs to be added at this point that the markers above dif-        the lecture which are considered important.
fer largely in their ubiquity, or, in other words, that they are            Furthermore, the configuration can hold a generic structure
more or less dependent on the context in which they appear,              potential (GSP) model for lectures [Hasan, 1994]. A generic
like the lecturer and the topic. For example, while processes            structure potential is a statement of the likely structure of a
like ‘needing’ and ‘wanting’ are tightly connected to the lec-           context. A generic structure however does not mean that there
turer of the computer science lecture that was focused, com-             will not be variation. The GSP-Model can be used to deter-
monly repeated words in lexical bundles appear to be more                mine a weight for each marker in a specific time interval of
universal within the domain of academic lectures.                        the lecture. For example at the beginning of the lecture there
   The next step is to look at the technical parts and determine         is just greetings and organizational information. Even when
how these markers could be detected automatically. When                  a marker is detected in these parts, the weights could modify
possible, we will make use of existing tools that help us to get         whether this marker should be used or not.
more information from audio and image data. After that, we
want to segment the video into parts, mark important aspects,
and generate a summary of important parts.
                                                                         5 Pipeline
                                                                         The overall system will work in two phases. The first one
4   Input data                                                           is the learning phase with the given expert notes as a ground
                                                                         truth. In this step, the system learns a configuration based on
In the final system [Wegener and Cassens, 2016], which will              the expert notes. The second phase is the production phase
be applied to live lectures, there will be two sensor streams            without the expert notes, but with a generalized configura-
that collect audio and image data. The audio data will go                tion. We separate the system into these two phases because
through an acoustic signal analyser on the one hand in or-               the expert notes are a costly factor. Therefore a generalized
der to get phonetic, prosodic, and text-level data; and on the           configuration to classify all lectures is what we want to get.
other hand, it will go through a speech-to-text processor to                The main part of the pipeline that deals with identifying
get grammatical, lexical, and cohesion data. The image data              the markers themselves and evaluating their meaning poten-
will go through gesture recognition to get behavioural, ges-             tial for the text consists of at least two steps. These steps
ture, and micro-gesture data. Together, these form the multi-            are detect markers and classify video segments. Before this
modal ensembles that are used in combination with models of              pipeline starts, preprocessing steps are necessary. For the user
context and generic structure to detect important information.           of the overall system, these tasks represent background tasks.
   The training data consists of the video of the lecture (con-          Only the input and the output are seen by the user as the user
sisting of audio and image data) and the notes from experts.             is handing in the video information and is retrieving a sum-
The expert notes will not be available during deployment of              mary or a notification. The architecture of a pipeline seems
the system. The use of the annotation data is twofold: firstly,          to be a natural fit, because the output of each step acts as the
the expert notes are used for benchmarking the classifica-               input for the subsequent step. Therefore, each step is a com-
tion step of the pipeline. Because we want a summarising                 puting unit which is responsible of a certain task. In the dif-
pipeline, the automatic summarisation output should ideally              ferent steps of the pipeline, processing even of the same type
be comparable to the expert notes.                                       of input can be handled by different units. For example, the
   Secondly, the expert notes are a ground truth for impor-              same type of classifier could be implemented both as a hidden
tance that is used to help us with the training data. The man-           markov model, a bayesian classifier or a neural networks, in
ual annotation by experts will denote those parts of the video           which case it might make sense to combine all types of clas-



                                                                    18
        Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden

sifiers in an ensemble process. In this sense, the pipeline can          The speech-to-text function is optional as well. Of course the
be thought of as a directed acyclic graph, with the different            textual marker detectors are dependent on the speech-to-text
input-output-transformation forming the overall pipeline.                generator, but during deployment, the systems should work
   Figure 1 shows how the specific computing units in the                without the textual markers.
overall pipeline are intended to fit together.
                                                                         5.2 Detecting Markers
                                                                         The detection of markers comprises the first step in the
                                                                         pipeline and is done with the video (image and audio) and
                                                                         transcript data. The transcript data would usually come from
                                                                         the audio data though the speech-to-text processor, but in this
                                                                         particular case, the transcript was created and patterned into
                                                                         clauses manually. To detect markers, we want to develop sev-
                                                                         eral kinds of detectors which analyse the video and transcribe
                                                                         the text. We can categorize the detectors into audio, includ-
                                                                         ing text, and image. For each of these groups, we will de-
                                                                         velop marker detectors. Thus this step is only responsible for
                  Figure 1: Computing Pipeline                           detecting a marker and returning a confidence for a certain
                                                                         marker. This step is not responsible for computing whether a
   At this stage of the development, the pipeline is restricted          marker carries importance or another meaning. The meaning
to processing recorded lectures only. Because lectures are               is already interpreted by the linguistic modelling work con-
a restricted and well described context, the meaning of be-              ducted in other stages of the project.
havioural patterns is more clearly visibly. An example would                The image marker detector will be developed with Open-
concern the field, tenor and mode of the situations, where               Pose [Cao et al., 2016]. This tool can detect a human skeleton
field is “the nature of the social activity. . . ”, tenor is “the        in an image. With OpenPose, we want to identify three kinds
nature of social relations. . . ”, and mode is “the nature of            of markers. We want to identify a) whether the lecturer is
contact. . . ” [Hasan, 1999]. As outlined by Wegener and                 looking at their notes, b) pointing at the board and/or c) writ-
Cassens [2016], we have a specific mode (lecture), a specific            ing onto the board. We know that these marker types signal
tenor (student and lecturer), and a specific field (introduction         importance in the lecture [Schüller, 2018].
to computational thinking), as well as a definable material sit-            The audio marker detectors will focus on prosody and
uational setting (the sloping auditorium, with multiple chalk            loudness. In this case, we have to apply a machine learn-
boards). This makes it easier to identify markers.                       ing tool to identify the markers. Preliminary research shows
                                                                         that there are prosodic and loudness patterns which can be
5.1   Preprocessing Video                                                found when the lecturer switches to a new topic and when
The video data is preprocessed to split the video data into              they emphasize different bits of information.
audio and image, to generate a transcript of the audio data                 As for the textual data, the work of Wegener et al. [2017]
and to identify when the lecturer is speaking. The video                 shows that certain keywords (e.g. the use of ‘needing’, ‘want-
should be split into images and audio because we only have               ing’, and ‘going’ by the lecturer of the computer science lec-
marker detectors for one modality at a time. For the other               ture) identify important parts in the lecture, too. Therefore, a
preprocessing, we use additional software. To generate a                 textual marker of this type will be generated for sets of these
transcript we will use a speech-to-text generator like Dragon            words. Another marker to identify is whether the lecturer is
NaturallySpeaking. For the speaker diarization, to identify              using continuatives like ‘so’, ‘ok’, ‘all right’. These represent
whether the lecturer or someone else is speaking, we will                flags that signal topic shifts [Schüller, 2018].
use the LIUM software [Meignier and Merlin, 2010]. Un-                      The detectors described so far are just examples. The
der many circumstances, only the lecturer would have the                 pipeline should provide a plug-in architecture, where several
microphone, which would negate the necessity for speaker                 marker detectors can be added. To this end, a programming
diarization. However, like in this particular instance, the mi-          interface which is to be used by the detectors has to be de-
crophone can be shared by two or more lecturers. If we know              fined. This also makes it possible to have multiple extractors
that the main speaker is not speaking, we could modify the               for the same marker, as described above. The configuration
weight so that such parts will not be included in the summary.           of the classification step defines how the markers should be
Additionally, speaker diarization represents an important pre-           combined to produce a contextually relevant summary.
processing step in dialogic speaking situations, meaning that
considering speaker diarization already now will make it eas-            5.3 Classifying the Video with the Detected
ier to apply the system to different domains in the future.                  Markers
   In any case, speaker diarization is not a single process-             The different kinds of data input for this step are used to clas-
ing step but includes useful subtasks such as the detection of           sify the video. We call this data triangulation because of
changes in prosody and loudness. In our experiments, speaker             the combination of image, audio and textual data. This step
diarization sometimes misclassified the main speaker. Further            is only responsible for classifying the segments of the video
analysis showed that features leading to this misclassification          which are considered to be important with the help of the pre-
could be used elsewhere within the acoustic marker detector.             viously detected markers. Therefore this step has as an input



                                                                    19
        Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden

the markers of the detecting step. Because we only have the                or targets. These markers will then be explored iteratively
markers for importance, we can only classify segments of the               with the same method of analysis. If they seem to co-occur
video as important and not important.                                      with the expert notes as well, the new marker types could be
   This part of the pipeline should be highly configurable be-             used to improve the quality of the summary.
cause the markers have to be aggregated to map segments
of the video in the two classes: used-in-summary and not-                  5.4 Using Classified Video Segments for
used-in-summary. So far, the following approaches have been                    Summarisation or Notification
considered: Marker Overlapping Analysis, Neural Networks,                  The detected markers can be used to automatically summarise
Linear Classification.                                                     the lecture. In that case we use the time information from the
   The Overlapping Analysis will visualise the existing video              important video parts and cut them into a summary. Figure 2
and annotate it with the markers. Then the markers will be                 shows a preview of the generated video. Ideally, the markers
compared with the expert input. We analyse where and which                 could be used for more than just a summary. For example,
types of markers concur with the expert notes in the sum-                  the video could become searchable. Then a query could be
mary. If a marker is in line with the expert notes, then this              used to search every part where the lecturer is writing on the
type of marker could be used for summarisation. The interval               board. The aim is to have a real time processing pipeline.
length is configured with the same method. The configuration               In that case the students could be notified directly while the
is made by a linguist who configures for each marker whether               lecture is being recorded.
the part of the video belongs to an extractive summary and,
if so, how long the segment should be. The visualisation for
the markers will look similar in principle to figure 2. The                6 Development methodology and evaluation
configuration is mostly exploratory for linguists.                         The methodology for the programming part is a feature driven
   The configuration of the classification can vary in complex-            development [Coad et al., 1999]. First, the image, audio and
ity. A simple way would be to look at just a single marker                 textual marker detectors will be implemented. Only after this
type and use it directly for the summariser. A more complex                step is done, we know what input data exists for the classifica-
configuration could be using multiple marker types and their               tion step. Then, a server will be implemented. We want to use
confidence to model a function which estimates the class. Be-              a dedicated server because some marker detectors have high
cause of the plug-in architecture, it is possible that there are           computational complexity. Therefore, it could be beneficial
multiple markers of the same type, too.                                    if the computation load can be separated to multiple physical
   An example of a classification could be to use the marker               client machines. After that, a graphical user interface (GUI)
“looking at notes” together with “prosody and loudness”.                   will be implemented to primarily provide the Marker Over-
Then we could build a configuration that works such that if                lapping Analysis. While the choice of technology can still be
the lecturer is looking at the notes right after raising his voice,        changed, the GUI will likely be implemented with the eclipse
this scenario could be used as a signal of importance.                     RAP framework. This implies that the server and the GUI
                                                                           will be implemented in the programming language Java. Java
                                                                           is chosen because programming experience for this language
                                                                           exists. Tools and frameworks (Python comes to mind in par-
                                                                           ticular for NLP functions) will be used through Java language
                                                                           bindings or wrappers.
                                                                              For the first evaluation, a second video of the same lecturer
                                                                           and a lecture in the same semester should be summarised with
                                                                           this tool and the previous configuration. It is plausible that
                                                                           both the configuration and the models learned are depend-
                                                                           ing on the individual lecturer, so the first evaluation will need
                                                                           to take that into account. Ideally, the same experts that pro-
                                                                           vided the training notes would analyse the output and judge
                                                                           the quality of the summary.
                                                                              A second evaluation scenario will be comparing the marker
                                                                           overlapping analysis with neural networks or linear classifi-
                                                                           cation, but using the same lecturer again. In this second eval-
                                                                           uation, a metric could be used to identify the quality. For
                                                                           example, how do the experts judge the quality of the differ-
                                                                           ent configurations, how well does the automatic summariser
                                                                           agree with expert annotations.
                                                                              Third and last, the tool needs to be tested with a) different
                                                                           lecturers and b) different topics. This is delegated to future
   Figure 2: Classify Video with Marker Overlapping Analysis               work to improve the usefulness of the deployed tool.
                                                                              A use case scenario could be that a student is learning for
 Given the Marker Overlapping Analysis, we can add some                    an exam. Then he or she wants to watch the most significant
more types of markers which are not already found to be flags              parts of a lecture as preparation for the exam. This system



                                                                      20
        Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden

could provide an extractive summary for the student. The                  Ruqaiya Hasan. Situation and the definition of genre. In
student just has to upload the video and the pipeline is build-             Allen Grimshaw, editor, What’s going on here? Comple-
ing a summary based on a configuration. Then the student                    mentary Analysis of Professional Talk: volume 2 of the
downloads the shortened video.                                              multiple analysis project. Ablex, Norwood N.J., 1994.
   However, since note taking and summarisation are impor-                Ruqaiya Hasan. Speaking with reference to context. In
tant skills to be learned by students, our system could also be             Mohsen Ghadessy, editor, Text and Context in Functional
integrated with academic writing support systems to facilitate              Linguistics. John Benjamins, Amsterdam, 1999.
learning to write good summaries.
                                                                          Fairouz Hussein, Sari Awwad, and Massimo Piccardi. Joint
                                                                            action recognition and summarization by sub-modular in-
7   Summary                                                                 ference. In ICASSP, 2016.
We have outlined a research program to build and evaluate a
                                                                          Anders Kofod-Petersen, Rebekah Wegener, and Jörg Cassens.
pipeline for extracting multi-modal markers for meaning in
                                                                            Closed doors – modelling intention in behavioural inter-
lectures. The overall architecture has been described and the
                                                                            faces. In Anders Kofod-Petersen, Helge Langseth, and
dependency on existing tools outlined. Finally, possible eval-
                                                                            Odd Erik Gundersen, editors, Proceedings of the Nor-
uation methods have been described.
                                                                            wegian Artificial Intelligence Society Symposium (NAIS
   While implementation of the prototype has started, we are
                                                                            2009), pages 93–102, Trondheim, Norway, November
still in the early stages of realisation. The overall architecture
                                                                            2009. Tapir Akademiske Forlag.
has been defined, but e.g. which machine learning methods
or tools are to be used primarily has not been finalised. This            Annabelle Lukin, Alison Moore, Maria Herke, Rebekah We-
paper is insofar a concept paper, but it is also an incremental             gener, and Canzhong Wu. Halliday’s model of register re-
update on our previously published work on the underlying                   visited and explored. Linguistics and the Human sciences,
theory. The next steps are implementing the pipeline as a                   2011.
whole and performing the evaluations outlined.                            Ron Martinez, Svenja Adolphs, and Ronald Carter. Listen-
                                                                            ing for needles in haystacks: how lecturers introduce key
8   Acknowledgements                                                        terms. ELT journal 67(3), pages 313–323, 2013.
We would like to thank the two host institutions, the Univer-             Sameer Maskey and Julia Hirschberg. Comparing lexi-
sity of Hildesheim and RWTH Aachen University, for making                   cal, acoustic/prosodic, structural and discourse features for
cross disciplinary student participation in this project possi-             speech summarization. In Interspeech 2005, pages 621–
ble. The first author is currently doing his master thesis on the           624, 2005.
project, the second author participated in the RWTH Under-                Sameer Maskey and Julia Hirschberg. Summarizing speech
graduate Research Opportunities Program (UROP) and did                      without text using hidden Markov models. In Proceed-
her bachelor thesis on the topic.                                           ings of the NAACL HLT, Companion Volume: Short Pa-
                                                                            pers, pages 89–92. ACL, 2006.
References                                                                Sylvain Meignier and Teva Merlin. LIUM SpkDiarization:
Françoise Boch and Annie Piolat. Note taking and learning:                 An open source toolkit for diarization. In CMU SPUD
  A summary of research. The WAC Journal 16, pages 101–                     Workshop, Dallas, Texas, 2010.
  113, 2005.
                                                                          Bianca Schüller. Understanding the identification of impor-
David Butt, Rebekah Wegener, and Jörg Cassens. Modelling                   tant information in academic lectures: Linguistic and cog-
  behaviour semantically. In P. Brézillon, P. Blackburn, and               nitive approaches. Bachelor thesis, RWTH Aachen Uni-
  R. Dapoigny, editors, Proceedings of CONTEXT 2013,                        versity, 2018.
  number 8175 in LNCS, pages 343–349, Annecy, France,
                                                                          Rebekah Wegener and Jörg Cassens. Multi-modal markers
  2013. Springer.
                                                                            for meaning: using behavioural, acoustic and textual cues
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Re-                    for automatic, context dependent summarization of lec-
  altime multi-person 2D pose estimation using part affinity                tures. In J. Cassens, R. Wegener, and A. Kofod-Petersen,
  fields. arXiv:1611.08050 [cs], November 2016. arXiv:                      editors, Proceedings of the Eighth International Workshop
  1611.08050.                                                               on Modelling and Reasoning in Context, 2016.
John Cartmill, Alison Moore, David Butt, and Lyn Squire.                  Rebekah Wegener, Bianca Schüller, and Jörg Cassens. Need-
  Surgical teamwork: systemic functional linguistics and the                ing and wanting in academic lectures: Profiling the aca-
  analysis of verbal and non verbal meaning in surgery. ANZ                 demic lecture across context. In Phil Chappell and John S.
  Journal of Surgery, pages 925–929, 2007.                                  Knox, editors, Transforming Contexts: Papers from the
Peter Coad, Eric Lefebvre, and Jeff De Luca. Java Modeling                  44th International Systemic Functional Congress, Wollon-
  in Color with UML. Prentice Hall, Upper Saddle River, NJ,                 gong, Australia, 2017.
  1999.                                                                   Rebekah Wegener. Instantiation in modelling multimodal
M. A. K. Halliday and Christian Matthiessen. An introduction                communication: Challenges and proposals, part 1. 2017.
  to functional grammar (4th ed.). Routledge, London/ New                   Presentation at the European Systemic Functional Linguis-
  York, 2014.                                                               tics Conference in Salamanca, Spain.



                                                                     21