=Paper=
{{Paper
|id=Vol-2134/paper04
|storemode=property
|title=A Pipeline for Extracting Multi-Modal Markers for Meaning in Lectures
|pdfUrl=https://ceur-ws.org/Vol-2134/paper04.pdf
|volume=Vol-2134
|authors=Johannes Ude,Bianca Schüller,Rebekah Wegener,Jörg Cassens
|dblpUrl=https://dblp.org/rec/conf/ijcai/UdeSWC18
}}
==A Pipeline for Extracting Multi-Modal Markers for Meaning in Lectures==
Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden
A Pipeline for Extracting Multi-Modal Markers for Meaning in Lectures
Johannes Ude1 , Bianca Schüller2 , Rebekah Wegener2 , Jörg Cassens1
1
University of Hildesheim, 2 RWTH Aachen University
udejoh@uni-hildesheim.de, bianca.schueller@rwth-aachen.de,
rebekah.wegener@ifaar.rwth-aachen.de, cassens@cs.uni-hildesheim.de
Abstract Further, we suggest that by using a rich model of context
that maps the unfolding of the text in real time with features
This article introduces initial concepts for a con- of the context, we can produce query driven summarisation.
text sensitive computing pipeline to detect multi- This paper focuses on the proposed computational pipeline
modal markers for meaning from video and audio for processing the different input data streams. The context
data, to notify the audience of markers of impor- of the research presented here is to discuss the different com-
tance and then to classify sequences of a recorded ponents needed for implementing a functional prototype of
video into segments by content and importance in the system.
order to summarise the content as video and audio The concept of using different modalities for summarisa-
and in other modalities. In this paper, we first con- tion of videos has been used in a number of other works be-
sider the linguistic background, then show the input fore, see for example Maskey and Hirschberg [2005], where
data for the pipeline. Finally, we outline the con- an acoustic signal was used in addition to text in order to sum-
cepts which are to be implemented in each step of marise. The authors used broadcast news as their proof of
this pipeline and discuss how the evaluation for this concept. As a classifier, Maskey and Hirschberg [2006] used
pipeline can be achieved. a hidden markov model.
The first application domain we are looking at is academic
1 Introduction lectures as these are largely monologic in nature, making
The summarisation of multimodal utterances is an important them easier to work with computationally. They are also read-
area of research in linguistics, natural language processing, ily available as large corpora and represent a clear case where
and multimodal interaction. Summarisation of text is a diffi- extracting important information is useful. Besides provid-
cult task in itself and the methods used vary widely depending ing a working prototype, we also like to use the system to
on the purpose or function of the summarisation. Most re- showcase the concept of “smart data”, meaning that we make
cent work in natural language processing now integrates lex- use of existing knowledge; on the one hand in order to model
ical, acoustic/prosodic, textual and discourse features for ef- the necessary contextual parameters and, on the other hand,
fective summarisation [Maskey and Hirschberg, 2005]. Only to be able to use a limited amount of data to learn and tune
recently, however, are behavioural features being taken into the computational model. We have successfully used this
consideration [Hussein et al., 2016] and here only to sum- method before to model intention as expressed through be-
marise the movement in a video. Behaviour is frequently haviour [Kofod-Petersen et al., 2009], see Butt et al. [2013]
under-utilised as a modality because it is treated as a contex- for some methodological background. In future research, a
tual footnote to speech. However, it can be equally meaning further application domain next to monologic discourse will
bearing and can often signal meaning prior to verbalisation be dialogic discourse, as for example telemedicine consulta-
[Butt et al., 2013]. For additional supporting arguments see tions. Both monologic and dialogic domains provide founda-
Lukin et al. [2011] and Cartmill et al. [2007]. tions for multi-participant dialogic and multilingual domains,
But meaning making is most often multi-modal, includ- such as business meetings or team work.
ing aspects of behaviour, and it is this multi-modality that we
make use of in outlining a model for an automatic and con- 2 Method and Motivation
text dependent note-taking system for academic lectures (see In order to explore how humans detect and classify im-
Wegener and Cassens [2016] for an overview of the research portant information, note-taking data from two studies was
program). Drawing on semiotic models of gesture and be- used. Both of them were based on a stimulus lecture from
haviour, linguistic models of text structure and sound, and a a recorded first-year computer science course at MIT. The
rich model of context, we argue that the combination of in- lecture is the first lecture in that course and has a length of
formation from all of these modalities through data triangu- approximately 55 minutes. It can be divided into four main
lation provides a better basis for information extraction and parts: 1. administrivia, 2. lecture, 3. examples, and 4. con-
summarisation than each alone. clusion [Wegener, 2017]. The administrivia and lecture parts
16
Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden
consist of three sections each. same, the students largely understand what the experts ex-
In the first study by Wegener et al. [2017], experts in com- pect; but where the gap increases, there might be problems in
puter science were asked to annotate the lecture transcript ac- capturing the students’ attention. This gives the motivation
cording to what they consider to be important information for a system to alert students when important information is
that students should take away from the lecture. As the lec- triggered so that their note-taking skills can improve. These
turer was not asked personally, experienced computer scien- notifications can also act as cues for an automated summari-
tists were asked for their opinion instead so that their anno- sation system to denote important parts of the lecture.
tations could act as a ground truth for measures of impor-
tance. It was found that the four experts who took part in the 3 Linguistic background
study largely agree on their notions of importance, which is
why their annotations were combined into one. The extrac- The initial goal of this study is to get a summary that is sim-
tive summary that they created was transferred to notes at a ilar in nature to that of the experts’ notes automatically with-
later point. out looking at the transcript of the text. Previous research has
The second study from Schüller [2018] involved under- shown that the important parts of the lecture can be identified
graduate and graduate students of computer science or me- consistently when analysing the video and audio [Schüller,
chanical engineering, both native and non-native speakers of 2018]. The important parts (targets) are indicated by the lec-
English. They were asked to watch the recorded lecture and turer’s behaviour (e.g. gestures, movement, gaze and visual
to take notes either by hand or by typing, depending on their target), their voice (e.g. prosodic markers, pitch, tone, loud-
preferences. Afterwards, they filled in a short survey that ness), and, of course, important markers can be found in the
asked them for some demographic data and for information textual transcript of the audio if that is available (e.g. words
on their note-taking practices. Notes were collected from nine such as “right”, “so”, “ok”).
students in total, one of whom was a native speaker of En- A marker is a specific pattern which is found in one of the
glish. Two students were native speakers of Albanian and the three categories of data. We focus on markers that signal im-
remaining ones were German. There was a balance of male portant parts of the lecture. This means whenever a marker of
and female participants, writing and typing, and there were a specific type appears in a lecture, this small part of the lec-
slightly more graduate participants than undergraduates. All ture should be considered as important and therefore included
participants were competent users of English. Three partici- in a summary.
pants spoke three, five participants spoke one, and the native By applying linguistic analyses from Systemic Functional
speaker spoke no languages other than English. This study Linguistics (SFL) [Halliday and Matthiessen, 2014] as well
was done in order to compare what experts expect students to as research on note-taking triggers in Cognitive Linguistics
take notes on and what the students actually do take notes on, and Psychology [Boch and Piolat, 2005] to the computer sci-
i.e. to compare the different notions of importance. ence lecture that was focused on in the experiments, mark-
Initial results show that the notion of importance differs a ers that signal importance have been found. SFL was cho-
lot between the experts and the students as well as among sen as an approach because it looks at language as a system
the students themselves, which can be seen in the number of of choices and therefore offers useful tools for investigating
words they consider relevant in the different sections. This how meaning-making works in texts. In the analysis, SFL,
difference provides a strong motivation for the development which looks at language in its social context, and cognitive
of a system that can guide students in identifying importance linguistics, which looks at language from the perspective of
during during academic lectures. In total, the expert notes in- language users, were combined in order to consider different
cluded 1775 words, while student notes included fairly even perspectives for the detection of markers.
steps between 93 and 691 words. This is partly due to the We differentiate between markers that act as flags and
fact that the experts did not have to take notes while watch- markers that are targets. Flags are multimodal markers that
ing the recording, but is also attributed to the proficiency they appear before the target text and therefore are a signal to pay
have in their field as well as their note-taking competence. attention to the following while targets are the multimodal
Among the students, the native speaker and the participants markers that we try to extract. The types of multimodal mark-
who spoke four languages took more notes than the others. ers for meaning that can be found below have already been
When looking at the different lecture sections, the lowest dis- observed in previous work carried out as part of this ongoing
crepancy between expert and student notes is in the welcome research and will be incorporated into the model:
(1.1), administrivia (1.3), and summary (4.) sections, while
the highest discrepancy is in the examples part, followed by Board-writing: Building up on work by Boch and Piolat
the course goals (1.2) and the lecture part (2.). The variation [2005], board-writing was found to act as a signifi-
seen in importance across different phases of the lecture sug- cant target [Schüller, 2018]. Being a behaviour, board-
gests that a model of the generic structure of a lecture could writing can be found within the image data.
be useful for the information extraction process. Pointing Gestures: Schüller [2018] found pointing gestures
While it is natural that experts consider more information to be targets as well. Not triggering note-taking to an ex-
to be important than what students take notes on, what needs tent as high as board-writing does, they showed discrep-
to be regarded is the size and location of the gap between ancies between experts’ and students’ notes and could
the experts’ and students’ number of words that is noted in therefore be a marker that students tend to miss. Being a
the different phases of the lecture. Where the distance is the behaviour as well, it is also part of the image data.
17
Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden
Notes as the Lecturer’s Visual Target: Appearing together where markers of importance should show up, if they ex-
with the textual marker of continuatives, the image data ist. So instead of learning from data alone, we already know
of the lecturer looking at his notes was found to be reli- where and when in the data to look for these markers. In
able in acting as a flag [Schüller, 2018]. essence, without the expert data, we do not know whether a
marker is really signifying importance or the time and dura-
Continuatives As it was mentioned above, when appearing
tion of importance. Neither do we know which parts should
together with the lecturer looking at his notes, the textual
be included in the summary, etc., so all of these aspects would
marker of continuatives acts as a flag.
have to be learned and validated if the manual annotation was
‘Needing’, ‘Wanting’, ‘Going’ Wegener et al. [2017] dis- not available.
covered that certain kinds of process types or the use In our proposed system the computing pipeline to detect
of the going-to future appear in the computer science multi-modal markers for meaning to classify video parts by
lecture from the experiment quite frequently. They are whether they belong to a summary or not can be learned from
textual markers that are targets. both manual annotation and data. For deployment and eval-
Most commonly repeated words in lexical bundles: uation of the finished system, just the video and audio of the
Taking word lists from Martinez et al. [2013], specific lecture is given.
words from lexical bundles were found to match with Another input type, not in the sense of data, but in the sense
the continuatives and process types that were mentioned of a model, is knowledge of linguistic research as a configu-
above, making them further textual markers that are ration. We already know which markers exist and what mean-
targets [Schüller, 2018]. ing they hold. In this pipeline, we use multi-modal markers
acting as flags or targets. So the markers signal the parts of
It needs to be added at this point that the markers above dif- the lecture which are considered important.
fer largely in their ubiquity, or, in other words, that they are Furthermore, the configuration can hold a generic structure
more or less dependent on the context in which they appear, potential (GSP) model for lectures [Hasan, 1994]. A generic
like the lecturer and the topic. For example, while processes structure potential is a statement of the likely structure of a
like ‘needing’ and ‘wanting’ are tightly connected to the lec- context. A generic structure however does not mean that there
turer of the computer science lecture that was focused, com- will not be variation. The GSP-Model can be used to deter-
monly repeated words in lexical bundles appear to be more mine a weight for each marker in a specific time interval of
universal within the domain of academic lectures. the lecture. For example at the beginning of the lecture there
The next step is to look at the technical parts and determine is just greetings and organizational information. Even when
how these markers could be detected automatically. When a marker is detected in these parts, the weights could modify
possible, we will make use of existing tools that help us to get whether this marker should be used or not.
more information from audio and image data. After that, we
want to segment the video into parts, mark important aspects,
and generate a summary of important parts.
5 Pipeline
The overall system will work in two phases. The first one
4 Input data is the learning phase with the given expert notes as a ground
truth. In this step, the system learns a configuration based on
In the final system [Wegener and Cassens, 2016], which will the expert notes. The second phase is the production phase
be applied to live lectures, there will be two sensor streams without the expert notes, but with a generalized configura-
that collect audio and image data. The audio data will go tion. We separate the system into these two phases because
through an acoustic signal analyser on the one hand in or- the expert notes are a costly factor. Therefore a generalized
der to get phonetic, prosodic, and text-level data; and on the configuration to classify all lectures is what we want to get.
other hand, it will go through a speech-to-text processor to The main part of the pipeline that deals with identifying
get grammatical, lexical, and cohesion data. The image data the markers themselves and evaluating their meaning poten-
will go through gesture recognition to get behavioural, ges- tial for the text consists of at least two steps. These steps
ture, and micro-gesture data. Together, these form the multi- are detect markers and classify video segments. Before this
modal ensembles that are used in combination with models of pipeline starts, preprocessing steps are necessary. For the user
context and generic structure to detect important information. of the overall system, these tasks represent background tasks.
The training data consists of the video of the lecture (con- Only the input and the output are seen by the user as the user
sisting of audio and image data) and the notes from experts. is handing in the video information and is retrieving a sum-
The expert notes will not be available during deployment of mary or a notification. The architecture of a pipeline seems
the system. The use of the annotation data is twofold: firstly, to be a natural fit, because the output of each step acts as the
the expert notes are used for benchmarking the classifica- input for the subsequent step. Therefore, each step is a com-
tion step of the pipeline. Because we want a summarising puting unit which is responsible of a certain task. In the dif-
pipeline, the automatic summarisation output should ideally ferent steps of the pipeline, processing even of the same type
be comparable to the expert notes. of input can be handled by different units. For example, the
Secondly, the expert notes are a ground truth for impor- same type of classifier could be implemented both as a hidden
tance that is used to help us with the training data. The man- markov model, a bayesian classifier or a neural networks, in
ual annotation by experts will denote those parts of the video which case it might make sense to combine all types of clas-
18
Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden
sifiers in an ensemble process. In this sense, the pipeline can The speech-to-text function is optional as well. Of course the
be thought of as a directed acyclic graph, with the different textual marker detectors are dependent on the speech-to-text
input-output-transformation forming the overall pipeline. generator, but during deployment, the systems should work
Figure 1 shows how the specific computing units in the without the textual markers.
overall pipeline are intended to fit together.
5.2 Detecting Markers
The detection of markers comprises the first step in the
pipeline and is done with the video (image and audio) and
transcript data. The transcript data would usually come from
the audio data though the speech-to-text processor, but in this
particular case, the transcript was created and patterned into
clauses manually. To detect markers, we want to develop sev-
eral kinds of detectors which analyse the video and transcribe
the text. We can categorize the detectors into audio, includ-
ing text, and image. For each of these groups, we will de-
velop marker detectors. Thus this step is only responsible for
Figure 1: Computing Pipeline detecting a marker and returning a confidence for a certain
marker. This step is not responsible for computing whether a
At this stage of the development, the pipeline is restricted marker carries importance or another meaning. The meaning
to processing recorded lectures only. Because lectures are is already interpreted by the linguistic modelling work con-
a restricted and well described context, the meaning of be- ducted in other stages of the project.
havioural patterns is more clearly visibly. An example would The image marker detector will be developed with Open-
concern the field, tenor and mode of the situations, where Pose [Cao et al., 2016]. This tool can detect a human skeleton
field is “the nature of the social activity. . . ”, tenor is “the in an image. With OpenPose, we want to identify three kinds
nature of social relations. . . ”, and mode is “the nature of of markers. We want to identify a) whether the lecturer is
contact. . . ” [Hasan, 1999]. As outlined by Wegener and looking at their notes, b) pointing at the board and/or c) writ-
Cassens [2016], we have a specific mode (lecture), a specific ing onto the board. We know that these marker types signal
tenor (student and lecturer), and a specific field (introduction importance in the lecture [Schüller, 2018].
to computational thinking), as well as a definable material sit- The audio marker detectors will focus on prosody and
uational setting (the sloping auditorium, with multiple chalk loudness. In this case, we have to apply a machine learn-
boards). This makes it easier to identify markers. ing tool to identify the markers. Preliminary research shows
that there are prosodic and loudness patterns which can be
5.1 Preprocessing Video found when the lecturer switches to a new topic and when
The video data is preprocessed to split the video data into they emphasize different bits of information.
audio and image, to generate a transcript of the audio data As for the textual data, the work of Wegener et al. [2017]
and to identify when the lecturer is speaking. The video shows that certain keywords (e.g. the use of ‘needing’, ‘want-
should be split into images and audio because we only have ing’, and ‘going’ by the lecturer of the computer science lec-
marker detectors for one modality at a time. For the other ture) identify important parts in the lecture, too. Therefore, a
preprocessing, we use additional software. To generate a textual marker of this type will be generated for sets of these
transcript we will use a speech-to-text generator like Dragon words. Another marker to identify is whether the lecturer is
NaturallySpeaking. For the speaker diarization, to identify using continuatives like ‘so’, ‘ok’, ‘all right’. These represent
whether the lecturer or someone else is speaking, we will flags that signal topic shifts [Schüller, 2018].
use the LIUM software [Meignier and Merlin, 2010]. Un- The detectors described so far are just examples. The
der many circumstances, only the lecturer would have the pipeline should provide a plug-in architecture, where several
microphone, which would negate the necessity for speaker marker detectors can be added. To this end, a programming
diarization. However, like in this particular instance, the mi- interface which is to be used by the detectors has to be de-
crophone can be shared by two or more lecturers. If we know fined. This also makes it possible to have multiple extractors
that the main speaker is not speaking, we could modify the for the same marker, as described above. The configuration
weight so that such parts will not be included in the summary. of the classification step defines how the markers should be
Additionally, speaker diarization represents an important pre- combined to produce a contextually relevant summary.
processing step in dialogic speaking situations, meaning that
considering speaker diarization already now will make it eas- 5.3 Classifying the Video with the Detected
ier to apply the system to different domains in the future. Markers
In any case, speaker diarization is not a single process- The different kinds of data input for this step are used to clas-
ing step but includes useful subtasks such as the detection of sify the video. We call this data triangulation because of
changes in prosody and loudness. In our experiments, speaker the combination of image, audio and textual data. This step
diarization sometimes misclassified the main speaker. Further is only responsible for classifying the segments of the video
analysis showed that features leading to this misclassification which are considered to be important with the help of the pre-
could be used elsewhere within the acoustic marker detector. viously detected markers. Therefore this step has as an input
19
Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden
the markers of the detecting step. Because we only have the or targets. These markers will then be explored iteratively
markers for importance, we can only classify segments of the with the same method of analysis. If they seem to co-occur
video as important and not important. with the expert notes as well, the new marker types could be
This part of the pipeline should be highly configurable be- used to improve the quality of the summary.
cause the markers have to be aggregated to map segments
of the video in the two classes: used-in-summary and not- 5.4 Using Classified Video Segments for
used-in-summary. So far, the following approaches have been Summarisation or Notification
considered: Marker Overlapping Analysis, Neural Networks, The detected markers can be used to automatically summarise
Linear Classification. the lecture. In that case we use the time information from the
The Overlapping Analysis will visualise the existing video important video parts and cut them into a summary. Figure 2
and annotate it with the markers. Then the markers will be shows a preview of the generated video. Ideally, the markers
compared with the expert input. We analyse where and which could be used for more than just a summary. For example,
types of markers concur with the expert notes in the sum- the video could become searchable. Then a query could be
mary. If a marker is in line with the expert notes, then this used to search every part where the lecturer is writing on the
type of marker could be used for summarisation. The interval board. The aim is to have a real time processing pipeline.
length is configured with the same method. The configuration In that case the students could be notified directly while the
is made by a linguist who configures for each marker whether lecture is being recorded.
the part of the video belongs to an extractive summary and,
if so, how long the segment should be. The visualisation for
the markers will look similar in principle to figure 2. The 6 Development methodology and evaluation
configuration is mostly exploratory for linguists. The methodology for the programming part is a feature driven
The configuration of the classification can vary in complex- development [Coad et al., 1999]. First, the image, audio and
ity. A simple way would be to look at just a single marker textual marker detectors will be implemented. Only after this
type and use it directly for the summariser. A more complex step is done, we know what input data exists for the classifica-
configuration could be using multiple marker types and their tion step. Then, a server will be implemented. We want to use
confidence to model a function which estimates the class. Be- a dedicated server because some marker detectors have high
cause of the plug-in architecture, it is possible that there are computational complexity. Therefore, it could be beneficial
multiple markers of the same type, too. if the computation load can be separated to multiple physical
An example of a classification could be to use the marker client machines. After that, a graphical user interface (GUI)
“looking at notes” together with “prosody and loudness”. will be implemented to primarily provide the Marker Over-
Then we could build a configuration that works such that if lapping Analysis. While the choice of technology can still be
the lecturer is looking at the notes right after raising his voice, changed, the GUI will likely be implemented with the eclipse
this scenario could be used as a signal of importance. RAP framework. This implies that the server and the GUI
will be implemented in the programming language Java. Java
is chosen because programming experience for this language
exists. Tools and frameworks (Python comes to mind in par-
ticular for NLP functions) will be used through Java language
bindings or wrappers.
For the first evaluation, a second video of the same lecturer
and a lecture in the same semester should be summarised with
this tool and the previous configuration. It is plausible that
both the configuration and the models learned are depend-
ing on the individual lecturer, so the first evaluation will need
to take that into account. Ideally, the same experts that pro-
vided the training notes would analyse the output and judge
the quality of the summary.
A second evaluation scenario will be comparing the marker
overlapping analysis with neural networks or linear classifi-
cation, but using the same lecturer again. In this second eval-
uation, a metric could be used to identify the quality. For
example, how do the experts judge the quality of the differ-
ent configurations, how well does the automatic summariser
agree with expert annotations.
Third and last, the tool needs to be tested with a) different
lecturers and b) different topics. This is delegated to future
Figure 2: Classify Video with Marker Overlapping Analysis work to improve the usefulness of the deployed tool.
A use case scenario could be that a student is learning for
Given the Marker Overlapping Analysis, we can add some an exam. Then he or she wants to watch the most significant
more types of markers which are not already found to be flags parts of a lecture as preparation for the exam. This system
20
Tenth International Workshop Modelling and Reasoning in Context (MRC) – 13.07.2018 – Stockholm, Sweden
could provide an extractive summary for the student. The Ruqaiya Hasan. Situation and the definition of genre. In
student just has to upload the video and the pipeline is build- Allen Grimshaw, editor, What’s going on here? Comple-
ing a summary based on a configuration. Then the student mentary Analysis of Professional Talk: volume 2 of the
downloads the shortened video. multiple analysis project. Ablex, Norwood N.J., 1994.
However, since note taking and summarisation are impor- Ruqaiya Hasan. Speaking with reference to context. In
tant skills to be learned by students, our system could also be Mohsen Ghadessy, editor, Text and Context in Functional
integrated with academic writing support systems to facilitate Linguistics. John Benjamins, Amsterdam, 1999.
learning to write good summaries.
Fairouz Hussein, Sari Awwad, and Massimo Piccardi. Joint
action recognition and summarization by sub-modular in-
7 Summary ference. In ICASSP, 2016.
We have outlined a research program to build and evaluate a
Anders Kofod-Petersen, Rebekah Wegener, and Jörg Cassens.
pipeline for extracting multi-modal markers for meaning in
Closed doors – modelling intention in behavioural inter-
lectures. The overall architecture has been described and the
faces. In Anders Kofod-Petersen, Helge Langseth, and
dependency on existing tools outlined. Finally, possible eval-
Odd Erik Gundersen, editors, Proceedings of the Nor-
uation methods have been described.
wegian Artificial Intelligence Society Symposium (NAIS
While implementation of the prototype has started, we are
2009), pages 93–102, Trondheim, Norway, November
still in the early stages of realisation. The overall architecture
2009. Tapir Akademiske Forlag.
has been defined, but e.g. which machine learning methods
or tools are to be used primarily has not been finalised. This Annabelle Lukin, Alison Moore, Maria Herke, Rebekah We-
paper is insofar a concept paper, but it is also an incremental gener, and Canzhong Wu. Halliday’s model of register re-
update on our previously published work on the underlying visited and explored. Linguistics and the Human sciences,
theory. The next steps are implementing the pipeline as a 2011.
whole and performing the evaluations outlined. Ron Martinez, Svenja Adolphs, and Ronald Carter. Listen-
ing for needles in haystacks: how lecturers introduce key
8 Acknowledgements terms. ELT journal 67(3), pages 313–323, 2013.
We would like to thank the two host institutions, the Univer- Sameer Maskey and Julia Hirschberg. Comparing lexi-
sity of Hildesheim and RWTH Aachen University, for making cal, acoustic/prosodic, structural and discourse features for
cross disciplinary student participation in this project possi- speech summarization. In Interspeech 2005, pages 621–
ble. The first author is currently doing his master thesis on the 624, 2005.
project, the second author participated in the RWTH Under- Sameer Maskey and Julia Hirschberg. Summarizing speech
graduate Research Opportunities Program (UROP) and did without text using hidden Markov models. In Proceed-
her bachelor thesis on the topic. ings of the NAACL HLT, Companion Volume: Short Pa-
pers, pages 89–92. ACL, 2006.
References Sylvain Meignier and Teva Merlin. LIUM SpkDiarization:
Françoise Boch and Annie Piolat. Note taking and learning: An open source toolkit for diarization. In CMU SPUD
A summary of research. The WAC Journal 16, pages 101– Workshop, Dallas, Texas, 2010.
113, 2005.
Bianca Schüller. Understanding the identification of impor-
David Butt, Rebekah Wegener, and Jörg Cassens. Modelling tant information in academic lectures: Linguistic and cog-
behaviour semantically. In P. Brézillon, P. Blackburn, and nitive approaches. Bachelor thesis, RWTH Aachen Uni-
R. Dapoigny, editors, Proceedings of CONTEXT 2013, versity, 2018.
number 8175 in LNCS, pages 343–349, Annecy, France,
Rebekah Wegener and Jörg Cassens. Multi-modal markers
2013. Springer.
for meaning: using behavioural, acoustic and textual cues
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Re- for automatic, context dependent summarization of lec-
altime multi-person 2D pose estimation using part affinity tures. In J. Cassens, R. Wegener, and A. Kofod-Petersen,
fields. arXiv:1611.08050 [cs], November 2016. arXiv: editors, Proceedings of the Eighth International Workshop
1611.08050. on Modelling and Reasoning in Context, 2016.
John Cartmill, Alison Moore, David Butt, and Lyn Squire. Rebekah Wegener, Bianca Schüller, and Jörg Cassens. Need-
Surgical teamwork: systemic functional linguistics and the ing and wanting in academic lectures: Profiling the aca-
analysis of verbal and non verbal meaning in surgery. ANZ demic lecture across context. In Phil Chappell and John S.
Journal of Surgery, pages 925–929, 2007. Knox, editors, Transforming Contexts: Papers from the
Peter Coad, Eric Lefebvre, and Jeff De Luca. Java Modeling 44th International Systemic Functional Congress, Wollon-
in Color with UML. Prentice Hall, Upper Saddle River, NJ, gong, Australia, 2017.
1999. Rebekah Wegener. Instantiation in modelling multimodal
M. A. K. Halliday and Christian Matthiessen. An introduction communication: Challenges and proposals, part 1. 2017.
to functional grammar (4th ed.). Routledge, London/ New Presentation at the European Systemic Functional Linguis-
York, 2014. tics Conference in Salamanca, Spain.
21