<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Possible Optimisation Procedure for US and MRI Tongue Contours</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Réka Trencsényi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>László Czap</string-name>
          <email>czap@uni-miskolc.hu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Proceedings of the 1</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Debrecen, Department of Electrical and Electronic Engineering</institution>
          ,
          <addr-line>Debrecen</addr-line>
          ,
          <country country="HU">Hungary</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Miskolc, Institute of Automation and Infocommunication</institution>
          ,
          <addr-line>Miskolc</addr-line>
          ,
          <country country="HU">Hungary</country>
        </aff>
      </contrib-group>
      <fpage>259</fpage>
      <lpage>269</lpage>
      <abstract>
        <p>The topic of the article is speech research. The main instruments of the study are US and MRI records of human beings, which were made during speech. In the dynamic records, primarily, the motion of the tongue is analysed and followed by automatic tongue contour tracking algorithms. The tongue contours are used to elaborate geometric transformations between US and MRI frames, which are the starting points of the optimisation of the matching of US and MRI tongue contours belonging to the same speech sound. As a result, the radial US geometry and the rectangular MRI geometry are embedded into each other in a biunique way.</p>
      </abstract>
      <kwd-group>
        <kwd>Data visualisation</kwd>
        <kwd>computational linguistics</kwd>
        <kwd>speech research</kwd>
        <kwd>dynamic US and MRI records</kwd>
        <kwd>automatic tongue contour tracking</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        One of the fundamental tools of the study of speech production is the analysis of
dynamic records of human speakers, made by ultrasound (US) [
        <xref ref-type="bibr" rid="ref4 ref9">4, 9</xref>
        ] and magnetic
resonance imaging (MRI) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] techniques. Investigating and processing these
twodimensional records created in the so-called sagittal plane resulting in a side view
of the human body, relevant qualitative and quantitative information can be gained
about the main features of articulation. Qualitative statements mainly refer to the
relative position of the tongue and palate in the case of diferent speech sounds
and sound transitions, while quantitative descriptions focus on the recognition and
connection of the geometric parameters which have high importance in the
understanding of the relationships between the acoustic and articulatory characteristics
of speech. Quantitative analyses can be performed in several ways with a wide
variety [
        <xref ref-type="bibr" rid="ref3 ref5 ref6">3, 5, 6</xref>
        ]. The starting points of the investigations of our present study are
tongue contours fitted to the frames of US [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and MRI [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] records by automatic
algorithms [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The used US and MRI sources difer from each other in many
details, such as the gender and nationality of the speakers, the geometry, resolution,
and scale of the images, and the visually evaluable anatomic segments of the vocal
tract. The aim of our research work is to match the US and MRI sources by
elaborating, applying, and optimising the proper geometric transformations between
the US and MRI tongue contours in a biunique way.
      </p>
      <p>
        In the literature, several publications can be found that deal with the fusion
of information arising from sources produced by diferent imaging techniques. The
demand for automatic tongue contour tracking algorithms emerged even in the
previous decade [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], confirming the necessity of fully automated procedures like
our algorithm that does not require any manual actions, as it is based on dynamic
programming [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Another benefit of our present results that we have been working
with dynamic US and MRI records instead of static frames belonging to sustained
sounds exclusively [
        <xref ref-type="bibr" rid="ref1 ref13">1</xref>
        ]. The US videos were made by the Micro system of the
MTAELTE Lendület Lingual Articulation Research Group of the Hungarian Academy of
Sciences, and the MRI videos made by fast MRI were downloaded from the website
of the University of Southern California. Also, such studies appeared that aim to
perform transformations between coordinate systems connected to US and MRI
frames relying on the optimisation of distances measured between special points
of the human head. [
        <xref ref-type="bibr" rid="ref1 ref13">1</xref>
        ]. In comparison with [
        <xref ref-type="bibr" rid="ref1 ref13">1</xref>
        ], it must be emphasised that our
transformations relate directly to the tongue contours, and the transformation is
carried out in one step without any intermediate coordinate system, so starting
from the US contour, one gets to the MRI contour immediately. Furthermore, the
optimisation procedure minimises the global distance between the linked US and
MRI tongue contours in the case of more than one sounds simultaneously.
2. Transformation and Optimisation
2.1. The Geometrical Considerations and Mathematical
Formulas of the Transformations for Tongue Contours
When writing the exact mathematical form of the transformation, we relied on the
special geometry of the available US records. Namely, the imaging US head scans
such a radial region of the oral cavity which is seen at an angle of 90∘ measured
from a fixed centre  . Consequently, it is obvious to treat the US images and the
points of the belonging tongue contours in such a polar coordinate system of 
origin where the position of each pixel is given by radius 
measured from point

and the signed angle
      </p>
      <p>measured from the central vertical axis of the image
unambiguously. The aim of the transformation is to embed the radial geometry of
the US frames to the rectangular geometry of the MRI records described by the
two-dimensional Cartesian coordinates so that the US and MRI tongue contours
assigned to the same sound should overlap with each other as much as possible.
The transformation of the US tongue contours can include three basic operations:
the scaling of the radial range, the scaling of the angular range, and the translation
of the angular range. The three operations can be realised mathematically by the
formulas
where the scale factors 
and</p>
      <p>allow the normalisation of the radial and
angular ranges, and the term   0 performs the translation of the initial angle of the
angular range.</p>
      <p>The mathematical operations of (2.1) can be interpreted in the
physical plane of the images in the following way: by scaling the radial range, the
magnification of the tongue contours can be modified. The scaling of the angular
range creates the possibility to change the width of the angular range covered by
the tongue contours. The translation of the angular range means the rotation of
the tongue contours in the plane of the image. Thus, relationships (2.1) fit the US
tongue contour to the corresponding MRI frame. Applying the inverse of
transformations (2.1), however, also the reverse conversion can be executed, i.e., by dint of
the inverse operations
(2.1)
(2.2)
the MRI tongue contour can be mapped onto the corresponding US frame. The
parameter set {,  ,</p>
      <p>0} of the transformations performed in the directions
USMRI and MRI-US must necessarily be the same since, thereby, the maintaining of
 ′ =  · ,
 ′ =  
·
,
 ′0 =  0 +   0</p>
      <p>,
 =
 =</p>
      <p>,
 ′
 ′</p>
      <p>,
 0 =  ′0 −   0</p>
      <p>,
the relative scale ratio of the US and MRI environment can be ensured
independently of the direction of the conversion. During the investigations, we fixed the
value of factor   by   = 1, which means that the transformation is conformal.</p>
      <p>
        Transformations (2.1) and (2.2) become valid by the numerical determination
of parameters  and   0, to which the optimisation of the values of the
parameters ofers a possible way. During the optimisation procedure, using an algorithm
elaborated by us, we find the parameter set, in the case of that the distance
between the transformed US tongue contour and the MRI tongue contour serving as
a reference curve is minimal. The calculation of the distance is carried out for all
of the possible pairs of points of the two curves, then the average of the smallest
distances assigned to each point of the US tongue contour is minimised [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. To
the successful transformation, however, not only the exact values of parameters
 and   0 are needed but also centre  ′ designated in the MRI frame must be
known, which is the image of centre  of the US record. Beyond these, during
the construction of the optimisation algorithm, also the peak of the epiglottis can
serve as a good reference point demonstrated by Figure 1, where the peaks of the
epiglottis  and  ′ are marked by green circles in the US and MRI frames, and
also centres  and  ′ are located by red cross signs.
      </p>
      <p>The meaning of the parameter set {,   0,  ′, ,  ′} can be understood by
the geometrical considerations of Figure 2, where the left-side block depicts the
points of the US frames ( and  ), while the right-side block carries the points
of the MRI frames ( ′ and  ′) in agreement with Figure 1. The radial distances
 1 and  2 are measured between the centre of the images and the peaks of the
epiglottis. The polar angles  1 and  2 are made by the central vertical axis of the
image and radii  1 and  2. Using these quantities, parameter  is interpreted as
a magnification factor by
,</p>
      <p>(2.3)
 2
 1
 =
and parameter   0 is produced by
  0 =  1 +  2,
(2.4)
which is actually a diference, since  1 is negative and  2 is positive, as the polar
angle is related to the vertical direction in both frames.
2.2. The Optimisation Procedure for Minimising the Distance</p>
      <p>Between Tongue Contours
Proceeding along the geometrical features of Figure 2, we created an
optimisation algorithm using MATLAB via such mathematical formulas which enable the
simultaneous optimisation of the parameters {,   0,  ′, ,  ′}. Some details of
the MATLAB scripts can be found in the appendix, where [ 1,  2], [ 1,  2], and
[ 3,  4] are the coordinates of  ′,  , and  ′, respectively, while   stands
for   0. The construction of the mathematical formulas for  1,  2 and   1,   2
follows the geometrical structure of Figure 2. As explained in (2.3) and (2.4),
the scale factor  is the ratio of the radial distances between the centre and the
peak of the epiglottis, measured in the US and MRI frames separately, and the
angular displacement   0 is given as the sum of the signed polar angles of the
US and MRI frames, made by a radial section and the central vertical axis of the
image. In the MATLAB scripts,  1 denotes the actual US or MRI curve having
points with complex coordinates to be transformed, ℎ 1 and  1 gives the US
and MRI curves to be compared, and  1 =  1 provides the average of the
smallest distances of all of the possible pairs of points of the two tongue contours.
In the appendix, the codes are written only for one of the contours belonging to
the investigated speech sounds and labelled by index 1, but to have the complete
script, these program blocks must be repeated with the same structure for the other
examined sounds with indices 2, 3, 4, . . . , as well.
2.2.1. Results
We fulfilled the optimisation of the parameters {,   0,  ′, ,  ′} for two speech
sounds, k and t simultaneously, and we obtained the following numerical results:
Based on the values in (2.5), it can be observed in Figure 3 that the optimised
positions of the peaks of the epiglottis are very close to the static position when
the voice box is at rest without speaking. In addition, the centre of the MRI frame
is located under the jaw of the speaker.</p>
      <p>Using the parameters of (2.5), the transformation of the US and MRI tongue
contours can be implemented in a bidirectional way according to Figure 4, where
the tongue contours belonging to sound k can be seen. The green curves stand
for the US, while the red curves represent the MRI tongue contours. In the MRI
frame, also the contour of the palate is drawn by the yellow curve. The figures
clearly show that the US and MRI tongue contours fit each other in an acceptable
way, since it is demonstrated visually, as well, that the distance is minimised by the
optimisation algorithm detailed in the appendix let the two curves mostly overlap
with each other. At this level, it is a suficient criterion for the acceptance of
the results without any specified lower or upper limit for the distance between
the two curves because we aimed to find the relative position of the two tongue
contours, which corresponds to the minimal distance ensured by parameters (2.5).
So, graphically, a realistic matching is obtained, and only this was expected. For
instance, if the transformed tongue contour was out of the region of the oral cavity
represented by the US or MRI image or placed visually at an unrealistic distance
from the reference curve then the optimisation would be surely false.</p>
      <sec id="sec-1-1">
        <title>2.2.2. Validation</title>
        <p>In order to verify the results, we also checked the projection of such US-MRI
pairs of contours onto each other, besides the setting of parameters (2.5) provided
by the optimisation, which were not present in the set of sounds  and  .
Accordingly, Figure 5 exemplifies our results in the case of sound  . It can be stated that
the matching of the US and MRI tongue contours is approximately as good as in
the case of sound  .</p>
        <p>Further validation of the results is in progress currently, as well. To gain more
experience about the harmonisation of US and MRI geometry and improve the
iftting of the tongue contours, we aim to develop our research work in several
directions. We wish to extend the optimisation procedure to more than two sounds
to understand the connection between the result of the optimisation and the
number of speech sounds. We would also like to investigate the optimisation in the
dependence of diferent sound contexts and speakers of diferent nationalities and
genders. Furthermore, advancing to a large number of speech sounds, we plan to
involve machine learning algorithms, as well.</p>
        <p>Acknowledgements. We would like to thank the MTA-ELTE Lendület Lingual
Articulation Research Group for providing the recordings with the Micro system.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Appendix</title>
      <sec id="sec-2-1">
        <title>Program code for the optimisation</title>
        <p>Program code for the transformation of the US tongue contour to the
MRI frame
for k = 1 : lcu1
xtr1(k) = R * r_UH1(k) * sin(FI * firad_UH1(k) − FIKORR);
ytr1(k) = R * r_UH1(k) * cos(FI * firad_UH1(k) − FIKORR);
pxtr1(k) = b(2) + xtr1(k);
pytr1(k) = b(1) − ytr1(k);
end
uh1 = 340 − pxtr1 + 1i * pytr1;
Program code for the transformation of the MRI tongue contour to the
US frame
a = [471, 335];
b = [n1, 340 − n2];
clear px1 py1 x1 y1 firad_MRI1 fideg_MRI1 r_MRI1 xtr1 ytr1 pxtr1 pytr1
for k = 1 : lcu1
px1(k) = real(cu1(k));
py1(k) = imag(cu1(k));
x1(k) = px1(k) − b(2);
y1(k) = b(1) − py1(k);
firad_MRI1(k) = atan(x1(k)/y1(k));
fideg_MRI1(k) = 180 * atan(x1(k)/y1(k))/pi;
r_MRI1(k) = y1(k)/cos(firad_MRI1(k));
end
for k = 1 : lcu1
xtr1(k) = r_MRI1(k)/R * sin((firad_MRI1(k) − FIKORR)/FI);
ytr1(k) = r_MRI1(k)/R * cos((firad_MRI1(k) − FIKORR)/FI);
pxtr1(k) = a(2) + xtr1(k);
pytr1(k) = a(1) − ytr1(k);
Program code for the calculation of the distances between US and MRI
tongue contours</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Aron</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>O. Berger</surname>
          </string-name>
          , E. Kerrien:
          <article-title>Multimodal fusion of electromagnetic, ultrasound and MRI data for building an articulatory model</article-title>
          ,
          <source>in: 8th International Seminar On Speech Production - ISSP'08</source>
          ,
          <string-name>
            <surname>Dec</surname>
            <given-names>2008</given-names>
          </string-name>
          , Strasbourg, France. finria-00326290f,
          <year>2008</year>
          , doi: https://hal.inria.fr/inria-00326290/document.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cleland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wrench</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Scobbie</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Semple: Comparing articulatory images: An MRI/Ultrasound Tongue Image database</article-title>
          ,
          <source>in: Proceedings of the 9th International Seminar on Speech Production</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>163</fpage>
          -
          <lpage>170</lpage>
          , doi: https://eresearch.qmu.ac.uk/handle/20.500.12289/2477.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S. G.</given-names>
            <surname>Danner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Barbosa</surname>
          </string-name>
          , L. Goldstein:
          <article-title>Quantitative analysis of multimodal speech data</article-title>
          ,
          <source>Journal of Phonetics</source>
          <volume>71</volume>
          (
          <year>2018</year>
          ), pp.
          <fpage>268</fpage>
          -
          <lpage>283</lpage>
          , doi: https://doi.org/10.1016/j.wocn.
          <year>2018</year>
          .
          <volume>09</volume>
          .007.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Denby</surname>
          </string-name>
          , M. Stone:
          <article-title>Speech synthesis from real time ultrasound images of the tongue</article-title>
          ,
          <source>in: 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing</source>
          , vol.
          <volume>1</volume>
          ,
          <issue>2004</issue>
          , pp.
          <fpage>I</fpage>
          -
          <volume>685</volume>
          , doi: https://doi.org/10.1109/ICASSP.
          <year>2004</year>
          .
          <volume>1326078</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Fulcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lodermeyer</surname>
          </string-name>
          , G. Kahler,
          <string-name>
            <given-names>S.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Kniesburges: Geometry of the vocal tract and properties of phonation near threshold: calculations and measurements</article-title>
          ,
          <source>Applied Sciences 9.13</source>
          (
          <year>2019</year>
          ),
          <volume>2755</volume>
          , doi: https://doi.org/10.3390/app9132755.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ojalammi</surname>
          </string-name>
          , J. Malinen:
          <article-title>Automated segmentation of upper airways from MRI-vocal tract geometry extraction</article-title>
          ,
          <source>International Conference on Bioimaging 3</source>
          (
          <year>2017</year>
          ), pp.
          <fpage>77</fpage>
          -
          <lpage>84</lpage>
          , doi: https://doi.org/10.5220/0006138300770084.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Scott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wylezinska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Birch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Miquel</surname>
          </string-name>
          <article-title>: Speech MRI: morphology</article-title>
          and function,
          <source>Physica Medica 30.6</source>
          (
          <issue>2014</issue>
          ), pp.
          <fpage>604</fpage>
          -
          <lpage>618</lpage>
          , doi: https://doi.org/10.1016/j.ejmp.
          <year>2014</year>
          .
          <volume>05</volume>
          .001.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>[8] span | speech production and articulation knowledge group: the rtMRI IPA chart (John Esling), Accessed in 2020 May 9th</article-title>
          , url: https://sail.usc.edu/span/rtmri_ipa/je_2015.html.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Stone</surname>
          </string-name>
          :
          <article-title>A guide to analysing tongue motion from ultrasound images</article-title>
          ,
          <source>Clinical Linguistics and Phonetics</source>
          <volume>19</volume>
          .
          <fpage>6</fpage>
          -
          <lpage>7</lpage>
          (
          <year>2005</year>
          ), pp.
          <fpage>455</fpage>
          -
          <lpage>501</lpage>
          , doi: https://doi.org/10.1080/02699200500113558.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. G.</given-names>
            <surname>Csapó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Roussel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Denby</surname>
          </string-name>
          :
          <article-title>A comparative study on the contour tracking algorithms in ultrasound tongue images with automatic re-initialization</article-title>
          ,
          <source>Journal of the Acoustical Society of America 139.5</source>
          (
          <issue>2016</issue>
          ),
          <fpage>EL154</fpage>
          -
          <lpage>EL160</lpage>
          , doi: https://doi.org/10.1121/1.4951024.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          <article-title>Czap: Automatic tracking of tongue contours in ultrasound records</article-title>
          ,
          <source>Beszédtudomány - Speech Science 27.1</source>
          (
          <issue>2019</issue>
          ), pp.
          <fpage>331</fpage>
          -
          <lpage>343</lpage>
          , doi: https://doi.org/10.15775/Beszkut.
          <year>2019</year>
          .
          <volume>331</volume>
          -
          <fpage>343</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Zharkova</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>Hewlett: Measuring lingual coarticulation from midsagittal tongue contours: Description and example calculations using English /t/</article-title>
          and /a/,
          <source>Journal of Phonetics 37.2</source>
          (
          <issue>2009</issue>
          ), pp.
          <fpage>248</fpage>
          -
          <lpage>256</lpage>
          , doi: https://doi.org/10.1016/j.wocn.
          <year>2008</year>
          .
          <volume>10</volume>
          .005.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          mri1 =
          <fpage>670</fpage>
          −
          <lpage>pxtr1</lpage>
          +
          <fpage>1i</fpage>
          *
          <fpage>pytr1</fpage>
          ;
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>