1. Introduction

A Possible Optimisation Procedure for US and MRI Tongue Contours

Réka Trencsényi

0 1

László Czap

czap@uni-miskolc.hu 0 2 0 Proceedings of the 1 1 University of Debrecen, Department of Electrical and Electronic Engineering , Debrecen , Hungary 2 University of Miskolc, Institute of Automation and Infocommunication , Miskolc , Hungary

259 269

The topic of the article is speech research. The main instruments of the study are US and MRI records of human beings, which were made during speech. In the dynamic records, primarily, the motion of the tongue is analysed and followed by automatic tongue contour tracking algorithms. The tongue contours are used to elaborate geometric transformations between US and MRI frames, which are the starting points of the optimisation of the matching of US and MRI tongue contours belonging to the same speech sound. As a result, the radial US geometry and the rectangular MRI geometry are embedded into each other in a biunique way.

Data visualisation computational linguistics speech research dynamic US and MRI records automatic tongue contour tracking

1. Introduction

One of the fundamental tools of the study of speech production is the analysis of dynamic records of human speakers, made by ultrasound (US) [ 4, 9 ] and magnetic resonance imaging (MRI) [ 7 ] techniques. Investigating and processing these twodimensional records created in the so-called sagittal plane resulting in a side view of the human body, relevant qualitative and quantitative information can be gained about the main features of articulation. Qualitative statements mainly refer to the relative position of the tongue and palate in the case of diferent speech sounds and sound transitions, while quantitative descriptions focus on the recognition and connection of the geometric parameters which have high importance in the understanding of the relationships between the acoustic and articulatory characteristics of speech. Quantitative analyses can be performed in several ways with a wide variety [ 3, 5, 6 ]. The starting points of the investigations of our present study are tongue contours fitted to the frames of US [ 10 ] and MRI [ 8 ] records by automatic algorithms [ 11 ]. The used US and MRI sources difer from each other in many details, such as the gender and nationality of the speakers, the geometry, resolution, and scale of the images, and the visually evaluable anatomic segments of the vocal tract. The aim of our research work is to match the US and MRI sources by elaborating, applying, and optimising the proper geometric transformations between the US and MRI tongue contours in a biunique way.

In the literature, several publications can be found that deal with the fusion of information arising from sources produced by diferent imaging techniques. The demand for automatic tongue contour tracking algorithms emerged even in the previous decade [ 2 ], confirming the necessity of fully automated procedures like our algorithm that does not require any manual actions, as it is based on dynamic programming [ 11 ]. Another benefit of our present results that we have been working with dynamic US and MRI records instead of static frames belonging to sustained sounds exclusively [ 1 ]. The US videos were made by the Micro system of the MTAELTE Lendület Lingual Articulation Research Group of the Hungarian Academy of Sciences, and the MRI videos made by fast MRI were downloaded from the website of the University of Southern California. Also, such studies appeared that aim to perform transformations between coordinate systems connected to US and MRI frames relying on the optimisation of distances measured between special points of the human head. [ 1 ]. In comparison with [ 1 ], it must be emphasised that our transformations relate directly to the tongue contours, and the transformation is carried out in one step without any intermediate coordinate system, so starting from the US contour, one gets to the MRI contour immediately. Furthermore, the optimisation procedure minimises the global distance between the linked US and MRI tongue contours in the case of more than one sounds simultaneously. 2. Transformation and Optimisation 2.1. The Geometrical Considerations and Mathematical Formulas of the Transformations for Tongue Contours When writing the exact mathematical form of the transformation, we relied on the special geometry of the available US records. Namely, the imaging US head scans such a radial region of the oral cavity which is seen at an angle of 90∘ measured from a fixed centre . Consequently, it is obvious to treat the US images and the points of the belonging tongue contours in such a polar coordinate system of origin where the position of each pixel is given by radius measured from point and the signed angle

measured from the central vertical axis of the image unambiguously. The aim of the transformation is to embed the radial geometry of the US frames to the rectangular geometry of the MRI records described by the two-dimensional Cartesian coordinates so that the US and MRI tongue contours assigned to the same sound should overlap with each other as much as possible. The transformation of the US tongue contours can include three basic operations: the scaling of the radial range, the scaling of the angular range, and the translation of the angular range. The three operations can be realised mathematically by the formulas where the scale factors and

allow the normalisation of the radial and angular ranges, and the term 0 performs the translation of the initial angle of the angular range.

The mathematical operations of (2.1) can be interpreted in the physical plane of the images in the following way: by scaling the radial range, the magnification of the tongue contours can be modified. The scaling of the angular range creates the possibility to change the width of the angular range covered by the tongue contours. The translation of the angular range means the rotation of the tongue contours in the plane of the image. Thus, relationships (2.1) fit the US tongue contour to the corresponding MRI frame. Applying the inverse of transformations (2.1), however, also the reverse conversion can be executed, i.e., by dint of the inverse operations (2.1) (2.2) the MRI tongue contour can be mapped onto the corresponding US frame. The parameter set {, ,

0} of the transformations performed in the directions USMRI and MRI-US must necessarily be the same since, thereby, the maintaining of ′ = · , ′ = · , ′0 = 0 + 0

, = =

, ′ ′

, 0 = ′0 − 0

, the relative scale ratio of the US and MRI environment can be ensured independently of the direction of the conversion. During the investigations, we fixed the value of factor by = 1, which means that the transformation is conformal.

Transformations (2.1) and (2.2) become valid by the numerical determination of parameters and 0, to which the optimisation of the values of the parameters ofers a possible way. During the optimisation procedure, using an algorithm elaborated by us, we find the parameter set, in the case of that the distance between the transformed US tongue contour and the MRI tongue contour serving as a reference curve is minimal. The calculation of the distance is carried out for all of the possible pairs of points of the two curves, then the average of the smallest distances assigned to each point of the US tongue contour is minimised [ 12 ]. To the successful transformation, however, not only the exact values of parameters and 0 are needed but also centre ′ designated in the MRI frame must be known, which is the image of centre of the US record. Beyond these, during the construction of the optimisation algorithm, also the peak of the epiglottis can serve as a good reference point demonstrated by Figure 1, where the peaks of the epiglottis and ′ are marked by green circles in the US and MRI frames, and also centres and ′ are located by red cross signs.

The meaning of the parameter set {, 0, ′, , ′} can be understood by the geometrical considerations of Figure 2, where the left-side block depicts the points of the US frames ( and ), while the right-side block carries the points of the MRI frames ( ′ and ′) in agreement with Figure 1. The radial distances 1 and 2 are measured between the centre of the images and the peaks of the epiglottis. The polar angles 1 and 2 are made by the central vertical axis of the image and radii 1 and 2. Using these quantities, parameter is interpreted as a magnification factor by ,

(2.3) 2 1 = and parameter 0 is produced by 0 = 1 + 2, (2.4) which is actually a diference, since 1 is negative and 2 is positive, as the polar angle is related to the vertical direction in both frames. 2.2. The Optimisation Procedure for Minimising the Distance

Between Tongue Contours Proceeding along the geometrical features of Figure 2, we created an optimisation algorithm using MATLAB via such mathematical formulas which enable the simultaneous optimisation of the parameters {, 0, ′, , ′}. Some details of the MATLAB scripts can be found in the appendix, where [ 1, 2], [ 1, 2], and [ 3, 4] are the coordinates of ′, , and ′, respectively, while stands for 0. The construction of the mathematical formulas for 1, 2 and 1, 2 follows the geometrical structure of Figure 2. As explained in (2.3) and (2.4), the scale factor is the ratio of the radial distances between the centre and the peak of the epiglottis, measured in the US and MRI frames separately, and the angular displacement 0 is given as the sum of the signed polar angles of the US and MRI frames, made by a radial section and the central vertical axis of the image. In the MATLAB scripts, 1 denotes the actual US or MRI curve having points with complex coordinates to be transformed, ℎ 1 and 1 gives the US and MRI curves to be compared, and 1 = 1 provides the average of the smallest distances of all of the possible pairs of points of the two tongue contours. In the appendix, the codes are written only for one of the contours belonging to the investigated speech sounds and labelled by index 1, but to have the complete script, these program blocks must be repeated with the same structure for the other examined sounds with indices 2, 3, 4, . . . , as well. 2.2.1. Results We fulfilled the optimisation of the parameters {, 0, ′, , ′} for two speech sounds, k and t simultaneously, and we obtained the following numerical results: Based on the values in (2.5), it can be observed in Figure 3 that the optimised positions of the peaks of the epiglottis are very close to the static position when the voice box is at rest without speaking. In addition, the centre of the MRI frame is located under the jaw of the speaker.

Using the parameters of (2.5), the transformation of the US and MRI tongue contours can be implemented in a bidirectional way according to Figure 4, where the tongue contours belonging to sound k can be seen. The green curves stand for the US, while the red curves represent the MRI tongue contours. In the MRI frame, also the contour of the palate is drawn by the yellow curve. The figures clearly show that the US and MRI tongue contours fit each other in an acceptable way, since it is demonstrated visually, as well, that the distance is minimised by the optimisation algorithm detailed in the appendix let the two curves mostly overlap with each other. At this level, it is a suficient criterion for the acceptance of the results without any specified lower or upper limit for the distance between the two curves because we aimed to find the relative position of the two tongue contours, which corresponds to the minimal distance ensured by parameters (2.5). So, graphically, a realistic matching is obtained, and only this was expected. For instance, if the transformed tongue contour was out of the region of the oral cavity represented by the US or MRI image or placed visually at an unrealistic distance from the reference curve then the optimisation would be surely false.

2.2.2. Validation

In order to verify the results, we also checked the projection of such US-MRI pairs of contours onto each other, besides the setting of parameters (2.5) provided by the optimisation, which were not present in the set of sounds and . Accordingly, Figure 5 exemplifies our results in the case of sound . It can be stated that the matching of the US and MRI tongue contours is approximately as good as in the case of sound .

Further validation of the results is in progress currently, as well. To gain more experience about the harmonisation of US and MRI geometry and improve the iftting of the tongue contours, we aim to develop our research work in several directions. We wish to extend the optimisation procedure to more than two sounds to understand the connection between the result of the optimisation and the number of speech sounds. We would also like to investigate the optimisation in the dependence of diferent sound contexts and speakers of diferent nationalities and genders. Furthermore, advancing to a large number of speech sounds, we plan to involve machine learning algorithms, as well.

Acknowledgements. We would like to thank the MTA-ELTE Lendület Lingual Articulation Research Group for providing the recordings with the Micro system.

Appendix Program code for the optimisation

Program code for the transformation of the US tongue contour to the MRI frame for k = 1 : lcu1 xtr1(k) = R * r_UH1(k) * sin(FI * firad_UH1(k) − FIKORR); ytr1(k) = R * r_UH1(k) * cos(FI * firad_UH1(k) − FIKORR); pxtr1(k) = b(2) + xtr1(k); pytr1(k) = b(1) − ytr1(k); end uh1 = 340 − pxtr1 + 1i * pytr1; Program code for the transformation of the MRI tongue contour to the US frame a = [471, 335]; b = [n1, 340 − n2]; clear px1 py1 x1 y1 firad_MRI1 fideg_MRI1 r_MRI1 xtr1 ytr1 pxtr1 pytr1 for k = 1 : lcu1 px1(k) = real(cu1(k)); py1(k) = imag(cu1(k)); x1(k) = px1(k) − b(2); y1(k) = b(1) − py1(k); firad_MRI1(k) = atan(x1(k)/y1(k)); fideg_MRI1(k) = 180 * atan(x1(k)/y1(k))/pi; r_MRI1(k) = y1(k)/cos(firad_MRI1(k)); end for k = 1 : lcu1 xtr1(k) = r_MRI1(k)/R * sin((firad_MRI1(k) − FIKORR)/FI); ytr1(k) = r_MRI1(k)/R * cos((firad_MRI1(k) − FIKORR)/FI); pxtr1(k) = a(2) + xtr1(k); pytr1(k) = a(1) − ytr1(k); Program code for the calculation of the distances between US and MRI tongue contours

[1]

Aron , M. - O. Berger , E. Kerrien: Multimodal fusion of electromagnetic, ultrasound and MRI data for building an articulatory model , in: 8th International Seminar On Speech Production - ISSP'08 , Dec

2008

, Strasbourg, France. finria-00326290f, 2008 , doi: https://hal.inria.fr/inria-00326290/document.

[2]

Cleland ,

Wrench ,

Scobbie , S. Semple: Comparing articulatory images: An MRI/Ultrasound Tongue Image database , in: Proceedings of the 9th International Seminar on Speech Production , 2011 , pp. 163 - 170 , doi: https://eresearch.qmu.ac.uk/handle/20.500.12289/2477.

[3]

S. G.

Danner ,

A. V.

Barbosa , L. Goldstein: Quantitative analysis of multimodal speech data , Journal of Phonetics 71 ( 2018 ), pp. 268 - 283 , doi: https://doi.org/10.1016/j.wocn. 2018 . 09 .007.

[4]

Denby , M. Stone: Speech synthesis from real time ultrasound images of the tongue , in: 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing , vol. 1 , 2004 , pp. I - 685 , doi: https://doi.org/10.1109/ICASSP. 2004 . 1326078 .

[5]

Fulcher ,

Lodermeyer , G. Kahler,

Becker , S. Kniesburges: Geometry of the vocal tract and properties of phonation near threshold: calculations and measurements , Applied Sciences 9.13 ( 2019 ), 2755 , doi: https://doi.org/10.3390/app9132755.

[6]

Ojalammi , J. Malinen: Automated segmentation of upper airways from MRI-vocal tract geometry extraction , International Conference on Bioimaging 3 ( 2017 ), pp. 77 - 84 , doi: https://doi.org/10.5220/0006138300770084.

[7]

A. D.

Scott ,

Wylezinska ,

M. J.

Birch ,

M. E.

Miquel : Speech MRI: morphology and function, Physica Medica 30.6 ( 2014 ), pp. 604 - 618 , doi: https://doi.org/10.1016/j.ejmp. 2014 . 05 .001.

[8] span | speech production and articulation knowledge group: the rtMRI IPA chart (John Esling), Accessed in 2020 May 9th , url: https://sail.usc.edu/span/rtmri_ipa/je_2015.html.

[9]

Stone : A guide to analysing tongue motion from ultrasound images , Clinical Linguistics and Phonetics 19 . 6 - 7 ( 2005 ), pp. 455 - 501 , doi: https://doi.org/10.1080/02699200500113558.

[10]

Xu ,

T. G.

Csapó ,

Roussel ,

Denby : A comparative study on the contour tracking algorithms in ultrasound tongue images with automatic re-initialization , Journal of the Acoustical Society of America 139.5 ( 2016 ), EL154 - EL160 , doi: https://doi.org/10.1121/1.4951024.

[11]

Zhao , L. Czap: Automatic tracking of tongue contours in ultrasound records , Beszédtudomány - Speech Science 27.1 ( 2019 ), pp. 331 - 343 , doi: https://doi.org/10.15775/Beszkut. 2019 . 331 - 343 .

[12]

Zharkova , N. Hewlett: Measuring lingual coarticulation from midsagittal tongue contours: Description and example calculations using English /t/ and /a/, Journal of Phonetics 37.2 ( 2009 ), pp. 248 - 256 , doi: https://doi.org/10.1016/j.wocn. 2008 . 10 .005.

mri1 = 670 − pxtr1 + 1i * pytr1 ;